Talend: Data deduplication with tUniqRow
In Talend, there are many components with data deduplication functionality. For example, I have discussed tFuzzyMatch in my previous blog. Here, we are going to look at data deduplication again by using tUniqRow component.
I have 5 contact records, all of them have a unique Id, different names and some records share the same phone number.
|001||wdci pty ltd||03 8322 0360|
|002||talend Open Studio||(714) 786 8140|
|004||WDCI Pty Ltd||03-8322 0360|
|005||wdci Sydney||61 2 9432 7834|
As you can see, there are five records in the sample data. The first glimpse from the data, you will probably find out that record number 1 and record number 4 are identical as they have similar Name and Phone Number.
Now, let’s start to build a Talend job to identify the duplicate record.
Secondly, drag the tUniqRow component into the design workspace and link the output row of tFileInputDelimited component to tUniqRow. Once you are done with step 2, your Talend process should look like Figure 1:
The following step is to set the “Key attribute” which will be used to identify the duplicate records in the tUniqow component. In this example, I will use the Name as the unique key by checking the ‘Key attribute’ checkbox next to it. Please see Figure 2.
After you run the process, you should see the following result printed in the console: