Talend: Data deduplication with tUniqRow

In Talend, there are many components with data deduplication functionality. For example, I have discussed tFuzzyMatch in my previous blog. Here, we are going to look at data deduplication again by using tUniqRow component.

 

Use Case: 

I have 5 contact records, all of them have a unique Id, different names and some records share the same phone number.

 

ID     Name  Phone 
001 wdci pty ltd  03 8322 0360
002 talend Open Studio  (714) 786 8140
003 Salesforce  (415) 901-7010
004 WDCI Pty Ltd  03-8322 0360
005 wdci Sydney  61 2 9432 7834

 

As you can see, there are five records in the sample data. The first glimpse from the data, you will probably find out that record number 1 and record number 4 are identical as they have similar Name and Phone Number.

Now, let’s start to build a Talend job to identify the duplicate record.

Firstly, you need to write those 5 records into a text file as the source data. Then, use a tFileInputDelimited component to read the data row by row in the process.

Secondly, drag the tUniqRow component into the design workspace and link the output row of tFileInputDelimited component to tUniqRow. Once you are done with step 2, your Talend process should look like Figure 1:

 

 

The following step is to set the “Key attribute” which will be used to identify the duplicate records in the tUniqow component. In this example, I will use the Name as the unique key by checking the ‘Key attribute’ checkbox next to it. Please see Figure 2.

 

 

Every tUniqRow will have two output rows:

Uniques – to capture the unique row
Duplicates – to capture the duplicated row
Here, I will use tLogRow to display the unique records only. Please look at figure 3.

After you run the process, you should see the following result printed in the console:

Easy right? From the example above, you can see that tUniqRow is easier to use if compared to tFuzzyMatch. However, tFuzzyMatch provides complex filtering function by using “Levenshtein” or “Metaphone” algorithm. Therefore, you might want to use both tUniqRow and tFuzzyMatch at the same time in the job.

Get in touch today to see how WDCi can help your business.