Talend Tip: Data deduplication with Talend tFuzzyMatch
July 6, 2012 /
Here’s some information you may find helpful if you plan to setup data deduplication with Talend using tFuzzyMatch.
Before we look at the tFuzzyMatch component let me introduce a sample use case and how this component can solve the problem.
I have 5 contact records, all of them have a unique Id, different names and some records share the same phone number. It is simple for the human eye to figure out which is the duplicate record given there are only 5 records. But what if you have a million records? You need an automated way to handle it. Take a look at my sample data below:
|001||wdci pty ltd||03 8322 0360|
|002||Talend Open Studio||(714) 786 8140|
|004||WDCI Group Pty Ltd||03 8322 0360|
|005||wdci Sydney||61 2 9432 7834|
In this scenario, record 001, 004 and 005 have a similar name. If we look at their Phone field carefully, you will find that ID 005 is different from ID 001 and 004. Figure 1 shows you how the Talend tFuzzyMatch with Levenshtein Matching type will solve this problem.
At tFuzzyMatch_1 component, I am using “Levenshtein” Matching type, which identifies the matching distance. Once the “Levenshtein” matching type is selected, you need to set the min and max distance (Figure 2). The distance means the differences in insertions, deletions and substitutions required between 2 words. For example “wdci pty ltd” and “WDCI Group Pty Ltd” has 6 distance differences.
In this example, I set the min distance = 0 and max distance = 8. This means that for all words that has no more than 8 differences will be displayed. However, merely depending on one tFuzzyMatch component is not good enough. One of the reasons is because it is very difficult for us to identify the correct distance at the beginning. Although we are able to get the right distance, the filtering process is still not accurate enough. Thus, we need to use another tFuzzyMatch component to enhance our filtering process. It would be better if you can identify more logic to filter out the duplicate records.
In addition, I will leave the “Case Sensitive” checkbox unchecked. If this check box is selected the distance between “wdci pty ltd” and “WDCI Group Pty Ltd” will become 11 because “wdci” and “WDCI” will be considered different.
The result from the tFuzzyMatch_1 component in this example will be:
|ID||Source Name||Source Phone||VALUE||MATCHING|
|001||wdci pty ltd||03 8322 0360||0,6,6||wdci pty ltd,WDCI Group Pty Ltd,wdci Sydney|
|002||talend Open Studio||(714) 786 8140||0||talend Open Studio|
|004||WDCI Group Pty Ltd||03 8322 0360||6,0||wdci pty ltd,WDCI Group Pty Ltd|
|005||wdci Sydney||61 2 9432 7834||6,0||wdci pty ltd,wdci Sydney|
The VALUE and MATCHING column are the default columns from the tFuzzyMatch component and they are not editable. VALUE = 0 indicates that no duplicate record found on that particular record, so we need to filter out those result returned with VALUE only equal to 0. In NameMap component I use “row3.VALUE.equals(“0″)” function to find out all record VALUE field equal to 0 and use expression filter to remove those records in “out1” output table. You can see the example in Figure 3
In my second tFuzzyMatch_2 component, I am still using “Levenshtein” as first tFuzzyMatch_1 but the min distance = 0 and max distance = 5 are changed, and the Matching column is changed to Phone (Figure 4).
At this stage, we can get the duplicate record Ids. With these Ids, you can then plan for your next clean-up strategy.