Python Dedupe

Finding duplicate records in a file is one of those things that seems easy… until you dive in. Here are some of the issues. First, if you have N records and you want to compare all pairs, then there are (N x (N-1))/2 pairs. If you have 10,000 records, that’s about 50 million pairs.

Second, lets say a record has several fields. It’s seems reasonable to start out by comparing each field individually, with something like fuzzy wuzzy, and then combining the scores to make a single score to compare against some threshold. But when you think about it carefully, it’s not obvious how to combine the scores.

For example, I am deduping a list of summer camp directors. Each summer camp can have more than one director. The fields include:

  • first name
  • last name
  • camp name
  • email
  • phone number

Sometimes the directors are relatives. In that case, first name and part of the email address are critical for determining that it’s not a duplicate. For example:

  • Carl Smith, Camp Big Lake, carl@campbiglake.com, 111-222-3333
  • Cathy Smith, Camp Big Lake, cathy@campbiglake.com, 111-222-3333

Comparison of fields might give score something like 0.6, 1.0, 1.0, 0.8, 1.0 giving an average of 0.88. Based on the score, it seems like a duplicate. Yet to a human, who knows about relatives and email addresses, it’s clearly not a duplicate.

Third, there may be triplicates and higher in the data.

Finally, after resolving all the above issues, you might realize that what you want to do is not deduping!

Maybe rolling your own is not the best answer. Python Dedupe to the rescue.

Python Dedupe

To get started, go to this page and read it all carefully. Maybe even look over the examples. Wow. Aren’t you glad you are not rolling your own? If you need more convincing, check out the Dedupe class – that’s right, it uses parallel processing on multi-core machines.

Getting Started with the CSV example

Since my problem was a lot like the CSV example, I used that example as a basis for my problem. The first thing I was not sure about was:

deduper.sample(data_d, sample_size=15000)

15000 samples? What does that mean? Do I need to manually rate 15000 samples? That does not make any sense. The test data set only contains 3500 records. I am guessing 15000 is the number of pairs that are used for training. And no, you do not have to manually rate them all.

Running csv_example.py starts by showing you several pairs and asking you to indicate if it’s a duplicate or not. After each pair, you will see something like:

3/10 positive, 5/10 negative

When does it stop? I am not sure. I got to 12/10 positive and 13/10 negative and it kept going. So I clicked (f) for finish and the code completed. I looked over the resulting deduped file. It was pretty impressive. I could not find any misclassifications.

One thing that made training somewhat confusing is the docs said the manual samples would be pairs that were the correct answer was not obvious. However, several of the pairs I was shown were very clearly not duplicates. Almost none of the fields were similar. I am no sure how the algorithm was improved by those samples. Maybe the code was testing me to see if I was paying attention?

Training

My data set had 15,000 records. It had already been manually deduped. I was concerned that there would not be enough duplicates to train the model. For my first attempt at training, I set:

deduper.sample(data_d, sample_size=200)

I made the sample size small in the hope that it would speed things up; I would fail fast and move on. As expected, I did not get any duplicates during training. The results were so-so (not terrible).

My plan was to write a function to add duplicates to the training data. This must be a common problem given that Python Dedupe has a function for it (trainingDataDedupe). I did not plan on using that function because I wanted the duplicates to have a certain structure.

In the mean time, I was dealing with some other issues in the code and set:

deduper.sample(data_d, sample_size=15000)

Now I was seeing duplicates during training. Evidently, with such a large sample size, there were duplicates and the training algorithm was able to find them and present them for manual review. After reviewing about 50 samples, I had more than 10 positives and 10 negatives. Once trained, the algorithm found about 100 duplicates. Scanning through the resulting CSV, showed that the algorithm worked well.

What About Blocking?

The docs make it clear that blocking is important for speeding things up. I had some ideas for the fields to use for blocking. How do I implement them? It turns out that the default is for Python Dedupe to automatically do blocking. In my case, the logging messages indicated that Dedupe automatically implemented the blocking I was thinking would be good. Nice!

You might wonder why I was de-duping data that was already de-duped. The reason was to check new records before they were added.

What About the Threshold?

Python Dedupe has a function to help you set the threshold. From the CSV example:

threshold = deduper.threshold(data_d, recall_weight=1)

Of course, you read the docs, so I do not need to go into details about what the key word “recall_weight” does. But to refresh your memory:

recall_weight — Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.

Opps I Really Do Not Want Deduping

Since I thought my list was already deduped, what I really wanted to do was prevent new records from being dupes! That’s kind of like deduping. It still involves comparing records and coming up with some sort of similarity score. But one big difference is that only N comparisons are needed; just the new record against the list of deduped records.

Luckily the authors of Python Dedupe have that functionality. It’s called: Gazetteer – from the docs:

Gazetteer matching is for matching a messy data set against a ‘canonical dataset’, i.e. one that does not have any duplicates. This class is useful for such tasks as matching messy addresses against a clean list.

I my case, I needed to run dedupe first to create the canonical dataset. Is there a way to use the dedupe training for the Gazetteer? If there is, I did not find it. But I found something close to that: markPairs.

Dedupe produced a file with clusters of duplicates. Singeltons were clusters with one record. I generated the cannonical dataset by selecting one record from each cluster. I made the messy data set by selecting other records in the cluster. The details are here.

Speeding things up

The keyword “index_predicates” has a huge impact on performance, especially the gazetteer. If you are only looking to see if a few samples are in a long canonical list, it’s not clear why you would want index the data first. In my use case, the speed up of setting “index_predicates” to False gave about a factor of 20 improvement in speed!

Scoring One Pair

Python Dedupe is not trivial to use. In my case, I had a new record that was clearly a duplicate, yet Python Dedupe was not finding the other record. It seemed useful to try to get the score for that single pair. Here is how I did it:

from dedupe.core import scoreDuplicates

if os.path.exists(settings_file):
    with open(settings_file, 'rb') as f:
        linker = dedupe.StaticGazetteer(f)

candidate_records = [[(1, {
    'first_name': 'jim',
    'last_name': 'conroy',
    'account_name': None,
    'email1': 'jim@camplakes.org',
    'phone_work': None,
    'phone_mobile': None,
    'primary_address_state': 'wi'}, {10}),

    (2, {
        'first_name': 'jim',
        'last_name': 'conroy',
        'account_name': 'camp lakes inc',
        'email1': 'jim@camplakes.org',
        'phone_work': None,
        'phone_mobile': None,
        'primary_address_state': 'wi'}, {11}),
]]
matches = scoreDuplicates(candidate_records, linker.data_model, linker.classifier)
print(matches)

Note, each record has three elements. The last one is a set of ids. I am not sure what that is for, but if the set for each record is disjoint, then the scoring function will work as expected.

Error: Is the data you are trying to match like the data you trained on?

This error is somewhat frustrating. It is possible to have a new record that is sufficiently unique that you get this error. My initial approach to this was to put it in a “try” block and return [] when the error occurred.

However, it is surprisingly easy to have a new record that is a duplicate and get this error. In my case, I was taking data from a web-form and looking for it in a CRM. I trained on records that are already in the CRM. One of the fields was “organization name”. This is a required field in the CRM, but not required in the web-form. This cause the above error anytime the “organization name” was blank on the web-form; even if all the other fields were perfect matches. I was able to solve this by using markPairs to manually add some matches that did not have organization names.

Error: Affinegap ZeroDivisionError

File "affinegap/affinegap.pyx", line 115, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1788)
 File "affinegap/affinegap.pyx", line 134, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1615)
ZeroDivisionError: float division

This occurs when two empty strings are being compared. In my case, I was adding some marked pairs and I accidentally had empty strings for values instead of None as per the docs.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s