Python Mock Functions Not Being Called

Are you using mock to mock some functions during testing and mysteriously your tests are ignoring the mock decorator?

One way this can happen is if the function being mocked is either defined in the module that is being tested or is imported into the module being tested. How to get around this?

In my case, after I wasted time figuring out why mock was not working, I ended up modifying the code in my module so that I did not need mock. In the end, I think the modified code is better anyway. Often making code testable, makes it better too.


Working in Python 3.6 in Ubuntu 14.04

This link shows how to install python 3.6 on Ubuntu 14.04:

When installing python 3.6 you will get a warning about not removing python 3.5. The commands above will not remove 3.5.

To run python 3.6, type:


In python 3.6, the pyvenv command for making virtualenvs was deprecated. The new command is:

python3.6 -m venv full-path-to-venv-dir

When I run that command, I get an error:

returned non-zero exit status 1

although the virtualenv folder is created along with a bunch of folders and files. Running the command like this works:

python3.6 -m venv full-path-to-venv-dir --without-pip

The virtual env can be activated with:

source path-to-env/bin/activate

After you activate the virtual env, you can install pip inside it, by downloading it and running it with the virtual env version of python:


Headless Chrome with Python and Selenium

In version 59, Google Chrome acquired the option to run headlessly! Here is what works for me (Ubuntu: 14.04, Chrome version: 60.0.3112.78, chromedriver: 2.31) :

from selenium import webdriver

def get_page_html(url, headless=False, screen_shot_path=None):
    options = webdriver.ChromeOptions()    
    if headless:
    selenium = webdriver.Chrome("/usr/local/share/chromedriver", chrome_options=options)
    self.selenium.set_window_size(1800, 1000)  # w,h
    page_html = selenium.page_source
    if screen_shot_path:

    return page_html

Make sure your version of chromedriver is up-to-date. Also, notice that the options.add_argument() expects the arguments to be as if you were running google-chrome from the command line (e.g. prefix with –).

Executing Multiple SSH Commands Using Python

There are many ways to do this. If you only want to run a few commands, then the Python subprocess  module might be best. If you are only working in Python 2.X, then Fabric might be best. Since I wanted 2.X or 3.X and I wanted to run lots of commands, I went with Paramiko. Here is the solution:

import paramiko

IDENTITY = 'path to private key'
REMOTE = 'url of remote'

k = paramiko.RSAKey.from_private_key_file(IDENTITY)
with paramiko.SSHClient() as ssh:
    ssh.connect(REMOTE, username='vagrant', pkey=k)
    for comand in commands:


Running from __main__ in PyCharm

This applies to PyCharm 2016.3.2.

Sometimes it’s useful to run a module, that is part of a bigger project, directly from PyCharm. To prevent that code from running with the module is being called from elsewhere, you can use something like this:

def my_module(x, y):
    return 'hello world'

if __name__ == '__main__':
    results = my_module(x, 7)

Sometimes when you do that, you will get import errors. Not a problem, just make sure the dir that contains the package that contains the module is on the sys path. To check that, put this at the top of the module:

import sys
for p in sys.path:

Here’s where PyCharm gets a little tricky. You would think you could just go into the Run/Debug Configurations and set the Working directory. Unfortunately that does not work. Instead you need to add an Environment variable PYTHONPATH. Like this:



Python Dedupe Errors

ValueError: No records have been blocked together

I got this error in Python Dedupe 1.6.1 while using Gazetteer to see if a single new record would be a duplicate. I successively replaced values in the new record dict, note the keys in the dict stayed the same. Eventually Dedupe indicated a potential match. This shows that a sufficiently distinct new record will give this ValueError.

Python Dedupe

Finding duplicate records in a file is one of those things that seems easy… until you dive in. Here are some of the issues. First, if you have N records and you want to compare all pairs, then there are (N x (N-1))/2 pairs. If you have 10,000 records, that’s about 50 million pairs.

Second, lets say a record has several fields. It’s seems reasonable to start out by comparing each field individually, with something like fuzzy wuzzy, and then combining the scores to make a single score to compare against some threshold. But when you think about it carefully, it’s not obvious how to combine the scores.

For example, I am deduping a list of summer camp directors. Each summer camp can have more than one director. The fields include:

  • first name
  • last name
  • camp name
  • email
  • phone number

Sometimes the directors are relatives. In that case, first name and part of the email address are critical for determining that it’s not a duplicate. For example:

  • Carl Smith, Camp Big Lake,, 111-222-3333
  • Cathy Smith, Camp Big Lake,, 111-222-3333

Comparison of fields might give score something like 0.6, 1.0, 1.0, 0.8, 1.0 giving an average of 0.88. Based on the score, it seems like a duplicate. Yet to a human, who knows about relatives and email addresses, it’s clearly not a duplicate.

Third, there may be triplicates and higher in the data.

Finally, after resolving all the above issues, you might realize that what you want to do is not deduping!

Maybe rolling your own is not the best answer. Python Dedupe to the rescue.

Python Dedupe

To get started, go to this page and read it all carefully. Maybe even look over the examples. Wow. Aren’t you glad you are not rolling your own? If you need more convincing, check out the Dedupe class – that’s right, it uses parallel processing on multi-core machines.

Getting Started with the CSV example

Since my problem was a lot like the CSV example, I used that example as a basis for my problem. The first thing I was not sure about was:

deduper.sample(data_d, sample_size=15000)

15000 samples? What does that mean? Do I need to manually rate 15000 samples? That does not make any sense. The test data set only contains 3500 records. I am guessing 15000 is the number of pairs that are used for training. And no, you do not have to manually rate them all.

Running starts by showing you several pairs and asking you to indicate if it’s a duplicate or not. After each pair, you will see something like:

3/10 positive, 5/10 negative

When does it stop? I am not sure. I got to 12/10 positive and 13/10 negative and it kept going. So I clicked (f) for finish and the code completed. I looked over the resulting deduped file. It was pretty impressive. I could not find any misclassifications.

One thing that made training somewhat confusing is the docs said the manual samples would be pairs that were the correct answer was not obvious. However, several of the pairs I was shown were very clearly not duplicates. Almost none of the fields were similar. I am no sure how the algorithm was improved by those samples. Maybe the code was testing me to see if I was paying attention?


My data set had 15,000 records. It had already been manually deduped. I was concerned that there would not be enough duplicates to train the model. For my first attempt at training, I set:

deduper.sample(data_d, sample_size=200)

I made the sample size small in the hope that it would speed things up; I would fail fast and move on. As expected, I did not get any duplicates during training. The results were so-so (not terrible).

My plan was to write a function to add duplicates to the training data. This must be a common problem given that Python Dedupe has a function for it (trainingDataDedupe). I did not plan on using that function because I wanted the duplicates to have a certain structure.

In the mean time, I was dealing with some other issues in the code and set:

deduper.sample(data_d, sample_size=15000)

Now I was seeing duplicates during training. Evidently, with such a large sample size, there were duplicates and the training algorithm was able to find them and present them for manual review. After reviewing about 50 samples, I had more than 10 positives and 10 negatives. Once trained, the algorithm found about 100 duplicates. Scanning through the resulting CSV, showed that the algorithm worked well.

What About Blocking?

The docs make it clear that blocking is important for speeding things up. I had some ideas for the fields to use for blocking. How do I implement them? It turns out that the default is for Python Dedupe to automatically do blocking. In my case, the logging messages indicated that Dedupe automatically implemented the blocking I was thinking would be good. Nice!

You might wonder why I was de-duping data that was already de-duped. The reason was to check new records before they were added.

What About the Threshold?

Python Dedupe has a function to help you set the threshold. From the CSV example:

threshold = deduper.threshold(data_d, recall_weight=1)

Of course, you read the docs, so I do not need to go into details about what the key word “recall_weight” does. But to refresh your memory:

recall_weight — Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.

Opps I Really Do Not Want Deduping

Since I thought my list was already deduped, what I really wanted to do was prevent new records from being dupes! That’s kind of like deduping. It still involves comparing records and coming up with some sort of similarity score. But one big difference is that only N comparisons are needed; just the new record against the list of deduped records.

Luckily the authors of Python Dedupe have that functionality. It’s called: Gazetteer – from the docs:

Gazetteer matching is for matching a messy data set against a ‘canonical dataset’, i.e. one that does not have any duplicates. This class is useful for such tasks as matching messy addresses against a clean list.

I my case, I needed to run dedupe first to create the canonical dataset. Is there a way to use the dedupe training for the Gazetteer? If there is, I did not find it. But I found something close to that: markPairs.

Dedupe produced a file with clusters of duplicates. Singeltons were clusters with one record. I generated the cannonical dataset by selecting one record from each cluster. I made the messy data set by selecting other records in the cluster. The details are here.

Speeding things up

The keyword “index_predicates” has a huge impact on performance, especially the gazetteer. If you are only looking to see if a few samples are in a long canonical list, it’s not clear why you would want index the data first. In my use case, the speed up of setting “index_predicates” to False gave about a factor of 20 improvement in speed!

Scoring One Pair

Python Dedupe is not trivial to use. In my case, I had a new record that was clearly a duplicate, yet Python Dedupe was not finding the other record. It seemed useful to try to get the score for that single pair. Here is how I did it:

from dedupe.core import scoreDuplicates

if os.path.exists(settings_file):
    with open(settings_file, 'rb') as f:
        linker = dedupe.StaticGazetteer(f)

candidate_records = [[(1, {
    'first_name': 'jim',
    'last_name': 'conroy',
    'account_name': None,
    'email1': '',
    'phone_work': None,
    'phone_mobile': None,
    'primary_address_state': 'wi'}, {10}),

    (2, {
        'first_name': 'jim',
        'last_name': 'conroy',
        'account_name': 'camp lakes inc',
        'email1': '',
        'phone_work': None,
        'phone_mobile': None,
        'primary_address_state': 'wi'}, {11}),
matches = scoreDuplicates(candidate_records, linker.data_model, linker.classifier)

Note, each record has three elements. The last one is a set of ids. I am not sure what that is for, but if the set for each record is disjoint, then the scoring function will work as expected.

Error: Is the data you are trying to match like the data you trained on?

This error is somewhat frustrating. It is possible to have a new record that is sufficiently unique that you get this error. My initial approach to this was to put it in a “try” block and return [] when the error occurred.

However, it is surprisingly easy to have a new record that is a duplicate and get this error. In my case, I was taking data from a web-form and looking for it in a CRM. I trained on records that are already in the CRM. One of the fields was “organization name”. This is a required field in the CRM, but not required in the web-form. This cause the above error anytime the “organization name” was blank on the web-form; even if all the other fields were perfect matches. I was able to solve this by using markPairs to manually add some matches that did not have organization names.

Error: Affinegap ZeroDivisionError

File "affinegap/affinegap.pyx", line 115, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1788)
 File "affinegap/affinegap.pyx", line 134, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1615)
ZeroDivisionError: float division

This occurs when two empty strings are being compared. In my case, I was adding some marked pairs and I accidentally had empty strings for values instead of None as per the docs.