Executing Multiple SSH Commands Using Python

There are many ways to do this. If you only want to run a few commands, then the Python subprocess  module might be best. If you are only working in Python 2.X, then Fabric might be best. Since I wanted 2.X or 3.X and I wanted to run lots of commands, I went with Paramiko. Here is the solution:

import paramiko

IDENTITY = 'path to private key'
REMOTE = 'url of remote'

k = paramiko.RSAKey.from_private_key_file(IDENTITY)
with paramiko.SSHClient() as ssh:
    ssh.connect(REMOTE, username='vagrant', pkey=k)
    for comand in commands:


Running from __main__ in PyCharm

This applies to PyCharm 2016.3.2.

Sometimes it’s useful to run a module, that is part of a bigger project, directly from PyCharm. To prevent that code from running with the module is being called from elsewhere, you can use something like this:

def my_module(x, y):
    return 'hello world'

if __name__ == '__main__':
    results = my_module(x, 7)

Sometimes when you do that, you will get import errors. Not a problem, just make sure the dir that contains the package that contains the module is on the sys path. To check that, put this at the top of the module:

import sys
for p in sys.path:

Here’s where PyCharm gets a little tricky. You would think you could just go into the Run/Debug Configurations and set the Working directory. Unfortunately that does not work. Instead you need to add an Environment variable PYTHONPATH. Like this:



Python Dedupe Errors

ValueError: No records have been blocked together

I got this error in Python Dedupe 1.6.1 while using Gazetteer to see if a single new record would be a duplicate. I successively replaced values in the new record dict, note the keys in the dict stayed the same. Eventually Dedupe indicated a potential match. This shows that a sufficiently distinct new record will give this ValueError.

Python Dedupe

Finding duplicate records in a file is one of those things that seems easy… until you dive in. Here are some of the issues. First, if you have N records and you want to compare all pairs, then there are (N x (N-1))/2 pairs. If you have 10,000 records, that’s about 50 million pairs.

Second, lets say a record has several fields. It’s seems reasonable to start out by comparing each field individually, with something like fuzzy wuzzy, and then combining the scores to make a single score to compare against some threshold. But when you think about it carefully, it’s not obvious how to combine the scores.

For example, I am deduping a list of summer camp directors. Each summer camp can have more than one director. The fields include:

  • first name
  • last name
  • camp name
  • email
  • phone number

Sometimes the directors are relatives. In that case, first name and part of the email address are critical for determining that it’s not a duplicate. For example:

  • Carl Smith, Camp Big Lake, carl@campbiglake.com, 111-222-3333
  • Cathy Smith, Camp Big Lake, cathy@campbiglake.com, 111-222-3333

Comparison of fields might give score something like 0.6, 1.0, 1.0, 0.8, 1.0 giving an average of 0.88. Based on the score, it seems like a duplicate. Yet to a human, who knows about relatives and email addresses, it’s clearly not a duplicate.

Third, there may be triplicates and higher in the data.

Finally, after resolving all the above issues, you might realize that what you want to do is not deduping!

Maybe rolling your own is not the best answer. Python Dedupe to the rescue.

Python Dedupe

To get started, go to this page and read it all carefully. Maybe even look over the examples. Wow. Aren’t you glad you are not rolling your own? If you need more convincing, check out the Dedupe class – that’s right, it uses parallel processing on multi-core machines.

Getting Started with the CSV example

Since my problem was a lot like the CSV example, I used that example as a basis for my problem. The first thing I was not sure about was:

deduper.sample(data_d, sample_size=15000)

15000 samples? What does that mean? Do I need to manually rate 15000 samples? That does not make any sense. The test data set only contains 3500 records. I am guessing 15000 is the number of pairs that are used for training. And no, you do not have to manually rate them all.

Running csv_example.py starts by showing you several pairs and asking you to indicate if it’s a duplicate or not. After each pair, you will see something like:

3/10 positive, 5/10 negative

When does it stop? I am not sure. I got to 12/10 positive and 13/10 negative and it kept going. So I clicked (f) for finish and the code completed. I looked over the resulting deduped file. It was pretty impressive. I could not find any misclassifications.

One thing that made training somewhat confusing is the docs said the manual samples would be pairs that were the correct answer was not obvious. However, several of the pairs I was shown were very clearly not duplicates. Almost none of the fields were similar. I am no sure how the algorithm was improved by those samples. Maybe the code was testing me to see if I was paying attention?


My data set had 15,000 records. It had already been manually deduped. I was concerned that there would not be enough duplicates to train the model. For my first attempt at training, I set:

deduper.sample(data_d, sample_size=200)

I made the sample size small in the hope that it would speed things up; I would fail fast and move on. As expected, I did not get any duplicates during training. The results were so-so (not terrible).

My plan was to write a function to add duplicates to the training data. This must be a common problem given that Python Dedupe has a function for it (trainingDataDedupe). I did not plan on using that function because I wanted the duplicates to have a certain structure.

In the mean time, I was dealing with some other issues in the code and set:

deduper.sample(data_d, sample_size=15000)

Now I was seeing duplicates during training. Evidently, with such a large sample size, there were duplicates and the training algorithm was able to find them and present them for manual review. After reviewing about 50 samples, I had more than 10 positives and 10 negatives. Once trained, the algorithm found about 100 duplicates. Scanning through the resulting CSV, showed that the algorithm worked well.

What About Blocking?

The docs make it clear that blocking is important for speeding things up. I had some ideas for the fields to use for blocking. How do I implement them? It turns out that the default is for Python Dedupe to automatically do blocking. In my case, the logging messages indicated that Dedupe automatically implemented the blocking I was thinking would be good. Nice!

You might wonder why I was de-duping data that was already de-duped. The reason was to check new records before they were added.

What About the Threshold?

Python Dedupe has a function to help you set the threshold. From the CSV example:

threshold = deduper.threshold(data_d, recall_weight=1)

Of course, you read the docs, so I do not need to go into details about what the key word “recall_weight” does. But to refresh your memory:

recall_weight — Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.

Opps I Really Do Not Want Deduping

Since I thought my list was already deduped, what I really wanted to do was prevent new records from being dupes! That’s kind of like deduping. It still involves comparing records and coming up with some sort of similarity score. But one big difference is that only N comparisons are needed; just the new record against the list of deduped records.

Luckily the authors of Python Dedupe have that functionality. It’s called: Gazetteer – from the docs:

Gazetteer matching is for matching a messy data set against a ‘canonical dataset’, i.e. one that does not have any duplicates. This class is useful for such tasks as matching messy addresses against a clean list.

I my case, I needed to run dedupe first to create the canonical dataset. Is there a way to use the dedupe training for the Gazetteer? If there is, I did not find it. But I found something close to that: markPairs.

Dedupe produced a file with clusters of duplicates. Singeltons were clusters with one record. I generated the cannonical dataset by selecting one record from each cluster. I made the messy data set by selecting other records in the cluster. The details are here.

Speeding things up

The keyword “index_predicates” has a huge impact on performance, especially the gazetteer. If you are only looking to see if a few samples are in a long canonical list, it’s not clear why you would want index the data first. In my use case, the speed up of setting “index_predicates” to False gave about a factor of 20 improvement in speed!

Scoring One Pair

Python Dedupe is not trivial to use. In my case, I had a new record that was clearly a duplicate, yet Python Dedupe was not finding the other record. It seemed useful to try to get the score for that single pair. Here is how I did it:

from dedupe.core import scoreDuplicates

if os.path.exists(settings_file):
    with open(settings_file, 'rb') as f:
        linker = dedupe.StaticGazetteer(f)

candidate_records = [[(1, {
    'first_name': 'jim',
    'last_name': 'conroy',
    'account_name': None,
    'email1': 'jim@camplakes.org',
    'phone_work': None,
    'phone_mobile': None,
    'primary_address_state': 'wi'}, {10}),

    (2, {
        'first_name': 'jim',
        'last_name': 'conroy',
        'account_name': 'camp lakes inc',
        'email1': 'jim@camplakes.org',
        'phone_work': None,
        'phone_mobile': None,
        'primary_address_state': 'wi'}, {11}),
matches = scoreDuplicates(candidate_records, linker.data_model, linker.classifier)

Note, each record has three elements. The last one is a set of ids. I am not sure what that is for, but if the set for each record is disjoint, then the scoring function will work as expected.

Error: Is the data you are trying to match like the data you trained on?

This error is somewhat frustrating. It is possible to have a new record that is sufficiently unique that you get this error. My initial approach to this was to put it in a “try” block and return [] when the error occurred.

However, it is surprisingly easy to have a new record that is a duplicate and get this error. In my case, I was taking data from a web-form and looking for it in a CRM. I trained on records that are already in the CRM. One of the fields was “organization name”. This is a required field in the CRM, but not required in the web-form. This cause the above error anytime the “organization name” was blank on the web-form; even if all the other fields were perfect matches. I was able to solve this by using markPairs to manually add some matches that did not have organization names.

Error: Affinegap ZeroDivisionError

File "affinegap/affinegap.pyx", line 115, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1788)
 File "affinegap/affinegap.pyx", line 134, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1615)
ZeroDivisionError: float division

This occurs when two empty strings are being compared. In my case, I was adding some marked pairs and I accidentally had empty strings for values instead of None as per the docs.



Blender for Python Programmers

So you want to use Python to do stuff in Blender? Here are my notes as I go down the rabbit hole.

In Blender, Python is a tool for automating stuff you could do manually. The best way to get started is to learn how to do things manually and then once you have it figured out, make the Python script. Here is a great tutorial to get you started.

Official Docs


Make sure to checkout the “Gotcha’s” section.

Naming Stuff

From the “Gotcha’s” section: don’t name stuff: https://www.blender.org/api/blender_python_api_2_78a_release/info_gotcha.html#naming-limitations

Using PyCharm

So you want to use the PyCharm debugger on your Blender code? Here is how:


Running Python Blender Script Non-interactively

From the command line, enter:

blender --background --python myscript.py

Writing Modular Code

First step is to configure Python paths. Pro-tip, when playing with Python paths, it is sometimes helpful to print them:

import sys


This page shows the layout of default paths. If you choose to put some *.py files in the ./scripts/modules, do not forget to put a __init__.py file in there too.


import bpy



This is not so much a Blender thing, but more of a 3D modelling thing. There are lots of ways of specifying rotations. Each with it’s own pros and cons. One important way is with quaternions. I have studied a lot of math in my days, but somehow quaternions never came up. To make matters worse, the math behind them is non-trivial. Here is the best free, online tutorial I could find.

General Tutorials

The Proving Ground


Logging a Single Module in Python

The problem is that sometimes 3rd party modules have logging turned on. For example, I run into that when using Selenium for testing.

If you do not want to see any logger output, you can put this at the top of your module:

import logging

That works fine, until you want to turn on logging for the module you are working on. Here is code that turns off all other loggers. The last line sends a logger message for this module to the console.

import logging

for logger_name in logging.Logger.manager.loggerDict.keys():

logging.debug('hello from my module')

Python six.moves

The python six.moves command handles modules whose name changed in Python 3. It’s not that difficult to use. But in the version of PyCharm that I am using, the code is shown as an error. Since I do not use the command that often, I wrote these notes.

Expanding on the six docs example, the HTML parser:

  • It’s called HTMLParser in Python 2
  • Its’ called html.parser in Python 3

Your Python 2 code:

from six.moves import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag

    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag

    def handle_data(self, data):
        print "Encountered some data  :", data


from six.moves import html_parser

class MyHTMLParser(html_parser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag

    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag

    def handle_data(self, data):
        print "Encountered some data  :", data

Note this is a two step process:

  1. change the import statement
  2. scan your code for “HTMLParser” and replace it with “html_parser” (in the case of Python 2)

Things get a little more complicated where packages in Python 2 have been merged into one package in Python 3. For example,  the urllib, urllib2, and urlparse modules have been combined in the urllib package in Python 3. My original Python 2 code:

from urlparse import urlparse

query_string = urlparse(full_url).query


# To import just the urlparse module
from six.moves.urllib.parse import urlparse
query_string = urlparse(full_url).query

# Or to import the entire package.
import six.moves.urllib.parse as urlparse
query_string = urlparse.urlparse(full_url).query


To use six.moves, use the name in column 1 of the table in the docs for your imports. Then if your code is Python 2, do find on value in second column and replace it with column 1. If your code is Python 3, do find on value in third column and replace it with column 1.