Grouping Choices in a Django Select

To make an ordinary Django ChoiceField you can do something like this:

choices = [[1, 'Apples'], [2, 'Oranges'], [3, 'Carrots'], [4, 'Beans']]
my_choices = forms.ChoiceField(choices=choices)

But what if you want to group the choices by using the HTML OPTGROUP tag? My first thought was to over-ride some on the methods in the Django Select widget code. But when I inspected the code, I found the code already supports this (Django 1.8). Here is how to do it:

choices = [
    ['Fruit', [[1, 'Apples'], [2, 'Oranges']]], 
    ['Veggies', [[3, 'Carrots'], [4, 'Beans']]]
]
my_choices = forms.ChoiceField(choices=choices)

Running from __main__ in PyCharm

This applies to PyCharm 2016.3.2.

Sometimes it’s useful to run a module, that is part of a bigger project, directly from PyCharm. To prevent that code from running with the module is being called from elsewhere, you can use something like this:

def my_module(x, y):
    return 'hello world'

if __name__ == '__main__':
    results = my_module(x, 7)

Sometimes when you do that, you will get import errors. Not a problem, just make sure the dir that contains the package that contains the module is on the sys path. To check that, put this at the top of the module:

import sys
for p in sys.path:
    print(p)

Here’s where PyCharm gets a little tricky. You would think you could just go into the Run/Debug Configurations and set the Working directory. Unfortunately that does not work. Instead you need to add an Environment variable PYTHONPATH. Like this:

screenshot-from-2017-02-15-100406

 

Python Dedupe Errors

ValueError: No records have been blocked together

I got this error in Python Dedupe 1.6.1 while using Gazetteer to see if a single new record would be a duplicate. I successively replaced values in the new record dict, note the keys in the dict stayed the same. Eventually Dedupe indicated a potential match. This shows that a sufficiently distinct new record will give this ValueError.

Python Dedupe

Finding duplicate records in a file is one of those things that seems easy… until you dive in. Here are some of the issues. First, if you have N records and you want to compare all pairs, then there are (N x (N-1))/2 pairs. If you have 10,000 records, that’s about 50 million pairs.

Second, lets say a record has several fields. It’s seems reasonable to start out by comparing each field individually, with something like fuzzy wuzzy, and then combining the scores to make a single score to compare against some threshold. But when you think about it carefully, it’s not obvious how to combine the scores.

For example, I am deduping a list of summer camp directors. Each summer camp can have more than one director. The fields include:

  • first name
  • last name
  • camp name
  • email
  • phone number

Sometimes the directors are relatives. In that case, first name and part of the email address are critical for determining that it’s not a duplicate. For example:

  • Carl Smith, Camp Big Lake, carl@campbiglake.com, 111-222-3333
  • Cathy Smith, Camp Big Lake, cathy@campbiglake.com, 111-222-3333

Comparison of fields might give score something like 0.6, 1.0, 1.0, 0.8, 1.0 giving an average of 0.88. Based on the score, it seems like a duplicate. Yet to a human, who knows about relatives and email addresses, it’s clearly not a duplicate.

Third, there may be triplicates and higher in the data.

Finally, after resolving all the above issues, you might realize that what you want to do is not deduping!

Maybe rolling your own is not the best answer. Python Dedupe to the rescue.

Python Dedupe

To get started, go to this page and read it all carefully. Maybe even look over the examples. Wow. Aren’t you glad you are not rolling your own? If you need more convincing, check out the Dedupe class – that’s right, it uses parallel processing on multi-core machines.

Getting Started with the CSV example

Since my problem was a lot like the CSV example, I used that example as a basis for my problem. The first thing I was not sure about was:

deduper.sample(data_d, sample_size=15000)

15000 samples? What does that mean? Do I need to manually rate 15000 samples? That does not make any sense. The test data set only contains 3500 records. I am guessing 15000 is the number of pairs that are used for training. And no, you do not have to manually rate them all.

Running csv_example.py starts by showing you several pairs and asking you to indicate if it’s a duplicate or not. After each pair, you will see something like:

3/10 positive, 5/10 negative

When does it stop? I am not sure. I got to 12/10 positive and 13/10 negative and it kept going. So I clicked (f) for finish and the code completed. I looked over the resulting deduped file. It was pretty impressive. I could not find any misclassifications.

One thing that made training somewhat confusing is the docs said the manual samples would be pairs that were the correct answer was not obvious. However, several of the pairs I was shown were very clearly not duplicates. Almost none of the fields were similar. I am no sure how the algorithm was improved by those samples. Maybe the code was testing me to see if I was paying attention?

Training

My data set had 15,000 records. It had already been manually deduped. I was concerned that there would not be enough duplicates to train the model. For my first attempt at training, I set:

deduper.sample(data_d, sample_size=200)

I made the sample size small in the hope that it would speed things up; I would fail fast and move on. As expected, I did not get any duplicates during training. The results were so-so (not terrible).

My plan was to write a function to add duplicates to the training data. This must be a common problem given that Python Dedupe has a function for it (trainingDataDedupe). I did not plan on using that function because I wanted the duplicates to have a certain structure.

In the mean time, I was dealing with some other issues in the code and set:

deduper.sample(data_d, sample_size=15000)

Now I was seeing duplicates during training. Evidently, with such a large sample size, there were duplicates and the training algorithm was able to find them and present them for manual review. After reviewing about 50 samples, I had more than 10 positives and 10 negatives. Once trained, the algorithm found about 100 duplicates. Scanning through the resulting CSV, showed that the algorithm worked well.

What About Blocking?

The docs make it clear that blocking is important for speeding things up. I had some ideas for the fields to use for blocking. How do I implement them? It turns out that the default is for Python Dedupe to automatically do blocking. In my case, the logging messages indicated that Dedupe automatically implemented the blocking I was thinking would be good. Nice!

You might wonder why I was de-duping data that was already de-duped. The reason was to check new records before they were added.

What About the Threshold?

Python Dedupe has a function to help you set the threshold. From the CSV example:

threshold = deduper.threshold(data_d, recall_weight=1)

Of course, you read the docs, so I do not need to go into details about what the key word “recall_weight” does. But to refresh your memory:

recall_weight — Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.

Opps I Really Do Not Want Deduping

Since I thought my list was already deduped, what I really wanted to do was prevent new records from being dupes! That’s kind of like deduping. It still involves comparing records and coming up with some sort of similarity score. But one big difference is that only N comparisons are needed; just the new record against the list of deduped records.

Luckily the authors of Python Dedupe have that functionality. It’s called: Gazetteer – from the docs:

Gazetteer matching is for matching a messy data set against a ‘canonical dataset’, i.e. one that does not have any duplicates. This class is useful for such tasks as matching messy addresses against a clean list.

I my case, I needed to run dedupe first to create the canonical dataset. Is there a way to use the dedupe training for the Gazetteer? If there is, I did not find it. But I found something close to that: markPairs.

Dedupe produced a file with clusters of duplicates. Singeltons were clusters with one record. I generated the cannonical dataset by selecting one record from each cluster. I made the messy data set by selecting other records in the cluster. The details are here.

Speeding things up

The keyword “index_predicates” has a huge impact on performance, especially the gazetteer. If you are only looking to see if a few samples are in a long canonical list, it’s not clear why you would want index the data first. In my use case, the speed up of setting “index_predicates” to False gave about a factor of 20 improvement in speed!

Scoring One Pair

Python Dedupe is not trivial to use. In my case, I had a new record that was clearly a duplicate, yet Python Dedupe was not finding the other record. It seemed useful to try to get the score for that single pair. Here is how I did it:

from dedupe.core import scoreDuplicates

if os.path.exists(settings_file):
    with open(settings_file, 'rb') as f:
        linker = dedupe.StaticGazetteer(f)

candidate_records = [[(1, {
    'first_name': 'jim',
    'last_name': 'conroy',
    'account_name': None,
    'email1': 'jim@camplakes.org',
    'phone_work': None,
    'phone_mobile': None,
    'primary_address_state': 'wi'}, {10}),

    (2, {
        'first_name': 'jim',
        'last_name': 'conroy',
        'account_name': 'camp lakes inc',
        'email1': 'jim@camplakes.org',
        'phone_work': None,
        'phone_mobile': None,
        'primary_address_state': 'wi'}, {11}),
]]
matches = scoreDuplicates(candidate_records, linker.data_model, linker.classifier)
print(matches)

Note, each record has three elements. The last one is a set of ids. I am not sure what that is for, but if the set for each record is disjoint, then the scoring function will work as expected.

Error: Is the data you are trying to match like the data you trained on?

This error is somewhat frustrating. It is possible to have a new record that is sufficiently unique that you get this error. My initial approach to this was to put it in a “try” block and return [] when the error occurred.

However, it is surprisingly easy to have a new record that is a duplicate and get this error. In my case, I was taking data from a web-form and looking for it in a CRM. I trained on records that are already in the CRM. One of the fields was “organization name”. This is a required field in the CRM, but not required in the web-form. This cause the above error anytime the “organization name” was blank on the web-form; even if all the other fields were perfect matches. I was able to solve this by using markPairs to manually add some matches that did not have organization names.

Error: Affinegap ZeroDivisionError

File "affinegap/affinegap.pyx", line 115, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1788)
 File "affinegap/affinegap.pyx", line 134, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1615)
ZeroDivisionError: float division

This occurs when two empty strings are being compared. In my case, I was adding some marked pairs and I accidentally had empty strings for values instead of None as per the docs.

 

 

Blender for Python Programmers

So you want to use Python to do stuff in Blender? Here are my notes as I go down the rabbit hole.

In Blender, Python is a tool for automating stuff you could do manually. The best way to get started is to learn how to do things manually and then once you have it figured out, make the Python script. Here is a great tutorial to get you started.

Official Docs

https://www.blender.org/api/blender_python_api_current/

Make sure to checkout the “Gotcha’s” section.

Naming Stuff

From the “Gotcha’s” section: don’t name stuff: https://www.blender.org/api/blender_python_api_2_78a_release/info_gotcha.html#naming-limitations

Using PyCharm

So you want to use the PyCharm debugger on your Blender code? Here is how:

https://code.blender.org/2015/10/debugging-python-code-with-pycharm/

Running Python Blender Script Non-interactively

From the command line, enter:

blender --background --python myscript.py

Writing Modular Code

First step is to configure Python paths. Pro-tip, when playing with Python paths, it is sometimes helpful to print them:


import sys

print(sys.path)

This page shows the layout of default paths. If you choose to put some *.py files in the ./scripts/modules, do not forget to put a __init__.py file in there too.

sys.path.append(‘/home/chuck/.config/blender/scripts/modules’)

import bpy

bpy.__path__

Rotations

This is not so much a Blender thing, but more of a 3D modelling thing. There are lots of ways of specifying rotations. Each with it’s own pros and cons. One important way is with quaternions. I have studied a lot of math in my days, but somehow quaternions never came up. To make matters worse, the math behind them is non-trivial. Here is the best free, online tutorial I could find.

General Tutorials

The Proving Ground

http://danielhnyk.cz/creating-animation-blender-using-python/

Django, Nginx, Invalid HTTP_HOST header

From time to time, I get an email from my Django server complaining about “Invalid HTTP_HOST” and looking like some hacker is trying to mess with my gunicorn sock. Here is the relevant part of the email:

Invalid HTTP_HOST header: u'/webapps/MY_SITE/run/gunicorn.sock:'. The domain name provided is not valid according to RFC 1034/1035.

Request repr():
<WSGIRequest
path:/arlo/,
GET:<QueryDict: {}>,
POST:<QueryDict: {}>,
COOKIES:{},
META:{'HTTP_ACCEPT': '*/*',
 'HTTP_CONNECTION': 'close',
 'HTTP_USER_AGENT': 'masscan/1.0 (https://github.com/robertdavidgraham/masscan)',
 'HTTP_X_FORWARDED_FOR': '120.88.152.6',
 'PATH_INFO': u'/arlo/',
 'QUERY_STRING': '',
 'RAW_URI': '/arlo/',
 'REMOTE_ADDR': '',
 'REQUEST_METHOD': 'GET',
 'SCRIPT_NAME': u'',
 'SERVER_NAME': '/webapps/MY_SITE/run/gunicorn.sock',
 'SERVER_PORT': '',
 'SERVER_PROTOCOL': 'HTTP/1.0',
 'SERVER_SOFTWARE': 'gunicorn/19.6.0',

How is this hacker getting the path to my gunicorn sock?

It turns out this is happening because I am using $http_host in my nginx config file:

upstream wsgi_server {
 server unix:/webapps/MY_SITE/run/gunicorn.sock fail_timeout=0;
}

server {
 listen 80;
 server_name MY_DOMAIN_NAME;

 location / {
     proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
     proxy_set_header Host $http_host;
     proxy_redirect off;
     if (!-f $request_filename) {
        proxy_pass http://wsgi_server;
        break;
     }
 }

when a request is made to the server and the HTTP Host is empty, nginx sets the HTTP host to the gunicorn sock.

I can generate this error using curl:

curl -H "HOST:" MY_DOMAIN_NAME -0 -v

This sends a request without a HTTP Host. The -0 causes curl to use HTTP version 1.0. If you do not set this, the request will use HTTP version 1.1, which will cause the request to be rejected immediately and not generate the error.

The solution is to replace $http_host with $host (as pointed out on Stackoverflow). When HTTP Host is missing, $host will take on the value of the “server_name” directive. This is a valid domain name and is the one that should be used.