Django Long Running Processes

The most commonly suggested solution for long running processes is to use Celery. I suspect that if you need scalabilty or high volume, etc… Celery is the best solution. That said,  I have been down the Celery rabbit hole more than once. It has never been pleasant. Since my needs are more modest, maybe there is a better alternative?

My needs involve running process that might run for 15 minutes or so. The process might run a dozen times/day and be launched by as many users. The process must be launch-able from the website by authorized users.

I have solved this problem by going to the polar opposite of Celery – cron. Every minute cron would launch a custom django command. That command would look in a database table for tasks and get input data from the database. When a task was completed, that fact was written to the database table. Honestly, this approach has worked well. Never-the-less, I always wonder if there is a stable, simple, robust solution that lies somewhere between cron and Celery.

Maybe RedisRQ and Django RQ? These are my notes so that a year from now, when this issue comes up again, I can get up to speed quickly.

Step 1: Install Redis and Start Redis Server

These instructions are pretty good.

Step 2: Is Redis Server Working?

Maybe you installed Redis Server a long time ago and you want to see if it’s still working? Go here.

Or you could type:

$ redis-cli ping
PONG

Step 3: Install RQ

pip install rq

Step 4: Install and Configure django-rq

Go here.

Step 5: Read the RQ Docs

Seriously – read the RQ docs. They are brief and to-the-point.

Step 6: Daemonize the RQ Workers

If you use supervisord , here is the Ansible template I use to do that:

[program:django_rq]
command= {{ virtualenv_path }}/bin/python manage.py rqworker high default low
stdout_logfile = /var/log/redis/redis_6379.log

numprocs=1

directory={{ django_manage_path }}
environment = DJANGO_SETTINGS_MODULE="{{ django_settings_import }}",PATH="{{ virtualenv_path }}/bin"
user = vagrant
stopsignal=TERM

autostart=true
autorestart=true


[program:rqscheduler]
command={{ virtualenv_path }}/bin/python manage.py rqscheduler
stdout_logfile = /var/log/redis/rq_scheduler.log

numprocs=1

directory={{ django_manage_path }}
environment = DJANGO_SETTINGS_MODULE="{{ django_settings_import }}",PATH="{{ virtualenv_path }}/bin"
user = vagrant
stopsignal=TERM

autostart=true
autorestart=true

Reflections

I do not recall all the problems I had with Celery. After reviewing the RQ solution above, it is clear that one of the advantages of that solution is the documentation is really good. Or at least it clearly and directly addressed what I was trying to do.

Additionally, I wish I would have implemented this a long time ago. It is so easy to use. And it’s so freeing to to be able to run long processes.

Reflections part Deux, Troubleshooting and Gotchas

It’s coming back to me. The supervisor config in the original post started the daemon OK. But it turns out there was an error in the config that caused the queued processes to fail. Finding and fixing that bug was a pain in the ass. Maybe my troubles with Celery were really troubles with supervisor? Down the rabbit hole we go!

It turns out that the Django RQ Queue Statistics are helpful for debugging. They show failed tasks along with a Python traceback! Very nice. In my case, I was getting the error:

ImportError: No module named XXXX

Clearly one of my paths in the supervisor conf file was wrong. Time to start hacking:

  1. Edit conf file
  2. Run supervisorctl stop django_rq
  3. Run supervisorctl start django_rq
  4. Queue a task
  5. It failed again? How is that possible? Back to step 1

GOTCHA! After a while you notice the changes you are making are not having any effect. And then you recall that to reload the config file you must run:

service supervisor restart

Now my config file works. All I have to do is figure out which of the ever cludgier hacks I made can be removed. The config file above has been updated.

Son of Reflections part Deux – Adding PATH to Supervisor Config

I thought I had it working. Then when I added a slightly more complex task that interacted with the database, it failed with an ImportError. After flailing around for a while, I found that adding a PATH to the supervisor environment variable solved the problem.

During my flailing, I found this blog post. Lots of great ideas.

Still Falling Down the Rabbit Hole – Logging to the Rescue

Everything was working almost every where… except with the daemonized workers on the server. Luckily, Django-RQ now comes with logging. I implemented the logging in the Django settings files as per the docs, restarted the dev server, and… no logging. Turns out you have to restart the workers.

Also, although the docs show the use of a logging class made for rq (rq.utils.ColorizingStreamHandler), it turns out you can use logging.FileHandler, which is what you want for debugging the code when running from a daemonized worker.

For what it’s worth, it turns out the problem was with the python locale module. The docs say something about it not being thread safe. The function locale.getlocale() returned a value when the workers were run via the dev server, but it returned None when run from a daemonized worker.

Logging a Single Module in Python

The problem is that sometimes 3rd party modules have logging turned on. For example, I run into that when using Selenium for testing.

If you do not want to see any logger output, you can put this at the top of your module:

import logging
logging.disable(logging.CRITICAL)

That works fine, until you want to turn on logging for the module you are working on. Here is code that turns off all other loggers. The last line sends a logger message for this module to the console.

import logging

for logger_name in logging.Logger.manager.loggerDict.keys():
    logging.getLogger(logger_name).setLevel(logging.CRITICAL)

logging.debug('hello from my module')

Celery Logging

When you start Celery, there is a flag for setting the logging level:

celery worker -A my_app -E -l WARNING --concurrency=1

Simple enough. There is a similar flag for Celery beat. The problem is that even if you set the level fairly high and you are running celery beat, you will still get DEBUG and INFO level logging info every time Celery beat runs. This can create a huge log file of useless information.

It seems there is a bug somewhere in the logging system. This post suggests that django-debug-toolbar can be the culprit. Unfortunately for me, I am not running django-debug-toolbar. Apparently some other 3rd party app is causing the problem.

My solution is simple. I created a cron job to clear the file every hour, using the command:

cat /dev/null > /path/to/logfile

Pretty Print Django POST Data in Python Logger

Due to customer requirements, I have a webpage with several forms, and dynamic formsets that are built on the client side using jquery formset. Kind of reminds me of the Simpsons episode where Homer designs a car. But here at ROI Technologies LLC, we are not girly men. We laugh in the face of complexity. And every once in a while complexity bites us in the ass.

Obviously, the best solution is to write extensive tests. Unless of course you include costs in your calculation of your “best” metric. An alternative to automatic testing is using the Python logger to gather the info you will need to quickly fix a bug if one is encountered. That’s what this post is about.

Back to my Homer page, as you can imagine, the POST data is extensive. To output it in a readable format in the logs, use json. Like this

try:
    is_valid = my_form.is_valid()
except:
    # Grab the data to fix the bug
    logger.exception('Got exception while validating form:')
    logger.debug('Here is the post data:\n %s', json.dumps(request.POST, indent=4, sort_keys=True))
    raise #  back to crashing

That first logger line in the exception handler, prints that message along with the traceback! The second logger line uses json to pretty print the Django POST data. Happy debugging.

Python Logging and Django

This post dives into some of the tricky aspects of the Python logging system. What makes them tricky is they are either not documented or hard to find in the docs. This exploration into the bowels of logging was inspired by this code from the Django docs:

LOGGING = {
    'version': 1,
    'disable_existing_loggers': True,
    'formatters': {
        'verbose': {
            'format': '%(levelname)s%(asctime)s%(module)s%(process)d%(thread)d%(message)s'
        },
        'simple': {
            'format': '%(levelname)s%(message)s'
        },
    },
    'filters': {
        'special': {
            '()': 'project.logging.SpecialFilter',
            'foo': 'bar',
        }
    },
    'handlers': {
        'null': {
            'level': 'DEBUG',
            'class': 'django.utils.log.NullHandler',
        },
        'console':{
            'level': 'DEBUG',
            'class': 'logging.StreamHandler',
            'formatter': 'simple'
        },
        'mail_admins': {
            'level': 'ERROR',
            'class': 'django.utils.log.AdminEmailHandler',
            'filters': ['special']
        }
    },
    'loggers': {
        'django': {
            'handlers': ['null'],
            'propagate': True,
            'level': 'INFO',
        },
        'django.request': {
            'handlers': ['mail_admins'],
            'level': 'ERROR',
            'propagate': False,
        },
        'myproject': {
            'handlers': ['console', 'mail_admins'],
            'level': 'INFO',
            'filters': ['special']
        }
    }
}

At this point its useful to review the python logger hierarchy and default behavior (for more info see this excellent post). The python logging system consists of a dot-separated tree of loggers. In the example above, django.requests is a child of django. If you send a logging message to django.requests, it will handle the message and since propagate is False that will be the end of it. If propagate were True, then the message would be handled and then passed to the django logger. Cool. But how does the “myproject” logger relate to the django loggers? It turns out there is a top level logger called the root logger. The django loggers and myproject pass their messages to the root logger.

In one of the python modules you might see something like this:

import logging
logging.debug('my message')

This statement send the logging message directly to the root logger.

Cool, but here comes the tricky part. What does the root logger do with this message since we have not defined a root logger (to define a root logger, make logger in the dict above with the name ”)? Here is my best guess at the defaults:

  • level = WARNING
  • handler = StreamHandler
  • format = %(levelname)s%(name)s%(message)s

Time for more magic. What happens if you run these commands?

import logging
logger = logging.getLogger(__name__)
logger.debug('another message')

The getLogger method creates a new logger, attached to the root logger, with the name equal to the module that contains the code. Note, you can name that logger anything you want. Giving it a name that is equal to the module name is especially helpful if you are aggregating the logging messages from multiple modules in one place (in this case, they are all going to StreamHandler).

FileHandler

With FileHandler, you specify the filename. But what directory does that file end up in? Not surprising, its the current working directory. However this can cause some confusion if you run your code from different directories, for example from the root dir using cron and the project dir at other times. This post shows how to specify the dir.