Headless Chrome with Python and Selenium

In version 59, Google Chrome acquired the option to run headlessly! Here is what works for me (Ubuntu: 14.04, Chrome version: 60.0.3112.78, chromedriver: 2.31) :

from selenium import webdriver

def get_page_html(url, headless=False, screen_shot_path=None):
    if headless:
        options = webdriver.ChromeOptions()
        options.add_argument('--headless')
        options.add_argument('--window-size=1200x2900')
        options.add_argument('--disable-gpu')
        selenium = webdriver.Chrome("/usr/local/share/chromedriver", chrome_options=options)
    else:
        selenium = webdriver.Chrome("/usr/local/share/chromedriver")
        selenium.set_window_size(1500, 900)

    selenium.get(url)
    page_html = selenium.page_source
    if screen_shot_path:
        selenium.save_screenshot(screen_shot_path)

    selenium.quit()
    return page_html

Make sure your version of chromedriver is up-to-date. Also, notice that the options.add_argument() expects the arguments to be as if you were running google-chrome from the command line (e.g. prefix with –).

Advertisements

Daemonizing Django-RQ using Supervisor

I was trying to set up a task to be run from Django-RQ. The task involved scrapping a webpage using Selenium and Google Chrome. It worked great in development, but not in production. The error message indicated that there were problems starting Chrome.

One big difference between dev and production was in production I was daemonizing Django-RQ using Supervisor. Some queued tasks would run. Just not the ones involving Selenium. The clue came when I stopped Django-RQ using supervisorctl and then started it from the command line. Now the Selenium tasks worked.

I solved the problem by adding this snippet to the top of the task that used Selenium:


import os
import json

json.dump(os.environ['PATH'].split(':'), open('debug_file.json', 'wb'))

This revealed that the environment PATH when running from the command line was much different that that when running Django-RQ from Supervisor. Adding some of those paths to the Supervisor config solved the problem.

[program:django_rq]
command= {{ virtualenv_path }}/bin/python manage.py rqworker high default low
stdout_logfile = /var/log/redis/redis_6379.log

numprocs=1

directory={{ django_manage_path }}
environment = DJANGO_SETTINGS_MODULE="{{ django_settings_import }}",PATH="{{ virtualenv_path }}/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
user = vagrant
stopsignal=TERM

autostart=true
autorestart=true