crawl: The Chrome Web Store Crawler

tasks

Tasks for Celery workers.

Beat Tasks

Beat tasks are those that are run on a periodic basis, depending on the configuration in celeryconfig.py or any cron jobs setup in the Ansible playbooks. Beat tasks only initiate the workflow by creating the jobs, they don’t actually do the work for each task.

Entry Points

Entry points are where an actual worker begins its work. A single task corresponds to a specific CRX file. The task function dictates what operations are performed on the CRX. Each operation is represented by a specific worker function (as described below).

Worker Functions

Worker functions each represent a discrete action to be taken on a CRX file.

Helper Tasks and Functions

These functions provide additional functionality that don’t fit in any of the above categories.

db_iface

webstore_iface

Chrome Web Store interface for dbling.

exception crawl.webstore_iface.ListDownloadFailedError(*args, **kwargs)[source]

Raised when the list download fails.

Initialize RequestException with request and response objects.

exception crawl.webstore_iface.ExtensionUnavailable[source]

Raised when an extension isn’t downloadable.

exception crawl.webstore_iface.BadDownloadURL[source]

Raised when the ID is valid but we can’t download the extension.

exception crawl.webstore_iface.VersionExtractError[source]

Raised when extracting the version number from the URL fails.

class crawl.webstore_iface.DownloadCRXList(ext_url, *, return_count=False, session=None)[source]

Generate list of extension IDs downloaded from Google.

As a generator, this is designed to be used in a for loop. For example:

>>> crx_list = DownloadCRXList(download_url)
>>> for crx_id in crx_list:
...     print(crx_id)

The list of CRXs will be downloaded just prior to when the first item is generated. In other words, instantiating this class doesn’t start the download, iterating over the instance starts the download. This is significant given that downloading the list is quite time consuming.

Parameters:
  • ext_url (str) – Specially crafted URL that will let us download the list of extensions.
  • return_count (bool) – When True, will return a tuple of the form: (crx_id, job_number), where job_number is the index of the ID plus 1. This way, the job number of the last ID returned will be the same as len(DownloadCRXList).
  • session (requests.Session) – Session object to use when downloading the list. If None, a new requests.Session object is created.
download_ids()[source]

Starting point for downloading all CRX IDs.

This function actually creates an event loop and starts the downloads asynchronously.

Return type:None
_async_download_lists()[source]

Download, loop through the list of lists, combine IDs from each.

Return type:None
_dl_parse_id_list(list_url)[source]

Download the extension list at the given URL, return set of IDs.

Parameters:list_url (str) – URL of an individual extension list.
Returns:Set of CRX IDs.
Return type:set
crawl.webstore_iface.save_crx(crx_obj, download_url, save_path=None, session=None)[source]

Download the CRX, save in the save_path directory.

The saved file will have the format: <extension ID>_<version>.crx

If save_path isn’t given, this will default to a directory called “downloads” in the CWD.

Adds the following keys to crx_obj:

  • version: Version number of the extension, as obtained from the final URL of the download. This may differ from the version listed in the extension’s manifest.
  • filename: The basename of the CRX file (not the full path)
  • full_path: The location (full path) of the downloaded CRX file
Parameters:
  • crx_obj (munch.Munch) – Previously collected information about the extension.
  • download_url (str) – The URL template that already contains the correct Chrome version information and {} where the ID goes.
  • save_path (str or None) – Directory where the CRX should be saved.
  • session (requests.Session or None) – Optional Session object to use for HTTP requests.
Returns:

Updated version of crx_obj with version, filename, and full_path information added. If the download wasn’t successful, not all of these may have been added, depending on when it failed.

Return type:

munch.Munch