crawl
: The Chrome Web Store Crawler¶
tasks¶
Tasks for Celery workers.
Beat Tasks¶
Beat tasks are those that are run on a periodic basis, depending on the configuration in celeryconfig.py
or any cron
jobs setup in the Ansible playbooks. Beat tasks only initiate the workflow by creating the jobs, they don’t actually do
the work for each task.
Entry Points¶
Entry points are where an actual worker begins its work. A single task corresponds to a specific CRX file. The task function dictates what operations are performed on the CRX. Each operation is represented by a specific worker function (as described below).
Worker Functions¶
Worker functions each represent a discrete action to be taken on a CRX file.
Helper Tasks and Functions¶
These functions provide additional functionality that don’t fit in any of the above categories.
db_iface¶
webstore_iface¶
Chrome Web Store interface for dbling.
-
exception
crawl.webstore_iface.
ListDownloadFailedError
(*args, **kwargs)[source]¶ Raised when the list download fails.
Initialize RequestException with
request
andresponse
objects.
Raised when an extension isn’t downloadable.
-
exception
crawl.webstore_iface.
BadDownloadURL
[source]¶ Raised when the ID is valid but we can’t download the extension.
-
exception
crawl.webstore_iface.
VersionExtractError
[source]¶ Raised when extracting the version number from the URL fails.
-
class
crawl.webstore_iface.
DownloadCRXList
(ext_url, *, return_count=False, session=None)[source]¶ Generate list of extension IDs downloaded from Google.
As a generator, this is designed to be used in a
for
loop. For example:>>> crx_list = DownloadCRXList(download_url) >>> for crx_id in crx_list: ... print(crx_id)
The list of CRXs will be downloaded just prior to when the first item is generated. In other words, instantiating this class doesn’t start the download, iterating over the instance starts the download. This is significant given that downloading the list is quite time consuming.
Parameters: - ext_url (str) – Specially crafted URL that will let us download the list of extensions.
- return_count (bool) – When True, will return a tuple of the form:
(crx_id, job_number)
, wherejob_number
is the index of the ID plus 1. This way, the job number of the last ID returned will be the same aslen(DownloadCRXList)
. - session (requests.Session) – Session object to use when downloading
the list. If None, a new
requests.Session
object is created.
-
download_ids
()[source]¶ Starting point for downloading all CRX IDs.
This function actually creates an event loop and starts the downloads asynchronously.
Return type: None
-
crawl.webstore_iface.
save_crx
(crx_obj, download_url, save_path=None, session=None)[source]¶ Download the CRX, save in the
save_path
directory.The saved file will have the format:
<extension ID>_<version>.crx
If
save_path
isn’t given, this will default to a directory called “downloads” in the CWD.Adds the following keys to
crx_obj
:version
: Version number of the extension, as obtained from the final URL of the download. This may differ from the version listed in the extension’s manifest.filename
: The basename of the CRX file (not the full path)full_path
: The location (full path) of the downloaded CRX file
Parameters: - crx_obj (munch.Munch) – Previously collected information about the extension.
- download_url (str) – The URL template that already contains the correct
Chrome version information and
{}
where the ID goes. - save_path (str or None) – Directory where the CRX should be saved.
- session (requests.Session or None) – Optional
Session
object to use for HTTP requests.
Returns: Updated version of
crx_obj
withversion
,filename
, andfull_path
information added. If the download wasn’t successful, not all of these may have been added, depending on when it failed.Return type: munch.Munch