Skip to content

developer guide

Felix Hamborg edited this page Dec 18, 2016 · 4 revisions

#news-please developer guide

This explains the inner workings of news-please and is directed at developers and advanced users. In the following sections we explain the program flow and the architecture of this project.

##Program flow

Here is an overview of the program flow (this diagram is not an UML-diagram or similar, it is just for clarification on how the program works):

Program flow

###Starting the crawler

After starting news-please, a number of crawlers will be started as sub processes. The number of processes started depends on the input (number of targets) and is limited by the configuration

Each sub process calls single_crawler.py loading the settings defined for the crawler.

###Scrapy - Spiders

As mentioned before, this project heavily relies on Scrapy 1.1, an easy modifiable crawler-framework. The crawlers are implemented as Scrapy spideres located in the spider directory (./newscrawler/crawler/spiders/).

Right now there are multiple crawlers implemented. For further information on how which spider works, read here.

###Heuristics

We use multiple heuristics to detect whether a site is an article or not. All these heuristics are (and if you want to add some, these should be as well) located in ./newscrawler/helpers/heuristics.py.

Heuristics can be enabled and disabled per site, also how heuristics work can be changed per site.

Heuristics must return a boolean, a string, an int or a float. For each heuristic a value can be set, that must be matched. More background information about the heuristics can be found here.

For further information, read the [Heuristics]-Section of the Configuration page.

###Scrapy - Pipelines

Sites that passed the heuristics (from now on called articles) are passed to pipelines. Disabling, enabling and the order of pipelines can be set in in the [Scrapy]-section of the newscrawler.cfg.

news-please offers several pipeline modules to filter, edit and save scraped articles. If your interested in developing your own make sure to add them to pipelines.py.

##Files

Our file structure has a simple file-hierarchy. Classes should only rely on classes which are stored in the same or child-directories.

  • __init__.py (empty [1])

  • .gitignore

  • init-db.sql (Setup script for the optional MySQL database)

  • README.md

  • LICENSE.txt

  • requirements.txt (simple Python requirements.txt)

  • single_crawler.py (A single crawler-manager)

  • __main__.py (Entry point, manages all crawlers)

  • config/

  • newscrawler/

    • __init__.py (empty [1])

    • config.py (Reading and parsing the config files (default: sitelist.json and config.cfg))

    • helper.py (Helper class, containing objects of classes in helper_classes/ for passing to the crawler-spiders)

    • crawler/ (containing mostly basic crawler-logic and scrapy-functionality)

    • helper_classes/

      • __init__.py (empty [1])

      • heuristics.py (heuristics used to detect articles)

      • parse_crawler.py (helper class for the crawlers parse-method)

      • savepath_parser.py (helper-class for saving files)

      • url_extractor.py (URL-Extraction-helper)

      • sub_classes/

        • __init__.py (empty [1])
        • heuristics_manager.py (class used in heuristics.py for easier configuration later on)
    • pipeline/

      • __init__.py (empty [1])
      • pipelines.py (Scrapys pipelines-functionality, handling database inserts, local storage, wrong HTTP-Codes ...)
      • extractor/ (Additional resources needed for the ArticleMasterExtractor)

[1]: These files are empty but required because otherwise python would not recognize these directories as packages.

##Classes

class diagram

Clone this wiki locally