downdrag

Webscraping done thoroughly. Python web scraping tool using lxml and configured by Confuse.

model

The harvested information is first structured by those main fields:

itemindex
source
index
name
description
extrainfo
link

configuration

These are the main roots:

querier
outputs
details
profiles

example

outputs:
  csv:
    filename: scraping.csv
  mysql:
    connectioninfos:
      host: localhost
      port: 3306
      user: root
      password: 12345
      database: db
    tablename: scraping
  html:
    filename: scraping.html
    title: Scraping Table Export
    scripts:
    - default.js
    styles:
    - default.css
details:
  color:
    type: string
    conversion:
      process: value
      pattern: '^(red|green|blue)'
  size:
    default: 8
    type: int
    conversion:
      process: calculate
      pattern: '\b(\d+)x(\d+)\b'
      formula: '%s*%s'
  lot:
    type: float
    conversion:
      process: layer
      formula: 'size*23.5'
  delivery:
    conversion:
      process: schedule
      pattern: '\b[012]\d:[0-5]\d\b'
    source: extrainfo
profiles:
  warehouse:
    url: https://warehouse.com/
    items: //div[@class="item-box-info"]
    name: //div[@class="item-box-name"]/text()
    features: //div[@class="item-box-features"]/text()
    evaluator: ^\s*(.+)\s*$
    pathfinder:
      target: external
      link: https://warehouse.com/deliveries/
      type: fulltext
      pattern: '%D'
      indexer: startwith
      value: //div[@id="daily-schedules"]/descendant::text()
  library:
    url: https://library.com/
    items: //div[@class="item-book-info"]
    infos: //a[@class="item-details"]
    name: //div[@class="item-book-name"]/text()
    features: //div[@class="item-book-features"]/text()
    evaluator: ^\s*\w+: (.+)\s*$
    pathfinder:
      target: current
      type: showcase
      value: //div[@id="daily-signings"]/div[@id="%s"]/descendant::text()

querier (optional, defaults to plain)

Mechanism of querying the data:

mode: strategy of the querier
- plain: (default)
- secure: use Tor, needs to have torpy package installed
- dynamic: use a selenium web driver
driver: only for dynamic mode, one of Firefox, Chrome, Ie or WebKitGTK
argsline: engine command line arguments, only for dynamic mode
cached: whether or not the querier is cached

outputs

There's currently three types of output available:

csv:
- filename
html
- filename
- title, optional
- scripts: optional list of js filenames
- styles: optional list of css filenames
mysql (requires having the installation of mysql-connector-python package)
- connectionsinfos: dict of connect args
- tablename

details (optional)

Each of the gathered information can be modelled by:

type: choice of save method
- string (default)
- int
- float
default: otherwise the type's default
conversion: collection configuriton
- process: choice of harvesting method
  - value: harvest directly
  - calculate: math formula from regex groups
  - layer: formula from previous fields
  - schedule: pair of time values on two fields
- pattern: except for layer process
- formula: for calculate and layer processes
- case: pattern of current datetime for schedule process (optional)
- threshold: time of day which usually splits whole days (optional)
source: different field to harvest

profiles

Multiple websites can be scraped.

url: index page of the items
pagers: XPath of element to next page or configuration for dynamic pagers
- action: event to trigger
- value: XPath of element to dynamic next page
items: XPath list of elements
infos: XPath of link to the element's details (defaults to the first link within the element)
name: XPath text value
features: XPath list of sub-elements
evaluator: Regex to cleanup the value
pathfinder: additional infos
- target: choice of source for the infos
  - current
  - external
  - index: items list page
- link: for external target
- type: choice of format for the into, except for index
  - fulltext: simple text
  - showcase: html presentation
- pattern: except for index or showcase
- format: choice for structure of the infos if fulltext
  - now: current date and time
  - list: enumeration of items
- indexer: string method for matching the line if fulltext
- value: XPath text value, parametrized with name if index or showcase

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.vscode		.vscode
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
downdrag.py		downdrag.py
outputs.py		outputs.py
outputs.svg		outputs.svg
outputs.uxf		outputs.uxf
querier.py		querier.py
querier.svg		querier.svg
querier.uxf		querier.uxf
requirements.txt		requirements.txt
schema.json		schema.json
schema.yml		schema.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

downdrag

model

configuration

example

querier (optional, defaults to plain)

outputs

details (optional)

profiles

development

outputs

querier

About

Releases

Packages

Contributors 2

Languages

License

CodeBeast357/downdrag

Folders and files

Latest commit

History

Repository files navigation

downdrag

model

configuration

example

querier (optional, defaults to plain)

outputs

details (optional)

profiles

development

outputs

querier

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages