Scraper of the KSL classifieds, written in Python Scrapy.
There are spiders for each category:
scrapes all computers for saleoutdoors
scrapes all outdoors items.
There is also a single large scraper for all the categories at once.
After installing Scrapy, in the project directory simply run the command
scrapy crawl ksl
or for specific category:
scrapy crawl computers
to generate a CSV file with the data
to generate an computers.csv
with data from computers cateogory
default: 12345
prod: 33333
file: requirements.txt
In case you use pipenv you may also specify a Pipfile:
default: 12345
prod: 33333
file: Pipfile
To deploy a Scrapy project to Scrapy Cloud, navigate into the project’s folder and run:
shub deploy [TARGET]
where [TARGET] is either a project name defined in scrapinghub.yml or a numerical Scrapinghub project ID. If you have configured a default target in your scrapinghub.yml, you can leave out the parameter completely:
$ shub deploy
Packing version 3af023e-master
Deploying to Scrapy Cloud project "12345"
{"status": "ok", "project": 12345, "version": "3af023e-master", "spiders": 1}
Run your spiders at:
you can set the proxy list in proxy.txt. If you don't they will block you. The ones provided work as of January 10, 2018. There are free proxy lists you can find online.
The scraped data contains the following fields (see
cell_phone = scrapy.Field()
home_phone = scrapy.Field()
category = scrapy.Field()
sub_category = scrapy.Field()
city = scrapy.Field()
state = scrapy.Field()
zip = scrapy.Field()
member_id = scrapy.Field()
To enable mongodb... set the settings and uncomment the lines in pipelines.
Aslan Varoqua Skylines Digital USA Denver, Colorado USA