The WebCrawler
class exploit the WebBrowser lib (which use Selenium and requests) to crawl URLs in parallel. This lib is "callback oriented".
git clone https://github.com/hayj/WebCrawler.git
pip install ./WebCrawler/wm-dist/*.tar.gz
- Use concurrency to request web pages in parallel with multiple Selenium instances or http requests (through the requests lib).
- Easy usage through callback functions (for events when the page is loaded, when an url failed...)
- Auto page checking (take care of pages duplicates (with DomainDuplicate) across pages of a same domain to detect "captcha pages" or invalid pages). It exploit all features of the WebBrowser library.
- Exploit Selenium (instead of requests for most crawler libraries like Scrapy). It makes it slower but we can execute javascript and use all Selenium features. Chrome and PhantomJS are both compatible.
- Exploit also requests but normalize the generated data with the WebBrowser library.
- Allows the use of several proxies.
- Real time proxies scoring and selection.
- Retry failed URLs a number of times.
- Randomize the browser header and window size according to the used proxy.
- Use a Multi-armed bandit to automatically adjusts the settings.
- Automatically kill dead Selenium process on Linux (Selenium is sometimes unstable).
- You can use piped browsers (beta) to connect to a web site and keep the session on, or you can do search in a specifi filed on the web page...
Some libraries (Scrapy) do Crawling well but doesn't exploit Selenium (or you need to requests 2 times). You can not easily exploit multiple proxies. And moreover it's most often not fully "callback" oriented but instead you need to inherit class which can be hard in separating your scrapinp / indexing code from the crawler itself.
>>> from webcrawler.crawler import *
>>> crawler = Crawler(["https://github.com/hayj/WebCrawler"], crawlingCallback=crawlingCallback)
>>> crawler.start()
- startUrls: The first param is mandatory, it is a list of urls, or a generator (i.e. a function which yield urls) / iterator. Duplicates urls will automatically be skipped.
- useHTTPBrowser: Indicate the usage of a hjwebbrowser.httpbrowser.HTTPBrowser. If set as
False
(default), the crawler will use a hjwebbrowser.browser.Browser with Selenium. - browserParams: Is a dict of args for hjwebbrowser.browser.Browser.
- httpBrowserParams: Is a dict of args for hjwebbrowser.httpbrowser.HTTPBrowser.
- maxRetryFailed: The maximum number of retries the crawler has to do for failed urls.
- banditRoundDuration: The duration of a round in seconds for the bandit. At each round, the bandit will choose new params (explore or exploit).
- proxies: A list of proxies. See WebBrowser for more information, in Proxies section.
- browsersDriverType: Default is
DRIVER_TYPE.chrome
, see WebBrowser for more information (don't forget to set Chrome and PhantomJS driver in thePATH
env var). Prefer PhantomJS to crawl with a lot of parallel browsers, it take less CPU, but is deprecated (it works withselenium==3.8.0
andphantomjs==2.1.1
). - browsersHeadless: Boolean, choose headless or not for the Chrome driver.
Each callback functions has to be passed in init parameters like we saw above.
crawlingCallback: Is the main callback, you receive crawled data and the browser (which can be a hjwebbrowser.browser.Browser or a hjwebbrowser.httpbrowser.HTTPBrowser) at end of Ajax sleep after a page was loaded. See WebBrowser for more informations about REQUEST_STATUS
and data
:
def crawlingCallback(data, browser=None):
global logger
try:
url = data["url"]
status = data["status"]
# If the request succeeded, we store it in a database:
if status in [REQUEST_STATUS.success, REQUEST_STATUS.timeoutWithContent]:
addToTheDatabase(data) # Your own function
# Else if the request succeeded but we got a 404:
elif status in [REQUEST_STATUS.error404]:
logError(url + " is a 404...", logger)
except Exception as e:
logException(e, logger, location="crawlingCallback")
You can access the Selenium driver through browser.driver
. If you scroll down (or other actions), you can reload the html using browser.driver.page_source
. But it's better to use afterAjaxSleepCallback
which is called before crawlingCallback
. See "Others callbacks" section for more information. After using afterAjaxSleepCallback
, you get fresh html data in crawlingCallback
.
If you store data
in a Mongo database, you can use databasetools.mongo.dictToMongoStorable
:
from databasetools.mongo import *
data = dictToMongoStorable(data)
failedCallback: In this callback, we receive data from failed urls (REQUEST_STATUS.<timeout|refused|duplicate|invalid|exception>
). In this example, we use a SerializableDict
from DataStructureTools to retain all failed urls so you can synchronize failed urls accross several servers via MongoDB (see alreadyFailedFunct
below):
failsSD = None
def failedCallback(data):
global failsSD
global logger
# We init the SD with a MongoDB on the localhost (set user, password and host):
if failsSD is None:
failsSD = SerializableDict("crawlfails", useMongodb=True)
try:
# We get the url:
url = data["url"]
if failsSD.has(url):
failsSD[url] += 1
else:
failsSD[url] = 1
except Exception as e:
logException(e, logger, location="failedCallback")
alreadyFailedFunct: This function take a CrawlingElement
(the url) and return True
if url failed enough (i.e. you don't want to retry it):
def alreadyFailedFunct(crawlingElement):
global logger
try:
url = crawlingElement.data
if failsSD.has(url) and failsSD[url] > crawler.maxRetryFailed:
log("Failed enough in failsSD: " + url, logger)
return True
except Exception as e:
logException(e, logger, location="alreadyFailedFunct")
return False
alreadyCrawledFunct: This function works the same but you have to return True
in the case you already downloaded this url. For example, you can check if the url exists in the database:
def alreadyCrawledFunct(crawlingElement):
global logger
try:
url = crawlingElement.data
if crawlingDatabaseHas(url):
log("This data of this url is already in the database: " + url, logger)
return True
except Exception as e:
logException(e, logger, location="alreadyCrawledFunct")
return False
terminatedCallback: In this function you receive urls that have failed enough in urlsFailedEnough
and those which don't failed enough in urlsFailedNotEnough
in the case you stopped the crawler:
def terminatedCallback(urlsFailedNotEnough, urlsFailedEnough):
global logger
for text, currentFailed in [("urlsFailedNotEnough", urlsFailedNotEnough),
("urlsFailedEnough", urlsFailedEnough)]:
currentFailedText = ""
for current in currentFailed:
currentFailedText += str(current.data) + "\n"
logInfo(text + ":\n" + currentFailedText, logger)
Use the put
method from Crawler
:
crawler.put("http://...")
You have to set the multi-armed bandit parameters in 3 parameters so the bandit can optimize these parameters for the crawler. It will explore random parameters and exploit best parameters in most cases. Each param has to be a list of values (for the bandit to choose):
# Defaults bandit parameters:
alpha=[0.99],
parallelRequests=[5, 20, 50],
browserCount=[50, 100],
- alpha: Is a number between 0.0 and 1.0 which determine the proxy pattern allocation. If near to 0.0, it will allocate only best proxies to all browsers, if near to 1.0, it will allocate uniformly all proxies to all browsers, you will see the allocation in logs at each multi-armed bandit round.
- parallelRequests: The number of parallel requests for the bandit to choose.
- browserCount: The number of browsers for the bandit to choose (must be greater than parallelRequests).
You can adjust the duration of certain events, all in seconds:
- ajaxSleep: The duration to sleep when the page is loaded. It allow to wait for Ajax elements to appear.
- pageLoadTimeout: The timeout duration for each page load
- firstRequestSleepMin and firstRequestSleepMax: The sleep duration of the first request of a Browser (a random value between min and max). It is useful to do not send a lot of requests in parallel when the crawler starts.
- allRequestsSleepMin and allRequestsSleepMax: The same but for all requests.
- browserMaxDuplicatePerDomain: Int, see DomainDuplicate for more information.
- allowRestartTor: if set as
True
, the crawling is authorized to restart Tor services for hjwebbrowser.httpbrowser.HTTPBrowser. - sameBrowsersParallelRequestsCount: If set as
True
, the crawler is not authorized to set different number of parallel requests vs parallel browsers in activity. You can use this if you set useHTTPBrowser asTrue
. See the Ajax sleep is not a bottleneck section for more information. - loadImages: Boolean.
- useProxies: Default is
True
. Set it asFalse
if you don't want to use proxies even you gave ones inproxies
. - stopCrawlerAfterSeconds: To stop the crawler the crawler after n seconds if he didn't receive other urls from the url generator.
- queueMinSize: The number of urls from
startUrls
to keep in the queue (default1000
). - banditExploreRate: The bandit exploration rate (default
0.2
). The first 30 requests, the bandit will explore. - browserUseFastError404Detection: Set it as
True
if you don't want to use 404Detector to speed up treatment. - logger: A logger from
systemtools.logger
(see SystemTools). - verbose: To set the verbose (
True
orFalse
).
We consider 2 bottleneck:
- The proxy bandwidth: so we automatically select best proxies at each multi-armed bandit round
- Your computer bandwidth: so the multi-armed bandit will adjust the number of parallel requests and other parameters
But we consider an Ajax sleep (when we wait for Ajax loading after a page was loaded) not to be a bottleneck. The main advantage of this hypothesis (which is right in most cases because you sometimes have no Ajax load, or it's a lot less bandwidth consumer) is that an Ajax sleep do not reduce the total amount of actual parallel requests (total amount of parallel requests != total amount of parallel browser in activity).
This class handle elements to crawl, basically URLs, but also piped messages (beta). To access the url, you have to get ceInstance.data
like in the alreadyFailedFunct
example.
- beforeAjaxSleepCallback and afterAjaxSleepCallback: See
hjwebbrowser.browser.Browser
from WebBrowser for more information. - pipCallback: This callback is useful to use piped browsers when your returned values in crawlingCallback. It allow you to keep a browser across multiple pages. It is actually in beta but works in most cases. When using piped browsers you can also set notUniqueUrlsGetter to give not unique urls (for example in the case you start on a "search page" and you want to continue browsing, or when you want to log in...)
- beforeGetCallback: Receive the browser, do something with it before the
get
of the url.
In the Crawler init you can set DomainDuplicate init parameters for the singleton used in hjwebbrowser.browser.Browser or hjwebbrowser.httpbrowser.HTTPBrowser using browserParams
or httpBrowserParams
. For example, if you want to set user
, password
and host
for the Mongo database of DomainDuplicate singleton in hjwebbrowser.browser.Browser:
>>> crawler = Crawler(browserParams={"domainDuplicateParams": {"user": None, "password": None, "host": "localhost"}})
- You can integrate HoneypotDetector in your crawler.