cheerio-crawler

Web site crawler that visits URL's recursively, starting from one initial URL and following links in HTML responses, and invokes your callback function for each one.

Example

var crawl = Crawler(function (url, $) {
    var title = $('title').text();
    console.log(title, '---', url);
});

// ...

crawl('http://www.resource.com/', function (err) {
    if (err) {
        console.error('unable to complete crawl:', err.message);
    }
    else {
        console.log('finished');
    }
});

Example: Cancel

To cancel a crawl at any time, such as when exceeding a threshold, call the "cancel" method:

var crawl = Crawler(function(url, $) {
    if (visits++ > MAX_VISITS) {
        crawl.cancel();
    }
});

Example: Skip Recursion

To skip recursion on a URL, return false from the handler function:

var crawl = Crawler(function(url, $) {
    if (DO_NOT_CRAWL_LIST.contains(url)) {
        return false;
    }
});

Functionality

stays on the origin of the initial URL
gets URL's from a@href
makes one request at a time
requests each URL only once (de-duplicates requests)
removes URL fragments (hash)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
bin		bin
lib		lib
test		test
.gitignore		.gitignore
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cheerio-crawler

Example

Example: Cancel

Example: Skip Recursion

Functionality

About

Releases

Packages

Languages

reykjavikingur/cheerio-crawler

Folders and files

Latest commit

History

Repository files navigation

cheerio-crawler

Example

Example: Cancel

Example: Skip Recursion

Functionality

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages