Yet another Node 'web Scraper', converting HTML, XML and JSON structures to proper JSON objects.
Scrapa uses 2 phases to process requests, both phases are separated and can be used apart from each other.
can be used without scrape
and vice versa.
- Makes the HTTP request, fetches the page and also able to select a specific part of the page, for example, find a JSON string using a RegExp.
Scrape phase has 2 types for processing requests, both will yield plain response as a string.
- simple fetchheadless
- headless browser, to process dynamic applications like React that has complex Javascript rendering.
You also can use this library to get remote content without the parse phase.
import { scrape } from 'scrapa';
let body = await scrape({ url: '', type: 'headless' });
- Parse is able to handle 3 types of inputs [HTML, XML, JSON] and convert them to a unified, easily consumed format.
All requests currently sent with a basic hardcoded user agent Mozilla/5.0 (iPhone; CPU iPhone OS 15_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Mobile/15E148 Safari/604.1
Headless browser sends the same useragent.
# npm
npm install --save scrapa
Extract the title of Yahoo website located in <title>Yahoo News<title/>
import { scrape, parse } from 'scrapa';
let body = await scrape({url: ''});
let parsed = await parse({body, fields: {
title_now_is: 'head > title'
total: 1,
fields: [
{ title_now_is: 'Yahoo News - Latest News & Headlines' }
Extract top 3 items from Yahoo News
import { scrape, parse } from 'scrapa';
let body = await scrape({url: ''});
let parsed = await parse({body, fields: {
article_title: '.js-stream-content ul li div'
total: 3,
fields: [
{ article_title: 'COVID-affected tenants face eviction despite CDC ban' },
{ article_title: 'Cayman Islands jails U.S. student in COVID case' },
{ article_title: "Fla. scientist vows to speak COVID-19 'truth to power'" }
Extracting links from Yahoo, finding the JSON part (root.App.main), and using it instead of HTML parsing.
import { scrape, parse } from 'scrapa';
let body = await scrape({
url: '',
regExp: [new RegExp('root\.App\.main = (.*?);\n.*\}\\(this\\)\\);', 'gm')],
let parsed = await parse({
type: 'json',
fields: { href: 'context.dispatcher.stores.PageStore.pageData.links.{Iterator}.href'},
options: {
total: 23,
fields: [
{ href: '//' },
{ href: '//' },
{ href: '' },
{ href: '' }
Simple Scraper
- String
: The page url or request options. - String
: The type of scrape required, two options: 1.get
- simple get operation via 'fetch' 2.headless
for a full browser for Javascript heavy application that post render on a browser. - Array
: Array of RegExp instances to clean the output, useful before passing toparse
- Promise Resolving with:
(String): Scrapped raw body
Parses finds the fields
and extracts the data formatted in the output under the same field's name.
uses 3 input types
- Using Cheeerio as a query selector. Fields should contain CSS style selectors to get find the data. All CSS Cheeerio selectors are valid. Example usage: {fields: {page_title: 'head > title'}} - The following will populate on the output the field page_title
with the page's title.
Currently it takes all the .innerHTML from the selectors and populate them as output.
import { parse } from 'scrapa';
let parsed = await parse({body, type: 'html', fields: {
page_title: 'head > title'
- Fields should be mapped as you would regularly read from JSON with DOT notation (store.books.0.title).
Array, should be accessed via DOT too, instead of [] as in the example.
Another operator used for objects containing many rows, for getting all objects, special operator should be used: {Iterator} instead of the number. This number will be replaced on runtime and process all items in the array.
Other than these, properties should behave as a regular JSON array address.
import { parse } from 'scrapa';
let parsed = await parse({body, type: 'json', fields: {
books_title: '',
books_price: '{Iterator}.title',
- Converts XML input to JSON. All syntax should be similar to JSON
💡 More: More examples can be found in the unit tests folder.
- String/Object
: Input body to parse - String
: The ofbody
- 'html' or 'json' or 'xml' - Object
: Key/Value pairs of parsing properties according to thetype
. Examples above. - Object
: Settings property to configure the parsing process and alter the output.limit
(Number): Splices the object to the desired amount.reverse
(Boolean): Reverses the output
- Promise Resolving with:
(Number): Amount of elements found.fields
(Array): Array of inputfields
, according to thekey
passed toparse
- Debug output
- Cover scrape with tests
- Add E2E
- Parse tranfsormation, for example parse date str to