Yet another Node 'web Scraper', converting HTML, XML and JSON structures to proper JSON objects.
Scrapa uses 2 phases to process requests, both phases are separated and can be used apart from each other.
parse
can be used without scrape
and vice versa.
scrape
- Makes the HTTP request, fetches the page and also able to select a specific part of the page, for example, find a JSON string using a RegExp.
Scrape phase has 2 types for processing requests, both will yield plain response as a string.
get
- simple fetchheadless
- headless browser, to process dynamic applications like React that has complex Javascript rendering.
You also can use this library to get remote content without the parse phase.
import { scrape } from 'scrapa';
let body = await scrape({ url: 'https://news.yahoo.com', type: 'headless' });
console.debug(body);
parse
- Parse is able to handle 3 types of inputs [HTML, XML, JSON] and convert them to a unified, easily consumed format.
All requests currently sent with a basic hardcoded user agent Mozilla/5.0 (iPhone; CPU iPhone OS 15_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Mobile/15E148 Safari/604.1
Headless browser sends the same useragent.
# npm
npm install --save scrapa
Extract the title of Yahoo website located in <title>Yahoo News<title/>
import { scrape, parse } from 'scrapa';
let body = await scrape({url: 'https://news.yahoo.com'});
let parsed = await parse({body, fields: {
title_now_is: 'head > title'
}
});
console.info(parsed);
{
total: 1,
fields: [
{ title_now_is: 'Yahoo News - Latest News & Headlines' }
]
}
Extract top 3 items from Yahoo News
import { scrape, parse } from 'scrapa';
let body = await scrape({url: 'https://news.yahoo.com'});
let parsed = await parse({body, fields: {
article_title: '.js-stream-content ul li div'
}
});
console.info(parsed);
{
total: 3,
fields: [
{ article_title: 'COVID-affected tenants face eviction despite CDC ban' },
{ article_title: 'Cayman Islands jails U.S. student in COVID case' },
{ article_title: "Fla. scientist vows to speak COVID-19 'truth to power'" }
]
}
Extracting links from Yahoo, finding the JSON part (root.App.main), and using it instead of HTML parsing.
import { scrape, parse } from 'scrapa';
let body = await scrape({
url: 'https://news.yahoo.com',
regExp: [new RegExp('root\.App\.main = (.*?);\n.*\}\\(this\\)\\);', 'gm')],
});
let parsed = await parse({
body,
type: 'json',
fields: { href: 'context.dispatcher.stores.PageStore.pageData.links.{Iterator}.href'},
options: {
},
});
console.info(parsed);
{
total: 23,
fields: [
{ href: '//s.yimg.com' },
{ href: '//mbp.yimg.com' },
...
{ href: 'https://s.yimg.com/cv/apiv2/favicon_y19_32x32.ico' },
{ href: 'https://news.yahoo.com/' }
]
}
Simple Scraper
- String
url
: The page url or request options. - String
type
: The type of scrape required, two options: 1.get
- simple get operation via 'fetch' 2.headless
for a full browser for Javascript heavy application that post render on a browser. - Array
regExp
: Array of RegExp instances to clean the output, useful before passing toparse
.
- Promise Resolving with:
body
(String): Scrapped raw body
Parses finds the fields
and extracts the data formatted in the output under the same field's name.
parse
uses 3 input types
html
- Using Cheeerio as a query selector. Fields should contain CSS style selectors to get find the data. All CSS Cheeerio selectors are valid. Example usage: {fields: {page_title: 'head > title'}} - The following will populate on the output the field page_title
with the page's title.
Currently it takes all the .innerHTML from the selectors and populate them as output.
import { parse } from 'scrapa';
let parsed = await parse({body, type: 'html', fields: {
page_title: 'head > title'
}
});
console.debug(parsed);
json
- Fields should be mapped as you would regularly read from JSON with DOT notation (store.books.0.title).
Array, should be accessed via DOT too, instead of [] as in the example.
Another operator used for objects containing many rows, for getting all objects, special operator should be used: {Iterator} instead of the number. This number will be replaced on runtime and process all items in the array.
Other than these, properties should behave as a regular JSON array address.
import { parse } from 'scrapa';
let parsed = await parse({body, type: 'json', fields: {
books_title: 'catalog.book.0.title',
books_price: 'catalog.book.{Iterator}.title',
}});
console.debug(parsed);
xml
- Converts XML input to JSON. All syntax should be similar to JSON
💡 More: More examples can be found in the unit tests folder.
- String/Object
body
: Input body to parse - String
type
: The ofbody
- 'html' or 'json' or 'xml' - Object
fields
: Key/Value pairs of parsing properties according to thetype
. Examples above. - Object
options
: Settings property to configure the parsing process and alter the output.limit
(Number): Splices the object to the desired amount.reverse
(Boolean): Reverses the output
- Promise Resolving with:
total
(Number): Amount of elements found.fields
(Array): Array of inputfields
, according to thekey
passed toparse
- Debug output
- Cover scrape with tests
- Add E2E
- Parse tranfsormation, for example parse date str to
Date
object.