Crawls and extracts bioschemas.org/schema.org JSON-LD and Microdata from a given website. The extracted information is stored on a JSON file and optionally can be stored on a Elasticsearch local service.
Use example:
./bioschemas-gocrawlit_mac_64 -p -u "https://www.ebi.ac.uk/biosamples/samples"
./bioschemas-gocrawlit_mac_64 -q -u https://tess.elixir-europe.org/sitemaps/events.xml
./bioschemas-gocrawlit_mac_64 -u http://159.149.160.88/pscan_chip_dev/
A folder "bioschemas_gocrawlit_cache" will be created on the current path of execution; This folder contains crawled website information in order to prevent multiple download of pages. Is safe to delete this folder.
Scraped data will be stored in a json file named <website_host>_schema.json
on the current program folder.
- -p: Stay on current path. i.e. When crawling a page like
https://www.ebi.ac.uk/biosamples/samples
and don't want it to crawl the whole website, e.g.https://www.ebi.ac.uk
. - -m: Max number of recursion depth of visited URLs. Default infinity recursion. (The crawler does not revisit URLs)
- -e: Adds crawled data to an Elasticsearch (v6) service at http://127.0.0.1:9200.
- -u: Start page to start crawling.
- -q: Remove query section from the link URL found.
- --query: Use with -q so it follows only links that contain the query word provided, e.g.,
./bioschemas-gocrawlit_mac_64 -u https://tess.elixir-europe.org/events -q --page page
- -h: Print Help and exit.
To create a binary for your current SO use:
make build
To create a binary for windows, macos and linux SO use:
make build-all
The binaries would be placed under build/ path.
Elasticsearch quick setup DOCKER
Steps for starting dockerized elasticsearch and kibana locally. This requires Docker.
docker network create elastic-stack
docker run -it --network=elastic-stack -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name elasticsearch docker.elastic.co/elasticsearch/elasticsearch:6.2.4
Avoid changing the containers name since Kibana docker image points by default to
http://elasticsearch:9200
.
docker run --network=elastic-stack --rm -it -p 5601:5601 --name kibana docker.elastic.co/kibana/kibana:6.2.4
Remember the --rm flag will delete the container once it is stoped.
- Crawl website
- URL by command line parameters
- JSON-LD Extraction
- Microdata extraction
- Better file output
- Sitemap.xml Crawl option
- Pagination option
- Conecting to a flexible storage
- RDFa extraction support
- Writing file as it scraps