BIOSCHEMAS.ORG GO CRAWL IT!

Crawls and extracts bioschemas.org/schema.org JSON-LD and Microdata from a given website. The extracted information is stored on a JSON file and optionally can be stored on a Elasticsearch local service.

How to use it:

Use example:

./bioschemas-gocrawlit_mac_64 -p -u "https://www.ebi.ac.uk/biosamples/samples"
./bioschemas-gocrawlit_mac_64 -q -u https://tess.elixir-europe.org/sitemaps/events.xml
./bioschemas-gocrawlit_mac_64 -u http://159.149.160.88/pscan_chip_dev/

A folder "bioschemas_gocrawlit_cache" will be created on the current path of execution; This folder contains crawled website information in order to prevent multiple download of pages. Is safe to delete this folder.

Output

Scraped data will be stored in a json file named <website_host>_schema.json on the current program folder.

Available commands

-p: Stay on current path. i.e. When crawling a page like https://www.ebi.ac.uk/biosamples/samples and don't want it to crawl the whole website, e.g. https://www.ebi.ac.uk.
-m: Max number of recursion depth of visited URLs. Default infinity recursion. (The crawler does not revisit URLs)
-e: Adds crawled data to an Elasticsearch (v6) service at http://127.0.0.1:9200.
-u: Start page to start crawling.
-q: Remove query section from the link URL found.
--query: Use with -q so it follows only links that contain the query word provided, e.g., ./bioschemas-gocrawlit_mac_64 -u https://tess.elixir-europe.org/events -q --page page
-h: Print Help and exit.

Building binaries

To create a binary for your current SO use:

make build

To create a binary for windows, macos and linux SO use:

make build-all

The binaries would be placed under build/ path.

Elasticsearch quick setup DOCKER

Steps for starting dockerized elasticsearch and kibana locally. This requires Docker.

Create a custom network for your elastic-stack:

docker network create elastic-stack

Pull and run an elasticsearch image:

docker run -it --network=elastic-stack -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name elasticsearch docker.elastic.co/elasticsearch/elasticsearch:6.2.4

Avoid changing the containers name since Kibana docker image points by default to http://elasticsearch:9200.

Pull and run an elasticsearch image:

docker run --network=elastic-stack --rm -it -p 5601:5601 --name kibana docker.elastic.co/kibana/kibana:6.2.4

Remember the --rm flag will delete the container once it is stoped.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
crawler		crawler
.gitignore		.gitignore
Gopkg.lock		Gopkg.lock
Gopkg.toml		Gopkg.toml
LICENSE		LICENSE
Main.go		Main.go
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIOSCHEMAS.ORG GO CRAWL IT!

How to use it:

Output

Available commands

Building binaries

Elasticsearch quick setup DOCKER

Create a custom network for your elastic-stack:

Pull and run an elasticsearch image:

Pull and run an elasticsearch image:

ToDo

About

Releases 3

Packages

Languages

License

ricardoaat/bioschemas-gocrawlit

Folders and files

Latest commit

History

Repository files navigation

BIOSCHEMAS.ORG GO CRAWL IT!

How to use it:

Output

Available commands

Building binaries

Elasticsearch quick setup DOCKER

Create a custom network for your elastic-stack:

Pull and run an elasticsearch image:

Pull and run an elasticsearch image:

ToDo

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages