#How to configure and run the parser
##Required modules: The following Python modules need to installed:
- RDFLib (https://github.com/RDFLib/rdflib),
- PDFMiner (http://www.unixuser.org/~euske/python/pdfminer/),
- Grab (http://grablib.org/),
- PyPDF2 (https://github.com/mstamy2/PyPDF2).
##Configuration
All configuration settings should be in config.py
file which should be created from config.py.example
by renaming it.
###Input urls
The list of input urls are set as a Python list to input_urls
variable.
###DBpedia dataset (with countries and universities) Parser uses DBpedia to extract the names of countries and univeristies, and their URIs in DBpedia.
There are three options:
- to use the original dataset. It's by default, nothing should be configured,
- to use the OpenLink's mirror, then the
sparqlstore['dbpedia_url']
should be changed tohttp://lod.openlinksw.com/sparql
, - to use a local dump, it's prefered option, because it should be much faster and more stable. The
sparqlstore['dbpedia_url']
should be set to the local SPARQL Endpoint and the RDF filesdumps/dbpedia_country.xml
anddumps/dbpedia_universities.xml
should be uploaded to it. Look at the wiki to find the steps to generate the DBpedia dumps.
###Run
Once you finished with the configuration you need just to execute the following script:
python CeurWsParser/spider.py
The dataset will be in rdfdb.ttl
file.
#Queries
SPARQL queries created for the Task 1 as translation of the human readable queries to SPARQL queries using our data model. The queries are in the wiki.
#Contacts
Maxim Kolchin (kolchinmax@gmail.com)
Fedor Kozlov (kozlovfedor@gmail.com)