SemPub2015 Tools and Extensions

This project implements FitLayout-based applications and tools for automatic information extraction from the CEUR-WS.org workshop proceedings pages. The tools were created as a proposed solution of the Task 1 of the Semantic Publishing Challenge 2015 colocated with the Extended Semantic Web Conference 2015.

How to Build

The whole package is build using maven. Use mvn package for creating the runnable SemPub2015Extractor.jar.

The compiled program is also contained in the repository folder /target/SemPub2015Extractor.jar for convenience.

Running the Extraction Task

Run the extraction tool using

java -jar SemPub2015Extractor.jar

This will start a FitLayout JavaScript console. Use help() command for obtaining more info.

Option 1. For accomplishing the SemPub2015 Task1 the following commands should be used:

processEvaluationSet(); 
transformToDomain();

Option 2. For process all the workshops located in CEUR the following commands should be used:

processAllData(); 
transformToDomain();

Option 3. For process a single volume, like http://ceur-ws.org/Vol-1/ the following commands should be used:

processPage('http://ceur-ws.org/Vol-1/'); 
transformToDomain();

The program stores generated data in Blazegraph, detail information see About_Blazegraph. This assumes the Blazegraph storage to be running at http://localhost:9999/blazegraph, and you can usestorage.connect() to connect another repository. You can get the latest software from https://www.blazegraph.com/download/.

After this, the storage should contain the complete extracted data.

Add license information and get clean rdf dataset

You can add the license information of workshops through the update tab of the Blazegraph software, the license file is license.ttl
The transformed data contains a lot non relevant information, like html element information. You can use the provided python script serializer.py to get the clean dataset.

SPARQL Queries

The SPARQL queries corresponding to the individual SemPub2015 queries are located in sparql/ESWC2015-queries.txt.

The transformation query from the domain-independent logical model to the domain-dependent CEUR workshop ontology is located in logicalTree2domain.sparql. The transformation itself is included in the transformToDomain() call so it's not necessary to execute this query manually.

Publication

The related publication is the following:

MILIČKA Martin and BURGET Radek. Information Extraction from Web Sources based on Multi-aspect Content Analysis. In: Semantic Web Evaluation Challenges, SemWebEval 2015 at ESWC 2015. Portorož: Springer International Publishing, 2015, pp. 81-92. ISBN 978-3-319-25517-0. ISSN 1865-0929.

LICENSE

The detail information is contained in LICENSE.

Acknowledgements

This work was supported by the BUT FIT grant FIT-S-14-2299 and the IT4Innovations Centre of Excellence CZ.1.05/1.1.00/02.0070.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
awk		awk
sparql		sparql
src/main		src/main
target		target
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
all_license_clean.ttl.tar.gz		all_license_clean.ttl.tar.gz
license.ttl		license.ttl
pom.xml		pom.xml
serializer.py		serializer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

SemPub2015 Tools and Extensions

How to Build

Running the Extraction Task

Add license information and get clean rdf dataset

SPARQL Queries

Publication

LICENSE

Acknowledgements

About

Licenses found

Releases

Packages

Languages

License

Licenses found

ceurws/ceur-extractor

Folders and files

Latest commit

History

Repository files navigation

SemPub2015 Tools and Extensions

How to Build

Running the Extraction Task

Add license information and get clean rdf dataset

SPARQL Queries

Publication

LICENSE

Acknowledgements

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages