WARCProcessor

Usage

java -jar WarcProject-4.X.X-X.jar [--help] [--nogui] [--config ]

--help: Show the startup options
--nogui: Executes WARCProcessor without gui
--config : Starts WARCProcessor using the specified configuration file

Updates

Version 4

Modifications in the view to include links to the directory of the datasource and to the output dir. A keyboard control is included to move through the menu tree and there is a link from the main menu to create datasources
Datasources can be disables through gui
A corpus can be created from CSV and multiple WARC or WARC.gz files
Executing without gui (using GetOpt)
Filtering data using idiom
A simple executable that can be executed through java -jar
Unified property names in CSV and ARFF datasources.
The SpamValue is now included in ARFF file
Save the config with a own extension for filenames

Version 3

Graphical interface added
Added DataSource for Wark directories (WarcDS). In contrast to CorpusDS, this datasource search warc files in a directory and let user to indicate whether the warc files included in the directory are ham or spam
Solved issue Cod. 2.01. When the dept is greater than 0 in the crawler, the searched links are the ones that belongs to the original websitees.
Solved issue Cod. 2.02. In the datasource we should be able to stablish the parameter "spamCol" with the value of a CSV field indicating whether the URL of this row belongs to a ham or spam website. Additionally, in the parameter "spamColValue" we should speficy the value for the field when the ROW is spam (distinct values will be considered as ham)
Solved issue Cod. 2.03. In the datasource CSVDS we have included the parameter "fieldSeparator" where we can configure the filed used to keep columns separated.

Version 2

Added support for new input formats (datasources): Arff, CSV, Corpus Warc
Modified the configuration file of the application to adjust the Datasources

Third party files

Icons: www.aha-soft.com - Creative Commons Attribution-Share Alike 3.0 License Thanks for the work.

Dependencies

# WarcProject dependencies
commons-beanutils-1.9.2.jar
commons-validator-1.4.0.jar

# crawler4j dependencies
httpcore-4.2.2.jar
commons-logging-1.1.1.jar
httpclient-4.2.3.jar
commons-codec-1.6.jar
je-4.0.92.jar
tika-parsers-1.0.jar
boilerpipe-1.1.0.jar
tagsoup-1.2.1.jar
metadata-extractor-2.4.0-beta-1.jar
asm-3.1.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
commons-compress-1.3.jar
apache-mime4j-dom-0.7.jar
apache-mime4j-core-0.7.jar
tika-core-1.0.jar
log4j-1.2.14.jar

# heritrix dependencies
# http://sourceforge.net/projects/archive-crawler/files/heritrix3
heritrix-commons-3.1.0.jar
fastutil-5.0.7.jar
commons-io-1.4.jar

# Kryo dependencies
# https://code.google.com/p/kryo/
# Used to serialization for Warc reader
kryo-1.04-all.jar

# Weka dependencies
# http://weka.wikispaces.com/
weka.jar

# Lucene dependencies
luceneTrigramsLanguageGuesser.jar

# Jsoup dependencies
# http://jsoup.org/
# Used to extract text from html
jsoup-1.8.1.jar

# GetOpt dependencies
# www.urbanophile.com/arenn/hacking/download.html
# Used to get options from command line
java-getopt-1.0.13.jar

License

See LICENSE

Authors

Miguel Cayon Álvarez
Ivan Ferrant
Rosalía Laza Fidalgo
Reyes Pavón Rial
José Ramón Méndez Reboredo
David Ruano Ordás
José Ramón Méndez Reboredo
Florentino Fernández Riverola

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.settings		.settings
in		in
lib/lucene/language_guesser/1.1		lib/lucene/language_guesser/1.1
out		out
src		src
.gitignore		.gitignore
.project		.project
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WARCProcessor

Usage

Updates

Version 4

Version 3

Version 2

Third party files

Dependencies

License

Authors

About

Releases 1

Packages

Contributors 5

Languages

License

sing-group/WARCProcessor

Folders and files

Latest commit

History

Repository files navigation

WARCProcessor

Usage

Updates

Version 4

Version 3

Version 2

Third party files

Dependencies

License

Authors

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 5

Languages

Packages