From 1f443e89c9c99f41a08d7c3eb63a73ba260c83d0 Mon Sep 17 00:00:00 2001 From: nruest Date: Sat, 6 Jun 2020 20:53:10 -0400 Subject: [PATCH 1/2] [Skip Travis] Trim README down given aut.docs.archivesunleashed.org --- README.md | 235 +++--------------------------------------------------- 1 file changed, 12 insertions(+), 223 deletions(-) diff --git a/README.md b/README.md index 10124f36..7ab83a89 100644 --- a/README.md +++ b/README.md @@ -8,51 +8,22 @@ [![LICENSE](https://img.shields.io/badge/license-Apache-blue.svg?style=flat)](https://www.apache.org/licenses/LICENSE-2.0) [![Contribution Guidelines](http://img.shields.io/badge/CONTRIBUTING-Guidelines-blue.svg)](./CONTRIBUTING.md) -The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Apache Spark](http://spark.apache.org/), which provides powerful tools for analytics and data processing. This toolkit is part of the [Archives Unleashed Project](http://archivesunleashed.org/). +The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Apache Spark](http://spark.apache.org/), which provides powerful tools for analytics and data processing. The Toolkit is part of the [Archives Unleashed Project](http://archivesunleashed.org/). The following two articles provide an overview of the project: -+ Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives](https://dl.acm.org/authorize.cfm?key=N46731). _ACM Journal on Computing and Cultural Heritage_, 10(4), Article 22, 2017. -+ Nick Ruest, Jimmy Lin, Ian Milligan, Samantha Fritz. [The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives](https://arxiv.org/abs/2001.05399). 2020. ++ Nick Ruest, Jimmy Lin, Ian Milligan, and Samantha Fritz. [The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives](https://yorkspace.library.yorku.ca/xmlui/handle/10315/37506). Proceedings of the 2020 IEEE/ACM Joint Conference on Digital Libraries (JCDL 2020), Wuhan, China. ++ Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives](https://dl.acm.org/authorize.cfm?key=N46731). _ACM Journal on Computing and Cul tural Heritage_, 10(4), Article 22, 2017. ## Dependencies -### Java +- Java 8 +- Python 3.6+ (PySpark) +- Apache Spark 2.4+ -The Archives Unleashed Toolkit requires Java 8. + More information on setting up dependencies can be found [here](https://aut.docs.archivesunleashed.org/docs/dependencies). -For macOS: You can find information on Java [here](https://java.com/en/download/help/mac_install.xml). We recommend [OpenJDK](https://adoptopenjdk.net/). The easiest way is to install with [homebrew](https://brew.sh) and then: - -```bash -brew cask install adoptopenjdk/openjdk/adoptopenjdk8 -``` - -If you run into difficulties with homebrew, installation instructions can be found [here](https://adoptopenjdk.net/). - -On Debian based system you can install Java using `apt`: - -```bash -apt install openjdk-8-jdk -``` - -Before `spark-shell` can launch, `JAVA_HOME` must be set. If you receive an error that `JAVA_HOME` is not set, you need to point it to where Java is installed. On Linux, this might be `export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64` or on macOS it might be `export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home`. - -### Python - -If you would like to use the Archives Unleashed Toolkit with PySpark and Jupyter Notebooks, you'll need to have a modern version of Python installed. We recommend using the [Anaconda Distribution](https://www.anaconda.com/distribution). This _should_ install Jupyter Notebook, as well as the PySpark bindings. If it doesn't, you can install either with `conda install` or `pip install`. - -### Apache Spark - -Download and unzip [Apache Spark](https://spark.apache.org) to a location of your choice. - -```bash -curl -L "https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz" > spark-2.4.5-bin-hadoop2.7.tgz -tar -xvf spark-2.4.5-bin-hadoop2.7.tgz -``` - -## Getting Started - -### Building Locally +## Building Clone the repo: @@ -66,198 +37,16 @@ You can then build The Archives Unleashed Toolkit. mvn clean install ``` -### Archives Unleashed Toolkit with Spark Submit - -The Toolkit offers a variety of extraction jobs with -[`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html) -. These extraction jobs have a few configuration options. - -The extraction jobs have a basic outline of: - -```shell -spark-submit --class io.archivesunleashed.app.CommandLineAppRunner PATH_TO_AUT_JAR --extractor EXTRACTOR --input INPUT DIRECTORY --output OUTPUT DIRECTORY -``` - -Additional flags include: - -* `--output-format FORMAT` (`csv` (default) or `parquet`. `DomainGraphExtractor` - has two additional output options `graphml` or `gexf`.) -* `--split` (The extractor will put results for each input file in its own - directory. Each directory name will be the name of the ARC/WARC file parsed.) -* `--partition N` (The extractor will partition RDD or DataFrame according to N - before writing results. The is useful to combine all the results to a single - file.) - -Available extraction jobs: - -- `AudioInformationExtractor` -- `DomainFrequencyExtractor` -- `DomainGraphExtractor` -- `ImageGraphExtractor` -- `ImageInformationExtractor` -- `PDFInformationExtractor` -- `PlainTextExtractor` -- `PresentationProgramInformationExtractor` -- `SpreadsheetInformationExtractor` -- `VideoInformationExtractor` -- `WebGraphExtractor` -- `WebPagesExtractor` -- `WordProcessorInformationExtractor` - -More documentation on using the Toolkit with `spark-submit` can be found [here](https://github.com/archivesunleashed/aut-docs/blob/master/current/aut-spark-submit-app.md). - -### Archives Unleashed Toolkit with Spark Shell - -There are a two options for loading the Archives Unleashed Toolkit. The advantages and disadvantages of using either option are going to depend on your setup (single machine vs cluster): - -```shell -spark-shell --help - - --jars JARS Comma-separated list of jars to include on the driver - and executor classpaths. - --packages Comma-separated list of maven coordinates of jars to include - on the driver and executor classpaths. Will search the local - maven repo, then maven central and any additional remote - repositories given by --repositories. The format for the - coordinates should be groupId:artifactId:version. -``` - -#### As a package - -Release version: - -```shell -spark-shell --packages "io.archivesunleashed:aut:0.80.0" -``` - -HEAD (built locally): - -```shell -spark-shell --packages "io.archivesunleashed:aut:0.80.1-SNAPSHOT" -``` - -#### With an UberJar - -Release version: - -```shell -spark-shell --jars /path/to/aut-0.80.0-fatjar.jar -``` - -HEAD (built locally): - -```shell -spark-shell --jars /path/to/aut/target/aut-0.80.1-SNAPSHOT-fatjar.jar -``` - -### Archives Unleashed Toolkit with PySpark - -To run PySpark with the Archives Unleashed Toolkit loaded, you will need to provide PySpark with the Java/Scala package, and the Python bindings. The Java/Scala packages can be provided with `--packages` or `--jars` as described above. The Python bindings can be [downloaded](https://github.com/archivesunleashed/aut/releases/download/aut-0.80.0/aut-0.80.0.zip), or [built locally](#building-locally) (the zip file will be found in the `target` directory. - -In each of the examples below, `/path/to/python` is listed. If you are unsure where your Python is, it can be found with `which python`. - -#### As a package - -Release version: - -```shell -export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files aut-0.80.0.zip --packages "io.archivesunleashed:aut:0.80.0" -``` - -HEAD (built locally): - -```shell -export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --packages "io.archivesunleashed:aut:0.80.1-SNAPSHOT" -``` - -#### With an UberJar - -Release version: - -```shell -export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files aut-0.80.0.zip --jars /path/to/aut-0.80.0-fatjar.jar -``` - -HEAD (built locally): - -```shell -export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --jars /path/to/aut-0.80.1-SNAPSHOT-fatjar.jar -``` - -### Archives Unleashed Toolkit with Jupyter - -To run a [Jupyter Notebook](https://jupyter.org/install) with the Archives Unleashed Toolkit loaded, you will need to provide PySpark the Java/Scala package, and the Python bindings. The Java/Scala packages can be provided with `--packages` or `--jars` as described above. The Python bindings can be [downloaded](https://github.com/archivesunleashed/aut/releases/download/aut-0.80.0/aut-0.80.0.zip), or [built locally](#Introduction) (the zip file will be found in the `target` directory. - -#### As a package - -Release version: - -```shell -export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files aut-0.80.0.zip --packages "io.archivesunleashed:aut:0.80.0" -``` - -HEAD (built locally): - -```shell -export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --packages "io.archivesunleashed:aut:0.80.1-SNAPSHOT" -``` - -#### With an UberJar - -Release version: - -```shell -export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files aut-0.80.0.zip --jars /path/to/aut-0.80.0-fatjar.jar -``` - -HEAD (built locally): - -```shell -export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --jars /path/to/aut-0.80.1-SNAPSHOT-fatjar.jar -``` - -A Jupyter Notebook _should_ automatically load in your browser at . You may be asked for a token upon first launch, which just offers a bit of security. The token is available in the load screen and will look something like this: - -``` -[I 19:18:30.893 NotebookApp] Writing notebook server cookie secret to /run/user/1001/jupyter/notebook_cookie_secret -[I 19:18:31.111 NotebookApp] JupyterLab extension loaded from /home/nruest/bin/anaconda3/lib/python3.7/site-packages/jupyterlab -[I 19:18:31.111 NotebookApp] JupyterLab application directory is /home/nruest/bin/anaconda3/share/jupyter/lab -[I 19:18:31.112 NotebookApp] Serving notebooks from local directory: /home/nruest/Projects/au/aut -[I 19:18:31.112 NotebookApp] The Jupyter Notebook is running at: -[I 19:18:31.112 NotebookApp] http://localhost:8888/?token=87e7a47c5a015cb2b846c368722ec05c1100988fd9dcfe04 -[I 19:18:31.112 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). -[C 19:18:31.140 NotebookApp] - - To access the notebook, open this file in a browser: - file:///run/user/1001/jupyter/nbserver-9702-open.html - Or copy and paste one of these URLs: - http://localhost:8888/?token=87e7a47c5a015cb2b846c368722ec05c1100988fd9dcfe04 -``` - -Create a new notebook by clicking “New” (near the top right of the Jupyter homepage) and select “Python 3” from the drop down list. - -The notebook will open in a new window. In the first cell enter: - -```python -from aut import * - -archive = WebArchive(sc, sqlContext, "src/test/resources/warc/") - -webpages = archive.webpages() -webpages.printSchema() -``` - -Then hit Shift+Enter, or press the play button. +## Usage -If you receive no errors, and see the following, you are ready to begin working with your web archives! +The Toolkit can be used to submit a variety of extraction jobs with `spark-submit`, as well used as a library via `spark-submit`, `pyspark`, or in your own application. More information on using the Toolkit can be found [here](https://aut.docs.archivesunleashed.org/docs/usage). -![](https://user-images.githubusercontent.com/218561/63203995-42684080-c061-11e9-9361-f5e6177705ff.png) -# License +## License Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0). -# Acknowledgments +## Acknowledgments This work is primarily supported by the [Andrew W. Mellon Foundation](https://mellon.org/). Other financial and in-kind support comes from the [Social Sciences and Humanities Research Council](http://www.sshrc-crsh.gc.ca/), [Compute Canada](https://www.computecanada.ca/), the [Ontario Ministry of Research, Innovation, and Science](https://www.ontario.ca/page/ministry-research-innovation-and-science), [York University Libraries](https://www.library.yorku.ca/web/), [Start Smart Labs](http://www.startsmartlabs.com/), and the [Faculty of Arts](https://uwaterloo.ca/arts/) and [David R. Cheriton School of Computer Science](https://cs.uwaterloo.ca/) at the [University of Waterloo](https://uwaterloo.ca/). From e9dc3f65503299e48f9d68b11cd7f0f23126921e Mon Sep 17 00:00:00 2001 From: nruest Date: Mon, 8 Jun 2020 06:19:15 -0400 Subject: [PATCH 2/2] review --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 7ab83a89..6eb9b09c 100644 --- a/README.md +++ b/README.md @@ -10,10 +10,12 @@ The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Apache Spark](http://spark.apache.org/), which provides powerful tools for analytics and data processing. The Toolkit is part of the [Archives Unleashed Project](http://archivesunleashed.org/). +Learn more about the Toolkit and how to use it by visiting our [comprehensive documentation](https://aut.docs.archivesunleashed.org/). + The following two articles provide an overview of the project: + Nick Ruest, Jimmy Lin, Ian Milligan, and Samantha Fritz. [The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives](https://yorkspace.library.yorku.ca/xmlui/handle/10315/37506). Proceedings of the 2020 IEEE/ACM Joint Conference on Digital Libraries (JCDL 2020), Wuhan, China. -+ Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives](https://dl.acm.org/authorize.cfm?key=N46731). _ACM Journal on Computing and Cul tural Heritage_, 10(4), Article 22, 2017. ++ Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives](https://dl.acm.org/authorize.cfm?key=N46731). _ACM Journal on Computing and Cultural Heritage_, 10(4), Article 22, 2017. ## Dependencies