This repository will be used to pool code and information pertaining to a project using Hadoop MapReduce in a Graduate Big Data course.
The instructions for the project are in Assignment2.pdf. The project uses Cloudera Hadoop running on VMWare. The Eclipse IDE included with the Cloudera Quickstart VM was used for development.
To run this as a JAR file in Cloudera, follow these steps:
-
Export the project as a JAR file (deselect 20-newsgroups from the export list) into the cloudera folder.
-
Make sure a copy of the 20-newsgroups dataset is on the Desktop. If you haven't made this copy, do so now.
-
Ensure that the HDFS and YARN services are running, and that the NameNode has left safe mode.
-
In the terminal, run the following command to make a directory to house the input:
hdfs dfs -mkdir /user/cloudera/input/
- Next, load the data from 20-newgroups into the input directory.
hdfs dfs -copyFromLocal ~/Desktop/20-newsgroups/ /user/cloudera/input/
The terminal will probably complain the whole time (it will throw InterruptedException warnings, in my experience), but don't be alarmed! It's actually doing what it's supposed to be doing. Let it do its work. It may take a while.
- To run the JAR file, now enter the following in Terminal:
hadoop jar Assignment2.jar AssignmentDriver /user/cloudera/input/ /user/cloudera/output/
If all goes well, this should properly produce the desired outputs. Check the results with the hdfs dfs -ls
command.