- Project Description
- Installing & Configuring Hadoop
- Running K-Means on Hadoop
- Results
- Team
- External Resourses
The aim of this project is to implement k-means clustering algorithm on Hadoop using sythetic data as a sample. The project was implemented in the context of the course "Big Data Management Systems" taught by Prof. Damianos Chatziantoniou. A detailed description of the assignment can be found here.
1. We assume that Python3 is already installed on the system.
2. Install Hadoop on Ubuntu according to the following website: How to install Hadoop on Ubuntu 18.04 Bionic Beaver Linux.
3. Install necessary requirements:
$ pip install -r requirements.txt
1. Clone this repository:
$ git clone https://github.com/ChryssaNab/BDMS-AUEB.git
$ cd /BDMS-AUEB/kmeans_mapreduce/src/
2. Run generateDataset.py to create the input data points:
$ python3 generateDataset.py
- In this example, the initial centers that are used are the following: (-100000, -100000), (1, 1), (100000, 100000).
- The remaining data points are generated around these points following a normal distribution with a standard deviation of 5.0.
3. Upload the data to HDFS:
$ hdfs dfs -mkdir /kmeans
$ hdfs dfs -put $HADOOP_HOME/localFilePath/data-points.csv /kmeans
4. Run kMeansRunner.py to deploy k-means on Hadoop:
$ python3 kMeansRunner.py
The output of the MapReduce process is also stored on Hadoop under the name part-00000.