Implementation of the K-Means clustering algorithm on Hadoop

Project Description

The aim of this project is to implement k-means clustering algorithm on Hadoop using sythetic data as a sample. The project was implemented in the context of the course "Big Data Management Systems" taught by Prof. Damianos Chatziantoniou. A detailed description of the assignment can be found here.

Installing & Configuring Hadoop

1. We assume that Python3 is already installed on the system.

2. Install Hadoop on Ubuntu according to the following website: How to install Hadoop on Ubuntu 18.04 Bionic Beaver Linux.

3. Install necessary requirements:

$ pip install -r requirements.txt

Running K-Means on Hadoop

1. Clone this repository:

$ git clone https://github.com/ChryssaNab/BDMS-AUEB.git
$ cd /BDMS-AUEB/kmeans_mapreduce/src/

2. Run generateDataset.py to create the input data points:

 $ python3 generateDataset.py

In this example, the initial centers that are used are the following: (-100000, -100000), (1, 1), (100000, 100000).
The remaining data points are generated around these points following a normal distribution with a standard deviation of 5.0.

3. Upload the data to HDFS:

$ hdfs dfs -mkdir /kmeans
$ hdfs dfs -put $HADOOP_HOME/localFilePath/data-points.csv /kmeans

4. Run kMeansRunner.py to deploy k-means on Hadoop:

 $ python3 kMeansRunner.py

Results

The output of the MapReduce process is also stored on Hadoop under the name part-00000.

Team

Zoe Kotti
Chryssa Nampouri

External Resourses

How to install Hadoop on Ubuntu 18.04 Bionic Beaver Linux
Writing An Hadoop MapReduce Program In Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Implementation of the K-Means clustering algorithm on Hadoop

Contents

Project Description

Installing & Configuring Hadoop

Running K-Means on Hadoop

Results

Team

External Resourses

Files

README.md

Latest commit

History

README.md

File metadata and controls

Implementation of the K-Means clustering algorithm on Hadoop

Contents

Project Description

Installing & Configuring Hadoop

Running K-Means on Hadoop

Results

Team

External Resourses