This project provides a containerized Hadoop ecosystem with integrated Big Data processing tools, designed for learning, development, and small-scale testing.
Component | Version |
---|---|
Hadoop | 2.7.4 |
Spark | 2.4.5 |
Hive | 2.3.2 |
Pig | 0.17.0 |
Tez | 0.9.2 |
Zeppelin | 0.9.0 |
- Configure a ready-to-use distributed Big Data environment
- Perform analytical processing with Hive, Pig, Spark, and Tez
- Compare performance across different processing engines
- Docker and Docker Compose installed
- System with at least 8GB RAM recommended
.
├── config-pig/ # Pig configuration files
├── config-tez/ # Tez configuration files
├── datasets/ # Sample CSV datasets
├── pig/ # Pig Dockerfile
├── tez/ # Tez Dockerfile
├── spark-jobs/ # Pre-written Spark scripts
├── docker-compose.yaml # Services definition
└── hadoop.env # Hadoop environment variables
git clone https://github.com/AbderrahmaneOd/hadoop-spark-tez-docker.git
cd hadoop-spark-tez-docker
docker compose up -d
docker compose ps -a
Service | URL |
---|---|
HDFS Namenode | http://localhost:50070 |
YARN ResourceManager | http://localhost:8088 |
Spark Master | http://localhost:8080 |
Zeppelin | http://localhost:8085 |
docker compose exec namenode hdfs dfs -mkdir -p /input/customers-data
docker compose exec namenode hdfs dfs -put datasets/*.csv /input/customers-data
- Enter Hive server:
docker compose exec -it hive-server bash
hive
- Create database and load data:
CREATE DATABASE customer_db;
USE customer_db;
CREATE TABLE users (
user_id INT,
user_name STRING,
email STRING,
age INT,
country STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA INPATH '/input/customers-data/users.csv' INTO TABLE users;
SELECT * FROM users LIMIT 5;
- Open Pig container:
docker compose exec -it pig bash
pig
- Run Pig script:
users = LOAD '/input/customers-data/users.csv'
USING PigStorage(',')
AS (user_id:int, user_name:chararray, country:chararray);
DUMP users;
- Enter Spark container:
docker compose exec -it spark bash
- Example PySpark job:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("spark://spark:7077") \
.appName("RevenueByProduct") \
.getOrCreate()
orders = spark.read.csv(
'hdfs://namenode:8020/input/customers-data/orders.csv',
header=True,
inferSchema=True
)
products = spark.read.csv(
'hdfs://namenode:8020/input/customers-data/products.csv',
header=True,
inferSchema=True
)
joined = orders.join(products, "product_id")
revenue = joined.groupBy("product_name") \
.sum("quantity * price") \
.orderBy("sum(quantity * price)", ascending=False)
revenue.show()
- Apache Hadoop: Distributed system management (HDFS, YARN)
- Apache Spark: Fast in-memory processing
- Apache Hive: SQL query management
- Apache Pig: Script-based data transformations
- Apache Tez: Workflow optimization for Hadoop