This repository contains various exercises and examples demonstrating the power of Apache Spark using Python. It is based on a course that teaches how to use PySpark for large-scale data analysis and machine learning tasks. The provided notebooks and datasets walk through real-world scenarios, from processing movie ratings to building machine learning models.
The repository is organized into multiple folders, each corresponding to a different concept or hands-on activity covered in the course. Here's a breakdown of the contents:
Contains the MovieLens dataset used to analyze and compute statistics such as the most popular movie, movie ratings, and recommendations.
- Files:
ml-100k/
: Contains the MovieLens dataset files (u.data
,u.item
, etc.)moverating.ipynb
: Jupyter notebook analyzing movie ratings.
Contains the example dataset fakefriends.csv
to demonstrate filtering and transformations on Resilient Distributed Datasets (RDDs).
- Files:
fakefriends.csv
: The fake friends dataset.friends-by-age.py
: Python script for analyzing the data.notebook.ipynb
: Jupyter notebook implementation.
Contains an example to demonstrate filtering RDDs, showing how to compute the minimum temperatures by location.
- Files:
1800.csv
: Dataset for minimum temperature analysis.filteringRDD.ipynb
: Jupyter notebook for temperature analysis.
Demonstrates mapping and flat-mapping techniques on text data.
- Files:
Book.txt
: Sample text file for word count operations.Map-flatmap.ipynb
: Notebook illustrating the differences betweenmap()
andflatMap()
in Spark.
Analyzes customer orders to compute the total amount spent by each customer.
- Files:
customer-orders.csv
: Dataset containing customer orders.analysis.ipynb
: Jupyter notebook for customer spending analysis.
Shows how to use SparkSQL for data processing with structured datasets.
- Files:
1800.csv
,fakefriends.csv
: Datasets used for SQL-style data analysis.- Notebooks for querying data using SparkSQL.
Advanced Spark examples including:
- Using broadcast variables.
- Computing the most popular superhero.
- Recommending movies based on similarity.
- Files:
Marvel+Graph
,Marvel+Names
: Datasets for the superhero analysis.notebooks
: Jupyter notebooks with advanced Spark implementations.
- Clone the repository:
git clone https://github.com/Abhigyan-RA/Apache-Spark-basics.git
- Install the required dependencies:
Ensure you have Python and Jupyter notebooks installed, along with Apache Spark. Install any other necessary libraries usingpip
:pip install -r requirements.txt