Welcome to my Big Data Computing course repository! This collection showcases the assignments completed for INP7079233 Big Data Computing during the 2023-2024 academic year, under the guidance of Professors Pietracaprina and Silvestri, at University of Padova.
This course dives deep into the world of big data, exploring advanced techniques for processing and analyzing massive datasets using cutting-edge technologies.
- Mastery of Apache Spark for large-scale data processing
- Implementation of distributed algorithms
- Real-time data stream analysis
- Practical experience with cloud computing platforms
Objective: Implement and compare exact and approximate outlier detection algorithms using Spark.
Key Components:
- Exact algorithm implementation (sequential)
- Approximate algorithm using Spark RDDs
- Performance and accuracy analysis
🔗 Detailed Assignment Description
Objective: Enhance outlier detection by integrating k-center clustering techniques.
Key Tasks:
- Refine MRApproxOutliers from HW1
- Implement Farthest-First Traversal (FFT) algorithm
- Develop MapReduce FFT (MRFFT)
- Execute experiments on CloudVeneto cluster
🔗 Detailed Assignment Description
Objective: Utilize Spark Streaming API to identify frequent items in real-time data streams.
Highlight Features:
- Reservoir sampling implementation
- Sticky sampling method
- Real-time stream processing
- Comparative analysis of sampling methods
🔗 Detailed Assignment Description
- Apache Spark & Spark Streaming
- Java
- CloudVeneto Cluster
This course offered an immersive journey into the realm of big data, providing:
- Hands-on experience with industry-standard big data tools
- Deep understanding of distributed computing principles
- Practical skills in real-time data analysis and processing
This project is licensed under the MIT License with a Non-Commercial Clause - see the LICENSE file for details.
💡 Feel free to explore the code and documentation!