Skip to content

Assignment repository for the Big Data Computing course at the University of Padova for the academic year 2023-2024.

License

Notifications You must be signed in to change notification settings

francesco-biscaccia-carrara/BigData_Projects

 
 

Repository files navigation

☁️ Big Data Computing Course Assignments

Spark Java

Welcome to my Big Data Computing course repository! This collection showcases the assignments completed for INP7079233 Big Data Computing during the 2023-2024 academic year, under the guidance of Professors Pietracaprina and Silvestri, at University of Padova.

📚 Course Overview

This course dives deep into the world of big data, exploring advanced techniques for processing and analyzing massive datasets using cutting-edge technologies.

🧠 Key Learning Outcomes

  • Mastery of Apache Spark for large-scale data processing
  • Implementation of distributed algorithms
  • Real-time data stream analysis
  • Practical experience with cloud computing platforms

🛠️ Homework Assignments

Homework 1: Outlier Detection in Large Datasets

Objective: Implement and compare exact and approximate outlier detection algorithms using Spark.

Key Components:

  • Exact algorithm implementation (sequential)
  • Approximate algorithm using Spark RDDs
  • Performance and accuracy analysis

🔗 Detailed Assignment Description

Homework 2: K-Center Clustering for Outlier Detection

Objective: Enhance outlier detection by integrating k-center clustering techniques.

Key Tasks:

  • Refine MRApproxOutliers from HW1
  • Implement Farthest-First Traversal (FFT) algorithm
  • Develop MapReduce FFT (MRFFT)
  • Execute experiments on CloudVeneto cluster

🔗 Detailed Assignment Description

Homework 3: Frequent Item Detection in Data Streams

Objective: Utilize Spark Streaming API to identify frequent items in real-time data streams.

Highlight Features:

  • Reservoir sampling implementation
  • Sticky sampling method
  • Real-time stream processing
  • Comparative analysis of sampling methods

🔗 Detailed Assignment Description

🛠️ Technologies & Tools

  • Apache Spark & Spark Streaming
  • Java
  • CloudVeneto Cluster

🌟 Key Takeaways

This course offered an immersive journey into the realm of big data, providing:

  • Hands-on experience with industry-standard big data tools
  • Deep understanding of distributed computing principles
  • Practical skills in real-time data analysis and processing

📄 License

This project is licensed under the MIT License with a Non-Commercial Clause - see the LICENSE file for details.

License: MIT

💡 Feel free to explore the code and documentation!

About

Assignment repository for the Big Data Computing course at the University of Padova for the academic year 2023-2024.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%