Skip to content

This project aimed to analyze and understand worldwide sales data through the use of Zeppelin and HDFS. The primary objective was to utilize Spark's basic Scala commands and SQL to query and manipulate the data, providing valuable insights and findings for the customer.

Notifications You must be signed in to change notification settings

subhanjandas/Worldwide-Sales-Data-Analysis-and-Exploration-using-Zeppelin-HDFS-and-Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Worldwide Sales Data Analysis and Exploration using Zeppelin HDFS and Spark

Introduction

In this project, the objective was to analyze a worldwide sales dataset using Zeppelin and HDFS. The data was ingested into Zeppelin and queried using Spark basic Scala commands and SQL. The focus of the analysis was to provide a comprehensive breakdown of the sales data and uncover key insights into sales patterns and trends.

The first step in the analysis was to load the data into a dataframe using Zeppelin and HDFS. The data was then displayed in the dataframe format, allowing for a visual inspection of the data. The next step was to print the schema of the dataframe, providing a structured overview of the variables and data types in the dataset.

To further refine the analysis, the dataframe was filtered to show only those observations where the number of units sold was greater than 8000 and the unit cost was greater than 500. This helped to focus on the most significant sales data and identify the top-performing products and regions.

Finally, the data was grouped by region and the count of observations in each region was calculated. This provided a summary of the sales data by region, allowing the customer to determine the areas where sales were strongest and where they should focus their marketing and sales efforts.

In conclusion, this project demonstrates the power of Zeppelin and HDFS in performing data analysis and uncovering valuable insights from sales data. By using Spark basic Scala commands and SQL, the data was queried and analyzed with ease, providing valuable information for the customer to make informed business decisions.

Tools Used

  • Zeppelin
  • Spark
  • HDFS
  • Scala
  • SQL

Result

image

Process

  • Load data into a Spark dataframe

image

  • Print the dataframe schema

image

  • Filter the dataframe to show units sold greater than 8000 and unit cost greater than 500 ("&&" operator can be used for multiple "AND" conditions)

image

  • Aggregate the dataframe via group by “Region” and count

image

  • Saving this new subset dataframe as a csv file into HDFS

image

  • Using SQL select all from “Regionview” view and show in a line graph

image

  • Using SQL, from the “Salesview” view, Select the region and sum of units sold, and group by region

image

  • Using SQL select from the “Salesview” view – the region and sum of total_profit and group by region and display in a Bar chart

image

  • Using SQL select from the “Salesview” view – show the total profit as profit, the total revenue as revenue and the total cost as cost from “Salesview”, group by region

image

  • The client is in the process of opening up a new store and they are looking at the best location to do so - They need to see the avg profit in each region as a percentage (pie chart) compared to other regions

image

About

This project aimed to analyze and understand worldwide sales data through the use of Zeppelin and HDFS. The primary objective was to utilize Spark's basic Scala commands and SQL to query and manipulate the data, providing valuable insights and findings for the customer.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages