Worldwide Sales Data Analysis and Exploration using Zeppelin HDFS and Spark

Introduction

In this project, the objective was to analyze a worldwide sales dataset using Zeppelin and HDFS. The data was ingested into Zeppelin and queried using Spark basic Scala commands and SQL. The focus of the analysis was to provide a comprehensive breakdown of the sales data and uncover key insights into sales patterns and trends.

The first step in the analysis was to load the data into a dataframe using Zeppelin and HDFS. The data was then displayed in the dataframe format, allowing for a visual inspection of the data. The next step was to print the schema of the dataframe, providing a structured overview of the variables and data types in the dataset.

To further refine the analysis, the dataframe was filtered to show only those observations where the number of units sold was greater than 8000 and the unit cost was greater than 500. This helped to focus on the most significant sales data and identify the top-performing products and regions.

Finally, the data was grouped by region and the count of observations in each region was calculated. This provided a summary of the sales data by region, allowing the customer to determine the areas where sales were strongest and where they should focus their marketing and sales efforts.

In conclusion, this project demonstrates the power of Zeppelin and HDFS in performing data analysis and uncovering valuable insights from sales data. By using Spark basic Scala commands and SQL, the data was queried and analyzed with ease, providing valuable information for the customer to make informed business decisions.

Tools Used

Zeppelin
Spark
HDFS
Scala
SQL

Result

Process

Load data into a Spark dataframe

Print the dataframe schema

Filter the dataframe to show units sold greater than 8000 and unit cost greater than 500 ("&&" operator can be used for multiple "AND" conditions)

Aggregate the dataframe via group by “Region” and count

Saving this new subset dataframe as a csv file into HDFS

Using SQL select all from “Regionview” view and show in a line graph

Using SQL, from the “Salesview” view, Select the region and sum of units sold, and group by region

Using SQL select from the “Salesview” view – the region and sum of total_profit and group by region and display in a Bar chart

Using SQL select from the “Salesview” view – show the total profit as profit, the total revenue as revenue and the total cost as cost from “Salesview”, group by region

The client is in the process of opening up a new store and they are looking at the best location to do so - They need to see the avg profit in each region as a percentage (pie chart) compared to other regions

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
Subhanjan - Assignment 2.docx		Subhanjan - Assignment 2.docx
not_readme.py		not_readme.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Worldwide Sales Data Analysis and Exploration using Zeppelin HDFS and Spark

Introduction

Tools Used

Result

Process

About

Releases

Packages

Languages

subhanjandas/Worldwide-Sales-Data-Analysis-and-Exploration-using-Zeppelin-HDFS-and-Spark

Folders and files

Latest commit

History

Repository files navigation

Worldwide Sales Data Analysis and Exploration using Zeppelin HDFS and Spark

Introduction

Tools Used

Result

Process

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages