In this project, the objective was to analyze a worldwide sales dataset using Zeppelin and HDFS. The data was ingested into Zeppelin and queried using Spark basic Scala commands and SQL. The focus of the analysis was to provide a comprehensive breakdown of the sales data and uncover key insights into sales patterns and trends.
The first step in the analysis was to load the data into a dataframe using Zeppelin and HDFS. The data was then displayed in the dataframe format, allowing for a visual inspection of the data. The next step was to print the schema of the dataframe, providing a structured overview of the variables and data types in the dataset.
To further refine the analysis, the dataframe was filtered to show only those observations where the number of units sold was greater than 8000 and the unit cost was greater than 500. This helped to focus on the most significant sales data and identify the top-performing products and regions.
Finally, the data was grouped by region and the count of observations in each region was calculated. This provided a summary of the sales data by region, allowing the customer to determine the areas where sales were strongest and where they should focus their marketing and sales efforts.
In conclusion, this project demonstrates the power of Zeppelin and HDFS in performing data analysis and uncovering valuable insights from sales data. By using Spark basic Scala commands and SQL, the data was queried and analyzed with ease, providing valuable information for the customer to make informed business decisions.
- Zeppelin
- Spark
- HDFS
- Scala
- SQL
- Load data into a Spark dataframe
- Print the dataframe schema
- Filter the dataframe to show units sold greater than 8000 and unit cost greater than 500 ("&&" operator can be used for multiple "AND" conditions)
- Aggregate the dataframe via group by “Region” and count
- Saving this new subset dataframe as a csv file into HDFS
- Using SQL select all from “Regionview” view and show in a line graph
- Using SQL, from the “Salesview” view, Select the region and sum of units sold, and group by region
- Using SQL select from the “Salesview” view – the region and sum of total_profit and group by region and display in a Bar chart
- Using SQL select from the “Salesview” view – show the total profit as profit, the total revenue as revenue and the total cost as cost from “Salesview”, group by region
- The client is in the process of opening up a new store and they are looking at the best location to do so - They need to see the avg profit in each region as a percentage (pie chart) compared to other regions