Home Sales

Overview:

use your knowledge of SparkSQL to determine key metrics about home sales data. Then you'll use Spark to create temporary views, partition the data, cache and uncache a temporary table, and verify that the table has been uncached.

Instructions

Answer the following questions using SparkSQL:

What is the average price for a four-bedroom house sold for each year? Round off your answer to two decimal places.
What is the average price of a home for each year it was built that has three bedrooms and three bathrooms? Round off your answer to two decimal places.
What is the average price of a home for each year that has three bedrooms, three bathrooms, two floors, and is greater than or equal to 2,000 square feet? Round off your answer to two decimal places.
What is the "view" rating for homes costing more than or equal to $350,000? Determine the run time for this query, and round off your answer to two decimal places.

Home Sales Questions, Answers and Code

1. What is the average price for a four bedroom house sold in each year rounded to two decimal places?

spark.sql("""select round(avg(price), 2) as price, bedrooms, date_built from home_sales_temp where bedrooms==4 group by 2, 3 order by date_built desc""").show()

2. What is the average price of a home for each year it was built that has three bedrooms and three bathrooms? Round off your answer to two decimal places.

spark.sql("""select round(avg(price), 2) as price, bedrooms, date_built from home_sales_temp where bedrooms==3 and bathrooms==3 group by 2, 3 order by date_built desc""").show()

3. What is the average price of a home for each year that has three bedrooms, three bathrooms, two floors, and is greater than or equal to 2,000 square feet? Round off your answer to two decimal places.

spark.sql("""select round(avg(price), 2) as price, bedrooms, date_built from home_sales_temp where bedrooms==3 and bathrooms==3 and floors==2 and sqft_living>=2000 group by 2, 3 order by date_built desc""").show()

4. What is the "view" rating for homes costing more than or equal to $350,000? Determine the run time for this query, and round off your answer to two decimal places.

start_time = time.time() spark.sql("""select view, round(avg(price), 2) as price from home_sales_temp group by 1 having avg(price) >=350000 order by view desc""").show() print("--- %s seconds ---" % (time.time() - start_time))

Cache the files

Cache the the temporary table home_sales. spark.sql("cache table home_sales_temp")
Check if the table is cached. spark.catalog.isCached('home_sales_temp')
Using the cached data, run the query that filters out the view ratings with average price greater than or equal to $350,000. Determine the runtime and compare it to uncached runtime.

start_time = time.time() spark.sql("""select view, round(avg(price), 2) as price from home_sales_temp group by 1 having avg(price) >=350000 order by view desc""").show() print("--- %s seconds ---" % (time.time() - start_time))

Parquet DataFrame

Partition by the "date_built" field on the formatted parquet home sales data home_sales_data.write.partitionBy('date_built').parquet('p_home_sales_temp',mode='overwrite')
Read the formatted parquet data. parq_homes_df = spark.read.parquet('p_home_sales_temp')
Create a temporary table for the parquet data. parq_homes_df.createOrReplaceTempView('parq_homes_df_temp')
Run the query that filters out the view ratings with average price of greater than or eqaul to $350,000 with the parquet DataFrame. Round your average to two decimal places. Determine the runtime and compare it to the cached version.

start_time = time.time() spark.sql("""select view, round(avg(price), 2) as price from home_sales_temp group by 1 having avg(price) >=350000 order by view desc""").show() print("--- %s seconds ---" % (time.time() - start_time))

Note:

Remember to Uncache the temporary table and Check if the temporary table is no longer cached spark.sql("uncache table home_sales_temp") spark.catalog.isCached("home_sales_temp")

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Home_Sales.ipynb		Home_Sales.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Home Sales

Overview:

Instructions

Home Sales Questions, Answers and Code

Cache the files

Parquet DataFrame

Note:

About

Releases

Packages

Languages

terryhill89/Home_Sales

Folders and files

Latest commit

History

Repository files navigation

Home Sales

Overview:

Instructions

Home Sales Questions, Answers and Code

Cache the files

Parquet DataFrame

Note:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages