use your knowledge of SparkSQL to determine key metrics about home sales data. Then you'll use Spark to create temporary views, partition the data, cache and uncache a temporary table, and verify that the table has been uncached.
Answer the following questions using SparkSQL:
- What is the average price for a four-bedroom house sold for each year? Round off your answer to two decimal places.
- What is the average price of a home for each year it was built that has three bedrooms and three bathrooms? Round off your answer to two decimal places.
- What is the average price of a home for each year that has three bedrooms, three bathrooms, two floors, and is greater than or equal to 2,000 square feet? Round off your answer to two decimal places.
- What is the "view" rating for homes costing more than or equal to $350,000? Determine the run time for this query, and round off your answer to two decimal places.
1. What is the average price for a four bedroom house sold in each year rounded to two decimal places?
spark.sql("""select round(avg(price), 2) as price, bedrooms, date_built from home_sales_temp where bedrooms==4 group by 2, 3 order by date_built desc""").show()
2. What is the average price of a home for each year it was built that has three bedrooms and three bathrooms? Round off your answer to two decimal places.
spark.sql("""select round(avg(price), 2) as price, bedrooms, date_built from home_sales_temp where bedrooms==3 and bathrooms==3 group by 2, 3 order by date_built desc""").show()
3. What is the average price of a home for each year that has three bedrooms, three bathrooms, two floors, and is greater than or equal to 2,000 square feet? Round off your answer to two decimal places.
spark.sql("""select round(avg(price), 2) as price, bedrooms, date_built from home_sales_temp where bedrooms==3 and bathrooms==3 and floors==2 and sqft_living>=2000 group by 2, 3 order by date_built desc""").show()
4. What is the "view" rating for homes costing more than or equal to $350,000? Determine the run time for this query, and round off your answer to two decimal places.
start_time = time.time()
spark.sql("""select view, round(avg(price), 2) as price from home_sales_temp group by 1 having avg(price) >=350000 order by view desc""").show() print("--- %s seconds ---" % (time.time() - start_time))
- Cache the the temporary table home_sales.
spark.sql("cache table home_sales_temp")
- Check if the table is cached.
spark.catalog.isCached('home_sales_temp')
- Using the cached data, run the query that filters out the view ratings with average price greater than or equal to $350,000. Determine the runtime and compare it to uncached runtime.
start_time = time.time()
spark.sql("""select view, round(avg(price), 2) as price from home_sales_temp group by 1 having avg(price) >=350000 order by view desc""").show() print("--- %s seconds ---" % (time.time() - start_time))
- Partition by the "date_built" field on the formatted parquet home sales data
home_sales_data.write.partitionBy('date_built').parquet('p_home_sales_temp',mode='overwrite')
- Read the formatted parquet data.
parq_homes_df = spark.read.parquet('p_home_sales_temp')
- Create a temporary table for the parquet data.
parq_homes_df.createOrReplaceTempView('parq_homes_df_temp')
- Run the query that filters out the view ratings with average price of greater than or eqaul to $350,000 with the parquet DataFrame. Round your average to two decimal places. Determine the runtime and compare it to the cached version.
start_time = time.time()
spark.sql("""select view, round(avg(price), 2) as price from home_sales_temp group by 1 having avg(price) >=350000 order by view desc""").show() print("--- %s seconds ---" % (time.time() - start_time))
Remember to Uncache the temporary table and Check if the temporary table is no longer cached
spark.sql("uncache table home_sales_temp")
spark.catalog.isCached("home_sales_temp")