Skip to content

Build MongoDB schema to house real-estate data, data pipeline to streamline data, and regression model to predict home price

Notifications You must be signed in to change notification settings

tomvdo29usc/NoSQL_HomePricePrediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

About The Project

The project contains the designed MongoDB database schema with optimized collections to store housing data. Extracted, transformed, loaded (ETL) raw home information to build accurate document-oriented data models for advanced analytics.

Tools: Python (NumPy, Pandas, statsmodels), MongoDB NoSQL

Skills: Data Engineering, Applied Statistics, ETL Pipeline, Database Management

1. Exploratory Data Analysis

image image

2. Data Strategies

2.1 Envision Data and Business Process

Our company currently has the data above stored in an Excel sheet. However, the company goal is to to establish a central database system where stakeholders can collaborate to support their business functions. Therefore, below is our data schema diagram:

image image

2.2 Data Schema Technicals

After interviewing 3 stakeholders and understanding their data requirements, we have developed schema for the 3 data collections in MongoDB with validation rules according to the business logics and inputs from stakeholders.

image

image

image

3. Analytics and Insights

3.1 Query Data from the MongoDB Database

image

3.2 Home Price Predictive Model Development

Obtaining the dataset from MongoDB query, we want to built a statistical model (Linear Regression) to understand how factors influence the home price and try to predict the home price accurately. Our approach is to split the dataset into training set (70%) and testing set (30%). We train the model on the training set and test the model performance on unseen data on the testing set. Below is the process and result of the train-test-split process:

image

3.3 Insights from the Model

As the Linear Regression model can yield very high performance in the testing set. We train the model using the entire dataset to understand the interactions. Below is the regression results:

image

Note

  • The expected average land price is $195,476 with no property, ceteris paribus
  • If the size of the home increases by 1 SQFT, the home price will increase by on average of $267.70, ceteris paribus. However,
  • If the total bed rooms in the home increases by 1 unit, the home price will decrease by $24,006.00, ceteris paribus. This might be that we sacrifice other living spaces or bedrooms for another new bedroom within a fixed amount of home size.
  • If the total floors in the home increases by 1 unit, the home price will decrease by $63,866.00, ceteris paribus. This might be that we sacrifice other floors space for another floor within a fixed amount of home size.
  • If the age of the home increases by 1 unit, the home price will decrease by $1,438.2, ceteris paribus. This might be because older home infrastructure is depreciating over time, making the home value to go down.

3.4 Business Decisions

Using the model, we can estimate the fair value of a home on the market. If the home price is currently below our estimate price, we should consider buy and invest in flipping the property to make profit.

image

THANK YOU!

About

Build MongoDB schema to house real-estate data, data pipeline to streamline data, and regression model to predict home price

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published