Zillow Housing Price Prediction
Last updated
Was this helpful?
Last updated
Was this helpful?
I built this Zillow Housing Price prediction project as a graduation project for the Udacity machine learning engineer nano degree program. This project originated from a Kaggle competition. In 2017, Zillow kicked off a 1 million dollar competition on Kaggle to improve its current algorithm - "Zestimate". I found the dataset and problem statement are quite interesting.
After discussed with the Udacity program mentor, I decided to use it as my graduation project. I chose two algorithms I have never learned and never used before in the projects. On one hand, I wanted to test out how well I can master new algorithms in the final project. On the other hand, I was really interested in learning about XGboost and LightGBM algorithms which were the winning algorithms for multiple Kaggle competition. In the end, I chose to ensemble them together to produce an even better and stronger algorithm for this project.
Zillow created “Zestimate” which gives customers a lot of information about homes and housing markets at no cost by using publicly available data.
7.5 million statistical and machine learning models that analyze hundreds of data points on each property are used by Zillow to create and improve “Zestimate”. They improved the median margin of error from 14% to 5%. Zillow announced a Kaggle competition to improve the accuracy of “Zestimate” even further.
The Zillow competition has two rounds. The first round is to build a model to predict Zillow residual error. The final round is to build a home evaluation algorithm from the ground up using all external data. My project focuses on the first round of the competition. The goal of the capstone project is to build a model to improve Zillow residual error.
This is a very typical supervised machine learning problem. Because supervised learning algorithms learn and analyzes labeled training data and then generate function to predict the output. Zillow gave the datasets of log error between Zestimate price and the actual price for both 2016 and 2017 which are labeled data as well as Zillow asked for a prediction for log error. Similar machine learning tasks are weather apps that predict the temperature for a given time and spamming email prediction based on prior spamming information.
First, exploratory data visualization was performed on a couple of aspects of the problem such as input dataset, missing values, non-numerical data, distribution of targeted values, and coefficient of features with targeted values.
After understanding the dataset, data preprocessing was taken out to make sure the training dataset was well prepared and valid.
Then the benchmark model was implemented with some optimal parameters and cross-validation as well as tested it out with testing data against performance metric MAE.
After benchmark model implementation, cross-validation was improved and parameters were tuned to generate the second model - the lightGBM model. The lightGBM model was trained and tested out against MAE. It got a better MAE. XGBoost model was implemented and tuned after that.
Eventually, a meta-model - the combination of lightGBM and XGBoost models was born. The final combined model was tested out against 3 different testing datasets to check its consistency. The final model got the best MAE among all the models and it performed pretty consistently across all different test datasets.
MAE for the final model is 0.06939.
You can check out the project here at my GitHub repo: