Machine Learning

I've built multiple machine learning and deep learning models for a couple of different projects including titanic survival rate estimation, Boston housing price prediction, customer segmentation, finding donors for charities, train smart driverless cabs, and image recognition.

I have learned a lot about machine learning algorithms while working on those projects. I learned how to identify underfitting(high bias), overfitting(high variance), why we should use cross-validation, grid search, hyper-parameter tuning, what the best measurements (such as Mean absolute error, accuracy score, precision, recall, AUC curve) are to used to measure if a model is good or not, how to tune the model, how machine learning pipeline works when to use supervised learning, unsupervised learning, and reinforcement learning.

Most importantly, I have learned machine learning is not just about fun technologies, engineering principles, or big data. Machine learning is more about finding the most interesting data relationships from a sea of data as well as storing and processing those relations in a mathematical and statistical way to solve mysteries that interests you. In my opinion, we are only in the first phase of machine learning. For most of the algorithms, humans are still teaching, training, and tuning machines based on our knowledge, opinion, and perceptions in a lot of fields. Gradually, I believe most of the machines can learn and teach themselves without human interventions. That’s why we need to take the AI takeover scenario very seriously. I am an advocate for building good AI and ethical AI! When I start talking about AI take over, I just can’t stop. We can talk about AI in another post. Let’s talk about my machine learning projects now!

Titanic Project

The first machine learning project that I worked on is to predict if someone survived or not in the Titanic tragedy. In 1912, ship Titanic struck an iceberg and sank. Most of the passengers died as a result. I got a subset of the Titanic passenger dataset. I used the dataset to extract the features that can best predict whether someone survived or not in the tragedy. I used those features to build a Decision Tree algorithm manually. I used the dataset from the project to train and test the model and got a pretty good accuracy rate from this manual decision tree algorithm.

You can check the project out here at my GitHub repo:

https://github.com/SheldonGeek/Machine-Learning-projects/blob/master/titanic_survival_exploration/titanic_survival_exploration.ipynb

Boston Housing Price Prediction

Another project that I worked on is the Boston housing prediction. The housing data was collected in 1978 and data has been multiplicatively scaled to account for 35 years of market inflation. The data is from UCI Machine Learning Repo. The goal of the project is to build a machine learning model to help estimate the Boston housing price based on the housing data I have. Boston housing price prediction project is the project in which I started using performance metrics - R square to measure a model’s performance, cross-validation, grid search, DecisionTreeRegressor from the sklearn library, and adjusted depth of the tree to tune the model to optimize the model. Finally, I picked up some of the most important features such as the total number of rooms in the home, neighborhood poverty level, a student-teacher ratio of nearby schools as input to the DecisionTreeRegressor algorithm that I chose. It gave a good prediction result using test data from this dataset. However, this model does have a lot of limitations on prediction in today’s housing market due to the limitation of this dataset.

You can check the project out here at my GitHub repo:

https://github.com/SheldonGeek/Machine-Learning-projects/blob/master/boston_housing/boston_housing.ipynb

Finding donors for the charity

One interesting project that I worked on is finding donors for charity using machine learning. The purpose of the project is to predict if an individual makes more than $50,000. This model can help non-profit organizations that survive on donations to make better decisions on who to reach out to for donations. In this machine learning project, I learned how to clean up raw data such as analyzing skewed data, dealing with outliners, data processing (One-hot encoding - transferring categorical data to binary data). I have used 3 algorithms to solve the problems including Decision Trees, Support Vector Machines, Ensemble Methods(Gradient Boosting). Based on the performance, I picked up the Gradient Boosting machine as the final algorithm for this model and it produced a good result.

You can check out the project here at my GitHub repo:

https://github.com/SheldonGeek/Machine-Learning-projects/blob/master/finding_donors/finding_donors.ipynb

Customer segmentation

This is the project where I used an unsupervised learning algorithm. The goal of the project is to identify the variation in the different types of customers (customer segmentation) so that the wholesale distributor can better structure their service for the target customer segment to optimize the customer experience. After I did data clean up, I used principal component analysis (PCA) to reduce the dimensions of the data. For this project, I chose the Gaussian Mixture Model over the K-Means clustering algorithm because GMM is soft clustering and it doesn't bias the cluster size. GMM is also good for small datasets. In addition, I designed an A/B testing experiment at the end of the project to help the distributor to figure out which customer segmentation will react positively to the change of delivery service.

You can check out the project here at my GitHub repo:

https://github.com/SheldonGeek/Machine-Learning-projects/blob/master/customer_segments/customer_segments.ipynb

Train smart driverless cab

My first Reinforcement learning project is to train smart driverless cab. This is an interesting project that I train driverless cars in a simulated game-like environment. I set up roads, rules, awards, and punishments there. I used an optimized Q-Learning algorithm to train a driving agent to learn driving in the simulation. Whenever an agent violates the rule, it gets punishment (deduction in the score). If it correctly follows the rule, it gets awards (increase score). There are 2 scores to measure an agent’s performance - safety and reliability. After thousands of times of trials and errors, the agent finally learned to drive safely and reliably.

You can check out the project here at my GitHub repo:

https://github.com/SheldonGeek/Machine-Learning-projects/blob/master/smartcab/smartcab.ipynb

Deep learning

One of the most fun deep learning projects that I worked on is to predict dog breed with the model. The model will provide the estimate of the dog’s breed if a dog is detected in the image. If it recognized a human face in the image, it will tell you the dog breed that is most resembling for that human face. Those are the image recognition algorithms I used for this project -VGG-19, ResNet-50, Inception, and Xception. Let’s see the results! It did misclassify one of them, can you find it?

You can check out the project here at my GitHub repo:

https://github.com/SheldonGeek/Deep-Learning-Project/blob/master/dog-project/dog_app.ipynb

PreviousZillow Housing Price Prediction NextFeedMore APP

Last updated 4 years ago

Was this helpful?