My first Kaggle competition
I started my first machine learning course on Udemy in May 2020. I quickly realized two things. First, for me simply following along with these courses is not the best way to learn because it doesn’t challenge me to think about a solution myself. And secondly, I was missing some basic knowledge on how to get started with a project. I mean, it is great to know how to perform a regression analysis, but preparing your data is a huge part of data science and this topic is not always covered in these courses.
I was looking for a way to deepen my knowledge and came a cross the Machine Learning Scientist career track on Datacamp. This track consists of 23 courses on a wide range of topics such as processing your data, feature engineering, training a wide range of models and hyperparameter tuning. This time my approach was going to be different. I would take the courses, and practice the things I learned with my own datasets that I found on Kaggle.
I definitely challenged myself with this approach. The track can be finished in about 90 hours, but it took me at least double that amount to finish the course because I ran into all sorts of problems. I couldn’t install certain libraries on my laptop, I got value errors because missing values were imputed by strings, I had to spent time on looking for the right values to tune my hyperparameters and figure out why my R-square was negative. At times working on these issues was definitely challenging and frustrating, but it was also very exciting when I was able to solve a problem with a little help from Google, Stackoverflow and the Kaggle community.
After finishing the Datacamp track it was time to start my first Kaggle competition. Kaggle is a data science platform where you can find a dataset, participate in machine learning competitions and engage with other data scientists around the world. I picked the housing prices competition. The goal of this competition is to predict the sale price of houses using 79 declarative variables.
Despite all the knowledge I had built from the courses, I was struggling to get started on such a huge amount of features. I started exploring the features one by one. I plotted the number of missing values and the relation between a feature and the sale price for every single feature. After performing some basic feature engineering and imputing it was time for my first model. But what model do you start with? And which features should you include in your first model? Many of the courses I had followed warned for the risk overfitting when including too many features and advised to start with a simple model and gradually increasing the complexity of this model. I started with a simple regression feature with two features: overall quality and total surface area. I was pretty pleased to see a R-square of 0.788, but my score on the public leaderboard was quite dramatic. Next I tried including different combinations of features and different models. No matter what I did, I kept ending up in the middle of the pack and nowhere near my set target of a top 10% score.
I realized I had to go back to the drawing board. I look at the code of other participants and realized that I spent too much time on modelling and hyperparameter tuning and too little time on cleaning the data and feature engineering. I decided to start over from scratch (this was not an easy decision after spending a week on the competition already). This new approach paid off immediately because I jumped about 10.000 places to a top 10% score. After a bit more tweaking I landed myself a top 9% score in competition with over 45.000 participants.
It took me about two weeks and a lot of frustration and effort but I reached my goal! Time for my next competition!
You can check my competition code on my Github page here.