Dmitry Ulyanov and Marios Michailidis are instructors of How to Win a Data Science Competition: Learn from Top Kagglers, part of the Advanced Machine Learning Specialization. Dimitry received his Master’s degree at Moscow State University with a major in machine learning and mathematical methods of forecasting. Marios Michailidis is a research data scientist at H2O.ai and received his PhD is in machine learning at UCL with a focus on ensemble modeling. In his spare time, he loves competing in data science competitions and was ranked 1st out of 500,000 members on Kaggle. Read on to discover what you could learn from taking their course:
What is the backstory on how this course came together?
Dmitry: Learning data science was a wonderful process as there were so many things to discover. I have learned by trial and error and it was always fascinating to read solution reports from the competition winners – I compared my solution to theirs and tried to find what I had missed or done wrong. However, the competition reports gave me information about the high level ideas that worked out, and they usually did not provide any details on competition solving process. I was participating in (and later organizing) meetups where we discussed past competitions in depth. It was a great chance to ask a knowledgeable person almost any question and “steal” their knowledge.
My co-instructors and I were united by competitive data science and each of us shares almost the same learning story. We each gained a lot from the community and felt that we have something to give back. We were nurturing the idea of creating a video course for a long time, and felt that the invitation to be a part of the Advanced Machine Learning Specialization was a sign for us to finally do it.
Can you tell learners about your background and passion for Data Science?
Marios: At the end of my last Semester at the University of Southampton where I was doing my Master’s in Risk Management, I was trying to find what to do next with the skills I had acquired. I started attending some entrepreneurship talks where various business professionals were telling you how they succeeded and what they did after finishing their studies. I was lucky to attend the talk of a person (his name escapes me right now) that after graduating started going to horse races. He was not gambling, but he was collecting data, and after 2 years he built a model to predict the winner and starting making money. The whole concept, seemed like a superpower to me (e.g. the ability to predict the future) and in general I was impressed with how he created value and income from nothing with a very well-structured approach of gathering data and teaching himself classification using logistic regression.
After this, I started learning data techniques used in prediction. I started learning tools such as SAS and SPSS, but I wanted to have more autonomy of what I create and therefore I had to teach myself a programming language. I started learning Java from a book and then I started implementing many techniques (such as decision trees, neural networks, regressions, K nearest neighbors, etc). Once I implemented a few of them, I made them available through a software named KazAnova. This was named after my mother’s last name (Kazani) and statistics (Anova) and encompases the love for family and science.
What did it take to become the top Kaggler in the world?
- Understand the problem well. Understand the metric you are being tested on as well as the dynamics of your training and test data. Is your test data in the future? Is it time series? Does your test data contain new entities (e.g. new customers, products). All these questions need to be answered and define the way you need to validate your solutions internally in order to get reliable and accurate results.
- Be disciplined. When defining that internal reliable validation environment, exploit it within reason. Never try something that you cannot actually replicate for the test data. You need to treat your validation data like test data. This helps to avoid leakage.
- Try problem-specific things. For instance in image classification you need CNNs and for text data you might need tf-idf, stemming, spell checking etc. You need to know what works best for each problem.
- To generalize on (3), you need to know the tools, programming languages, libraries, techniques. You also need to make certain you update your arsenal constantly with new tools.
- Good hardware to try many things. Image classification competitions need GPUs, while in tabular datasets CPUS with multiple cores would also do.
- Collaborating with other people and forming teams. This works well for various reasons. People tend to seize the problem from different angles, ultimately uncovering more information about the target variable. At the same time you could sub-divide tasks among team members to cover more ground.
- Ensembling, by means of combining many different (ideally diverse) approaches together in order to get a better result.
What are the most important skills for learners to master for a career in Data Science?
- Ability to explain complicated concepts to people outside the field. This can really help in utilizing more science methods and businesses can be less afraid to implement black box approaches.
- Be good at powerpoint presentations and improve your skills in data visualization too. It can really make your life easier.
- Ability to adjust complexity/accuracy of any data science solution to meet the business needs. I certainly love big ensembles of models, but realistically these are rarely used in practice as they are too costly for the uplift they yield. While sometimes it is worth exploiting complexity to achieve more accuracy, quite often a simple model may be able to do the trick based on the resources available.
Why should learners take this course?
Marios: In the course you learn how to solve data science challenges competitively. You can experience similar improvements in your data science careers from doing the same. You will learn:
- How to solve predictive modelling competitions efficiently and learn which of the skills obtained can be applicable to real-world tasks.
- Learn how to preprocess the data and generate new features from various sources such as text and images.
- Be taught advanced feature engineering techniques like generating mean-encodings, using aggregated statistical measures or finding nearest neighbors as a means to improve your predictions.
- Be able to form reliable cross validation methodologies that help you benchmark your solutions and avoid overfitting or underfitting when tested with unobserved test data.
- Gain experience of analysing and interpreting the data. You will become aware of inconsistencies, high noise levels, errors and other data-related issues such as leakages and you will learn how to overcome them.
- Acquire knowledge of different algorithms and learn how to efficiently tune their hyperparameters and achieve top performance.
- Master the art of combining different machine learning models and learn how to ensemble.
- Get exposed to past (winning) solutions and codes and learn how to read them.
Anything else you would like to highlight about your course?
Marios: There is some criticism about Kaggle competitions and similar challenges for not being exactly like ‘real-life problems’, which is true. Participating in Kaggle competitions is like participating in the Olympics of data science and in order for it to work on a large scale you need to define some metrics and impose certain constraints to make it viable and easy for many people to participate. This does not mean that it is not valuable. If you think about it, being able to run really fast (as in the olympics), it isn’t a very useful real-life skill on its own (unless someone steals your bag and you need to catch him), but within the context of what is the theoretical best you can get, given certain constraints, it is a very valuable and impressive skill to have. Luckily, from participating in Kaggle competitions you can obtain much more useful (data science) skills than ‘just running really fast’ , however in order to get the whole data science package, some external experience is needed too. Ultimately, the course will make you ‘expert runners’ and can definitely help you enter or strengthen your presence in the data science space.
Sign up for How to Win a Data Science Competition: Learn from Top Kagglers, part of the Advanced Machine Learning Specialization.