I spent several weeks at the end of 2016 learning about and experimenting with machine learning. It has been a hot topic in recent years and is behind many of the innovations we are benefitting from today. Plus, the possibility that in the near future we could be able to model human cognition is pretty exciting.
Here I want to give a recap of what I learned and provide a little bit of guidance for those who want to get started with machine learning but don't know where to begin. It can seem daunting at first, but once you get past all of the unfamiliar terminology and learn about the great libraries around, harnessing this tool is quite doable.
Supervised Learning (Tell The Machine What's What)
Supervised learning entails processing a labeled dataset in order to build a model. What do I mean by labeled? For example, an apple [label] is red [attribute 1] and weighs 150g [attribute 2]. What do I mean by model? The model is the collection of knowledge that the algorithm uses to make decisions.
When testing the algorithm on new data, if a match on the features is made, the algorithm can confirm that the sample (a data item) should be assigned a known label. Conversely, if a sample does not match the features (attributes) of a known label, it will not be matched. For example, if a sample is encountered that is orange and weighs 50g, it is probably not an apple. This entire process is called Classification.
The common supervised algorithms are Logistic Regression, Support Vector Machines, and Decision Trees.
Unsupervised Learning (Let The Machine Figure It Out)
On the other hand, unsupervised learning is where you have unlabeled data and rely on the algorithm to distinguish the different classes based on inspecting the attributes of each sample in the dataset and noting similarities.
This is accomplished with the straightforward K-means Clustering algorithm, aka "nearest neighbor" algorithm. This algorithm works by comparing a sample with it's neighbors, and grouping it with the neighbor whose feature value is nearest to its own. This process repeats until no regroupings can be made and you are left with clusters of features. There are your classes.
The most popular library is the Python module scikit-learn. There are others, but this is the perfect place to start and will give you a lot of runway.
Part and parcel of the machine learning toolkit, Matlab allows you to make plots of your datasets and visualize the results of your learning models.
Prepare The Data
Having a good dataset is critical. You need to remove any samples that are missing feature values, or exclude those features. Also, eliminate any features that are not necessary for distinguishing meaningful classifications in your data. Fewer features to analyze means less processing is required.
You have some decisions to make based on your data. If your data is structured and you have well-defined classes, you use a supervised algorithm. If your data is unstructured you choose an unsupervised approach.
And then there's the questions of performance and accuracy. Some algorithms are inherently less performant on large datasets. Tuning algorithm parameters can also give you some gains in performance and accuracy.
Train, Test, Tune, Repeat
Whatever algorithm you choose, the process is pretty much the same.
- Divide dataset into training and test datasets.
- Train your model on your training dataset.
- Check your accuracy on test dataset.
- Tune algorithm parameters.
- Re-train your model until you reach the accuracy you require.
Before You Begin
A lot of machine learning courses list pre-requisites of statistics and linear algebra courses. I found that this was not really necessary. If you want to understand the formulas behind certain algorithms, you had better have calculus still fresh in mind (I skimmed over these definitions).
However, it will make it easier before you begin if you at least know:
- What's a vector?
- What's a matrix?
- What do linear/quadratic/logarithmic graphs look like?
Khan Academy is a good place to brush up on these.
What I Did
I looked at a few different courses, but ultimately decided on Google's Udacity course Introduction to Machine Learning. I felt this was a good pace, and provided a broad enough introduction without getting too far into the weeds.
I paired the Udacity course with the excellent book, Python Machine Learning by Sebastian Raschka. Sebastian's book starts from the earliest building block of machine learning (the Perceptron), and continues all the way to building a neural network. There is more depth here than the Udacity course and the math behind the algorithms is explained if you are interested.
Once I felt comfortable with the concepts thus far, I worked through some tutorials on Kaggle. Kaggle is a community for machine learning, providing exercises and hosting competitions. I found their tutorials on deep learning easy to follow.
The Stanford course is quite popular, but to me it seemed very algorithm-focused and covered a lot in a short amount of time. If you have the time (and money), this can be a good option.
Udacity also hosts the machine learning courses corresponding to Georgia Tech's OMSCS program. These courses are more drawn out than the Intro course, spanning three different sessions for the same breadth.
What struck me most was how little code was required to make some conclusion about a dataset. The truth is that all of the hard work is encapsulated in the algorithms provided by libraries. Developing a machine learning solution is more a matter of ensuring you have good data and selecting an appropriate algorithm to use.
Ofcourse, if you want to innovate in this area or seek performance gains for a machine learning system, you need intimate knowledge of the algorithms and significant background in calculus and statistics.
I'm more interested in the less math-y aspects of computer science but it will interesting to follow the continuing research in deep learning and the potential unleashed by advancements therein.