By Lucas | September 3, 2014
The eighth course in Johns Hopkins Data Science Specialization on Coursera is Practical Machine Learning This is the third and final course in the sequence taught by Jeff Leek.
Probably more than any other course in the JHU series of classes, this is the one that feels like it brought the whole sequence together. Students of Practical Machine Learning need the skills developed throughout the rest of the sequence to be successful in this course, from basic R Programming (course 2) through Regression Models (course 7).
Like most of the courses in the JHU Data Science sequence, this course moves very quickly. Fortunately, some of the concepts were touched on a bit in less detail in earlier courses in the specialization, so this is a chance to explore them in more depth. Dr. Leek begins the course with an a fairly detailed exploration of the concept of cross validation, explaining the importance of training and testing sets while offering alternatives such as training, testing, and validation sets, using k-fold validation, and leave one out validation.
A great deal of time is also spent on certain methods for creating predictive models. In particular, random forests and boosting are hit pretty hard by Leek, as they explained to be very accurate predictors and among the most consistent methods for winning Kaggle competitions. We also spent some time looking at other methods such as regularized expression and blended models. The caret package in R is used with all of these predictive methods. In fact, caret was used for just about everything we did in this class, including creating our training and testing sets.
There were other topics covered in Practical Machine Learning that didn’t seem to get the same level of attention on assessments such as pre-processing of data, exploratory data analysis, and measures of error such as root mean squared error.
When I took the course in August of 2014, there were 4 quizzes and a final project. For the final project, students were required to use the caret package to make 20 predictions about a data set, that fortunately, had pretty strong predictors. I got all 20 predictions right on the first try using my method of choice.
This class was critical to giving me some insight into the methods that are used in modern predictive algorithms. That said, I feel like the greatest weakness in this class is that I came away from them without a strong sense that I know when to use which model (i.e., what features inherent in a data set indicate that a random forest is a better initial method to try than another method). Feature selection also continues to be challenging, just as it was with Regression Models.
For me, that doesn’t detract from the fact that along with Developing Data Products, this was one of the two most interesting courses in the sequence. I also feel like I am ready to dive into what Kaggle has to offer now without feeling like it would be too intimidating.
Click here to register for the Johns Hopkins Data Science Specialization on Coursera. (Affiliate link, thanks for your support!)