By Lucas | January 20, 2015
Overview of the Data Science Capstone Project and Approach
The Johns Hopkins Data Science Capstone project concluded around Christmas last month. It was an interesting experience, and very different than the other classes. The project, a partnership with smartphone app maker SwiftKey, required students to create a predictive text web app that worked much like a smartphone keyboard.
I spent much of the almost 2 months of the project getting up to speed on the basic terminology and approaches of Natural Language Processing, a field dedicated to the interaction between computers and human languages. My basic approach was to create very large matrices (sparse matrices) in R of unigrams, bigrams, and trigrams (single word, two word, and three word phrases), in each case making the rows the n-grams and the columns the next word from the corpus of text provided by SwiftKey. This is much like a Markov Transition Matrix, except I used frequencies rather than probabilities due to storage reasons and didn’t normalize until I was making my prediction. You can see my app and more on my approach here.
Strengths of the Capstone
The problem was extremely open ended. I was reminded of Andrew Wiles’ description of solving a mathematics problem by starting in a dark room. One fumbles around for a light switch to illuminate some understanding, finds that switch and moves to another dark room to search for more understanding. Much can be learned through this type of process.
The capstone project did a nice job of bringing together all of the various tools used throughout the data science specialization. To successfully complete the project, R Studio presenter, Shiny, Knitr, and the various graphical tools I’d gained knowledge of were all required.
The time frame, 7 weeks, was longer than the previous courses. A single month would simply not have been enough to do this type of project, even with significant time to work on it in the evenings. I needed a full week at the beginning just to gain an understanding of the problem.
Weaknesses of the Capstone
The predictive modeling methods I employed felt disconnected from the rest of the sequence of the course. From the forums, I got the impression that only few students successfully found a way to apply machine learning methods to this problem, something many of us were hoping to do.
The benchmarks were extremely low. My model correctly predicted out of sample at about 15% accuracy, which was consistent with what most other students reported. SwiftKey themselves reported that their app predicts correctly a little less than 30% of the time. This isn’t particularly shocking if you think about the problem, but I’m not sure that it’s the right kind of problem to send students after as one of their first “real world” problems. It’s rather hard to feel like you are making progress at that level of accuracy.
Bottom Line Data Science Specialization Capstone Review
While I can certainly say I learned during the capstone project, I didn’t exactly out of it what I was hoping for. I know that a project like this will look quite different each time it is offered based on who the project partner is and what problem is selected. My hope for future iterations of this class would be to see it emphasize machine learning concepts more. Machine learning is a critical skill to data scientists. Since it is one of the last concepts taught in the JHU sequence of data science classes on Coursera, there is little chance for it to “sink in.” By emphasizing it more in future offerings of the capstone, the JHU team will make the sequence feel more complete. We were told that about 500 of the 800 students that enrolled in the capstone successfully completed it this time. You’d have to expect that number to grow with future offerings.