By Lucas | September 8, 2014
A process that began 4 months ago, the sequence of 9 Johns Hopkins Data Science Specialization courses on Coursera, wrapped up for me late last week with my last quiz in course 9, Developing Data Products. While I haven’t truly finished the specialization yet (the first ever capstone project doesn’t launch until late October), I still feel a sense of accomplishment.
According to our JHU professors, as of early August, over 800,000 students have attempted at least one course in the sequence. 14,000 students have succeeded in completing at least a single course, and 266 have completed all 9 courses. It feels good to add my name to that list. Since I’ve been writing about the courses individually, now seemed as good a time as any to share my collective thoughts on the Data Science Specialization through 9 courses.
A Steep Learning Curve
There are two main hurdles to success the JHU sequence, programming in R and statistics. While I’m sure there are some students of the sequence that find both very natural, in my experience on the course forums, just about everybody finds one or the other a struggle initially. I teach AP Statistics, and while the JHU sequence goes beyond that level of statistics, that gave me an advantage and comfort level with the statistics challenges that many others struggled with.
R, on the other hand, did not come naturally, at least initially. It had been around 17 years since my undergraduate programming courses in Scheme and Fortran. I muddled through the R Programming class and more or less got the programs to do what I needed them to do, but it was a struggle. Through that initial struggle, I tried to trust that the lectures and activities our JHU professors had lined up for us would bring me through the other side with a deeper side if I persisted with their plan.
I found that it was somewhere around the 6th or 7th week into working with R that things started to snap into place for me and it felt more like I was working with a tool than fighting with a machine. Now in week 18 or so, I still have plenty to learn, but it’s hard to look back without chuckling at the things that were a struggle just 3 months ago.
Well Conceived and Executed
I have only started dabbling in other MOOCs since I finished the JHU courses, but I’ve already started noticing some strengths of this sequence compared to other MOOCs.
- The R programming is not just “in the browser.” I’d like to write more about the value of learning to code in browser based solutions, which seem to be very popular right now, at a later date. I think it’s great for some things. However, my gut tells me it isn’t the best way to learn to tackle big challenges, and the fact that we did significantly large projects using R Studio forced me to grow my R coding skills rapidly.
- The sequence teaches many tools. Github, R Studio, Knitr, Shiny, R Pubs, Yhat, Slidify, and more. It seemed like every course introduced a couple of new weapons to our data arsenal. I cannot believe how many tools are built on top of R alone. Discovering the R community reminds me of starting this blog and discovering WordPress. Does that make the folks at R Studio the equivalent of Automattic? Is Hadley Wickham R’s Matt Wullenweg?
- The short course lengths are great. Each course being only a month long allows so much flexibility in completing the sequence over a long time or quickly like I did. It also keeps motivation up when topics are frequently changing. Other MOOC instructors should learn from the JHU instructors and split their courses up.
- Similarly, the weekly deadlines keep motivation high. I’ve already joined a new MOOC where nothing is due until the final day of the course. Nothing. No homework, quizzes, the midterm, the final, is due until the course end date. It’s great to have that flexibility, but is that going to motivate students to stay on pace? All I can do is speculate, but based on my observations of human nature in the classroom over the last decade plus, I’d say no.
Order of Classes Matters for Accelerated Students
I’m very motivated to start a job search late this year or early next year. I also wanted to take advantage of summer break. Consequently, I decided to cram the whole sequence of classes into 4 months. I know some people have done it in 3 months, something I considered, but decided against when I found out the capstone wasn’t happening until October (right decision in retrospect).
If you are going to cram like I did, and you don’t already know a lot of the material, be careful of the order you take the classes. JHU provides a course dependency chart that I followed to the letter of the law. Here was my sequence, following those dependencies:
Month 1: Data Scientist’s Toolbox and R Programming
Month 2: Reproducible Research, Exploratory Data Analysis, and Statistical Inference
Month 3: Getting & Cleaning Data and Regression Models
Month 4: Practical Machine Learning and Developing Data Products
Here’s the thing: I wouldn’t recommend this sequence to someone trying to complete the courses in 4 months. Although Getting & Cleaning Data isn’t listed as a dependency for Reproducible Research or Exploratory Data Analysis, I think it should be listed as a “soft dependency.” I found myself having all kinds of issues, particularly in EDA, with topics that I later realized were covered in Getting & Cleaning Data. I was able to figure it out, but a simple change to the following would have made for a less stressful month 2:
Month 1: Data Scientist’s Toolbox and R Programming
Month 2: Getting & Cleaning Data, Reproducible Research, and Statistical Inference
Month 3: Exploratory Data Analysis and Regression Models
Month 4: Practical Machine Learning and Developing Data Products
Saving the Best for Last
For me, the best two classes in the sequence are two very different classes, Practical Machine Learning (review) and Developing Data Products (review). In Practical Machine Learning we finally had the chance to learn how to use machine learning to build predictive models, which felt like a very powerful use of the knowledge that had been accumulated in previous classes. Developing Data Products afforded students the opportunity to see how our R knowledge could be put to use in beautiful, interactive data visualizations on the web that could be deployed for the common man. It’s good to know that there’s a prize at the end in these two gems, but it’d be very tough for the average student to be successful in these classes without having taken the earlier classes in the sequence.
Suggested Time Investment Was Unrealistic for Me
I can only speak for myself, but I found the suggested time investment of 3-5 hours per week for each class wasn’t realistic for very many of them. The only classes I truly invested 3-5 hours a week in were the introductory Data Scientist’s Toolbox, Statistical Inference (no project at the time I took it, and I had significant prior knowledge of the material), and possibly Developing Data Products. Heck, there were classes where I invested 5 hours just getting through the videos and taking the quiz before I even started the a project that took that much time again. I’m not complaining about the time I invested. It was well spent, but if I’m any indication (and maybe I’m not) the course descriptions probably need to be revised.
Instructors Seem to Care
Throughout the sequence, I have heard students comparing the JHU courses to other MOOCs. Oftentimes, they have been complementary, occasionally critical. Some of the criticism has been of a the lecture style in certain courses being too much theoretical, not enough practical (example driven), or difficult to understand. I think some of that criticism is valid. There are occasions where lectures are not perfect. However, they are good way more often than not. Additionally, the instructors have shown a willingness to accept feedback, restructuring questions, rerecording many videos, etc, in response to student feedback.
I’m not plugged into academia enough to understand the motivations of professors running MOOCs. My guesses? On one end of the spectrum, some seem to use them a vehicle to promote books. They maybe using them as a way to “upsell” their full on-campus or paid online graduate programs. It’s probably an ego boost to some to teach to tens or even hundreds of thousands of students. I would definitely hope that the other end of the spectrum is a significant majority of those teaching MOOCs doing so for altruistic reasons, a sense of wanting to provide knowledge to those that don’t currently have the time or financial means to invest in a full degree program.
While I’ve spent a lot of time learning from them, I obviously don’t know Roger Peng, Jeff Leek, or Brian Caffo, but in their press release announcing the sequence, Jeff Leak said, “By delivering it through a MOOC, we hope to dramatically expand the pool of qualified data scientists.” These three gentlemen have clearly spent an inordinate amount of time developing a detailed sequence for students, most of whom they’ll never meet. In Dr. Caffo’s virtual office hours, he explained the time they all spent meeting with industry partners to seek input about what should be included in the sequence, so that students who completed the program would be better prepared to work in research and industry. The expense of the program is so minimal it’s almost laughable ($500 for all 9 courses and the capstone) compared to a masters program. It’s hard not to take these guys at their word, that their primary goal is pass along their knowledge into an area that most experts say will experience a serious shortfall of qualified workers over the coming decade.
Waiting for the Capstone
So now I’m left waiting for the big event, the very first capstone project, which is a partnership with Android keyboard maker Swiftkey. In the meantime, I’ll be a Community TA for Course 6, Statistical Inference. I’m also going to be active on Datacamp, trying to pick up more Python data analysis understanding in a Coursera class that starts in a couple of weeks, and going to try my hand at a Kaggle Competition or two.
No rest for the weary.
Click here to register for the Johns Hopkins Data Science Specialization on Coursera. (Affiliate link, thanks for your support!)