By Lucas | August 15, 2014
The third course in Johns Hopkins Data Science specialization on Coursera is Getting and Cleaning Data. The purpose of this class is to get students familiar with the process of creating a “tidy” data set from a variety of different sources. Like The Data Scientist’s Toolbox, this class is taught by Jeff Leek.
The breadth of material covered in this course was spectacular. Dr. Leek spent the majority of the first two weeks of the course explaining who to read a variety of data sources into R, some of which I was pretty familiar with, but others I was learning about for the first time. Among the included data sources were Excel files, XML, JSON, HDF5, and MySQL. I wish we had spent a little more time on SQL since this seems to be about the only place the Data Science Specialization touches on it, but this class had so many topics that it was impossible to spend too much time on any one.
I also found the subject of reading API’s into R fascinating. I’ve heard the term API thrown around so much over the last 5 years or so, but beyond knowing that it stood for Application Programming Interface, all I really knew was that it allowed programmers to tap into existing web services somehow to create their own apps. We practiced with accessing Twitter from R using the Twitter API, and it was so interesting that I spent some time later experimenting with the LinkedIn API in R on my own time.
There was time spent learning how to subset and sort data as well, much of it a review of material from the R Programming class, but my guess is that most people will need that review unless they took the classes out of sequence like me. There’s also significant time in week 4 devoted to editing text variables, working with dates, and plain-text search functions such as “grep.”
If I have one beef with this class, it is not the class itself, but the course dependency chart. On the official chart, there are no courses that list Getting and Cleaning Data as even a soft dependency. Consequently, I took courses 4 and 5 (Exploratory Data Analysis and Reproducible Research) prior to taking this course. Cleaning the data sets for analysis was probably the most challenging part of those courses for me. In retrospect, I can see it is because I had not yet taken Getting and Cleaning Data. The material from week 3 and especially week 4 does come in handy in those later classes, so I would not recommend taking the route I did for future students of the Data Science Specialization.
Like many other classes in the sequence, this one had 4 quizzes and a peer graded project due at the end of week 3. The project involved preparing a tidy data set, and it did require information from week 4 of the class, so if you are thinking of taking this class, plan ahead on the project.
Some students on the class forums have called it among the most challenging in the Data Science Specialization. Although I learned a lot and was challenged, I didn’t find it to be the toughest, but that could be because I had already taken 5 classes by the time I got to this one and had gotten through some of through some of the hardest initial part of the learning curve.
Click here to register for the Johns Hopkins Data Science Specialization on Coursera. (Affiliate link, thanks for your support!)