Last week, I started an online course at Udacity to introduce myself to data science and try to understand what it takes to analyse big datasets.
About Udacity: I have successfully (and unsuccessfully, while I was writing my Ph.D. dissertation) taken courses at Coursera and stumbled upon Udacity from an iPad app I downloaded to view my Coursera lectures. Coursera offers courses designed and taught by university professors, as opposed to Udacity where a lot of courses on programming in general and data science are taught by people working in the tech industry. So I was curious to see how differently they teach a class. I chose to pay the course (although you can also take it for free), which provides a coach (i.e., a real person that Google Hangouts with you!), follow up on your progress and feedback on the final project, and a verified certificate for my resume.
Starting this course, I had just taken an introduction to data analysis using R at Coursera.
About the course: Intro to Data Science is aimed at people who have some basics in statistics and programming. The programming language is Python and we also also Pandas, a data analysis library which looks like a combination of R and Python. The course is divided into 5 main lessons, each of them are accompanied by a project. In the final assignment, we will communicate as a blog post the results of the project developed during the course using datasets of the NYC subway and NYC weather.
Project #1: Intro to the Titanic survival dataset
After a few lecture videos introducing what data science is, I started to work on the first project. At first, we don’t start on the main project but us a dataset and problem from a Kaggle project. Although most of Kaggle competitions are really intimidating, this project was created for people starting in data science. Using a dataset of surviving passengers from the Titanic tragedy, we had to write a small program that predicts the survival of half of the passenger list. The survival outcome of the first half is provided, with information about each passenger (age, gender, social class, how much they paid etc…). With this information (and also intuition), it is possible to estimate which type of passengers were more likely to survive and then write a script that will predict the survival outcome of the second half of the list. Kaggle provides tutorials to solve this problem with Excel (I learned some neat stuff there), Python, Random Forest and R. In Udacity however, you have to use Python and the power of Pandas. Project #1 includes three exercices that walks you through making a simple prediction script, based on one variable, to a more complex and customized way you want to think about the problem. So, do you think that a woman with 2 children in 3rd class of the Titanic had a chance to survive?
What I liked: Watching the videos of lesson 1, I really liked that some of them introduced real data scientists, what their definition of data science was and why they decided to follow such career path. In project #1, using a Kaggle competition dataset and project was really nice as it is not just limited to the scope of the course. I learned ways to work with Excel and the competition motivated me to try to find the highest prediction possible.
What I liked less: For the same reason, the downside was after doing the tutorials in Kaggle, I tried to obtain the best prediction right in the first exercice in Udacity, which made the next two exercices useless.