Why should biologists use GitHub?

Screen Shot 2016-05-08 at 3.45.28 PM
from GitHub repository here

Today, all biologists use computers on a daily basis, produce and analyse data. A lot of us now have to learn programming (even just some bits of it).

I am not a computer scientist/engineer, and I am far from a bioinformatics person (yet), but I have started to discipline myself to make my published research as reproducible as possible and this is not only by depositing DNA sequence data to NCBI, but also analysis pipelines and command lines made available to the public on GitHub.

GitHub is a great public repository hosting service for publishing programming source code, but it can also be used to detail your analysis pipeline and code, and even create tutorials on softwares or pipelines for others.

For example, the Trinity tutorial from Brian Haas was a life saver for a biologist like me that had never touched any next-generation sequencing data.

As a biologist, to support my recent publication on scale insect phylogenetics, I created a GitHub repository that details all the steps and command lines in MrBayes and R analyses. This provides transparency to the reader and more rapid reproducibility.

Tip: If you are worried about your analysis pipeline or data being online during the manuscript review process, academic researchers can apply for free space for 5 private repositories. For more information, check here.

Nowadays, a lot of biologists will come to work in a multidisciplinary environment, and it implies learning new skills. In bioinformatics in particular, workshops are available but the internet is a great resource to learn skills by ourselves and GitHub can help both learning how a software works, but also making the details of informatics methods available to other biologists that are also learning how to use these softwares (from command lines for de novo assembly using Trinity or making a simple plot with R).

 

 

Would you have survived on the Titanic? (Udacity Intro to Data Science)

Last week, I started an online course at Udacity to introduce myself to data science and try to understand what it takes to analyse big datasets.

About Udacity: I have successfully (and unsuccessfully, while I was writing my Ph.D. dissertation) taken courses at Coursera and stumbled upon Udacity from an iPad app I downloaded to view my Coursera lectures. Coursera offers courses designed and taught by university professors, as opposed to Udacity where a lot of courses on programming in general and data science are taught by people working in the tech industry. So I was curious to see how differently they teach a class. I chose to pay the course (although you can also take it for free), which provides a coach (i.e., a real person that Google Hangouts with you!), follow up on your progress and feedback on the final project, and a verified certificate for my resume.

Starting this course, I had just taken an introduction to data analysis using R at Coursera.

About the course: Intro to Data Science is aimed at people who have some basics in statistics and programming. The programming language is Python and we also also Pandas, a data analysis library which looks like a combination of R and Python. The course is divided into 5 main lessons, each of them are accompanied by a project. In the final assignment, we will communicate as a blog post the results of the project developed during the course using datasets of the NYC subway and NYC weather.

Project #1: Intro to the Titanic survival dataset
After a few lecture videos introducing what data science is, I started to work on the first project.  At first, we don’t start on the main project but us a dataset and problem from a Kaggle project. Although most of Kaggle competitions are really intimidating, this project was created for people starting in data science. Using a dataset of surviving passengers from the Titanic tragedy, we had to write a small program that predicts the survival of half of the passenger list. The survival outcome of the first half is provided, with information about each passenger (age, gender, social class, how much they paid etc…). With this information (and also intuition), it is possible to estimate which type of passengers were more likely to survive and then write a script that will predict the survival outcome of the second half of the list. Kaggle provides tutorials to solve this problem with Excel (I learned some neat stuff there), Python, Random Forest and R. In Udacity however, you have to use Python and the power of Pandas. Project #1 includes three exercices that walks you through making a simple prediction script, based on one variable, to a more complex and customized way you want to think about the problem. So, do you think that a woman with 2 children in 3rd class of the Titanic had a chance to survive?

What I liked: Watching the videos of lesson 1, I really liked that some of them introduced real data scientists, what their definition of data science was and why they decided to follow such career path. In project #1, using a Kaggle competition dataset and project was really nice as it is not just limited to the scope of the course. I learned ways to work with Excel and the competition motivated me to try to find the highest prediction possible.

What I liked less: For the same reason, the downside was after doing the tutorials in Kaggle, I tried to obtain the best prediction right in the first exercice in Udacity, which made the next two exercices useless.