Getting Started in Data Science: R vs Python

  Image – Wikimedia Commons

Image – Wikimedia Commons

Introduction

Every person that starts out working with data in any serious capacity eventually asks “what language should I learn for data science?” Often the choice is difficult to make and there is a lot of strong opinions out there regarding what tools are needed for data science. In this post, we will discuss why you cannot just use Excel (the need for programming), discuss Python and R and how you should choose which to learn.

Why Can’t I Just Use Excel?

As mentioned in the introduction, doing anything serious with data requires more than just Microsoft Excel. There are a lot of “data analysts” who just simply use Excel. Normally, these analysts are not really doing anything statistically meaningful. They may be making some graphs, calculating some summary statistics or even visually looking at the data for patterns. There are probably others as well that do some interesting things with Excel (we often embed ML models in Excel workbooks), however Excel is not a very deep tool as far as analysis goes.

With Excel, you lack reproducibility, flexibility and lack more advanced concepts. Reproducibility refers to being able to reproduce an analysis for confirmation of results. This means that no one can recreate results you achieved with your workbook. This may be possible with VBA, but not for most people that do not know VBA. In some business settings, this may be OK, but is a huge red flag in academic or trade settings where peer review matters.

When it comes to flexibility, you are stuck with what Excel has to offer in the form of functions and structure. Writing your own functions in Excel is possible, but most people do not know how to do it. There is quite a bit of flexibility in Excel’s VBA, but this is another programming language that most people do not know how to use and it is not exactly straightforward. Excel can also make data visualizations, but handcrafting the graph to get it the way you want is easier said than done with Excel. Then that process has to be repeated for multiple graphs. Excel also slows down if it works at all with really large amounts of data.

Last, Excel has many data analysis functions built in for hypothesis tests, ANOVA, Fourier analysis etc… but is lacking many advanced data science techniques such as machine learning concepts. When it comes down to it, Excel is not meant for really sophisticated data analysis. It is meant as a spreadsheet tool for getting quick insights or answers out of data, not deep insights via pattern recognition. This means that we need a way to analyze data with sophisticated methods in a reproducible and flexible way, this means we need to program.

This is often the hardest pill for people to swallow that are analysts, but may not consider themselves programmers. Programming is a skill that can be learned by anyone. It is simple passing the computer instructions similar to how you would with a mouse or a keyboard. It does take time to master, but increases the scope of what is possible. In data science, the two most important programming languages are R and Python.

Python

Python was invented in 1991 by Guido Van Rossum and is named after the British Comedy group Monty Python. It is an object oriented programming language that is used in data science, but is also a general purpose language used in gaming, web development and application development. You can do almost anything with Python (RealPython). It is used heavily in data science by a lot of companies and most people that end up using Python for data science come from the computer science school of thought. Python is an easy language to learn and can be rather efficient. It uses multiple CPU’s natively and is therefore fast. As far as data science goes, there are multiple packages to support data science type work. Pandas is used for data manipulation and structuring, scikit-learn is used for machine learning, Keras and Tensorflow are used for deep learning, statsmodels is used for statistical modelling and packages such as seaborn and Dash are used for data visualization and dashboards. Data scientists often use IDE’s such as Spyder, Rodeo and PyCharm.  Python is also open source, so cost is not an issue here.

Python does have its disadvantages. Some of the packages developed are not as mathematically rigorous as packages developed in R since many are developed my programmers, not statisticians. There is some risk with that. Also, applying packages can often be frustrating in trying to find and assess functionality. Almost everything done with Python is through a package add on. With some data science related work, simple functionality often becomes a lot more difficult. For example creating a plot with matplotlib is numerous lines of code to get the plot in a place that is acceptable. Understanding the nuances between lists, dictionaries, tuples and dataframes also gets quite a bit frustrating in reshaping of data. Python at times can also have weak or non-existent documentation. R has standards around package documentation that are much more stringent.

R

R was developed as a successor to S developed by John Chambers at Bell Laboratories for statistical processing (Rproject). R was specifically built for statisticians by statisticians. This is both its strength and weakness. R is often popular in academic settings and settings closely bound to academia. Unlike Python, R offers a lot of strong functionality in its base packages such as statistical modelling, linear modelling, plotting, statistical computations and hypothesis testing and even some classification out of the box. Outside of that, R packages are strongly vetted before being placed on the CRAN environment to download. Packages such as Caret are general purpose machine learning packages, NNET and H2O are deep learning packages, ggplot is the golden standard of plotting, dplyr and data.table are amazing data manipulation packages and Shiny and Plotly are used for interactive apps and plots. RStudio is the IDE for R and is one of the best IDE’s to work in for data analysis. R is also open source and free like Python which means anyone can extend and add to it.  

R, like Python has its disadvantages. Since it was written by statisticians, programmers often find R’s syntax and paradigms difficult to learn and understand. However, many people that are not programmers often find it more intuitive. R only runs on single CPU’s out of the box which makes it slower than Python. There are some packages that allow a user to utilize more CPU’s and some do it out of the box such as data.table. R is often more difficult, but not impossible, to implement in an application or website flow. Python is much more easily integrated into architecture.

Should I Choose R or Python?

The answer to this is that it depends. In most people’s cases, I would recommend both. R has strong statistical theory and ease of implementing concepts behind it, while Python has the advantage of speed and integration capability into most software stacks. Often, I choose to do data analysis in R and then build general purpose programs for things like web scraping in Python. Deep Learning is also better in Python due to its speed out of the box. If you do not want to learn both, then the recommendation would be to learn R if academia or research is the goal and learn Python if a job in a business setting is the goal. Either way, the methods and skills gained by learning either cannot be replaced.

In conclusion, both languages have their pros and cons and both are very good languages. It depends on the arc in which you envision your career to progress that would drive your decision. Both languages are open source and free and have a lot of resources behind them to help someone get started and support the users. Remember we offer training in both R and Python. This can be requested at Courses .