R as a second language: What I wish I'd known

When I first started learning R, I hated it. My first programming language was Python, so R’s lack of list comprehensions and f-strings seemed like major shortcomings. Having enjoyed learning Python, I had assumed learning my second language would be similar. Instead, I had three frustrating failed attempts to get started with R over several years before I finally got it.

I now enjoy programming in R, but it’s been a bumpy road. I’m writing this post for people who already know Python and would like to add R to their repertoire, in the hopes of preventing some of the mistakes I made.

One assumption of this post is that you are mostly interested in using R for data analysis. I imagine this is true of the majority of people using both R and Python. I’m also assuming that Python was your first language. This will mean, like me, you may have formed generalisations about how programming languages work that will set you up for disappointment and confusion when you start learning R. If you have a broader experience in different programming languages, this is less likely to be the case but you may still find some of my advice useful.

The goal of this post is not to teach you R. Instead, I intend this to be used as a supplement to the many excellent resources available (some are listed at the bottom of the post) that will hopefully help others not fall into some of the same traps that I did. While there are many resources available for learning R, many of them assume that R is your first programming language and so don’t highlight possible points of confusion for those coming from other languages.

I’ll begin by highlighting some reasons for learning R, and then give some broad advice for people who already know Python. I’ll then talk more about some specific differences between R and Python that can easily trip you up.

Why learn R?

You may ask, why learn R at all? Is there anything I can do in R that I can’t do in Python? Maybe not, but there are cases where R is the better tool for a specific task (and of course there are also cases where Python is the better choice). R was written with data analysis in mind, so this is built in in a way that it isn’t for Python. This makes sense, since Python is used much more broadly.

For example, R has data frames as a basic type, whereas Python has nothing analagous. Python libraries like NumPy and Pandas of course give you similar functionality, but it isn’t a feature of Python itself. The same is true for statistics; you can do a linear regression in base R, whereas in Python you would first need to import scikit-learn. Maybe importing a package isn’t a big deal, but using a language that isn’t designed specifically for statistics or used as widely by statisticans can cause problems. For example, by default when you do logistic regression in scikit-learn, L2 regularisation is applied (here is a relevant twitter thread). This is clear in the documentation, but it’s nonetheless a weird choice for the default that has very likely led a lot of people astray. The advantage of using a language written by statisticans and for statisticans is that you can trust the stats a bit more (though please do also read the documentation!).

If you’ve ever found Pandas, NumPy, or Matplotlib unintuitive, you may like the R alternatives better. The rough equivalent of Matplotlib in R is ggplot2, and it’s an amazing tool for visualising data. This isn’t to say ggplot2 is inherently easier or better than Matplotlib, but if Matplotlib isn’t working for you it’s something to try.

Some other things I particularly like about R:

RStudio: an incredibly good IDE that almost everyone programming in R uses
R Markdown: if you like Jupyter notebook, you’ll love R Markdown
Shiny: allows you to easily create interactive web apps

How to approach learning R

Don’t try to write Python code in R

This will be obvious to anyone who already knows multiple programming languages, but may not be if R is your second language. Python and R are very different languages. If you take Python code and try to literally translate it into R, you will have a bad time. Some people even say that you shouldn’t use for loops in R, which may be taking things a bit too far. That said, I’ve certainly found that every time I’ve found myself writing nested for loops in R I’ve been doing something stupid. When you find yourself getting tangled in a web of curly braces, take this as R’s way of telling you you’re in trouble.

One thing that was very surprising to me when I started using R is that I hardly use some of the types and constructs that I use all the time in Python. For example, you may find that you rarely write functions. You may never write classes. This is fine - R is not Python. It’s intended to be used in a very different way.

Start with the tidyverse

The differences between R and Python have implications for how I think you should approach learning R. When you learn Python, you generally learn all the different types and what they are for, different kinds of control structures, and how to write functions and classes. When you’re comfortable with this, you then move on to libraries that are useful for data science like Pandas and Matplotlib. This is how I tried to approach learning R the first couple of tries, and each time I gave up. Although I learned how to to write while loops and classes, I didn’t actually know how to do the kinds of data analysis tasks I wanted to do. It was only when I instead focused on learning so-called tidyverse R instead of just base R that I made progress.

The tidyverse is a collection of packages authored by Hadley Wickham (that’s definitely a name you’ll want to know) that make R more readable and accessible and minimise the amount of base R you need to deal with. The most useful package to begin with is dplyr, as this allows you to manipulate data in a very intuitive way. It may seem odd to bypass large parts of base R and skip straight to packages, but this is actually a very common approach. Base R will still be there when you need it. Hadley Wickham’s book R for Data Science will get you started and is available for free online. It covers the bits of base R that are unavoidable, but moves quickly into how to do things like wrangle data and make plots.

Others have also suggested beginners start with tidyverse R (this is a great post by David Robinson on the subject), but I think this is particularly true for those coming from Python. Although the syntax is not similar, in the tidyverse readability is prioritised in a way that I think Python users will find appealing. Another reason to get comfortable with the tidyverse is that a lot of other packages are based on the tidy paradigm, so if you’re not familiar with the tidyverse you’ll be making things harder for yourself later on.

Learn to recognise problems other people have already solved

As a beginner, it’s very unlikely you’ll need to do something that no one else has ever done before. So before doing something like writing a function to determine the mean of a vector, ask yourself: “Is it likely someone else has done this before?” If so, it’s extremely likely that it’s either a part of base R or someone has written a package that handles it. It’s one thing to write your own functions as an exercise to understand how things are implemented or to practice writing functions, but you also don’t want to waste a lot of time. There’s really no glory in writing worse versions of functions that better programmers than you have already written. Focus your efforts and get better at identifying when a problem you have may already have been solved. In some form, this will almost always be the case and the hard part will be figuring out what to search for on Stack Overflow.

Important differences between R and Python

I’m now going to give some specific examples of things that are quite different in R and Python and may cause you problems. This is by no means an exhaustive list, but may help you avoid at least some traps. The Tidynomicon also has a lot of useful comparisons between R and Python and is a great resource for learning R (especially for those coming from Python).

R is often used interactively

One of the most obvious differences between R and Python is that R is designed to be used interactively. That is, you can (and often will) write and run your code line by line. In Python it’s more common to write a script and then run it or to run a cell at a time in a Jupyter notebook. This means if you break something in the script or cell, the whole thing won’t work.

Running your code line by line makes a lot of sense for the things you’ll likely be doing in R. You might import a csv, take a quick look at the resulting data frame, notice that the date column is formatted weirdly (it always is), fix it, check it again, rearrange the data frame a bit, join it to another data frame, look at it, and then plot it. You’re unlikely to be able to anticipate all the things you’ll need to do before you’ve even looked at the file and seen its structure. It makes sense to do this step by step.

Where things can go wrong is if you change or delete something upstream in a file and don’t rerun it, as R will keep using whatever’s still in the global environment. This will be a problem when you restart R and clear your global environment. If you keep getting unexpected results and suspect this is the issue, you can always restart R and run your code again. I also suggest changing your settings so that when you close RStudio your workspace is not saved.

Vectors are everything

Whatever resource you use to learn R, it will almost certainly talk about vectors. Vectors are one of the basic types in R. When I first started, I understood vectors to be essentially an inferior version of Python lists. This was a huge mistake. Vectors are used very differently than lists are in Python and if you treat them similarly you will end up doing things inefficiently. I’ll show an example to demonstrate.

In Python, if you have two lists of numbers, how would you multiply them together to create a new list? Maybe something like this:

list_a = [1, 2, 3, 4, 5]
list_b = [6, 7, 8, 9, 10]

list_c = []
for i in range(len(list_a)):
    list_c.append(list_a[i] * list_b[i])

print(list_c)

## [6, 14, 24, 36, 50]

This works fine in Python, but this kind of a approach is generally not a good idea in R. Iteratively increasing the length of a vector or list as you add new elements is very slow because of the way memory allocation works. Instead, you can do this:

vector_a <- c(1, 2, 3, 4, 5)
vector_b <- c(6, 7, 8, 9, 10)

vector_c <- vector_a * vector_b

vector_c

## [1]  6 14 24 36 50

In general, vectorise where you can. Not only does this make things faster, but it also makes your code easier to read. The R Inferno has an excellent chapter on vectorising with lots of examples (and to balance it out, another chapter on the dangers of over-vectorising).

Becoming comfortable with vectors and vectorisation will actually help with your Python programming too. Sometimes a for loop is the right tool for the job, but learning to look for opportunities to vectorise is a good habit to have. In particular, vectorisation in NumPy and Pandas can greatly improve efficiency.

Functions with the same names in R and Python can do different things

In natural languages, a ‘false friend’ is a word that sounds similar to a word in your language but actually has a very different meaning (like ‘embarazada’ in Spanish, which doesn’t mean ‘embarrassed’ but ‘pregnant’). R has a lot of false friends, as there are some functions that look a lot like functions you already know from Python but actually behave differently. Save yourself the confusion and don’t assume things will be the same.

One example is range(), which in Python takes the arguments start, stop, and step to return a sequence of numbers spanning a given interval. In R, range() instead returns the minimum and maximum of the arguments.

Most of the other false friends I’ve encountered in R relate to indexing. Most notably, R uses one-based numbering whereas Python uses zero-based numbering.

Here’s another example: if you pull an element out of a list by index in Python, what you get is that element itself (hence type str and not list).

my_list = ['a', 'b', 'c', 'd']
print(type(my_list[0]))

## <class 'str'>

But in R:

my_list2 <- list("a", "b", "c", "d")
class(my_list2[1])

## [1] "list"

my_list2[1]

## [[1]]
## [1] "a"

It’s a list! What R is doing here is creating a list of length one that contains the element. This can create problems when you index a list as you would in Python and expect to get a string but instead get a list. When things aren’t working, one of the first things I check is that everything is the type I’m expecting it to be. Very often, something is unexpectedly a list.

If you want to get just the string “a”, you need to use double square brackets:

class(my_list2[[1]])

## [1] "character"

my_list2[[1]]

## [1] "a"

Here’s another false friend, again related to indexing. In Python, you can get the last element in a list using the index -1:

my_list3 = [1, 2, 3, 4, 5]
print(my_list3[-1])

## 5

But doing the same thing in R gives you the whole list except the first element:

my_list4 <- list(1, 2, 3, 4, 5)
my_list4[-1]

## [[1]]
## [1] 2
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 4
## 
## [[4]]
## [1] 5

White space isn’t syntactically important, but is still important for readability and sanity

Unlike in Python, white space isn’t syntactically important in R. This isn’t a reason for your code to be a mess. It may never be quite as aesthetically pleasing as your Python code, but you should still try to make it easy for others (and your future self) to read. There are a number of style guides available (such as this one).

Now go and learn R!

Now that I’ve given you some tips on how to approach learning R, it’s time for you to go and actually learn. Like most things in data science, the key is to learn by doing.

A lot of books about R and R packages are available online for free. This makes learning new tools easier and is a big perk of R since you’re not relying solely on official documentation and whatever blog posts you can find to figure something out.

Here are some resources that I highly recommend:

R for Data Science by Hadley Wickham: This book will get you started wrangling and visualising data very quickly. It covers ‘just enough’ base R and focuses mostly on tidyverse packages. It’s available for free online, though I also like having a physical copy.
The Tidynomicon by Dhavide Aruliah and Greg Wilson: Aside from the excellent name, I recommend this book because it includes many comparisons between R and Python that are very helpful for those who are familiar with Python. With section names like “How do I filter data?”, it’s also very easy to find answers to specific questions.
aRrgh: a newcomer’s (angry) guide to R by Tim Smith (with Kevin Ushey): This is a brief guide to R that covers basic syntax and types. It’s written by another frustrated R learner and highlights some common gotchas.
This post about teaching R by Roger Peng: This goes into some of the history of R and the motivations for developing it. This has helped me enormously to understand why R is the way it is.
There’s a great R community on Twitter that is more geared towards data science than the broader Python community is, which I find really nice. People are very helpful and supportive. Try #rstats, #r4ds, and #TidyTuesday for a start.