Starting data analysis with R: Things I wish I'd been told

scishop · on Nov 24, 2014

Things that are likely to trip up new R programmers:

* Function arguments are always passed by value. Objects are copied if they are modified in a function.

* Function arguments are lazy evaluated.

* Watch out for automatic factor conversion when importing data. R will display your string data as text, but behind the scenes it will treat it as an integer.

* R is slow. Really, really slow. All your intensive calculations should be handled by libraries written in C, Fortran or some other compiled language. Your R code should be mostly for glueing things together.

bachmeier · on Nov 24, 2014

In my experience teaching R, the single biggest issue is that it is a dynamic language that is in love with silent casting [or really returning objects with unexpected types]. Although other bugs might be more common, none will frustrate the user more than the type of a variable changing without warning, and most of the time the error message refers to something else.

A couple of examples are stripping the time series attributes of a ts object and default conversion of a row of a matrix to a vector. These are the cases where they come to my office and say they have no idea what's going on. After using R for a decade, I know the language well enough that these are about the only errors I get.

jghn · on Nov 24, 2014

1) can always use environments! :)

Okay, I'll stop at 1).

Having used R for over 15 years now, the one "feature" which still makes me shake my head is the partial evaluation of function arguments. If function foo has an argument 'blah' and you say foo(bl=5), it'll autocomplete that to 'blah' if there's nothing better to use.

The number of bugs I've found due to that ....

stdbrouw · on Nov 24, 2014

Also, read "R language for programmers" http://www.johndcook.com/blog/r_language_for_programmers/

It explains a lot of the quirks.

hadley · on Nov 24, 2014

Or try "Advanced R", http://adv-r.had.co.nz. It's my attempt to explain the language in such a way that programmers from other languages can appreciate the beauty and elegance behind the quirkiness.

earino · on Nov 24, 2014

As a programmer, I can enthusiastically endorse this book. R "as a language" is hard to grok when you come to it with the mentality of "a programmer" due to it's history as an interactive environment for data analysis, and an interface to algorithms first, and a programming environment second.

Hadley's book is a solid introduction when you are trying to map concepts from other, more traditional, programming environments to the R way of thinking.

jghn · on Nov 24, 2014

R "as a language" is much easier if you don't come in with the Standard c-ish language blinders on

earino · on Nov 24, 2014

there was a great keynote at this last useR! conference by John Chambers about how R's predecessor, S, was designed as an interface language (https://www.youtube.com/watch?v=_hcpuRB5nGs). He discusses how as part of it's fundamental cognitive model, it was there to provide a facile interactive "interface" to best-of-breed algorithms. Very few other programming languages have that pedigree, they were mostly designed to architect systems.

spo81rty · on Nov 24, 2014

I just recently started playing with R via Azure's new Machine Learning service. From what I have seen I am really impressed. It helps with the input and outputs of the data and makes it easy to turn your process in to a web service that can be easily consumed. I had planned on setting up some linux boxes to host R, but now I can just use Azure ML and not have to jack with it.

leeber · on Nov 23, 2014

He forgot the last one: "to use python instead"

That was my attempt at a joke. But seriously, I love python when it comes to manipulating data and doing anything statistical. You've got numpy, scipy, scikit-learn, etc.

Aqwis · on Nov 23, 2014

As a student of statistics, I'm kind of split on R. On one hand, it's just not a very well-designed language. The fact that it has three (!) independent object systems is a testament to this. On the other hand, as vegabook also mentions, working with vectors and matrices is just a lot more natural in R than in general-purpose languages like Python, because R's syntax has been built from the ground up to work with the kind of structures you usually work with in science.

I'm hoping Julia might become a good alternative to R and Python, but I can't see it catching on in the statistical community anytime soon given how many people are still using relics like SAS and Stata. The raw fact is that statisticians (considered as a group) just aren't very good at programming (and many older statisticians can't program at all), which means that a well-designed programming language may not necessarily be easy to use for a member of the statistical community used to point-and-click statistics suites.

hadley · on Nov 24, 2014

I think R is a well-designed language. It definitely has its quirks (what language doesn't?), but by-and-large they are problems with the standard library, not the language. This is admittedly a subtle distinction, but it's much easier to fix problems with the standard library than it is with the language.

Three aspects of the language that make R particularly well suited for statistical programming are:

1) Missing values built in at a fundamental level.

2) Metaprogramming capabilities. The best way to solve many categories of data analysis problems is to design a domain specific language which allows you to easily combine independent pieces. R's incredible flexibility is great for this.

3) Fundamentally vectorised and functional. This allows you to elegantly express many common data analysis tasks.

jph00 · on Nov 24, 2014

Could you describe what facilities of R help with metaprogramming and make it good for designing DSLs?

hadley · on Nov 24, 2014

http://adv-r.had.co.nz/dsl.html and the two prior chapters

Gatsky · on Nov 23, 2014

How do you feel about reproducible computing in python? R is very well set up to A) get it running on any platform easily B) report the crucial parts of the environment. I know that if I grab someone else's (published) code written in R, I'm pretty confident I can make it work. Part of this is the great package management through CRAN or Bioconductor, and also because often important reference data for bioinformatics is actually available through the package manager.

I haven't done much with Python, but I don't quite get the same feeling (happy to be told that the reality is otherwise!). For example, the opening line of the installation guide for Pandas doesn't inspire great confidence in me: "The easiest way for the majority of users to install pandas is to install it as part of the Anaconda distribution, a cross platform distribution for data analysis and scientific computing."[1] Do I really need to install the HDF5 package so I can split a concatenated variable into two columns??

[1] http://pandas.pydata.org/pandas-docs/stable/install.html

jghn · on Nov 24, 2014

The thing w/ reproducible research (I was an early BioC core member and have worked directly w/ its RR advocates) is that it requires having an exact set of R and packages. I know that BioC tries to do this (I wrote the original BioC package download script) but weird things can still happen. A few years ago I was tracing down a bug in some computational biologist's code that really traced down to some wacky version of a particular package which might be downloaded in the right circumstances.

In a previous life what we did was for every project you'd download a snapshot of an R environment, including all packages. That, and only that, was used for all computation for everything involving that project from start to finish. If Docker was around at the time, that's what we'd have used.

Gatsky · on Nov 24, 2014

Thanks for your work with BioC, it's fantastic. I use it a lot in my cancer genomics research. Part of that involves providing a service to patients living with cancer, so your work is definitely out there having an impact!

jghn · on Nov 24, 2014

Thanks but I haven't been a contributor for a decade, I just had a hand in the early days. I agree though that it's a phenomenal suite for the bioinformatics world and an exemplar of proper R techniques

vegabook · on Nov 24, 2014

Python's pip is pretty good though not quite as polished as CRAN. I have had few problems running complex code from third party sources, though one always has to be aware of the Python 2 v 3 "problem" (though it is diminishing now with most things available on 3). If you get pip up and running on a new Python installation you can avoid Anaconda/Canopy if you want a clean installation, and I have installed fairly complex Python setups in multiple locations without too much trouble. Let's be fair, R can also be tough if it calls a lot of third party libraries. Just try to get rJava working properly for example if the local R and Java installations are not both 32 or 64 bit. It can be a complete mess to disentangle this sort of stuff in R. Or for example running code that uses Cairo, on a mac. My experience is that Python's poor package management reputation is not really deserved anymore. Python's virtualenv also allows you hermetically to seal away an entire python environment, including its libraries, so that it will not conflict with other python environments that might have different versions of the interpreter and/or libraries. I am not aware of anything this robust in R.

Reproducible computing? The ipython notebook is awesome, though I am not sure if there is anything as good as knitr if your workflow is LaTeX oriented.

R "hands" will usually find Python a backward step when it comes to vectorized data manipulation, but its a forward leap if your data becomes too big or if you have to step out of the comfy environment of exploratory analysis into any form of (even trivial) production settings.

And no you definitely do not need HDF5 to effectively use Pandas.

hadley · on Nov 24, 2014

The closest equivalent to virtualenv for R is packrat: http://rstudio.github.io/packrat/. It doesn't (yet) support different R versions for different projects, but that's on the roadmap.

mrgordon · on Nov 24, 2014

Yeah packrat is great! It is a really important package which has greatly increased my willingness to use R in production.

Gatsky · on Nov 24, 2014

Ok that's good to know. Sure, R breaks inexplicably sometimes due to dependencies, no doubt about that.

virtualenv sounds useful. Is it used much when python code is published in a paper?

About HDF5: I was just making the point that the Pandas docs recommend I install Anaconda to get Pandas, thus also installing HDF5. I am sure there are other ways, but the way the documentation is phrased suggests that these other ways are overly difficult.

eevilspock · on Nov 24, 2014

I'm just learning Python to do some data and graph ananlysis experiments. Should I go with Python 2 or 3?

vegabook · on Nov 24, 2014

You are strongly encouraged by the Python powers that be to move to 3, and I have only in the past few months begun to agree with them, and that is because some serious standard libraries like asyncio are now only available on 3. It's (finally) the future. However a big caveat is that if you're learning Python, most of the sample code you will find on the web will be 2-based and will not work well under 3. It's not so much the print statement, but range() works subtly differently too now (return a generator not a list - too subtle for beginners to properly understand in my view) and unicode strings can break older code too. Just be aware of these things and move to 3 is my (51/49) advice, but this is a controversial point and others will have differing points of view.

deskamess · on Nov 24, 2014

I find knitr easy to use. They way it generates graphs and can output to pdf/html is really useful and is reproducible and easily shared. While essentially just markdown + R code the code can point to data sets instead of having it embedded. It has a good set of graphing libraries (ggplot2, etc) too. I can see how this could be the killer app that gets social science research papers written and produced in knitr. I always thought IPython would take this crown but R/knitr is looking good. Have not used Shiny yet

Edit:knitr not rdoc

undergrowth54 · on Nov 24, 2014

You don't have to install the entirety of anaconda. You can install miniconda (from here: http://conda.pydata.org/miniconda.html) and then do `conda install $package_name` or, if like me you like to create separate environments for separate projects... `conda create --name $environment_name python; source activate $environment_name; conda install $package_name`

disclosure: I work on miniconda. I'm currently working on improving our developer experience. Complaints are welcome.

vegabook · on Nov 23, 2014

yes I have moved (back) to Python mainly because R is too slow when we get beyond a certain data size and the language is not powerful enough when data starts having to be moved around at scale. I have a 5-10 times speed improvement in native Python and another 30x more if I can vectorize things in Numpy. However a huge caveat is that R is much more succinct when it comes to exploratory analysis during what I call the "data rotation" phase because its vectorized nature is so much more efficient at selecting, reducing, cleaning and rotating data, than even Pandas can manage. It's irritating having to write list comprehensions constantly for what would often have been a ridiculously direct and efficient vectorized command in R. Moreover R's graphics leave matplotlib in the dust, though this advantage is eroding with the JS libraries taking over.

The other area where Python crushes R is if your data is live streaming. Here you inevitably need a full fledged programming language with proper asynchronous io capabilities and multithreading / multiprocessing that is not batch oriented.

infinite8s · on Nov 24, 2014

Can you give an example of the a somewhat complicated vectorized command in R that would require lots of list comps in Python?

consz · on Nov 24, 2014

Totally agreed. I do model analysis on data sets with 200k-5m rows and anywhere from 500 to 20k columns. I originally started doing my work in R, but about two years ago, python started improving rapidly for heavy data analysis, and at the moment I'd say it's a clear winner.

IndianAstronaut · on Nov 24, 2014

For that kind of data or larger, I would avoid R and Python and move to writing my own algorithms or try out something for more heavy duty analysis such as Mahout or Spark. R and Python are still one box and memory constrained.

tacos · on Nov 23, 2014

I know we don't reward snarky humor 'round these parts, but I was about to say the same thing. Python seems to own this space and the ecosystem around Python and math/stats/analysis is exceptionally healthy. If there's a specific place where R kicks ass please speak up -- it's fallen off my radar.

hadley · on Nov 24, 2014

There are three areas where I think R is the clear winner:

1) An IDE for data analysis/programming: RStudio

2) Easy way to turn your analyses into reports: knitr

3) Easy way to turn your analyses into interactive webapps: shiny

(I also think R wins on visualisation and data manipulation, but I'm biased ;)

peatmoss · on Nov 24, 2014

R absolutely wins on visualization and data manipulation. I'll spare you the immodesty :-)

dthal · on Nov 24, 2014

I use both Python and R a fair bit. As a language, absolutely I prefer Python to R. However, I think there are two areas where R is better than Python and together, I think they add up to a durable advantage, at least for stats people. 1) Package support. Yes, Pandas and scikit-learn are good, but R still has a definite edge here. Here are three things I've needed lately where R has hands-down better code available: forecasting, frequent itemset mining, and network community detection. 2) Non-programming uses. There are a lot of tasks where you need a computer, but just to do one thing, a plot, calculate a statistic, ... stuff like that. R is better in that use case.

stdbrouw · on Nov 23, 2014

R is in some ways more forgiving to newcomers. Sure, there's all sorts of weirdness around how vectors and matrices work, and don't get me started on the cryptic function naming, but (1) almost all batteries are included -- hardly ever a need to hunt around for packages, (2) RStudio is really nice, with graphics, a shell, a text editor, documentation etc. all in one place, (3) it's mature and well-tested.

I prefer Python myself, but after spending a couple of months with R I do understand why people like it.

(OTOH I'll be a happy person if I never ever have to work with SAS ever again.)

srean · on Nov 24, 2014

> R is in some ways more forgiving to newcomers.

Oops! sorry sorry,... really sorry, apologies for snorting coffee over you, but given multiple years of experience TA'ing for machine learning / datmining courses I couldnt disagree more. R had them in absolute knots, and yeah they were asked to use RStudio if that helped. They struggled with simple things such as writing a naive Bayes classifier. Most of their mistakes were because of R's weird and silent inconsistencies: scalar or vector, copy or reference.

It is possible that all these 30 odd students every year were stupid but chances are fairly low.

EDIT:

The course has since switched to Java (Knime) and Python and that has gone a whole lot smoother.

Neither Java nor Python are my most favorite languages, but have to concede that Python is massively more consistent than R, so a student has to remember less of special cases, and the whipping boy of dearth of packages seemed less real at least in the context of the course. At least in the academic setting enthought / canopy / anaconda does a marvelous job of it.

stdbrouw · on Nov 24, 2014

I said more forgiving. It's certainly not a forgiving language or ecosystem in absolute terms, you're right on the mark there. But ultimately you have to pick your poison. Do you want to struggle with all of the various quirks of R or do you want to struggle with all of the various quirks of (data analysis in) Python?

sytelus · on Nov 24, 2014

R is very acquired taste - or rather you struggle through until you can memorize all the gotchas. For people who are not doing doing data analysis as a full time and perhaps only job, remembering intricacies of R is going to be deal breaker. Personally I would shift towards Python unless there are exclusive packages in R that you have to have it - and that's becoming rarer by the dat. Combined with iPython Notebook and recent wave of migrations of packages to Python, I see little need to deal with R on regular basis.

micheljansen · on Nov 24, 2014

Thanks for introducing me to iPython Notebook. I had never heard of this, but it looks very promising. The browser-based interface does not look as polished as R's studio application and I'm a bit confused why this functionality is not in the Qt console, but I'll check it out nonetheless.

undergrowth54 · on Nov 24, 2014

Much of python's power for data science comes from the scipy/numpy libraries and the tools built on top of them. Unfortunately, installing these requires first installing a bunch of fortran libraries. Fortunately, There is a free distribution of python that comes with them and comes with a package manager you can use to install python libraries that have C extensions. You can also install R packages and use them from iPython notebook. https://store.continuum.io/cshop/anaconda/

greenleafjacob · on Nov 24, 2014

Something I've found useful is the sqldf package [0], which lets you use data frames as tables in SQL with all the power of joins and so on.

[0] https://code.google.com/p/sqldf/

hadley · on Nov 24, 2014

NB: sqldf works by copying your data frame into a temporary SQLite database, running the query and then copying the data back. So it's not that fast.

Instead, learn a native R package like dplyr or data.table that supports all the power of SQL, and is v. v. fast.

ggrothendieck · on Nov 24, 2014

(!) sqldf works with H2, MySQL and PostgreSQL too - its not limited to SQLite. (2) Most alternatives to SQL available in R don't work well with complex multi-way joins. You wind up materializing the full outer join first which can be quite problematic. (3) sqldf is fast enough and R is slow enough that quite often sqldf is faster than base R so if R is fast enough for you then its likely that sqldf is too. (4) The speed is often the last thing you need to worry about. The ability to express a query in a familiar way (if you know SQL) is often the more important consideration than having to learn a new system.

Gatsky · on Nov 24, 2014

+1 for dplyr. Really helpful, responsive community around it as well.

psychometry · on Nov 24, 2014

Strange that someone who knows about plyr (and is clearly not a beginning R user) is still using "for" loops.

houshuang · on Nov 24, 2014

There are lot's of ways of doing things in R, and I am always learning. I always use vectorized functions when I do something to every row in a data.frame, but when I want to do something to every column, like here, I sometimes use for. How would you use something like plyr here, and would it be as easy to understand?

  likertcat <- c("1"="Not at all", "2"="To a small extent", "3"="To some extent",
    "4"="To a moderate extent", "5"="To a large extent")
  
  for(e in names(db[,9:44])) {
    db[[e]] <- revalue(db[[e]], likertcat)
    db[[e]] <- ordered(db[[e]], levels= c("Not at all","To a small extent",
      "To some extent","To a moderate extent","To a large extent"))
  }

nograpes · on Nov 24, 2014

Like this:

  db<-data.frame(replicate(44,1:5))
  likertcat <- as.ordered(c("Not at all", "To a small extent", "To some extent",
                 "To a moderate extent", "To a large extent"))
  db[ ,9:44] <- lapply(db[,9:44], function(x) likertcat[x])

houshuang · on Nov 24, 2014

Thank you. As said, there are always many ways of doing things in R.

SixSigma · on Nov 24, 2014

As an aside, that is not a valid Likert set. It is not balanced around a central non-comittal value.