This alone is reason to upgrade -> "R now uses a stringsAsFactors = FALSE defaul...

amrrs · on April 24, 2020

This is one primary reason I started using read_csv from readr than read.csv of base-R. Everytime you teach someone to read a csv using base-R this always came up somewhere in the middle of an analysis because those strings were read as Factors.

crispyambulance · on April 24, 2020

Yes, factors absolutely can be a pain in the ass when they're instituted too early in an analysis. It's better to keep them as strings for as long as possible and only convert to factor when you've cleaned up your data. Otherwise, you have to deal with annoying and confusing factor manipulation.

The drawback is that character takes up so much space, but these days memory is so bountiful it usually doesn't matter.

dash2 · on April 24, 2020

Nowadays strings are stored in a "string pool" anyway, so if you have a string that can be turned into a factor (i.e. with few unique variants), you probably don't need to.

s1t5 · on April 24, 2020

But readr gives you a tibble instead of a plain data.frame and that adds a bunch of other headaches.

crispyambulance · on April 24, 2020

How so? I can't think of any drawbacks of tibbles vs data.frames?

_fnhr · on April 24, 2020

Using tibbles outside of tidyverse can be dangerous.

    df_iris <- iris
    tb_iris <- tibble(iris)

    nunique <- function(x, colname) length(unique(x[,colname]))

    nunique(df_iris, "Species")
    > 3

    nunique(tb_iris, "Species")
    > 1

Imagine now using some complex function from a repository (i.e. BioConductor) that works on data.frames and passing a tibble to it.

notafraudster · on April 24, 2020

Post edit: Preserving the below post, which I think highlights some of the issues in the parent I'm responding to's example, but it turns out he is correct on another issue -- narrowly, in his example, the `[` subset command dispatches differently on tibbles than data.frames, so you can narrowly produce weird behaviour. So to anyone reading, please consider upvoting parent and reading rest of thread

Original post follows:

Right off the bat, the problem is not "using" tibbles, it's that you've incorrectly constructed one by passing the data through the tibble() constructor rather than using as_tibble(). The tibble constructor -- for pretty good reasons in other circumstances that seem crazy to you here because of your intent -- infers that you want the entire data frame to be a single column inside the tibble, called "iris". It does this because it evaluates the variable name passed to the tibble constructor as both the intended column name and the data to be placed inside the column. This demonstrates nesting, which is one of the great features of tibbles and otherwise used for a bunch of stuff.

If you had done `tb_iris <- as_tibble(iris)`, it would have worked fine. `as_tibble()` is the function to convert an existing data structure to a tibble. R is obviously not "type safe" in any way, but you can engage in defensive programming, and one way you can do that is being hyper-aware of the steps you take during type conversions. If you check the documentation for `tibble()`, it tells you explicitly to "Use as_tibble() to turn an existing object into a tibble." Is there a reason you didn't? Imagine this related example:

  my_string <- "10"
  numeric(my_string)
  as.numeric(my_string)

Would we conclude that "using the numeric type can be dangerous" because the constructor interpreted the argument different than the conversion helper?

Second, I suspect you must be using extremely old versions of things, because on more recent versions, your nunique function would fail, not produce 1. I correctly get "Error: Can't find column `Species` in `.data`." This error message is maybe a little confusing if you don't check the structure `str(tb_iris)` of tb_iris to see what I mentioned above, but is the correct error to output in light of it. You'd also be able to flag this by just checking `colnames(tb_iris)` or `View(tb_iris)` if you're working in RStudio or using the embedded environment pane or really any other way of looking at the data.

But your broader point is also false. Once a tibble has been formed, it should work EXACTLY the way a data.frame works because R objects can have multiple classes. The only thing that makes a tibble different than a data.frame is that it has an additional class label. All dispatches that work on data.frame objects work on tibbles because of how multiple classing works in R. This has been a goal since the beginning of tibble. The one exception I'm aware of is external functions that incorrectly check `if(class(obj) == "data.frame")` instead of using `is.data.frame()` or `if("data.frame" %in% class(obj))`. The former is and always has been incorrect because of how multiple dispatch is designed to work in R and should generate an error with multi-classed objects because the if statement evaluates to a vector of logicals instead of a logical.

Once way you can tell that tibbles and data frames are identical save the above caveat is to run the following code:

  df_iris <- iris
  tb_iris <- as_tibble(iris)
  identical(df_iris, tb_iris)
  class(tb_iris) <- "data.frame"
  identical(df_iris, tb_iris)

Note that you are not "downconverting" a tibble into a data.frame in this code (but that would work too) -- you are taking the tibble exactly as is and hacking its class label to look like a data frame. It's identical because a tibble was always a data frame.

_fnhr · on April 24, 2020

I think everything you wrote here is false, so I am not sure how to reply. Will try to keep it respectful and short:

First, about the as_tibble - it returns the same thing as tibble:

    tb_iris <- as_tibble(iris)
    length(unique(tb_iris[,"Species"]))
    > 1

Second, about the incorrect version:

    > packageVersion("tibble")
    [1] ‘3.0.1’

Which is also the current version on CRAN.

Third, about the classes:

You say:

> Once a tibble has been formed, it should work EXACTLY the way a data.frame works because R objects can have multiple classes.

This is not the case. You can add any class to any object in R S3 system. So people behind tibble can call their tibble a data.frame but it gives no guarantee that it will behave like one.

More about this problem here (and you can also find replies from tidyverse authors) https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...

notafraudster · on April 24, 2020

Actually your reply was very helpful because it surfaced ways in which you were partially right and I was partially wrong.

I highlighted the nesting issue in constructing versus coercing (which is correct and does have implications for what you're trying to do) but actually in your example the distinction is broken because of a different edge case

Which is to say the following:

  ncol(iris) # 5
  ncol(as_tibble(iris)) # 5
  ncol(tibble(iris) # 1

  iris$Species # Works
  as_tibble(iris)$Species # Works
  tibble(iris)$Species # Errors because of nesting

  iris[, "Species"] # Works
  tibble(iris)[, "Species"] # Doesn't work
  as_tibble(iris)[, "Species"] # Works

However, you're correct that because the subset operator for tibble doesn't drop dimensions, length gets you the number of columns rather than the number of observations. This does speak to the fact that length is a pretty shitty function to begin with, but I concede you're partially correct there.

You are also correct that because class labels are not contractual, there is no guarantee that having the data.frame fallback label means stuff behaves identically (for instance, you could add the data.frame label to any data structure and the data.frame dispatch stuff would not work properly). My point was that in the case of a tibble, a tibble is literally a data frame with an additional class label. If you remove that class label, it's exactly identical.

But your example and linked discussion does highlight a way in which I'm wrong; the subset function is overridden for something with a tibble class label. That's true and could produce edge cases I hadn't considered.

Apologies for any hostility in my original reply.

balnaphone · on April 24, 2020

I'm sorry to report that this analysis is completely wrong, and demonstrates a lack of understanding of the R object model. The class that is provided by tibble does not implement all of data.frame, and the OP is correct.

notafraudster · on April 24, 2020

(S3 -- see footnote) Classes don't "implement" anything in R the way they would in other languages. They are labels that tell dispatch functions how to deal with an object. A tibble is internally a data frame. The last example in my post makes this exactly clear.

The other OO systems in R do act closer to traditional classes, but all the tidyverse stuff is S3.

(But the OP was correct in another sense related to the example narrowly!)

stewbrew · on April 25, 2020

So you're ignoring that the [-function by design works differently for tibbles than for data frames. This isn't really a problem with tibble but with sloppiness in programming allowed by dynamic languages.

I personally think it's a good thing that the drop-argument defaults to FALSE for tibbles, since data frame's default drop = TRUE is a source of frequent bugs. The change of the default for this parameter is the source of your observation.

_fnhr · on April 25, 2020

I am not ignoring it, I am _highlighting_ it. The question of the comment above was "why would one prefer data.frame over tibble". I merely answered that question.

stewbrew · on April 25, 2020

Yes, but the problem isn't tibble since what you're highlighting is a design choice and an argument in favor of tibble. The problem only arises when you're not aware of this design choice which is facilitated by sloppiness and dynmically typed languages.

One might ask whether it was a good idea that tibble enlists data.frame as an inherited class. Since a tibble obviously doesn't behave like a data frame, one could also argue that this is a mistake on part of the tibble developers but this is a different discussion.

_fnhr · on April 25, 2020

All I am saying is that there are perfectly good reasons for not using tibbles if you do any kind of work outside of tidyverse. And you seem to agree?

As for whether or not tibbles should be data.frames - I posted a link to this exact discussion on R-dev mailing list within this thread, as an answer to a different poster. Here it is: https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...

stewbrew · on April 25, 2020

Ok, now I understand where you're coming from.

kgwgk · on April 24, 2020

Just put a as.data.frame( ) around it. That's what I do with readxl::read_excel :-)

FiReaNG3L · on April 24, 2020

One of the big reasons why I quit R 10 years ago and never looked back - Python wasn't secretely converting anything, AND failing silently when its not the expected type.

kyllo · on April 24, 2020

R's come a long way in the last decade. The tibble and data.table packages both address this issue. data.table (https://github.com/Rdatatable/data.table) is the more strongly-typed of the two, by default it fails loudly when it encounters data that doesn't conform to the column type. It's also quite fast--binds to C code that parallelizes with OpenMP. It has very terse and expressive syntax, I find it so much more intuitive and easy to work with than pandas.

If you're happy with Python, by all means keep using it. I use both languages. Just suggesting that if you gave up on R that long ago, you might be pleasantly surprised by how much better it's gotten since then.

Tarq0n · on April 24, 2020

vctrs [0] is the latest effort by the Rstudio developers to help people write type-stable code. The R standard library has a lot of issues with silently casting types, but the wonderful thing about it being so scheme-like is that many of these things can be evolved through libraries.

[0] https://github.com/r-lib/vctrs

closed · on April 24, 2020

By python, are you also including pandas? Because it is definitely doing a lot of that!

Breza · on May 2, 2020

Exactly! TensorFlow is the only Python data analysis package I've used that doesn't automatically convert things in the background. I was helping a friend with STATA the other day, which doesn't automatically convert, and I realized I've gotten so used to that behavior in R and Python.

lmg643 · on April 24, 2020

i stopped using R a number of years ago because it is not useful for very large datasets (in the tens to hundreds of millions) and now use kdb almost exclusively.

stewbrew · on April 25, 2020

R has quite a few specialized libs to deal with large datasets (out of memory). Nothing keeps you from hosting the data in a DBMS and using SQL (or dplyr) to pull the data in an appropriate format.

Breza · on May 2, 2020

I run a data science department at a corporation and this is exactly how we handle our massive amounts of data. It's rare that we're using a billion+ data points in one model so we use SQL to get the data we need in the format we need and move forward from there in R.

ggrothendieck · on April 24, 2020

The new C++ derived syntax for string literals seems to me to be the top new feature. It will make it possible to support markdown, latex, R code and Windows path literals without munging them first.

ryndbfsrw · on April 24, 2020

If I understand, does this mean on windows when I've copied a file location I would no longer have to replace backslash in the file path with double back slash or forward slashes? Or am I off-piste?

ggrothendieck · on April 24, 2020

Correct. One can write r"(c:\Windows\System32)" . Check out additional examples at the end of the help file: https://github.com/wch/r-source/blob/trunk/src/library/base/...

stilisstuk · on April 24, 2020

Awesome

ryndbfsrw · on April 24, 2020

I'm so used to using data.table to import data I forgot this was still a thing

minimaxir · on April 24, 2020

One of the selling points of tidyverse’s readr was “no stringAsFactors = FALSE necessary.” That’s how annoying it is.

kgwgk · on April 24, 2020

You could add to your .Rprofile

  options(stringsAsFactors=FALSE)

thom · on April 24, 2020

This is likely to break things when you share or publish your code.

kgwgk · on April 24, 2020

Upgrading will also break things. And code which depends on defaults will be broken either in new releases or in older releases.

thom · on April 24, 2020

Yes, I mean, we could list things all day that make R a horrible environment for reproducible analysis or production deployment, but it is what it is.

kgwgk · on April 24, 2020

You’re right. My defensive reply was unwarranted.

s1t5 · on April 24, 2020

Upgrading makes the changes explicit, tinkering with your environment variables doesn't.

kgwgk · on April 24, 2020

Upgrading is not different than changing environment variables as far as breaking existing code is concerned.

You can always run R --no-init-file to be sure that you have the default settings. Now you have to know what default settings the code that you want to run expects.

truculent · on April 24, 2020

And make your code potentially non-reproducible?

kgwgk · on April 24, 2020

If you care about that don't rely on defaults. This upgrade makes old code non-reproducible, should everyone abstain from upgrading?

baldfat · on April 24, 2020

Well that is why there is 4.0. Hadley Wickham has had a HUGE influence on R and it is now we have a lot of new things we can use that makes it reproducible in base R.

paultopia · on April 24, 2020

This makes me want to sing and dance. Ding-dong, the stringsAsFactors witch is dead!

topheroo · on April 24, 2020

I thought exactly the same thing! Been a long time coming…

this_is_not_you · on April 24, 2020

My thoughts exactly when I saw that bullet point.