This alone is reason to upgrade -> "R now uses a stringsAsFactors = FALSE default, and hence by default no longer converts strings to factors in calls to data.frame() and read.table()."
This is one primary reason I started using read_csv from readr than read.csv of base-R. Everytime you teach someone to read a csv using base-R this always came up somewhere in the middle of an analysis because those strings were read as Factors.
Yes, factors absolutely can be a pain in the ass when they're instituted too early in an analysis. It's better to keep them as strings for as long as possible and only convert to factor when you've cleaned up your data. Otherwise, you have to deal with annoying and confusing factor manipulation.
The drawback is that character takes up so much space, but these days memory is so bountiful it usually doesn't matter.
Nowadays strings are stored in a "string pool" anyway, so if you have a string that can be turned into a factor (i.e. with few unique variants), you probably don't need to.
Post edit: Preserving the below post, which I think highlights some of the issues in the parent I'm responding to's example, but it turns out he is correct on another issue -- narrowly, in his example, the `[` subset command dispatches differently on tibbles than data.frames, so you can narrowly produce weird behaviour. So to anyone reading, please consider upvoting parent and reading rest of thread
Original post follows:
Right off the bat, the problem is not "using" tibbles, it's that you've incorrectly constructed one by passing the data through the tibble() constructor rather than using as_tibble(). The tibble constructor -- for pretty good reasons in other circumstances that seem crazy to you here because of your intent -- infers that you want the entire data frame to be a single column inside the tibble, called "iris". It does this because it evaluates the variable name passed to the tibble constructor as both the intended column name and the data to be placed inside the column. This demonstrates nesting, which is one of the great features of tibbles and otherwise used for a bunch of stuff.
If you had done `tb_iris <- as_tibble(iris)`, it would have worked fine. `as_tibble()` is the function to convert an existing data structure to a tibble. R is obviously not "type safe" in any way, but you can engage in defensive programming, and one way you can do that is being hyper-aware of the steps you take during type conversions. If you check the documentation for `tibble()`, it tells you explicitly to "Use as_tibble() to turn an existing object into a tibble." Is there a reason you didn't? Imagine this related example:
Would we conclude that "using the numeric type can be dangerous" because the constructor interpreted the argument different than the conversion helper?
Second, I suspect you must be using extremely old versions of things, because on more recent versions, your nunique function would fail, not produce 1. I correctly get "Error: Can't find column `Species` in `.data`." This error message is maybe a little confusing if you don't check the structure `str(tb_iris)` of tb_iris to see what I mentioned above, but is the correct error to output in light of it. You'd also be able to flag this by just checking `colnames(tb_iris)` or `View(tb_iris)` if you're working in RStudio or using the embedded environment pane or really any other way of looking at the data.
But your broader point is also false. Once a tibble has been formed, it should work EXACTLY the way a data.frame works because R objects can have multiple classes. The only thing that makes a tibble different than a data.frame is that it has an additional class label. All dispatches that work on data.frame objects work on tibbles because of how multiple classing works in R. This has been a goal since the beginning of tibble. The one exception I'm aware of is external functions that incorrectly check `if(class(obj) == "data.frame")` instead of using `is.data.frame()` or `if("data.frame" %in% class(obj))`. The former is and always has been incorrect because of how multiple dispatch is designed to work in R and should generate an error with multi-classed objects because the if statement evaluates to a vector of logicals instead of a logical.
Once way you can tell that tibbles and data frames are identical save the above caveat is to run the following code:
Note that you are not "downconverting" a tibble into a data.frame in this code (but that would work too) -- you are taking the tibble exactly as is and hacking its class label to look like a data frame. It's identical because a tibble was always a data frame.
> Once a tibble has been formed, it should work EXACTLY the way a data.frame works because R objects can have multiple classes.
This is not the case. You can add any class to any object in R S3 system. So people behind tibble can call their tibble a data.frame but it gives no guarantee that it will behave like one.
Actually your reply was very helpful because it surfaced ways in which you were partially right and I was partially wrong.
I highlighted the nesting issue in constructing versus coercing (which is correct and does have implications for what you're trying to do) but actually in your example the distinction is broken because of a different edge case
Which is to say the following:
ncol(iris) # 5
ncol(as_tibble(iris)) # 5
ncol(tibble(iris) # 1
iris$Species # Works
as_tibble(iris)$Species # Works
tibble(iris)$Species # Errors because of nesting
iris[, "Species"] # Works
tibble(iris)[, "Species"] # Doesn't work
as_tibble(iris)[, "Species"] # Works
However, you're correct that because the subset operator for tibble doesn't drop dimensions, length gets you the number of columns rather than the number of observations. This does speak to the fact that length is a pretty shitty function to begin with, but I concede you're partially correct there.
You are also correct that because class labels are not contractual, there is no guarantee that having the data.frame fallback label means stuff behaves identically (for instance, you could add the data.frame label to any data structure and the data.frame dispatch stuff would not work properly). My point was that in the case of a tibble, a tibble is literally a data frame with an additional class label. If you remove that class label, it's exactly identical.
But your example and linked discussion does highlight a way in which I'm wrong; the subset function is overridden for something with a tibble class label. That's true and could produce edge cases I hadn't considered.
I'm sorry to report that this analysis is completely wrong, and demonstrates a lack of understanding of the R object model. The class that is provided by tibble does not implement all of data.frame, and the OP is correct.
(S3 -- see footnote) Classes don't "implement" anything in R the way they would in other languages. They are labels that tell dispatch functions how to deal with an object. A tibble is internally a data frame. The last example in my post makes this exactly clear.
The other OO systems in R do act closer to traditional classes, but all the tidyverse stuff is S3.
(But the OP was correct in another sense related to the example narrowly!)
So you're ignoring that the [-function by design works differently for tibbles than for data frames. This isn't really a problem with tibble but with sloppiness in programming allowed by dynamic languages.
I personally think it's a good thing that the drop-argument defaults to FALSE for tibbles, since data frame's default drop = TRUE is a source of frequent bugs. The change of the default for this parameter is the source of your observation.
I am not ignoring it, I am _highlighting_ it. The question of the comment above was "why would one prefer data.frame over tibble". I merely answered that question.
Yes, but the problem isn't tibble since what you're highlighting is a design choice and an argument in favor of tibble. The problem only arises when you're not aware of this design choice which is facilitated by sloppiness and dynmically typed languages.
One might ask whether it was a good idea that tibble enlists data.frame as an inherited class. Since a tibble obviously doesn't behave like a data frame, one could also argue that this is a mistake on part of the tibble developers but this is a different discussion.
One of the big reasons why I quit R 10 years ago and never looked back - Python wasn't secretely converting anything, AND failing silently when its not the expected type.
R's come a long way in the last decade. The tibble and data.table packages both address this issue. data.table (https://github.com/Rdatatable/data.table) is the more strongly-typed of the two, by default it fails loudly when it encounters data that doesn't conform to the column type. It's also quite fast--binds to C code that parallelizes with OpenMP. It has very terse and expressive syntax, I find it so much more intuitive and easy to work with than pandas.
If you're happy with Python, by all means keep using it. I use both languages. Just suggesting that if you gave up on R that long ago, you might be pleasantly surprised by how much better it's gotten since then.
vctrs [0] is the latest effort by the Rstudio developers to help people write type-stable code. The R standard library has a lot of issues with silently casting types, but the wonderful thing about it being so scheme-like is that many of these things can be evolved through libraries.
Exactly! TensorFlow is the only Python data analysis package I've used that doesn't automatically convert things in the background. I was helping a friend with STATA the other day, which doesn't automatically convert, and I realized I've gotten so used to that behavior in R and Python.
i stopped using R a number of years ago because it is not useful for very large datasets (in the tens to hundreds of millions) and now use kdb almost exclusively.
R has quite a few specialized libs to deal with large datasets (out of memory). Nothing keeps you from hosting the data in a DBMS and using SQL (or dplyr) to pull the data in an appropriate format.
I run a data science department at a corporation and this is exactly how we handle our massive amounts of data. It's rare that we're using a billion+ data points in one model so we use SQL to get the data we need in the format we need and move forward from there in R.
The new C++ derived syntax for string literals seems to me to be the top new feature. It will make it possible to support markdown, latex, R code and Windows path literals without munging them first.
If I understand, does this mean on windows when I've copied a file location I would no longer have to replace backslash in the file path with double back slash or forward slashes? Or am I off-piste?
Upgrading is not different than changing environment variables as far as breaking existing code is concerned.
You can always run R --no-init-file to be sure that you have the default settings. Now you have to know what default settings the code that you want to run expects.
Well that is why there is 4.0. Hadley Wickham has had a HUGE influence on R and it is now we have a lot of new things we can use that makes it reproducible in base R.