Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
R at Microsoft (revolutionanalytics.com)
106 points by vladiim on June 29, 2015 | hide | past | favorite | 42 comments


Microsoft buying Revolution Analytics is another piece of evidence shoes Microsoft has its A-game back. R is massively popular in the statistical community, it is basically the PHP of analytics (huge standard library with no namespaces but very easy to get started with and powerful for advanced users).

Additionally R is one of the few languages where you can be productive in a non-unix environment. At my old Windows workplace, R was the only open source language with working network NTLM authentication out of the box. Everything can be self-contained in RStudio, an interpreter and gui package manager to stop you jumping to the command line, and a help system to stop you using man.

By buying becomming closely associated with R, Microsoft is more likely to be a component in large analytics investments i.e. "let's get Microsoft's R distro, oh and why not their big data plaform too?", and ensures its services and databases are well integrated in this upcoming language.


"PHP of analytics"

I started laughing, then I realized you meant that as a compliment.

R has a great community and a ton of unique packages, though many of those great packages are thin wrappers over Fortran/C libraries and are available in a host of other languages (Python, Julia) that, in my opinion, have better features when it comes to testing and group-based development.

In my experience, R degenerates into a rat's nest more easily than the other candidates. I've been hired twice for this explicit reason.

BTW, I think your comparison with PHP is apt. R has many of the same strengths and flaws.


Slight typo: compliment is correct here, rather than complement. Excellent discussion.


The write once run anywhere concept is intriguing. R has major scaling issues and a lot of the algorithms and packages in the CRAN just don't scale to any reasonable amounts of data. It really was designed in an era of controlled experiments with minimal data.


R is a funny language because there are a lot of things in it that are just objectively bad, yet I still love it. Probably because of the community and CRAN. Of course, R is really just some syntax wrapped over C++, and it is very extendable using C and C++ when needed, so I don't think you can really say, in absolute, that R doesn't scale. Someone has written a package to scale whatever problem in R, you just need to find it.


> an era of controlled experiments with minimal data

This world hasn't disappeared, y'know. Many of the specialized packages on CRAN deal with things like missing data, removing selection bias from longitudinal studies and distinguishing mediated from direct effects. These concerns are by and large not even applicable to the data the average data scientist deals with, so why even bother making them scalable?

On the ML front, on the other hand, R has never really been a forerunner.


I don't know why this comment is downvoted. It is true. The vast majority of businesses will never deal with machine learning. ML has very limited applications in the "real world", as opposed to regression analysis with some small samples.


Your point that simple regressions with small(ish) sample sets are important is valid. This:

> "The vast majority of businesses will never deal with machine learning."

However, is myopic. It plausibly will not hold for much longer in most areas, and is already not true for a significant number of industries.


I find ML and "3d printing" very similar in the expectations and extent business will adopt them.


On what basis? They are not similar in any respect that I can think of.

3D printing has had limited impact for prototyping in the many decades it has been around, and in the expected quarters. That's pretty much what every analysis I've seen except a very few boosters have predicted, for what it's worth.

On the other hand ML has fundamentally changed: document processing, search engines, remote sensing, retail sales, e-commerce, etc. ; It has also strongly impacted areas of logistics and shipping, manufacturing, (e.g. computer vision & robotics), medicine, law enforcement, finance, defense, also etc.

It's also poised to find market advantages in many non-obvious industries as larger scale data becomes available... unclear how far that will extend.

Where's the parallel you see?


Many of the companies I have interviewed and all the ones I have worked for have set up ML and analytics teams in the last 5 years. It is a growing area and people are starting to understand its utility.


Poorly understood, high hopes, a lot lower practical applications. Predictions are not as applicable as one might think at first glance.

I hope you do not confuse machine learning with statistics and data mining. They are very different.


No, I do not make such a confusion. I hope you do not make the converse one, of re-categorizing techniques "out" of machine learning as soon as we understand how they work.

At any rate, your experience in industry seems to be quite different than mine.


The work going on in Spark[1] is pretty interesting from this point of view.

Using R functions via closures against Spark RDDs in a cluster has the potential to deliver pretty good performance against large scale data.

[1] https://amplab-extras.github.io/SparkR-pkg/


I am very excited about the SparkR package. I have it set up on my desktop box and look forward to seeing what models and learning algorithms will work well with it. There really is no runner up to R in terms of the variety of statistical models, now what it needs is to scale up.


Does someone know how does this work with regards to the GPL? I have heard that they have a complete reimplementation of R. Really? What about all of the GPL'ed packages at CRAN, do they just not ship those?

I also know that the R developers don't really enforce the GPL. Is that what's going on here?


If they ship their own from-scratch-not-using-any-third-party-code implementation, they are fine

If they are shipping a GPL version of R, then this is a legal grey area, with different opinions from differing lawyers, mostly on whether it's a derivative work covered by the GPL or not.

It's honestly, not worth getting into the whole discussion, because there are no lights at the end of the tunnel, only opinions on all sides that are usually supported by reasonable arguments.


I'm surprised that no one in this thread (or on the broader Internet, as far as I can tell from a cursory web search) has any information the status of this "Revolution R" package that Microsoft recently acquired.

Is it a clean-room reimplementation of R? (like the relationship between .NET and Mono) Or is it a "distribution" of R? (like the relationship between Debian and Ubuntu?)

Whether the R implementation here is non-GPL, or whether it actually is running in a "fork-and-exec" separate process, I'm sure that Microsoft has their bases covered. They know more than a thing or two about software licensing, and certainly wouldn't take any risks of subjecting their flagship enterprise database to the GPL.

However, I'm completely uncertain as to what the legal status of Microsoft's R implementation means in terms of libraries that one can use from CRAN (aren't many of those GPL'ed as well?).

For me, I work in the real world, where you're not allowed to touch the GPL with a 10-foot pole, so this is of idle curiosity only. I'm not sure if Microsoft is trying to appeal to academia here, or if it's just a P.R. move in general... but if they expect to sell this to business users, then they're going to have to put a LOT more effort into clarifying its legal status.


> I work in the real world

I must work in the imaginary world, since we are allowed to use GPL'ed stuff at work. In fact, I got hired at my current job to improve GPL'ed stuff. And I don't work in academia.

Whenever I hear stories about how big companies can't touch the GPL, I always call "bullshit". Of course they can; and in fact many large companies do. Some, sadly, are just full of inefficient bureaucracy that feeds their employees big fat lies about how the GPL will destroy us all.


According to [1] R "will be run inside a sandbox process within SQL Server itself". If that's correct then there's not much to it. MS includes a link somewhere saying "Go here for the R source code."

As far as CRAN goes, the licenses of the packages there have nothing to do with the license of the R implementations they run on. Even if MS has their own implementation of R, it should be able to download and run packages from CRAN without MS having to worry about it.

[1] http://blog.revolutionanalytics.com/2015/05/r-in-sql-server....


" If that's correct then there's not much to it."

The FSF would beg to differ the last time i looked, depending on the situation: http://www.gnu.org/licenses/gpl-faq.en.html#GPLPluginsInNF

is the closest to their position on this.

Basically, if you can function call into R from SQL/etc, and it's just not just a subprocess completely independent, their view is that you'd have to GPL the main process.


My interpretation was that SQL Server would be more or less forking an R process and passing it the script as an argument. It seems hard to believe that would be forbidden by the GPL.


That's a very strict position to take. So using a GPL'd library in my code means my code is now GPL as well? I understand now why many companies blanket ban GPL software.


Is it? If I go and download the unreal engine to build a game, I got to pay 5% of gross revenue for using that in a commercially sold game. If I go to Flickr and download some pictures, I often got to pay the photographer for including it in commercial products.

GPL like any other software licenses defines the terms when authorization is granted, and instead of asking money it only request that you share and share alike the source code. Some people find that to be a fair concept, while others prefer to spend money and time to create their own implementation. Depending on the level of market competition, business models, and revenue streams reimplementing existing software can be valid but in most times its not.


It's not very strict; it's the raison d'être of the GPL. The LGPL was written precisely to allow libraries without copyleft. A big chunk of CRAN is library-based and depends on a binary interface to R, and is GPL'ed.

The FSF also does not think that whether the GPL'ed program is a separate process or not determines whether the bundle as a whole is GPL'ed or not. In their opinion, if you bundle the whole thing and it acts as a whole, and the communication between the GPL'ed components and the non-free parts are "intimate enough"[1], then the whole bundle is copylefted. It seems to me that bundling R with SQL server and tightly coupling the two, even across different processes, could well qualify.

Of course, if the R developers do not want to enforce their copyleft, none of this matters.

[1] https://www.gnu.org/licenses/gpl-faq.html#MereAggregation


> So using a GPL'd library in my code means my code is now GPL as well?

Yes, that's always been the whole point of the GPL. (The LGPL exists if you want to release your library without that requirement)


I don't know a lot about R, but apart from the fact that it's made for stats, isn't it a great language because it's also aimed to handle embarrassingly parallel programs, and no other other language do it as well as R ?

I guess R is used in machine learning fields.

Is R opencl enabled, or is it planned to be ?


R doesn't have built-in parallelism features that I know of. There are a lot of packages, though: http://cran.r-project.org/web/views/HighPerformanceComputing...


The 'parallel' package has been included in recent versions of R and has RNGs for parallel execution and variations of the `apply` functions. For "embarrassingly parallel" calculations these work as drop-in replacements. It's built-in in the same way that MASS and lattice are.


Most of the things can be transformed through lapply family (and d/plyr equivalents). d/plyr has simple parallel=TRUE. For other problems use mclapply().

That's basically all you need on a day-to-day basis.


You forgot python, with package: (similar in that R needs parallel packages): http://matthewrocklin.com/blog/work/2015/06/26/Complex-Graph...


Julia

http://julialang.org/

Btw. almost down-vote for "embarrassingly"



Anyone able to read the slides? I have difficulty accessing the slides.


They worked for me (Chrome, Linux). Perhaps you can try the direct link:

http://www.slideshare.net/RevolutionAnalytics/r-at-microsoft...


I have to say I am very disappointed in the information, after the title got me so excited.

No announcements about Excel? I really hope R lands in Excel, it will open big opportunities for developers and businesses.

Also there was a guy from MS, who created python tools for VS and asked if anyone would want R tools for VS. After receiving a very warm response, never heard from him again. What's up with that? Parallel to that RA has an IDE, which uses the VS shell. Will we have more streamlined R workflow in VS? This is important for a lot of people...

What about R.NET?


hi there - it took a while to make our 1st hire, but i'm happy to report that work is underway as part of the Azure ML group. as soon the project has a pulse i'll post on HN. current plan is to have it be free & open source like Python Tools for Visual Studio.

i'm also hiring for the project, so if you're into dev tools (editors, debuggers, profilers, languages, ...) and looking, pls drop me a line. looking for devs and a technical PM.

re R.Net - no effort that i know of.


Glad to hear you are making progress then!

Can't help you with development. but I will be more than happy to beta test. I make analytics applications, encapsulated in asp.net and additionally use R for market research.


If R lands in excel we can see even more proliferation of unstable excel "applications" with unmaintainable VBA, hard (if not impossible) to follow logic all wrapped up in a program where the next release might flat out break everything.

Unless microsoft address some of the core issues with using excel as a "tools" platform - I'd prefer they didn't encourage it any more than they already do.


You're exaggerating, for sure.


I'm probably underselling it actually.

Since excel 2013 came out several spreadsheets which used to function in 2010 no longer do, and anything over ~50mb has become significantly less stable.

I've spent literally days debugging excel tools with VBA split all around between sheets, modules, classes etc with no documentation. Not to mention that additional logic lives in the various formulae inside worksheets (which may or may not be dynamically updated multiple times during code execution and then trigger a worksheet_changed macro).

The value of source control is pretty limited for an excel sheet, and I've been in a situation whereby a source controlled sheet could no longer be saved without excel crashing. The end solution was to copy all of the modules/sheets/extraneous code into a new workbook. A highly manual process. It appeared to work - but good luck proving I didn't screw it up.

Links to other workbooks are significantly more unstable in 2013 (they're probably a terrible idea anyway, but the company considers it the best way to affirm the data being used for calculations is right).

Excel 2013 launches differently to 2010 breaking every excel tool which relied on launch args.

I can probably find a bunch of other reasons that excel is something that literally gives me nightmares. I'm not exaggerating, and based on a few conversations with people in various finance companies this is far from isolated to myself and my company.


>and based on a few conversations with people in various finance companies this is far from isolated to myself and my company.

Exactly my point. You just like to pride yourself in how much you know and the 'proper way to do things'. If you dig deep enough to the core of the Earth, you will find a win98 machine, running Excel. How bad can it be, if the Earth runs on it?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: