Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In principle this should allow readers to completely understand and reproduce the processing of the raw data, rather than reading a few paragraphs summarizing the processing done by the authors. Using a closed source tool

But consider http://www.bbc.co.uk/news/science-environment-39054778

"Science is facing a "reproducibility crisis" where more than two-thirds of researchers have tried and failed to reproduce another scientist's experiments"

I don't think that can be handwaved away as "OMG closed source software!". Especially since all the scientists in a given field will have access to the same software anyway. Give them open source and the issue will persist, and we both know it because the root cause isn't anything to do with the license of the software



Reducing the barriers for doing reproduction studies will likely increase how often they happen.

Of just consider peer review. How often does a reviewer today actually review the code and data used in a study? As far as I know this is essentially never done. That would involve seeing the code, running it, maybe messing a bit with it.


Open source software, in general, tends to make things more reproducible. Sure, software licencing might not be the singular root cause, but why does that suggest we shouldn't capitalise on the improvements available?


Open source software, in general, tends to make things more reproducible

Citation very much needed for that. Because you can very easily find that 6 months or a year later you update your dependencies and everything is now broken. I recently came back to a Python project after a year, updated my packages then realized: I simply cannot be bothered to unpick the mess that resulted just to add one trivial feature. Whereas the poster child for backwards compatibility is closed-source and proprietary.

Science is not reproducible because there are no incentives for it to be so, despite everyone paying lip service to it. It's extra work and helps those who are competing with you for grants, after all. That's a social problem, not a software one. The software is irrelevant.


Open source software is important for reproducibility for a couple of reasons. Firstly, if you record that you've done your analysis with Python 3.6.3 and Numpy 1.14.2, and it later breaks on some newer version, it's relatively easy to get the same versions you were using. Commercial software vendors are usually not keen on you downloading and running a version of their product which was superseded four years ago.

Secondly, of course, open source means that if you're not sure why two versions/functions/libraries are giving you different answers, you can go and find out. I accept that a lot of people may not have time for that, but I don't think you can fix that problem unless it's possible to dig down and follow the working.

Finally, 'reproducible by anyone with a computer' is a lot better than 'reproducible by people who buy a license for the tool I used to do it'.


f you record that you've done your analysis with Python 3.6.3 and Numpy 1.14.2, and it later breaks on some newer version, it's relatively easy to get the same versions you were using

Better record which compiler you used too, and what flags, and every version of every library and everything else. It’s not as simple as you make out and it’s far from guaranteed that all those packages will still be available or compile on your OS.

Commercial software vendors are usually not keen on you downloading and running a version of their product which was superseded four years ago.

I guess you must not deal with vendors much because generally they are fine with this. It’s part of the support agreement usually, just another service. Getting an “obsolete” version for whatever reason has never been a problem for me.

By “anyone with a computer” you mean “anyone who can exactly reproduce my configuration which I don’t even know myself for certain”


Because the reproducibility crisis has very little to do with the difficulty in re-running the same code. These are almost orthogonal concerns.

The typical problem paper has a small data set, on which the authors tried 20 different things, one of which achieved p<0.05 and got published. The result tells you nothing meaningful about the world (or more often, tells you something about how its authors wish the world worked). But re-running their code on their data set will not reveal the problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: