That's all true but I think what you meant to say is that successfully detecting fabricated data is rare. I've been in the weeds on this and after a while it's tempting to believe you can always spot scientific fraud by reading papers, but of course it's not so. A paper is just a bag of words and pictures, it can say anything it wants. As long as a data fabricator doesn't make basic errors you won't detect fraud.
Unfortunately, when people do find ways to automatically detect inconsistent numbers in a paper, like SPRITE or GRIM, they find maybe 50% of all checkable papers fail the audit and the authors almost invariably refuse to share their raw data for further checks despite having previously agreed to do so. So it seems likely that fake data is actually not rare at all but very widespread.
Caveat: like I say in every comment on scientific fraud on HN, all this varies dramatically by field. Some fields are much more corrupt than others. That 50% number is from psychology. Computer science isn't that bad relative to others. But for example, in climatology it's taken as axiomatic that it's OK to fabricate data and then present it as "observations". Anyone who tries to call them out on this behavior gets attacked, censored or even sued, yet if you did the exact same thing in physics nobody would hesitate to call it fraud. There isn't even a consistent definition in academia of what data fabrication means.
So no, sadly we don't really know how prevalent data fabrication is. The only way to detect scientific fraud to the level of robustness we'd accept for any other kind of fraud is regular, randomized lab audits with jail time or ruinous damages for perpetrators, backed by rigorous field-independent written standards for what evidence must be provided, all checked by outside organizations. Think financial audits but for scientists.
Obviously not only is academia not doing this, it's culturally nowhere close to considering the possibility of even thinking about starting. Academia has spent so many years engaging in such deep ideological purges that everyone who remains is at minimum sympathetic to ideas like "defund the police". The concept of policing themselves to the level accountancy does will never come up, not even during discussions of reform.
* If there's no clear way to get the raw-data
* If it's unclear how the data was gathered
-> the results are most likely fabricated
One strange psychological effect I've seen is that especially in medicine some people themselves spend a lot of time rigging data to get to their PhD, but in the end they absolutely trust in anything published in the field. "The science nowadays is so good - medical scandals today are completely impossible"
There are many detailed discussions out there. Go to any skepticism website and search for terms like TOB adjustment, homogenization, "the blip", "the pause". I'd give you links but without fail HN commenters just engage in ad hominem attacks on any source given without bothering to read them. If you want more details and can't find any given this info, email me and I'll send you starter links.
tl;dr where does temperature data come from? Climatologists gather data from various sources and then aggregate them into time series which are then used for later research. Which would be fine, except that along the way they do things that aren't considered legitimate in other fields like:
• reporting "observations" in their "raw data" from weather stations that haven't existed for decades. They are actually software-generated guesses based on other stations that themselves may not exist or be in comparable similar surroundings. This isn't a rare thing, in some countries most of the data is attributed to weather stations that don't physically exist or have moved large distances, yet still show up in databases with readings at their old coordinates.
• adjust their "raw data" in various ways and for various reasons. That is, what gets presented as observational data in climatology - like graphs of temperature - are not actually readings from thermometers, although it's usually presented as if it is. Modern temperature time series have dozens of adjustments including many extremely questionable ones like homogenization (a form of spatial averaging).
• dropping confidence intervals or even reporting temperatures that are simply the max of the CI around the real reading. This is a problem because QA on weather stations is frequently non-existent. In the UK over 80% of weather stations are WMO grade 4 or 5, meaning they are junk tier with uncertainties of 2C and 5C respectively. As claimed warming is 0.1C/decade this sort of data is of no use for detecting it but is used anyway. It's a global problem not UK specific.
• they regularly rewrite temperature time series in ways that invalidate all prior published papers and claims based on that data, but they don't retract any of those papers.
Editing data points, making up readings from non-existent instruments, using garbage-tier instruments, ignoring confidence intervals, playing with the data until it comes good and retroactively deciding your data was wrong but not doing retractions of papers built on it, are all behaviors that would be decried in most other fields.
Until academia can settle on universal standards for what it calls science people's trust in it will continue to fall, because they won't be able to make any assumptions about what kind of methods the word science really means :(
> I'd give you links but without fail HN commenters just engage in ad hominem attacks on any source given without bothering to read them.
That's a red flag. Reads as evasive to me. Why are you bothered by internet randos disagreeing with you? If you have reputable sources link them instead of wasting time. If the responses are truly ad hominem that should only strengthen your argument in the eyes of any making an attempt to be objective.
I apologize if it seems uncharitable but my default assumption in this sort of scenario is that you've been the subject of well reasoned rebuttals and don't want to confront that reality.
> universal standards for what it calls science
> what kind of methods the word science really means
That's nonsensical. "Science" is a vague, very high level methodology. There's no single correct way to go about things other than to recursively apply evidence driven approaches.
Some fields have broad applicability of course. For example something like the foundations of statistical methods will presumably remain the same no matter what task you apply them to. But what the results of such methods "mean" is going to vary widely.
No, I've never seen a well reasoned rebuttal. The type of reply you get here on HN is always exactly like yours: an insistence on arguing the who instead of the what. Hence your requirement that any given sources are "reputable", without specifying what that means. Based on experience it's guaranteed to exclude 100% of all sources that discuss the issue honestly. That game has played out here 1000 times and is boring. If you disagree, check a fact or two. Go find sources that you think are reputable instead of expecting others to guess and read a few, poke holes in them if you can. That would be interesting.
> "Science" is a vague, very high level methodology. There's no single correct way to go about things
It's really not. But this is the kind of semantic dispute I am warning of. If your response to discovering that academics engage in data fabrication is to say that's ok because science can mean anything, then science doesn't mean anything. And eventually the general public will learn that, and vote to defund it.
> the foundations of statistical methods will presumably remain the same no matter what task you apply them to.
Overfitted models are not only common in many fields but academics often try to justify them as OK to use, so I wouldn't assume statistical methods remain the same no matter what task you apply them to. After all, science is really vague, right? Who is to say what the foundations of statistics are?
Unfortunately, when people do find ways to automatically detect inconsistent numbers in a paper, like SPRITE or GRIM, they find maybe 50% of all checkable papers fail the audit and the authors almost invariably refuse to share their raw data for further checks despite having previously agreed to do so. So it seems likely that fake data is actually not rare at all but very widespread.
Caveat: like I say in every comment on scientific fraud on HN, all this varies dramatically by field. Some fields are much more corrupt than others. That 50% number is from psychology. Computer science isn't that bad relative to others. But for example, in climatology it's taken as axiomatic that it's OK to fabricate data and then present it as "observations". Anyone who tries to call them out on this behavior gets attacked, censored or even sued, yet if you did the exact same thing in physics nobody would hesitate to call it fraud. There isn't even a consistent definition in academia of what data fabrication means.
So no, sadly we don't really know how prevalent data fabrication is. The only way to detect scientific fraud to the level of robustness we'd accept for any other kind of fraud is regular, randomized lab audits with jail time or ruinous damages for perpetrators, backed by rigorous field-independent written standards for what evidence must be provided, all checked by outside organizations. Think financial audits but for scientists.
Obviously not only is academia not doing this, it's culturally nowhere close to considering the possibility of even thinking about starting. Academia has spent so many years engaging in such deep ideological purges that everyone who remains is at minimum sympathetic to ideas like "defund the police". The concept of policing themselves to the level accountancy does will never come up, not even during discussions of reform.