This is probably the most annoying problem in my daily life (yeah I know, first-world problems). I have daily conversations with biologists where I've analysed some data and associated (say) a posterior probability with each condition in the model. They insist I give them p-values or something they can present as though they were p-values. At the beginning of my PhD I complied. Then they throw out all the nuance in the data, put one asterisk for p < 0.05, two asterisks for p < 0.01, etc., and (this is the horrifying part) _believe_ that an asterisk indicates that something is true. They put stupid asterisks all over my beautiful plots and then think their arbitrary cutoffs mean something biologically meaningful. I die a little inside every time I see an asterisk on a plot.
Now I refuse to use p-values and deliberately construct analyses that are incompatible with Fisherian statistics. And rather than giving people raw numbers, I produce a massive document of interpretation. Takes a huge amount of time, but I'm hoping it will mean my publishing track will contain significantly (ha!) fewer false results than most biologists'.
My favorite term for that behavior is "stargazing," because they're staring at the little asterisks instead of thinking about the meaning of the results.
I have been working on a guide for practicing scientists which attacks common statistical misconceptions:
It's on its way to publication, so you'll be able to hit your biologists upside the head with it in due time. (Although it'll be slim, so it won't hurt too much.)
"Once the two hypotheses have been defined, the first hypothesis is scarcely mentioned again -- attention focuses solely on the null hypothesis. It makes me laugh to write this, but it's true! The null hypothesis is accepted or rejected purely on the basis of how unexpected the data were to H0, not on how much better H1 predicted the data."
The story is so much simpler than that. People not familiar with science don't trust scientists because they want to stick to their own known story. Change is scary, and people generally don't like it unless it significantly improves their living conditions.
It doesn't matter how rigorous scientists are with their methodology, my 'new age' mother isn't going to stop believing in fairies in the trees because scientists say there's no proof. Or when she talks about higher and lower vibrations, she is puzzled when I ask her "vibrations of what? what is it that's vibrating?". This isn't important to her story, and is merely a detail that makes life less enjoyable, so she ignores it. A less obnoxious version of ICP's "scientists are ruining life by explaining things" during their 'how do magnets work' phase.
Saying that people don't trust scientists because of p values is a bit like saying people don't trust plumbers when they use blue plumber's tape instead of gray. People don't like plumbers because they don't show up on time and charge through the nose - things that change your own story.
That's a good point too, but not what I was trying to get at. I realize now I wasn't clear enough. I was talking more about people that recognize science as a good thing and tend to follow what is presented as scientific facts, but aren't specifically trained in it. The problem I see is that some scientists are taking data that points toward a conclusion, even if it isn't well supported or rigorous data and talking as if it is absolutely certain and proven. This degrades people's trust in scientists, and likely science in general. It's also a problem of the media too, since they're the ones promoting the scientists that act in this way.
As a biologist I would argue that we also cannot dwell too much on p values. The true test is to design follow up experiments to test causal predictions from many of these borderline p value phenomena, for example: analyzing microarrays data to find interesting GO categories. Whether it's p=.001 or p=.051 is less important than whether or not it provides a good clue to the underlying mechanistic biology, and there is no way to address that without designing specific follow up experiments. That's why experimentalists don't dwell too much upon p values. They're, for the most part, a clue.
In my experience, experimentalists tend to dwell on p-values and especially on cutoffs in their molecular experimental data as much as they do in high throughput data.
Publishing is easy. It's just a bit easier for the people who are (perhaps unconsciously) cheating. I'm pretty confident my loudly-refuse-to-cheat strategy will prevent me perishing before my time.
Actually I have an ever increasing queue of collaboration requests. My reputation is (I am told) as someone who can rigorously get the story from the data, and who can solve problems. The people who don't want the rigour are the people I reject, not the other way around.
I was recently trying to explain Bayesian logic to a friend, and came up with the following analogy. I would be interested to hear feedback on it.
---
Imagine everyone in the USA gets sudden amnesia. We want to find out who the President is, but no one can remember.
A scientist comes up with a test to determine if someone is the President.
If they are the President, there is a 100% chance the test will say they are the President and a 0% chance the test will say they are not the President.
If they are not the President, there is a 99.999% chance the test will say that are not the President, and a 0.001% chance the test will falsely say they are the President.
Giving the test to the person sitting in the big chair in the Oval Office is useful, because it's already quite likely this person is the President. If the test is positive for Presidency, it's extremely likely that person is the president.
Giving the test to the 10 people nearest the oval office is useful, because it's fairly likely the President is one of these people. A positive result will indicate strongly that that person is the President, and if no-one in that group is actually the President, there's a 99.99% chance the test will say so.
Giving the test to the 1000 people in the White House is pretty useful, because it's pretty likely the President is in the White House, and if none of these people are the president, there's still a 99% chance the test will be correct. A positive result for any one person will indicate quite strongly that that person is the President.
But giving the test to everyone in America is not very useful at all, because it's very unlikely that any particular person is the President, and we can expect the test will give a positive result for around 3200 people. For any particular person in this group, it's much more likely they're not the President than they are.
---
Is this a broadly correct, if non-rigorous, analogy? I realize most HNers will be much more familiar with this stuff than I am, I'm interested chiefly in whether or not I misled my friend.
Right. And the frequentist version is also useful.
If you test one person and the the test is positive, then that person is the president (p=0.00001).
If you test a thousand people, and the test is positive for one of them, then that person is the president (p=0.01).
So you don't really need Bayesian logic to reason that you should test fewer people if you want a more significant result. (Note I'm not saying you don't need Bayes' Theorem, which everyone uses.)
Edit: I think most people on HN get their knowledge of frequentist and bayesian statistics from XKCD #1132. That's sad.
> So you don't really need Bayesian logic to reason that you should test fewer people if you want a more significant result.
That's misleading by the use of the word "significant" which apparently means something in frequentism than it does in normal speech. I certainly wouldn't use "significant" in that way as a non-frequentist, I would instead rephrase what you said as:
> So you don't really need Bayesian logic to reason that you should test fewer people if you want more confirmation bias in your result.
And that's a statement I can definitely get behind!
It can definitely be used misleadingly, but it's not too out of line with normal scientific usage of the term. The "significant" in "significant figures" is the same: if a number has "11 significant figures", it doesn't mean the 11th digit is significant in the sense of being important or having a big impact, just that the 11th digit is within the measurement precision (as propagated through any subsequent calculations).
That's a good point. Conversely, when you run a test and get a result which is "not statistically significant", that doesn't mean that the measured effect is not significant. For example, if I test a drug and say "it does not cause cancer (p=0.17)", that's wrong. Yes, the correlation with cancer is not statistically significant, but it is significant in the normal sense of the word.
The point is not the number of people that are tested, but the prior probability that the person tested is the president. The person wandering around the white house already has a decent probability of being president. Some guy found in the middle of the country wearing every day clothes, not so much.
>If you test one person and the the test is positive, then that person is the president (p=0.00001).
>If you test a thousand people, and the test is positive for one of them, then that person is the president (p=0.01).
I am frequently a frequentist, but this particular test doesn't make sense to me.
Without any prior belief, the person you tested is still no more likely to be the president than any of the other 3199 who would test positive.
If you test one person at random, the odds of that person being the President equals the probability of a true positive divided by the probability that the test provides a Positive result (i.e. the sum of the probabilities for True and False Positives). The probability of a true positive is the probability that test subject is the President times the conditional probability that the subject tests positive given that they are the President. If the subject is chosen at random from the population of the US, then the a priori odds of this person being the President is about 1/314M times a conditional probability of testing positive of 1. The probability of a false negative is the a priori odds of a random person being the President, times the conditional probability of a non-President testing positive. By hypothesis, this is (1-1/314M)0.00001. So the odds of a subject chosen at random from the overall US population must be approximately (1/314M)/((1/314M)+(1-1/314M)0.00001), which simplifies to 1/(1+(314M-1)*0.00001, or about 1/3141.
The ~99.97% chance that the test result is a false positive for a subject chosen at random is a consequence of the prior probability of that subject being the President in the first place being over 3000 times lower than the probability of a positive test result. In other words, though the sensitivity of the test is perfect, the specificity of the test is insufficient to isolate a condition as rare as the Presidency. It has nothing to do with testing one person being inherently more decisive than testing more than one.
Only narrowing the pool of candidates in a way which will increase the prior probability that the subject is the President will improve the situation, not random decimation to a single test subject. The GP gave several correct examples such as limiting the test to those who were much more likely to be the President in the first place (eg. a person randomly found sitting in the President's chair at the Oval Office might have a prior probability of being the President of greater than 0.01, millions of times greater than that of a randomly selected person in the US).
It seems awkward that p-value does not account for the setting in which the test is performed.
That is, if I test 1000 people at a high school football game in Peoria, Illinois, and one of them comes out positive (p=0.01), there is a very different likelihood that I have found the true president than if I test 1000 people at a official dinner in the White House and find a match (also p=0.01). In fact, I think I'd be willing to wager more on the chance that the individual tested in the White House is actually the president than the random football fan in Peoria, even if only a single person in Peoria was tested (p=0.00001).
Of course, this is only an issue if p-value is being used to express relative confidence in a conclusion, which it shouldn't (?) be. Still, how does a frequentist account for a choice of venue like this?
Frequentist approach would be to design an experiment, determine rules for interpreting data, and then consider the probability that the experiment lead to the correct conclusion.
In this case, the experiment is really "who is president?" and not "is person X president?" The "who is president?" question does not really have a null hypothesis, so it makes no sense to talk about p-values. Instead, we can talk about two different results: identified the president correctly, and identified the president incorrectly.
We then have to design the experiment in such a way that probabilities can be calculated. But this is not possible if you say the repeated experiment is testing a bunch of people to see if one is president--because the probability depends on who actually is president, and that's not known.
So you consider the experiment to be "the president goes about their business and ends up in some random place, we get amnesia, and then run the test." We can simulate this experiment because we can come up with some probability distribution for where the president is. We can't come up with a probability for who the president is, because that's unknown and either 0% or 100%--only Bayesians would let you do that.
Then you can choose the order of people to test to maximize the probability that the president will be identified correctly.
And then you say things like,
* The test had a 92% chance of succeeding.
* We only had to test 14 people, and in such cases, the test had a 98% chance of succeeding.
P-values are not comparable across different null hypotheses, they are a property of a particular sample given a null hypothesis. You can compare p-values only by making very strong assumptions that the samples come from the same data-generating process. In almost all cases this is unlikely.
A frequentist usually builds a more descriptive model by adding more variables (accounting for obvious factors that contribute to differences in observed frequencies) and increasing sample size to increase statistical power (reduce false negatives). This is not applicable in this case because there is only one president and the tests have low power. The test is bogus because you can't 'sample' and determine to probability of X==president effectively.
I think you're mistaken about what frequentist p-values mean. The definition you're using is the one we would like to have - the probability that X is president given the data. But what a p-value actually gives you is the chance of seeing the given data given that X is president. And if X is the president, we have p=0, since there are no false negatives. So a p-value is useless in this example, although better frequentist measures can handle it.
Less wrong, but still off. The p-value is the probability of committing a type-1 error (false positive: we declare X to be president when he really isn't) given a sample and a null hypothesis. P-values are not the 'chances' of seeing the given data, it is a test-statistic that describe how reliable your probability estimates are for a given sample and null.
Actually I think we're still a little off.
The p-value is the probability of seeing a test statistic at least as extreme as the observed sample statistic, under the assumption that the null hypothesis is true. This can be restated in terms of tests and errors.
The p-value is the probability of committing a type-1 error for a statistical test, where the threshold of that test is chosen to be equal to the sample statistic. In more intuitive terms, it's the probability of a type-1 error under the null hypothesis for the most extreme test that the sample data is able to pass.
In your explanation, where you said "given a sample" you should say "given a sample size", to make it clear that the type-1 error probability is not conditional on the sample data. If you condition on the observed data, then the probability of passing a statistical test is going to be 1 or 0 depending on whether or not the data passes the test. It is the test itself, not the error probability, that depends on the sample data. Which is also something you should have specified.
This is the clearest answer I've read so far. The p-value is the minimum probability of ending up with a test statistic that lies in the rejection region (for a given null hypothesis and critical value).
Please re-read onetwofiveten's explanation. All I've done is paraphrase it. Alpha is just an arbitrary cutoff p-value corresponding to a threshold t-value for a given null hypothesis.
Furthermore, practicing biomedical researchers rarely, if ever, take a single study as "proof" of a hypothesis, no matter what the P-value is.
Replication and plausibility in the context of other studies are taken into account, however these additional parameters are difficult to reduce to numbers.
A recent blog post (link below) describes the actual practice of (good) biomedical research fairly well.
A big problem with Bayesian statistics (as I see it) it that it not always possible to have any sense of what the prior probability is.
Say you are looking for genes that might influence the rate of occurrence a particular disease. There might be genes that influence this rate, or there might not, it could be entirely environmental, or it could be entirely genetic, or something in between. In any case, you go genome-wide studies, and find that certain gene variants occur more often in your diseased population than in your control population. You apply frequentist statistics, using some corrections for multiple hypothesis testing, and get some kind of "significant" result. This gets published in Nature (you lucky thing!).
Are your conclusions correct? Do the genes you identified really modify the course of the disease you studied? Bayesian statistics won't give you the answer.
The only way to get the answer is to do experimental science, i.e. deliberately modify the gene(s) in question and show that your modifications change the occurrence or course of the disease.
Unfortunately, that is not always feasible, for either technical or ethical reasons, so we have to fall back on the poor cousin of experimental science that is population statistics.
Both frequentist and Bayesian decision making ultimately rest on arbitrary prior assumptions. In Bayesian statistics it is explicit. In frequentist stats it is less acknowledged, but still there. Where do your cutoff values for alpha and beta (p-value and power) come from? Same place the Bayesians get their priors.
If you are hypothesizing that genetics influences the incidence of a disease, without any prior knowledge as to whether there is any influence of genetics on a disease, how can you have a prior probability that there is a genetic influence on the disease in question?
- Most trivially, your prior distribution can assign equal probabilities to different outcomes, if you have no reason to do otherwise. Beta(1, 1) for modeling a coin's bias, for example, if you have no prior information about its bias.
- There are more advanced tools in Bayesian analysis such as Jeffreys prior (known as uninformative priors, look it up).
- As was mentioned in other responses the same "big problems" exist in every other statistical and mathematical modeling approach, namely that you have to make assumptions and your results are going to be crap if your assumptions are crap.
- Generally, Bayesian stats got a late start due to high computational resource costs, not some theoretical limitations. The issue with priors that gets repeated by philosophers and some statisticians does not stop the huge, monumental progress Bayesian statistics has had in a ton of applied fields, from computer science / machine learning all the way to economics and political science.
You said "A big problem with Bayesian statistics (as I see it) it that it not always possible to have any sense of what the prior probability is." This is a common complaint about Bayesian stats -- the choice of priors seems arbitrary. That's what I was responding to.
So, based on this data, having the gene increases your probability of having the disease some 11% (0.109/0.0979) or 1.1 percentage points (0,109-0,0979).
Or you might want to calculate some other figures from these. If you just want the ratio, then P(sick) does not have a bearing on it. You could perform some sensitivity analysis etc...
I am not a statistician either, apart from a few courses, but your argument starts with:
P(gene|healthy) = 0.010 and P(no_gene|healthy) = 0.99
P(gene|sick) = 0.011 and P(no_gene|sick) = 0.989
My point is that in many cases there is no justification for placing any number whatsoever on a prior hypothesis. You can't simply say that the prior probability of a particular gene being involved in a disease is 1%, or 0.00001% or 10%, or whatever.
Edit: I'm not saying that Bayesian statistics is without uses, it is very useful in epidemiology, for example. However it is not appropriate for determining molecular and genetic mechanisms.
P(gene|healthy), P(gene) and P(healthy) are something we can often measure.
P(healthy|gene) is something we can calculate with the Bayes formula from the above values.
More generic version:
We want to know P(model|data). That's what we always want. What is the probability of some model, based on this data that we measured.
But we only have P(model), P(data) and P(data|model). So we use the Bayes formula to get the answer to the interesting conditional probability. It needs all three inputs. We must estimate if we don't have some.
Just presenting P(data|model) and not constraining P(model) or P(data) in any way means that we can't say anything about P(model|data).
It's like if we start with x + a + b = 5.
If we don't know or estimate a and b, it's impossible to say anything about x. Bayes' formula is like saying then, solve it like this: x = 5 - a - b.
So, if you say, there is not justification placing any constraint on a or b - then it's just saying we clearly do not have enough data to say anything about x either. There is no way out of it.
it not always possible to have any sense of what the prior probability is
And in those cases, frequentist statistics is no better off, because you have no sense of what the sample space is.
However, there are cases where there is no well-defined sample space, but you can still assign reasonable priors; so Bayesian statistics covers a range of cases that is a superset of the range of cases that frequentist statistics covers. E. T. Jaynes goes into this in some detail in his book Probability Theory: The Logic of Science.
As I understand it, using Bayesian statistics, you can report the posterior probability given various reasonable priors. It seems useful to know how sensitive the conclusion is to the choice of prior. (With a strong result, it should make relatively little difference.)
>A scientist comes up with a test to determine if someone is the President.
It's a poor analogy, because it's not clear to people why such a test is "natural." It's not clear how your specific test could be broken in the peculiar way that it would have 99.999% chance to confirm that someone who isn't the president is indeed not the president.
And people would get caught up with what you mean by "If they are" and "If they are not," since it's not clear how you would know the error of your test without a real president around to identify.
False positives or false negatives are not at all intuitive to people who have never done experimental design. Most people would get stuck at percentages anyhow.
So just to get a handle on this stuff, the two problems you have if you do a test and only look at a low p are:
1) Unusual things do occur. If a million people do the same test, it's obvious they'll come up with some wrong values. It's less obvious that a similar number of wrong values will come if a million people do a million different tests with a similar small chance of bogus results.
2) The pattern of results may indeed be unusual but not necessarily in the fashion you think it is. There may be a non-random pattern possessed by the data but may not because of your particular hypothesis but a "this is not random" result may seem to say your hypothesis does explain the data.
This has probably been addressed millions of times, but just to give a response to why this is a misleading comic:
If you were equally mocking of both Bayesian and Frequentist, the Bayesian would arguably come up with basically the same conclusion. In this scenario where the Frequentist ignores previous data based on human history and our understanding of the lifecycle of the sun based on physics, the Bayesian should do the same if we are fair. Then his prior distribution would likely be 50% belief that the sun would explode (Why favor either outcome? Thus the prior distribution is split down the middle between both outcomes.). With the new information given from the detector, his posterior would be 35/36 probability that the sun has exploded.
Maybe the Bayesian approach provides a more explicit way of incorporating previous information, and without that, it's cause for some misuse with a Frequentist approach, but that doesn't mean Frequentists need be fundamentally ignorant in their approach.
The comic's good for a laugh. Just don't take it too seriously as a criticism.
That's the most famous comparison between frequentist and bayesian statistics, and it's a shame because the frequentist interpretation depicted in the comic is really a straw man argument. See: http://stats.stackexchange.com/questions/43339/whats-wrong-w...
I'm not sure it's a straw man, there seems to be plenty of examples of people publishing results based on p < 0.05 findings even when it wasn't justified... just like the person mentioned in the Nature article almost did.
In a world where all statisticians were Bayesian, you'd still see results like this published. It would simply read differently: "We saw something that seems pretty unlikely, care to confirm?"
Perhaps, or perhaps the scientist would try to replicate the results themselves first (as the person in the article did).
But even if they published them with the verbiage you mentioned, that would be a better world because reporters would be less likely to turn it into a story with a headline like "Scientists say ...". The prevalence of articles like this erode trust in science, and fill the public's head with misconceptions.
I bet the p-values on that site would be very high if properly calculated. Generally, if you are reporting correlations between a large number of variables, the p-values shoot through the roof.
Of course, TONS of people forget this and publish a p-value as if those two variables are the only ones under consideration. Which is just sad.
I think we're talking about different experiments. I'm not talking about, "here are the correlation statistics for 100 variables", I'm talking about, "I tested 100 variables, and these are the five pairs which are most strongly correlated."
Woah, why didn't I know about this site? Awesome! Would be great if you'd include links to you original sources. Even though you don't include P-values in the graphics, on the detail page I would still love them just as much as Chi-square. Your tagline should be "correlation does not imply causation" ;-)
P values have always had critics. In their almost nine decades of existence, they have been likened to mosquitoes (annoying and impossible to swat away), the emperor's new clothes (fraught with obvious problems that everyone ignores) and the tool of a “sterile intellectual rake” who ravishes science but leaves it with no progeny3. One researcher suggested rechristening the methodology “statistical hypothesis inference testing”3, presumably for the acronym it would yield.
Statistics are descriptive, not predictive--period.
I'm continually surprised at how many people either don't know, or don't internalize, that. Look at how often "risk factors"--which are a descriptive concept--are converted to advice--which is predictive.
Doing so in the absence of a causal hypothesis is a basic violation of "correlation does not equal causation."
If you want to construct a scientific theory you must be able to articulate some predictive tests, and that means you must hypothesize a causal mechanism.
Yet the only reason we are interested in statistics is for the predictive aspect. This is exactly where hypothesis testing goes wrong: as much as you may claim that you're doing something purely descriptive, the whole point of the exercise is to make decisions. Hypothesis tests simply don't give us the right information to base a decision on, but in practice people still do make decisions based on them in a fundamentally incorrect way.
That's because your short statement is not really representative of good statistical practice. For example, people spend a lot of time researching http://en.wikipedia.org/wiki/Generalization_error, and models like http://en.wikipedia.org/wiki/Probably_approximately_correct_... worry a lot about things like VC dimension exactly because they characterize the behavior of statistical models to unseen data. Or maybe you don't think of prediction as "behavior under unseen data"?
Which is why the alt text on that particular comic is so key.
Correlation suggests that if you're looking for causation, it might be somewhere over here. It doesn't insist that the two are the same, but if you're looking for clues, it's a hell of a dowsing rod.
Reliable correlation is all you need to make predictions. If you understand causation, you can be more confident about the reliability (or lack thereof) of the correlation under changing conditions.
The root of this problem is that most data sets in psychology, anthropology, and epidemiology are not as large in terms of sample size as what computer scientists and electrical engineers encounter. p-values are a surrogate to explicitly describing the data using probability distributions or as random processes. In essence, you sacrifice granularity for simplicity. If you look at the original works of Fisher, etc. and their widespread utility, a large part of early statistics is intended for 'practical statisticians' who seldom encounter data-sets that are 'small' in terms of sample size. As someone who works in electrical engineering/computer science, I've never used the p-value because:
1. The field, in general, demands far more mathematical rigor when dealing with statistics.
2. The demand for mathematical rigor is justified because most data sets we deal with are many orders of magnitude larger than what psychologists and others encounter. So predictions based on limit theorems, etc. are often testable.
A little late in coming (blame the Southeast U.S.'s snowstorm), but I work in a very stats heavy field and interact with a number of CS types because I work on computational models.
I've gotten a fair amount of "I just need a p-value" requests, and some assumptions that everything can be hit with a t-test or an ANOVA and it'll all work out fine.
I'd love to see a comprehensive article that shows what a research paper's analysis would look like using Bayesian methods. I've seen plenty of general hints about Bayesian methods, discussion of priors, and similar, but I haven't found any specific guide on how to apply those methods to the types of research papers that would traditionally use a null hypothesis significance test with a p value.
The latter will also give you more details of how to approach classical, frequentest tests and summary statistics with their Bayesian equivalent.
Honestly I would say get both books as they're cheap and provide different insights. You only need to read a few chapters of each to see how you approach basic experiments from a Bayesian perspective.
There's a number of tutorials scattered throughout the International Journal of Epidemiology, Epidemiology and the American Journal of Epidemiology, as well as good "exemplar" articles either using Bayesian methods, or using both approaches.
As the article reports, "Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. 'P-hacking,' says Simonsohn, 'is trying multiple things until you get the desired result' — even unconsciously."
Simonsohn, has a whole website about "p-hacking" and how to detect it.
He and his colleagues are concerned about making scientific papers more reliable. You can use the p-curve software on that site for your own investigations into p values found in published research.
Many of the interesting issues brought up by the comments on the article kindly submitted here become much more clear after reading Simonsohn's various articles
on evaluating replication results with more specific tips on that issue.
"Abstract: "When does a replication attempt fail? The most common standard is: when it obtains p>.05. I begin here by evaluating this standard in the context of three published replication attempts, involving investigations of the embodiment of morality, the endowment effect, and weather effects on life satisfaction, concluding the standard has unacceptable problems. I then describe similarly unacceptable problems associated with standards that rely on effect-size comparisons between original and replication results. Finally, I propose a new standard: Replication attempts fail when their results indicate that the effect, if it exists at all, is too small to have been detected by the original study. This new standard (1) circumvents the problems associated with existing standards, (2) arrives at intuitively compelling interpretations of existing replication results, and (3) suggests a simple sample size requirement for replication attempts: 2.5 times the original sample."
I don't understand how did they get to the number 71% percent for 0.05 p-value?
0.05 p-value means that there is 5% probability that (for a t-test as example) a difference in averages of two sequences (the statistic) is by chance and not because of difference in means of their underlying normal distribution.
I assume that the "toss-up" means that there is no difference in the means in reality (so the null hypothesis is true). Am I understanding it correctly? Shouldn't in this case the probability of getting p-value < 0.05 be in fact less than 5% and not 29%?
No, you've got it wrong. A p-value of .05 means that given a true null-hypotheses, you have a 5% chance of getting this result anyway. That is, if there is no difference in the population mean, than you still have a 5% chance of finding a sample that contains this difference anyway.
Do you see the difference. The p-value doesn't tell you anything about the population, it only gives you information about your sample -- a confusion that this article was pinpointing.
So what the 29% is telling you is that, the chances of finding a statistically significant result (p = 0.05) given a population factor of µ = 0 are up to 29%. Whereas given a population factor of µ = 0 the chances of getting a significant result (p = 0.05) is 5%.
Most of the objections here and in the article are not inherent problems with frequentist p-values.
First, the reported p-value might be wrong. E.g. basing it on assumptions of normality when the data is non-normal. However modern non-parametric approaches like the bootstrap can avoid this issue.
Second, testing multiple hypotheses. If you test 10 hypotheses then you cannot reject the null (that all 10 null hypotheses hold) simply because one single hypothesis is rejected in isolation. But this is well known, and failing to account for it is an issue with the researcher, not with frequentist statistics. I actually think that the main practical difference between Bayesian and Frequentist statistics is whether accounting for the issue of multiple hypotheses is done formally or informally.
The article doesn't bash the p-value as a statistical test specifically, more its use and interpretation by scientists over the years.
You're absolutely correct about using non-parametric tests, and more scientists should be using them. The normality assumption is flat out laughable when using real-world data most of the time.
You're also correct about multiple hypothesis testing. Accounting for familywise error (e.g., Holms adjustments) can help to keep your p-value reporting honest.
That doesn't negate the underlying problem, though. A p-value is simply an indication, nothing more. The p-value never promised to be more than that. The issue isn't in the p-value's construction, the issue lies in its misuse and how easily it can be abused in statistical reporting (see: p-hacking).
The p-value as a test statistic is perfectly honest in my opinion. But like many other statistical methods, it comes with its own set of baggage that I feel gets conveniently glossed over more often than it should.
I fully agree with the critics about the p-values, but what are the best alternatives to analyze and compare data? Most of the time, scientists have to compare the outcome of treatment 1 versus treatment 2; how should they do it "properly"?
Effect measures. Don't just report your p-value, report the actual effect measure, and a measure of uncertainty around it, be it a frequentist confidence interval, Bayesian posterior distribution, etc.
Agreed. A good reference is Geoff Cumming's "Understanding the new statistics," although its focus on analysis in Excel may put off the typical HN audience.
Effect sizes and confidence intervals are much more useful than a p value or two.
> In 2005, epidemiologist John Ioannidis of Stanford University in California suggested that most published findings are false; since then, a string of high-profile replication problems has forced scientists to rethink how they evaluate results.
That's what is supposed to happen, though, right? You publish your findings. Others try to reproduce. They publish THEIR findings, etc. etc. If most published findings are false, it sounds like the process is working as designed.
Yes and no. Yes, the media sucks, but no, most scientific papers are not replicated -- or if they are, it's by a hapless grad student using the results in their research, only to find they don't hold up. The hapless grad student usually doesn't get to publish this, because negative results are boring and not usually published in prestigious journals.
Most scientists have better things to do than replicate previous findings, unless that previous finding directly bears on their own work.
Unfortunately it's not just the nightly news one needs to be skeptical of, it's also (amongst many others) your doctor—who probably isn't particularly well educated w.r.t. statistics, but who will nonetheless give advice and prescribe treatments based on them.
This is a great -- and scary -- point. These statistical fallacies and misunderstandings are so deeply ingrained into our scientific and medical systems that it's hard to see how and when they will be removed. I can attest from personal experience that many scientists 1) don't understand these statistical tests, 2) don't care to find out, and 3) don't think there's a problem.
a p-value of 0.05 or even 0.01 is stupidly high. it only takes a little thought experiment about what that means in reality to realise how permissive it is and you can find demonstrations of this without going particularly far, looking very hard or being especially well educated...
consider the wikipedia example with heads vs. tails.
the idea that 5 coin tosses can produce a p-value < 0.05 that 'demonstrates' that the coin is biased towards heads is intuitively 'obviously wrong'. even if we take it to 10 coin tosses (the p-value you get is 0.001 - which looks really strong if we accept that 0.01 is acceptable) it clashes with my own ideals for what statistical significance should mean. this is in a loose way a proof by contradiction that p-values of 0.05 or 0.01 do not have utility (at least for these kinds of small n).
aside from that consider running the experiment 5 times or 20 times. how many false positives do you expect? what is the expected number of false positives? is that significant?
it also bothers me how connected to the problem formulation that the value itself is. if we analyse the same situation with an identical test but a different formulation of the problem that the values differ?
why is five heads in a row less significant as a result when the test is whether a coin is biased at all rather than a test that it is biased towards heads only? sure i understand the probability involved there that we have all these potential coins biased towards tails that mean nothing in the first case - but there is something very deeply wrong with that.
shouldn't this be the other way around? if 5 consecutive heads is good evidence that a coin is biased towards heads, isn't it equally good evidence that it is biased at all? classical logic says that it is because being biased towards heads is a subset of being biased in either direction. the truth is that it really is equally good evidence - i challenge someone to explain why it is not! ( actually i kinda want to be wrong about that because i might learn something new then :) )
probability is counter-intuitive and useless for the kinds of small n usually used in experiments - the intuition about it recovers when we deal with sensible n - numbers like 1000 or 10000 - but these are still small n really if you need to scale up, or be confident that your result is correct. even at 100 samples its obvious that our idealisation of percentage and what happens in reality do not marry up neatly...
to make a very crude software analogy what about those 1 in 10,000 bugs? they are still a very real problem if you have millions of customers...
or - IMO even 10,000 is a very exceedingly small n to try and draw robust conclusions from.
0.05 == 5% == 1/20. If you flip a coin 5 times and get heads every time, do you intuitively feel that there is more than 1-in-20 odds that the coin is fair?
You should really get used to the idea that stating a different problem will give you a different answer. You need to be very careful when asking a question, or your answer might not mean what you think it means.
Bayes to the rescue! In this case, most people's prior, that is, the unconditional probability that a given coin is biased, is probably much less than 0.5 - after all, almost all coins are supposed to be nonbiased, and biased coins must be specially manufactured. In light of this prior, after five heads in a row, the posterior probability that a randomly chosen coin is biased is much less than 0.95. On the other hand if your prior really is 0.5, for instance if you have two coins and know one is biased but not which one, then, after five heads, you really should deduce that the probability of you having picked the biased one is high (it depends on how biased exactly the coin is!)
maybe my presentation was unclear. i seem to have given the impression that i am really quite dim...
> You should really get used to the idea that stating a different problem will give you a different answer
this is entirely normal and expected since ever i can remember... what i'm saying is that you can analyse the same data in two different ways and reach differing conclusions because of the nature of the p-value (vs. the nature of, well... nature)
what i was mainly trying to get across is that a coin being biased towards heads /logically infers that/ it is biased. so the idea that 5 heads in a row is less evidence of a coin being biased than it is that it is biased towards heads is not only counter-intuitive but in disagreement with a much stronger and more intuitive form of reasoning.
the fact that the p-values are different in these cases leads me to expect that p-values on their own are not a good indicator of strength of evidence without a lot more context - and /really/ understanding what that context is and means - in which case why use the value at all? nobody else is likely to interpret it correctly unless you lay it out that way which then negates the supposed utility of the p-value...
and yes, i intuitively consider 5 heads in a row to be unspectacular for a fair coin certainly not a 19/20 chance that it is biased (maybe i am very, very wrong though).
Well, try it a few times and see if you can convince yourself :)
The evidence is different because there are different outcomes. For the two-tailed test the possible outcomes are: biased toward heads or tails, or not biased at all. For the one-tailed experiment, the outcomes are: biased toward heads, or not. Getting 5 tails in a row would be evidence in favor of the coin being biased, but not being biased toward heads.
Think of it this way: the two-tailed test is running 2 experiments at the same time (one for heads and one for tails) with the option of picking the one that gives you better results. So obviously the standard for significance has to be higher, because you're cherry-picking results. https://xkcd.com/882/
Now I refuse to use p-values and deliberately construct analyses that are incompatible with Fisherian statistics. And rather than giving people raw numbers, I produce a massive document of interpretation. Takes a huge amount of time, but I'm hoping it will mean my publishing track will contain significantly (ha!) fewer false results than most biologists'.