*The ads started on January 2 and took 6 days to hit 98% statistical confidence....

KnowledgeSponge · on Jan 30, 2014

So since Optimizely basically lets the test run until it "reaches statistical significance" with their Chance to Beat Baseline number, would you say they are doing it wrong?

Methodology here: https://help.optimizely.com/hc/en-us/articles/200039895-Chan...

I've often run tests that seem to be in the 90%+ chance to beat baseline, but the graphs just didn't look like they were finished or had enough data, so I'd let it run a bit longer. Sure enough, big changes in percent lift, and also big changes in confidence dropping down below 90% (sometimes MUCH lower), and occasionally coming back up.

204NoContent · on Jan 29, 2014

Weird... I always thought that if you are running a split test in parallel (all at the same time), then you can figure out the number of samples needed to compare the branches with statistical confidence. I mean it makes sense to me. As the number of samples increases for each split test the distribution shifts from a binomial distribution to a gaussian distribution by the central limit theorem, and that happens around 1000 samples with a reasonable conversion rate. Then you're just comparing Gaussians, centered around the mean conversion value, with a width proportional to the number of samples. Taking the difference between two gaussians will give you the "chance to be different". Standard practice is to wait until one branch has a 95% chance to be better and then declare it the winner. This will test for false positives, which is usually what you are concerned about. False negatives don't matter that much when it comes to things like picking a name.

sp332 · on Jan 29, 2014

Here's a good explanation: http://www.evanmiller.org/how-not-to-run-an-ab-test.html

204NoContent · on Jan 30, 2014

Thanks for the link to the blog post. It raised an important point worthy of inspection. I ran some numbers and "peeking" after the first 1000 trials does change the outcome. The chance that the outcome will reverse from declaring branch A the winner with 95% confidence to declaring branch B the winner with 95% confidence is rather small, less than 10%. However, if you lower your requirements to 80% confidence then the chance of the winner swapping increases to over 50%! For reference, I used the Wilson approximation for binomial distributions. I'm sure the Wald approximation fares worse.

tlarkworthy · on Jan 30, 2014

Interesting, this is a weakness of significance testing, in particular, its parametric model. Using Bayesian inference you would be able to look early withou messing your results up.

Agathos · on Jan 30, 2014

A classic mistake. Even Gregor Mendel may have committed it.

darrennix · on Jan 29, 2014

Interesting. As a thought experiment, what if I had run a sample size estimator beforehand and concluded that I needed 4,000 data points and estimated it would take 6 days.

In this alternate reality, I began the experiment on January 2 and ended it on January 8. Would that render the invalid experiment valid?

sp332 · on Jan 29, 2014

Yeah, actually. The difference is that you're supposed to take a random sample, not a sample that looks nice :)