A/B Testing Rigorously (without losing your job)

btilly · on Jan 24, 2013

Note, I expect this to be the first in a series. It probably won't wind up laid out exactly as I expect, but here is my tentative plan:

1. Rigorous (this one) - How to use frequentist techniques to define a very rigorous statistical procedure that - no matter how small the bias in the test - can give correct answers with very high probability (and, of course, support multiple looks at your statistics).

2. Fixed data - Your business has an upper limit on how much data it realistically can collect. How to design a statistical procedure that makes reasonably good decisions while remaining within that limit.

3. Simple test - How to use Bayes' Theorem to design very simple and straightforward test criteria, where every A/B test has a known cost up front.

4. Bayesian statistics - How a Bayesian approach compares to the previous ones. Where the sticking points are, what its advantages are.

5. Bayesian optimization - How a Bayesian approach can let you turn the problem of A/B testing into a pure optimization problem where your decision to stop is based on whether the cost of continuing the test exceeds your expected return from continuing it over some time horizon.

We'll see how far I get. As it goes on, they become tricker to write at the level that I'm aiming for. And it may be that people will be interested in something else. But I thought that the first two were important. If for no other reason than to give me something definitive to point people at who read Evan Miller's article and then were telling people to run A/B tests in a way that made no business sense.

woah · on Jan 25, 2013

I'm a designer, and I value AB tests, but the math required to make sure I can trust my results is not what I want to focus on. There should be a startup that offers a some sort of framework, instruction, and automation for this stuff. I would use it.

btilly · on Jan 25, 2013

There are a number of startups that do exactly that. However there are some complex trade-offs that need to be made under the hood, and businesses whose goal is to make things simple, naturally tend to shy away from educating their customers on those issues.

loup-vaillant · on Jan 24, 2013

From your article:

> We will follow Evan's lead and use a frequentist approach.

I though: "how about using the correct approach instead?", followed by a helpful comment about how you are a Bad Person¹. Hem.

Then I noticed that the way you talk about p-value doesn't sound very frequenty. And then your comment. I eagerly await for the rest of the sequence –err, series².

[1]: https://xkcd.com/386/

[2]: Come to think of it, this does looks like LessWrong material to me.

btilly · on Jan 25, 2013

I suspect that you may like me substantially less when I get to discussing the trade-offs that are inherent in Bayesian statistics.

I know how wonderful it can be to feel that you have realized The Truth, but the frequentist school of statistics does not remain in existence just because statisticians are too stupid to recognize the obvious superiority of Bayesian methods.

loup-vaillant · on Jan 30, 2013

Having read chapter 2 of Probability Theory: the Logic of Science by E.T. Jaynes, my probability for the stupidity hypothesis went way, way up. There's probably a heavy status-quo bias at work too.

I mean, many frequentist methods directly contradict Cox theorems! That should convince anyone they belong to History, not Science, doesn't it?

loup-vaillant · on Jan 31, 2013

I know the above comment sounds extremist, but bear in mind that its truth value has nothing to do with that. I mean for instance, the only reason "2 + 2 = 4, and anyone who believe otherwise lacks either a brain or some crucial information" does not sound extremist is because everyone actually agrees with it.

So. Who in her right mind would reject the three assumptions behind Cox's theorem, and for what reason? Assuming we don't reject those axioms, why should we not reject Frequentism at once?

https://en.wikipedia.org/wiki/Cox%27s_theorem

_dps · on Jan 25, 2013

> Then I noticed that the way you talk about p-value doesn't sound very frequenty.

I think a lot of people improperly understand the Frequentist/Bayesian divide; both look at likelihood functions but the Frequentists assert, as a matter of experimental interpretation, that the likelihood function should depend only on the observed data while the Bayesians are willing to admit "out-of-experiment" modifications to the likelihoods implied by the data (i.e. the "prior").

Frequentist arguments correctly point out that the prior is, from the isolated standpoint of the experimental observations, purely subjective and non-falsifiable. Bayesian arguments correctly point out that experiments are designed by humans in context, and that the context may imply a prior distribution that is not evident within the observations themselves. Neither is incorrect, but they answer different questions. Which question is the right one to answer is context-dependent.

As a concrete example, consider an A/B test where A and B perform very similarly. Merely seeing A's performance against B does not carry the contextual information that A has consecutively beaten one hundred previous non-B competitors that sort-of-look-like B. The Frequentist will correctly argue that, if you don't explicitly model the A/B test generation process (as opposed to just the A/B test itself), then all the data can tell you based purely on the laws of probability is that A and B are likely to be pretty similar. The trouble faced by the frequentist is that it's pretty easy to model an individual A/B test for, say, binary outcomes, but it's very hard to explicitly model the generation of A/B test competitors. So the Frequentist says "all I can tell you and still be strictly objective is that A and B look similar".

The Bayesian, rather than explicitly modeling the generation of tests as the Frequentist would like, wraps up that contextual knowledge into a subjective prior distribution that heavily weights A over B simply because A has won so many times before. The Bayesian's advantage is that the previously-blocking part of the probability modeling problem is punted into a subjective, experimenter-designed, input. This is a tradeoff, not an increase of correctness! It will "do the right thing" assuming the new B looks a lot like the old Bs, but will do the wrong thing if the new B is dramatically different and maybe looks a lot more like A with small tweaks.

Bayesianism is not a universal solution to probability modeling problems, and Frequentism is not an obsolete unhip method.

darkxanthos · on Jan 25, 2013

Thank you for adding an objective comment about this whole controversy. I love the ideas present in a Bayesian approach but there seems to be a lot of value in Frequentist statistics as well. I'm glad to see I'm not crazy for thinking the two can find a balanced existence.

jessaustin · on Jan 25, 2013

I'm very encouraged by the provided prospectus, since it implies btilly is one of the chosen few who not only know both frequentist and Bayesian techniques but also know when to use each and when not to e.g. give a Bayesian interpretation to a frequentist technique.

btilly · on Jan 25, 2013

Thanks for the kind words, but I'm winging this more than I'd like to admit. (Though winging it as carefully as I can.)

But if I finish, I'll probably have learned a lot about when to use each. Hopefully because I figured it out correctly, and not because I said something egregiously wrong and got corrected by someone who knew better!

pjungwir · on Jan 25, 2013

> And it may be that people will be interested in something else.

Personally I'm eagerly awaiting #5: treating the question of whether to continue a test as an optimization problem.

mwexler · on Jan 25, 2013

Will 5.1 or 6 discuss some aspects of the Multi Armed Bandit approach, ala http://analytics.blogspot.com/2013/01/multi-armed-bandit-exp... ? Looking forward to the whole series!

btilly · on Jan 25, 2013

I was not planning to go into MAB approaches because that quickly gets into material that I don't know how to easily explain at the level that I am trying to reach here. And if I did go into that material, it would be an extremely idiosyncratic take. But if you're interested in an overview of my general thoughts on MAB approaches, http://bentilly.blogspot.com/2012/09/ab-testing-vs-mab-algor... may be of interest.

tel · on Jan 25, 2013

I'm excited! I've been looking forward to this for a while from your comments. This first post is written at a great level and answers a very practical question that few people realize even has answers to.

I'll be waiting eagerly to read the rest.

Cheers!

gburt · on Jan 25, 2013

Thank you for this. You are awesome.

gburt · on Jan 25, 2013

I am strongly of the opinion that even if you got a non-significant result doing your A/B test, assuming we're testing something of no cost to change (like a web design that has already been deployed for the sake of the test), your point estimate for the effect is the "best data you have" and you should act on it.

To provide an example, you run a test and find Page 1 outperformed Page 2 by a factor of 1.1, but that was non-significant at your desired significance level (say, p=0.05 and power = 0.8). You should deploy page 1 instead of page 2, assuming there are no other costs associated with deploying page 1 instead of page 2, because your BEST GUESS RIGHT NOW is that page 1 is 1.1x better than page 2.

grzaks · on Jan 25, 2013

But if your 1.1x result is not confident in reality it could perform 1.1x worst. In my opinion in this situation you're just guessing.

rieter · on Jan 29, 2013

Yes, it is a guess that might be wrong, but it is a best guess you have. Staying on the old page is also a guess, which might be wrong, possibly with a larger likelihood.

aaronjg · on Jan 24, 2013

That seems to be a pretty sound approach, compared to some of the stuff about multiarmed bandits that shows up here some times. And I certainly expect Noel Welsh to chime in as well.

There are two schools of thought about the approaches to sequential testing, the Bayesian approach lead by Anscombe, and the frequentist by Armitage. I talked a bit about this and outlined Anscombe's approach here [1]. And it is great to see such a nice write up of the frequentist approach and the tables of the stopping criteria

[1] http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-te...

btilly · on Jan 24, 2013

Thanks.

If all goes perfectly, I will discuss more ways to think about the problem than just those two, and try to show some connections that may surprise people. But, judging by your nice article, I doubt that I'll prove to have anything to say that you don't already know. :-)

mwexler · on Jan 25, 2013

You imply that the MAB approach has problems, or at least is not as "sound". Care to state what those problems or issues are?

aaronjg · on Jan 25, 2013

I am mostly critical of claims like '20 lines of code that will beat A/B Testing Every Time.' Multi armed bandits are also not as useful for inference as the frequentist methods that Ben presents in his posts.

mistercow · on Jan 24, 2013

If you do an A/B test and get a result that leans one way but is statistically insignificant, then it seems to me you might as well just go with that answer. No, you don't have good reason to believe that it's better than the alternative, but you do have good reason to believe that the choice is harmless.

tessro · on Jan 25, 2013

> you do have good reason to believe that the choice is harmless.

The issue you will run into here is that 95% confidence means that you will only have a false positive 5% of the time. It does not mean a neutral finding is 95% likely to be neutral. The lever that controls that is statistical power, which is oft-ignored in conversations about A/B testing. Most statisticians use 80% power, which means a full 20% of neutral findings were false negatives.

mistercow · on Jan 25, 2013

That is very true. However, you do have much better reason to believe that the option is neutral than you have to believe that the option is beneficial. In an example like the one given in the article, you also likely have enough statistical power to be reasonably confident that the option is close to neutral, so if you're making a negative decision, it's probably not strongly negative.

tessro · on Jan 25, 2013

Right. And you can also ratchet up the statistical power you want, at the cost of increased sample size requirements.

gburt · on Jan 25, 2013

Which is still better than 50%. After a split test, your point estimate, whether significant or not, is still your best guess of the effect.

darkxanthos · on Jan 25, 2013

This doesn't seem correct to me. The null hypothesis is that both treatments are the same. If you can't find evidence to reject the null hypothesis why would you change it?

From experience I've learned this puts you back in the gut-based guessing game those employing these methods are trying to escape.

btilly · on Jan 24, 2013

My planned second article will be entirely on that theme. :-)

ArnoVanLumig · on Jan 25, 2013

One article I've found helpful with A/B testing is [1] which also describes some common statistical problems (e.g. Simpson's paradox as it applies to A/B testing) based on case studies. It's written as a research paper by some folks at Microsoft, but it's definitely very readable.

[1]: http://www.exp-platform.com/Documents/2009-ExPpitfalls.pdf

majormajor · on Jan 24, 2013

My stats knowledge is fairly rusty, so I got a bit lost midway through, but here's a side question I've been wondering about, especially given that the linked Miller article talks about the null hypothesis being that the two are equal: how does that null hypothesis fit into these tests? I.e., you have to choose a button, so does it even make sense when the hypothetical employee in that example at the start of the article says "we didn't get an answer"? A .18 p-value isn't great, but I imagine he'd still recommend the green one -- there's much less reason to think the other button was better.

Why not do a one-sided test where the null hypothesis is "A is at least as good as B" and the test hypothesis is "B is better than A"? It seems like you'd be gaining some more power to detect B being better, without losing much since A's safe to choose regardless of if they're about the same or if A's actually better?

btilly · on Jan 24, 2013

The reason is that statistics is done with numbers. I do not know how to take a statement like "A is at least as good as B" and tell you the probability that after 15 coin flips, A is 3 ahead of B. I do know how to do that calculation under the assumption that they are exactly equal. Or under the assumption that A comes up 51% of the time, B 49% of the time.

The point of the null hypothesis is that it is an absolute worst case. So whatever the real difference, it can't be harder to detect than the null hypothesis. Therefore if we make very few mistakes on the null hypothesis, then we're confident that we'll make very few mistakes, no matter what.

And finally you're right that going with the green button is better than just tossing up your hands. But how to decide how much better is a surprisingly complicated question, and there is no simple answer about how to quantify it. (Remember, statistics has to be done with numbers, so quantifying the answer matters.)

majormajor · on Jan 24, 2013

Wouldn't it just be a normal one-sided hypothesis test? The only thing that changes under H_0: A > B vs H_0: A = B is that you're only interested in the area under one side of the curve instead of both. The test statistic remains the same, you just get a different p value.

It should let you get away with a lower sample size at the expense of possibly only being able to conclude "A's better" or "A's not better" instead of "A's better" or "A and B could be the same" or "B's better." (It's unclear to me what would be the problem with looking at the P values for both the > test and the != test, for a single test, and only "falling back" to the > test if you happen to be in a range where you can say "we can't say for sure that B's better, but we can be pretty sure that A isn't better," like how the different P values are presented here[1].)

It still wouldn't be valid to do something like "oh the two-sided one was inconclusive but leaned in favor of A, so let's do a one-sided one to check if A's better after all," though.

[1] http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests...

btilly · on Jan 25, 2013

In this article I was going for the highest possible standard. The ins and outs of one-sided versus two-sided was not a topic that I wanted to explore.

In the next article I plan to relax things a lot farther than just one-sided versus two-sided tests!

majormajor · on Jan 25, 2013

Gotcha, thanks. And one- vs two-sided alone would hardly be a huge win, I was just curious since I didn't remember seeing it brought up at all in any of the prior articles I'd read.

will_critchlow · on Jan 24, 2013

How do you approach this issue? http://www.distilled.net/blog/conversion-rate-optimization/w...

I haven't come up with a decent answer yet especially for smaller sites that can't run tests on very specific segments...

btilly · on Jan 25, 2013

Based on your article and code, I believe that you have misunderstood what the statistical test is supposed to be telling you.

When you make decisions at 95% confidence, your guarantee is that at most 5% of the time are you going to wrongly conclude that one is better when it isn't. However you have absolutely no guarantees about having correctly called the direction of the test if you do hit that significance level. (Indeed if the null hypothesis is correct, every time you call the test, you're wrong!)

What you did is simulated the test many times, ignored the cases where you were told there was no answer (thereby throwing away a large part of your guarantee), and found that you could be wrong a large portion of the time. Furthermore if there was a large, discoverable, random factor, you found that could be correlated with a lot of the mistakes. Unfortunately, in addition to the discoverable factor, there are lots of unknown random factors that also get randomly correlated. And even if there aren't, there is always the possibility of experiencing bad luck.

The only way to solve this is to throw enough traffic at the problem that the underlying bias can be reliably detected statistically. There is a complicated relationship between the size of bias you're willing to get wrong, and the amount of data that you need to collect to reliably detect it. I hope to explore that relationship in the next two articles.

For larger sites, that solution is perfectly workable. For smaller ones it is not, and the best that I can suggest is that they rely heavily on design principles that have been validated through A/B testing on larger sites, and hope they are not going too far wrong.

will_critchlow · on Jan 25, 2013

I'm pretty sure we should just ignore my code for the purposes of this discussion - my math(s) may be fuzzy, but my code has never been a strong point.

I was mobile when I wrote the original question - perhaps a better way of phrasing it would have been something like:

Don't many (most? all?) of these theoretical approaches assume that the sequence of results for each page (call them a_i and b_i for i=1,2,3... where each x_n is 0 [no conversion] or 1 [conversion]) are sequences of iid random variables with underlying conversion probabilities p_a and p_b? In reality, these sequences are much more complex and, if the scale of the variation in conversion probability within the sequence is greater than the difference between p_a and p_b won't the test be much weaker than we originally thought?

To use an example that is simplified vs. reality but hopefully indicates what I mean, imagine that we have two traffic sources - one with a conversion probability twice that of the other (on each page variant - so p_1_a = 2p_2_a and p_1_b = 2p_2_b). We randomly send traffic from both sources (1 and 2) to each variant (a and b). Do the standard tests work even though our sequence of conversions are not iid?

btilly · on Jan 25, 2013

No.

One good way to see why is this. The theoretical approaches that you're talking about do not care about the internal details of your random number generator. You could have p_a be the result of a single random decision (convert or not), or be the result of first finding out that we had a random source, then having conversion probabilities that depend on the source. Either way, as long as in the end you get a stable p_a of converting from noticing you have a new visitor to an actual conversion, probability theory says that the exact same statistical statements will be true, and you'll have entirely equivalent results.

(Note, the way I set up my particular approach means that actual conversion rates can shift over the test without invalidating the results. I didn't call that out too strongly, but I value that detail.)

However there is a gotcha lurking. The gotcha is that if conversion depends on your source, then A/B testing will optimize for your current traffic mix, but its result may become wrong if that mix shifts. My usual approach is to simply assume that the world is not going to be malicious in this way, unless I have specific reason to suspect it is.

Thus, for instance, I would not think twice about source vs button color. But if I had a landing page with random testimonials, I'd expect that a testimonial from pg would convert better than a testimonial from Phil Ivey. And conversely for traffic from a gambling site.

monkeyfacebag · on Jan 24, 2013

Can someone who understood this post all the way through tell me what m refers to? It first shows up in this sentence

> Therefore we're faced with a series of decisions, at each number of conversions n, m more turned up A than B (m can, of course, be negative).

which I am not able to parse.

btilly · on Jan 24, 2013

m is the number of A conversions minus the number of B conversions.

bobbles · on Jan 25, 2013

"m more often resulted in result A than result B"

I think would be another way of saying it. I could be completely wrong though, it is a bit confusing.

btilly · on Jan 25, 2013

Did I manage to clarify the wording in my latest revision?

If so, how would you like to be acknowledged?

carlsednaoui · on Jan 26, 2013

Loving this, thank you for taking the time to make these. Is there any way we could get notified by email when the rest of the series comes out? My email is user @ gggmail.com just in case.

sycren · on Jan 25, 2013

I have done many split tests on my company website using Optimizely, How does this service and others measure up in accuracy of the results?

grzaks · on Jan 25, 2013

btbilly: I read your article but before I digg into the math and try to understand it - can you please take a look at the tool I use now http://mystatscalc.com and see 1) how it might work 2) do you think it's giving correct results?

Waiting for rest from the series!

Grzegorz

btilly · on Jan 25, 2013

That tool is giving you a standard frequentist p-value. Which means that if you insist on 95% confidence, and take multiple looks, then you'll have more than a 5% chance of eventually getting to confidence randomly.

If you stop your tests on the p-value cutoffs that I gave in my third graph, you'll get very strong a priori guarantees that the decisions that you make will be right. The downside is that you won't have any guarantee of making decisions in reasonable time. But that is the subject of the next article.