So I spent a chunk of yesterday doing research for the split-testing section of the Step By Step Guide.
I was looking into the maths of split-testing landing pages and offers to make sure that I was right before I wrote up the usual way to do it.
But I ended up finding some new research on the subject - and what I found was pretty mind-blowing.
It turns out that a) the way we usually do split-testing is very wrong, and b) there's a far better, more precise way to do it that uses less data!
Here's how.
(Note: if you want to skip the "why should we do this?" part and go straight to the neat tool, click here.)
What's Wrong With How We're Doing It Now?
Usually, we split-test landing pages using calculators like this one. That gives us the math to say whether our split-test is statistically significant or not, meaning that we don't make decisions on crappy data.
So far so good. But there are a bunch of problems with standard statistical significance calculation for split-testing.
Large, unknown sample sizes
The way we're doing it at the moment, split-tests need a lot of data to be significant. We tend to recommend 100 clicks on an LP as an absolute minimum, and realistically, we're usually underestimating that: 300 - 500 is more like it.
And what makes it worse is that if you don't see a significant difference, landing page tests just drag on and on. There's no clear point to stop them.
That means that usually, we end up checking our statistical significance every 100 clicks or so until we see a difference.
But it turns out that doing that is a really bad idea.
Repeated testing produces big errors
It's a bad idea to check whether a split-test is significant more than once - ever.
Significance calculations always have a chance of error. And if you're only stopping the test when you have a significant result, and you check multiple times, you bias the calculation in favour of detecting significance even when it isn't there.
This excellent post explains more, and also provides some pretty hair-raising figures.
At the worst case, what you think is a 5% chance of error could be closer to a 30% chance of error with repeated tests!
The "approved" way to fix this problem is to choose a sample size in advance, run the test to that sample size, then test. But if you use the correct math to calculate your sample size, tests done this way end up being very expensive.
If you have a 10% conversion rate and you want to detect a 3% or more bump in conversions 90% of the time, for example, you'll need 2,206 clicks to test that - per landing page you're testing. Here's a calculator that you can use to play around with the sample sizes - it's pretty alarming stuff.
Introducing: Bayesian Calculators
So how do we avoid spending a fortune to test our LPs?
Well, fortunately, required sample size for experimentation is a problem that has been around for much longer than affiliate marketing has been. In fact, medical research has a far bigger problem with it than we do.
It's not easy or cheap to run trials of new medicines, and there's a lot of reasons to attempt to make your sample size as small as possible. Like, y'know, people dying.
So since the '80s, a lot of medical trials have used much smarter statistical math than the significance calculations most of the Web world use.
By using a totally different branch of statistics called Bayesian Inference, it's possible to get more useful data faster. Bayesian calculations combined with a program that runs thousands of simulations on the data we have - called a Monte Carlo simulation - can give valid probability information from small sample sizes, give much more precise data than just "yes" or "no", and don't get less accurate the more you use them!
Unfortunately, the math involved in Bayesian calculation is pretty darn terrifying. So most people have disregarded it up until now.
But it turns out - and this was my big discovery - that there's actually an accessible Bayesian calculator out there!
Read on to find out about it...
So...
To use it, just tick the boxes for as many landers or offers as you've been testing (up to 4). Then for each lander or offer, input the number of clicks it has received as "trials", and the number of conversions it's had as "successes". Then hit "Calculate".
You'll get some useful statistics out, as well as a very pretty - and helpful - graph.
What The Numbers Mean

The most immediately valuable bit are the probability percentages: they'll show you exactly how likely each offer or lander is to be the winning choice.
Note that the calculator will give you results even on very small sample sizes. This is a purely mathematical model, so isn't taking into account the fact that you're dealing with the real world here.
Since your ads aren't being shown to perfectly spherical, frictionless visitors of uniform mass, it's best to get to at least 50 clicks and a few conversions before taking the data seriously, ideally over a couple of days.
Remember, also, that a lander with a 63% chance of being the best one isn't a very good choice yet! That gives you a 37% chance of having picked dramatically the wrong lander - which if you're planning to spend $x,xxx on the campaign after you choose it, ain't great odds. Personally, I'd aim for 90% certainty minimum.
The range of possible conversion rates shown are also very useful - you can compare these with the minimum conversion rate you need to reach your target ROI. They're particularly useful alongside the graphs.
What The Graphs Mean
The graphs show the probability of each lander's conversion rates. The higher the graph above a particular percentage point, the more likely that the conversion rate is that number.
This gives you an excellent tool to tell when you should stop the test. You're looking for high, narrow peaks: if they're clearly separated, your numbers will almost certainly be showing that one lander is very likely to be best. If you have very high, sharp peaks that are overlapping, the chances are that you're not going to see significant results from this test, and all your landers or offers are performing about the same - stop the test and try something else.
Here's an example of a "stop the test, nothing's going to change" graph:

If you're seeing more wide, rounded curves, the data's not certain yet - give it some more time and clicks. Here's an example of that:

Remember, unlike a conventional split-test, you can check these stats as much as you like - so if the data isn't looking convincing, give it some more clicks then re-test.
Can We Trust This?
Any time someone comes up with a hot new way to do something well-understood, it's worth asking if it's a bunch of bullshit.
I've done fairly thorough due dilligence on this approach, and I think that it's sound.
The concept of using Bayesian models for exactly this sort of problem is well-understood: as I mentioned, it dates back to the '80s and it's increasingly becoming the go-to method for medical trials around the world. The method clearly works. I've also found a lot of Web data analysis experts discussing it as the future of split-testing.
Side Note: Incidentally, there's further that we could go with this approach. Medical trials use a calculation of "minimum regret" - rather than percentage of probability, a weighted recommendation based on just how bad it would be to be wrong. Using that concept (which isn't used here) it would be possible to design a calculator which will tell you exactly when to stop a test given your expectations of future spend on the campaign.
The calculator itself is designed by a guy with some reasonably serious credentials, and it appears (to my limited math eyes) to use the same maths that RichRelevance discuss in this post on Bayesian testing. RichRelevance are the guys who run testing for WalMart amongst other people - they're extremely heavy hitters in the testing world.
I'm convinced enough that I'm going to be using this calculator for my own A/B tests in future, but I'd be very interested to hear what you think, and particularly interested to hear from people with more math chops than me!
So there you go! Any questions, comments, discussion, or debate, please do comment below!
awesome post.
what does all this mean in english? :-)
OK, the executive summary:
The way you decide which contender in a split-test is best has a much higher chance of error than you think it does.
Use this tool instead. It gives you better information more quickly with no hidden errors.
Awesome info Caurmen. I love this stuff. Keep em coming
Good post, will definitely check it out. I had meant to talk to you about a theory/question of statistical significance I've had floating around some time on Skype but I guess I will post it below for everyone to see.
Here's something I have been contemplating for some time: is it truly possible to get statistically significant data if your traffic is too broad/untargeted (RON)? If there is no common binding identifier, can you really test things out even with a massive budget?
An easy example of split-testing is when you buy Google search traffic, and say you are targeting "East Village New York Emergency Plumber". You know everyone who is searching that and clicking on your engaging ad that promises a 24/7 emergency plumber has many things in common. They most likely live in the East Village in NYC, they probably need a plumber, and it's probably an emergency etc etc. You can tailor your landing page for them super razor-sharp and any small change like even font color etc may result in statistically significant CVR boost.
Now let's take the extreme opposite angle: if you have the whole world's population clicking to your lander in a RON style stream. Every single language, country, etc. All mixed up. A huge majority of people are going to speak Chinese and NO English (people from China), many will be from India, etc etc. English speakers also make up a huge portion of the population as well.
Let's pretend before this test, you ran 100 million clicks more or less eventually distributed between English Speakers and non-English Speakers to an English LP and ended up with a 11.37% CTR from LP to offer, and you wanted to split test a new font color for the body text.
What randomly happens this test though is that when people get to your landing page, out of a sample test of 55 million users: you potentially get a random chunk of 50 million people from china, with only 3 million English-speaking people and 2 million who don't speak either of them going through your funnel.
You check your stats, you see that your new font that you split test with this 55 million volume burst of traffic is absolutely TANKING your results. You're trying to figure out what's going on, and since in this example -- we are omniscient and know everything, we know that the variability in the type of demographic/people going through your tunnel made more of a difference than the actual difference in the font on your LP.
In the real world, we aren't omniscient, and we are left wondering what many inconsistent split testing results mean when you run less targeted streams of traffic. Any solutions you've found to this?
Hmm, that's a very interesting question!
My initial thought - and I'm going to give this some more thought over the weekend - is that this is a question of understanding the variables and the "chunk size" of your traffic source.
Let's imagine that your traffic source is an elevator with a gate at the end, and the gate opens up every fraction of a second to allow a new visitor to your lander.
Now, if every time the gate opens up one visitor comes through, and that visitor is randomly chosen based on a percentage distribution of population across the globe, out of 55 million visitors, you'll get a distribution that's very, very close to the population distribution of the world. Whilst there is a chance that there's a huge disproportion in your visitors just based on random luck, the chance is tiny: something in the order of 0.3% to the power of 15 million, which is getting toward the same likelihood as that of your computer spontaneously dropping through the floor thanks to quantum creep. In other words, it's sufficiently unlikely as to be effectively impossible.
Bearing that in mind, if you do see huge bursts of Chinese traffic, there's got to be another explanation. And assuming you're reasonably certain that the average makeup of your traffic source is equivalent to the average makeup of the globe (or at least the portion of the globe that has Internet access), the obvious explanation is that the gate isn't admitting just one person at once.
If the gate randomly chooses a country and then admits a million people from that country to your lander, rather than just one, the odds of getting a massive burst of Chinese traffic on a particular day go up massively - rather that 55 million decisions, you're only seeing the results of 55. And in that case, it won't be obvious what's going on unless you know how your traffic source works.
So my answer would be, from this rather roundabout discussion, that you really need to understand if the distribution you're getting is truly random per visitor.
If it is, and you know the percentages that the random choice is coming from, then over a sufficiently large sample size the noise of randomness will cancel out to give you a very reliable dataset - arguably an even better one than your East Village plumbing emergencies. This is the principle behind everything from poker bots to Pixar's raytracers to meta-analyses of medical data.
However, if there's a variable you're not aware of, then it can screw everything up - and when you're buying that many clicks, that's not good!
At that sort of volume, I'd probably recommend having a conversation with the engineers behind the traffic source you're using. You'd want to know how they distribute your clicks and what random/pseudorandom algorithm they're using, if they already have a frequency and volume model of visitors, if they've got data you can analyse to make sure that it's random and that their stats are accurate. Obviously, you'd also want to know how the traffic varies by day or by week so you can rule those out as potential spoilers.
You'd probably also want to hire a probability expert to check all this stuff. Fortunately this is a very similar problem to ones that the financial markets face pretty frequently, so there are a lot of very highly skilled analysts out there. (If this becomes more than a theoretical example, I can introduce you to a couple of them - they're extremely expensive, but for this sort of problem they're also worth it.)
And finally, of course you'd want to have some incredibly robust tracking in place that's tracking all the data you can think of. In the event that the supremely wierd did happen, you'd be able to track it down pretty quickly by looking at - in this case - your country stats. If your stats show that you've had a burst of traffic with a probability less than a meteor wiping out life on earth, you probably would then want to have some Serious Words with your traffic source.
The great advantage of this sort of volume is that you can use some incredibly robust reporting mechanisms. For example, you could send 0.5% of your visitors to a language-specific survey, and ask them if, for example, this day's unusual for them, if there's something happening that's capturing their attention, if there's something about your offer that puts them off - all stuff that will let you track down any funnel problems.
Haha well... those numbers I was using were purely theoretical. Even if I was slinging some straight fantasy stuff, if I was getting 55 million clicks a day, EVEN at 1 cents a click, that'd be 500k of adspend I'd be spending per day, and spread out through a whole year would be 200 million USD at 1 cent clicks.
I think your theory of large gated chunks of traffic still applies though. That seems to be the only explanation. Unfortunately, for this massive traffic source I have in mind for this problem, their ad delivery platform is pretty much a black box, and any attempts to talk to them about it won't result in a nice conversation (despite spending a very significant shit-ton of money with them), it will simply result in a template email back.
Aha, now, that ends up being a very interesting problem
I've spent a good chunk of my career reverse-engineering black boxes - it's not easy but it can be both fun and profitable...
If you have the budget to spend on this, my recommendation would be to hire a shit-hot mathematician / data analysis person and talk to them about reverse-engineering the traffic algo the traffic source is using. I know of a few people who might fit the bill - contact me if you want some recommendations.
Alternatively, get one of their engineers drunk 
Having said that, there's every chance that the result you'd end up with would essentially be "yup, the way they do this is spectacularly stupid and hard to make money from." Depends on the potential upside, I guess - although if they're gating like this, you might be able to figure out how to capture only the high-quality traffic...
Hmm I kind of derailed this thread -- to get it back on track:
Caurmen, what do you think about using this tool when you have variability in payouts? For example, I'm always split testing multiple LP's with multiple offers, so an LP may get a super high CTR, more clicks, more conversions, but if I were testing SOI vs DOI, then obviously using a standard "conversion" as a metric isn't exactly fair, since the ultimate total conversion is skewed with multiple offers and payouts. Is it doable to use total number of clicks for Trials and total Revenue for the Successes?
That's a very good question! Andyvon was asking a similar question over here.
I don't actually have a solid answer on that right now - I know the math gets a bit more complicated. Let me look into it, and I shall get back to you.
Brilliant stuff like usual!
AMAZING post. Fascinating.
Anybody else getting a virus warning/blank page for the
calculator?
Can't get anything showing up.
Looking forward to applying this to our current tests and
seeing what happens.
-D
I'm seeing some nasty injected ads in there - I think it may have been hacked. Fuck.
Gonna get in touch with him and see what's what.
In the meantime, taking links temporarily offline.
Looks fine now, no Ads or malware showing up.
Caurmen, this is totally badass!
If you see an excel version around let me know, it seems this could be hard to do - but I use excel extensively to save time so will see if it can be implemented somehow.
Some more info here:
http://devblog.songkick.com/2012/10/...g-split-tests/
@weekendwarrior - Nice share! I'm saving that one for later - breaking my head on statistics theory is an active project right now 
The certainty of a lander winning is mostly based on the number of conversions. Try increasing it to 1000 trials, 100 and 150 successes - you'll see that the numbers are a lot more certain then.
I am math retarded in inverse proportion to Excel skills; is there any way to build this into an Excel workbook rather than using an online tool? I know you can perform Monte Carlo simulations in Excel, and this seems similar...
For most purposes, using a multi armed bandit model will get you faster results than standard AB testing. When done properly, this incorporates Bayesian probabilities (via Thomson sampling or Randomised Probability Matching) to dynamically determine the relative proportions of traffic to send to each test, so that you don't waste too much money on the losers while ensuring that you have enough statistical robustness to make these decisions.
Indeed, this is how Google Analytics does its experiments.
https://support.google.com/analytics.../2844870?hl=en
Multi-armed bandit experiments
By Steven L. Scott, PhD, Sr. Economic Analyst @ Google
Benefits
"Experiments based on multi-armed bandits are typically much more efficient than "classical" A-B experiments based on statistical-hypothesis testing. They’re just as statistically valid, and in many circumstances they can produce answers far more quickly.
They’re more efficient because they move traffic towards winning variations gradually, instead of forcing you to wait for a "final answer" at the end of an experiment. They’re faster because samples that would have gone to obviously inferior variations can be assigned to potential winners. The extra data collected on the high-performing variations can help separate the "good" arms from the "best" ones more quickly.
Basically, bandits make experiments more efficient, so you can try more of them. You can also allocate a larger fraction of your traffic to your experiments, because traffic will be automatically steered to better performing pages.
Example
A simple A/B test
Suppose you’ve got a conversion rate of 4% on your site. You experiment with a new version of the site that actually generates conversions 5% of the time. You don’t know the true conversion rates of course, which is why you’re experimenting, but let’s suppose you’d like your experiment to be able to detect a 5% conversion rate as statistically significant with 95% probability. A standard power calculation1 tells you that you need 22,330 observations (11,165 in each arm) to have a 95% chance of detecting a .04 to .05 shift in conversion rates. Suppose you get 100 visits per day to the experiment, so the experiment will take 223 days to complete. In a standard experiment you wait 223 days, run the hypothesis test, and get your answer.
Now let’s manage the 100 visits each day through the multi-armed bandit. On the first day about 50 visits are assigned to each arm, and we look at the results. We use Bayes' theorem to compute the probability that the variation is better than the original2. One minus this number is the probability that the original is better. Let’s suppose the original got really lucky on the first day, and it appears to have a 70% chance of being superior. Then we assign it 70% of the traffic on the second day, and the variation gets 30%. At the end of the second day we accumulate all the traffic we’ve seen so far (over both days), and recompute the probability that each arm is best. That gives us the serving weights for day 3. We repeat this process until a set of stopping rules has been satisfied.
Figure 1 shows a simulation of what can happen with this setup. In it, you can see the serving weights for the original (the black line) and the variation (the red dotted line), essentially alternating back and forth until the variation eventually crosses the line of 95% confidence. (The two percentages must add to 100%, so when one goes up the other goes down). The experiment finished in 66 days, so it saved you 157 days of testing.
Figure 1. A simulation of the optimal arm probabilities for a simple two-armed experiment. These weights give the fraction of the traffic allocated to each arm on each day.
Of course this is just one example. We re-ran the simulation 500 times to see how well the bandit fares in repeated sampling. The distribution of results is shown in
Figure 2. On average the test ended 175 days sooner than the classical test based on the power calculation. The average savings was 97.5 conversions.
Figure 2. The distributions of the amount of time saved and the number of conversions saved vs. a classical experiment planned by a power calculation. Assumes an original with 4% CvR and a variation with 5% CvR.
But what about statistical validity? If we’re using less data, doesn’t that mean we’re increasing the error rate? Not really. Out of the 500 experiments shown above, the bandit found the correct arm in 482 of them. That’s 96.4%, which is about the same error rate as the classical test. There were a few experiments where the bandit actually took longer than the power analysis suggested, but only in about 1% of the cases (5 out of 500).
We also ran the opposite experiment, where the original had a 5% success rate and the the variation had 4%. The results were essentially symmetric. Again the bandit found the correct arm 482 times out of 500. The average time saved relative to the classical experiment was 171.8 days, and the average number of conversions saved was 98.7.
Stopping the experiment
By default, we force the bandit to run for at least two weeks. After that, we keep track of two metrics.
The first is the probability that each variation beats the original. If we’re 95% sure that a variation beats the original then Google Analytics declares that a winner has been found. Both the two-week minimum duration and the 95% confidence level can be adjusted by the user.
The second metric that we monitor is is the "potential value remaining in the experiment", which is particularly useful when there are multiple arms. At any point in the experiment there is a "champion" arm believed to be the best. If the experiment ended "now", the champion is the arm you would choose. The "value remaining" in an experiment is the amount of increased conversion rate you could get by switching away from the champion. The whole point of experimenting is to search for this value. If you’re 100% sure that the champion is the best arm, then there is no value remaining in the experiment, and thus no point in experimenting. But if you’re only 70% sure that an arm is optimal, then there is a 30% chance that another arm is better, and we can use Bayes’ rule to work out the distribution of how much better it is. (See the appendix for computational details).
Google Analytics ends the experiment when there’s at least a 95% probability that the value remaining in the experiment is less than 1% of the champion’s conversion rate. That’s a 1% improvement, not a one percentage point improvement. So if the best arm has a conversion rate of 4%, then we end the experiment if the value remaining in the experiment is less than .04 percentage points of CvR.
Ending an experiment based on the potential value remaining is nice because it handles ties well. For example, in an experiment with many arms, it can happen that two or more arms perform about the same, so it does not matter which is chosen. You wouldn’t want to run the experiment until you found the optimal arm (because there are two optimal arms). You just want to run the experiment until you’re sure that switching arms won’t help you very much."
more here
https://support.google.com/analytics/answer/2846882
Damn, that's some fascinating stuff, thanks a lot for posting! I'll try and incorporate it into the next couple tests, finally I can make use of the weighing function of CPV Labs! ;-)
Interesting!
I've read some articles skeptical of Multi-Armed Bandit approaches in the past, hence why I've not tried them - but if they're good enough for Google, I'll have to look into their methodology and do some testing!
So in the comments of the Bayesian calculator the following article was posted: http://engineering.richrelevance.com...-normal-model/
It talks about using the Bayesian approach for non-binominals, e.g. different payouts. Being a mathematical retard, this article goes way over my head... any chance that one of the STM geniuses could give us a TLDR version on how to use this approach?
@andyvon - I've had that article in my to-do list for a while. It's ... well, so far I've confirmed that it's definitely about maths
However, I shall add it to the list to ask my Friendly Local Mathematicians about.
I use this calculator on every campaign. I hope this guy never takes his website down or I will really be screwed.
Hire someone to make your own calculator, or use my hacked-together version 
http://huskysteals.com/ABTest/
If it ever does get taken down we'll come up with a replacement (or just start linking to Jennatalia's one
).
Should we be using this tool for ad testing instead of the 'Exact Binomial and Poisson Confidence Intervals' test?
@jaydenuk - No, they serve two different purposes.
This tool will tell you which of a group of things are the best. However, what it doesn't do is tell you whether the winner will be profitable. Hence, if you're in a battle-to-the-death test where you know you'll only want to end up with one thing (like lander or offer optimisation) this is a great tool.
The confidence interval calculator will tell you whether it's probable that a thing, given an amount of data, will hit a specific threshold of events. That's what you want for ads.
Even if ad A is the best, if it's not profitable, you don't want to run it. And you almost never want to run only one ad.
So the two tools serve different functions.
Does that make sense?
Yes thanks.
Hello @caurmen
Do you think this calc: https://www.peakconversion.com/2012/...al-calculator/
work for FACEBOOK ADS?
Because Facebook ads gives more impressions to some ads. So maybe this behavior does not meet the requirements of this calculator.
Thanks for your great guides.
Best Regards
PS:
This can always be used in any online marketing test?
What do you think is the limitation of this calculator?
The Bayesian calculator doesn't require that all tests had the same impression count, so will work fine.
However, beware of FB's algorithm when you're doing split-tests on the platform. Make sure you're following best practises for FB ads to avoid the entire thing being skewed by FB's machine learning deciding that Ad A is the best ad and all others aren't
Limitations of the calculator: it doesn't work well at very small sample sizes, as mentioned. Beyond that, it's a good statistical tool. It doesn't work with non-binomial situations (where the outcome isn't "Yes" or "No"), but fortunately advertising online tends toward a lot of binomial results.
If you really want to dig into the math you can look up the massive and ongoing frequentist vs Bayesian debate, but that's a deep, deep rabbit-hole with limited ROI.