A/B Test Results Validation – 3 Simple Rules for Non-statisticians

AB Testing is the most effective way of proving that changes to digital content work or don’t work, however it is a statistical, scientific method and therefore only as good as the statistical interpretation of the results. In very simple terms, it is the answer the question ‘what is the probability that the difference between those 2 conversion rates is not the result of chance/accident’

Yet it is fairly alarming how frequently you will find people running A/B tests and then making judgements purely on the basis of the difference in the conversion rate between the variations, as if those numbers are a given fact. What follows is the absolute basic guide for how not to do this. You don’t need to understand stats, you just need to follow a few basic rules.

First, some terminology:

What follows is simple, but it will use some terminology. Whether you have seen A/B test results before or not, it helps to clarify the key elements I am talking about and the terminology used:

AB test results

  • A = Baseline conversion rate, or the conversion rate of the control version 
  • B = Minimum Detectable Effect, the ‘lift’, or the % change between the control and the variation
  • C = The volumes of visitors entering each variation
  • D = Confidence level

For a variation in a test to be statistically significant and to be declared a winner it must satisfy 3 rules:

Rule 1 : The sample size of the traffic entering each variation must be a significant size

Use this tool to calculate how big the visitor volumes need to be in each variation, based on the % increase in the conversion rate. You will need to enter (A) and (B) in order to determine whether each of the values in (C) is big enough

These tools are designed to calculate, in advance, the sample size that you want to reach based on the lift that you want, however it is easier (albeit maybe not as pure statistically) to use it as a test to see whether your current test results already contain appropriate sample sizes. So if you do this regularly on a live test using the lift as it exists at that time, it will tell you whether it has reached the right size, OR what kind of volume you need to get to, based on that lift.

Rule 2 : The confidence level should be 95% or higher

The confidence level (D) should always be at least 95%, however if a test has been running for a long time (see below) and the variation has consistently been either above or below the control in a neat pattern, then I might under duress be tempted to take 90% as a minimum.

Tools like Adobe Target and Optimizely do this calculation for you, and both are relatively sophisticated, however I also like to check using alternative tools as well . This is probably the best one.

N.B if it would be a successful result for the conversion rate of your test to be lower than the control, for example if you are testing whether visitors do something you don’t want them to do, then you need to use 2-sided test results validation. Many testing tools don’t provide this – the tool link provided does.

Rule 3 : The test should have run for a valid period of time

This one is most commonly overlooked. The result of an experiment is really a statement that the change made will have the observed effect continuously over time i.e. changing this piece of copy is going to improve conversion by x forever (or until something else changes). This is how we calculate the longer term benefit of the changes we make.

However, claiming that the result seen in a test will persist is not so straight forward, because there are myriad of invisible forces at work during the time of the test. Imagine as follows that we ran a test on a single day. What you would not see is the constitution of different segments of visitors on that day. On a different, later day that makeup could have been different:

  Day 1 Day 6
Visits from segment A which converts in the variation at 1% 5,000 500
Visits from segment B which converts in the variation at 5% 500 5,000
Overall conversion for variation on that day 1% 5%

So, your test at one time may not get the same result as at another time if there are significant differences to the audience etc.

The way to mitigate this is to ensure that the test runs for such a length of time that you have captured a representative cross-section of the audience. There is no single time frame that works for all tests, and often it is about a business cycle. For example, if I run tests in the billing area of a site then ideally these run for a month, because customers are billed monthly and therefore there is a cycle of a month in which things will shift and change. As a rule of thumb though 2 weeks should be the absolute minimum unless you are prepared to relax your confidence in the longer term benefit.

Deeper understanding

Whilst you do not need to be a statistician to run and interpret AB tests, it helps if you familiarise yourselves with the basics and the different challenges. Here are some more resources:

Other tools & calculators:

http://www.evanmiller.org/ab-testing/

http://getdatadriven.com/ab-significance-test

https://vwo.com/ab-split-test-significance-calculator/

Learning more about the statistics:

http://conversionxl.com/ab-testing-statistics/

https://blog.kissmetrics.com/how-ab-testing-works/

https://vwo.com/blog/what-you-really-need-to-know-about-mathematics-of-ab-split-testing/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s