We recently introduced a new statistical calculation engine, called Advanced Stats, based on a Bayesian approach to statistical testing. This approach offers you greater flexibility when it comes to analysing results and produces betterquantified information than is possible with fixedhorizon tests such as Chi².
This new statistics engine means changes to reporting as you know it. From now on, we display additional indicators, as shown on the next screen. The aim of this article is to explain how to read this new data and interpret the results of your tests.
The statistical test is designed to answer one simple question: which is the best variation? To do so, it goes on just the following information: number of visitors and number of conversions, for each variation.
A naive answer to this question is to calculate the conversion rate for each variation, and take the variation with the best rate. That would be true if you had an infinite number of visitors, because then your conversion rate measurements would then have infinite accuracy. But with infinite accuracy beyond our reach, we must handle the uncertainty of our conversion rate measurements to best inform the decision process.
Let us take as an example the report given above, and interpret the results.


Variation A (original) 
Variation B 
Data 
Number of visitors 
2503 
1239 
Number of conversions 
49 
38 

Conventional "fixed horizon" measurements 
Conversion rate 
~1,96 % 
~3,07 % 
Conclusion: an initial conclusion would suggest that we would gain 1 conversion point by implementing Variation B, with a 97% confidence interval. We'll see later on, that this isn't quite true. 

Advanced Stats, Bayesian measurements 
95% confidence intervals 
[1.49 % ; 2.57 %] 
[2.25 % ; 4.18 %] 
Confidence level of there being a difference between A and B = 98% 

We now have the information above, which we can use to answer the question "which is the best variation?" However, in practice, Variation A has already been implemented, and so the real question is rather: "what gain (or loss) will I see if I go from A to B". To do so, we must therefore measure together the uncertainty of what we will be giving up with Variation A, and the uncertainty of what we will get with Variation B. To do this, we produce one final measurement: the gain. 

95% confidence interval of an absolute gain from changing A to B. 
[ 2.82947 % , 136.23 % ] , median = 57.8286 % 

Conclusion: we now have a more nuanced view of the situation. It may be that the variation offers very little (the left bound of the gain interval is close to 0), but it is also possible that the absolute gain may reach as high as 136%. The remaining scenarios (gain < 2.8% or gain >136 %) have only a 5% chance of occurring. The median value represents the “neutral” scenario, i.e. one which is neither optimistic nor pessimistic. It shows that there is equal likelihood that the true gain will be above or below 57.8%. In practice:
If the cost of implementing Variation B is low, it may be worthwhile implementing Variation B immediately. In the worst case, there isn't much gain, and in the best case, you gain “a lot”, but above all, the gain is immediate! 
Here are some example interpretations of other scenarios.
Example 1 (A ~= B)
Conversion rates for variations
The conversion rates for A (1.84%) and B (1.76%) are very close. The confidence intervals for these measurements, [1.45, 2.28] and [1.38, 2.19] respectively, have considerable overlap, indicating a very high probability that the difference in the measurement is simply due to chance.
Gain measurement
Even though the observed gain is negative, the width of the confidence interval [31.2156, 31.9887] indicates that there is no difference in performance between A and B.
Conclusion
A and B show more or less the same performance.
Example 2 (A < B)
Conversion rates for variations
As percentages, the confidence intervals for the performances of A and B are [1.97, 3.1] and [2.75, 4.13] respectively. We see that these values partially overlap, suggesting a nonzero probability that the two variations are equivalent. Going on this information alone, we could be tempted to wait until we had collected more data before reaching a decision.
Gain measurement
Nonetheless, the gain interval is strictly positive. We have a 95% chance that the gain is within the interval [0.143955, 84.2233], and the median (36.1243%) represents a prediction of a neutral gain (neither optimistic nor pessimistic).
Conclusion:
B is better than A, and probably a lot better. This example demonstrates that the most indicative and significant measurement is that of gain. Performance measurements for A and B are not always enough to analyse the situation. A superficial analysis based on information about A and B only (without the gain) could lead to the conclusion that there is a possibility of A having better performance than B. We would then conclude that we need to wait until we have more visitors before taking a riskfree decision. Analysing the gain interval shows us that this probability is actually very low (2.5%). We may therefore conclude straight away that B is better than A.
Example 3 (A ~< B)
Conversion rates for variations
The median conversion rate for A is 2%, with values in the range [1.63, 2.4], whereas B's is 2.41%, with an interval of [2.01, 2.86]. While Variation B appears better, the fact that the intervals partially overlap (over [2.01, 2.4]) still leaves room for doubt.
Gain measurement
Unlike example 2, there is considerable uncertainty over the gain measurement, as there is a wide interval [7.08888, 57.3778] which includes negative values. This is why the confidence level is lower than the commonly accepted threshold of 95%.
Conclusion
In all likelihood there is a difference in performance between these variations, but it is wiser to wait and gather more information. This will reduce the confidence interval for the measured gain. Our hope is, of course, that this interval will move out of the negative range, in which case we will be able to make a decision.
Example 4 (A > B)
Conversion rates for variations
As in the previous example, the intervals for A and B partially overlap: [2.16, 3.15]% compared to [1.38, 2.66]%. However, analysing the gain interval indicates that this band is highly unlikely.
Gain measurement
The gain confidence interval is negative overall. There is only a 2.5% chance that the gain will be above 7.36%. There is also a 2.5% chance that the gain will be below 49.7%.
Conclusion
Variation B performs less well than the original. The high confidence level (0.94) confirms an actual difference between the variations, whatever that difference may be (high, low or negative).
Overall conclusion
 The confidence level enables us to identify tests that point to a “result” (positive or negative). This result may be statistically significant, but may not represent a real gain in practice (a low gain compared to high implementation costs, or a corresponding financial gain that is actually low). You therefore cannot decide based on this measurement alone.
 Conversion rate medians give a “reasonable” order of magnitude for the performance of each variation. Confidence intervals, on the other hand, give an indication of the uncertainty of these measurements: the narrower the intervals, the less the uncertainty.
 The confidence interval for “gain” gives us an indication of what outcome we can actually expect by replacing A with B. This is therefore the measurement that should guide your decision making.
Best practices
While our new statistics engine provides you with greater flexibility when analysing results and enables you to make decisions more quickly, it does not spare you from a few caveats. For example, the duration of your tests must cover at least one purchasing cycle, and ideally two. Web users do not purchase straight away once they discover your site. They gather information, compare prices… so that between the time of being included in one of your tests and the time of actual conversion, there could a 1, 2 or even 3 week lag. If your purchasing cycle is 3 weeks and you only test over a single week, your sample will not be representative. Similarly, your sample must include all of your traffic sources (email campaigns, sponsored links, social networks, etc.) and you must ensure that none of these sources is overrepresented. It is therefore important to be aware of your acquisition campaigns and, if possible, avoid testing over such periods, when there are atypical sources of traffic to your site.
Comments
0 comments
Please sign in to leave a comment.