Statistics for the reporting

Statistical indicators characterize the observed eligible metrics for each variation, as well as the differences between variations for the same metrics. They allow you to make informed decisions for the future based on a proven Bayesian statistical tool. 

When you observe a raw growth of X%, the only certainty is that this observation has taken place in the past in a context (time of year, then current events, specific visitors, …) that won’t happen in the future in the exact same way again. 

By using statistical indicators to reframe these metrics and associated growth, you get a much clearer picture of the risk you are taking when modifying a page after an A/B test.

Statistical indicators are displayed with the following metrics: 

  • All “action tracking” growth metrics (click rate, scroll tracking, dwell time tracking, visible element tracking)
  • Pageviews growth metrics
  • Transaction growth metrics(except average product quantity, price, and revenue)
  • Bounce rate growth 
  • Revisit rate growth

Statistical indicators are not displayed with the following metrics:

  • Transaction growth metrics for average product quantity, price, and revenue
  • Number of viewed pages growth

Lastly, statistical indicators  are only displayed on visitor metrics and not on the session metrics. The former are generally the focus of optimizations and, as a consequence, our statistical tool was designed with them in mind and is not compatible with session data. 

These indicators  are displayed on all variations, except on the one used as the baseline.

Definitions and use cases

Confidence interval based on Bayesian tests

The confidence interval indicator is based on the Bayesian test. The Bayesian statistical tool calculates the confidence interval of a gain (or growth), as well as its median value. They enable you to understand the extent of the potential risk related to putting a variation into production following a test.

Where to find the confidence interval

In the reporting, the confidence interval is visible in the "statistics" metric view:

How to read and interpret the confidence interval

Our Bayesian test stems from the calculation method developed by mathematician Thomas Bayes. It is based on known events, such as the number of conversions on an objective in relation to the number of visitors who had the opportunity to reach it, and provides as we have seen above a confidence interval on the gain as well as its median value. Bayesian tests enable sound decision-making thanks to nuanced indicators that provide a more complete picture of the expected outcome than a single metric would.

In addition to the raw growth, we provide a 95% confidence interval.

“95%” simply means that we are 95% confident that the true value of the gain is situated between the two values at each end of the interval.

👉 Why not 100%?

In simple terms, it would lead to an confidence interval of infinite width, as there always will be a risk, however minimal.

“95%”is a common statistical compromise between precision and the timeliness of the result.

The remaining 5% is the error, equally divided below and above the low and high bounds of the interval, respectively. Please note that, of those 5%, only 2.5% would lead to a worse outcome than expected. This is the actual business risk. 

👉As seen previously, the confidence interval is composed of three values: the lower and higher bounds of the interval, and the median.

Median growth vs Average growth:

These values can often be very close to one another, while not matching exactly. This is normal and shouldn’t be cause for concern.

In the following example, you can see that the variation has a better growth than the original: 5.34%.

Zooming in on confidence interval visualization, we see the following indicators:

  • Median growth: 5.38%
  • Lower-bound growth: 0.13%
  • Higher-bound growth: 10.84%

An important note is that every value in the interval has a different likelihood (or chance) to actually be the real-world growth if the variation were to be put in production:

  • The median value has the highest chance
  • The lower-bound and higher-bound values have a low chance

👉 Summarizing:

  • Getting a value between 0.16% and 14.06% in the future has a 95% chance of happening
  • Getting a value inferior to 0.16% has a 2.5% chance of happening
  • Getting a value superior to 14.06% has a 2.5% chance of happening

👉Going further, this means that: 

  • If the lower-bound value is above 0%: your chances to win in the future are maximized, and the associated risk is low;
  • If the higher-bound value is under 0%: your chances to win in the future are minimized, and the associated risk is high;
  • If the lower-bound value is under 0% and the  higher-bound value above 0%, your risk is uncertain.  You will have to judge whether or not the impact of a potential future negative improvement is worth the risk, if waiting for more data has the potential to remove the uncertainty, or if using another metric in the report for the campaign to make a decision is possible.

The smaller the interval, the lower the level of uncertainty: at the beginning of your campaign, the intervals will probably be spaced out. Over time, they will tighten until they stabilize.

In any case, AB Tasty provides these Bayesian tests and statistical metrics to help you to make an informed decision, but can’t be responsible in case of a bad decision. The risk is never null in any case and even if the chance to lose is very low, it doesn’t mean that it can’t happen at all.

Chances to win

This metric is another angle of the confidence interval and answers the question, “What are my chances to get a better/strictly positive growth in the future with the variation I’m looking at?”, or a better/strictly negative growth in the future with the variation I’m looking at?” for the specific bounce rate which have to be the lowest possible.

The chance to win enables a fast result analysis for non-experts. The variation with the biggest improvement is shown in green, which simplifies the decision-making process. 

The chance to win indicator enables you to ascertain the odds of a strictly positive gain on a variation compared to the original version. It is expressed as a percentage.
When the chance to win is higher than 95%, the progress bar turns green.

As in any percentage of chances that is displayed in betting, it gives you a focus on the positive part of the confidence interval. 

The chance to win metric is based on the Bayesian test as it is based on the confidence interval metric.

This metric is always displayed on all variations except on the one which is used as the baseline.

Where to find the chance to win

In the reporting, the confidence interval is visible in the "statistics" metric view:

How to read and interpret the chance to win

This index assists with the decision-making process, but we recommend reading the chance to win in addition to the confidence intervals, which may display positive or negative values.

The chance to win can take values between 0% and 100% and is rounded to the nearest hundredth.

  • If the chance to win is equal to or greater than 95%, this means the collected statistics are reliable and the variation can be implemented with what is considered to be low risk (5% or less). 
  • If the chance to win is equal to or lower than 5%, this means the collected statistics are reliable and the variation shouldn’t be implemented with what is considered to be high risk (5% or more).
  • If the chance to win is close to 50%, it means that the results seem “neutral” - AB Tasty can’t provide a characteristic trend to let you make a decision with the collected data.

👉 What does this mean?

  • The closer the value is to 0%, the higher the odds of it underperforming compared to the original version, and the higher the odds of having confidence intervals with negative values.
  • At 50%, the test is considered “neutral”, meaning that ​​the difference is below what can be measured with the available data. There is as much chance of the variation underperforming compared to the original version as there is of it outperforming the original version. The confidence intervals can take negative or positive values. The test is either neutral or does not have enough data.
  • The closer the value is to 100%, the higher the odds of recording a gain compared to the original version. The confidence intervals are more likely to take on positive values.

If the chance to win displays 0% or 100% in the reporting tool, these figures are rounded (up or down). A statistical probability can never equal exactly 100% or 0%. It is, therefore, preferable to display 100% rather than 99.999999% to facilitate report reading for users.

Statistics computation

There are two kinds of statistical tools depending on the type of data analyzed:

  • For conversion data, corresponding to the notion of success and failure rate we use a Bayesian framework. Typical data is the act of purchasing, reaching a given page, or consenting to subscribe to a newsletter... This framework gives us a chance to win index and confidence interval for the estimated gain.
  • For transaction data, like the cart value, we use the Mann-Whitney U test which is robust to "extreme" values. 
    This test does not provide a confidence interval, so it only tells if the average cart value goes up or down, but no information is given about the estimated gain.

Conversion data

For clicks data, we use a Bayesian framework where clicks are represented as binomial distributions, whose parameters are the number of tries and a success rate. In the digital experimentation field, the number of tries is the number of visitors and the success rate is the click or transaction rate. In this case, it is important to note that the rates we are dealing with are only estimates for a limited number of visitors. To model this limited accuracy, we use beta distributions (which are the conjugate prior to binomial distributions).

These distributions model the likelihood of a success rate measured on a limited number of trials.

Let’s take an example:

  • 1,000 visitors on A with 100 successes
  • 1,000 visitors on B with 130 successes

We build the model

Ma = beta(1+success_a,1+failures_a)

Where success_a = 100, and failures_a = visitors_a – success_a =900.

(Note: the 1+ comes from the fact that this distribution can also have another shape and then model a different type of process.)

For the three following graphs, the horizontal axis is the click rate while the vertical axis is the likelihood of that rate knowing that we had an experiment with 100 successes in 1,000 trials.

Graph-1.jpeg

We observe that 10% is the most likely, 5% or 15% are doubtful, and 11% is half as likely as 10%.

The model Mb is built the same way with data from experiment B:

Mb= beta(1+100,1+870)

Graph-2-Bayesian-article.jpeg

For B, the most likely rate is 13%, and the width of the curve’s shape is close to the previous curve.

Then we compare A and B rate distributions.

Graph-3-Bayesian-article.jpeg

We see an overlapping area, a 12% conversion rate, where both models have the same likelihood.

To estimate the overlapping region, we need to sample from both models to compare them.

We draw samples from distributions A and B:

  • s_a[i] is the i th sample from A
  • s_b[i] is the i th sample from B

Then we apply a comparison function to these samples:

  • The relative gain: g[i] =100* (s_b[i] – s_a[i])/s_a[i]) for all i.

It is the difference between the possible rates for A and B, relative to A (multiplied by 100 for readability in %).

We can now analyze the samples g[i] with a histogram:

Graph-4-Bayesian-article-1024x738.jpeg

We see that the most likely value of the gain is around 30%.

The yellow line shows where the gain is 0, meaning no difference between A and B. Samples that are below this line correspond to cases where A > B, and samples on the other side are cases where A < B.

We then define the gain chances to win as:

CW = (number of samples > 0) / total number of samples

With 1,000,000 (10^6) samples for g, we have 982,296 samples that are >0, making

B>A ~98% probable.

We call this the “chances to win” or the “gain probability” (the probability that you will win something).

Using the same sampling method, we can compute classic analysis metrics like the mean, median, percentiles, etc.

Looking back at the previous chart, the vertical red lines indicate where most of the blue area is, intuitively which gain values are the most likely.

We have chosen to expose a best and worst-case scenario with a 95% confidence interval. It excludes 2.5% of extreme best and worst cases, leaving out a total of 5% of what we consider rare events. This interval is delimited by the red lines on the graph. We consider that the real gain (as if we had an infinite number of visitors to measure it) lies somewhere in this interval 95% of the time.

In our example, this interval is [1.80%; 29.79%; 66.15%], meaning that it is quite unlikely that the real gain is below 1.8 %, and it is also quite unlikely that the gain is more than 66.15%. And there is an equal chance that the real rate is above or under the median, 29.79%.

It is important to note that, in this case, a 1.80% relative gain is quite small, and is maybe not worth implementation, at least not yet, even if the best-case scenario is very appealing (66%). This is why, in practice, we suggest waiting for at least 5000 visitors per variation before one calls a test "ready", to obtain a smaller confidence interval.

Transaction data

For data like transaction values, we use the Mann-Whitney U test for its nice property with extreme values.

A few customers ordering for huge value can raise a variation in average order value but are not significant by the number of people. Imagine that an A/B test holds 10 extreme values (let's say 10 customers that spend 20 times the average order value). The chance that these 10 visitors are not evenly split between A & B is quite high since the assignment is purely random. This will imply a noticeable difference between the average order value of A & B. But this difference is maybe not statistically significant because of the too small number of visitors concerned.

So it is important to trust the chances to win provided by this statistical test. It's not uncommon to see an observed average order value going up and the statistic says that the chance to win is below 50% showing an opposite trend. And the reverse may also happen: an observed negative trend for the average cart value can be a winner if the chances to win are above 95%.

Limitations

The Bonferroni correction is a method that involves taking into account the risk linked to the presence of several comparisons/variations.

In the case of an A/B Test, if there are only two variations (the original and Variation 1), it is estimated that the winning variation may be implemented if the chance to win is equal to or higher than 95%. In other words, the risk incurred does not exceed 5%.

In the case of an A/B test with two or more variations (the original version, Variation 1, Variation 2, and Variation 3, for instance), if one of the variations (let’s say Variation 1) performs better than the others and you decide to implement it, this means you are favoring this variation over the original version, as well as over Variation 2 and Variation 3. In this case, the risk of loss is multiplied by three (5% multiplied by the number of “abandoned” variations).

A correction is therefore automatically applied to tests featuring one or more variations. Indeed, the displayed chance to win takes the risk related to abandoning the other variations into account. This enables the user to make an informed decision with full knowledge of the risks related to implementing a variation.

When the Bonferroni correction is applied, there may be inconsistencies between the chance to win and the confidence interval displayed in the confidence interval tab. This is because the Bonferroni correction does not apply to confidence interval.

Use cases

Case #1: High chance to win

In this example, the chosen goal is the revisit rate in the visitor view. The A/B Test includes three variations.

The conversion rate of Variation 2 is 38.8%, compared to 20.34% for the original version. Therefore, the increase in conversion rate compared to the original equals 18.46%.

The chance to win displays 98.23% for Variation 2 (the Bonferroni correction is applied automatically because the test includes three variations). This means that Variation 2 has a 98.23% chance of triggering a positive gain, and therefore of performing better than the original version. The chance of this variation performing worse than the original equals 1.8%, which is a low risk.

Because the chance to win is higher than 95%, Variation 2 may be implemented without incurring a high risk.

However, to find out the gain interval and reduce the risk percentage even more, we would need to also analyze the advanced statistics based on the Bayesian test.

Case #2: Neutral chance to win

If the test displays a chance to win around 50% (between 45% and 55%), this can be due to several factors:

  • Either traffic is insufficient (in other words, there haven't been enough visits to the website and the visitor statistics do not enable us to establish reliable values)
    • In this case, we recommend waiting until each variation has clocked 5,000 visitors and a minimum of 500 conversions.
  • Or the test is neutral because the variations haven't shown an increase or a decrease compared to the original version: This means that the tested hypotheses have no effect on the conversion rate.
    • In this case, we recommend referring to the confidence interval tab. This will provide you with the confidence interval values.
      If the confidence interval does not enable you to ascertain a clear gain, the decision will have to be made independently from the test, based on external factors (such as implementation cost, development time, etc.).

Case #3: Low chance to win

In this example, the chosen goal is the CTA click rate in visitor view. The A/B Test is made up of a single variation.

The conversion rate of Variation 1 is 14.76%, compared to 15.66% for the original version. Therefore, the conversion rate of Variation 1 is 5.75% lower than the original version.

The chance to win displays 34.6% for Variation 1. This means that Variation 1 has a 34.6% chance of triggering a positive gain, and therefore of performing better than the original version. The chance of this variation performing worse than the original equals 65.4%, which is a very high risk.

Because the chance to win is lower than 95%, Variation 1 should not be implemented: the risk would be too high.

    • In this case, you can view the advanced statistics to make sure the confidence interval values are mostly negative.

Was this article helpful?

/