Avoiding Pitfalls in AOV Analysis for Conversion Rate optimization

When optimizing your conversion rate (CRO), it’s easy to rely on Average Order Value (AOV) as a key metric. However, AOV can be misleading and often results in incorrect conclusions if misinterpreted. Let's explore why this happens and how to use AOV effectively.

What Is AOV?

Average Order Value (AOV) is calculated by dividing the total value of orders by the number of orders. It’s a convenient metric to track how much customers are spending on average. Many CRO professionals use it to compare the performance of different A/B test variations. Unfortunately, focusing on AOV alone can lead to inaccurate insights.

Common Mistake: Outliers in A/B Tests

Imagine running an A/B test to compare two versions of your website. If just one customer makes an unusually large purchase, the AOV for that variation could spike significantly. This single outlier can make it seem like the variation is outperforming the other, even though it might not be true in a broader context. This issue is exacerbated in small sample sizes, but even large datasets are not immune.

Example: The Impact of a High-Value Order

Consider an A/B test where two subgroups are being compared:

  • Variation A has an AOV of €56.6 with 5 orders of value 50, 51,30, 75, and €77 
  • Variation B has an AOV of €56.8 with 5 orders of value 52, 50, 40, 62 and €80, which is a minor difference.

At first glance, this small difference seems insignificant. Now, imagine that a new customer places an order worth €202 in Variation B. This dramatically increases the AOV of Variation B to €81, with 6 orders of value 52, 50, 40, 62, 80 and €202.

This spike could falsely suggest that Variation B is significantly better, even though it is due to just one high-value purchase. Similarly, the opposite can also occur: an outlier could make it appear that a variation is underperforming when there is no real effect, or even obscure a winning variation, making it look like a loser. This kind of misleading result is more likely to occur with smaller datasets, but even larger datasets can still be influenced by similar outliers. It's important to recognize that these effects can lead to incorrect conclusions in both directions, emphasizing the need for careful analysis.

Why AOV can be misleading

We will break down why relying on Average Order Value (AOV) can be misleading in conversion rate optimization (CRO). There are two key factors that impact the reliability of AOV as a metric:

  1. The number of customers: This is a factor of measurement reliability. Generally, we expect that as the number of customers increases, our metrics should become more stable.
  2. The magnitude of the maximum value: This is a factor of measurement imprecision. As the number of customers increases, the highest values in the dataset are also likely to increase, which can introduce instability into the metric.

The "law of large numbers" is often mentioned in this context as a reason to trust AOV when the number of customers is large enough. However, it is crucial to understand that increasing the number of customers does not always solve the problem. In fact, as the customer base grows, the maximum values in order size are also likely to grow, which means that these two factors effectively counterbalance each other. As a result, the AOV can remain unstable even with a larger dataset.

To illustrate this, let’s look at data collected from an e-commerce site. In the graph showing basket values:

  • The horizontal axis represents the number of basket values collected.
  • The vertical axis represents the maximum value found in the list of collected basket values.

From the graph, we see that as more data is collected, the maximum values tend to increase. This behavior can affect the stability of AOV. Next, let’s consider the evolution of the average value as we collect more data:

 The horizontal axis represents the number of baskets collected, and the vertical axis represents the average value of the baskets collected.  

At first glance, it may seem that the average value stabilizes quickly. However, upon closer inspection, it takes 20,000 customers to reach what seems like stabilization. Achieving 20,000 customers actually requires significantly more visitors due to conversion rates. For instance, with a typical conversion rate of 5%, you would need 400,000 visitors (calculated as 400,000 * 0.05 = 20,000). Therefore, to conduct an A/B test effectively, you would require a traffic volume equivalent to 800,000 visitors.

Even when it appears that AOV is stabilizing, this stabilization is often deceptive. If we zoom in on the data beyond 20,000 customers, we notice fluctuations of -€1 to +€1 as more data is collected. 

This means that, even with a large number of data points, AOV can still vary significantly—by +/- €2 in total. In an A/B test context, this implies that a seemingly stable AOV could indicate a false difference between two variations, even when no real effect exists.

In other words, an A/A test—where there is no difference between the variations—could still indicate a difference in AOV of +/- €2 simply due to fluctuations. This makes it impossible to confidently measure any effect smaller than +/- €2 just by observing AOV.

Impact on Related Metrics

This problem extends beyond AOV:

  • Revenue per Visitor (RPV): Since it divides the total order value by the number of visitors, RPV can also show large, misleading jumps, especially influenced by big spenders.
  • Total Revenue: Summing all cart values can be misleading, even without explicitly averaging. Comparing these large figures can lead to similar effects as averaging, particularly when the number of visitors to each variation is similar. Rare, high-value orders may not be evenly distributed across variations, creating an illusion of significant differences.

How to Analyze AOV Correctly

To make informed decisions, use statistical tools like the Mann-Whitney U test, specifically designed for data like AOV. This test helps you determine whether a difference in AOV between two groups is statistically significant by comparing the distributions of order values in each group, rather than relying solely on raw AOV numbers, which can be misleading due to outliers. Unlike a simple average comparison, the Mann-Whitney U test evaluates whether the overall ranking of values differs significantly between the two groups, making it more robust in handling variations caused by extreme values.

It’s not uncommon for the results from the Mann-Whitney test to contradict the simple AOV measurements—for example, indicating a loss even when the AOV appears to show a gain. This is why statistical validation is crucial.

Conclusion

While AOV is a useful metric, relying solely on it in A/B testing can lead to erroneous conclusions due to its susceptibility to outliers and inherent fluctuations. Always back AOV analysis with appropriate statistical testing to ensure your CRO decisions are based on stable, meaningful data.

Related articles:

Was this article helpful?

/