What is Sequential Testing and how does it work?
In a classic horizon test, you shouldn’t check your experiment before the end because at each check you’ll have a chance to make the wrong decision.
Sequential Testing is here to answer that behavior. It’s a classic horizon test where you’ll apply a kind of correction to take into account the fact that we will take several opportunities to decide to continue or to stop the experiment. And the radical decision you’ll make to stop the experiment is because you detected a loser.
How does it not go against AB Testing rules?
Rule °1: In AB Testing, you should sample each day of the week equally. People do not behave the same depending on the day. If Friday is your hottest business day, then you should sample that day. You want your data to be the most representative of your overall business so you need to sample each day the same way.
Obviously this rule is broken if you use sequential testing since you may stop the test in mid week. But the difference in this case is that you only stop to declare a loser, not a winner. If the B variation is significantly behind the reference, there is no chance that it will both catch up to the reference and even get significantly above it, just in a few days. So you are not leaving a winner on the table, in the worst case scenario you will declare B as loser when in fact B is equivalent to A. If you sampled all days equally you would have discovered it later.
So this “worst case scenario” is in fact a good thing : you saved visitors for another test.
Rule °2: In AB Testing, you should wait 2 business cycles before having significant data and making a decision. True. When you’re looking for a winner. But here we’re looking for a loosing experiment. So when the statistical test says there’s a loser, at worst he’s saying that your variation is even with the reference. In practice, it’s not a mistake since your goal is to find a clear winner, not to spend visitors on a neutral test.
Depending on the sensitivity level, how many false alarms would be triggered?
If we consider that all experiments are neutral, then we should have 5%, 3% or 1% of experiments which will trigger a false alarm (based on the sensitivity level selected -- High, Balanced, Low).
In every case, that estimation is increased because not all experiments are neutral, there’s winners and losers which aren’t taken into consideration here. I.e, with the Low sensitivity level (1%), it’s unlikely that you’ll receive a false alarm.
What is the behavior with an A/B/C test?
The more you’ll have variation, the more comparison will be done.
Here, the sequential testing algorithm will compare A vs B & A vs C. If B or C should underperform, the alerting system will be triggered.
In case of an experiment with more than one variation, what happens when it’s underperforming?
It depends on the setup you did. If one of the variations is underperforming and you just selected to receive an alert, then you’ll be warned that the variation is loosing. If you choose to stop the whole experiment, the experiment will be stopped.
If you have only selected to be alerted and your experiment with 4 variations is underperforming on 3 of them, you’ll receive an alert for each variation.
Can I receive an alert after having been already alerted for a variation?
The alerting system is limited to one per day, so you might receive one alert per day for each variation underperforming in case you didn’t take any action.
How does the alerting system work?
There’s 4 parameter to take a decision:
- The volume of traffic
- The difference between variation
- Time elapsed since the beginning of the experiment
- The level of sensitivity set: the more sensitive the sooner the alert
Depending on those parameters combinations, we’ll trigger an alert or not.
How does the alerting system react in case my primary KPI doesn’t apply to my variation?
As the sequential testing algorithm is applied on the Primary KPI here and you might have 0 conversions on your KPI, you’ll certainly receive an alert to warn you.
Note that here it’s because the Primary KPI doesn’t apply on your variation, but it might also be because of an implementation issue and it could save your experiment.