Impact of Imprecise A/B Test Triggering on Test Duration

Introduction to Triggering

A/B testing has become a key part of the product development toolkit. This means that having enough time or data to run A/B tests to be able to reach valid conclusions has become one of the major constraints for many teams. In this post I’ll look into one of the lesser known practices of A/B testing that could help with this constraint. And I’ll also evaluate if and when it would make sense to invest in it.

At FindHotel we A/B test most changes we make to our website to make sure that they have a positive impact on the user’s experience and to increase our knowledge about how our users behave. As more teams across the company start to make regular use of A/B testing and the number & complexity of the tests we run increases, we need to develop more advanced experimentation capabilities. One of these is triggering(also called targeting) which is the selection of which users to include in an experiment. While a core requirement of experimentation is that users should be randomly distributed between the variations being tested, it doesn’t necessarily mean that every user should be included in every experiment. Only users who are effected by the change being tested in an experiment should be included in the analysis of the experiment.

Here are some examples going from simpler to more complicated:

  • When testing a change to the mobile experience, only trigger the test for users on a mobile device. Don’t include desktop & tablet users.
  • When testing a change to the calendar, only trigger the test when a user expands the calendar (assuming that the change is only visible after expanding the calendar)
  • When testing the addition of a “Price Drop” tag to hotels in a search results page, only trigger the test when a user sees a hotel with this tag. (the city they search for might not have any hotels with the tag or hotels with the tag might be lower down in the search results than how far down the user scrolled down)
  • When testing a new API endpoint to serve more accurate prices on the checkout, the experiment should be triggered only when a price served by the new endpoint is different than the one that would have been served by the old one.

While the first 1-2 examples are obvious and also simple to implement accurate triggering for in most off-the-shelf or custom built experimentation tools, the last 1-2 examples might be less obvious and also require more advanced experimentation capabilities and engineering resources which many teams might not have.

But why is triggering the test only for the effected users such a big deal and why can’t we just include all users and measure any improvements across the whole user base? We could actually do this and still run technically valid experiments. But doing this would dilute the experiment with the non-affected users and add more noise to the results which increases the amount of traffic & time required to reach statistical significance or reduce the ability to detect smaller changes. So it would make it more difficult to identify winning or losing experiments and we’d end up with more inconclusive experiments or be forced to run the experiments longer. As both conclusive results & experimentation time are scarce resources, we can appreciate the potential importance of triggering accurately.

Quantifying the impact of imprecise triggering


But how bad is the actual impact of not triggering the experiment accurately?  We need to know its impact so that we can decide how much to invest in tackling it. To start to get an understanding of its impact, let’s first play around with an AB testing sample size calculator like Evan Miller's.

Let’s run through an example, using the calendar experiment mentioned in the introduction:

  • Case 1 (perfect triggering, no dilution): We want to run an experiment to improve the calendar on our search results page. Let’s say that the conversion rate of users that expand the calendar is 5%, and we’d like to be able to detect a relative change of at least 10% to this conversion rate. Based on the calculator, this would require 30,244 users per variation(link). If we had 10,000 users per day who expanded the calendar(and we triggered the experiment only for these users) and we ran an experiment with two variations(A/B test), we would have 5,000 user per day for each variation. So we would need to run the test for around 6 days.
  • Case 2 (imprecise triggering, with 50% dilution): And what if we do not trigger the experiment only when users expand the calendar but we trigger it for all users on the search results page? Let’s assume that half the users are expanding the calendar and that we have a total of 20,000 users per day using our search results page. And let’s make another assumption that the conversion rate of users who do not expand the calendar is the same as those that do (important but tricky assumption, more on this later). So now we have a ‘dilution ratio’ of 50% where half the users targeted are actually not being affected by the experiment since they do not see the calendar. We cannot celebrate that we have twice the number of users in the experiment since now we will need to detect a smaller change! In Case 1 we were measuring for a 10% relative change while now we will need to measure for a 5% relative change because if the conversion rate of the affected half of users increases by 10% in the b-side of the experiment, this will only increase the overall conversion rate of all users on the b-side of the experiment by 5%. How does this affect the sample size required? Also using the calculator we see that detecting a 5% relative change on a base conversion rate of 5% would require 120,146 users per variation(link).  Since now we have 10,000 users per day for each variation, we would need to run the test for around 12 days.

So we figured out using an off-the-shelf calculator that diluting the traffic by half would double the time we’d need to run the experiment to detect a 10% relative change on the affected users with a base conversion rate of 5%. But what about other situations such as different combinations of base conversion rates, minimum detectable effect (MDE) and dilution rates? Which of these variables would affect it more? When would investing in more precise triggering make sense and when would it be not worth the effort? We’d need to manually run through many such examples to get a somewhat complete picture. So instead, let’s be lazy and turn to Google to see if someone has calculated all of this before and shared it with the world.

Google “research”: While triggering is not mentioned in most of the popular AB testing “best practices” and “common mistakes” resources (one of the good ones is this one from the CXL institute), it is mentioned in some of the more in-depth or advanced sources such as:

So we have validation about the importance of triggering from reliable sources and one or two data points about its actual impact on experiment duration but not a source that maps out the impact of triggering across various situations to help decide on the ROI of investing in each of these situations. So time to get our hands dirty in a spreadsheet to run some calculations then.

Spreadsheet modeling: So I created this spreadsheet that calculates, similar to how the sample size calculators work, how much longer a test would need to run(to be able to detect the same effect to the affected users) due varying levels of dilution  As a reminder, I define dilution as the share of users in an experiment that do not experience the changes being tested. So if 5% of users are affected by an experiment and the test only targets these users there is 0% dilution. If the test instead targets all users there is 95% dilution.

Following the assumption that dilution does not affect the baseline conversion rate, the only effect of dilution is that it decreases the the Minimum Detectable Effect(MDE). And based on this decrease in MDE we can calculate how much additional users would need to be included in the experiment to detect this level of change. The original MDE targeted, the base conversion rate, confidence level or statistical power seems to have almost no impact on this increase so we can generalize a global relationship between dilution and increase in the time required to run a test which should hold true for all tests(assuming the conversion rate is the same in diluted vs non-diluted groups, again more on this below). This is how that chart looks like:

Not the kind of chart where we’d like to see a hockey stick! Since the increase is so exponential, looking at the y-axis on the logarithmic scale would make it more readable. Here’s how that looks like:

How to read these charts and the spreadsheet:

  • 10% dilution means we'd need to run the test 11% longer
  • 25% dilution means we'd need to run the test 33% longer
  • 50% dilution means we'd need to run the test 99% longer (~2 times longer)
  • 75% dilution means we'd need to run the test 298% longer (~4 times longer)
  • 95% dilution means we'd need to run the test 1888% longer (~20 times longer)

So it looks like the test would need to run ~1/(1-dilution) longer. Probably something we could have calculated analytically as well but I leave that to another time.

Conclusion on when investing in precise triggering makes sense: Impact of dilution increases exponentially from around 80%. Even moderate amounts of dilution around 50% can be problematic for teams that already struggle with not having enough sample sizes and statistical significance while high amounts of dilution(over 80%) would make changes practically un-testable. So triggering experiments only for the relevant users in these situations is indeed very important to get right and is likely worth the extra engineering effort. A dilution of 25% or less, where most of the users in the experiment are affected by the change, causes only minor increase to the test duration so in some cases it might not be worth the extra effort the get perfectly precise triggering.

Advanced considerations, more examples, solutions to & risks of triggering

About the “same conversion rate” assumption:  I mentioned a few times that one of the key assumptions for the above calculations is that the conversion rate is the same for users affected and non-affected by the experiment. There are many cases in experimentation where this does not hold true. And especially if most of the experimentation is focused on features & components that are on the critical conversion path of the user (which they should be), then in these experiments the users interacting with these features&components could often have a higher conversion rate than the users that don’t interact with them. And in these cases the negative impact of imprecise triggering on test duration will be milder than the above calculations suggest.

Let’s look at some of the examples we had used in the introduction:

  • Testing a change to the calendar: Many of our customer journeys require using the calendar on the way to making a booking so users who expand the calendar are likely to have a higher conversion rate than those who don’t. This means that the value of precise triggering should be less but it is still important as many users do convert without using the calendar.
  • Testing a “Price Drop” tag: Users who see hotels that would have been assigned a price drop tag are not likely to convert very differently from those who don’t. So in this case impact of triggering will be similar to what is shown in the calculations above.

So this is a warning to not to take the above numbers too literally. I will not get into the calculations of the exact impact of triggering when conversion rates can change due to time constraints but you can check it for yourself by playing with the sample size calculators or adjusting the assumptions in my spreadsheet.

How can we have more precise triggering?

So now that we have seen the importance of precise triggering, what can we do about it? There are two high level approaches to achieve accurate triggering:

  1. Trigger the test only for the users that see or interact with the change being tested as they are using the product (requires being able identify their seeing/interaction & trigger the experiment based on it in real time)
  2. Trigger the test for all users but afterwards limit the analysis to only the users that saw or interacted with the change being tested. (doesn’t require real time identification or triggering, identification can be done retrospectively)

They both achieve the end goal of making decisions on an experiment based on only the data from the users that were really exposed to the test. The main difference between them is that in some cases, the second option of identifying the relevant users only after all have been exposed to the test might be technically simpler as identifying affected users real time and triggering accordingly can be tricky from an engineering point of view. In either case, both approaches require investing in tracking capabilities such as firing events when a user interacts with certain features or components(like the calendar) so that these events can be used(in real time or retrospectively) to identify which users should be triggered or analysed for the experiment. Triggering experiments just based on page views would not be enough in many cases.

There are cases where simply firing events to see which users saw or experienced the component being tested would not be enough either. Here are two examples we had talked about and how they can be more complicated than they might initially seem:

  • The experiment where we tested adding a “Price Drop” tag to some hotels: For some users the results might include multiple hotels with this tag while for others it might not include any. So how do we determine if a user actually saw that tag or not? For this we might need to implement an event in our front end tracking which fires whenever this tag appears in the user’s browser view (so might also require investing in scroll depth tracking). Using this event we can know which users saw the tag and only include them in the experiment or its analysis. But this is not enough! As the tag exists only in the b-side of the experiment, the event would not fire on the a-side though. We also need to know which users on the a-side would have seen the tag if they had been on the b-side(the counterfactual). It’s getting more complicated... So we would also need to include the algorithm that determines which hotel to show the tag on the a-side of the experiment. Doing all this might add engineering complexity and require more time&resources which might not have been in the original scope of this experiment.
  • And the experiment where we were testing a new API endpoint to serve more accurate prices to users in the checkout: The new endpoint is supposed to be of higher accuracy but the problem is that most of the prices from the new endpoint will not be different compared to the old endpoint, it will only change for a small share of prices. If we include all users in the experiment, dilution might be very high and we might not notice if the conversion rates of users who are getting different prices changed or not. But how can we identify which users experienced different results due to the new endpoint? We’d have to query both API endpoints from both variations of the test to able to compare them. This is unlikely to have been part of the original solution where the simplest approach would have been to query the old API on the a-side and the new one on the b-side. To achieve this kind of implementation, especially for the first time, could take more time & complexity than the rest of the design&engineering of the experiment combined.

Long story short, it can be both more complicated and also require more engineering work to setup precise triggering. And other than the extra time required to setup the triggering which might delay the test launch significantly, there are also other potential risks to narrowing down the targeting to only the relevant users, such as:

  • The extra code & logic required for the triggering increases the chance of bugs, unexpected consequences or degradation of the user's experience (querying two APIs might slow response times or cause timeout more frequently)
  • The targeting might become too narrow (the change being tested might only be on mobile but maybe some users who experience the test on tablet afterwards go onto complete their purchase on desktop)

So it’s potentially quite a lot of work to get this right for every experiment and there are risks.  So it doesn’t mean that it is always the right decision to make this investment in triggering. While for many tests it’s critical to implement precise triggering, for others it might not be worth it, especially if the level of dilution & changes in conversion rate mean that the test duration would not significantly increase without precise triggering.

Conclusion

Calling both API end points from both sides of the test? Implementing scroll depth tracking and matching it to the location of hotels in the page? Implementing these kind of tracking and experimentation, especially for the first time, can be complicated, take time and come with their own risks. They might even take more time to setup than all other work that needs to be done launch these tests. Are they worth it just to be able to get the triggering of the experiment right?

Based on the above statistics, for tests with low dilution(<30-50%) precise triggering is not absolutely necessary and in some cases it might be more efficient to skip it especially if the effort required for it is high. But for tests which would have dilution of more than 50-70%, it seems that accurate triggering is a must-have if we want to be able to detect meaningful changes in the experiment within a useful period of time.

Another thing to consider is that it is likely that we won’t be running only one test that would benefit from each of these tracking & triggering setups. For teams that are very experimentation driven like FindHotel, it is likely that they would be able to re-use their investment in triggering in many tests over a longer period of time. So it could make sense to invest in precise triggering as early as possible.

Triggered about working at FindHotel after reading this blog? Check our current vacancies on careers.findhotel.net