All other things being equal — 4 steps to run a rigorous A/B test

A/B tests are formidable decision-making tools in product management, discover how to untap their full potential.

Clément Caillol
8 min readOct 12, 2021
Photo by Girl with red hat on Unsplash

This article is part of a series tackling the topic of A/B testing in the context of product management. If you are curious about the role A/B tests play in product development and when to and not to use them, follow me to be notified of next publications!

The current article aims at providing readers with a simple 4 steps process to set up rigorous A/B tests that actually bring answers. It builds on an earlier post, A/B testing: 10 common mistakes we all make, by Romain Ayres.

Back in university my girlfriend*, who was majoring in Public Policy, picked a class with the weirdest of names: Randomized Experiments. As the semester went on I remember her telling me about the methodologies involved in evaluating public actions and measuring the effects of policy decisions — ceteris paribus.

Fast forward ten years and I am now working in Product Management, rambling about the perils of unrigorous testing of ideas and features, and actually conducting randomized experiments on a daily basis.

As they reached critical mass, technology companies looked at the scientific world (Public Policy Evaluation, Pharmaceutical research) for ways to root their decision-making in solid evidence rather than on human gut feelings. Randomized experiments became A/B tests which then became standard. Bringing hard-cold evidence to the table was always going to be a hit in an industry where uncertainty is everywhere and users number in the billions.

Technology, however, is not (only) science and some readers may have experienced a certain frustration when it comes to building a rigorous A/B testing methodology. I know I have! That is before I had the chance to work with world-class data scientists such as Romain Ayres, Yohan Grember, Jacques Peeters and Armelle Patault back at ManoMano.

Photo by Kelly Sikkema on Unsplash

#1 — Start with the hypothesis

A/ B tests are a fantastic tool for making decisions, and they are useful to answer business questions. The first step to building a rigorous A/B test is to formulate an hypothesis. It should reflect your inclination, the solution to which you would intuitively give the highest probability of success.

In practice
Let’s say you run a e-commerce website and consider adding a floating add-to-cart button on the product page. You need to consider this change with regards to user behavior. Start by formalizing your hypothesis:

h = Users who see a floating add-to-cart button on the product page have a higher conversion rate than if they don’t see it.

test KPI = Conversion rate from the product page

Formalizing the hypothesis forces you to state out loud the expected outcomes of your test, and serves as the basis on which you will evaluate results. Sometimes results will validate the hypothesis, other times they will contradict it — others yet the test will fail to either confirm or contradict the hypothesis (meaning perhaps the question was irrelevant).

Another great thing that is allowed by formalizing the hypothesis is to identify your test KPI. If your test is going to have an impact you definitely have to define what metrics is going to indicate that it does.

CHECKLIST

☑ Formalize the hypothesis in terms of user behavior.
☑ Identify the test KPI you are going to follow.

#2 — Find the baseline

As we will discuss in a later post, A/B tests are most useful during the exploitation phase of product management. At their hearts they are a great tool to improve on an existing basis, fine tuning parameters to find the global optimum.

In order to determine whether your test achieved incremental improvement, you first need to identify what the current state of things is: what is your baseline?

In practice
In our earlier example we picked “conversion rate from the product page” as our test KPI. Let’s find out what is its current level, over a significant period of time so as to brush off seasonality or tracking errors:

Over the last 30 days, out of 100 users who visited the product page, 8 were found on the order confirmation page up to 2 hours later, meaning a 8% baseline conversion rate.

Baseline KPI = 8%

At this stage we are armed with our hypothesis, our test KPI and its baseline. Finding the baseline also forced us to determine the path to conversion and the conversion window: in our example we are only measuring users from the moment they display the product page to the moment they exit the funnel on the order confirmation page, and we only give them 2 hours to do so.

CHECKLIST

☑ Estimate your test KPI’s current state (aka the Baseline)
☑ Define the conversion path and conversion window

#3 — Place your bet

This one is the most critical part of building a rigorous A/B test. You see, for statistical reasons I would not be able to explain (not better than how Evan Miller explained it that is) you can not simply let a test run its course and reach a critical mass in order to conclude on it. Doing so would have you commit the sin of stopping an A/B test too early or too late (more on this on the next post), basically stopping the test when it tells you what you want to know.

Before even launching the test, you need to place your bet by deciding a minimum detectable effect. This is crucial to determine a test’s duration so you need to make sure you are placing a sincere bet. It’s a way of taking a commitment before even starting your test.

In practice
In our example, we have more or less an idea of the size of the effect we want to measure. Having one add-to-cart button causes 8% of users to convert, so we can expect that adding a second one will have a big impact as well, even though not as big, but certainly bigger than changing a color.

minimum detectable effect = +/- 10%

We are talking of a relative increase of 10% on top of our 8% of converting users (meaning the conversion rate we aim for is between 7.2% and 8.8% and NOT 18% 😉).

This step serves to estimate the magnitude of the change you expect in user behavior. Similarly to poker planning this will feel strange at first, because you do not have a reference point, but try and ask yourself this: will this change affect user behavior in a small way (+/- 1 to 5%), a big way (+/- 10 to 20%) or dramatically (+/- 50%) ?

CHECKLIST

☑ Determine the minimum detectable effect
☑ [optional] Use past tests to fine tune your estimation

#4 — Determine test duration

As we hinted above, A/B tests must have a strict, pre-defined duration to escape the risk of introducing biases in our conclusion. Determining the right duration for your test also involves some statistical concepts that I’m not qualified to explain properly, but just remember the coin story:

Photo by Claudio Schwarz on Unsplash

The coin story
Let’s imagine I give you a two sided-coin (heads and tails) and want you to determine whether I corrupted the coin in order for it to always land on heads. To determine if the coin is corrupted you throw it in the air 3 times, and it lands on heads 3 times. Is this really enough to conclude that the coin is corrupted?

This story I borrowed from former ManoMano Data Analyst Charles Goddet illustrates the importance of sample size to conclude on probability. The probability that an uncorrupted coin lands on heads 3 times is actually pretty high, compared with the probability that it always lands on heads if I throw it a million times.

Pragmatically speaking that means: if I only collect 3 observations my test is not robust enough for me to conclude, and if I collect 1 million observations I have wasted a hell of a lot of time just watching a coin land on heads. The right answer is somewhere in-between, and can be estimated very easily using your baseline and minimum detectable effect:

I use Evan Miller’s sample size estimator which works marvel to estimate sample size.

Once you have estimated the number of observations that are needed in order to conclude on your test, you can estimate its duration by measuring how many days it will take on average to reach this figure. During the A/B test each version will see half the traffic so be sure to double the size of your sample.

In practice
s = Sample size (per variation) = 18 296
S = Sample size (total) = 2s = 36 592
d = Average daily # of users displaying the product page over 30 days = 5 000 Test duration = S/ d = 7.3
My test should run for 8 days (rounding up).

CHECKLIST

☑ Determine how many observations you need (per population).
☑ Determine how many users enter your funnel every day on average.
☑ Estimate your test duration (amount of time needed to gather the correct number of observations).

If you followed the previous four steps rigorously, the last thing to do is simply to launch your A/B test and refrain from concluding on its results before the end of the test duration. That is if you don’t want your test falling victim to one of the 5 ways to ruin an A/B test (🆕 post to be published in the following weeks). After this test have run, you will need to analyze its results, and that will be yet another publication 😉.

The publication of this story would not have been possible without the help of Romain Ayres, Jacques Peteers and Yohan Grember.

--

--

Clément Caillol
Clément Caillol

Written by Clément Caillol

Head of Product @ Monisnap — Helping users everywhere send money back home to support their families. Ex Google Ex ManoMano

No responses yet