Simple A/B testing dashboard example

The problem

Assume you’ve a website or a mobile app for your product or service. You have in mind a change to the user journey that hopefully will positively affect an important metric like conversion rate or click through rate etc. How do you know whether said change should be incorporated?

Performing an A/B test of the change will allow us to determine whether incorporating the change is worthwhile. Here, we show a simple illustration of this process, where we:

determine the sample size for the test;
A/B test a single change; &
determine if the results are significant.

In reality, the bulk of the complexity mainly down to the business aspects of A/B testing, which requires coordination between the development, app design, & marketing teams. This is because a lot of the technical details of A/B testing are taken care by testing suites, which run the A/A, A/B tests, as well as monitor any spillover effects on guard-rail metrics that we may have defined earlier.

The business discussions might involve pinning down the following:

the parameters of the test (typically the effect size);
when to test (marketing may be busy with another promo campaign);
accounting for trends & seasonality in the control group data;
defining appropriate guard-rail metrics if they haven’t already been done so;
checking if other internal/external confounds, & side-effects don’t invalidate the results;
deciding whether to stop the experiment prematurely using a stopping rule, like Pocock for eg. (testing is costly after all);
determining if it’s worthwhile to test multiple features at once (multi-variate testing).

Back to Annie’s homepage

What’s the min sample size required for our experiment?

This A/B test would attempt to determine whether changing the position of a “Reserve Slot”, button, to a different, strategic location on the screen, could bump up the conversion rate.

We start by fixing all of the parameters prior to the start of the experiment by either relying on our experience or after discussions with the necessary stakeholders. In this case, we determine that we’re likely to see a nearly 19% (38% - 17%) uptick in the conversion rate (effect size). Additionally, the alpha & power are set to their typical levels, & we determine an equal split in the proportion of the control & test groups.

The resultant calculation indicates that we need a minimum of 131 samples in order to conduct the test.

A/B testing experiment data

We run the test by randomly assigning users of the app into either the control or test groups, with the button location changed to a new, strategic location for the test group. We run the test for a period of 1 month, for the whole of January.

Upon retrieving the data, we see that we’ve captured data for 588 users, well in excess of the minimum 131 samples required.

Perhaps we should revisit our assumptions about the length of time for the experiment, as we could have stopped it sooner. Or perhaps this is due to a sudden influx of new sign-ups. In either case, we need to remember to investigate this after our analysis of this test, as it’ll have a bearing on follow-up experiments.

What are our conversion rates across each day of the week?

To account for change in user behavior across the days of the week, we plot the conversion rates for the two groups across all the days of the week.

We see that encouragingly, there’s a clear difference in the conversion rates between the two groups, with the test group having consistently higher conversion rate than the control group.

What is the difference in conversion rates across control & test groups?

The conversion rate of the test group averages about 38%, as opposed to the control group average rate of 17%. This is a good sign.

Can we rely on these results?

We can rely on the results of this experiment & conclude that moving the button is a worthwhile change, as running a logistic regression on the experiment data shows that the p-values are significant.

Back to Annie’s homepage