10 lessons from A/B testing

24 Aug

INTRODUCTION

We implemented A/B testing into our product 6 months ago. During that time we conducted a variety of A/B tests to generate insights about our user’s behaviour. We learnt a lot about our specific product. More generally, we learnt about how to run valuable A/B tests.

Below is a Buzzfeed-esque TOP 10 LESSONS I learnt RUNNING A/B TESTS. It’s tips & tricks – plus things to avoid doing. It’s written from a product/BA perspective.

Experimentation

 

10 LESSONS

Lesson 1: A/B vs MVT testing

Lesson 1

A/B and MVT testing are very similar. Infact the terms are sometimes used interchangeably.

A/B and MVT tests both serve up different experiences to the audience and measure which experience performs the best. They are both run with the same 3rd party tools (e.g. Optimizely, Maxymiser) and have a similar experiment lifecycles.

The key difference between A/B and MVT tests is how many elements they vary to the audience.

A/B test

This is where you change one element of a page (e.g. the colour of a button). You might compare a blue button (challenger) against a red button (control) and examine what effect the button’s colour has on user behaviour. For example:

Button Colour | Variant Name |

| Blue                  | Challenger       |

| Red                   | Control            |

Pros: Simple to build, faster results, easier to interpret results

Cons: limited to one element of a user experience (e.g. button colour)

As a note – A/B tests aren’t limited to 2 variants. You could show a blue button, red button, purple button etc; as long as you change only one element of an experience (button colour) it’s an A/B test.

MVT test

This is where you change a combination of elements. You might compare changing the button colour and its text label. You would test all combinations of those changes and see what effect it has on user behaviour. For example:

| Button Colour | Text Copy | Variant Name |

| Blue                  | Click here  | Challenger 1   |

| Blue                  | Click          | Challenger 2   |

| Red                  | Click here   | Challenger 3   |

| Red                  | Click           | Control           |

Pros: Greater insights, identifies the optimal user experience, more control

Cons: Longer to get results, more complex, requires more traffic

Which one to pick?

This depends on what you want to test & your testable hypothesis. In the early stages of running experiments you might start with A/B tests and then move onto MVT tests. This is because A/B tests are simpler to create & interpret. MVT tests are slightly more complex but provide greater product insight.

As an example: we ran an MVT experiment where we changed the promotional copy on a page and a CTA label. We thought both elements would impact the click-through rate. The result was that the winning promotional copy was emotive copy. The best CTA was “Get started“. However the optimal variant was descriptive copy with “Get started“.Why? Perhaps because the tone between the two elements was more aligned. If we had run this as 2 A/B tests then we wouldn’t have identified the optimal combination.

 

Lesson 2: Have a clear hypothesis

Huty1913428

An experiment is designed to test a hypothesis. The purpose of an experiment is to make a change and analyse the effect. Tests need to have a clear reason and a measurable outcome.

When creating an A/B test its crucial to create a clear hypothesisWhat is the problem you’re trying to solve? What are the success metrics? Why do you think this change will have an effect?

We use a variation of the Thoughtworks format to write testable hypotheses:

We predict that <change>

Will significantly impact <KPI/user behaviour>

We will know this to be true when <measurable outcome>

By having clearly defined hypotheses we can:

  1. Compare the merits of different hypotheses and select the most valuable one first. For example if hypothesis 1 predicts a 5% uplift in a KPI and hypothesis 2 predicts a 50% uplift in the same KPI, then we would test hypothesis 2 first.
  2. Agree the success metric upfront before starting development. For example if changing the mobile navigation is the test, what are the success metrics: more users clicking on the menu button, more items in the menu being clicked, increased usage and retention of brand new users? Having clear success metrics/goals is key when trying to identify the winning variant later on.
  3. Ensure the test is focussed on solving a user problem or improving a KPI that matters to the product. We don’t want to run tests simply because we can – they need to solve problems and offer benefits. The above format aligns each test with business KPIs/user problems.
  4. Make it incredibly easy for anyone to generate a hypothesis. The Thoughtworks format means that anyone in our team can generate a hypothesis. Some of the best ideas we’ve had are from “non-creatives” such as QA.

Note – we often put a “background” section with research in the testable hypothesis (e.g. how many people currently use a feature, industry average, user feedback etc).

 

Lesson 3: Forecast sample size

Lesson 3

When designing an A/B experiment it’s crucial to calculate the sample size. You will need to forecast the sample size required to detect the MDE (Minimal Detectable Effect). This forecast will inform:

  1. Whether you can run the experiment (do you have enough users?)
  2. The maximum number of variants you can create
  3. What proportion of the audience will need to be in the experiment
  4. Potentially the experiment duration (e.g. it will take 2 weeks to get that many users)

There’s several tools online to help you forecast e.g. https://www.optimizely.com/resources/sample-size-calculator/. Without upfront forecasting you run the risk of creating an experiment that will never reach an outcome.

For example: imagine your product has 100k weekly users. You plug in the numbers and forecast that each variant requires 22k users to detect a 0.05 statistical effect size. That means you should build no more than 4 variants, otherwise you won’t detect a significant result. At least 44% of users need to be in the experiment (22% see a variant, 22% see a control). If the change is radical, based on these numbers you may only want to create one variant; this is because you don’t want to show the experiment/significant UX changes to a large proportion of the audience.

 

Lesson 4: More variants the better

Optimizely ran an analysis of their customers successful A/B tests. What they found was interesting. The more variants run in an experiment (up to a limit), the more likely you are to find an effect. Why?

One reason is that if you ask UX to create 2 variants they may create two similar visuals. If you ask them to create 8 there might be greater differences between them. It’s likely with 2 variants you’re playing it safe. The Optimizely results suggests running about 5 variants in a test:

Lesson 4

 

Lesson 5: Implement a health metric

The purpose of a health metric is to ensure that an experiment doesn’t maximise one KPI (the experiment’s primary goal) at the detriment of other KPIs. Popular health metrics include: average weekly visits, content consumption, session duration etc. Essentially health metrics are key business KPIs you don’t want to see go downduring an experiment. If the health metric fails, then you pull the experiment early, or do not release the winning variant.

For example: imagine you have 3 variants of a sign-in prompt. One variant of the prompt is non-dismissible. If your primary goal is to maximise sign-ins then this variant will win. However the variant could be so annoying that it reduces overall user engagement with the product. Your health metric ensures you don’t maximise sign-ins at the detriment of core product KPIs (e.g. average weekly sessions).

In our case – the BA worked with stakeholders/the product owner to identify & track the health metrics. The health metrics will vary depending on the product.

heart-865226_1280

 

Lesson 6: Get management buy in

Based on experience, I recommend getting management buy-in early on. A/B testing is a significant culture change. It challenges the idea that a Product Owner/UX/Managers know what the best user experience is. It replaces gut decisions with data based decisions. Essentially A/B testing can transition a team from a HIPPO culture (HIghest Paid Persons Opinion) to a data driven culture.

Lesson 6

To get management buy in for A/B testing there’s a variety of tactics:

  1. Ensure the 1st A/B test you run offers real business value. Don’t run a minor/arbitrary change as your 1st test. Try to solve an important problem or turn the dial on a key business KPI. Even better if the result might challenge existing beliefs.
  2. Reiterate the benefits of A/B testing. These include:
    • Increasing collaboration by empowering the team to generate their own hypotheses, which can be delivered as “small bets”
    • Increasing openness by encouraging a data-driven culture to decision making, rather than a HIPPO culture
    • Increasing innovation by learning more about user behaviour and adapting the product
    • Increasing innovation because delivering changes to a sub-set of the live audience means you can experiment more and take more risks
    • Challenging assumptions and decisions to create a more valuable product. Gut feelings can be wrong
    • Small bets are better than big bets. They are less risky & can have significant user benefits
    • Empowering the team to improve the quality of solutions
  3. Create experiments in collaboration with the entire team so that it’s not seen as a threat to the PO/UX
  4. Create a fun testing environment. Get people to place bets on the winner.

 

Lesson 7: Assumptions can be wrong

lesson 7

We’ve had several examples of where our assumptions about user behaviour were wrong.

Our 1st A/B test was a prompt. We thought it would increase usage of a new service. We were so confident about it as an in-app notification that we were going to make it a re-usable component. We actually had 3 more prompts on the roadmap.

What did we find out with an A/B test? The prompt significantly reduced general usage of the app. It was a dramatic drop in usage. The results challenged our assumptions and changed our roadmap.

By having a control group that we could compare against & by serving the experiment to a sub-set of the audience we were able to challenge our assumptions early & with a relatively small subset of users.
We never put the prompt live. Test your assumptions.

 

Lesson 8:  Broadly it’s a 6 step process

This is a slight simplification – below is the typical lifecycle of an experiment.

lesson 8

STEP 1 – Business goals

Identify the business goals (KPIs) and significant user problems for your product.

STEP 2 – Generate hypotheses

Generate testable hypotheses to solve these goals/problems. Prioritise the most valuable tests.

STEP 3 – Create the test

  • Work with UX & developers to create n number of variants
  • Forecast the number of users required for the MDE
  • Decide on traffic allocation (e.g. 50% see A, 50% see B)
  • Identify target conditions (e.g. only signed in users, only 10% of users)
  • Implement conversion goals (one primary and optional secondary goals)
  • Implement the health check
  • Set the statistical significance level

STEP 4 – Run the experiment

  • Run the experiment for at least 1 business cycle
  • Actively monitor it
  • Potentially ramp up number of users

STEP 5 – Analyse results

  • Review the performance of variants
  • Analyse the health check
  • Identify winner

STEP 6 – Promote the winner

  • Promote the winner to 100% of the audience
  • Learn the lessons
  • Archive the experiment

 

Lesson 9: Make testing part of the process

lesson 9

When we started A/B tested we committed to run 3 tests in the first quarter. It was a realistic target. It meant we were either developing a test, or analysing the results of a test (tests typically ran for 2 weeks). The more tests we ran, the easier they were to create.

Getting into a regular cycle is important in the early stages. For any feature or change you should ask “Could we A/B test that?”

I have seen several teams “implement A/B testing” and only run 1-2 tests. The key to getting value from A/B testing is to make it part of the product development lifecycle.

 

Lesson 10: There’s a community out there …

There’s a huge number of resources out there:

lesson 10

ACKNOWLEDGEMENTS

I learnt a huge amount from Olivier Tatard, Sibbs Singh, Sam Brown, Toby Urff and the folks at Optimizely. Big thanks also to the rest of the app team, we all went on the journey together.

If you made it down this far then you get 10 bonus points.

 

Advertisements

2 Responses to “10 lessons from A/B testing”

  1. Georgi Georgiev June 26, 2017 at 1:46 pm #

    Hey,

    Some good advice here, but an important things seem to be omitted in some of these points: cost!

    In both the A/B vs MVT point and the “More variants the better” point, it is worth mentioning that:

    #1 Running a test with more variants increases the likelihood of detecting a nominally statistically significant winner. That is, a winner which is a result of the variation of the data, not a manifestation of a true phenomena. This can be accounted for, of course, and accounting for it leads to

    #2 Significantly increased sample size, required to run the test with the same uncertainty and sensitivity measures (statistical significance and statistical power, respectively). So, a cost must be associated with the need to run the test for longer.

    #3 Developing, deploying and QA-ing variants that have a chance to beat the original, especially if this is a 3-th or 4-th iteration over the same page/element/process etc. can be hard, expensive and slow.

    Also, when you say “Forecast sample size”, it should actually be: fix your sample size. It’s a forecast only before you commit to a design, then it becomes fixed, or you’d fall prey to the bane of A/B testing, which is “waiting for statistical significance”. Alternatively, you can opt in to use a sequential testing framework such as the AGILE A/B testing method. It usually results in much faster tests, and can have other benefits such as futility stopping rules, etc.

    Best,
    Georgi

    • rthewitt June 30, 2017 at 3:25 pm #

      Thanks Georgi. I think you also raise some interesting points!! Nice blog/website too 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: