You’re doing A/B tests wrong. Do this instead.

Taking A/B test results at face value goes against the fundamentals of experimentation.

A few weeks ago, I was in a meeting with a very senior product manager, with about decade and a half of experience in PM. He proudly mentioned that the company runs hundreds of A/B tests and experiments every day. They take the results of the experiments and use it to tweak the product.

They never ask why.

If users tend to click on blue buttons more, let’s change the main buttons to blue. If users tend to click on Book Now if they see rooms are running out, then let’s show them how many rooms are left. Maybe reduce the availability a bit to drive urgency.

Never asking why.

There are two massive ways we are doing this wrong.

Causality drives decision-making, not the other way round.

I’m a firm believer in the five why’s. Understanding the root causes that drive user behaviour is as important as learning about the user behaviour in the first place. If an A/B test gives you a result at the beginning of the year and another six months later, did you really gain anything useful with the experiment?

The five why’s technique will typically help you identify the root cause of user behaviour by simply asking Why five times. The immense appeal of this technique is simply that it works. Eric Ries explains it far better than I ever will:

Another trick is to get initial findings from A/B tests and then validate those findings and assumptions with real life customers. An unhealthy amount of product managers tend to only look at data – that is only half the story. Testing on 50,000 customers is important, but if talking to an additional 50 can amplify the impact of your tests, then why not?

Unavoidable selection-bias can kill authenticity of the tests.

The next problem I have with blind A/B tests is that there is an inherent selection bias that marketers and product managers tend to ignore. Until you are absolutely sure that the users coming to your website every day are exactly representative of the entire user base, you are going off base. Look out for the impact of these biases:

  • Impact of daily sales. For instance, if you are an e-commerce company, daily promotions and sales will have an impact on who lands on your home page. Assume you’re a horizontal player selling products across a range of categories, your home page will look very different at Valentine’s vs. when you’re promoting a new Xiaomi launch. Any experiment you run on those days will be representative of a different kind of audience. This applies on every day to varying degrees. To avoid this, remove the impact of outliers on the user base and choose a day when your audience is more representative of the user base. Say no to hundreds of tests everyday.
  • Difference in on-site behaviour. Surely a repeat user that regularly spends fifteen minutes on the website at every session is not the same as one that arrived on your website for the first time and bailed in 10 seconds. If a change enables you to keep your repeat users engaged but keeps away new users, your user-base will shrink with time. To avoid this, limit your testing to the average: remove people that spend too little or too much time. Remove users who buy too frequently or bounce too quickly.

--

In summary:

  • A/B and split tests are very powerful in helping us understand user motivations. But doing a half-baked job of getting user feedback can do more harm than good. Don’t experiment for the heck of it.
  • Draw causation. Take the extra step to get first-hand user feedback, from quantitative as well as qualitative channels.
  • Avoid inherent selection bias. Look at what might be causing selection bias in your industry, and protect against that.