AB Testing

Example

The first change Audacity wants to try is on their homepage. They hope that this new, more engaging design will increase the number of users that explore their courses, that is, move on to the second stage of the funnel.

The metric we will use is the click through rate for the Explore Courses button on the home page. Click through rate (CTR) is often defined as the the number of clicks divided by the number of views. Since Audacity uses cookies, we can identify unique users and make sure we don't count the same one multiple times. For this experiment, we'll define our click through rate as:

CTR: # clicks by unique users / # views by unique users

Now that we have our metric, let's set up our null and alternative hypotheses:

Our alternative hypothesis is what we want to prove to be true, in this case, that the new homepage design has a higher click through rate than the old homepage design. And the null hypothesis is what we assume to be true before analyzing data, which is that the new homepage design has a click through rate that is less than or equal to that of the old homepage design. As you’ve seen before, we can rearrange our hypotheses to look like this:

Example of calculating the significance of a metric in an AB test

Handling the evaluation of multiple metrics

The more metrics you evaluate, the more likely you are to observe significant differences just by chance - similar to what you saw in previous lessons with multiple tests. Luckily, this multiple comparisons problem can be handled in several ways.

the Bonferroni Correction is one way we could handle experiments with multiple tests, or metrics in this case. To compute the new bonferroni correct alpha value, we need to divide the original alpha value by the number of tests.

Since the Bonferroni method is too conservative when we expect correlation among metrics, we can better approach this problem with more sophisticated methods, such as the closed testing procedure, Boole-Bonferroni bound, and the Holm-Bonferroni method. These are less conservative and take this correlation into account.

If you do choose to use a less conservative method, just make sure the assumptions of that method are truly met in your situation, and that you're not just trying to cheat on a p-value. Choosing a poorly suited test just to get significant results will only lead to misguided decisions that harm your company's performance in the long run.NEXT

Difficulties in AB Testing

There are many factors to consider when designing an A/B test and drawing conclusions based on its results. To conclude, here are some common ones to consider.

  • Novelty effect and change aversion when existing users first experience a change

  • Sufficient traffic and conversions to have significant and repeatable results

  • Best metric choice for making the ultimate decision (eg. measuring revenue vs. clicks)

  • Long enough run time for the experiment to account for changes in behavior based on time of day/week or seasonal events.

  • Practical significance of a conversion rate (the cost of launching a new feature vs. the gain from the increase in conversion)

  • Consistency among test subjects in the control and experiment group (imbalance in the population represented in each group can lead to situations like Simpson's Paradox)

Scenario #1

  • EXPERIMENT: Audacity tests a new layout in the classroom to see if it would help engage students. After running an A/B test for two weeks, they find that average classroom times and completion rates decrease with the new layout, and decide against launching the change.

  • REALITY: What they don't know, is that the classroom times and completion rates actually increase significantly for new students using the new layout. In the long run, the layout would help existing students too, but they are currently experiencing change aversion.

What contributed?

  • The experiment included existing users who would bias results in a short time frame.

  • The experiment wasn't run long enough to allow existing users to adjust to the change.

Scenario #2

  • EXPERIMENT: Audacity tests a new description for a difficult course that gets very few enrollments. They hope this description is more exciting and motivates students to take it. After running an A/B test for five weeks, they find that the enrollment rate increases with the new description, and decide to launch the change.

  • REALITY: What they don't know, is that although the enrollment rate appears to increase with the new description, the results from this A/B test are unreliable and largely due to chance, because fewer than 40 out of thousands of visitors enrolled during this experiment. This makes even one new student for the course substantially impact the results and potentially even the conclusion.

What contributed?

  • The course page had too little traffic and conversions to produce significant and repeatable results in the time frame.

Last updated