Online A/B Test Introduction

A-B test has a much longer history than web services and Internet. Its usage in medical clinical trials even predated the PC age. The kind of A-B tests we apply to online services, such as Web Search, Internet Ads, ECommerce Websites, by the likes of Google, Amazon, EBay, etc, still has the same basic structure and follows the same statistical theories of those conducted on clinical trials from more than half a century ago. In this series of articles, I will walk thru those basic theories behind A-B tests and also the unique challenges that are specific to A-B tests on modern Online Services.

Structure of A-B Tests

What is an A-B test?
  1. What question A-B test answers?
Given a population of IID variables and a treatment on those variables, what is the effect of the treatment on the population, specifically, in terms of the population mean.
IID stands for "independent and identically distributed". We leaves out its detailed explanation for a later post in this series. Instead, I provide an example as the following:
Google conducts an A-B test on a new search ranking algorithm to tell whether the new ranking algo improves click-thru rate of search results. The variable type is defined as the click-thru rate on search queries of a user. Such A-B test is designed to find out whether a treatment improves expected click-thru rate of all users, in other words, the mean of all users' click-thru rates.
Note that there is more than one way to define the types of "variables" in an A-B test, for instance, the variable could also be the click-thru rate of each search query as opposed to that of a user. Defining variables is an important part of the A-B test design, which we will go into details in a later post.
2. How does A-B test answers such question?
The problem is structured as a test on a hypothesis about the change that the treatment incurs on the mean of the population (of the variables). Let lift = treatedMean / controlMean - 1 , with treatedMean being the mean of the population when treated and controlMean being the mean of the population when left untreated. In statistical hypothesis testing, we construct a null hypothesis H0 and collect evidence against it to reject it, for example:
  • H0: lift = 0 This is usually referred to as two-tailed test. It tests both superiority (lift > 0) and inferiority (lift < 0).
  • H0: lift <= 0 One-tail test for superiority.
  • H0: lift <= -errorMargin This is called non-inferiority test, where we want to reject the H0 that treated is worse than control by the errorMargin.
The logical complement of H0 is Ha (alternative hypothesis), which is the hypothesis that the test help us to decide to accept and usually represents the business question we care, for example, whether we will see higher click-thru rate if all users are treated.
The next question is the techniques for hypothesis testing.
Draw two disjoint sample sets A and B from the population without introducing sampling bias and selection bias. Apply the treatment to set B. Analyze the statistical properties of A and B as samples of the population to test the statistical hypothesis on the population.
Sampling bias refers to that the sample set do not properly represent the population. For example, a sampled set of people that consists of mostly men is not an unbiased sample set of general human population. Sampling bias is actually context sensitive and needs to be evaluated within the context of the A-B test. In the Google search example, sampling bias may be evaluated on multiple dimensions of users, such as gender, geo-location, ip-address, etc. A truly random sample of users that gives equal chance of every user to be selected into the sample set, should do the trick.
Selection bias refers to that two sampled sets are not similar to each other. With enough sample size, true random sampling should avoid selection bias. Some poor implementation of random selection may introduce selection bias. For example, select each variable (user) into A or B based on hashing the first 3 bytes of the user's IP address may result in that the two sample set containing people from different zip codes, hence not similar to each other on geo dimension.
When both sampling bias and selection bias are avoided, the set A, set B and the population (they are sampled from) are identically distributed. With the two sets of observed variables A and B, we will be able to leap to the hypothesis testing on the population. The statistical techniques we need to use are T-Tests on hypothesis and power analysis. The following posts will go into their details, the theories behind them and the practices of them specifically in A-B tests of online services. I will cover the following topics:
  • A/B Tests: Applied Statistics (posted on slideshare)
  • A-B test: offline data processing and computations
  • A-B test: implementation in distributed applications

No comments:

Post a Comment

Post your comments

Decentralized Database over DLT