Controlled experiments on the web...
Data Min Knowl Disc (2009) 18:140���181 DOI 10.1007/s10618-008-0114-1 Controlled experiments on the web: survey and practical guide Ron Kohavi �� Roger Longbotham �� Dan Sommerfield �� Randal M. Henne Received: 14 February 2008 / Accepted: 30 June 2008 / Published online: 30 July 2008 Springer Science+Business Media, LLC 2008 Abstract The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, Control/Treatment tests, MultiVariable Tests (MVT) and parallel flights. Controlled experiments embody the best scientific design for estab- lishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end- users can help guide the development of features. Our experience indicates that sig- nificant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person���s Opinion (HiPPO). We pro- vide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limita- tions (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for vari- ance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled Responsible editor: R. Bayardo. R. Kohavi (B) �� R. Longbotham �� D. Sommerfield �� R. M. Henne Microsoft, One Microsoft Way, Redmond, WA 98052, USA e-mail: email@example.com R. Longbotham e-mail: firstname.lastname@example.org D. Sommerfield e-mail: email@example.com R. M. Henne e-mail: firstname.lastname@example.org 123
Controlled experiments on the web 141 experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the out- come of interest, leading to new hypotheses and creating a virtuous cycle of improve- ments. Organizations that embrace controlled experiments with clear evaluation cri- teria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments. Keywords Controlled experiments �� A/B testing �� e-commerce �� Website optimization �� MultiVariable Testing �� MVT 1 Introduction One accurate measurement is worth more than a thousand expert opinions ��� Admiral Grace Hopper In the 1700s, a British ship���s captain observed the lack of scurvy among sailors serv- ing on the naval ships of Mediterranean countries, where citrus fruit was part of their rations. He then gave half his crew limes (the Treatment group) while the other half (the Control group) continued with their regular diet. Despite much grumbling among the crew in the Treatment group, the experiment was a success, showing that consuming limes prevented scurvy. While the captain did not realize that scurvy is a consequence of vitamin C deficiency, and that limes are rich in vitamin C, the intervention worked. British sailors eventually were compelled to consume citrus fruit regularly, a practice that gave rise to the still-popular label limeys (Rossi et al. 2003 Marks 2000). Some 300years later, Greg Linden at Amazon created a prototype to show personalized recommendations based on items in the shopping cart (Linden 2006a, b). You add an item, recommendations show up add another item, different recommenda- tions show up. Linden notes that while the prototype looked promising, ���a marketing senior vice-president was dead set against it,��� claiming it will distract people from checking out. Greg was ���forbidden to work on this any further.��� Nonetheless, Greg ran a controlled experiment, and the ���feature won by such a wide margin that not having it live was costing Amazon a noticeable chunk of change. With new urgency, shopping cart recommendations launched.��� Since then, multiple sites have copied cart recommendations. The authors of this paper were involved in many experiments at Amazon, Microsoft, Dupont, and NASA. The culture of experimentation at Amazon, where data trumps intuition (Kohavi et al. 2004), and a system that made running experiments easy, allowed Amazon to innovate quickly and effectively. At Microsoft, there are multiple systems for running controlled experiments. We describe several architectures in this paper with their advantages and disadvantages. A unifying theme is that controlled experiments have great return-on-investment (ROI) and that building the appropriate infrastructure can accelerate innovation. Stefan Thomke���s book title is well suited here: Experimentation Matters (Thomke 2003). 123
142 R. Kohavi et al. The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or facto- rial designs), A/B tests (and their generalizations), split tests, Control/Treatment, and parallel flights. In the simplest manifestation of such experiments, live users are ran- domly assigned to one of two variants: (i) the Control, which is commonly the ���exist- ing��� version, and (ii) the Treatment, which is usually a new version being evaluated. Metrics of interest, ranging from runtime performance to implicit and explicit user behaviors and survey data, are collected. Statistical tests are then conducted on the collected data to evaluate whether there is a statistically significant difference between the two variants on metrics of interest, thus permitting us to retain or reject the (null) hypothesis that there is no difference between the versions. In many cases, drilling down to segments of users using manual (e.g., OLAP) or machine learning and data mining techniques, allows us to understand which subpopulations show significant differences, thus helping improve our understanding and progress forward with an idea. Controlled experiments provide a methodology to reliably evaluate ideas. Unlike other methodologies, such as post-hoc analysis or interrupted time series (quasi exper- imentation) (Charles and Melvin 2004), this experimental design methodology tests for causal relationships (Keppel et al. 1992, pp. 5���6). Most organizations have many ideas, but the return-on-investment (ROI) for many may be unclear and the evalua- tion itself may be expensive. As shown in the next section, even minor changes can make a big difference, and often in unexpected ways. A live experiment goes a long way in providing guidance as to the value of the idea. Our contributions include the following. ��� In Sect. 3 we review controlled experiments in a web environment and provide a rich set of references, including an important review of statistical power and sample size, which are often missing in primers. We then look at techniques for reducing variance that we found useful in practice. We also discuss extensions and limitations so that practitioners can avoid pitfalls. ��� In Sect. 4, we present several alternatives to MultiVariable Tests (MVTs) in an online setting. In the software world, there are sometimes good reasons to prefer concurrent uni-variate tests over traditional MVTs. ��� In Sect. 5, we present generalized architectures that unify multiple experimentation systems we have seen, and we discuss their pros and cons. We show that some randomization and hashing schemes fail conditional independence tests required for statistical validity. ��� In Sect. 6 we provide important practical lessons. When a company builds a system for experimentation, the cost of testing and experimental failure becomes small, thus encouraging innovation through experimen- tation. Failing fast and knowing that an idea is not as great as was previously thought helps provide necessary course adjustments so that other more successful ideas can be proposed and implemented. 123
Controlled experiments on the web 143 2 Motivating examples The fewer the facts, the stronger the opinion ��� Arnold Glasow The following examples present surprising results in multiple areas. The first two deal with small UI changes that result in dramatic differences. The third example shows how controlled experiments can be used to make a tradeoff between short-term revenue from ads and the degradation in the user experience. The fourth example shows the use of controlled experiments in backend algorithms, in this case search at Amazon. 2.1 Checkout page at Doctor FootCare The conversion rate of an e-commerce site is the percentage of visits to the website that include a purchase. The following example comes from Bryan Eisenberg���s articles (Eisenberg 2003a, b). Can you guess which one has a higher conversion rate and whether the difference is significant? There are nine differences between the two variants of the Doctor FootCare check- out page shown in Fig.1. If a designer showed you these and asked which one should be deployed, could you tell which one results in a higher conversion rate? Could you estimate what the difference is between the conversion rates and whether that difference is significant? We encourage you, the reader, to think about this experiment before reading the answer. Can you estimate which variant is better and by how much? It is very humbling to see how hard it is to correctly predict the answer. Please, challenge yourself! Fig. 1 Variant A on left, Variant B on right 123
144 R. Kohavi et al. Fig. 2 Microsoft help ratings widget. The original widget is shown above. When users click on Yes/No, the dialogue continues asking for free-text input (two-phase) Variant A in Fig.1 outperformed variant B by an order of magnitude. In reality, the site ���upgraded��� from the A to B and lost 90% of their revenue! Most of the changes in the upgrade were positive, but the coupon code was the critical one: people started to think twice about whether they were paying too much because there are discount coupons out there that they do not have. By removing the discount code from the new version (B), conversion-rate increased 6.5% relative to the old version (A) in Fig. 2. 2.2 Ratings of Microsoft Office help articles Users of Microsoft Office who request help (or go through the Office Online website at http://office.microsoft.com) are given an opportunity to rate the articles they read. The initial implementation presented users with a Yes/No widget. The team then modified the widget and offered a 5-star ratings. The motivations for the change were the following: 1. The 5-star widget provides finer-grained feedback, which might help better evaluate content writers. 2. The 5-star widget improves usability by exposing users to a single feedback box as opposed to two separate pop-ups (one for Yes/No and another for Why). Can you estimate which widget had a higher response rate, where response is any interaction with the widget? The surprise here was that number of ratings plummeted by about 90%, thus significantly missing on goal #2 above. Based on additional tests, it turned out that the two-stage model helps in increasing the response rate. Specifically, a controlled experiment showed that the widget shown in Fig.3, which was a two-stage model and also clarified the 5-stars direction as ���Not helpful��� to ���Very helpful��� outperformed the one in Fig.4 by a factor of 2.2, i.e., the response rate was 2.2 times higher. Even goal #1 was somewhat of a disappointment as most people chose the extremes (one or five stars). When faced with a problem for which you need help, the article either helps you solve the problem or it does not! The team finally settled on a yes/no/I-don���t-know option, which had a slightly lower response rate than just yes/no, but the additional information was considered useful. 123