New Book: “Trustworthy Online Controlled Experiments — A Practical Guide to A/B Testing”

The recommender-system (research) community heavily relies on offline evaluations. While I personally always advocated the use of large-scale online studies and A/B tests, conducting these is arguably difficult, time-intensive, and sometimes simply impossible if a researcher has no access to (many) real users.

For those who do have access to a large number of users, this new book may be an interesting read:

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Accelerate innovation using trustworthy online controlled experiments by listening to the customers and making data-driven decisions

I haven’t read the book myself, but looking at the authors, this book seems very promising


Ron Kohavi is a Vice President and Technical Fellow at Airbnb.  He was previously a Technical Fellow and Corporate VP at  Microsoft.  Prior to Microsoft, he was the director of data mining and personalization at  He has a PhD in Computer Science from Stanford University.  More at

His papers have over 40,000 citations and three of his papers are in the top 1,000 most-cited papers in Computer Science.

Diane Tang is a Google Fellow, with expertise in large-scale data analysis and infrastructure, online controlled experiments, and ads systems. She has an AB from Harvard and MS/PhD from Stanford, and has patents and publications in mobile networking, information visualization, experiment methodology, data infrastructure, and data mining / large data.

Ya Xu heads Data Science and Experimentation at LinkedIn. She has led LinkedIn to become one of the most well-regarded companies when it comes to A/B testing. Before LinkedIn, she worked at Microsoft and received a PhD in Statistics from Stanford University. She is widely regarded as one of the premier scientists, practitioners and thought leaders in the domain of experimentation, with several filed patents and publications. She is also a frequent speaker at top conferences, universities and companies across the country.

Excerpt from the book

In 2012, an employee working on Bing, Microsoft’s search engine, suggested changing how ad headlines display (Kohavi and Thomke 2017). The idea was to lengthen the title line of ads by combining it with the text from the first line below the title, as shown in Figure 1.1. Nobody thought this simple change, among the hundreds suggested, would be the best revenue-generating idea in Bing’s history!

The feature was prioritized low and languished in the backlog for more than six months until a software developer decided to try the change, given how easy it was to code. He implemented the idea and began evaluating the idea on real users, randomly showing some of them the new title layout and others the old one. User interactions with the website were recorded, including ad clicks and the revenue generated from them. This is an example of an A/B test, the simplest type of controlled experiment that compares two variants: A and B, or
a Control and a Treatment.

A few hours after starting the test, a revenue-too-high alert triggered,
indicating that something was wrong with the experiment. The Treatment, that is, the new title layout, was generating too much money from ads. Such “too good to be true” alerts are very useful, as they usually indicate a serious bug, such as cases where revenue was logged twice (double billing) or where only ads displayed, and the rest of the web page was broken. For this experiment, however, the revenue increase was valid. Bing’s revenue increased by a whopping 12%, which at the time translated to over $100M annually in the US alone, without significantly hurting key user-experience metrics. The experiment was replicated multiple times over a long period

Add a Comment

Your email address will not be published. Required fields are marked *