A Deep Dive into A/B Testing

To some extent, we’re all familiar with A/B testing. The fact that you can A/B test email headers, CTA buttons or landing pages probably won’t surprise you. But have you considered how personalization engines take A/B testing to the next level? Furthermore, did you know that personalization engines themselves can be compared to each other through A/B testing? Or that you can A/B test versions of a website or app with and without personalization?

With an unbeaten track record in A/B testing against competing personalization solutions, Yusp is a true veteran of the genre. Let us share our know-how: if you want to put Yusp and/or any other personalization engine to the test, here’s how it’s done.

A/B Testing: Simple, Controlled, Randomized

You’ve most likely come across A/B testing in a marketing context, where it’s used to compare the performance of digital assets in order to reach specific business goals. In fact, the concept originates from the rolling hills of England where, in the 1920s, geneticist and statistician Ronald Fisher laid down the framework of controlled, randomized experiments – which include A/B testing – in order to compare crop yields under various conditions. Later, the method was adopted by scientists conducting clinical trials, and even later by marketers wishing to assess how their direct response campaigns were doing.

Throughout the decades and the various use cases, the design created by Fisher has remained consistent: A/B testing is a simple, controlled, randomized experiment.

Simple, because it’s basically comparing two (or more) versions of something to see which one works better. For instance, you could A/B test two different incentives in a “Sign up to our newsletter” pop-up to find out which one is more attractive.

Controlled, because the compared variations are exactly the same except for one, well-defined difference. This is crucial in order to make sure that the resulting change in performance is not a mere coincidence. For example, if you want to A/B test top banner headlines, all other elements of the website should be identical on the two (or more) tested variations. This way, it’s clear that any uplift in the winning version is because of the headline – the only element that has changed.

Randomized, because the tested variations are shown to users at random, to prevent possible patterns of behavior in various user groups from influencing the result. In the simplest case, comparing two versions of a website means half the traffic is directed to version A and the other to B at random, so the composition of the audience of either version is most likely to be the same. Drawing the line anywhere in the allocation of traffic could skew the outcome. First-time vs returning visitors; mobile vs desktop users; weekday vs weekend traffic; heavy buyers vs average consumers – practically any categorization is bound to interfere with fair A/B testing.

For instance, if the KPI of an A/B test is the value of conversions generated by recommendations in the compared user groups, then it’s especially important that the users are split evenly in terms of spending.

Let’s consider an e-commerce website that serves individual customers as well as wholesalers or businesses. Suppose their individual buyers outnumber commercial buyers in a 90-10 percent ratio. Suppose also that their commercial buyers typically spend ten times as much as the individual customers.

So, assuming a total of 20 customers, there will be 2 commercial and 18 individual buyers. If both commercial buyers end up in the same group, then that group will represent 2×10+8=28 units of average spending, in great contrast to the other group’s mere 10 units. This vast difference creates a bias for any average spending related test. (The extent of the difference is caused by the small number of buyers in this example. Still, with much lower probability, a similar phenomenon could occur even if the number of buyers is in the magnitude of thousands or millions.)

Therefore, it’s crucial to ensure a randomized user split in order to prevent bias. In fact, separate tests, called A/A tests, are often run to check if there is any significant difference between the user groups.

Classic A/B Testing, Step by Step

The process of a standard A/B test is pretty straightforward. Let’s take website optimization as the most typical use case.

It all starts with collecting data. Looking at a website’s analytics helps pinpoint areas or functions in need of improvement. You might discover, for instance, that first-time visitors to your webstore are reluctant to purchase anything, or that an alarmingly high percentage of prospective buyers abandon their carts before checkout.
Once you’ve identified the problem, the next step is formulating your goal and finding the right metrics, or KPIs, to measure the uplift. Suppose you choose to tackle the trend of abandoned carts: in this case, you’ll want to increase conversion at the checkout page.
In order to reach your goal, you need to come up with possible solutions. In other words, you construct a hypothesis, such as: “By adding a ‘FOMO nudge’ to cart summaries, indicating if a product is in short supply and also in others’ carts, we can increase conversion at the checkout by 5 percent.”
Next, you create variations of the page or feature you want to test: besides the original, also known as the control or baseline version, you need at least one variation with a specific difference. In the case mentioned above, it makes sense to test the checkout page in its current state, and a variation where a red blinking sign says, “Pounce! 3 more people have this item in their carts!”
To kick off the experiment, split traffic to the different variations evenly and at random.
Measure users’ reactions to the compared variations. Once the A/B test has run its course, analyze your results to see which variation worked best.
Now all you need to do is pick the winner and direct your entire traffic to it, canceling the version(s) that it outperformed. It might turn out that the “FOMO nudge” failed to boost conversion at checkout by the expected rate, as it didn’t prevent people from abandoning their carts. In which case you’re back to square one: the challenge is to find a new hypothesis and start A/B testing all over again.

A/B Testing in Personalization

Personalization engines rely heavily on A/B testing. They take the multi-step A/B testing process detailed above, and put it on a continuous, high-speed loop. But instead of comparing variations of digital assets like web pages or emails, they A/B test personalized recommendations, or rather, the algorithms behind them.

They incorporate two (or more) algorithms, or variants of these, in personalized recommendations to two (or more) different sets of users. Then, using statistical models, they determine which version is more effective, that is to say, which recommendation converts better.

This type of A/B testing is essential in improving the overall performance of personalization engines. However, this improvement has its limits: through A/B testing, the efficiency of the personalization solution will increase and then level out at one point. From this point onwards, the goal is to maintain this efficiency.

The Ultimate A/B Test: Assessing or Comparing Personalization Solutions

As a marketer considering a personalization strategy for your enterprise, you may want to use A/B testing to establish the added value of personalization. Or, if you’ve already implemented some kind of recommendation system, A/B testing it against another or several other personalization solution(s) can help to decide which one works best for your purposes.

In fact, Yusp has been A/B tested against some of the leading retail and content personalization vendors, and it has remained unbeaten in all tests. That’s 42 and counting.

In our experience, employing multiple personalization vendors for the evaluation period doesn’t require a significantly higher investment from clients. Therefore, it’s always a good idea to choose your long-term personalization provider based on the evidence of an A/B test.

Here’s how to go about setting up the experiment.

The key requirement of an A/B test is to ensure that the same conditions apply to all competing solutions. To this end, users should be divided into two (or more) groups that are roughly the same in terms of size and past purchase value.
All participating personalization solutions should serve the same types of recommendations, and these should look identical.
The evaluation period (that is, the duration of the A/B test) should be long enough to produce reliable results, but not so long as to be too costly. We usually suggest two months for comparing personalization solutions, while an A/B test gauging the added value of personalization can run for a shorter period.
Setting the scope of the personalization features to be compared depends on the purpose of the A/B test. For deciding which personalization engine performs best, a relatively small scope is enough. However, if your goal is to assess whether it’s worth investing in personalization at all, a more comprehensive scope is required. This is because the larger the surface of personalized elements on a site, and the more personalized touchpoints a user encounters on their journey, the more significant the impact of personalization will be.

Here are two of our case studies to illustrate this point.

May the Best Win: Yusp Beats Competition in A/B Test

Leading Turkish e-retailer N11.com already had a personalization solution in place when they decided to test it against Yusp. The objective of the A/B test was to see if Yusp performed significantly better than their current provider, and to measure this difference. Ultimately, N11 hoped to increase sales and improve user experience on their website through A/B testing.

Based on the experiment design detailed earlier, eight different personalized placements were tested across N11 platforms. This scope amounted to 15 percent of all available placements. However, because these were the most frequented ones, their high traffic made them ideal for measuring the difference between the personalization engines compared in the A/B test.

In each of the tested placements, Yusp proved to be the more efficient solution. It outperformed the competing personalization engine in all pre-set KPIs: the value of conversions generated by recommendations (gross merchandise value, or GMV); GMV per 1000 recommendations, which helps measure the return on investment; and click-through rate (CTR), which indicates the accuracy of the product recommendations.

This A/B testing success has led to a long-term collaboration between N11 and Yusp. Today, N11 leverages the personalization potential of Yusp on its web and mobile platforms in over 60 placements. These create a significant part of the e-retailer’s total revenue through personalization solutions.

Better With Than Without: Yusp Justifies Investment in Personalization

Retail chain Cora Romania wanted to A/B test Yusp against their default, non-personalized setup to see if personalization could make their online store more profitable. So we integrated Yusp into the cora.ro website on several placements, including personalized

recommendations on the main page, category pages, and product pages;
auto-complete for the search field;
search result ordering;
category list ordering;
automatic out-of-stock product replacement;
recipe ingredient recommendation.

As you can see, that’s a fairly broad scope for an A/B test.

To measure the added value generated by the recommendations, we turned off all the personalization features for 3 weeks for a randomly selected customer group. We measured the difference between the two groups during the 3 weeks, and Yusp proved its ability to increase the conversion rate by 10 percent and the total revenue by 7 percent.

Your Guide to Successful A/B Testing

As we have demonstrated, A/B testing is a versatile tool suitable for a variety of purposes. However, its simplicity is deceptive: though the basics summarized here seem straightforward enough, when it comes down to the nitty-gritty of actual A/B testing, the average person typically lacks the expertise or the resources to conduct conclusive tests. In other words: Don’t try this at home. A/B testing is best left to professionals.

As the vendor of a personalization engine that runs A/B tests, and as the provider of the winning solution in multiple comparisons, the Yusp team has amassed considerable knowledge about A/B testing. All you’ve read about here is just the tip of the iceberg: don’t get us started on statistical significance, confidence intervals, A/A testing, or the Mann–Whitney U test… ). Do get in touch, though, if this piece has got you thinking about measuring or comparing the various ways your business can leverage personalization. We’re up for the next test.

On-site personalization

Recommendations

Marketing Channels

Personalized Search

Why Yusp?

Industries

Retail

Marketplace

News

Deep Learning

Blog posts

Case studies

Publications

Knowledge base

Deep Learning

SCHEDULE A CALL WITH A PERSONALIZATION EXPERT!

A Deep Dive into A/B Testing

A/B Testing: Simple, Controlled, Randomized

Classic A/B Testing, Step by Step

A/B Testing in Personalization

The Ultimate A/B Test: Assessing or Comparing Personalization Solutions

May the Best Win: Yusp Beats Competition in A/B Test

Better With Than Without: Yusp Justifies Investment in Personalization

Your Guide to Successful A/B Testing

What to read next

Deep Learning in Personalization

A Layman’s Guide to Personalization

A Deep Dive into A/B Testing

A Marathon Won by a Hair’s Breadth: Our Story of the Netflix Prize

Join our newsletter

Gravity R&D Becomes Taboola’s European R&D Hub: Facts, Figures, and the Future

RecSys 2021 – Impressions & Summary

A Marathon Won by a Hair’s Breadth: Our Story of the Netflix Prize