A Deep Dive into A/B Testing

Author: Gabriella Vas

To some extent, we’re all familiar with A/B testing. The fact that you can A/B test email headers, CTA buttons or landing pages probably won’t surprise you. But have you considered how personalization engines take A/B testing to the next level? Furthermore, did you know that personalization engines themselves can be compared to each other through A/B testing? Or that you can A/B test versions of a website or app with and without personalization? 

With an unbeaten track record in A/B testing against competing personalization solutions, Yusp is a true veteran of the genre. Let us share our know-how: if you want to put Yusp and/or any other personalization engine to the test, here’s how it’s done. 

A/B Testing: Simple, Controlled, Randomized

You’ve most likely come across A/B testing in a marketing context, where it’s used to compare the performance of digital assets in order to reach specific business goals. In fact, the concept originates from the rolling hills of England where, in the 1920s, geneticist and statistician Ronald Fisher laid down the framework of controlled, randomized experiments – which include A/B testing – in order to compare crop yields under various conditions. Later, the method was adopted by scientists conducting clinical trials, and even later by marketers wishing to assess how their direct response campaigns were doing. 

Throughout the decades and the various use cases, the design created by Fisher has remained consistent: A/B testing is a simple, controlled, randomized experiment. 

Simple, because it’s basically comparing two (or more) versions of something to see which one works better. For instance, you could A/B test two different incentives in a “Sign up to our newsletter” pop-up to find out which one is more attractive.

Controlled, because the compared variations are exactly the same except for one, well-defined difference. This is crucial in order to make sure that the resulting change in performance is not a mere coincidence. For example, if you want to A/B test top banner headlines, all other elements of the website should be identical on the two (or more) tested variations. This way, it’s clear that any uplift in the winning version is because of the headline – the only element that has changed.  

Randomized, because the tested variations are shown to users at random, to prevent possible patterns of behavior in various user groups from influencing the result. In the simplest case, comparing two versions of a website means half the traffic is directed to version A and the other to B at random, so the composition of the audience of either version is most likely to be the same. Drawing the line anywhere in the allocation of traffic could skew the outcome. First-time vs returning visitors; mobile vs desktop users; weekday vs weekend traffic; heavy buyers vs average consumers – practically any categorization is bound to interfere with fair A/B testing. 

For instance, if the KPI of an A/B test is the value of conversions generated by recommendations in the compared user groups, then it’s especially important that the users are split evenly in terms of spending. 

Let’s consider an e-commerce website that serves individual customers as well as wholesalers or businesses. Suppose their individual buyers outnumber commercial buyers in a 90-10 percent ratio. Suppose also that their commercial buyers typically spend ten times as much as the individual customers. 

So, assuming a total of 20 customers, there will be 2 commercial and 18 individual buyers. If both commercial buyers end up in the same group, then that group will represent 2×10+8=28 units of average spending, in great contrast to the other group’s mere 10 units. This vast difference creates a bias for any average spending related test. (The extent of the difference is caused by the small number of buyers in this example. Still, with much lower probability, a similar phenomenon could occur even if the number of buyers is in the magnitude of thousands or millions.)

Therefore, it’s crucial to ensure a randomized user split in order to prevent bias. In fact, separate tests, called A/A tests, are often run to check if there is any significant difference between the user groups. 

Classic A/B Testing, Step by Step

The process of a standard A/B test is pretty straightforward. Let’s take website optimization as the most typical use case. 

A/B Testing in Personalization

Personalization engines rely heavily on A/B testing. They take the multi-step A/B testing process detailed above, and put it on a continuous, high-speed loop. But instead of comparing variations of digital assets like web pages or emails, they A/B test personalized recommendations, or rather, the algorithms behind them. 

They incorporate two (or more) algorithms, or variants of these, in personalized recommendations to two (or more) different sets of users. Then, using statistical models, they determine which version is more effective, that is to say, which recommendation converts better. 

This type of A/B testing is essential in improving the overall performance of personalization engines. However, this improvement has its limits: through A/B testing, the efficiency of the personalization solution will increase and then level out at one point. From this point onwards, the goal is to maintain this efficiency.  

The Ultimate A/B Test: Assessing or Comparing Personalization Solutions

As a marketer considering a personalization strategy for your enterprise, you may want to use A/B testing to establish the added value of personalization. Or, if you’ve already implemented some kind of recommendation system, A/B testing it against another or several other personalization solution(s) can help to decide which one works best for your purposes. 

In fact, Yusp has been A/B tested against some of the leading retail and content personalization vendors, and it has remained unbeaten in all tests. That’s 42 and counting. 

In our experience, employing multiple personalization vendors for the evaluation period doesn’t require a significantly higher investment from clients. Therefore, it’s always a good idea to choose your long-term personalization provider based on the evidence of an A/B test. 

Here’s how to go about setting up the experiment. 

Here are two of our case studies to illustrate this point. 

May the Best Win: Yusp Beats Competition in A/B Test

Leading Turkish e-retailer already had a personalization solution in place when they decided to test it against Yusp. The objective of the A/B test was to see if Yusp performed significantly better than their current provider, and to measure this difference. Ultimately, N11 hoped to increase sales and improve user experience on their website through A/B testing.

Based on the experiment design detailed earlier, eight different personalized placements were tested across N11 platforms. This scope amounted to 15 percent of all available placements. However, because these were the most frequented ones, their high traffic made them ideal for measuring the difference between the personalization engines compared in the A/B test. 

In each of the tested placements, Yusp proved to be the more efficient solution. It outperformed the competing personalization engine in all pre-set KPIs: the value of conversions generated by recommendations (gross merchandise value, or GMV); GMV per 1000 recommendations, which helps measure the return on investment; and click-through rate (CTR), which indicates the accuracy of the product recommendations. 

This A/B testing success has led to a long-term collaboration between N11 and Yusp. Today, N11 leverages the personalization potential of Yusp on its web and mobile platforms in over 60 placements. These create a significant part of the e-retailer’s total revenue through personalization solutions.

Better With Than Without: Yusp Justifies Investment in Personalization

Retail chain Cora Romania wanted to A/B test Yusp against their default, non-personalized setup to see if personalization could make their online store more profitable. So we integrated Yusp into the website on several placements, including personalized

As you can see, that’s a fairly broad scope for an A/B test.

To measure the added value generated by the recommendations, we turned off all the personalization features for 3 weeks for a randomly selected customer group. We measured the difference between the two groups during the 3 weeks, and Yusp proved its ability to increase the conversion rate by 10 percent and the total revenue by 7 percent.

Your Guide to Successful A/B Testing

As we have demonstrated, A/B testing is a versatile tool suitable for a variety of purposes. However, its simplicity is deceptive: though the basics summarized here seem straightforward enough, when it comes down to the nitty-gritty of actual A/B testing, the average person typically lacks the expertise or the resources to conduct conclusive tests. In other words: Don’t try this at home. A/B testing is best left to professionals. 

As the vendor of a personalization engine that runs A/B tests, and as the provider of the winning solution in multiple comparisons, the Yusp team has amassed considerable knowledge about A/B testing. All you’ve read about here is just the tip of the iceberg: don’t get us started on statistical significance, confidence intervals, A/A testing, or the Mann–Whitney U test… ). Do get in touch, though, if this piece has got you thinking about measuring or comparing the various ways your business can leverage personalization. We’re up for the next test.