The Yusp personalization engine includes numerous recommendation algorithms to solve personalization tasks efficiently, in various industry segments. One of Yusp’s recent algorithmic developments is the deep learning recommendation framework that tackles the session-based recommendation problem.
In this case, the challenge is providing relevant recommendations to the customer, based on the in-session activity (browsing, search, etc.) explicitly, and predicting which products or content to show next to the user. It’s a typical setup in many use cases, which received little attention from the practitioners and the research community so far.
Online user experience can be significantly improved if we present relevant content based on the user’s actual intent, captured real-time in their browsing session.
The algorithms are the invention of the research team of Yusp (Gravity R&D), which comes with a highly optimized implementation into the Yusp personalization engine. The framework was evaluated in different real-world personalization use-cases and showed significant (5–15%) uplift in different business KPIs (see details in Case studies) compared to other state-of-the-art methods in comparative A/B tests.
An additional advantage is the out-of-the-box compliance with GDPR and similar regulations, hence being session-based means that the setup doesn’t necessarily need to store customer data beyond the session.
In this material, we explain what deep learning is and how we use it for personalization. Then we highlight some of our use cases and cherry pick some other personalization related applications of deep learning technology. After a short discussion on the pros and cons of deep learning in personalization, we provide some hints on potential future research directions that our team finds interesting.
What is deep learning?
Deep learning is a subclass of machine learning that uses complex models that consist of a cascade of non-linear processing layers (also called deep neural networks) to learn different representations of the data.
Deep learning has achieved significant improvements over previous approaches in complex domains such as image recognition, speech recognition, natural language processing, and more, starting from the early 2010s.
The uptake of deep learning for recommendations started in late 2015 and it was facilitated by a handful of pioneering research groups. Gravity R&D, the company behind Yusp, has been one of these pioneers and has been considered one of the leaders of this field.
Why is session-based recommendation an essential task for personalization engines?
The case of session-based recommendations has long been an understudied problem in the machine learning community and the personalization industry, yet, it has multiple important use cases.
Tracking users across sessions is not straightforward at all.
User identification technologies that provide some level of user recognition – such as cookies and browser fingerprinting – are often not reliable enough, especially across different devices and platforms. These technologies can’t recognize the users who opt-out from cookie tracking, and they may raise privacy concerns for users who opt-in.
Many eCommerce and content websites don’t typically track users who visit their sites infrequently because the cookies expire. However, even if user tracking is possible, users may only have one or two sessions on the websites within the tracking cookie’s validity period. Besides, in specific domains (e.g. classified sites), users’ behavior often shows session-based traits.
Even if user tracking works appropriately, user intent should be captured and reflected in real-time and personalized recommendations. Still, it may change from session to session. In order to prepare the algorithms for such use cases, once the same user has various sessions, it should be handled independently.
This is termed the session-based recommendation problem.
Before, typically the solution to handle the session-based recommendation problem in personalization engines was based on relatively simple methods that only take the last user click and ignore other past user clicks within the session.
The early history of deep learning in personalization
There were only a few deep learning technology applications to complete personalization-related tasks before our work with session-based recommendation started.
A similar method was the use of Restricted Boltzmann Machines (RBM) for collaborative filtering. In RBM, a two-layered neural network is used to model user-item interaction and perform recommendations. This was applied to the Netflix Prize problem and became one of the best performing collaborative filtering models.
Deep Models have also been used to extract features from unstructured content (music or images) that are then used together with more conventional collaborative filtering models. For example, a convolutional deep network is used to extract features from music files that are then used in a factor model. In a more generic approach, a deep network is used to extract generic content-features from any types of items. These features are then incorporated in a standard collaborative filtering model to enhance the recommendation performance.
In other machine learning fields, various deep models were applied very successfully. The application domains range from image, audio, and speech recognition to natural language processing, machine translation, bioinformatics, and drug design, to name a few.
Achievements of the Yusp research team
Our choice of deep neural network model for the session-based recommendation task was RNN (recurrent neural networks). RNN is the go-to deep learning solution for processing sequences; its successful applications range from text-translation to image captioning.
However, the RNN concept has to adapt to the recommendation setup. First off, the Yusp team opted for a more advanced version of the RNN, that uses gated recurrent units (GRU). Then, the Yusp team introduced the following modification to the GRU-based RNN model for the recommendation domain:
- session-parallel mini-batches,
- mini-batch based customized negative sampling, and
- custom loss function tailored for ranking recommendable items.
The algorithm coined GRU4Rec is therefore designed for session-based recommendations in mind – i.e., it recommends items that are suitable for continuing the user’s current session.
Alternatively, it’s possible to use the recent user history instead of the actual session on domains where session-based behavior isn’t commonplace. The Yusp team describes GRU4Rec in detail in our ICLR 2016 (pdf) and CIKM 2018 (pdf) papers.
From 2017, the Yusp team integrated an optimized version of GRU4Rec into the Yusp personalization engine – by Gravity R&D – and the solution was evaluated in different real-world recommendation setups and showed significant (5-15%) uplift compared to other state-of-the-art methods in comparative A/B-tests.
In the first case study, we introduce a personalization case for an OTT video site similar to YouTube. It’s a typical example of session-based recommendations because visitors’ interests frequently change from session to session.
For example, user intent may be driven by different factors when a user clicks on a completely different video than the user’s long-term history. The users often come to the site from external referring pages – such as social networks – embedded video links from a website or instant messaging applications.This is because two visits of the same user – and even the same social network – can’t be connected, especially if the user blocks cookies, since the browser can’t track the path of the visits.
In most cases, there are many first-time visitors, visitors whose cookies have expired since their last visit to the platform, and they’re considered practically first-time visitors. These first-time visits create an essential opportunity for the video portal to capture their interest and convert them into more frequent visitors.
To evaluate the efficiency of Yusp’s deep learning module and GRU4Rec, the team behind Yusp conducted a comparative study, where we randomly divided visitors into four separate groups. Visitors in each group received recommendations generated by the selected algorithms. The Yusp team tested four algorithms for 2.5 months:
- Baseline algorithm – used to serve all requests of the OTT site by Yusp
- RNN – an early version of GRU4Rec described in the CIKM 2018 paper
- RNN seq – we used the same model as in the previous point to recommend a sequence of N videos, instead of recommending the N best guesses for the next video
- CF method – an item-2-item collaborative filtering algorithm trained on the same data as the two RNN versions.
It was the first production test of the algorithm. Therefore only a few places of the recommendation box were dedicated to showing recommendations from RNN and RNN seq for their respective user groups. The baseline algorithm padded the rest of the recommendation box. Consequently, this test set could lower the difference between groups.
We scrutinized the following metrics in this experiment. Note that these metrics have a direct impact on revenues, since, in the OTT video industry, more videos are clicked/played/watched means more ads can be displayed, which translates into more ad revenues.
- WatchTime / Rec: The video-watching time of users (in seconds) normalized by the number of recommendations
- Played / Rec: The number of PLAYED events relative to the number of recommendations
- View / Rec: The number of VIEW events relative to the number of recommendations
- Click / Rec: The number of clicks on recommendations divided by the number of recommendation, similar to click-through rate
The Yusp team calculated all the metrics for each user group. PLAYED and VIEW events’ difference is that PLAYED requires the user to watch at least a certain proportion of the video, while there are no such restrictions for VIEW events. Therefore we assume that a PLAYED event signals higher user preference than a VIEW event.
We summarized the results in the following graph, where we compared the performance of the algorithms RNN, RNN seq, and CF to the Yusp baseline algorithm (group B), indicated as 100%. One can conclude that:
- Even with limited deployment, RNN and RNN seq are significantly better than the base algorithm in all metrics
- The added value of the RNN and RNN seq over the baseline is an additional 779 and 1674 seconds per 1000 recommendation requests.
- The number of PLAYED events also increases by 26.9 and 31.3 per 1000 recommendation requests;
- It also translated into an increase in the number of VIEW events by 52.5 and 47 per 1000 recommendation requests.
- The added value over the CF is lower, but still significant:
- The added value of the RNN and RNN seq over CF is 371 and 1266 seconds per 1000 recommendation requests.
- The number of PLAYED events increase by 5.5 and 9.8 per 1000 recommendation requests;
- the number of VIEWS events increase by 32.8 and 38.4 per 1000 recommendation requests;
- CF is also better than the base logic because itás trained on a pre-filtering set of data
- RNN seq is significantly better than the general RNN in watchtime and PLAYED events / rec
- The sequential prediction results in 8950 seconds more per 1000 recommendation requests than the general RNN.
- The number of PLAYED events also increases by 4.4 per 1000 recommendation requests. Still, the number of VIEW events hasn’t changed significantly, and the number of clicks decreased by 3.2 per 1000 recommendation requests.
We can summarize that there is an up to 5% increase in the metrics evaluated. Assuming ad revenue grows linearly with the display inventory, this 5% uplift can be translated to the same level of ad revenue increase at our video site client.
There are also quite a few so-called recommendation scenarios, also known as placements, where session-based strategies can model customer’s behavior well. Similarly, visitors’ interests may change from session to session when they look for a new product to buy on the OTT site.
We picked two typical placements:
- Product page: recommendation of similar products to the highlighted one that may be also interesting for the customer
- Main page: personalized recommendation based on customer’s earlier clicks.
It’s relatively straightforward to apply the session-based recommendation logic for the product page placement that predicts the best next product to display for each customer. But instead of considering only the highlighted product on the current page, the last few item views of the user are also considered to determine the recommendations.
As we’ll show next, a session-based strategy can also significantly improve the KPIs on the main page placement. When the main page is the entry point of a session, the GRU4Rec algorithm may utilize the customer’s last session data, so we use the in-session data when the customer returns within a session.
In case of both placements, as a fallback, the Yusp team used the baseline recommendation logic when GRU4Rec was unable to predict the sufficient number of products due to the missing data. The fallback ratio was about 10-15% for the product page, and around 25-30% for the main page. The latter could be due to the impact of the first-time and infrequent (no cookie available) visitors.
In a long-term study lasting about 10 months, we compared the efficiency of Yusp’s deep learning module and GRU4Rec against a well-performing, personalized, collaborative filtering baseline algorithm trained on the same data as the RNN-based deep learning algorithm.
In this experiment, one could directly analyze the direct business impact of the GRU4Rec algorithm measuring the revenue, and, in particular, the revenue gain of the new recommendation strategy. As the primary raw metric, the Yusp team measured the revenue through recommendations for both placements. On the graphs below, we show the cumulative daily average revenue gain percentages with a confidence window at p=0.05.
We omitted the first 30 and 60 days of the item and main page tests from the graphs. In the initial period, the cumulative metric has high variance and a large confidence interval. The drop in the first graph can also be attributed to this effect.
We don’t share exact numbers for privacy reasons as far as the revenue is concerned, but it represents a significant business value for the marketplace site.
On each graph, the dark green line represents the total cumulative revenue gain percentage, the light green area represents the cumulative confidence of the revenue gain percentage at p=0.05.
One can conclude that the session-based recommendations can greatly improve the generated revenue from these two placements. The averaged daily revenue gain is
- 14.6% on the main page, and
- 10.6% on the product page.
Note that as a side effect of the long-term experiment, the baseline collaborative filtering algorithm also learns from the successful recommendations of GRU4Rec, meaning that the overall performance of the system also increases, even though measuring the extent of this effect is impossible.
Summarizing the results, we can show that there is almost a 15% revenue gain from recommendations when using the GRU4Rec session-based algorithm on two different placements of a marketplace site.
Other use cases of deep learning in the personalization industry
There are a growing number of applications of deep learning technology in the area of recommendation engines. Therefore we only select a few notable and interesting ones here.
CTR prediction in display advertising
Display advertising is a way of online advertising where the advertisers’ ads are placed on publisher websites. In the online advertising ecosystem, it’s crucial for each party to target the placed ads accurately, which means estimating users’ propensity on clicking ads is vital.
This task is called CTR prediction, aiming to estimate the ratio of clicks to impressions of displayed ads. Accurately predicting the CTR of ads helps all the online advertising ecosystem stakeholders to spend their budget on display advertising efficiently. With the multi-billion-dollar value of display advertising, CTR prediction is indeed a crucial task.
CTR prediction can be cast as a machine learning problem. Namely, user (and ad) attributes are the training features, while the output is the prediction of the clicks on an advertisement.
Numerous learning models have been proposed for CTR prediction and feature representation. Beyond the traditional shallow models, like logistic regression and factorization machines, recently deep models were also proposed.
Deep models – having powerful representation capabilities – can improve shallow models’ expressive ability, but at the price of being less interpretable and significantly higher computational demand. Notable algorithms are DeepFM, FiBiNet, Wide & Deep, and Deep & Cross, for instance. There are also freely available github repositories that contain most of the recent deep models for practitioners.
Deep learning was first applied for recommender systems in the music industry. Sander Dieleman and his colleagues, an early pioneer group of this topic (see their first report published at NIPS 2013), realized that the standard collaborative filtering based music recommendation suffers the cold start problem. Consequently, new and unpopular songs are not present in recommendations, which equally hinders the exploration of these songs for potentially interested users, and puts users into a filter bubble of their known and popular songs.
To overcome the cold-start effect for songs, they represented audio signals by deep convolutional networks when there wasn’t sufficient usage data. They used this representation of the song as a proxy in their traditional latent factor model for such songs, resulting in sensible recommendations.
This work has been applied in the recommender system of the popular music streaming platform, Spotify. The neural network layers represented a group of filters of the songs’ musical features, like vocal thirds or bass drums. A visual representation of a layer (source: Dataconomy) is shown below, where “negative values are red, positive values are blue and white is zero. Note that each filter is only four frames wide. The individual filters are separated by dark red vertical lines”, as explained by Dieleman.
They could also create playlists associated with filters in this representation to demonstrate the similarities of the song. Note that the naming of such filters isn’t automatic; only humans can execute it.
Here is a sample playlist of the ambient filter. One of the interesting features of this project is that the deep neural networks could classify songs into subgenres such as gospel, Chinese pop, and deep house. But even more importantly, it could create smarter playlists and help people uncover obscure, unpopular, yet relevant songs they previously never knew existed.
Deep learning at scale
Deep learning models are computationally expensive and often require specific hardware to run efficiently (GPUs instead of CPUs). Therefore it’s vital for our production system, the Yusp personalization engine, to use computational resources effectively.
Gravity’s R&D team has been continuously working on the optimization of the deep learning module of Yusp and delivered significant achievements in this regard. These achievements are vital to increasing Yusp’s deep learning technology’s profitability, which is crucial for our customers since they evaluate the technology based on cost-benefit analysis. The ROI of our effective implementation is at least 10-15x at the appropriate scale. We performed all optimization steps without jeopardizing the recommendation quality.
For productization, we evaluated deep learning frameworks, such as TensorFlow, pytorch, and Theano. Although the first two frameworks are nowadays more popular, we selected Theano for a number of reasons, such as:
- For a long time, Theano was the only close-to-production level deep learning framework, and due to its straightforward Python API and ease of installment, it’s convenient to work with.
- During the years, our team accumulated a lot of knowledge on how Theano works, how to optimize the Theano code, and how to extend its capabilities. We developed several custom operators to further speed up various parts of the training process. We also shared some of our improvements with the Theano community as a contribution.
- The operators of Theano are generally well optimized for fast execution on GPU. The training process is still 3.5-4 times faster than other frameworks when executed on a single GPU due to compiling the whole computational graph into a monolithic C++/CUDA code before execution. Utilizing a single GPU to its fullest potential, instead of achieving the same speed by distributing computations between several GPUs, also aligns with our cost-effectiveness philosophy.
We also executed on the following optimization steps:
- With a smart preprocessing layer, GRU4Rec can now serve recommendations from CPU with low response times, and GPUs are devoted only for training. Beyond the significant cost decrease it also resulted in a 33% throughput increase.
- Significant training time reduction. By optimizing the loading of data and the preprocessing process, we sped up the training time.
- Improvements in the service. We optimized the prediction service’s memory-efficiency, enabling the system handling use-cases with much larger catalog sizes.
- A new cold-start algorithm with deep learning coined GRU4RecCS. This version can recommend similar products for a cold-start (hence the CS ending in its name) product based on their metadata. The solution required to (1) implement custom operators for Theano to efficiently process sparse matrices. We implemented them in CUDA, the parallel computing platform of NVIDIA for GPUs, as the standard ones in deep learning frameworks are usually by orders of magnitude slower than custom operators; (2) build an optimized metadata cache for the model training, and (3) for the online service.
We can conclude that the deep learning module of Yusp delivered significant achievements in terms of added performance and business benefits. At the appropriate scale, we can achieve at least 10-15x ROI of the hardware cost with various GRU4Rec family algorithms.