RecSys 2021 – Impressions & Summary

Avatar Balázs Hidasi
26 min read | November 18, 2021

After the fully virtual conference of 2020, RecSys 2021 was organized as a hybrid conference. The physical part was located in Amsterdam. Those who couldn’t be there in person could join online. Last year I basically missed the conference due to distractions preventing me from attending the online sessions live. So, I was excited that I could sort out traveling requirements in the end and that I was able to attend in person. I was also a little worried about how the hybrid event would fare, whether remote presentations would feel like pre-recorded talks or how much time would be wasted with sorting out technical issues. Fortunately, the large majority of the presentations went swimmingly, the organizers, the presenters and the audiovisual team really outdid themselves to provide a very pleasant conference experience. 

There were even a few surprising upsides of the hybrid event. For example, the venue wasn’t crowded at all (~300 people were present from the ~1100 total participants), and the number of attendees was manageable. In order to accommodate as many time zones as possible, regular sessions started after lunch, so waking up early wasn’t necessary. And of course, researchers who wouldn’t have been able to commit the money or the time towards the conference could still follow it online. Maybe keeping this format even after the pandemic is over wouldn’t be a bad idea. The obvious downside of the hybrid approach is that connecting with colleagues and partners was way harder as it required conscious effort, instead of just running into them in the crowd.

The core program was split into 12 sessions over four days. The conference was single-track; industry talks and reproducibility papers were parts of the regular sessions. I think that both of these were good decisions. Some people don’t like single-track conferences, because not every session is interesting to everyone, but I often find that multi-track conferences tend to hold the sessions I’m interested in in parallel and there are still some time slots where I can’t find a session that I’m excited about. Also, one might find hidden gems in sessions that didn’t seem exciting at first glance. Sprinkling in industry talks and reproducibility papers made sessions more varied and the driving theme behind sessions was not research versus application, but rather a (broad) topic. Since most of the attendees and presenters of RecSys are already coming from the industry, and some of the research papers are basically application papers, putting industry talks into separate sessions doesn’t really make sense.

In this blogpost I summarize my personal takeaways from the conference, the trends that I see in recommender systems research, and I highlight the papers that I found interesting.

The Beurs van Berlage – originally designed as a commodity exchange – provided the venue for the conference in the city center of Amsterdam.

Trends in Recommender System Research

Reinforcement learning is becoming a popular approach. This makes sense because reinforcement learning really lends itself to some (more advanced) aspects of recommendations. With traditional models it is fairly straightforward to optimize for short-term rewards, like CTR, where every reward (click) is both immediate and can be attributed to a single recommendation. But CTR is just a proxy KPI in most cases. The more interesting performance indicators (customer loyalty, revenue, etc.) entail rewards that don’t have these properties. 

Traditionally, it has been hard to apply reinforcement learning for recommendations, due to the large state-action space and the sparse data. With tools lifted from deep reinforcement learning, now we can model relations between different actions or states using latent representations. Neural networks can also be used to learn and approximate the appropriate functions (e.g. value function) from sparse data without having to design this function explicitly. 

Full-on reinforcement learning is still mostly used by larger companies as they are the ones that have enough data and computational power, as well as production systems where they can test these models. It is hard to evaluate a reinforcement learning system offline, so having your own production system is key. This might change as simulators become better and better, but I don’t think that we are there yet. Of course, reinforcement learning for recommenders has been around in a simpler form, in the form of multiarmed bandit algorithms for several years.

Evaluation is still the biggest problem of the field in my opinion. It has been well known among practitioners for a long time that offline evaluation metrics are unreliable, but we have stuck with them because there was no better alternative. At RecSys 2021, I heard several times that we should move on from these metrics. 

There might be a change coming in offline evaluation, just like how the community finally ditched rating prediction errors (RMSE, MAE) a few years ago (even though you can still encounter a few rating prediction papers outside of RecSys). However, I don’t think it will be that easy this time around. Moving on from rating prediction took more than 5 years, even after it was clear that it is uninformative on ranking performance while alternative metrics were already available. Now, the whole evaluation methodology is put into question and the potential alternatives also have significant drawbacks. 

Current offline evaluation basically tries to predict the organic feedback (or behaviour) of users from past data, i.e. which items they will consume (view, purchase, watch, etc.) based on their historical behaviour. This prediction task is not aligned with the goal of the recommender, because the recommender should help the users find what they need (or are interested in) by suggesting relevant items. Instead, the evaluation rewards mimicking future behaviour. 

The reason why this approach can still work to some extent is due to several factors. The most important being that models generalize across many users and items and thus recommenders focusing on this metric can still create value for both the users and the service. In fact, frequently used metrics, such as recall, MRR, etc. correlate with CTR (click-through rate), the most commonly used online KPI. But there is no direct relation between the two: 5% increase in recall won’t result in 5% increase in CTR. Most likely it will give a very similar online performance. However, the larger the difference in offline accuracy metrics, the more likely it is that the online difference will also be significant. Unless other factors of the algorithm (such as the lack of generalization and exploration of instance based methods) don’t counterbalance this effect. 

So, it works to some extent, but it is not too reliable, and one clearly can’t judge most of the scientific papers based on these metrics alone, because the same 5-10% improvement sometimes translates to nothing in online, and to significant performance gains at other times. This is further compounded by the various evaluation setups that are used in different papers, many of which are really far from real life while some are outright flawed. I’m baffled to see papers with unrealistic setups (e.g. negative item sampling, data leaking through time, etc.) keep being accepted to conferences, which in turn propagate these bad evaluation setups. 

What are the alternatives to organic feedback prediction? One of the approaches is to focus on the interaction feedback and to do counterfactual learning and off-policy evaluation. There was a really good tutorial on this topic this year, of which here is a simplified summary. 

Recommendation requests are logged, and items that were recommended by an online recommender as well as the users’ feedback on them are recorded. The task is to approximate how many clicks a new algorithm would have gathered on the same requests. Due to the interactive nature of recommenders, one can’t tell what would have happened if a different recommender had been used. One also needs to be careful not to optimize the new algorithm to mimic the algorithm that is in production. Counterfactual learning focuses on estimating the reward (clicks) in this setup. 

Unfortunately, most of the approaches don’t scale well for the large action space (item catalogue) of recommenders, due to the large variance of the estimates. A few methods have been proposed to mitigate this and certain metrics – such as recap – correlate well with online A/B tests, but it is still work in progress. 

Another alternative is the use of simulators. The idea here is to measure online without having to deploy a new algorithm in the production system. Modern simulators are capable of learning from logged behaviour data (organic feedback) and simulating user actions. This might be a viable direction, but it is unclear as yet how accurate these simulators really are. 

The gold standard to most of the researchers is the online A/B test. The more well-known problem with A/B tests is that they are expensive and slow. They are also not fit for scientific reproducibility, because production systems are proprietary and the exact conditions of the test are unique. 

I also note that some online KPIs can be measured more easily than others. The less directly a metric is connected to a single recommendation, the harder it gets to measure its effect. Unfortunately, these loosely connected metrics are the most interesting ones from a business point of view. So, even if the community will be able to move to an evaluation methodology that approximates A/B tests on CTR really well, we will still face very similar problems to those we struggle with today.

Responsible recommendation is becoming more and more emphasized as time passes. This broad area encompasses a variety of topics, such as the social effects of recommenders, recommenders for social good, filter bubbles, misinformation filtering, and so on. The largest social media platforms do influence society and can amplify or inhibit certain ideas or processes, and they can be easily misused by a malicious actor. The main issue here is that there is no exact definition of what and how we want to achieve, there are a lot of grey areas, the line is drawn at a different place for everybody. Most of this is not something that can be solved by an algorithm, well, at least not until we know what we want to solve. Even then, machine learning algorithms will never give perfect results. Despite all of this, I still think that it is worth discussing this topic as this is how we will move forward. 

Fairness is a subtopic of responsible recommendations. A recommender mustn’t discriminate between its users or between items. Of course, it can still personalize, and it can recommend items with varying frequency as some items will always be more relevant than others. The item side fairness might require some explanation. For example, YouTube pushing or burying certain content creators can make or break their careers; or an unfair recommender on a marketplace can negatively affect small businesses. In my opinion, the recommender shouldn’t use any kind of sensitive or even personal data, and then it won’t be able to discriminate based on that. Fortunately, most of the well-performing algorithms don’t need this information. The event data itself can still be biased towards certain items, but more often than not this comes down to the feedback loop. Recommenders usually derive relevancy from interactions, but interactions also depend on exposure, which can easily result in a “rich get richer” situation. This should be countered with exploration methods. Fairness is a complex topic and some aspects are influenced by personal convictions. However, I see the progress in the field moving from softer/subjective goals to more well-defined ones.

Deep learning has become a widespread tool. It is used in most of the algorithmic papers. However, recently some skepticism has also cropped up around its usefulness in recommender systems. This is natural, it happens with any kind of approach that radically changes a field. This skepticism is different from the pushback of the early days of deep learning based recommenders and some of the points skeptics make are valid. 

One of these is that making models deeper – without major architectural changes – doesn’t increase performance if the hyperparameters of both the deep and the shallow models are carefully optimized. We have also experienced something similar: modeling sessions of items often work best with one or two recurrent layers, with a low BPTT window. Anything more than that usually decreases performance. The same goes for deep matrix factorization, autoencoders, deep factorization machines, etc. 

This might be counterintuitive, because human behaviour is complex and therefore deeper networks should be able to model it better. My assumption is that while human behaviour is complex, the data we have isn’t. This is due to how online services are built, and to how the data we have is generated: mostly through items that are presented to the users (through search, recommendation, category listings, static banners), which will ultimately result in somewhat simple patterns, especially within shorter time frames.

Another valid point of the skeptics is that there are many papers where the use of deep learning techniques is not justified. There are papers circulating various conferences that use very complex architectures without any justification or underlying ideas, for tasks that don’t benefit from them, and then showing 1-2% improvement under uninformative offline evaluation setups. I understand that people get tired after reviewing tens of these papers. I also do. Unfortunately, this is not specific to deep learning. This has been present for a long time now and it is the result of how research is funded (publication pressure). The only difference is that nowadays it is easier to throw together a complex algorithm in user-friendly open-source deep learning frameworks than it was ten years ago.

Despite all of this, I still do think that deep learning has pushed the field forward by a lot in a short time and is still an essential tool in this field. Maybe it is unfortunate that the latest generation of neural networks is called deep learning, because it makes people focus on the models being deep, rather than being “neural” models. The three main ways how deep learning benefits recommenders, in my opinion:

  1. There are certain recommendation tasks on which traditional models didn’t work well, especially in online settings. One of these is the sequential/session-based task, which is very common in practice, but wasn’t really working well until recurrent networks (and later: CNNs, attention networks, transformers, etc.) were proposed for the task.
  2. The modularity of deep learning enables us to include heterogeneous information relatively easily in our models. The range of these information sources was also extended to unstructured data – such as text, image, audio, video – without the need of manual feature engineering for every single use case.
  3. Neural models also provide researchers with function approximators that allow replacing intractable parts in our models. For example, reinforcement learning wouldn’t be viable for the large state-action space of recommender systems without using neural networks to approximate the state-action value or the advantage function.

So, while I think that skepticism here is natural, useful, and valid in some cases, deep learning has substantially improved the field and is here to stay. It is a tool that can be used for many things, and it is up to us as researchers to find the best use for it.

Paper Highlights

RecSys is known for having a fluctuating overall program quality and in some years it is just uninspiring. I was delighted that this year’s program was well above average. There wasn’t a single breakthrough that I could point out, but there were several papers that I marked for later reading during the presentations. I do this every year, but in 2021 this resulted in a personal record breaking number of papers to read. This is one of the reasons for the late publishing of this summary. Most of these papers were interesting to me because of some ideas they contained although I might have criticism toward the overall execution or even the message. But overall, I was happy with reading through the majority of the conference papers, as I found several inspiring ideas.

My Top 5 Papers

My favourite paper from the conference is Values of User Exploration in Recommender Systems by Minmin Chen et al. from Google. The exploration-exploitation trade-off is a well-known topic in reinforcement learning. It focuses on the question whether the system should exploit its current knowledge and act according to that to maximize rewards, or experiment with actions whose outcome is unknown or uncertain. The paper points out that in recommender systems this so-called system exploration is just one aspect of this topic, and this trade-off also affects the users’ exploration of new items / interest. 

The authors incorporate three methods into a reinforcement learning based recommender to control the balance between accurate predictions and user exploration: (1) entropy regularization; (2) reward shaping to reward serendipitous recommendations; (3) actionable representations to inform the model about unexpected but relevant finds. These methods are then examined w.r.t. accuracy, diversity, novelty and serendipity. (A slight tangent: The hard part of defining metrics for serendipity is defining which items are unexpected. I really like how this paper defines clusters over the feature vectors – which themselves are based on co-occurrence patterns – to define this, rather than going for the same category approach.) Online tests show that, contrary to previous research, novelty and diversity are not really important for enhancing user experience, but serendipity is key. This makes sense, because novelty and diversity don’t take relevancy into account, but serendipity does. The best results come from combining (2) and (3), because these methods can directly incorporate serendipity. 

I only have a minor criticism of this paper. Some of the important details are left out, such as how the user enjoyment metric is defined, but hey, it’s a Google paper, so much is expected. Also, the state transition matrix of user engagement contains only one significant value (from casual to core), which to me signals that there is no systematic change. It seems rather strange to me that the method only affects casual users and pushes them up to core ones, but there is no casual-emerging or emerging-core transition. But apart from these minor issues, this paper is excellent and is clearly the best paper of the conference. At least in my opinion. Minmin Chen also gave an industry talk that complemented this paper really well. It was titled Exploration in Recommender Systems and discussed system, user, and online exploration in recommender systems.

One of my other favourites is Cold Start Similar Artists Ranking with Gravity-Inspired Graph Autoencoders by Guillaume Salha-Galvan et al. from Deezer Research and LIX École Polytechnique. While the paper highlights similar artist ranking in the title, the proposed method should work for the general item cold-start problem. In my opinion, it might also help in learning better quality representations for warm items as well. Cold start is common in practice and the performance of neural approaches greatly depends on the quality of the item embedding they use. I usually like papers that describe methods that can be then used with many different algorithms, and I think that this one has this potential. 

The goal set by the paper is to learn high quality item representations (or embeddings) for cold-start items. A directed graph is defined over the items. A directed edge points from item A to item B if B is among the top K similar items of A. Each item is represented as a (learnable) latent vector, which is produced by a two-layered graph convolutional network from the item features (metadata) and the adjacency matrix of the directed graph. The task is then treated as a link prediction task in the directed graph. The probability of an edge from A to B depends on the influence of B and the similarity of their representations. The influence is also a learned parameter, but there is some connection between influence and popularity. The learning is done using an autoencoder or its variational version, i.e. the input is the metadata of the items and the similarity graph, which is encoded into item embeddings, and a decoder reconstructs the similarity graph based on the method described before. 

I really like the similarity graph idea and that it is taking the asymmetric nature of similarity into account. The gravity-inspired approach also seems useful to me as it enables the algorithm to decouple popularity from topical similarity. Overall, there are several interesting ideas in the paper, I’ll definitely try them out in my own algorithms in the future.

The next one on my list is cDLRM: Look Ahead Caching for Scalable Training of Recommendation Models by Keshav Balasubramanian, Murali Annavaram et al. from the University of Southern California. This is also a paper that can be used with many different algorithms and while it is more of an engineering project than research in my opinion, it is a very useful one. 

Almost all deep learning based recommenders use some kind of learnable item embedding. While in some cases this can be computed from embeddings of other entities (e.g. metadata tokens), in most of the cases each item has its own independent embedding vector. The number of items in a real-life recommender is often in the millions, sometimes even in the billions. Even if the embeddings are relatively small, storing all of them in the GPU memory can be challenging. Constantly moving the data from the RAM to the GPU memory has a significant overhead that will result in very slow training and underutilized GPUs. 

The paper proposes a caching solution for this problem, by keeping only a small subset of the embeddings in the GPU memory at any given time while mitigating the overhead of the data transfer by looking ahead in the training data to determine which embeddings will be needed and loading them beforehand. It is based on three processes: the prefetching process looks ahead and finds the unique item indices used in the coming batches; the caching+training process figures out which embeddings are needed in addition to the ones already loaded and does a forward and a backward pass; the eviction process writes back the updated embeddings into memory when they are removed from the cache. 

The caching system is very well put together and the authors considered and handled the potential race conditions. Using this cache introduces some small overhead, and it can’t be 100% accurate (a few updates on certain embeddings can be missed if the same embedding is already prefetched while it is already marked for eviction in the cache), but both issues are well within an acceptable range.

I both like and dislike Negative Interactions for Improved Collaborative Filtering: Don’t Go Deeper, Go Higher by Harald Steck and Dawen Liang from Netflix. I like it because it presents a simple, useful idea in a sensible way. The main idea to me shows that there is nothing new under the sun: we have basically come full circle, from limiting frequent itemsets and association rules to size two due to the sparseness of event data, just to extend them again to model higher order interactions. But there is a twist that turns this from a positive thing into a negative. I dislike it because it unnecessarily jumps on the deep learning skepticism bandwagon to gain more attention. 

While neural networks are universal approximators, the theory also states that this requires large networks, sometimes infinite ones. There are some well-known tasks where neural networks suffer in practice, and modeling higher order interactions is one of these. For example, it is not a coincidence that contextual modeling requires different architectures; feeding the context alone is not enough. Since we have never expected neural networks to model higher order interactions well, it appears rather misleading to focus on them in the title and in the introduction of the paper. 

In fact, the main idea of the paper is perfectly compatible with neural models, which is no surprise, since the authors also use an approach that can be represented as a neural network. The idea is simply to use frequently observed item pairs along with the single items. The model here is a simple weighted model without any hidden layers: basically, the items (and frequent pairs) are assigned with a feature vector whose length is equal to the number of items. The sum of the feature vectors belonging to items (and pairs) that the user has in his history gives the predicted scores over the items. 

The proposed model with the LS solver is not scalable (scales cubically with the number of items), but this could be overcome by using different optimizers, or by adding a hidden layer. The offline results are also not that impressive (~5% increase in one of the three datasets). So why have I listed this paper among the top five? The reason is that it was inspiring for me. The idea could be used with any kind of model that is item based and uses some form of learned item representation. Also, I think that the hidden genius of this idea is that we can easily model interactions between three items (two antecedents and one consequence) by using frequent itemsets of size two (which are easy to find) and learnable representations. 

So, going back to limiting and extending association rules for a minute: even if it is unfeasible to find association rules of three items, we can model them by finding itemsets of two and embeddings. I think that this should have been the message of the paper. I also found the observation on higher order interactions having a negative correctional effect on the score interesting. Overall, this paper was a mixed bag for me, but I can’t deny that it was very inspiring, even if this kind of interpretation was never intended.

I also added a poster paper (late breaking result) to this list: Optimizing the Selection of Recommendation Carousels with Quantum Computing by Maurizio Ferrari Dacrema et al. from the Politecnico di Milano. This admittedly preliminary work explores the use of quantum processors for recommendations. The advantages of current-generation quantum processors can be best demonstrated on integer programming tasks; therefore, they would be optimal for something like carousel selection (i.e. ordering recommendation lists produced by different algorithms based on accuracy while also accounting for duplicates). 

The paper describes how to transform this problem into a quadratic optimization problem that can be solved on quantum processors. The results are somewhere in between the individual greedy and the incremental greedy approach: better than just ordering the lists based on their precision scores, but not as good as always selecting the best list given the already selected ones. However, the big difference is that the incremental greedy approach requires an hour to run, while its quantum counterpart is executed within a fraction of a second. So, in theory it could adapt to user actions and changing recommendation lists in real time. 

Of course, using quantum computations is a very new direction. This work is its first tentative exploration (to my knowledge). I don’t expect it to become industry standard in the next 5 years. But works like this show that it has potential for certain tasks in recommender systems. It is possible that it will significantly change the field in the future, similarly to the way deep learning did. Once quantum processors have more qubits and software libraries that are easier to use, maybe future recommenders will use them for certain computations, just like the way they utilize GPUs today.

Other Papers of Merit

As I mentioned earlier, I marked a record number of papers this year, too much to go through all of them in detail. I will highlight some of the papers/talks that either gained much attention at the conference or contain ideas that I like.

The paper winning the Best Paper Award of the Conference was An Audit of Misinformation Filter Bubbles on YouTube: Bubble Bursting and Recent Behavior Changes by Matúš Tomlein et al. from the Kempelen Institute of Intelligent Technologies. It paints a mixed picture of YouTube’s recommender. Misinformation is recommended with the same frequency as it was in 2019. However, one can’t really get stuck in a misinformation filter bubble, even if one exclusively watches videos propagating conspiracy theories, and the frequency of videos spreading misinformation drops if the user watches debunking videos. 

Audits like these are very important, because large social media and content platforms do have an impact on society. These kinds of papers point out problems with the recommenders of such platforms. Unfortunately, solving these problems is not trivial, and better system design or algorithms alone can’t solve the issue.

The conference’s Best Student Paper Award went to Pessimistic Reward Models for Off-Policy Learning in Recommendation by Olivier Jeunen and Bart Goethals from the University of Antwerp. The off-policy estimation of the reward (e.g. clicks) is often significantly greater than the actual reward during online deployment. The paper attributes this to the varying uncertainty of the individual reward estimations over items and the argmax recommendation selection. It proposes to use the lower confidence bound of the estimate as the score of the item (instead of the mean) to get a more reliable estimate. 

While this sounds good at first, I think there are significant potential downsides. The lower confidence bound of the score of an item (for a given recommendation request) is high if the estimated reward is high and the confidence interval is narrow. The former means that the number of expected clicks on the item is high, while the latter entails high recommendation frequency (of the given item) by the logging policy (online algorithm). Therefore, this method will heavily prefer algorithms that exploit rather than explore and mimic the logging policy. 

I liked the other paper of the same authors more: Top-K Contextual Bandits with Equity of Exposure by Olivier Jeunen and Bart Goethals from the University of Antwerp. The goal is to disentangle relevancy from exposure to achieve item-side fairness. The probability of an item receiving clicks depends on both its relevancy and its exposure. I really like this objective definition of fairness, that items with the same relevancy should get the same number of clicks. This can be achieved by randomizing the top recommendations. 

The randomization of top N lists is a common practice and its beneficial effects are well-known within the industry. The interesting part to me is that the authors suggest a method to do this in a personalized way. Some users prefer their recommendations to be focused on a narrow goal, while others prefer some variety. The method in the paper can determine how much of the top N list to randomize for each individual user. The authors claim that it is fast enough for smaller lists to be used in practice.

Accordion: A Trainable Simulator for Long-Term Interactive Systems by James McInerney et al. from Netflix and from Spotify. As I have already mentioned, simulators are potential candidates to solve the evaluation problem of recommender systems. The authors of this paper bring up another really good point against offline recommendations: if a recommender infrastructure is complex – for example, it uses many models in parallel; models use each other’s output – even recreating the recommendation setup can be challenging offline. Accordion itself is a trainable simulator that goes beyond modeling single user sessions: it uses inhomogeneous Poisson processes to model user visits, and samples events within these visits. The paper shows that there is some correlation between A/B tests and simulated A/B tests.

Transformers4Rec: Bridging the Gap between NLP and Sequential / Session-Based Recommendation by Gabriel de Souza P. Moreira et al. from nVIDIA and Jeong-min Lee from Facebook AI. One of the seminal papers of deep learning based recommenders was written by us: the GRU4Rec paper from 2015 (and its follow-up from 2017) adapted recurrent neural networks for the session-based recommendation task and revitalized the field where traditional models were struggling. Since then, several papers have followed suit and proposed to port different sequential models, such as 1D convolutions, attention networks, BERT, etc. This paper follows this idea and ports the current go-to sequential models from deep learning, the transformers. This is also motivated by the fact that methods from (a subfield of) NLP are usually successful on (a subset of) recommender systems, and transformers are currently deemed to be the best performing NLP models. 

Having a choice between different models is always good. My only concern is that due to the low quality of real-life session data, there is no real difference between the performances of different sequential models in practice, at least in our experience. The differences reported in various papers are mostly due to offline evaluation setups favoring one or the other and to hyperparameter optimization. In my opinion, the way forward is to focus more on the inputs of these networks, to move on from simply injecting the item id, just like the internal cold-start version of GRU4Rec. Fortunately, Transformers4Rec also support multiple input features.

ProtoCF: Prototypical Collaborative Filtering for Few-Shot Recommendation by Aravind Sankar, Junting Wang et al. from the Computer Science University of Illinois. Most collaborative filtering approaches (including neural networks) have uneven performance over the items. Their accuracy is usually higher on popular items and dwindles on items on the long tail. This paper aims to correct this by training a separate model for long tail items, using knowledge transfer from the base model. The long tail model is trained on subsets of the data. Each subset uses only a limited number of items, and samples events so that the support of each item is intentionally low. The knowledge transfer comes in through the embeddings, forcing the item similarities in the long tail model to be similar to those in the base model.

Local Factor Models for Large-Scale Inductive Recommendation by Longqi Yang et al. from Microsoft. (Co-)clustering models are proposed from time to time for recommendations, but most of the time they are not scalable. For example, GSLIM, a clustering-based extension of SLIM, the Best Paper Award winner of RecSys 2016, scales with the product of the number of users and items, making it unscalable and unusable for real-life setups. This paper rectifies the main problem of co-clustering recommenders by making the clustering latent as well, using the attention mechanism. The score on a user-item pair is computed based on their affinities to clusters, using the minimal affinity (between the user and the item) for each cluster.

Hierarchical Latent Relation Modeling for Collaborative Metric Learning by Viet-Anh Tran et al. from Deezer Research. Metric learning is another way to generate recommendations. The end result is similar to other latent methods – the users and the items are embedded into a latent space and items similar to the user are recommended – but the surrounding framework is different. One of the problems of metric learning (and matrix factorization as well) is that when a user interacts with different item groups, his feature vector can’t be similar to all of them if these groups are at different parts of the latent space. One way to solve this is to use a delta user representation – that depends on the item – when computing recommendations, but learning this delta is not trivial. This paper assumes that every relation between item pairs can be computed from a fixed set of base relations. These base relation vectors can be learned from item co-occurrences, and then can be used to produce the delta representation for the user feature vector.

Drug Discovery as a Recommendation Problem: Challenges and Complexities in Biological Decisions by Anna Gogleva et al. from AstraZeneca. This industry talk covered a very special use case of recommenders, one might even call it a borderline case. It was interesting to see how recommenders can be used in a domain where the items are complex, data is really scarce and ground truths are rare but the users are well-trained experts. This requires a solution that keeps the user more in the loop than classical recommenders, and one that relies mostly on unsupervised or semi-supervised methods.

See You in 2022 (Hopefully)

Overall, RecSys 2021 was a great experience. Organization was stellar, the program was solid. Frankly, I also enjoyed that personal attendance was lower so the venue wasn’t crowded at all, yet everyone could follow the conference online. I missed meeting some of my colleagues and research partners that couldn’t attend due to travel restrictions. If international travel is back to normal by 2022, I’ll see you in Seattle at RecSys 2022.

What to read next

Join our newsletter

Get to know the ins and outs of personalization