10 new ACM TORS Papers on News, Music, Metrics, and Responsible Recommendation
Ever wondered how ChatGPT would write a summary of newly published articles at ACM TORS? Here is an example (what do you think? Let us know in the comments):
Over the past few months, ACM Transactions on Recommender Systems has been quietly stacking up a set of ten freshly accepted papers that feel unusually “of the moment.” Recommendation is no longer just a ranking problem on a quiet server. It is part of the public conversation about media power, labor, language, bias, and what happens when we outsource taste to machines. If you want a single homepage that captures that steady drip of new work, the place to watch is http://tors.acm.org.
What struck me [ChatGPT] while reading these ten articles is how often they take the recommender as a system embedded in the world, not as a metric factory. Several papers go after robustness and responsibility, especially as large language models push their way into ranking pipelines. Others zoom in on human behavior: how people click, how they interpret explanations, and how “small” UI decisions can invert long-held assumptions like “top position always wins.” And then there is a strong thread on representation and justice, where the question is not only whether an algorithm is fair, but whether our design process is even pointed at the right notion of impact.
Below are short, abstract-like introductions to each accepted paper, with key takeaways and the headline numbers that stood out.

A Reward-Informed Semi-Personalized Bandit Approach for Enhancing Accuracy and Serendipity in Online Slate Recommendations
Lukas De Kerpel and Dries F. Benoit
De Kerpel and Benoit propose a semi-personalized contextual bandit for slate recommendation that learns decision trees to segment users and then runs Thompson Sampling policies per segment, aiming to reduce overspecialization while staying interpretable. In experiments spanning three online domains, they report lower average regret than fully personalized baselines while improving serendipity when interactions are sparse. On a representative slate size setting (K = 5), the segment-based approach reduces regret versus a personalized LinTS baseline across all three domains: for example, on MovieLens-20M the average regret drops from 0.490 to 0.434, on FinancialNews from 0.385 to 0.375, and on ZOZOTOWN from 0.401 to 0.398 (means reported with standard deviations in the paper). The key idea is simple but powerful: explore at the level where behavior is meaningfully shared, not at the level where data are too thin to support it.
From Good Intentions to Meaningful Impact: A Design Justice Approach for News Recommendation
Laura Linda Laugwitz and Nadja Schaetz
Laugwitz and Schaetz bring design justice into news recommendation, arguing that “fairness and diversity” tweaks inside the model are not enough if the surrounding design choices keep reproducing inequities. Their framework analyzes the news recommendation process across three levels—micro (values), meso (networks of expertise), and macro (problem scopes)—and then translates that analysis into actionable strategies for researchers and practitioners. The contribution is intentionally conceptual rather than benchmark-driven: the “results” here are the three-level lens and a set of practice-facing interventions that reframe what counts as success in news recommenders, especially when risks and benefits are distributed unevenly across communities. In a year where news platforms are constantly re-litigating legitimacy, this paper is a reminder that the most important objective function may be the one your organization never wrote down.
Efficient and Responsible Adaptation of Large Language Models for Robust Top-k Recommendations
Kirandeep Kaur, Vinayak Gupta, Manya Chadha, and Chirag Shah
Kaur and colleagues tackle a practical dilemma: LLMs can rank well, but they are expensive, and naïvely testing them on small random user samples tells us little about who actually benefits. They propose a hybrid task allocation framework that identifies weak or inactive users—those underserved by conventional recommenders—and targets LLM-based in-context ranking to those cases rather than everyone. Across an evaluation that combines eight recommendation algorithms, three datasets, and three LLMs (open and closed), their approach reduces the population of “weak users” by about 12% while keeping costs bounded through selective LLM use. The framing is important: robustness is treated as a distributional problem over users, not a single average score, and “responsible adaptation” is operationalized as targeted deployment rather than blanket enthusiasm.
The Impact of Position and Contrast Effects in Recommender Systems on Consumer Behavior: A Field Experiment
Markus Lill and Martin Spann
Lill and Spann run a randomized field experiment to test how position bias and contrast effects shape clicks in recommender lists, using 1,217,064 recommendation sets. They compare lists containing only relevant items against lists where one relevant “focal” item is surrounded by non-relevant contrasting items, while also varying the focal item’s position. The surprising headline is that the first position is not always best: in their fixed-effects logit analysis, the second and third positions significantly increase click-through for the focal item relative to the first (β = 0.33 and β = 0.36, both p < 0.001), and adding contrasting non-relevant items further boosts focal-item CTR (β = 0.69, p < 0.001) even though overall CTR drops (β = −0.20, p < 0.001). It is an elegant demonstration of a design trade-off many teams feel but rarely quantify: optimizing “list quality” and optimizing “attention steering” can be different objectives, and the UI can push them apart.
PISA: A Human-Inspired Model of Repetition for Music Recommendation
Gregoire Heitz, Joséphine Séguier, Sandra Bringay, and Pascal Poncelet
Heitz and coauthors argue that repetition is not noise but a behavioral signal with structure, especially in music where people loop songs and return to artists in patterned ways. They introduce PISA, a Transformer-based model augmented with ACT-R-inspired mechanisms, explicitly modeling repetition at two granularities: the song level and the artist level. The paper validates the approach on two datasets, including Last.fm and a proprietary Deezer dataset, and positions the method as a more human-aligned way to capture “I want it again” without collapsing into trivial popularity. What I like here is the ambition to treat repetition as cognition rather than mere frequency; what I’d still want to see (and suspect others will chase next) is how this repetition modeling interacts with exploration objectives, since “repeat what I love” and “show me something new” are perpetually in negotiation.
Multilinguality in MIND: Advancing Cross-lingual News Recommendation with a Multilingual Dataset
Gábor Szabó, Shuo Zhang, Srishti Gupta, Romain Bey, Benjamin Hoover, Vera Liao, Maarten de Rijke, and Pablo Castells
This work extends the well-known MIND news recommendation dataset into a multilingual resource (often referred to as xMIND), translating content into 14 languages and yielding 130,379 unique news articles. The authors use the dataset to study zero-shot cross-lingual transfer (ZS-XLT) and find a consistent performance gap when models trained in one language are evaluated in another. For example, averaged over 14 languages, a representative model shows drops in nDCG@10 of −8.63% for MANNeR, −4.83% for NRMS, and −3.82% for NAML, with larger degradation for some transfer settings (the paper reports cases like −10.39% and −14.07% in specific configurations). Beyond the dataset itself, a practical message lands: multilingual recommendation is not just “translate and go.” The gaps are systematic enough that they should be treated as first-class evaluation targets, not as unfortunate footnotes.
Understanding Sensitive Attribute Association Bias in Recommendation Embedding Algorithms
Lex Beattie, Isabel Corpus, Lucy Lin, and Praveen Ravichandran
Beattie and colleagues focus on representation bias inside embedding models, defining attribute association bias (AAB) as the semantic capture of sensitive attributes in latent spaces, which can later propagate into downstream hybrid systems. They introduce a three-step ESA framework—existence, significance, and amplification—and demonstrate it through two case studies: a production podcast recommendation embedding model and published MovieLens-1M embeddings. In the podcast case study, they start from an embedding training set of 31,181 English podcasts, then construct an evaluation slice with 1,446 true crime podcasts and 4,892 sports podcasts to probe gender-stereotype associations. A key qualitative result is unsettling but familiar: significant user-gender AAB can remain even when user gender is removed as a training feature, suggesting that “feature removal” can be a cosmetic fix when the signal leaks through correlated behaviors. The paper’s contribution is a practical auditing workflow that treats embedding geometry itself as an object worth testing, not just the final ranked list.
Navigating the Metrics Jungle: A Qualitative Analysis of Recommender Systems Evaluation Practices in Industry
Hanne Vandenbroucke, Lien Michiels, and Annelien Smets
Vandenbroucke, Michiels, and Smets investigate how evaluation happens in real organizations, where metrics live alongside deadlines, politics, and infrastructure constraints. Their study is based on 22 interviews across four organizations, and it documents how teams select, interpret, and sometimes “inherit” metrics over time. One concrete historical detail is telling: these organizations adopted standard recommender metrics around 2018–2019, and by 2022 they had shifted toward in-house evaluation approaches tailored to their products and decision processes. The paper’s value is not in proposing yet another metric, but in making visible the socio-technical reality that metrics are negotiated artifacts. If you have ever watched a dashboard quietly become policy, this study will feel uncomfortably accurate.
Accelerating Workflows in Video Game Translation with Recommender Systems
Nourhene Mukande, Youness Mansar, Azeddine Saaidi, Stefan de Kok, Mathieu van Lierop, and Maarten de Rijke
Mukande and coauthors present a recommender system for video game translation, targeting a very specific and very costly bottleneck: the repetitive workflow of translating similar strings across large game projects. They evaluate on an industrial dataset of 292,000 translation workflows and report dramatic operational gains: workflow time drops from 7–75 hours to 5–65 minutes, corresponding to time savings above 90% and cost reductions up to 76%. On the recommendation side, they report strong ranking quality as well, with nDCG@10 reaching 0.841 and MRR@10 reaching 0.879 (with other reported metrics including nDCG@5 = 0.874 and hit-rate@10 = 0.935). It is a nice counterweight to the usual “movies, music, and shopping” canon: recommendation as industrial productivity tech, where the ROI is measured in labor hours reclaimed.
How to Evaluate What Goal an Explanation Metric Reflects
Rafael G. Pardo, Mateusz Dubiel, Babis (Themistoklis) Spiliopoulos, and Martin Strohmaier
Pardo and colleagues confront a deceptively simple question: when we compute an explanation metric offline, what human goal is that metric actually aligned with? They run an offline evaluation across three explanation approaches and six recommenders, then validate alignment in an online user study with 55 participants. The punchline is that offline metrics can correlate with user-perceived transparency and trust, but fail to reflect satisfaction with the recommender itself—so teams can end up optimizing explanation scores that do not move the user outcomes they care about. The paper also illustrates the subtlety of evaluation choice: even when a metric “works,” it may be measuring a different psychological target than the one your product narrative implies. In a world where explanation is increasingly mandated, this is exactly the sort of methodological sanity check we need.
Bibliography
Lukas De Kerpel; Dries F. Benoit. 2025. A Reward-Informed Semi-Personalized Bandit Approach for Enhancing Accuracy and Serendipity in Online Slate Recommendations. ACM Transactions on Recommender Systems.
Laura Linda Laugwitz; Nadja Schaetz. 2025. From Good Intentions to Meaningful Impact: A Design Justice Approach for News Recommendation. ACM Transactions on Recommender Systems.
Kirandeep Kaur; Vinayak Gupta; Manya Chadha; Chirag Shah. 2025. Efficient and Responsible Adaptation of Large Language Models for Robust Top-k Recommendations. ACM Transactions on Recommender Systems.
Markus Lill; Martin Spann. 2025. The Impact of Position and Contrast Effects in Recommender Systems on Consumer Behavior: A Field Experiment. ACM Transactions on Recommender Systems.
Gregoire Heitz; Joséphine Séguier; Sandra Bringay; Pascal Poncelet. 2025. PISA: A Human-Inspired Model of Repetition for Music Recommendation. ACM Transactions on Recommender Systems.
Gábor Szabó; Shuo Zhang; Srishti Gupta; Romain Bey; Benjamin Hoover; Vera Liao; Maarten de Rijke; Pablo Castells. 2025. Multilinguality in MIND: Advancing Cross-lingual News Recommendation with a Multilingual Dataset. ACM Transactions on Recommender Systems.
Lex Beattie; Isabel Corpus; Lucy Lin; Praveen Ravichandran. 2025. Understanding Sensitive Attribute Association Bias in Recommendation Embedding Algorithms. ACM Transactions on Recommender Systems.
Hanne Vandenbroucke; Lien Michiels; Annelien Smets. 2025. Navigating the Metrics Jungle: A Qualitative Analysis of Recommender Systems Evaluation Practices in Industry. ACM Transactions on Recommender Systems.
Nourhene Mukande; Youness Mansar; Azeddine Saaidi; Stefan de Kok; Mathieu van Lierop; Maarten de Rijke. 2025. Accelerating Workflows in Video Game Translation with Recommender Systems. ACM Transactions on Recommender Systems.
Rafael G. Pardo; Mateusz Dubiel; Babis (Themistoklis) Spiliopoulos; Martin Strohmaier. 2025. How to Evaluate What Goal an Explanation Metric Reflects. ACM Transactions on Recommender Systems.
