Yambda-5B, a promising dataset for recommender systems

June 4, 2025

The Yambda-5B dataset on Huggingface, introduced in the paper “Yambda-5B — A Large-Scale Multi-modal Dataset for Ranking and Retrieval” , offers a comprehensive resource for evaluating recommender systems. Sourced from Yandex Music, it encompasses 4.79 billion user-item interactions across 9.39 million tracks, involving 1 million users. This dataset includes both implicit feedback (listening events) and explicit feedback (likes, dislikes, unlikes, and undislikes). A distinguishing feature is the ‘is_organic’ flag, which differentiates user actions driven by recommendations from organic interactions.

Key Features:

User-Item Interactions: 4.79 billion events, including listens, likes, dislikes, unlikes, and undislikes.
User Base: 1 million anonymized users.
Item Base: 9.39 million tracks.
Feedback Types: Implicit (listens) and explicit (likes, dislikes, unlikes, undislikes).
Audio Embeddings: Precomputed embeddings for 7.72 million tracks, generated using a convolutional neural network trained on audio spectrograms.
Temporal Data: Timestamps with global temporal ordering.
Interaction Flags: ‘is_organic’ flag indicating whether the interaction was organic or recommendation-driven.
Dataset Scales: Available in multiple scales: Yambda-50M, Yambda-500M, and Yambda-5B.

Data Format:

The dataset is provided in Parquet format, with the following files:

listens.parquet: User listening events with playback details.
likes.parquet: User like actions.
dislikes.parquet: User dislike actions.
undislikes.parquet: User undislike actions (reverting dislikes).
unlikes.parquet: User unlike actions (reverting likes).
embeddings.parquet: Track audio embeddings.

Each file contains the following columns:

uid: Unique user identifier.
item_id: Unique track identifier.
timestamp: Delta times, binned into 5-second units.
is_organic: Boolean flag indicating if the interaction was organic (1) or recommendation-driven (0).

Evaluation Protocol:

To facilitate rigorous benchmarking, the authors introduce an evaluation protocol based on a Global Temporal Split. This approach allows recommendation algorithms to be assessed in conditions that closely mirror real-world use. Benchmark results are provided for standard baselines (ItemKNN, iALS) and advanced models (SANSA, SASRec) using various evaluation metrics.

Conclusion:

Yambda-5B serves as a substantial resource for advancing research in recommender systems, particularly in the music domain. Its scale, diversity of interactions, and inclusion of audio embeddings make it suitable for developing and evaluating a wide range of recommendation algorithms. By providing this dataset, the authors aim to promote reproducible research and innovation in the field.

Acknowledgements: Thanks to https://x.com/AixinSG/status/1928805909981016389 through whose X post, we got aware of the dataset.

Tags:Yambda, Yandex

About The Author

Joeran Beel

I am the founder of Recommender-Systems.com and head of the Intelligent Systems Group (ISG) at the University of Siegen, Germany https://isg.beel.org. We conduct research in recommender-systems (RecSys), personalization and information retrieval (IR) as well as on automated machine learning (AutoML), meta-learning and algorithm selection. Domains we are particularly interested in include smart places, eHealth, manufacturing (industry 4.0), mobility, visual computing, and digital libraries. We founded or maintain, among others, LensKit-Auto, Darwin & Goliath, Mr. DLib, and Docear, each with thousand of users; we contributed to TensorFlow, JabRef and others; and we developed the first prototypes of automated recommender systems (AutoSurprise and Auto-CaseRec) and Federated Meta Learning (FMLearn Server and Client).

Related Posts

About The Author

Joeran Beel

Add a Comment