Yambda-5B, a promising dataset for recommender systems
The Yambda-5B dataset on Huggingface, introduced in the paper “Yambda-5B — A Large-Scale Multi-modal Dataset for Ranking and Retrieval” , offers a comprehensive resource for evaluating recommender systems. Sourced from Yandex Music, it encompasses 4.79 billion user-item interactions across 9.39 million tracks, involving 1 million users. This dataset includes both implicit feedback (listening events) and explicit feedback (likes, dislikes, unlikes, and undislikes). A distinguishing feature is the ‘is_organic’ flag, which differentiates user actions driven by recommendations from organic interactions.
Key Features:
- User-Item Interactions: 4.79 billion events, including listens, likes, dislikes, unlikes, and undislikes.
- User Base: 1 million anonymized users.
- Item Base: 9.39 million tracks.
- Feedback Types: Implicit (listens) and explicit (likes, dislikes, unlikes, undislikes).
- Audio Embeddings: Precomputed embeddings for 7.72 million tracks, generated using a convolutional neural network trained on audio spectrograms.
- Temporal Data: Timestamps with global temporal ordering.
- Interaction Flags: ‘is_organic’ flag indicating whether the interaction was organic or recommendation-driven.
- Dataset Scales: Available in multiple scales: Yambda-50M, Yambda-500M, and Yambda-5B.
Data Format:
The dataset is provided in Parquet format, with the following files:
listens.parquet
: User listening events with playback details.likes.parquet
: User like actions.dislikes.parquet
: User dislike actions.undislikes.parquet
: User undislike actions (reverting dislikes).unlikes.parquet
: User unlike actions (reverting likes).embeddings.parquet
: Track audio embeddings.

Each file contains the following columns:
uid
: Unique user identifier.item_id
: Unique track identifier.timestamp
: Delta times, binned into 5-second units.is_organic
: Boolean flag indicating if the interaction was organic (1) or recommendation-driven (0).
Evaluation Protocol:
To facilitate rigorous benchmarking, the authors introduce an evaluation protocol based on a Global Temporal Split. This approach allows recommendation algorithms to be assessed in conditions that closely mirror real-world use. Benchmark results are provided for standard baselines (ItemKNN, iALS) and advanced models (SANSA, SASRec) using various evaluation metrics.
Conclusion:
Yambda-5B serves as a substantial resource for advancing research in recommender systems, particularly in the music domain. Its scale, diversity of interactions, and inclusion of audio embeddings make it suitable for developing and evaluating a wide range of recommendation algorithms. By providing this dataset, the authors aim to promote reproducible research and innovation in the field.
Acknowledgements: Thanks to https://x.com/AixinSG/status/1928805909981016389 through whose X post, we got aware of the dataset.