Yambda-5B, a promising dataset for recommender systems

The Yambda-5B dataset on Huggingface, introduced in the paper “Yambda-5B — A Large-Scale Multi-modal Dataset for Ranking and Retrieval” , offers a comprehensive resource for evaluating recommender systems. Sourced from Yandex Music, it encompasses 4.79 billion user-item interactions across 9.39 million tracks, involving 1 million users. This dataset includes both implicit feedback (listening events) and explicit feedback (likes, dislikes, unlikes, and undislikes). A distinguishing feature is the ‘is_organic’ flag, which differentiates user actions driven by recommendations from organic interactions.

Key Features:

  • User-Item Interactions: 4.79 billion events, including listens, likes, dislikes, unlikes, and undislikes.
  • User Base: 1 million anonymized users.
  • Item Base: 9.39 million tracks.
  • Feedback Types: Implicit (listens) and explicit (likes, dislikes, unlikes, undislikes).
  • Audio Embeddings: Precomputed embeddings for 7.72 million tracks, generated using a convolutional neural network trained on audio spectrograms.
  • Temporal Data: Timestamps with global temporal ordering.
  • Interaction Flags: ‘is_organic’ flag indicating whether the interaction was organic or recommendation-driven.
  • Dataset Scales: Available in multiple scales: Yambda-50M, Yambda-500M, and Yambda-5B.

Data Format:

The dataset is provided in Parquet format, with the following files:

  • listens.parquet: User listening events with playback details.
  • likes.parquet: User like actions.
  • dislikes.parquet: User dislike actions.
  • undislikes.parquet: User undislike actions (reverting dislikes).
  • unlikes.parquet: User unlike actions (reverting likes).
  • embeddings.parquet: Track audio embeddings.

Each file contains the following columns:

  • uid: Unique user identifier.
  • item_id: Unique track identifier.
  • timestamp: Delta times, binned into 5-second units.
  • is_organic: Boolean flag indicating if the interaction was organic (1) or recommendation-driven (0).

Evaluation Protocol:

To facilitate rigorous benchmarking, the authors introduce an evaluation protocol based on a Global Temporal Split. This approach allows recommendation algorithms to be assessed in conditions that closely mirror real-world use. Benchmark results are provided for standard baselines (ItemKNN, iALS) and advanced models (SANSA, SASRec) using various evaluation metrics.

Conclusion:

Yambda-5B serves as a substantial resource for advancing research in recommender systems, particularly in the music domain. Its scale, diversity of interactions, and inclusion of audio embeddings make it suitable for developing and evaluating a wide range of recommendation algorithms. By providing this dataset, the authors aim to promote reproducible research and innovation in the field.

Acknowledgements: Thanks to https://x.com/AixinSG/status/1928805909981016389 through whose X post, we got aware of the dataset.

Add a Comment

Your email address will not be published. Required fields are marked *