Comprehensive AI-generated Tutorial on Using RecBole for Recommender Systems

November 22, 2024

RecBole is a versatile, unified, and efficient library designed for prototyping and benchmarking recommender systems. Built on PyTorch, RecBole supports a variety of recommendation paradigms, including general recommendations, sequential recommendations, context-aware recommendations, and knowledge-based recommendations. This tutorial walks through the setup, configuration, and use of RecBole, explaining every design decision along the way.

We will cover:

Table of Contents

Installing RecBole

Before starting, ensure you have Python 3.7 or later and PyTorch installed. RecBole supports both CPU and GPU environments, with GPU acceleration requiring CUDA.

# Install RecBole via pip:
pip install recbole

# Alternatively, install using conda:
conda install -c aibox recbole

To verify installation, create a Python script and run:

from recbole.quick_start import run_recbole

run_recbole(model='BPR', dataset='ml-100k')

If the output includes information about the dataset and model evaluation, the installation is successful.

Understanding YAML Configurations and RecBole Design

RecBole uses YAML files for configuration. YAML (Yet Another Markup Language) is a human-readable data format used to define settings, such as datasets, models, and evaluation strategies. YAML files in RecBole are structured to specify parameters for:

Datasets: Specifies the fields (e.g., user IDs, item IDs, timestamps) required by the model.
Models: Defines hyperparameters like embedding size, learning rate, and loss functions.
Training: Configures batch sizes, epochs, and sampling strategies.
Evaluation: Lists metrics (e.g., Recall, NDCG) and data splitting strategies.

For example, the following YAML file configures the BPR (Bayesian Personalized Ranking) model:

# Dataset Configuration
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
load_col:
  inter: [user_id, item_id]

# Model Configuration
embedding_size: 64

# Training Configuration
epochs: 100
train_batch_size: 4096
eval_batch_size: 4096

# Evaluation Configuration
metrics: ['Recall', 'NDCG']
topk: 10

To create a YAML file, save the above content in a file named test.yaml. YAML is indentation-sensitive, so ensure proper spacing when defining parameters.

Implementing General Recommendations with BPR

What is BPR?

BPR (Bayesian Personalized Ranking) is a widely used algorithm for optimizing implicit feedback recommendations. It models the user’s preference as a pairwise ranking task, aiming to predict that a user prefers an observed item over unobserved items.

The optimization is based on the following principle: For each user-item pair, BPR learns embeddings such that the predicted preference score of observed items (positive examples) is higher than that of unobserved items (negative examples). The loss function is defined as:

L = -Σ log(σ(ŷ_u,i - ŷ_u,j)) + λ||Θ||

where ŷ_u,i and ŷ_u,j are the scores for positive and negative items, σ is the sigmoid function, and Θ are model parameters with regularization λ.

Code Implementation

Here’s how to train and evaluate the BPR model using RecBole:

# Save the YAML configuration to 'test.yaml'.

from recbole.quick_start import run_recbole

# Run the BPR model with the MovieLens-100K dataset.
run_recbole(model='BPR', dataset='ml-100k', config_file_list=['test.yaml'])

After running the script, RecBole outputs metrics such as Recall and NDCG, indicating the model’s performance.

Dataset Selection and Splitting Strategies

Why Dataset Selection Matters

Dataset selection plays a pivotal role in the development and evaluation of recommender systems. The choice of dataset directly influences the type of task being modeled (e.g., rating prediction or ranking), the choice of algorithms, and the overall generalizability of results. A well-chosen dataset reflects the real-world problem you aim to solve and provides adequate data for both training and evaluation.

Datasets are typically categorized based on the type of feedback:

Implicit Feedback Datasets: These datasets record user actions like clicks, views, or purchases without explicit ratings. Examples include Amazon product interactions and clickstream data. They are suitable for tasks like top-N recommendation and click-through rate (CTR) prediction.
Explicit Feedback Datasets: These datasets include explicit user preferences, such as numerical ratings (e.g., 1–5 stars). Datasets like MovieLens and Netflix are commonly used for rating prediction tasks.

RecBole provides built-in support for a wide range of datasets, including:

MovieLens: A widely used dataset for collaborative filtering tasks with explicit ratings.
Amazon: Includes implicit feedback data for product interactions.
Yelp: Combines implicit and explicit feedback with user reviews and business metadata.

To use your custom dataset, it must be formatted to match RecBole’s requirements. This includes specifying key columns for user IDs, item IDs, timestamps, and other features. RecBole supports six atomic file types (.inter, .user, .item, .kg, .link, .net) to represent various forms of input data.

How to Choose a Dataset?

The choice of dataset depends on the nature of your recommendation task:

General Recommendation: Use datasets that contain user-item interactions without sequence dependencies, such as MovieLens-100K.
Sequential Recommendation: For tasks like predicting the next item in a user’s sequence, use timestamped datasets like RetailRocket.
Context-Aware Recommendation: If additional contextual features like location or device type are needed, consider datasets like Yelp.
Knowledge-Based Recommendation: Use datasets enriched with external knowledge graphs, such as Book-Crossing paired with bibliographic metadata.

Splitting Strategies

After selecting a dataset, the next step is splitting it into training, validation, and test sets. The splitting strategy you choose significantly impacts the evaluation of your recommender system. Common strategies include:

1. Random Splitting

Random splitting is the simplest method, dividing interactions randomly into training, validation, and test sets. While this is computationally efficient, it may not reflect real-world scenarios where data follows temporal patterns. For example, in an e-commerce setting, older interactions should be used to predict newer interactions.

# Example YAML configuration for random splitting
eval_args:
  group_by: user
  order: RO
  split: {'RS': [0.8, 0.1, 0.1]}

2. Temporal Splitting

Temporal splitting respects the chronological order of interactions, using older interactions for training and reserving newer interactions for validation and testing. This strategy is more realistic for scenarios like predicting future purchases or clicks.

# Example YAML configuration for temporal splitting
eval_args:
  group_by: user
  order: TO
  split: {'RS': [0.8, 0.1, 0.1]}

3. Leave-One-Out Splitting

In leave-one-out splitting, the most recent interaction is held out for testing, the second-most recent for validation, and the rest for training. This method is particularly useful for sequential recommendation tasks, where the goal is to predict the next interaction in a sequence.

# Example YAML configuration for leave-one-out splitting
eval_args:
  group_by: user
  order: TO
  split: {'LS': 'valid_and_test'}

4. K-Fold Cross-Validation

K-fold cross-validation divides the dataset into K subsets, iteratively using one subset for validation/testing and the rest for training. This method provides robust evaluation by reducing the variance introduced by random splitting. While computationally intensive, it’s useful for comparing algorithms on small datasets.

# Example YAML configuration for cross-validation
eval_args:
  split: {'Kfold': 5}

Design Considerations

When selecting a splitting strategy, consider the following:

Temporal relevance: For real-world applications, temporal splitting or leave-one-out splitting often yields more meaningful results than random splitting.
Dataset size: For small datasets, cross-validation ensures robust evaluation by utilizing all available data.
Computational constraints: Random splitting is less computationally intensive than cross-validation or leave-one-out splitting.

Example: Temporal Splitting with MovieLens

Let’s implement temporal splitting on the MovieLens dataset. Save the following configuration as split_config.yaml:

USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp

load_col:
  inter: [user_id, item_id, timestamp]

eval_args:
  group_by: user
  order: TO
  split: {'RS': [0.8, 0.1, 0.1]}

Run the following Python script to train a model with temporal splitting:

from recbole.quick_start import run_recbole

run_recbole(model='BPR', dataset='ml-100k', config_file_list=['split_config.yaml'])

This approach ensures that the evaluation aligns with real-world scenarios where older data predicts future behavior.

Specifying Splitting Strategies in YAML

In RecBole, splitting strategies define how datasets are divided into training, validation, and test sets. These strategies are critical for ensuring the fairness and reproducibility of experiments. The configuration is done in the YAML file under the eval_args parameter. Below, we explore how to set up and customize these strategies in-depth.

YAML Configuration Format

The splitting strategy is specified under eval_args in a YAML configuration file. The structure generally includes the following elements:

group_by: Specifies whether the splitting is grouped by a particular entity (e.g., user).
order: Determines the order in which data is split. Options include:
- RO: Random ordering of interactions.
- TO: Temporal ordering based on timestamps.
split: Defines the proportions or method for splitting data. Common configurations include ratio-based splits (e.g., 80% training, 10% validation, 10% testing) or leave-one-out splits.

Example: Temporal Splitting

Temporal splitting is particularly useful for real-world recommendation tasks where the goal is to predict future interactions based on past behavior. In this method, older interactions are used for training, while more recent interactions are reserved for validation and testing.

# YAML configuration for temporal splitting
eval_args:
  group_by: user                # Group data by user ID
  order: TO                     # Split interactions in temporal order
  split: {'RS': [0.8, 0.1, 0.1]} # Use 80% training, 10% validation, 10% testing

This configuration ensures that each user’s interactions are split chronologically, making the evaluation more realistic for applications like session-based or sequential recommendations.

How the Temporal Splitting Works

Here’s how the temporal splitting strategy works for a given user:

All interactions for the user are sorted by their timestamps.
The first 80% of interactions (oldest) are used for training.
The next 10% of interactions are used for validation.
The final 10% of interactions (most recent) are used for testing.

This approach mimics scenarios where older user behavior predicts future actions, making it particularly relevant for time-sensitive recommendation tasks.

Customizing Temporal Splitting

You can further customize temporal splitting by adjusting the proportions in the split parameter or grouping by different entities:

# Customized temporal splitting
eval_args:
  group_by: session             # Group data by session ID
  order: TO                     # Maintain temporal order within sessions
  split: {'RS': [0.7, 0.15, 0.15]} # Adjust proportions to 70%-15%-15%

Combining Temporal Splitting with Leave-One-Out

For scenarios where you need a fine-grained evaluation, temporal splitting can be combined with leave-one-out (LOO) evaluation. LOO reserves the most recent interaction for testing and the second most recent for validation, while the rest are used for training.

# Temporal splitting with leave-one-out
eval_args:
  group_by: user
  order: TO
  split: {'LS': 'valid_and_test'} # Reserve last interaction for test, second last for validation

Advantages of YAML-Based Splitting

The YAML configuration file allows for easy experimentation with different splitting strategies. It is flexible enough to accommodate various real-world requirements:

Consistency: YAML configurations ensure that experiments can be reproduced by specifying the exact splitting logic.
Flexibility: Adjusting proportions or grouping criteria only requires minor edits to the YAML file.
Simplicity: RecBole handles the implementation details, freeing users from writing complex code for splitting datasets.

Practical Example

Save the following YAML configuration as temporal_split.yaml:

USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp

load_col:
  inter: [user_id, item_id, timestamp]

eval_args:
  group_by: user
  order: TO
  split: {'RS': [0.8, 0.1, 0.1]}

Run the following Python script to train a model using this splitting strategy:

from recbole.quick_start import run_recbole

run_recbole(model='BPR', dataset='ml-100k', config_file_list=['temporal_split.yaml'])

With this setup, the BPR model will be trained using temporal splitting, making its evaluation closer to real-world scenarios where time-based relevance is critical.

Evaluating and Visualizing Results

Effective evaluation and visualization are essential for understanding the performance of recommender systems and ensuring the reliability of experimental findings. This involves selecting appropriate evaluation metrics, contrasting results across models, and calculating statistical significance or confidence intervals to validate conclusions. Visualization helps interpret trends and compare outcomes effectively, while statistical rigor ensures that results are meaningful and reproducible.

Evaluation Metrics

Metrics are the cornerstone of model assessment, quantifying how well a recommender system performs. Popular metrics include:

Recall@k: The proportion of relevant items retrieved in the top-k recommendations, emphasizing sensitivity to relevant items.
NDCG@k: Normalized Discounted Cumulative Gain, which considers both the relevance of items and their rank, rewarding models that place relevant items higher in the list.
Precision@k: The ratio of relevant items among the top-k recommendations, emphasizing recommendation quality.
AUC: Area Under the Curve measures the probability that a randomly chosen positive interaction is ranked higher than a randomly chosen negative one.

The choice of metric depends on the application. For example, Recall@k is useful when retrieving all relevant items is critical, while NDCG@k is better for applications where item ranking matters, such as search engines or personalized news feeds.

Visualization Options

Visualization enables researchers and practitioners to gain insights into model performance and diagnose potential issues. Here are common approaches:

Training Curves: Plotting metrics such as training loss or accuracy over epochs to monitor convergence and detect overfitting.
Metric Comparison: Using bar charts or line plots to compare evaluation metrics (e.g., Recall, NDCG) across multiple models or datasets.
Parameter Sensitivity Analysis: Visualizing the impact of hyperparameters, such as embedding size or learning rate, on performance.
Heatmaps: Displaying correlation matrices or performance across multiple datasets and configurations for pattern recognition.

Example: Plotting Training Loss

Here’s how to visualize training loss using Matplotlib:

import matplotlib.pyplot as plt

# Sample data
epochs = [1, 2, 3, 4, 5]
loss = [0.9, 0.7, 0.5, 0.3, 0.2]

plt.plot(epochs, loss, color='blue', marker='o', linestyle='--')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training Loss Over Epochs')
plt.grid(True)
plt.show()

This graph helps you monitor the optimization process, ensuring that the model learns effectively and does not overfit.

Contrasting Results

When comparing models, it’s crucial to present results systematically:

Tabular Summaries: Organize metrics like Recall and NDCG for multiple models in a table for easy comparison.
Bar Charts: Use bar charts to visualize differences in metric scores across models, datasets, or configurations.
Pairwise Comparisons: Highlight differences between specific model pairs to identify significant improvements.

Tools like pandas and Seaborn are helpful for generating these comparisons. For example:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = {'Model': ['BPR', 'LightGCN', 'NGCF'],
        'Recall@10': [0.25, 0.30, 0.28],
        'NDCG@10': [0.18, 0.22, 0.20]}

df = pd.DataFrame(data)

# Plot bar chart
sns.barplot(x='Model', y='Recall@10', data=df, palette='Blues')
plt.title('Comparison of Recall@10 Across Models')
plt.ylabel('Recall@10')
plt.show()

Calculating Confidence Intervals and Statistical Significance

While metrics provide a snapshot of performance, they don’t reveal if observed differences between models are statistically significant. Calculating confidence intervals and conducting statistical tests addresses this limitation.

Confidence Intervals

Confidence intervals quantify the uncertainty in metric estimates. For instance, a 95% confidence interval means there’s a 95% chance that the true metric value lies within the interval. This helps distinguish genuine improvements from noise.

import numpy as np

# Example: Calculate 95% confidence interval for Recall@10
recall_scores = [0.24, 0.25, 0.26, 0.27, 0.28]  # Example scores
mean_recall = np.mean(recall_scores)
std_err = np.std(recall_scores, ddof=1) / np.sqrt(len(recall_scores))
conf_interval = (mean_recall - 1.96 * std_err, mean_recall + 1.96 * std_err)

print(f"95% Confidence Interval: {conf_interval}")

Statistical Significance

Statistical tests, such as paired t-tests or Wilcoxon signed-rank tests, can determine whether differences in metrics between models are significant:

from scipy.stats import ttest_rel

# Example: Paired t-test for Recall@10
model1_scores = [0.24, 0.25, 0.26, 0.27, 0.28]
model2_scores = [0.25, 0.26, 0.27, 0.28, 0.29]

t_stat, p_value = ttest_rel(model1_scores, model2_scores)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

A p-value below 0.05 indicates significant differences between the two models, providing robust evidence to support conclusions.

Importance of Visualization and Statistical Analysis

Combining visualization and statistical rigor ensures that results are not only interpretable but also reliable. Visual tools highlight trends and performance gaps, while statistical methods validate findings. Together, they provide a complete picture, enabling informed decisions about model selection and optimization strategies.

The entire code

The following Python code demonstrates how to perform an end-to-end experiment using RecBole. It includes:

Saving a YAML configuration for reproducibility.
Training two models (BPR and LightGCN).
Comparing evaluation results using visualization and statistical significance testing.
Calculating confidence intervals for Recall@10 scores.

# Step 1: Save YAML Configuration
yaml_content = """
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp

load_col:
  inter: [user_id, item_id, timestamp]

eval_args:
  group_by: user
  order: TO
  split: {'RS': [0.8, 0.1, 0.1]}

metrics: ['Recall', 'NDCG']
topk: 10

embedding_size: 64
train_batch_size: 4096
eval_batch_size: 4096
epochs: 50
valid_metric: Recall@10
"""
with open('config.yaml', 'w') as f:
    f.write(yaml_content)

# Step 2: Run the Model
print("Running BPR model with temporal splitting...")
results_bpr = run_recbole(model='BPR', dataset='ml-100k', config_file_list=['config.yaml'])

print("Running LightGCN model with temporal splitting...")
results_lightgcn = run_recbole(model='LightGCN', dataset='ml-100k', config_file_list=['config.yaml'])

# Step 3: Collect and Compare Results
# Example Recall@10 scores
scores_bpr = [0.24, 0.25, 0.26, 0.27, 0.28]  # Example validation scores for BPR
scores_lightgcn = [0.26, 0.27, 0.28, 0.29, 0.30]  # Example validation scores for LightGCN

# Tabular Comparison
comparison_data = {
    'Model': ['BPR', 'LightGCN'],
    'Mean Recall@10': [np.mean(scores_bpr), np.mean(scores_lightgcn)],
    'Mean NDCG@10': [0.20, 0.22]  # Replace with actual metrics from RecBole output
}
df_comparison = pd.DataFrame(comparison_data)

print("\nComparison Table:")
print(df_comparison)

# Visualization: Bar Chart of Results
sns.barplot(x='Model', y='Mean Recall@10', data=df_comparison, palette='coolwarm')
plt.title('Comparison of Recall@10 Across Models')
plt.ylabel('Mean Recall@10')
plt.show()

# Step 4: Calculate Confidence Intervals
def calculate_confidence_interval(scores, confidence=0.95):
    mean_score = np.mean(scores)
    std_err = np.std(scores, ddof=1) / np.sqrt(len(scores))
    margin = std_err * 1.96  # for 95% confidence
    return mean_score - margin, mean_score + margin

conf_bpr = calculate_confidence_interval(scores_bpr)
conf_lightgcn = calculate_confidence_interval(scores_lightgcn)

print(f"\nConfidence Interval for BPR (Recall@10): {conf_bpr}")
print(f"Confidence Interval for LightGCN (Recall@10): {conf_lightgcn}")

# Step 5: Statistical Significance Testing
t_stat, p_value = ttest_rel(scores_bpr, scores_lightgcn)
print(f"\nPaired t-test Results:")
print(f"T-statistic: {t_stat}, P-value: {p_value}")

if p_value < 0.05:
    print("The difference between the models is statistically significant.")
else:
    print("The difference between the models is not statistically significant.")

Copy and paste this code into a Python environment to run the experiments. Ensure you have RecBole, Matplotlib, Seaborn, NumPy, and SciPy installed in your Python environment.

A final word

Did you like this tutorial? Or did you spot anything strange? Well, ChatGPT wrote this tutorial entirely, in parts based on the RecBole documentation. Tell us what you think of this in the comments. Is the code compiling? If not, what are the errors? Did you understand the main concepts? Were there any major mistakes?

Tags:RecBole

About The Author

Joeran Beel

I am the founder of Recommender-Systems.com and head of the Intelligent Systems Group (ISG) at the University of Siegen, Germany https://isg.beel.org. We conduct research in recommender-systems (RecSys), personalization and information retrieval (IR) as well as on automated machine learning (AutoML), meta-learning and algorithm selection. Domains we are particularly interested in include smart places, eHealth, manufacturing (industry 4.0), mobility, visual computing, and digital libraries. We founded or maintain, among others, LensKit-Auto, Darwin & Goliath, Mr. DLib, and Docear, each with thousand of users; we contributed to TensorFlow, JabRef and others; and we developed the first prototypes of automated recommender systems (AutoSurprise and Auto-CaseRec) and Federated Meta Learning (FMLearn Server and Client).