Why isn’t your recommender system training faster on GPU? [Even Oldridge @NVIDIA]

December 9, 2020

Even Oldridge from NVIDIA Merlin has written a blog post about Why isn’t your recommender system training faster on GPU? (And what can you do about it?). It’s a nice article that outlines the differences between Computer Vision and NLP — two areas where Deep Learning and GPUs work excellently — and recommender systems, where Deep Learning and GPUs are of mediocre success.

Part of the reason that GPUs have been so successful at Vision and NLP tasks is that models in those domains are large and complex. In vision, compute makes up the entire workload, and the parallelism of the GPU shines. For NLP the case is slightly more complex; BERT-base has 12-layers, with a 768-hidden width and 12 attention heads for a total of 110M parameters. It’s vocabulary size is 30K x 1024 meaning that the embeddings which store representation learned by the model make up ~30% of its total size. This is a significant proportion, but compute is still the dominant factor in throughput.
Compare that with recent session based recommender system architectures such as BST which follow the same general transformer architecture but are configured very differently. BST has a single transformer layer followed by a 1024–512–256 width MLP and 8 attention heads. Its equivalent ‘vocabulary’ on the other hand is a whopping 300M users and 12M items in the example Taobao dataset which are embedded with a width of 64. So the amount of compute is several orders of magnitude lower than we see in NLP and the embeddings which are IO bound, not compute bound, make up well over 95% of the model and are hard to fit on a single GPU (stay tuned for a future article on that). What that means is that in order to keep the GPU running efficiently we need to make sure other aspects of the workflow are well tuned.
https://medium.com/nvidia-merlin/why-isnt-your-recommender-system-training-faster-on-gpu-and-what-can-you-do-about-it-6cb44a711ad4

https://medium.com/nvidia-merlin/why-isnt-your-recommender-system-training-faster-on-gpu-and-what-can-you-do-about-it-6cb44a711ad4

Tags:Deep Recommender Systems, Even Oldridge, GPU, NVIDIA Merlin

About The Author

Joeran Beel

I am the founder of Recommender-Systems.com and head of the Intelligent Systems Group (ISG) at the University of Siegen, Germany https://isg.beel.org. We conduct research in recommender-systems (RecSys), personalization and information retrieval (IR) as well as on automated machine learning (AutoML), meta-learning and algorithm selection. Domains we are particularly interested in include smart places, eHealth, manufacturing (industry 4.0), mobility, visual computing, and digital libraries. We founded or maintain, among others, LensKit-Auto, Darwin & Goliath, Mr. DLib, and Docear, each with thousand of users; we contributed to TensorFlow, JabRef and others; and we developed the first prototypes of automated recommender systems (AutoSurprise and Auto-CaseRec) and Federated Meta Learning (FMLearn Server and Client).

Related Posts

About The Author

Joeran Beel

Add a Comment