Why isn’t your recommender system training faster on GPU? [Even Oldridge @NVIDIA]

Even Oldridge from NVIDIA Merlin has written a blog post about Why isn’t your recommender system training faster on GPU? (And what can you do about it?). It’s a nice article that outlines the differences between Computer Vision and NLP — two areas where Deep Learning and GPUs work excellently — and recommender systems, where Deep Learning and GPUs are of mediocre success.

Part of the reason that GPUs have been so successful at Vision and NLP tasks is that models in those domains are large and complex. In vision, compute makes up the entire workload, and the parallelism of the GPU shines. For NLP the case is slightly more complex; BERT-base has 12-layers, with a 768-hidden width and 12 attention heads for a total of 110M parameters. It’s vocabulary size is 30K x 1024 meaning that the embeddings which store representation learned by the model make up ~30% of its total size. This is a significant proportion, but compute is still the dominant factor in throughput.

Compare that with recent session based recommender system architectures such as BST which follow the same general transformer architecture but are configured very differently. BST has a single transformer layer followed by a 1024–512–256 width MLP and 8 attention heads. Its equivalent ‘vocabulary’ on the other hand is a whopping 300M users and 12M items in the example Taobao dataset which are embedded with a width of 64. So the amount of compute is several orders of magnitude lower than we see in NLP and the embeddings which are IO bound, not compute bound, make up well over 95% of the model and are hard to fit on a single GPU (stay tuned for a future article on that). What that means is that in order to keep the GPU running efficiently we need to make sure other aspects of the workflow are well tuned.


Add a Comment

Your email address will not be published. Required fields are marked *