Welcome to RS_c, the central platform for the RecSys community. We provide curated lists of recommender-systems datasets, algorithms, books, conferences and many resources more. Maybe most importantly, we publish the latest recommender-system news. If you want your news to be reported on RS_c, read here.
Part of the reason that GPUs have been so successful at Vision and NLP tasks is that models in those domains are large and complex. In vision, compute makes up the entire workload, and the parallelism of the GPU shines. For NLP the case is slightly more complex; BERT-base has 12-layers, with a 768-hidden width and 12 attention heads for a total of 110M parameters. It’s vocabulary size is 30K x 1024 meaning that the embeddings which store representation learned by the model make up ~30% of its total size. This is a significant proportion, but compute is still the dominant factor in throughput.
Compare that with recent session based recommender system architectures such as BST which follow the same general transformer architecture but are configured very differently. BST has a single transformer layer followed by a 1024–512–256 width MLP and 8 attention heads. Its equivalent ‘vocabulary’ on the other hand is a whopping 300M users and 12M items in the example Taobao dataset which are embedded with a width of 64. So the amount of compute is several orders of magnitude lower than we see in NLP and the embeddings which are IO bound, not compute bound, make up well over 95% of the model and are hard to fit on a single GPU (stay tuned for a future article on that). What that means is that in order to keep the GPU running efficiently we need to make sure other aspects of the workflow are well tuned.