HetSeq: Training BERT on a random assortment of GPUs [Yifan Ding et al.]

BERT has brought huge changes to how NLP is done, and also had a notable impact on recommender systems (not always though*). However, training BERT may take weeks, if not months. Yifan Ding, Nicholas Botzer, and Tim Weninger promise to have found a solution for those being e.g. at universities with heterogeneous GPU infrastructure.

Unfortunately, most organizations not named Google or Microsoft do not have a thousand identical GPUs. Instead, small and medium organizations have a piecemeal approach to purchasing computer systems resulting in a heterogeneous infrastructure, which cannot be easily adapted to compute large models. Under these circumstances training even moderately-sized models could take weeks or even months to complete.

To help remedy this situation we recently released a software package called HetSeq, which is adapted from the popular PyTorch package and provides the capability to train large neural network models on heterogeneous infrastructure.

Experiments, details of which can be found in an article (available on ArXiV) published at the 2021 AAAI/IAAI Conference, show that base-BERT can be trained in about a day over 8 different GPU systems, most of which we had to “borrow” from idle labs from across Notre Dame.


* In some recent work we showed that language models like BERT do not always perform very well

Hassan, Hebatallah A. Mohamed, Giuseppe Sansonetti, Fabio Gasparetti, Alessandro Micarelli, and Joeran Beel. “BERT, ELMo, USE and InferSent Sentence Encoders: The Panacea for Research-Paper Recommendation?” In 13th ACM Conference on Recommender Systems (RecSys), 2019.

Add a Comment

Your email address will not be published. Required fields are marked *