distilbert, DistilBERT is a distilled version of the popular BERT model that is smaller, faster, and cheaper to train than its original counterpart. Compared to BERT-base-uncased, it has 40% fewer parameters and runs 60% faster while still preserving over 95% of BERT’s performance as measured on the GLUE language understanding benchmark. Through knowledge distillation during pretraining, we are able to reduce the size of a BERT model without sacrificing its capabilities for language understanding tasks. Our triple loss combining language modeling, distillation and cosine-distance losses allows us to leverage the inductive biases learned by larger models during pretraining. With our smaller yet powerful model being more cost effective in terms of both time and resources needed for training or inference computations on edge devices, it can be used across a range of applications with great success.

You May Also Like.

Share Your Valuable Opinions