SqueezeBERT architecture (#7083)

* configuration_squeezebert.py thin wrapper around bert tokenizer fix typos wip sb model code wip modeling_squeezebert.py. Next step is to get the multi-layer-output interface working set up squeezebert to use BertModelOutput when returning results. squeezebert documentation formatting allow head mask that is an array of [None, ..., None] docs docs cont'd path to vocab docs and pointers to cloud files (WIP) line length and indentation squeezebert model cards formatting of model cards untrack modeling_squeezebert_scratchpad.py update aws paths to vocab and config files get rid of stub of NSP code, and advise users to pretrain with mlm only fix rebase issues redo rebase of modeling_auto.py fix issues with code formatting more code format auto-fixes move squeezebert before bert in tokenization_auto.py and modeling_auto.py because squeezebert inherits from bert tests for squeezebert modeling and tokenization fix typo move squeezebert before bert in modeling_auto.py to fix inheritance problem disable test_head_masking, since squeezebert doesn't yet implement head masking fix issues exposed by the test_modeling_squeezebert.py fix an issue exposed by test_tokenization_squeezebert.py fix issue exposed by test_modeling_squeezebert.py auto generated code style improvement issue that we inherited from modeling_xxx.py: SqueezeBertForMaskedLM.forward() calls self.cls(), but there is no self.cls, and I think the goal was actually to call self.lm_head() update copyright resolve failing 'test_hidden_states_output' and remove unused encoder_hidden_states and encoder_attention_mask docs add integration test. rename squeezebert-mnli --> squeezebert/squeezebert-mnli autogenerated formatting tweaks integrate feedback from patrickvonplaten and sgugger to programming style and documentation strings * tiny change to order of imports
2020-10-05 01:25:43 -07:00
parent e2c935f561
commit 02ef825be2
18 changed files with 1950 additions and 14 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -148,7 +148,10 @@ conversion utilities for the following models:
 29. `XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `XLNet: Generalized
    Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang, Zihang
    Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
-30. `Other community models <https://huggingface.co/models>`_, contributed by the `community
+30. SqueezeBERT (from UC Berkeley) released with the paper
+    `SqueezeBERT: What can computer vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`_
+    by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
+31. `Other community models <https://huggingface.co/models>`_, contributed by the `community
    <https://huggingface.co/users>`_.

 .. toctree::
@@ -241,6 +244,7 @@ conversion utilities for the following models:
    model_doc/reformer
    model_doc/retribert
    model_doc/roberta
+    model_doc/squeezebert
    model_doc/t5
    model_doc/transformerxl
    model_doc/xlm
--- a/docs/source/model_doc/squeezebert.rst
+++ b/docs/source/model_doc/squeezebert.rst
@@ -0,0 +1,103 @@
+SqueezeBERT
+----------------------------------------------------
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~
+
+The SqueezeBERT model was proposed in
+`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
+<https://arxiv.org/abs/2006.11316>`__
+by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer.
+It's a bidirectional transformer similar to the BERT model.
+The key difference between the BERT architecture and the SqueezeBERT architecture
+is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
+instead of fully-connected layers for the Q, K, V and FFN layers.
+
+The abstract from the paper is the following:
+
+*Humans read and write hundreds of billions of messages every day. Further, due to the availability of
+large datasets, large computing systems, and better neural network models, natural language processing (NLP)
+technology has made significant strides in understanding, proofreading, and organizing these messages.
+Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users,
+social networks, and businesses. In particular, we consider smartphones and other mobile devices as
+crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network
+models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds
+to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped
+convolutions have yielded significant speedups for computer vision networks, but many of these techniques
+have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
+self-attention layers with grouped convolutions, and we use this technique in a novel network architecture
+called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive
+accuracy on the GLUE test set. The SqueezeBERT code will be released.*
+
+Tips:
+
+- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+  the right rather than the left.
+- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
+  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
+  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+- For best results when finetuning on sequence classification tasks, it is recommended to start with the
+  `squeezebert/squeezebert-mnli-headless` checkpoint.
+
+SqueezeBertConfig
+~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertConfig
+    :members:
+
+
+SqueezeBertTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertTokenizer
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+
+
+SqueezeBertTokenizerFast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertTokenizerFast
+    :members:
+
+
+SqueezeBertModel
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertModel
+    :members:
+
+
+SqueezeBertForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForMaskedLM
+    :members:
+
+
+SqueezeBertForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForSequenceClassification
+    :members:
+
+
+SqueezeBertForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForMultipleChoice
+    :members:
+
+
+SqueezeBertForTokenClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForTokenClassification
+    :members:
+
+
+SqueezeBertForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForQuestionAnswering
+    :members:
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -426,4 +426,13 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details <https://github.com/microsoft/DeBERTa>`__)                                                                              |
 +--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-
+| SqueezeBERT        | ``squeezebert/squeezebert-uncased``                        | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
+|                    |                                                            | | SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks.          |
+|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                    | ``squeezebert/squeezebert-mnli``                           | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
+|                    |                                                            | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.      |
+|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                    | ``squeezebert/squeezebert-mnli-headless``                  | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
+|                    |                                                            | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.      |
+|                    |                                                            | | The final classification layer is removed, so when you finetune, the final layer will be reinitialized.                             |
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+