SqueezeBERT architecture (#7083)
* configuration_squeezebert.py thin wrapper around bert tokenizer fix typos wip sb model code wip modeling_squeezebert.py. Next step is to get the multi-layer-output interface working set up squeezebert to use BertModelOutput when returning results. squeezebert documentation formatting allow head mask that is an array of [None, ..., None] docs docs cont'd path to vocab docs and pointers to cloud files (WIP) line length and indentation squeezebert model cards formatting of model cards untrack modeling_squeezebert_scratchpad.py update aws paths to vocab and config files get rid of stub of NSP code, and advise users to pretrain with mlm only fix rebase issues redo rebase of modeling_auto.py fix issues with code formatting more code format auto-fixes move squeezebert before bert in tokenization_auto.py and modeling_auto.py because squeezebert inherits from bert tests for squeezebert modeling and tokenization fix typo move squeezebert before bert in modeling_auto.py to fix inheritance problem disable test_head_masking, since squeezebert doesn't yet implement head masking fix issues exposed by the test_modeling_squeezebert.py fix an issue exposed by test_tokenization_squeezebert.py fix issue exposed by test_modeling_squeezebert.py auto generated code style improvement issue that we inherited from modeling_xxx.py: SqueezeBertForMaskedLM.forward() calls self.cls(), but there is no self.cls, and I think the goal was actually to call self.lm_head() update copyright resolve failing 'test_hidden_states_output' and remove unused encoder_hidden_states and encoder_attention_mask docs add integration test. rename squeezebert-mnli --> squeezebert/squeezebert-mnli autogenerated formatting tweaks integrate feedback from patrickvonplaten and sgugger to programming style and documentation strings * tiny change to order of imports
This commit is contained in:
@@ -148,7 +148,10 @@ conversion utilities for the following models:
|
||||
29. `XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `XLNet: Generalized
|
||||
Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang, Zihang
|
||||
Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
|
||||
30. `Other community models <https://huggingface.co/models>`_, contributed by the `community
|
||||
30. SqueezeBERT (from UC Berkeley) released with the paper
|
||||
`SqueezeBERT: What can computer vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`_
|
||||
by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
|
||||
31. `Other community models <https://huggingface.co/models>`_, contributed by the `community
|
||||
<https://huggingface.co/users>`_.
|
||||
|
||||
.. toctree::
|
||||
@@ -241,6 +244,7 @@ conversion utilities for the following models:
|
||||
model_doc/reformer
|
||||
model_doc/retribert
|
||||
model_doc/roberta
|
||||
model_doc/squeezebert
|
||||
model_doc/t5
|
||||
model_doc/transformerxl
|
||||
model_doc/xlm
|
||||
|
||||
103
docs/source/model_doc/squeezebert.rst
Normal file
103
docs/source/model_doc/squeezebert.rst
Normal file
@@ -0,0 +1,103 @@
|
||||
SqueezeBERT
|
||||
----------------------------------------------------
|
||||
|
||||
Overview
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The SqueezeBERT model was proposed in
|
||||
`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
|
||||
<https://arxiv.org/abs/2006.11316>`__
|
||||
by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer.
|
||||
It's a bidirectional transformer similar to the BERT model.
|
||||
The key difference between the BERT architecture and the SqueezeBERT architecture
|
||||
is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
|
||||
instead of fully-connected layers for the Q, K, V and FFN layers.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Humans read and write hundreds of billions of messages every day. Further, due to the availability of
|
||||
large datasets, large computing systems, and better neural network models, natural language processing (NLP)
|
||||
technology has made significant strides in understanding, proofreading, and organizing these messages.
|
||||
Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users,
|
||||
social networks, and businesses. In particular, we consider smartphones and other mobile devices as
|
||||
crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network
|
||||
models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds
|
||||
to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped
|
||||
convolutions have yielded significant speedups for computer vision networks, but many of these techniques
|
||||
have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
|
||||
self-attention layers with grouped convolutions, and we use this technique in a novel network architecture
|
||||
called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive
|
||||
accuracy on the GLUE test set. The SqueezeBERT code will be released.*
|
||||
|
||||
Tips:
|
||||
|
||||
- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
|
||||
the right rather than the left.
|
||||
- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
|
||||
It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
|
||||
text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
|
||||
- For best results when finetuning on sequence classification tasks, it is recommended to start with the
|
||||
`squeezebert/squeezebert-mnli-headless` checkpoint.
|
||||
|
||||
SqueezeBertConfig
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SqueezeBertConfig
|
||||
:members:
|
||||
|
||||
|
||||
SqueezeBertTokenizer
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SqueezeBertTokenizer
|
||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
||||
create_token_type_ids_from_sequences, save_vocabulary
|
||||
|
||||
|
||||
SqueezeBertTokenizerFast
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SqueezeBertTokenizerFast
|
||||
:members:
|
||||
|
||||
|
||||
SqueezeBertModel
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SqueezeBertModel
|
||||
:members:
|
||||
|
||||
|
||||
SqueezeBertForMaskedLM
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SqueezeBertForMaskedLM
|
||||
:members:
|
||||
|
||||
|
||||
SqueezeBertForSequenceClassification
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SqueezeBertForSequenceClassification
|
||||
:members:
|
||||
|
||||
|
||||
SqueezeBertForMultipleChoice
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SqueezeBertForMultipleChoice
|
||||
:members:
|
||||
|
||||
|
||||
SqueezeBertForTokenClassification
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SqueezeBertForTokenClassification
|
||||
:members:
|
||||
|
||||
|
||||
SqueezeBertForQuestionAnswering
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SqueezeBertForQuestionAnswering
|
||||
:members:
|
||||
@@ -426,4 +426,13 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
|
||||
| | | |
|
||||
| | | (see `details <https://github.com/microsoft/DeBERTa>`__) |
|
||||
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|
||||
| SqueezeBERT | ``squeezebert/squeezebert-uncased`` | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone. |
|
||||
| | | | SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks. |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``squeezebert/squeezebert-mnli`` | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone. |
|
||||
| | | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base. |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``squeezebert/squeezebert-mnli-headless`` | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone. |
|
||||
| | | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base. |
|
||||
| | | | The final classification layer is removed, so when you finetune, the final layer will be reinitialized. |
|
||||
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|
||||
Reference in New Issue
Block a user