Integrate DeBERTa v2(the 1.5B model surpassed human performance on Su… (#10018)
* Integrate DeBERTa v2(the 1.5B model surpassed human performance on SuperGLUE); Add DeBERTa v2 900M,1.5B models; * DeBERTa-v2 * Fix v2 model loading issue (#10129) * Doc members * Update src/transformers/models/deberta/modeling_deberta.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Address Sylvain's comments * Address Patrick's comments Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Style Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
This commit is contained in:
@@ -201,6 +201,7 @@ Current number of checkpoints: ** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
|
1. **[ConvBERT](https://huggingface.co/transformers/model_doc/convbert.html)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
|
||||||
1. **[CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
1. **[CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
||||||
1. **[DeBERTa](https://huggingface.co/transformers/model_doc/deberta.html)** (from Microsoft Research) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
|
1. **[DeBERTa](https://huggingface.co/transformers/model_doc/deberta.html)** (from Microsoft Research) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
|
||||||
|
1. **[DeBERTa-v2](https://huggingface.co/transformers/model_doc/deberta_v2.html)** (from Microsoft Research) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
|
||||||
1. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
|
1. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
|
||||||
1. **[DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/master/examples/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
|
1. **[DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/master/examples/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
|
||||||
1. **[DPR](https://huggingface.co/transformers/model_doc/dpr.html)** (from Facebook) released with the paper [Dense Passage Retrieval
|
1. **[DPR](https://huggingface.co/transformers/model_doc/dpr.html)** (from Facebook) released with the paper [Dense Passage Retrieval
|
||||||
|
|||||||
@@ -117,95 +117,98 @@ and conversion utilities for the following models:
|
|||||||
12. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft Research) released with the paper `DeBERTa: Decoding-enhanced
|
12. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft Research) released with the paper `DeBERTa: Decoding-enhanced
|
||||||
BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao,
|
BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao,
|
||||||
Weizhu Chen.
|
Weizhu Chen.
|
||||||
13. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
|
13. :doc:`DeBERTa-v2 <model_doc/deberta_v2>` (from Microsoft Research) released with the paper `DeBERTa:
|
||||||
|
Decoding-enhanced BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong
|
||||||
|
Liu, Jianfeng Gao, Weizhu Chen.
|
||||||
|
14. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
|
||||||
Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`__ by Yizhe
|
Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`__ by Yizhe
|
||||||
Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
|
Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
|
||||||
14. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
|
15. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
|
||||||
distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__ by Victor
|
distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__ by Victor
|
||||||
Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2
|
Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2
|
||||||
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, RoBERTa into `DistilRoBERTa
|
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, RoBERTa into `DistilRoBERTa
|
||||||
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, Multilingual BERT into
|
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, Multilingual BERT into
|
||||||
`DistilmBERT <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German
|
`DistilmBERT <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German
|
||||||
version of DistilBERT.
|
version of DistilBERT.
|
||||||
15. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
|
16. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
|
||||||
Question Answering <https://arxiv.org/abs/2004.04906>`__ by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
|
Question Answering <https://arxiv.org/abs/2004.04906>`__ by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
|
||||||
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
|
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
|
||||||
16. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
|
17. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
|
||||||
Pre-training text encoders as discriminators rather than generators <https://arxiv.org/abs/2003.10555>`__ by Kevin
|
Pre-training text encoders as discriminators rather than generators <https://arxiv.org/abs/2003.10555>`__ by Kevin
|
||||||
Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
||||||
17. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
|
18. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
|
||||||
Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,
|
Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,
|
||||||
Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
|
Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
|
||||||
18. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
|
19. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
|
||||||
Filtering out Sequential Redundancy for Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__ by
|
Filtering out Sequential Redundancy for Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__ by
|
||||||
Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||||
19. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
|
20. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
|
||||||
Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans
|
Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans
|
||||||
and Ilya Sutskever.
|
and Ilya Sutskever.
|
||||||
20. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
|
21. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
|
||||||
Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford*, Jeffrey Wu*, Rewon Child, David
|
Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford*, Jeffrey Wu*, Rewon Child, David
|
||||||
Luan, Dario Amodei** and Ilya Sutskever**.
|
Luan, Dario Amodei** and Ilya Sutskever**.
|
||||||
21. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
|
22. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
|
||||||
of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li,
|
of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li,
|
||||||
Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
|
Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
|
||||||
22. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
|
23. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
|
||||||
<https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
<https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
||||||
23. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
|
24. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
|
||||||
Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
||||||
24. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
|
25. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
|
||||||
Encoder Representations from Transformers for Open-Domain Question Answering <https://arxiv.org/abs/1908.07490>`__
|
Encoder Representations from Transformers for Open-Domain Question Answering <https://arxiv.org/abs/1908.07490>`__
|
||||||
by Hao Tan and Mohit Bansal.
|
by Hao Tan and Mohit Bansal.
|
||||||
25. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
|
26. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
|
||||||
Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
|
Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
|
||||||
Translator Team.
|
Translator Team.
|
||||||
26. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
|
27. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
|
||||||
Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
|
Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
|
||||||
Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||||
27. :doc:`MBart-50 <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Translation with Extensible
|
28. :doc:`MBart-50 <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Translation with Extensible
|
||||||
Multilingual Pretraining and Finetuning <https://arxiv.org/abs/2008.00401>`__ by Yuqing Tang, Chau Tran, Xian Li,
|
Multilingual Pretraining and Finetuning <https://arxiv.org/abs/2008.00401>`__ by Yuqing Tang, Chau Tran, Xian Li,
|
||||||
Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
||||||
28. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
|
29. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
|
||||||
Pre-training for Language Understanding <https://arxiv.org/abs/2004.09297>`__ by Kaitao Song, Xu Tan, Tao Qin,
|
Pre-training for Language Understanding <https://arxiv.org/abs/2004.09297>`__ by Kaitao Song, Xu Tan, Tao Qin,
|
||||||
Jianfeng Lu, Tie-Yan Liu.
|
Jianfeng Lu, Tie-Yan Liu.
|
||||||
29. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
|
30. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
|
||||||
text-to-text transformer <https://arxiv.org/abs/2010.11934>`__ by Linting Xue, Noah Constant, Adam Roberts, Mihir
|
text-to-text transformer <https://arxiv.org/abs/2010.11934>`__ by Linting Xue, Noah Constant, Adam Roberts, Mihir
|
||||||
Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||||
30. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
|
31. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
|
||||||
Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__> by Jingqing Zhang, Yao Zhao,
|
Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__> by Jingqing Zhang, Yao Zhao,
|
||||||
Mohammad Saleh and Peter J. Liu.
|
Mohammad Saleh and Peter J. Liu.
|
||||||
31. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
|
32. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
|
||||||
Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
|
Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
|
||||||
Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||||
32. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
|
33. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
|
||||||
Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
|
Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
|
||||||
33. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
|
34. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
|
||||||
Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
|
Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
|
||||||
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
||||||
34. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
|
35. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
|
||||||
about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi
|
about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi
|
||||||
Krishna, and Kurt W. Keutzer.
|
Krishna, and Kurt W. Keutzer.
|
||||||
35. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
|
36. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
|
||||||
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
|
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
|
||||||
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
36. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
|
37. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
|
||||||
Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
|
Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
|
||||||
Francesco Piccinno and Julian Martin Eisenschlos.
|
Francesco Piccinno and Julian Martin Eisenschlos.
|
||||||
37. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
|
38. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
|
||||||
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
|
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
|
||||||
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
|
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
|
||||||
38. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
|
39. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
|
||||||
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
|
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
|
||||||
Zhou, Abdelrahman Mohamed, Michael Auli.
|
Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||||
39. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
|
40. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
|
||||||
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
|
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
|
||||||
40. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
|
41. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
|
||||||
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
|
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
|
||||||
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||||
41. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
|
42. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
|
||||||
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
|
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
|
||||||
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
|
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
|
||||||
Zettlemoyer and Veselin Stoyanov.
|
Zettlemoyer and Veselin Stoyanov.
|
||||||
42. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
|
43. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
|
||||||
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
|
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
|
||||||
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||||
|
|
||||||
@@ -246,6 +249,8 @@ TensorFlow and/or Flax.
|
|||||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||||
| DeBERTa | ✅ | ❌ | ✅ | ❌ | ❌ |
|
| DeBERTa | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||||
|
| DeBERTa-v2 | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||||
|
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||||
| DistilBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| DistilBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||||
| ELECTRA | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| ELECTRA | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||||
@@ -389,6 +394,7 @@ TensorFlow and/or Flax.
|
|||||||
model_doc/convbert
|
model_doc/convbert
|
||||||
model_doc/ctrl
|
model_doc/ctrl
|
||||||
model_doc/deberta
|
model_doc/deberta
|
||||||
|
model_doc/deberta_v2
|
||||||
model_doc/dialogpt
|
model_doc/dialogpt
|
||||||
model_doc/distilbert
|
model_doc/distilbert
|
||||||
model_doc/dpr
|
model_doc/dpr
|
||||||
|
|||||||
@@ -60,7 +60,7 @@ DebertaModel
|
|||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaModel
|
.. autoclass:: transformers.DebertaModel
|
||||||
:members:
|
:members: forward
|
||||||
|
|
||||||
|
|
||||||
DebertaPreTrainedModel
|
DebertaPreTrainedModel
|
||||||
@@ -74,25 +74,25 @@ DebertaForMaskedLM
|
|||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaForMaskedLM
|
.. autoclass:: transformers.DebertaForMaskedLM
|
||||||
:members:
|
:members: forward
|
||||||
|
|
||||||
|
|
||||||
DebertaForSequenceClassification
|
DebertaForSequenceClassification
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaForSequenceClassification
|
.. autoclass:: transformers.DebertaForSequenceClassification
|
||||||
:members:
|
:members: forward
|
||||||
|
|
||||||
|
|
||||||
DebertaForTokenClassification
|
DebertaForTokenClassification
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaForTokenClassification
|
.. autoclass:: transformers.DebertaForTokenClassification
|
||||||
:members:
|
:members: forward
|
||||||
|
|
||||||
|
|
||||||
DebertaForQuestionAnswering
|
DebertaForQuestionAnswering
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaForQuestionAnswering
|
.. autoclass:: transformers.DebertaForQuestionAnswering
|
||||||
:members:
|
:members: forward
|
||||||
|
|||||||
118
docs/source/model_doc/deberta_v2.rst
Normal file
118
docs/source/model_doc/deberta_v2.rst
Normal file
@@ -0,0 +1,118 @@
|
|||||||
|
..
|
||||||
|
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
|
||||||
|
DeBERTa-v2
|
||||||
|
-----------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
Overview
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
|
||||||
|
<https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
|
||||||
|
BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
|
||||||
|
|
||||||
|
It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
|
||||||
|
RoBERTa.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Recent progress in pre-trained neural language models has significantly improved the performance of many natural
|
||||||
|
language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
|
||||||
|
disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
|
||||||
|
disentangled attention mechanism, where each word is represented using two vectors that encode its content and
|
||||||
|
position, respectively, and the attention weights among words are computed using disentangled matrices on their
|
||||||
|
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
|
||||||
|
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
|
||||||
|
of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
|
||||||
|
the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
||||||
|
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
|
||||||
|
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
|
||||||
|
|
||||||
|
|
||||||
|
The following information is visible directly on the [original implementation
|
||||||
|
repository](https://github.com/microsoft/DeBERTa). DeBERTa v2 is the second version of the DeBERTa model. It includes
|
||||||
|
the 1.5B model used for the SuperGLUE single-model submission and achieving 89.9, versus human baseline 89.8. You can
|
||||||
|
find more details about this submission in the authors'
|
||||||
|
[blog](https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/)
|
||||||
|
|
||||||
|
New in v2:
|
||||||
|
|
||||||
|
- **Vocabulary** In v2 the tokenizer is changed to use a new vocabulary of size 128K built from the training data.
|
||||||
|
Instead of a GPT2-based tokenizer, the tokenizer is now
|
||||||
|
[sentencepiece-based](https://github.com/google/sentencepiece) tokenizer.
|
||||||
|
- **nGiE(nGram Induced Input Encoding)** The DeBERTa-v2 model uses an additional convolution layer aside with the first
|
||||||
|
transformer layer to better learn the local dependency of input tokens.
|
||||||
|
- **Sharing position projection matrix with content projection matrix in attention layer** Based on previous
|
||||||
|
experiments, this can save parameters without affecting the performance.
|
||||||
|
- **Apply bucket to encode relative postions** The DeBERTa-v2 model uses log bucket to encode relative positions
|
||||||
|
similar to T5.
|
||||||
|
- **900M model & 1.5B model** Two additional model sizes are available: 900M and 1.5B, which significantly improves the
|
||||||
|
performance of downstream tasks.
|
||||||
|
|
||||||
|
The original code can be found `here <https://github.com/microsoft/DeBERTa>`__.
|
||||||
|
|
||||||
|
|
||||||
|
DebertaV2Config
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.DebertaV2Config
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
DebertaV2Tokenizer
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.DebertaV2Tokenizer
|
||||||
|
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
||||||
|
create_token_type_ids_from_sequences, save_vocabulary
|
||||||
|
|
||||||
|
|
||||||
|
DebertaV2Model
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.DebertaV2Model
|
||||||
|
:members: forward
|
||||||
|
|
||||||
|
|
||||||
|
DebertaV2PreTrainedModel
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.DebertaV2PreTrainedModel
|
||||||
|
:members: forward
|
||||||
|
|
||||||
|
|
||||||
|
DebertaV2ForMaskedLM
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.DebertaV2ForMaskedLM
|
||||||
|
:members: forward
|
||||||
|
|
||||||
|
|
||||||
|
DebertaV2ForSequenceClassification
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.DebertaV2ForSequenceClassification
|
||||||
|
:members: forward
|
||||||
|
|
||||||
|
|
||||||
|
DebertaV2ForTokenClassification
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.DebertaV2ForTokenClassification
|
||||||
|
:members: forward
|
||||||
|
|
||||||
|
|
||||||
|
DebertaV2ForQuestionAnswering
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.DebertaV2ForQuestionAnswering
|
||||||
|
:members: forward
|
||||||
@@ -443,15 +443,30 @@ For the full list, refer to `https://huggingface.co/models <https://huggingface.
|
|||||||
| | | |
|
| | | |
|
||||||
| | | (see `details <https://github.com/microsoft/unilm/tree/master/layoutlm>`__) |
|
| | | (see `details <https://github.com/microsoft/unilm/tree/master/layoutlm>`__) |
|
||||||
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
| DeBERTa | ``microsoft/deberta-base`` | | 12-layer, 768-hidden, 12-heads, ~125M parameters |
|
| DeBERTa | ``microsoft/deberta-base`` | | 12-layer, 768-hidden, 12-heads, ~140M parameters |
|
||||||
| | | | DeBERTa using the BERT-base architecture |
|
| | | | DeBERTa using the BERT-base architecture |
|
||||||
| | | |
|
| | | |
|
||||||
| | | (see `details <https://github.com/microsoft/DeBERTa>`__) |
|
| | | (see `details <https://github.com/microsoft/DeBERTa>`__) |
|
||||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
| | ``microsoft/deberta-large`` | | 24-layer, 1024-hidden, 16-heads, ~390M parameters |
|
| | ``microsoft/deberta-large`` | | 24-layer, 1024-hidden, 16-heads, ~400M parameters |
|
||||||
| | | | DeBERTa using the BERT-large architecture |
|
| | | | DeBERTa using the BERT-large architecture |
|
||||||
| | | |
|
| | | |
|
||||||
| | | (see `details <https://github.com/microsoft/DeBERTa>`__) |
|
| | | (see `details <https://github.com/microsoft/DeBERTa>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``microsoft/deberta-xlarge`` | | 48-layer, 1024-hidden, 16-heads, ~750M parameters |
|
||||||
|
| | | | DeBERTa XLarge with similar BERT architecture |
|
||||||
|
| | | |
|
||||||
|
| | | (see `details <https://github.com/microsoft/DeBERTa>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``microsoft/deberta-xlarge-v2`` | | 24-layer, 1536-hidden, 24-heads, ~900M parameters |
|
||||||
|
| | | | DeBERTa XLarge V2 with similar BERT architecture |
|
||||||
|
| | | |
|
||||||
|
| | | (see `details <https://github.com/microsoft/DeBERTa>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``microsoft/deberta-xxlarge-v2`` | | 48-layer, 1536-hidden, 24-heads, ~1.5B parameters |
|
||||||
|
| | | | DeBERTa XXLarge V2 with similar BERT architecture |
|
||||||
|
| | | |
|
||||||
|
| | | (see `details <https://github.com/microsoft/DeBERTa>`__) |
|
||||||
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
| SqueezeBERT | ``squeezebert/squeezebert-uncased`` | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone. |
|
| SqueezeBERT | ``squeezebert/squeezebert-uncased`` | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone. |
|
||||||
| | | | SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks. |
|
| | | | SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks. |
|
||||||
|
|||||||
@@ -157,6 +157,7 @@ _import_structure = {
|
|||||||
"models.camembert": ["CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "CamembertConfig"],
|
"models.camembert": ["CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "CamembertConfig"],
|
||||||
"models.ctrl": ["CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP", "CTRLConfig", "CTRLTokenizer"],
|
"models.ctrl": ["CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP", "CTRLConfig", "CTRLTokenizer"],
|
||||||
"models.deberta": ["DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaConfig", "DebertaTokenizer"],
|
"models.deberta": ["DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaConfig", "DebertaTokenizer"],
|
||||||
|
"models.deberta_v2": ["DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaV2Config", "DebertaV2Tokenizer"],
|
||||||
"models.distilbert": ["DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "DistilBertConfig", "DistilBertTokenizer"],
|
"models.distilbert": ["DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "DistilBertConfig", "DistilBertTokenizer"],
|
||||||
"models.dpr": [
|
"models.dpr": [
|
||||||
"DPR_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
"DPR_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||||
@@ -515,6 +516,17 @@ if is_torch_available():
|
|||||||
"DebertaForQuestionAnswering",
|
"DebertaForQuestionAnswering",
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
_import_structure["models.deberta_v2"].extend(
|
||||||
|
[
|
||||||
|
"DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"DebertaV2ForSequenceClassification",
|
||||||
|
"DebertaV2Model",
|
||||||
|
"DebertaV2ForMaskedLM",
|
||||||
|
"DebertaV2PreTrainedModel",
|
||||||
|
"DebertaV2ForTokenClassification",
|
||||||
|
"DebertaV2ForQuestionAnswering",
|
||||||
|
]
|
||||||
|
)
|
||||||
_import_structure["models.distilbert"].extend(
|
_import_structure["models.distilbert"].extend(
|
||||||
[
|
[
|
||||||
"DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
"DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
@@ -1287,6 +1299,7 @@ if TYPE_CHECKING:
|
|||||||
from .models.convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig, ConvBertTokenizer
|
from .models.convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig, ConvBertTokenizer
|
||||||
from .models.ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig, CTRLTokenizer
|
from .models.ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig, CTRLTokenizer
|
||||||
from .models.deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig, DebertaTokenizer
|
from .models.deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig, DebertaTokenizer
|
||||||
|
from .models.deberta_v2 import DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaV2Config, DebertaV2Tokenizer
|
||||||
from .models.distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig, DistilBertTokenizer
|
from .models.distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig, DistilBertTokenizer
|
||||||
from .models.dpr import (
|
from .models.dpr import (
|
||||||
DPR_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
DPR_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
@@ -1604,6 +1617,15 @@ if TYPE_CHECKING:
|
|||||||
DebertaModel,
|
DebertaModel,
|
||||||
DebertaPreTrainedModel,
|
DebertaPreTrainedModel,
|
||||||
)
|
)
|
||||||
|
from .models.deberta_v2 import (
|
||||||
|
DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
DebertaV2ForMaskedLM,
|
||||||
|
DebertaV2ForQuestionAnswering,
|
||||||
|
DebertaV2ForSequenceClassification,
|
||||||
|
DebertaV2ForTokenClassification,
|
||||||
|
DebertaV2Model,
|
||||||
|
DebertaV2PreTrainedModel,
|
||||||
|
)
|
||||||
from .models.distilbert import (
|
from .models.distilbert import (
|
||||||
DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
DistilBertForMaskedLM,
|
DistilBertForMaskedLM,
|
||||||
|
|||||||
@@ -31,6 +31,7 @@ from ..camembert.configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCH
|
|||||||
from ..convbert.configuration_convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig
|
from ..convbert.configuration_convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig
|
||||||
from ..ctrl.configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
|
from ..ctrl.configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
|
||||||
from ..deberta.configuration_deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig
|
from ..deberta.configuration_deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig
|
||||||
|
from ..deberta_v2.configuration_deberta_v2 import DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaV2Config
|
||||||
from ..distilbert.configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig
|
from ..distilbert.configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig
|
||||||
from ..dpr.configuration_dpr import DPR_PRETRAINED_CONFIG_ARCHIVE_MAP, DPRConfig
|
from ..dpr.configuration_dpr import DPR_PRETRAINED_CONFIG_ARCHIVE_MAP, DPRConfig
|
||||||
from ..electra.configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig
|
from ..electra.configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig
|
||||||
@@ -103,6 +104,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
|
|||||||
LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
DPR_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
DPR_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
|
DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
@@ -138,6 +140,7 @@ CONFIG_MAPPING = OrderedDict(
|
|||||||
("reformer", ReformerConfig),
|
("reformer", ReformerConfig),
|
||||||
("longformer", LongformerConfig),
|
("longformer", LongformerConfig),
|
||||||
("roberta", RobertaConfig),
|
("roberta", RobertaConfig),
|
||||||
|
("deberta-v2", DebertaV2Config),
|
||||||
("deberta", DebertaConfig),
|
("deberta", DebertaConfig),
|
||||||
("flaubert", FlaubertConfig),
|
("flaubert", FlaubertConfig),
|
||||||
("fsmt", FSMTConfig),
|
("fsmt", FSMTConfig),
|
||||||
@@ -199,6 +202,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||||||
("encoder-decoder", "Encoder decoder"),
|
("encoder-decoder", "Encoder decoder"),
|
||||||
("funnel", "Funnel Transformer"),
|
("funnel", "Funnel Transformer"),
|
||||||
("lxmert", "LXMERT"),
|
("lxmert", "LXMERT"),
|
||||||
|
("deberta-v2", "DeBERTa-v2"),
|
||||||
("deberta", "DeBERTa"),
|
("deberta", "DeBERTa"),
|
||||||
("layoutlm", "LayoutLM"),
|
("layoutlm", "LayoutLM"),
|
||||||
("dpr", "DPR"),
|
("dpr", "DPR"),
|
||||||
@@ -366,7 +370,6 @@ class AutoConfig:
|
|||||||
{'foo': False}
|
{'foo': False}
|
||||||
"""
|
"""
|
||||||
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||||
|
|
||||||
if "model_type" in config_dict:
|
if "model_type" in config_dict:
|
||||||
config_class = CONFIG_MAPPING[config_dict["model_type"]]
|
config_class = CONFIG_MAPPING[config_dict["model_type"]]
|
||||||
return config_class.from_dict(config_dict, **kwargs)
|
return config_class.from_dict(config_dict, **kwargs)
|
||||||
|
|||||||
@@ -84,6 +84,13 @@ from ..deberta.modeling_deberta import (
|
|||||||
DebertaForTokenClassification,
|
DebertaForTokenClassification,
|
||||||
DebertaModel,
|
DebertaModel,
|
||||||
)
|
)
|
||||||
|
from ..deberta_v2.modeling_deberta_v2 import (
|
||||||
|
DebertaV2ForMaskedLM,
|
||||||
|
DebertaV2ForQuestionAnswering,
|
||||||
|
DebertaV2ForSequenceClassification,
|
||||||
|
DebertaV2ForTokenClassification,
|
||||||
|
DebertaV2Model,
|
||||||
|
)
|
||||||
from ..distilbert.modeling_distilbert import (
|
from ..distilbert.modeling_distilbert import (
|
||||||
DistilBertForMaskedLM,
|
DistilBertForMaskedLM,
|
||||||
DistilBertForMultipleChoice,
|
DistilBertForMultipleChoice,
|
||||||
@@ -254,6 +261,7 @@ from .configuration_auto import (
|
|||||||
ConvBertConfig,
|
ConvBertConfig,
|
||||||
CTRLConfig,
|
CTRLConfig,
|
||||||
DebertaConfig,
|
DebertaConfig,
|
||||||
|
DebertaV2Config,
|
||||||
DistilBertConfig,
|
DistilBertConfig,
|
||||||
DPRConfig,
|
DPRConfig,
|
||||||
ElectraConfig,
|
ElectraConfig,
|
||||||
@@ -332,6 +340,7 @@ MODEL_MAPPING = OrderedDict(
|
|||||||
(LxmertConfig, LxmertModel),
|
(LxmertConfig, LxmertModel),
|
||||||
(BertGenerationConfig, BertGenerationEncoder),
|
(BertGenerationConfig, BertGenerationEncoder),
|
||||||
(DebertaConfig, DebertaModel),
|
(DebertaConfig, DebertaModel),
|
||||||
|
(DebertaV2Config, DebertaV2Model),
|
||||||
(DPRConfig, DPRQuestionEncoder),
|
(DPRConfig, DPRQuestionEncoder),
|
||||||
(XLMProphetNetConfig, XLMProphetNetModel),
|
(XLMProphetNetConfig, XLMProphetNetModel),
|
||||||
(ProphetNetConfig, ProphetNetModel),
|
(ProphetNetConfig, ProphetNetModel),
|
||||||
@@ -408,6 +417,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
|
|||||||
(MPNetConfig, MPNetForMaskedLM),
|
(MPNetConfig, MPNetForMaskedLM),
|
||||||
(TapasConfig, TapasForMaskedLM),
|
(TapasConfig, TapasForMaskedLM),
|
||||||
(DebertaConfig, DebertaForMaskedLM),
|
(DebertaConfig, DebertaForMaskedLM),
|
||||||
|
(DebertaV2Config, DebertaV2ForMaskedLM),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -465,6 +475,7 @@ MODEL_FOR_MASKED_LM_MAPPING = OrderedDict(
|
|||||||
(MPNetConfig, MPNetForMaskedLM),
|
(MPNetConfig, MPNetForMaskedLM),
|
||||||
(TapasConfig, TapasForMaskedLM),
|
(TapasConfig, TapasForMaskedLM),
|
||||||
(DebertaConfig, DebertaForMaskedLM),
|
(DebertaConfig, DebertaForMaskedLM),
|
||||||
|
(DebertaV2Config, DebertaV2ForMaskedLM),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -510,6 +521,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
|
|||||||
(ElectraConfig, ElectraForSequenceClassification),
|
(ElectraConfig, ElectraForSequenceClassification),
|
||||||
(FunnelConfig, FunnelForSequenceClassification),
|
(FunnelConfig, FunnelForSequenceClassification),
|
||||||
(DebertaConfig, DebertaForSequenceClassification),
|
(DebertaConfig, DebertaForSequenceClassification),
|
||||||
|
(DebertaV2Config, DebertaV2ForSequenceClassification),
|
||||||
(GPT2Config, GPT2ForSequenceClassification),
|
(GPT2Config, GPT2ForSequenceClassification),
|
||||||
(OpenAIGPTConfig, OpenAIGPTForSequenceClassification),
|
(OpenAIGPTConfig, OpenAIGPTForSequenceClassification),
|
||||||
(ReformerConfig, ReformerForSequenceClassification),
|
(ReformerConfig, ReformerForSequenceClassification),
|
||||||
@@ -545,6 +557,7 @@ MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
|
|||||||
(LxmertConfig, LxmertForQuestionAnswering),
|
(LxmertConfig, LxmertForQuestionAnswering),
|
||||||
(MPNetConfig, MPNetForQuestionAnswering),
|
(MPNetConfig, MPNetForQuestionAnswering),
|
||||||
(DebertaConfig, DebertaForQuestionAnswering),
|
(DebertaConfig, DebertaForQuestionAnswering),
|
||||||
|
(DebertaV2Config, DebertaV2ForQuestionAnswering),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -577,6 +590,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
|
|||||||
(FunnelConfig, FunnelForTokenClassification),
|
(FunnelConfig, FunnelForTokenClassification),
|
||||||
(MPNetConfig, MPNetForTokenClassification),
|
(MPNetConfig, MPNetForTokenClassification),
|
||||||
(DebertaConfig, DebertaForTokenClassification),
|
(DebertaConfig, DebertaForTokenClassification),
|
||||||
|
(DebertaV2Config, DebertaV2ForTokenClassification),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -66,6 +66,7 @@ from .configuration_auto import (
|
|||||||
ConvBertConfig,
|
ConvBertConfig,
|
||||||
CTRLConfig,
|
CTRLConfig,
|
||||||
DebertaConfig,
|
DebertaConfig,
|
||||||
|
DebertaV2Config,
|
||||||
DistilBertConfig,
|
DistilBertConfig,
|
||||||
DPRConfig,
|
DPRConfig,
|
||||||
ElectraConfig,
|
ElectraConfig,
|
||||||
@@ -108,6 +109,7 @@ if is_sentencepiece_available():
|
|||||||
from ..barthez.tokenization_barthez import BarthezTokenizer
|
from ..barthez.tokenization_barthez import BarthezTokenizer
|
||||||
from ..bert_generation.tokenization_bert_generation import BertGenerationTokenizer
|
from ..bert_generation.tokenization_bert_generation import BertGenerationTokenizer
|
||||||
from ..camembert.tokenization_camembert import CamembertTokenizer
|
from ..camembert.tokenization_camembert import CamembertTokenizer
|
||||||
|
from ..deberta_v2.tokenization_deberta_v2 import DebertaV2Tokenizer
|
||||||
from ..marian.tokenization_marian import MarianTokenizer
|
from ..marian.tokenization_marian import MarianTokenizer
|
||||||
from ..mbart.tokenization_mbart import MBartTokenizer
|
from ..mbart.tokenization_mbart import MBartTokenizer
|
||||||
from ..mt5 import MT5Tokenizer
|
from ..mt5 import MT5Tokenizer
|
||||||
@@ -122,6 +124,7 @@ else:
|
|||||||
BarthezTokenizer = None
|
BarthezTokenizer = None
|
||||||
BertGenerationTokenizer = None
|
BertGenerationTokenizer = None
|
||||||
CamembertTokenizer = None
|
CamembertTokenizer = None
|
||||||
|
DebertaV2Tokenizer = None
|
||||||
MarianTokenizer = None
|
MarianTokenizer = None
|
||||||
MBartTokenizer = None
|
MBartTokenizer = None
|
||||||
MT5Tokenizer = None
|
MT5Tokenizer = None
|
||||||
@@ -233,6 +236,7 @@ TOKENIZER_MAPPING = OrderedDict(
|
|||||||
(FSMTConfig, (FSMTTokenizer, None)),
|
(FSMTConfig, (FSMTTokenizer, None)),
|
||||||
(BertGenerationConfig, (BertGenerationTokenizer, None)),
|
(BertGenerationConfig, (BertGenerationTokenizer, None)),
|
||||||
(DebertaConfig, (DebertaTokenizer, None)),
|
(DebertaConfig, (DebertaTokenizer, None)),
|
||||||
|
(DebertaV2Config, (DebertaV2Tokenizer, None)),
|
||||||
(RagConfig, (RagTokenizer, None)),
|
(RagConfig, (RagTokenizer, None)),
|
||||||
(XLMProphetNetConfig, (XLMProphetNetTokenizer, None)),
|
(XLMProphetNetConfig, (XLMProphetNetTokenizer, None)),
|
||||||
(ProphetNetConfig, (ProphetNetTokenizer, None)),
|
(ProphetNetConfig, (ProphetNetTokenizer, None)),
|
||||||
|
|||||||
@@ -23,6 +23,10 @@ logger = logging.get_logger(__name__)
|
|||||||
DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||||
"microsoft/deberta-base": "https://huggingface.co/microsoft/deberta-base/resolve/main/config.json",
|
"microsoft/deberta-base": "https://huggingface.co/microsoft/deberta-base/resolve/main/config.json",
|
||||||
"microsoft/deberta-large": "https://huggingface.co/microsoft/deberta-large/resolve/main/config.json",
|
"microsoft/deberta-large": "https://huggingface.co/microsoft/deberta-large/resolve/main/config.json",
|
||||||
|
"microsoft/deberta-xlarge": "https://huggingface.co/microsoft/deberta-xlarge/resolve/main/config.json",
|
||||||
|
"microsoft/deberta-base-mnli": "https://huggingface.co/microsoft/deberta-base-mnli/resolve/main/config.json",
|
||||||
|
"microsoft/deberta-large-mnli": "https://huggingface.co/microsoft/deberta-large-mnli/resolve/main/config.json",
|
||||||
|
"microsoft/deberta-xlarge-mnli": "https://huggingface.co/microsoft/deberta-xlarge-mnli/resolve/main/config.json",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -18,7 +18,6 @@ import math
|
|||||||
from collections.abc import Sequence
|
from collections.abc import Sequence
|
||||||
|
|
||||||
import torch
|
import torch
|
||||||
from packaging import version
|
|
||||||
from torch import _softmax_backward_data, nn
|
from torch import _softmax_backward_data, nn
|
||||||
from torch.nn import CrossEntropyLoss
|
from torch.nn import CrossEntropyLoss
|
||||||
|
|
||||||
@@ -40,10 +39,15 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
_CONFIG_FOR_DOC = "DebertaConfig"
|
_CONFIG_FOR_DOC = "DebertaConfig"
|
||||||
_TOKENIZER_FOR_DOC = "DebertaTokenizer"
|
_TOKENIZER_FOR_DOC = "DebertaTokenizer"
|
||||||
|
_CHECKPOINT_FOR_DOC = "microsoft/deberta-base"
|
||||||
|
|
||||||
DEBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [
|
DEBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [
|
||||||
"microsoft/deberta-base",
|
"microsoft/deberta-base",
|
||||||
"microsoft/deberta-large",
|
"microsoft/deberta-large",
|
||||||
|
"microsoft/deberta-xlarge",
|
||||||
|
"microsoft/deberta-base-mnli",
|
||||||
|
"microsoft/deberta-large-mnli",
|
||||||
|
"microsoft/deberta-xlarge-mnli",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
@@ -54,7 +58,7 @@ class ContextPooler(nn.Module):
|
|||||||
self.dropout = StableDropout(config.pooler_dropout)
|
self.dropout = StableDropout(config.pooler_dropout)
|
||||||
self.config = config
|
self.config = config
|
||||||
|
|
||||||
def forward(self, hidden_states, mask=None):
|
def forward(self, hidden_states):
|
||||||
# We "pool" the model by simply taking the hidden state corresponding
|
# We "pool" the model by simply taking the hidden state corresponding
|
||||||
# to the first token.
|
# to the first token.
|
||||||
|
|
||||||
@@ -79,22 +83,23 @@ class XSoftmax(torch.autograd.Function):
|
|||||||
dim (int): The dimension that will apply softmax
|
dim (int): The dimension that will apply softmax
|
||||||
|
|
||||||
Example::
|
Example::
|
||||||
import torch
|
|
||||||
from transformers.models.deberta import XSoftmax
|
>>> import torch
|
||||||
# Make a tensor
|
>>> from transformers.models.deberta.modeling_deberta import XSoftmax
|
||||||
x = torch.randn([4,20,100])
|
|
||||||
# Create a mask
|
>>> # Make a tensor
|
||||||
mask = (x>0).int()
|
>>> x = torch.randn([4,20,100])
|
||||||
y = XSoftmax.apply(x, mask, dim=-1)
|
|
||||||
|
>>> # Create a mask
|
||||||
|
>>> mask = (x>0).int()
|
||||||
|
|
||||||
|
>>> y = XSoftmax.apply(x, mask, dim=-1)
|
||||||
"""
|
"""
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def forward(self, input, mask, dim):
|
def forward(self, input, mask, dim):
|
||||||
self.dim = dim
|
self.dim = dim
|
||||||
if version.Version(torch.__version__) >= version.Version("1.2.0a"):
|
|
||||||
rmask = ~(mask.bool())
|
rmask = ~(mask.bool())
|
||||||
else:
|
|
||||||
rmask = (1 - mask).byte() # This line is not supported by Onnx tracing.
|
|
||||||
|
|
||||||
output = input.masked_fill(rmask, float("-inf"))
|
output = input.masked_fill(rmask, float("-inf"))
|
||||||
output = torch.softmax(output, self.dim)
|
output = torch.softmax(output, self.dim)
|
||||||
@@ -127,10 +132,7 @@ def get_mask(input, local_context):
|
|||||||
mask = local_context.mask if local_context.reuse_mask else None
|
mask = local_context.mask if local_context.reuse_mask else None
|
||||||
|
|
||||||
if dropout > 0 and mask is None:
|
if dropout > 0 and mask is None:
|
||||||
if version.Version(torch.__version__) >= version.Version("1.2.0a"):
|
|
||||||
mask = (1 - torch.empty_like(input).bernoulli_(1 - dropout)).bool()
|
mask = (1 - torch.empty_like(input).bernoulli_(1 - dropout)).bool()
|
||||||
else:
|
|
||||||
mask = (1 - torch.empty_like(input).bernoulli_(1 - dropout)).byte()
|
|
||||||
|
|
||||||
if isinstance(local_context, DropoutContext):
|
if isinstance(local_context, DropoutContext):
|
||||||
if local_context.mask is None:
|
if local_context.mask is None:
|
||||||
@@ -166,9 +168,7 @@ class StableDropout(torch.nn.Module):
|
|||||||
Optimized dropout module for stabilizing the training
|
Optimized dropout module for stabilizing the training
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
|
|
||||||
drop_prob (float): the dropout probabilities
|
drop_prob (float): the dropout probabilities
|
||||||
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, drop_prob):
|
def __init__(self, drop_prob):
|
||||||
@@ -183,8 +183,6 @@ class StableDropout(torch.nn.Module):
|
|||||||
|
|
||||||
Args:
|
Args:
|
||||||
x (:obj:`torch.tensor`): The input tensor to apply dropout
|
x (:obj:`torch.tensor`): The input tensor to apply dropout
|
||||||
|
|
||||||
|
|
||||||
"""
|
"""
|
||||||
if self.training and self.drop_prob > 0:
|
if self.training and self.drop_prob > 0:
|
||||||
return XDropout.apply(x, self.get_context())
|
return XDropout.apply(x, self.get_context())
|
||||||
@@ -302,7 +300,7 @@ class DebertaIntermediate(nn.Module):
|
|||||||
|
|
||||||
class DebertaOutput(nn.Module):
|
class DebertaOutput(nn.Module):
|
||||||
def __init__(self, config):
|
def __init__(self, config):
|
||||||
super(DebertaOutput, self).__init__()
|
super().__init__()
|
||||||
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
|
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
|
||||||
self.LayerNorm = DebertaLayerNorm(config.hidden_size, config.layer_norm_eps)
|
self.LayerNorm = DebertaLayerNorm(config.hidden_size, config.layer_norm_eps)
|
||||||
self.dropout = StableDropout(config.hidden_dropout_prob)
|
self.dropout = StableDropout(config.hidden_dropout_prob)
|
||||||
@@ -317,7 +315,7 @@ class DebertaOutput(nn.Module):
|
|||||||
|
|
||||||
class DebertaLayer(nn.Module):
|
class DebertaLayer(nn.Module):
|
||||||
def __init__(self, config):
|
def __init__(self, config):
|
||||||
super(DebertaLayer, self).__init__()
|
super().__init__()
|
||||||
self.attention = DebertaAttention(config)
|
self.attention = DebertaAttention(config)
|
||||||
self.intermediate = DebertaIntermediate(config)
|
self.intermediate = DebertaIntermediate(config)
|
||||||
self.output = DebertaOutput(config)
|
self.output = DebertaOutput(config)
|
||||||
@@ -701,7 +699,6 @@ class DebertaEmbeddings(nn.Module):
|
|||||||
self.embed_proj = nn.Linear(self.embedding_size, config.hidden_size, bias=False)
|
self.embed_proj = nn.Linear(self.embedding_size, config.hidden_size, bias=False)
|
||||||
self.LayerNorm = DebertaLayerNorm(config.hidden_size, config.layer_norm_eps)
|
self.LayerNorm = DebertaLayerNorm(config.hidden_size, config.layer_norm_eps)
|
||||||
self.dropout = StableDropout(config.hidden_dropout_prob)
|
self.dropout = StableDropout(config.hidden_dropout_prob)
|
||||||
self.output_to_half = False
|
|
||||||
self.config = config
|
self.config = config
|
||||||
|
|
||||||
# position_ids (1, len position emb) is contiguous in memory and exported when serialized
|
# position_ids (1, len position emb) is contiguous in memory and exported when serialized
|
||||||
@@ -763,6 +760,11 @@ class DebertaPreTrainedModel(PreTrainedModel):
|
|||||||
config_class = DebertaConfig
|
config_class = DebertaConfig
|
||||||
base_model_prefix = "deberta"
|
base_model_prefix = "deberta"
|
||||||
_keys_to_ignore_on_load_missing = ["position_ids"]
|
_keys_to_ignore_on_load_missing = ["position_ids"]
|
||||||
|
_keys_to_ignore_on_load_unexpected = ["position_embeddings"]
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__(config)
|
||||||
|
self._register_load_state_dict_pre_hook(self._pre_load_hook)
|
||||||
|
|
||||||
def _init_weights(self, module):
|
def _init_weights(self, module):
|
||||||
""" Initialize the weights """
|
""" Initialize the weights """
|
||||||
@@ -773,6 +775,25 @@ class DebertaPreTrainedModel(PreTrainedModel):
|
|||||||
if isinstance(module, nn.Linear) and module.bias is not None:
|
if isinstance(module, nn.Linear) and module.bias is not None:
|
||||||
module.bias.data.zero_()
|
module.bias.data.zero_()
|
||||||
|
|
||||||
|
def _pre_load_hook(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
|
||||||
|
"""
|
||||||
|
Removes the classifier if it doesn't have the correct number of labels.
|
||||||
|
"""
|
||||||
|
self_state = self.state_dict()
|
||||||
|
if (
|
||||||
|
("classifier.weight" in self_state)
|
||||||
|
and ("classifier.weight" in state_dict)
|
||||||
|
and self_state["classifier.weight"].size() != state_dict["classifier.weight"].size()
|
||||||
|
):
|
||||||
|
logger.warning(
|
||||||
|
f"The checkpoint classifier head has a shape {state_dict['classifier.weight'].size()} and this model "
|
||||||
|
f"classifier head has a shape {self_state['classifier.weight'].size()}. Ignoring the checkpoint "
|
||||||
|
f"weights. You should train your model on new data."
|
||||||
|
)
|
||||||
|
del state_dict["classifier.weight"]
|
||||||
|
if "classifier.bias" in state_dict:
|
||||||
|
del state_dict["classifier.bias"]
|
||||||
|
|
||||||
|
|
||||||
DEBERTA_START_DOCSTRING = r"""
|
DEBERTA_START_DOCSTRING = r"""
|
||||||
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
|
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
|
||||||
@@ -867,7 +888,7 @@ class DebertaModel(DebertaPreTrainedModel):
|
|||||||
@add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
|
@add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
|
||||||
@add_code_sample_docstrings(
|
@add_code_sample_docstrings(
|
||||||
tokenizer_class=_TOKENIZER_FOR_DOC,
|
tokenizer_class=_TOKENIZER_FOR_DOC,
|
||||||
checkpoint="microsoft/deberta-base",
|
checkpoint=_CHECKPOINT_FOR_DOC,
|
||||||
output_type=SequenceClassifierOutput,
|
output_type=SequenceClassifierOutput,
|
||||||
config_class=_CONFIG_FOR_DOC,
|
config_class=_CONFIG_FOR_DOC,
|
||||||
)
|
)
|
||||||
@@ -953,7 +974,6 @@ class DebertaModel(DebertaPreTrainedModel):
|
|||||||
|
|
||||||
@add_start_docstrings("""DeBERTa Model with a `language modeling` head on top. """, DEBERTA_START_DOCSTRING)
|
@add_start_docstrings("""DeBERTa Model with a `language modeling` head on top. """, DEBERTA_START_DOCSTRING)
|
||||||
class DebertaForMaskedLM(DebertaPreTrainedModel):
|
class DebertaForMaskedLM(DebertaPreTrainedModel):
|
||||||
|
|
||||||
_keys_to_ignore_on_load_unexpected = [r"pooler"]
|
_keys_to_ignore_on_load_unexpected = [r"pooler"]
|
||||||
_keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
|
_keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
|
||||||
|
|
||||||
@@ -974,7 +994,7 @@ class DebertaForMaskedLM(DebertaPreTrainedModel):
|
|||||||
@add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
|
@add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
|
||||||
@add_code_sample_docstrings(
|
@add_code_sample_docstrings(
|
||||||
tokenizer_class=_TOKENIZER_FOR_DOC,
|
tokenizer_class=_TOKENIZER_FOR_DOC,
|
||||||
checkpoint="microsoft/deberta-base",
|
checkpoint=_CHECKPOINT_FOR_DOC,
|
||||||
output_type=MaskedLMOutput,
|
output_type=MaskedLMOutput,
|
||||||
config_class=_CONFIG_FOR_DOC,
|
config_class=_CONFIG_FOR_DOC,
|
||||||
)
|
)
|
||||||
@@ -1114,7 +1134,7 @@ class DebertaForSequenceClassification(DebertaPreTrainedModel):
|
|||||||
@add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
|
@add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
|
||||||
@add_code_sample_docstrings(
|
@add_code_sample_docstrings(
|
||||||
tokenizer_class=_TOKENIZER_FOR_DOC,
|
tokenizer_class=_TOKENIZER_FOR_DOC,
|
||||||
checkpoint="microsoft/deberta-base",
|
checkpoint=_CHECKPOINT_FOR_DOC,
|
||||||
output_type=SequenceClassifierOutput,
|
output_type=SequenceClassifierOutput,
|
||||||
config_class=_CONFIG_FOR_DOC,
|
config_class=_CONFIG_FOR_DOC,
|
||||||
)
|
)
|
||||||
@@ -1194,7 +1214,6 @@ class DebertaForSequenceClassification(DebertaPreTrainedModel):
|
|||||||
DEBERTA_START_DOCSTRING,
|
DEBERTA_START_DOCSTRING,
|
||||||
)
|
)
|
||||||
class DebertaForTokenClassification(DebertaPreTrainedModel):
|
class DebertaForTokenClassification(DebertaPreTrainedModel):
|
||||||
|
|
||||||
_keys_to_ignore_on_load_unexpected = [r"pooler"]
|
_keys_to_ignore_on_load_unexpected = [r"pooler"]
|
||||||
|
|
||||||
def __init__(self, config):
|
def __init__(self, config):
|
||||||
@@ -1210,7 +1229,7 @@ class DebertaForTokenClassification(DebertaPreTrainedModel):
|
|||||||
@add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
|
@add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
|
||||||
@add_code_sample_docstrings(
|
@add_code_sample_docstrings(
|
||||||
tokenizer_class=_TOKENIZER_FOR_DOC,
|
tokenizer_class=_TOKENIZER_FOR_DOC,
|
||||||
checkpoint="microsoft/deberta-base",
|
checkpoint=_CHECKPOINT_FOR_DOC,
|
||||||
output_type=TokenClassifierOutput,
|
output_type=TokenClassifierOutput,
|
||||||
config_class=_CONFIG_FOR_DOC,
|
config_class=_CONFIG_FOR_DOC,
|
||||||
)
|
)
|
||||||
@@ -1283,7 +1302,6 @@ class DebertaForTokenClassification(DebertaPreTrainedModel):
|
|||||||
DEBERTA_START_DOCSTRING,
|
DEBERTA_START_DOCSTRING,
|
||||||
)
|
)
|
||||||
class DebertaForQuestionAnswering(DebertaPreTrainedModel):
|
class DebertaForQuestionAnswering(DebertaPreTrainedModel):
|
||||||
|
|
||||||
_keys_to_ignore_on_load_unexpected = [r"pooler"]
|
_keys_to_ignore_on_load_unexpected = [r"pooler"]
|
||||||
|
|
||||||
def __init__(self, config):
|
def __init__(self, config):
|
||||||
@@ -1298,7 +1316,7 @@ class DebertaForQuestionAnswering(DebertaPreTrainedModel):
|
|||||||
@add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
|
@add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
|
||||||
@add_code_sample_docstrings(
|
@add_code_sample_docstrings(
|
||||||
tokenizer_class=_TOKENIZER_FOR_DOC,
|
tokenizer_class=_TOKENIZER_FOR_DOC,
|
||||||
checkpoint="microsoft/deberta-base",
|
checkpoint=_CHECKPOINT_FOR_DOC,
|
||||||
output_type=QuestionAnsweringModelOutput,
|
output_type=QuestionAnsweringModelOutput,
|
||||||
config_class=_CONFIG_FOR_DOC,
|
config_class=_CONFIG_FOR_DOC,
|
||||||
)
|
)
|
||||||
|
|||||||
@@ -44,12 +44,20 @@ PRETRAINED_VOCAB_FILES_MAP = {
|
|||||||
"vocab_file": {
|
"vocab_file": {
|
||||||
"microsoft/deberta-base": "https://huggingface.co/microsoft/deberta-base/resolve/main/bpe_encoder.bin",
|
"microsoft/deberta-base": "https://huggingface.co/microsoft/deberta-base/resolve/main/bpe_encoder.bin",
|
||||||
"microsoft/deberta-large": "https://huggingface.co/microsoft/deberta-large/resolve/main/bpe_encoder.bin",
|
"microsoft/deberta-large": "https://huggingface.co/microsoft/deberta-large/resolve/main/bpe_encoder.bin",
|
||||||
|
"microsoft/deberta-xlarge": "https://huggingface.co/microsoft/deberta-xlarge/resolve/main/bpe_encoder.bin",
|
||||||
|
"microsoft/deberta-base-mnli": "https://huggingface.co/microsoft/deberta-base-mnli/resolve/main/bpe_encoder.bin",
|
||||||
|
"microsoft/deberta-large-mnli": "https://huggingface.co/microsoft/deberta-large-mnli/resolve/main/bpe_encoder.bin",
|
||||||
|
"microsoft/deberta-xlarge-mnli": "https://huggingface.co/microsoft/deberta-xlarge-mnli/resolve/main/bpe_encoder.bin",
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||||
"microsoft/deberta-base": 512,
|
"microsoft/deberta-base": 512,
|
||||||
"microsoft/deberta-large": 512,
|
"microsoft/deberta-large": 512,
|
||||||
|
"microsoft/deberta-xlarge": 512,
|
||||||
|
"microsoft/deberta-base-mnli": 512,
|
||||||
|
"microsoft/deberta-large-mnli": 512,
|
||||||
|
"microsoft/deberta-xlarge-mnli": 512,
|
||||||
}
|
}
|
||||||
|
|
||||||
PRETRAINED_INIT_CONFIGURATION = {
|
PRETRAINED_INIT_CONFIGURATION = {
|
||||||
|
|||||||
72
src/transformers/models/deberta_v2/__init__.py
Normal file
72
src/transformers/models/deberta_v2/__init__.py
Normal file
@@ -0,0 +1,72 @@
|
|||||||
|
# flake8: noqa
|
||||||
|
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||||
|
# module, but to preserve other warnings. So, don't check this module at all.
|
||||||
|
|
||||||
|
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...file_utils import _BaseLazyModule, is_torch_available
|
||||||
|
|
||||||
|
|
||||||
|
_import_structure = {
|
||||||
|
"configuration_deberta_v2": ["DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaV2Config"],
|
||||||
|
"tokenization_deberta_v2": ["DebertaV2Tokenizer"],
|
||||||
|
}
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
_import_structure["modeling_deberta_v2"] = [
|
||||||
|
"DEBERTA_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"DebertaV2ForSequenceClassification",
|
||||||
|
"DebertaV2Model",
|
||||||
|
"DebertaV2ForMaskedLM",
|
||||||
|
"DebertaV2PreTrainedModel",
|
||||||
|
"DebertaV2ForTokenClassification",
|
||||||
|
"DebertaV2ForQuestionAnswering",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_deberta_v2 import DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaV2Config
|
||||||
|
from .tokenization_deberta_v2 import DebertaV2Tokenizer
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
from .modeling_deberta_v2 import (
|
||||||
|
DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
DebertaV2ForMaskedLM,
|
||||||
|
DebertaV2ForQuestionAnswering,
|
||||||
|
DebertaV2ForSequenceClassification,
|
||||||
|
DebertaV2ForTokenClassification,
|
||||||
|
DebertaV2Model,
|
||||||
|
DebertaV2PreTrainedModel,
|
||||||
|
)
|
||||||
|
|
||||||
|
else:
|
||||||
|
import importlib
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
|
||||||
|
class _LazyModule(_BaseLazyModule):
|
||||||
|
"""
|
||||||
|
Module class that surfaces all objects but only performs associated imports when the objects are requested.
|
||||||
|
"""
|
||||||
|
|
||||||
|
__file__ = globals()["__file__"]
|
||||||
|
__path__ = [os.path.dirname(__file__)]
|
||||||
|
|
||||||
|
def _get_module(self, module_name: str):
|
||||||
|
return importlib.import_module("." + module_name, self.__name__)
|
||||||
|
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, _import_structure)
|
||||||
138
src/transformers/models/deberta_v2/configuration_deberta_v2.py
Normal file
138
src/transformers/models/deberta_v2/configuration_deberta_v2.py
Normal file
@@ -0,0 +1,138 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2020, Microsoft and the HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" DeBERTa-v2 model configuration """
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||||
|
"microsoft/deberta-v2-xlarge": "https://huggingface.co/microsoft/deberta-v2-xlarge/resolve/main/config.json",
|
||||||
|
"microsoft/deberta-v2-xxlarge": "https://huggingface.co/microsoft/deberta-v2-xxlarge/resolve/main/config.json",
|
||||||
|
"microsoft/deberta-v2-xlarge-mnli": "https://huggingface.co/microsoft/deberta-v2-xlarge-mnli/resolve/main/config.json",
|
||||||
|
"microsoft/deberta-v2-xxlarge-mnli": "https://huggingface.co/microsoft/deberta-v2-xxlarge-mnli/resolve/main/config.json",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class DebertaV2Config(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a :class:`~transformers.DebertaV2Model`. It is used
|
||||||
|
to instantiate a DeBERTa-v2 model according to the specified arguments, defining the model architecture.
|
||||||
|
Instantiating a configuration with the defaults will yield a similar configuration to that of the DeBERTa
|
||||||
|
`microsoft/deberta-v2-xlarge <https://huggingface.co/microsoft/deberta-base>`__ architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
|
||||||
|
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
|
||||||
|
|
||||||
|
Arguments:
|
||||||
|
vocab_size (:obj:`int`, `optional`, defaults to 128100):
|
||||||
|
Vocabulary size of the DeBERTa-v2 model. Defines the number of different tokens that can be represented by
|
||||||
|
the :obj:`inputs_ids` passed when calling :class:`~transformers.DebertaV2Model`.
|
||||||
|
hidden_size (:obj:`int`, `optional`, defaults to 1536):
|
||||||
|
Dimensionality of the encoder layers and the pooler layer.
|
||||||
|
num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
|
||||||
|
Number of hidden layers in the Transformer encoder.
|
||||||
|
num_attention_heads (:obj:`int`, `optional`, defaults to 24):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
intermediate_size (:obj:`int`, `optional`, defaults to 6144):
|
||||||
|
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
|
||||||
|
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
|
||||||
|
The non-linear activation function (function or string) in the encoder and pooler. If string,
|
||||||
|
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"`, :obj:`"gelu"`, :obj:`"tanh"`, :obj:`"gelu_fast"`,
|
||||||
|
:obj:`"mish"`, :obj:`"linear"`, :obj:`"sigmoid"` and :obj:`"gelu_new"` are supported.
|
||||||
|
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
|
||||||
|
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||||
|
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
|
||||||
|
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||||
|
just in case (e.g., 512 or 1024 or 2048).
|
||||||
|
type_vocab_size (:obj:`int`, `optional`, defaults to 0):
|
||||||
|
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.DebertaModel` or
|
||||||
|
:class:`~transformers.TFDebertaModel`.
|
||||||
|
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-7):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
relative_attention (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether use relative position encoding.
|
||||||
|
max_relative_positions (:obj:`int`, `optional`, defaults to -1):
|
||||||
|
The range of relative positions :obj:`[-max_position_embeddings, max_position_embeddings]`. Use the same
|
||||||
|
value as :obj:`max_position_embeddings`.
|
||||||
|
pad_token_id (:obj:`int`, `optional`, defaults to 0):
|
||||||
|
The value used to pad input_ids.
|
||||||
|
position_biased_input (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
|
Whether add absolute position embedding to content embedding.
|
||||||
|
pos_att_type (:obj:`List[str]`, `optional`):
|
||||||
|
The type of relative position attention, it can be a combination of :obj:`["p2c", "c2p", "p2p"]`, e.g.
|
||||||
|
:obj:`["p2c"]`, :obj:`["p2c", "c2p"]`, :obj:`["p2c", "c2p", 'p2p"]`.
|
||||||
|
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
"""
|
||||||
|
model_type = "deberta-v2"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size=128100,
|
||||||
|
hidden_size=1536,
|
||||||
|
num_hidden_layers=24,
|
||||||
|
num_attention_heads=24,
|
||||||
|
intermediate_size=6144,
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout_prob=0.1,
|
||||||
|
attention_probs_dropout_prob=0.1,
|
||||||
|
max_position_embeddings=512,
|
||||||
|
type_vocab_size=0,
|
||||||
|
initializer_range=0.02,
|
||||||
|
layer_norm_eps=1e-7,
|
||||||
|
relative_attention=False,
|
||||||
|
max_relative_positions=-1,
|
||||||
|
pad_token_id=0,
|
||||||
|
position_biased_input=True,
|
||||||
|
pos_att_type=None,
|
||||||
|
pooler_dropout=0,
|
||||||
|
pooler_hidden_act="gelu",
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.type_vocab_size = type_vocab_size
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.relative_attention = relative_attention
|
||||||
|
self.max_relative_positions = max_relative_positions
|
||||||
|
self.pad_token_id = pad_token_id
|
||||||
|
self.position_biased_input = position_biased_input
|
||||||
|
|
||||||
|
# Backwards compatibility
|
||||||
|
if type(pos_att_type) == str:
|
||||||
|
pos_att_type = [x.strip() for x in pos_att_type.lower().split("|")]
|
||||||
|
|
||||||
|
self.pos_att_type = pos_att_type
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
|
||||||
|
self.pooler_hidden_size = kwargs.get("pooler_hidden_size", hidden_size)
|
||||||
|
self.pooler_dropout = pooler_dropout
|
||||||
|
self.pooler_hidden_act = pooler_hidden_act
|
||||||
1516
src/transformers/models/deberta_v2/modeling_deberta_v2.py
Normal file
1516
src/transformers/models/deberta_v2/modeling_deberta_v2.py
Normal file
File diff suppressed because it is too large
Load Diff
491
src/transformers/models/deberta_v2/tokenization_deberta_v2.py
Normal file
491
src/transformers/models/deberta_v2/tokenization_deberta_v2.py
Normal file
@@ -0,0 +1,491 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2020 Microsoft and the HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Tokenization class for model DeBERTa."""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import unicodedata
|
||||||
|
from typing import Optional, Tuple
|
||||||
|
|
||||||
|
import sentencepiece as sp
|
||||||
|
import six
|
||||||
|
|
||||||
|
from ...tokenization_utils import PreTrainedTokenizer
|
||||||
|
|
||||||
|
|
||||||
|
PRETRAINED_VOCAB_FILES_MAP = {
|
||||||
|
"vocab_file": {
|
||||||
|
"microsoft/deberta-v2-xlarge": "https://huggingface.co/microsoft/deberta-v2-xlarge/resolve/main/spm.model",
|
||||||
|
"microsoft/deberta-v2-xxlarge": "https://huggingface.co/microsoft/deberta-v2-xxlarge/resolve/main/spm.model",
|
||||||
|
"microsoft/deberta-v2-xlarge-mnli": "https://huggingface.co/microsoft/deberta-v2-xlarge-mnli/resolve/main/spm.model",
|
||||||
|
"microsoft/deberta-v2-xxlarge-mnli": "https://huggingface.co/microsoft/deberta-v2-xxlarge-mnli/resolve/main/spm.model",
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||||
|
"microsoft/deberta-v2-xlarge": 512,
|
||||||
|
"microsoft/deberta-v2-xxlarge": 512,
|
||||||
|
"microsoft/deberta-v2-xlarge-mnli": 512,
|
||||||
|
"microsoft/deberta-v2-xxlarge-mnli": 512,
|
||||||
|
}
|
||||||
|
|
||||||
|
PRETRAINED_INIT_CONFIGURATION = {
|
||||||
|
"microsoft/deberta-v2-xlarge": {"do_lower_case": False},
|
||||||
|
"microsoft/deberta-v2-xxlarge": {"do_lower_case": False},
|
||||||
|
"microsoft/deberta-v2-xlarge-mnli": {"do_lower_case": False},
|
||||||
|
"microsoft/deberta-v2-xxlarge-mnli": {"do_lower_case": False},
|
||||||
|
}
|
||||||
|
|
||||||
|
VOCAB_FILES_NAMES = {"vocab_file": "spm.model"}
|
||||||
|
|
||||||
|
|
||||||
|
class DebertaV2Tokenizer(PreTrainedTokenizer):
|
||||||
|
r"""
|
||||||
|
Constructs a DeBERTa-v2 tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_file (:obj:`str`):
|
||||||
|
`SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
|
||||||
|
contains the vocabulary necessary to instantiate a tokenizer.
|
||||||
|
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
|
Whether or not to lowercase the input when tokenizing.
|
||||||
|
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
|
||||||
|
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
|
||||||
|
token instead.
|
||||||
|
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
|
||||||
|
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
|
||||||
|
sequence classification or for a text and a question for question answering. It is also used as the last
|
||||||
|
token of a sequence built with special tokens.
|
||||||
|
pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
|
||||||
|
The token used for padding, for example when batching sequences of different lengths.
|
||||||
|
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
|
||||||
|
The classifier token which is used when doing sequence classification (classification of the whole sequence
|
||||||
|
instead of per-token classification). It is the first token of the sequence when built with special tokens.
|
||||||
|
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
|
||||||
|
The token used for masking values. This is the token used when training this model with masked language
|
||||||
|
modeling. This is the token which the model will try to predict.
|
||||||
|
"""
|
||||||
|
|
||||||
|
vocab_files_names = VOCAB_FILES_NAMES
|
||||||
|
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||||
|
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
|
||||||
|
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_file,
|
||||||
|
do_lower_case=False,
|
||||||
|
split_by_punct=False,
|
||||||
|
unk_token="[UNK]",
|
||||||
|
sep_token="[SEP]",
|
||||||
|
pad_token="[PAD]",
|
||||||
|
cls_token="[CLS]",
|
||||||
|
mask_token="[MASK]",
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
super().__init__(
|
||||||
|
do_lower_case=do_lower_case,
|
||||||
|
unk_token=unk_token,
|
||||||
|
sep_token=sep_token,
|
||||||
|
pad_token=pad_token,
|
||||||
|
cls_token=cls_token,
|
||||||
|
mask_token=mask_token,
|
||||||
|
split_by_punct=split_by_punct,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not os.path.isfile(vocab_file):
|
||||||
|
raise ValueError(
|
||||||
|
"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
|
||||||
|
"model use `tokenizer = DebertaV2Tokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
|
||||||
|
)
|
||||||
|
self.do_lower_case = do_lower_case
|
||||||
|
self.split_by_punct = split_by_punct
|
||||||
|
self._tokenizer = SPMTokenizer(vocab_file, split_by_punct=split_by_punct)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def vocab_size(self):
|
||||||
|
return len(self.vocab)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def vocab(self):
|
||||||
|
return self._tokenizer.vocab
|
||||||
|
|
||||||
|
def get_vocab(self):
|
||||||
|
vocab = self.vocab.copy()
|
||||||
|
vocab.update(self.get_added_vocab())
|
||||||
|
return vocab
|
||||||
|
|
||||||
|
def _tokenize(self, text):
|
||||||
|
"""Take as input a string and return a list of strings (tokens) for words/sub-words"""
|
||||||
|
if self.do_lower_case:
|
||||||
|
text = text.lower()
|
||||||
|
return self._tokenizer.tokenize(text)
|
||||||
|
|
||||||
|
def _convert_token_to_id(self, token):
|
||||||
|
""" Converts a token (str) in an id using the vocab. """
|
||||||
|
return self._tokenizer.spm.PieceToId(token)
|
||||||
|
|
||||||
|
def _convert_id_to_token(self, index):
|
||||||
|
"""Converts an index (integer) in a token (str) using the vocab."""
|
||||||
|
return self._tokenizer.spm.IdToPiece(index) if index < self.vocab_size else self.unk_token
|
||||||
|
|
||||||
|
def convert_tokens_to_string(self, tokens):
|
||||||
|
""" Converts a sequence of tokens (string) in a single string. """
|
||||||
|
return self._tokenizer.decode(tokens)
|
||||||
|
|
||||||
|
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||||
|
"""
|
||||||
|
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
|
||||||
|
adding special tokens. A DeBERTa sequence has the following format:
|
||||||
|
|
||||||
|
- single sequence: [CLS] X [SEP]
|
||||||
|
- pair of sequences: [CLS] A [SEP] B [SEP]
|
||||||
|
|
||||||
|
Args:
|
||||||
|
token_ids_0 (:obj:`List[int]`):
|
||||||
|
List of IDs to which the special tokens will be added.
|
||||||
|
token_ids_1 (:obj:`List[int]`, `optional`):
|
||||||
|
Optional second list of IDs for sequence pairs.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
|
||||||
|
cls = [self.cls_token_id]
|
||||||
|
sep = [self.sep_token_id]
|
||||||
|
return cls + token_ids_0 + sep + token_ids_1 + sep
|
||||||
|
|
||||||
|
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||||
|
"""
|
||||||
|
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||||
|
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
token_ids_0 (:obj:`List[int]`):
|
||||||
|
List of IDs.
|
||||||
|
token_ids_1 (:obj:`List[int]`, `optional`):
|
||||||
|
Optional second list of IDs for sequence pairs.
|
||||||
|
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
|
Whether or not the token list is already formatted with special tokens for the model.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if already_has_special_tokens:
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
raise ValueError(
|
||||||
|
"You should not supply a second sequence if the provided sequence of "
|
||||||
|
"ids is already formatted with special tokens for the model."
|
||||||
|
)
|
||||||
|
return list(
|
||||||
|
map(
|
||||||
|
lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0,
|
||||||
|
token_ids_0,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
|
||||||
|
return [1] + ([0] * len(token_ids_0)) + [1]
|
||||||
|
|
||||||
|
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||||
|
"""
|
||||||
|
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A DeBERTa
|
||||||
|
sequence pair mask has the following format:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
|
||||||
|
| first sequence | second sequence |
|
||||||
|
|
||||||
|
If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
token_ids_0 (:obj:`List[int]`):
|
||||||
|
List of IDs.
|
||||||
|
token_ids_1 (:obj:`List[int]`, `optional`):
|
||||||
|
Optional second list of IDs for sequence pairs.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
|
||||||
|
sequence(s).
|
||||||
|
"""
|
||||||
|
sep = [self.sep_token_id]
|
||||||
|
cls = [self.cls_token_id]
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return len(cls + token_ids_0 + sep) * [0]
|
||||||
|
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
|
||||||
|
|
||||||
|
def prepare_for_tokenization(self, text, is_split_into_words=False, **kwargs):
|
||||||
|
add_prefix_space = kwargs.pop("add_prefix_space", False)
|
||||||
|
if is_split_into_words or add_prefix_space:
|
||||||
|
text = " " + text
|
||||||
|
return (text, kwargs)
|
||||||
|
|
||||||
|
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
|
||||||
|
return self._tokenizer.save_pretrained(save_directory, filename_prefix=filename_prefix)
|
||||||
|
|
||||||
|
|
||||||
|
class SPMTokenizer:
|
||||||
|
def __init__(self, vocab_file, split_by_punct=False):
|
||||||
|
self.split_by_punct = split_by_punct
|
||||||
|
self.vocab_file = vocab_file
|
||||||
|
spm = sp.SentencePieceProcessor()
|
||||||
|
assert os.path.exists(vocab_file)
|
||||||
|
spm.load(vocab_file)
|
||||||
|
bpe_vocab_size = spm.GetPieceSize()
|
||||||
|
# Token map
|
||||||
|
# <unk> 0+1
|
||||||
|
# <s> 1+1
|
||||||
|
# </s> 2+1
|
||||||
|
self.vocab = {spm.IdToPiece(i): i for i in range(bpe_vocab_size)}
|
||||||
|
self.id_to_tokens = [spm.IdToPiece(i) for i in range(bpe_vocab_size)]
|
||||||
|
# self.vocab['[PAD]'] = 0
|
||||||
|
# self.vocab['[CLS]'] = 1
|
||||||
|
# self.vocab['[SEP]'] = 2
|
||||||
|
# self.vocab['[UNK]'] = 3
|
||||||
|
|
||||||
|
self.spm = spm
|
||||||
|
|
||||||
|
def __getstate__(self):
|
||||||
|
state = self.__dict__.copy()
|
||||||
|
state["spm"] = None
|
||||||
|
return state
|
||||||
|
|
||||||
|
def __setstate__(self, d):
|
||||||
|
self.__dict__ = d
|
||||||
|
self.spm = sp.SentencePieceProcessor()
|
||||||
|
self.spm.Load(self.vocab_file)
|
||||||
|
|
||||||
|
def tokenize(self, text):
|
||||||
|
pieces = self._encode_as_pieces(text)
|
||||||
|
|
||||||
|
def _norm(x):
|
||||||
|
if x not in self.vocab or x == "<unk>":
|
||||||
|
return "[UNK]"
|
||||||
|
else:
|
||||||
|
return x
|
||||||
|
|
||||||
|
pieces = [_norm(p) for p in pieces]
|
||||||
|
return pieces
|
||||||
|
|
||||||
|
def convert_ids_to_tokens(self, ids):
|
||||||
|
tokens = []
|
||||||
|
for i in ids:
|
||||||
|
tokens.append(self.ids_to_tokens[i])
|
||||||
|
return tokens
|
||||||
|
|
||||||
|
def decode(self, tokens, start=-1, end=-1, raw_text=None):
|
||||||
|
if raw_text is None:
|
||||||
|
return self.spm.decode_pieces([t for t in tokens])
|
||||||
|
else:
|
||||||
|
words = self.split_to_words(raw_text)
|
||||||
|
word_tokens = [self.tokenize(w) for w in words]
|
||||||
|
token2words = [0] * len(tokens)
|
||||||
|
tid = 0
|
||||||
|
for i, w in enumerate(word_tokens):
|
||||||
|
for k, t in enumerate(w):
|
||||||
|
token2words[tid] = i
|
||||||
|
tid += 1
|
||||||
|
word_start = token2words[start]
|
||||||
|
word_end = token2words[end] if end < len(tokens) else len(words)
|
||||||
|
text = "".join(words[word_start:word_end])
|
||||||
|
return text
|
||||||
|
|
||||||
|
def add_special_token(self, token):
|
||||||
|
if token not in self.special_tokens:
|
||||||
|
self.special_tokens.append(token)
|
||||||
|
if token not in self.vocab:
|
||||||
|
self.vocab[token] = len(self.vocab) - 1
|
||||||
|
self.id_to_tokens.append(token)
|
||||||
|
return self.id(token)
|
||||||
|
|
||||||
|
def part_of_whole_word(self, token, is_bos=False):
|
||||||
|
if is_bos:
|
||||||
|
return True
|
||||||
|
if (
|
||||||
|
len(token) == 1
|
||||||
|
and (_is_whitespace(list(token)[0]) or _is_control(list(token)[0]) or _is_punctuation(list(token)[0]))
|
||||||
|
) or token in self.special_tokens:
|
||||||
|
return False
|
||||||
|
|
||||||
|
word_start = b"\xe2\x96\x81".decode("utf-8")
|
||||||
|
return not token.startswith(word_start)
|
||||||
|
|
||||||
|
def pad(self):
|
||||||
|
return "[PAD]"
|
||||||
|
|
||||||
|
def bos(self):
|
||||||
|
return "[CLS]"
|
||||||
|
|
||||||
|
def eos(self):
|
||||||
|
return "[SEP]"
|
||||||
|
|
||||||
|
def unk(self):
|
||||||
|
return "[UNK]"
|
||||||
|
|
||||||
|
def mask(self):
|
||||||
|
return "[MASK]"
|
||||||
|
|
||||||
|
def sym(self, id):
|
||||||
|
return self.ids_to_tokens[id]
|
||||||
|
|
||||||
|
def id(self, sym):
|
||||||
|
return self.vocab[sym] if sym in self.vocab else 1
|
||||||
|
|
||||||
|
def _encode_as_pieces(self, text):
|
||||||
|
text = convert_to_unicode(text)
|
||||||
|
if self.split_by_punct:
|
||||||
|
words = self._run_split_on_punc(text)
|
||||||
|
pieces = [self.spm.encode_as_pieces(w) for w in words]
|
||||||
|
return [p for w in pieces for p in w]
|
||||||
|
else:
|
||||||
|
return self.spm.encode_as_pieces(text)
|
||||||
|
|
||||||
|
def split_to_words(self, text):
|
||||||
|
pieces = self._encode_as_pieces(text)
|
||||||
|
word_start = b"\xe2\x96\x81".decode("utf-8")
|
||||||
|
words = []
|
||||||
|
offset = 0
|
||||||
|
prev_end = 0
|
||||||
|
for i, p in enumerate(pieces):
|
||||||
|
if p.startswith(word_start):
|
||||||
|
if offset > prev_end:
|
||||||
|
words.append(text[prev_end:offset])
|
||||||
|
prev_end = offset
|
||||||
|
w = p.replace(word_start, "")
|
||||||
|
else:
|
||||||
|
w = p
|
||||||
|
try:
|
||||||
|
s = text.index(w, offset)
|
||||||
|
pn = ""
|
||||||
|
k = i + 1
|
||||||
|
while k < len(pieces):
|
||||||
|
pn = pieces[k].replace(word_start, "")
|
||||||
|
if len(pn) > 0:
|
||||||
|
break
|
||||||
|
k += 1
|
||||||
|
|
||||||
|
if len(pn) > 0 and pn in text[offset:s]:
|
||||||
|
offset = offset + 1
|
||||||
|
else:
|
||||||
|
offset = s + len(w)
|
||||||
|
except Exception:
|
||||||
|
offset = offset + 1
|
||||||
|
|
||||||
|
if prev_end < offset:
|
||||||
|
words.append(text[prev_end:offset])
|
||||||
|
|
||||||
|
return words
|
||||||
|
|
||||||
|
def _run_strip_accents(self, text):
|
||||||
|
"""Strips accents from a piece of text."""
|
||||||
|
text = unicodedata.normalize("NFD", text)
|
||||||
|
output = []
|
||||||
|
for char in text:
|
||||||
|
cat = unicodedata.category(char)
|
||||||
|
if cat == "Mn":
|
||||||
|
continue
|
||||||
|
output.append(char)
|
||||||
|
return "".join(output)
|
||||||
|
|
||||||
|
def _run_split_on_punc(self, text):
|
||||||
|
"""Splits punctuation on a piece of text."""
|
||||||
|
chars = list(text)
|
||||||
|
i = 0
|
||||||
|
start_new_word = True
|
||||||
|
output = []
|
||||||
|
while i < len(chars):
|
||||||
|
char = chars[i]
|
||||||
|
if _is_punctuation(char):
|
||||||
|
output.append([char])
|
||||||
|
start_new_word = True
|
||||||
|
else:
|
||||||
|
if start_new_word:
|
||||||
|
output.append([])
|
||||||
|
start_new_word = False
|
||||||
|
output[-1].append(char)
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
return ["".join(x) for x in output]
|
||||||
|
|
||||||
|
def save_pretrained(self, path: str, filename_prefix: str = None):
|
||||||
|
filename = VOCAB_FILES_NAMES[list(VOCAB_FILES_NAMES.keys())[0]]
|
||||||
|
if filename_prefix is not None:
|
||||||
|
filename = filename_prefix + "-" + filename
|
||||||
|
full_path = os.path.join(path, filename)
|
||||||
|
with open(full_path, "wb") as fs:
|
||||||
|
fs.write(self.spm.serialized_model_proto())
|
||||||
|
return (full_path,)
|
||||||
|
|
||||||
|
|
||||||
|
def _is_whitespace(char):
|
||||||
|
"""Checks whether `chars` is a whitespace character."""
|
||||||
|
# \t, \n, and \r are technically contorl characters but we treat them
|
||||||
|
# as whitespace since they are generally considered as such.
|
||||||
|
if char == " " or char == "\t" or char == "\n" or char == "\r":
|
||||||
|
return True
|
||||||
|
cat = unicodedata.category(char)
|
||||||
|
if cat == "Zs":
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def _is_control(char):
|
||||||
|
"""Checks whether `chars` is a control character."""
|
||||||
|
# These are technically control characters but we count them as whitespace
|
||||||
|
# characters.
|
||||||
|
if char == "\t" or char == "\n" or char == "\r":
|
||||||
|
return False
|
||||||
|
cat = unicodedata.category(char)
|
||||||
|
if cat.startswith("C"):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def _is_punctuation(char):
|
||||||
|
"""Checks whether `chars` is a punctuation character."""
|
||||||
|
cp = ord(char)
|
||||||
|
# We treat all non-letter/number ASCII as punctuation.
|
||||||
|
# Characters such as "^", "$", and "`" are not in the Unicode
|
||||||
|
# Punctuation class but we treat them as punctuation anyways, for
|
||||||
|
# consistency.
|
||||||
|
if (cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126):
|
||||||
|
return True
|
||||||
|
cat = unicodedata.category(char)
|
||||||
|
if cat.startswith("P"):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def convert_to_unicode(text):
|
||||||
|
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
|
||||||
|
if six.PY3:
|
||||||
|
if isinstance(text, str):
|
||||||
|
return text
|
||||||
|
elif isinstance(text, bytes):
|
||||||
|
return text.decode("utf-8", "ignore")
|
||||||
|
else:
|
||||||
|
raise ValueError("Unsupported string type: %s" % (type(text)))
|
||||||
|
elif six.PY2:
|
||||||
|
if isinstance(text, str):
|
||||||
|
return text.decode("utf-8", "ignore")
|
||||||
|
else:
|
||||||
|
raise ValueError("Unsupported string type: %s" % (type(text)))
|
||||||
|
else:
|
||||||
|
raise ValueError("Not running on Python2 or Python 3?")
|
||||||
@@ -883,6 +883,63 @@ class DebertaPreTrainedModel:
|
|||||||
requires_pytorch(self)
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
|
||||||
|
DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
class DebertaV2ForMaskedLM:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
|
||||||
|
class DebertaV2ForQuestionAnswering:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
|
||||||
|
class DebertaV2ForSequenceClassification:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
|
||||||
|
class DebertaV2ForTokenClassification:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
|
||||||
|
class DebertaV2Model:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
|
||||||
|
class DebertaV2PreTrainedModel:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(self, *args, **kwargs):
|
||||||
|
requires_pytorch(self)
|
||||||
|
|
||||||
|
|
||||||
DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
290
tests/test_modeling_deberta_v2.py
Normal file
290
tests/test_modeling_deberta_v2.py
Normal file
@@ -0,0 +1,290 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Microsoft Authors and the HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
|
||||||
|
import random
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from transformers import is_torch_available
|
||||||
|
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
|
||||||
|
|
||||||
|
from .test_configuration_common import ConfigTester
|
||||||
|
from .test_modeling_common import ModelTesterMixin, ids_tensor
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import (
|
||||||
|
DebertaV2Config,
|
||||||
|
DebertaV2ForMaskedLM,
|
||||||
|
DebertaV2ForQuestionAnswering,
|
||||||
|
DebertaV2ForSequenceClassification,
|
||||||
|
DebertaV2ForTokenClassification,
|
||||||
|
DebertaV2Model,
|
||||||
|
)
|
||||||
|
from transformers.models.deberta_v2.modeling_deberta_v2 import DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class DebertaV2ModelTest(ModelTesterMixin, unittest.TestCase):
|
||||||
|
|
||||||
|
all_model_classes = (
|
||||||
|
(
|
||||||
|
DebertaV2Model,
|
||||||
|
DebertaV2ForMaskedLM,
|
||||||
|
DebertaV2ForSequenceClassification,
|
||||||
|
DebertaV2ForTokenClassification,
|
||||||
|
DebertaV2ForQuestionAnswering,
|
||||||
|
)
|
||||||
|
if is_torch_available()
|
||||||
|
else ()
|
||||||
|
)
|
||||||
|
|
||||||
|
test_torchscript = False
|
||||||
|
test_pruning = False
|
||||||
|
test_head_masking = False
|
||||||
|
is_encoder_decoder = False
|
||||||
|
|
||||||
|
class DebertaV2ModelTester(object):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=13,
|
||||||
|
seq_length=7,
|
||||||
|
is_training=True,
|
||||||
|
use_input_mask=True,
|
||||||
|
use_token_type_ids=True,
|
||||||
|
use_labels=True,
|
||||||
|
vocab_size=99,
|
||||||
|
hidden_size=32,
|
||||||
|
num_hidden_layers=5,
|
||||||
|
num_attention_heads=4,
|
||||||
|
intermediate_size=37,
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout_prob=0.1,
|
||||||
|
attention_probs_dropout_prob=0.1,
|
||||||
|
max_position_embeddings=512,
|
||||||
|
type_vocab_size=16,
|
||||||
|
type_sequence_label_size=2,
|
||||||
|
initializer_range=0.02,
|
||||||
|
relative_attention=False,
|
||||||
|
position_biased_input=True,
|
||||||
|
pos_att_type="None",
|
||||||
|
num_labels=3,
|
||||||
|
num_choices=4,
|
||||||
|
scope=None,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_input_mask = use_input_mask
|
||||||
|
self.use_token_type_ids = use_token_type_ids
|
||||||
|
self.use_labels = use_labels
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.type_vocab_size = type_vocab_size
|
||||||
|
self.type_sequence_label_size = type_sequence_label_size
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.num_labels = num_labels
|
||||||
|
self.num_choices = num_choices
|
||||||
|
self.relative_attention = relative_attention
|
||||||
|
self.position_biased_input = position_biased_input
|
||||||
|
self.pos_att_type = pos_att_type
|
||||||
|
self.scope = scope
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
|
||||||
|
input_mask = None
|
||||||
|
if self.use_input_mask:
|
||||||
|
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
|
||||||
|
|
||||||
|
token_type_ids = None
|
||||||
|
if self.use_token_type_ids:
|
||||||
|
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||||
|
|
||||||
|
sequence_labels = None
|
||||||
|
token_labels = None
|
||||||
|
choice_labels = None
|
||||||
|
if self.use_labels:
|
||||||
|
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||||
|
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
|
||||||
|
choice_labels = ids_tensor([self.batch_size], self.num_choices)
|
||||||
|
|
||||||
|
config = DebertaV2Config(
|
||||||
|
vocab_size=self.vocab_size,
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
intermediate_size=self.intermediate_size,
|
||||||
|
hidden_act=self.hidden_act,
|
||||||
|
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||||
|
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||||
|
max_position_embeddings=self.max_position_embeddings,
|
||||||
|
type_vocab_size=self.type_vocab_size,
|
||||||
|
initializer_range=self.initializer_range,
|
||||||
|
relative_attention=self.relative_attention,
|
||||||
|
position_biased_input=self.position_biased_input,
|
||||||
|
pos_att_type=self.pos_att_type,
|
||||||
|
)
|
||||||
|
|
||||||
|
return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||||
|
|
||||||
|
def check_loss_output(self, result):
|
||||||
|
self.parent.assertListEqual(list(result.loss.size()), [])
|
||||||
|
|
||||||
|
def create_and_check_deberta_model(
|
||||||
|
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||||
|
):
|
||||||
|
model = DebertaV2Model(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
sequence_output = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)[0]
|
||||||
|
sequence_output = model(input_ids, token_type_ids=token_type_ids)[0]
|
||||||
|
sequence_output = model(input_ids)[0]
|
||||||
|
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(sequence_output.size()), [self.batch_size, self.seq_length, self.hidden_size]
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_deberta_for_masked_lm(
|
||||||
|
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||||
|
):
|
||||||
|
model = DebertaV2ForMaskedLM(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
|
||||||
|
|
||||||
|
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
|
||||||
|
|
||||||
|
def create_and_check_deberta_for_sequence_classification(
|
||||||
|
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||||
|
):
|
||||||
|
config.num_labels = self.num_labels
|
||||||
|
model = DebertaV2ForSequenceClassification(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=sequence_labels)
|
||||||
|
self.parent.assertListEqual(list(result.logits.size()), [self.batch_size, self.num_labels])
|
||||||
|
self.check_loss_output(result)
|
||||||
|
|
||||||
|
def create_and_check_deberta_for_token_classification(
|
||||||
|
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||||
|
):
|
||||||
|
config.num_labels = self.num_labels
|
||||||
|
model = DebertaV2ForTokenClassification(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
|
||||||
|
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.num_labels))
|
||||||
|
|
||||||
|
def create_and_check_deberta_for_question_answering(
|
||||||
|
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||||
|
):
|
||||||
|
model = DebertaV2ForQuestionAnswering(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(
|
||||||
|
input_ids,
|
||||||
|
attention_mask=input_mask,
|
||||||
|
token_type_ids=token_type_ids,
|
||||||
|
start_positions=sequence_labels,
|
||||||
|
end_positions=sequence_labels,
|
||||||
|
)
|
||||||
|
self.parent.assertEqual(result.start_logits.shape, (self.batch_size, self.seq_length))
|
||||||
|
self.parent.assertEqual(result.end_logits.shape, (self.batch_size, self.seq_length))
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
(
|
||||||
|
config,
|
||||||
|
input_ids,
|
||||||
|
token_type_ids,
|
||||||
|
input_mask,
|
||||||
|
sequence_labels,
|
||||||
|
token_labels,
|
||||||
|
choice_labels,
|
||||||
|
) = config_and_inputs
|
||||||
|
inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = DebertaV2ModelTest.DebertaV2ModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=DebertaV2Config, hidden_size=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_deberta_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_deberta_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_for_sequence_classification(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_deberta_for_sequence_classification(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_for_masked_lm(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_deberta_for_masked_lm(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_for_question_answering(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_deberta_for_question_answering(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_for_token_classification(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_deberta_for_token_classification(*config_and_inputs)
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
for model_name in DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||||
|
model = DebertaV2Model.from_pretrained(model_name)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@require_sentencepiece
|
||||||
|
@require_tokenizers
|
||||||
|
class DebertaV2ModelIntegrationTest(unittest.TestCase):
|
||||||
|
@unittest.skip(reason="Model not available yet")
|
||||||
|
def test_inference_masked_lm(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_inference_no_head(self):
|
||||||
|
random.seed(0)
|
||||||
|
np.random.seed(0)
|
||||||
|
torch.manual_seed(0)
|
||||||
|
torch.cuda.manual_seed_all(0)
|
||||||
|
model = DebertaV2Model.from_pretrained("microsoft/deberta-v2-xlarge")
|
||||||
|
|
||||||
|
input_ids = torch.tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]])
|
||||||
|
output = model(input_ids)[0]
|
||||||
|
# compare the actual values for a slice.
|
||||||
|
expected_slice = torch.tensor(
|
||||||
|
[[[-0.2913, 0.2647, 0.5627], [-0.4318, 0.1389, 0.3881], [-0.2929, -0.2489, 0.3452]]]
|
||||||
|
)
|
||||||
|
self.assertTrue(torch.allclose(output[:, :3, :3], expected_slice, atol=1e-4), f"{output[:, :3, :3]}")
|
||||||
162
tests/test_tokenization_deberta_v2.py
Normal file
162
tests/test_tokenization_deberta_v2.py
Normal file
@@ -0,0 +1,162 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2019 Hugging Face inc.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
|
||||||
|
import os
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from transformers import DebertaV2Tokenizer
|
||||||
|
from transformers.testing_utils import require_sentencepiece, require_tokenizers
|
||||||
|
|
||||||
|
from .test_tokenization_common import TokenizerTesterMixin
|
||||||
|
|
||||||
|
|
||||||
|
SAMPLE_VOCAB = os.path.join(os.path.dirname(os.path.abspath(__file__)), "fixtures/spiece.model")
|
||||||
|
|
||||||
|
|
||||||
|
@require_sentencepiece
|
||||||
|
@require_tokenizers
|
||||||
|
class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||||
|
|
||||||
|
tokenizer_class = DebertaV2Tokenizer
|
||||||
|
rust_tokenizer_class = None
|
||||||
|
test_rust_tokenizer = False
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
super().setUp()
|
||||||
|
|
||||||
|
# We have a SentencePiece fixture for testing
|
||||||
|
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB)
|
||||||
|
tokenizer.save_pretrained(self.tmpdirname)
|
||||||
|
|
||||||
|
def get_input_output_texts(self, tokenizer):
|
||||||
|
input_text = "this is a test"
|
||||||
|
output_text = "this is a test"
|
||||||
|
return input_text, output_text
|
||||||
|
|
||||||
|
def test_rust_and_python_full_tokenizers(self):
|
||||||
|
if not self.test_rust_tokenizer:
|
||||||
|
return
|
||||||
|
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
rust_tokenizer = self.get_rust_tokenizer()
|
||||||
|
|
||||||
|
sequence = "I was born in 92000, and this is falsé."
|
||||||
|
|
||||||
|
tokens = tokenizer.tokenize(sequence)
|
||||||
|
rust_tokens = rust_tokenizer.tokenize(sequence)
|
||||||
|
self.assertListEqual(tokens, rust_tokens)
|
||||||
|
|
||||||
|
ids = tokenizer.encode(sequence, add_special_tokens=False)
|
||||||
|
rust_ids = rust_tokenizer.encode(sequence, add_special_tokens=False)
|
||||||
|
self.assertListEqual(ids, rust_ids)
|
||||||
|
|
||||||
|
rust_tokenizer = self.get_rust_tokenizer()
|
||||||
|
ids = tokenizer.encode(sequence)
|
||||||
|
rust_ids = rust_tokenizer.encode(sequence)
|
||||||
|
self.assertListEqual(ids, rust_ids)
|
||||||
|
|
||||||
|
def test_full_tokenizer(self):
|
||||||
|
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, keep_accents=True)
|
||||||
|
|
||||||
|
tokens = tokenizer.tokenize("This is a test")
|
||||||
|
self.assertListEqual(tokens, ["▁", "[UNK]", "his", "▁is", "▁a", "▁test"])
|
||||||
|
|
||||||
|
self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [13, 1, 4398, 25, 21, 1289])
|
||||||
|
|
||||||
|
tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
|
||||||
|
# fmt: off
|
||||||
|
self.assertListEqual(
|
||||||
|
tokens,
|
||||||
|
["▁", "[UNK]", "▁was", "▁born", "▁in", "▁9", "2000", ",", "▁and", "▁this", "▁is", "▁fal", "s", "[UNK]", "."],
|
||||||
|
)
|
||||||
|
ids = tokenizer.convert_tokens_to_ids(tokens)
|
||||||
|
self.assertListEqual(ids, [13, 1, 23, 386, 19, 561, 3050, 15, 17, 48, 25, 8256, 18, 1, 9])
|
||||||
|
|
||||||
|
back_tokens = tokenizer.convert_ids_to_tokens(ids)
|
||||||
|
self.assertListEqual(
|
||||||
|
back_tokens,
|
||||||
|
["▁", "<unk>", "▁was", "▁born", "▁in", "▁9", "2000", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "."],
|
||||||
|
)
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
def test_sequence_builders(self):
|
||||||
|
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB)
|
||||||
|
|
||||||
|
text = tokenizer.encode("sequence builders")
|
||||||
|
text_2 = tokenizer.encode("multi-sequence build")
|
||||||
|
|
||||||
|
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||||
|
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||||
|
|
||||||
|
assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
|
||||||
|
assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + text_2 + [
|
||||||
|
tokenizer.sep_token_id
|
||||||
|
]
|
||||||
|
|
||||||
|
def test_tokenizer_integration(self):
|
||||||
|
tokenizer_classes = [self.tokenizer_class]
|
||||||
|
if self.test_rust_tokenizer:
|
||||||
|
tokenizer_classes.append(self.rust_tokenizer_class)
|
||||||
|
|
||||||
|
for tokenizer_class in tokenizer_classes:
|
||||||
|
tokenizer = tokenizer_class.from_pretrained("microsoft/deberta-xlarge-v2")
|
||||||
|
|
||||||
|
sequences = [
|
||||||
|
[
|
||||||
|
"DeBERTa: Decoding-enhanced BERT with Disentangled Attention",
|
||||||
|
"DeBERTa: Decoding-enhanced BERT with Disentangled Attention",
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks.",
|
||||||
|
"DeBERTa: Decoding-enhanced BERT with Disentangled Attention",
|
||||||
|
],
|
||||||
|
[
|
||||||
|
"In this paper we propose a new model architecture DeBERTa",
|
||||||
|
"DeBERTa: Decoding-enhanced BERT with Disentangled Attention",
|
||||||
|
],
|
||||||
|
]
|
||||||
|
|
||||||
|
encoding = tokenizer(sequences, padding=True)
|
||||||
|
decoded_sequences = [tokenizer.decode(seq, skip_special_tokens=True) for seq in encoding["input_ids"]]
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
expected_encoding = {
|
||||||
|
'input_ids': [
|
||||||
|
[1, 1804, 69418, 191, 43, 117056, 18, 44596, 448, 37132, 19, 8655, 10625, 69860, 21149, 2, 1804, 69418, 191, 43, 117056, 18, 44596, 448, 37132, 19, 8655, 10625, 69860, 21149, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||||
|
[1, 9755, 1944, 11, 1053, 18, 16899, 12730, 1072, 1506, 45, 2497, 2510, 5, 610, 9, 127, 699, 1072, 2101, 36, 99388, 53, 2930, 4, 2, 1804, 69418, 191, 43, 117056, 18, 44596, 448, 37132, 19, 8655, 10625, 69860, 21149, 2],
|
||||||
|
[1, 84, 32, 778, 42, 9441, 10, 94, 735, 3372, 1804, 69418, 191, 2, 1804, 69418, 191, 43, 117056, 18, 44596, 448, 37132, 19, 8655, 10625, 69860, 21149, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
|
||||||
|
'token_type_ids': [
|
||||||
|
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||||
|
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||||
|
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
|
||||||
|
'attention_mask': [
|
||||||
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||||
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||||
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
expected_decoded_sequences = [
|
||||||
|
'DeBERTa: Decoding-enhanced BERT with Disentangled Attention DeBERTa: Decoding-enhanced BERT with Disentangled Attention',
|
||||||
|
'Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. DeBERTa: Decoding-enhanced BERT with Disentangled Attention',
|
||||||
|
'In this paper we propose a new model architecture DeBERTa DeBERTa: Decoding-enhanced BERT with Disentangled Attention'
|
||||||
|
]
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
self.assertDictEqual(encoding.data, expected_encoding)
|
||||||
|
|
||||||
|
for expected, decoded in zip(expected_decoded_sequences, decoded_sequences):
|
||||||
|
self.assertEqual(expected, decoded)
|
||||||
Reference in New Issue
Block a user