Add gpt-sw3 model to transformers (#20209)
* Add templates for gpt-sw3 * Add templates for gpt-sw3 * Added sentencepiece tokenizer * intermediate commit with many changes * fixed conflicts * Init commit for tokenization port * Tokenization progress * Remove fast tokenizer * Clean up and rename spm.model -> spiece.model * Remove TF -> PT conversion script template, Clean up Megatron -> PT script * Optimize encode & decode performance * added new attention * added new attention * attention for gpt-sw3 working * attention good * Cache is now working * fixed attention mask so that it works with causal attention * fixed badbmm bug for cpu and caching * updated config with correct parameters * Refactor and leave optimizations as separate functions to avoid breaking expected functionality * Fix special tokens mapping for both tokenizers * cleaning up of code and comments * HF compatible attention outputs * Tokenizer now passing tests, add documentation * Update documentation * reverted back to base implementation after checking that it is identical to pretrained model * updated gpt-sw3 config * updated conversion script * aligned parameters with gpt-sw3 config * changed default scale_attn_by_inverse_layer_idx to true * removed flag from conversion script * added temporary model path * reverted back to functioning convert script * small changes to default config * updated tests for gpt-sw3 * make style, make quality, minor cleanup * Change local paths to testing online repository * Change name: GptSw3 -> GPTSw3 * Remove GPTSw3TokenizerFast references * Use official model repository and add more model sizes * Added reference to 6.7b model * Add GPTSw3DoubleHeadsModel to IGNORE_NON_AUTO_CONFIGURED, like GPT2DoubleHeadsModel * Remove pointers to non-existing TFGPTSw3 * Add GPTSw3 to docs/_toctree.yml * Remove TF artifacts from GPTSw3 in __init__ files * Update README:s with 'make fix-copies' * Add 20b model to archive list * Add documentation for GPT-Sw3 * Fix typo in documentation for GPT-Sw3 * Do 'make fix-copies' again after having updated docs * Fix some typos in docs * Update src/transformers/models/gpt_sw3/configuration_gpt_sw3.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/gpt_sw3/configuration_gpt_sw3.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/gpt_sw3/__init__.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/gpt_sw3/__init__.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/gpt_sw3/convert_megatron_to_pytorch.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/gpt_sw3/modeling_gpt_sw3.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/models/gpt_sw3/test_tokenization_gpt_sw3.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/gpt_sw3/modeling_gpt_sw3.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/gpt_sw3/modeling_gpt_sw3.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Resolve comments from PR feedback * Resolve more comments from PR feedback, also set use_cache=True in convert script * Add '# Copied from' comments for GPTSw3 modeling * Set 'is_parallelizable = False' * Remove '# Copied from' where code was modified and add 'with x->y' when appropriate * Remove parallelize in mdx * make style, make quality * Update GPTSw3Config default values and corresponding documentation * Update src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/gpt_sw3/__init__.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Clean up and protect GPTSw3Tokenizer imports with is_sentencepiece_available * Make style, make quality * Add dummy object for GPTSw3Tokenizer via 'make fix-copies' * make fix-copies * Remove GPTSw3 modeling classes * make style, make quality * Add GPTSw3 auto-mappings for other GPT2 heads * Update docs/source/en/model_doc/gpt-sw3.mdx Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/gpt_sw3/convert_megatron_to_pytorch.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Remove old TODO-comment * Add example usage to GPTSw3Tokenizer docstring * make style, make quality * Add implementation details and example usage to gpt-sw3.mdx Co-authored-by: JoeyOhman <joeyoh@kth.se> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -322,6 +322,7 @@ Current number of checkpoints: ** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori.
|
||||
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
|
||||
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
|
||||
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
|
||||
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
|
||||
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
|
||||
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
|
||||
|
||||
@@ -322,6 +322,7 @@ Número actual de puntos de control: ** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori.
|
||||
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
|
||||
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
|
||||
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
|
||||
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
|
||||
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
|
||||
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
|
||||
|
||||
@@ -295,6 +295,7 @@ conda install -c huggingface transformers
|
||||
1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (अबेजा के जरिए) शिन्या ओटानी, ताकायोशी मकाबे, अनुज अरोड़ा, क्यो हटोरी द्वारा।
|
||||
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (ओपनएआई से) साथ में पेपर [लैंग्वेज मॉडल्स अनसुपरवाइज्ड मल्टीटास्क लर्नर्स हैं](https://blog.openai.com/better-language-models/) एलेक रैडफोर्ड*, जेफरी वू*, रेवन चाइल्ड, डेविड लुआन, डारियो एमोडी* द्वारा * और इल्या सुत्सकेवर** ने पोस्ट किया।
|
||||
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI से) साथ वाला पेपर [kingoflolz/mesh-transformer-jax](https://github. com/kingoflolz/mesh-transformer-jax/) बेन वांग और अरन कोमात्सुजाकी द्वारा।
|
||||
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
|
||||
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA से) साथ में कागज [GroupViT: टेक्स्ट सुपरविजन से सिमेंटिक सेगमेंटेशन इमर्जेस](https://arxiv .org/abs/2202.11094) जियारुई जू, शालिनी डी मेलो, सिफ़ी लियू, वोनमिन बायन, थॉमस ब्रेउएल, जान कौट्ज़, ज़ियाओलोंग वांग द्वारा।
|
||||
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (फेसबुक से) साथ में पेपर [ह्यूबर्ट: सेल्फ सुपरवाइज्ड स्पीच रिप्रेजेंटेशन लर्निंग बाय मास्क्ड प्रेडिक्शन ऑफ हिडन यूनिट्स](https ://arxiv.org/abs/2106.07447) वेई-निंग सू, बेंजामिन बोल्टे, याओ-हंग ह्यूबर्ट त्साई, कुशाल लखोटिया, रुस्लान सालाखुतदीनोव, अब्देलरहमान मोहम्मद द्वारा।
|
||||
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (बर्कले से) साथ में कागज [I-BERT: Integer-only BERT Quantization](https:// arxiv.org/abs/2101.01321) सेहून किम, अमीर घोलमी, ज़ेवेई याओ, माइकल डब्ल्यू महोनी, कर्ट केटज़र द्वारा।
|
||||
|
||||
@@ -357,6 +357,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
||||
1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori.
|
||||
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
|
||||
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
|
||||
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
|
||||
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
|
||||
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
|
||||
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
|
||||
|
||||
@@ -272,6 +272,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
||||
1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori.
|
||||
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
|
||||
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
|
||||
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
|
||||
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
|
||||
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
|
||||
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
|
||||
|
||||
@@ -296,6 +296,7 @@ conda install -c huggingface transformers
|
||||
1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (来自 ABEJA) 由 Shinya Otani, Takayoshi Makabe, Anuj Arora, Kyo Hattori。
|
||||
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (来自 OpenAI) 伴随论文 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 由 Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 发布。
|
||||
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (来自 EleutherAI) 伴随论文 [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) 由 Ben Wang and Aran Komatsuzaki 发布。
|
||||
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
|
||||
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (来自 UCSD, NVIDIA) 伴随论文 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 由 Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 发布。
|
||||
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (来自 Facebook) 伴随论文 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 由 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 发布。
|
||||
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (来自 Berkeley) 伴随论文 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 由 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 发布。
|
||||
|
||||
@@ -308,6 +308,7 @@ conda install -c huggingface transformers
|
||||
1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori.
|
||||
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
|
||||
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released with the paper [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
|
||||
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
|
||||
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
|
||||
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
|
||||
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
|
||||
|
||||
@@ -275,6 +275,8 @@
|
||||
title: GPT-J
|
||||
- local: model_doc/gpt2
|
||||
title: GPT2
|
||||
- local: model_doc/gpt-sw3
|
||||
title: GPTSw3
|
||||
- local: model_doc/herbert
|
||||
title: HerBERT
|
||||
- local: model_doc/ibert
|
||||
|
||||
@@ -109,6 +109,7 @@ The documentation is organized into five sections:
|
||||
1. **[GPT NeoX Japanese](model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori.
|
||||
1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
|
||||
1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
|
||||
1. **[GPT-Sw3](model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
|
||||
1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
|
||||
1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
|
||||
1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
|
||||
@@ -276,6 +277,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| GPT NeoX | ❌ | ✅ | ✅ | ❌ | ❌ |
|
||||
| GPT NeoX Japanese | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||
| GPT-J | ❌ | ❌ | ✅ | ✅ | ✅ |
|
||||
| GPT-Sw3 | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| GroupViT | ❌ | ❌ | ✅ | ✅ | ❌ |
|
||||
| Hubert | ❌ | ❌ | ✅ | ✅ | ❌ |
|
||||
| I-BERT | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
|
||||
54
docs/source/en/model_doc/gpt-sw3.mdx
Normal file
54
docs/source/en/model_doc/gpt-sw3.mdx
Normal file
@@ -0,0 +1,54 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# GPT-Sw3
|
||||
|
||||
## Overview
|
||||
|
||||
The GPT-Sw3 model was first proposed in
|
||||
[Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf)
|
||||
by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman,
|
||||
Fredrik Carlsson, Magnus Sahlgren.
|
||||
|
||||
Since that first paper the authors have extended their work and trained new models on their new 1.2TB corpora named The Nordic Pile.
|
||||
|
||||
GPT-Sw3 is a collection of large decoder-only pretrained transformer language models that were developed by AI Sweden
|
||||
in collaboration with RISE and the WASP WARA for Media and Language. GPT-Sw3 has been trained on a dataset containing
|
||||
320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a
|
||||
causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation.
|
||||
|
||||
This model was contributed by [AI Sweden](https://huggingface.co/AI-Sweden).
|
||||
|
||||
The implementation uses the [GPT2Model](https://huggingface.co/docs/transformers/model_doc/gpt2) coupled
|
||||
with our `GPTSw3Tokenizer`. This means that `AutoTokenizer` and `AutoModelForCausalLM` map to our tokenizer
|
||||
implementation and the corresponding GPT2 model implementation respectively.
|
||||
*Note that sentencepiece is required to use our tokenizer and can be installed with:* `pip install transformers[sentencepiece]` or `pip install sentencepiece`
|
||||
|
||||
Example usage:
|
||||
```python
|
||||
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("AI-Sweden/gpt-sw3-356m")
|
||||
>>> model = AutoModelForCausalLM.from_pretrained("AI-Sweden/gpt-sw3-356m")
|
||||
|
||||
>>> input_ids = tokenizer("Träd är fina för att", return_tensors="pt")["input_ids"]
|
||||
|
||||
>>> generated_token_ids = model.generate(inputs=input_ids, max_new_tokens=10, do_sample=True)[0]
|
||||
|
||||
>>> print(tokenizer.decode(generated_token_ids))
|
||||
Träd är fina för att de är färgstarka. Men ibland är det fint
|
||||
```
|
||||
|
||||
## GPTSw3Tokenizer
|
||||
|
||||
[[autodoc]] GPTSw3Tokenizer
|
||||
- save_vocabulary
|
||||
@@ -81,6 +81,7 @@ Ready-made configurations include the following architectures:
|
||||
- FlauBERT
|
||||
- GPT Neo
|
||||
- GPT-J
|
||||
- GPT-Sw3
|
||||
- GroupViT
|
||||
- I-BERT
|
||||
- ImageGPT
|
||||
|
||||
@@ -253,6 +253,7 @@ _import_structure = {
|
||||
"models.gpt_neo": ["GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoConfig"],
|
||||
"models.gpt_neox": ["GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoXConfig"],
|
||||
"models.gpt_neox_japanese": ["GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoXJapaneseConfig"],
|
||||
"models.gpt_sw3": [],
|
||||
"models.gptj": ["GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTJConfig"],
|
||||
"models.groupvit": [
|
||||
"GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
@@ -580,6 +581,7 @@ else:
|
||||
_import_structure["models.cpm"].append("CpmTokenizer")
|
||||
_import_structure["models.deberta_v2"].append("DebertaV2Tokenizer")
|
||||
_import_structure["models.fnet"].append("FNetTokenizer")
|
||||
_import_structure["models.gpt_sw3"].append("GPTSw3Tokenizer")
|
||||
_import_structure["models.layoutxlm"].append("LayoutXLMTokenizer")
|
||||
_import_structure["models.m2m_100"].append("M2M100Tokenizer")
|
||||
_import_structure["models.marian"].append("MarianTokenizer")
|
||||
@@ -3815,6 +3817,7 @@ if TYPE_CHECKING:
|
||||
from .models.cpm import CpmTokenizer
|
||||
from .models.deberta_v2 import DebertaV2Tokenizer
|
||||
from .models.fnet import FNetTokenizer
|
||||
from .models.gpt_sw3 import GPTSw3Tokenizer
|
||||
from .models.layoutxlm import LayoutXLMTokenizer
|
||||
from .models.m2m_100 import M2M100Tokenizer
|
||||
from .models.marian import MarianTokenizer
|
||||
|
||||
@@ -77,6 +77,7 @@ from . import (
|
||||
gpt_neo,
|
||||
gpt_neox,
|
||||
gpt_neox_japanese,
|
||||
gpt_sw3,
|
||||
gptj,
|
||||
groupvit,
|
||||
herbert,
|
||||
|
||||
@@ -77,6 +77,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
||||
("fsmt", "FSMTConfig"),
|
||||
("funnel", "FunnelConfig"),
|
||||
("glpn", "GLPNConfig"),
|
||||
("gpt-sw3", "GPT2Config"),
|
||||
("gpt2", "GPT2Config"),
|
||||
("gpt_neo", "GPTNeoConfig"),
|
||||
("gpt_neox", "GPTNeoXConfig"),
|
||||
@@ -383,6 +384,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
||||
("fsmt", "FairSeq Machine-Translation"),
|
||||
("funnel", "Funnel Transformer"),
|
||||
("glpn", "GLPN"),
|
||||
("gpt-sw3", "GPT-Sw3"),
|
||||
("gpt2", "OpenAI GPT-2"),
|
||||
("gpt_neo", "GPT Neo"),
|
||||
("gpt_neox", "GPT NeoX"),
|
||||
|
||||
@@ -76,6 +76,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
||||
("fsmt", "FSMTModel"),
|
||||
("funnel", ("FunnelModel", "FunnelBaseModel")),
|
||||
("glpn", "GLPNModel"),
|
||||
("gpt-sw3", "GPT2Model"),
|
||||
("gpt2", "GPT2Model"),
|
||||
("gpt_neo", "GPTNeoModel"),
|
||||
("gpt_neox", "GPTNeoXModel"),
|
||||
@@ -197,6 +198,7 @@ MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
|
||||
("fnet", "FNetForPreTraining"),
|
||||
("fsmt", "FSMTForConditionalGeneration"),
|
||||
("funnel", "FunnelForPreTraining"),
|
||||
("gpt-sw3", "GPT2LMHeadModel"),
|
||||
("gpt2", "GPT2LMHeadModel"),
|
||||
("ibert", "IBertForMaskedLM"),
|
||||
("layoutlm", "LayoutLMForMaskedLM"),
|
||||
@@ -258,6 +260,7 @@ MODEL_WITH_LM_HEAD_MAPPING_NAMES = OrderedDict(
|
||||
("fnet", "FNetForMaskedLM"),
|
||||
("fsmt", "FSMTForConditionalGeneration"),
|
||||
("funnel", "FunnelForMaskedLM"),
|
||||
("gpt-sw3", "GPT2LMHeadModel"),
|
||||
("gpt2", "GPT2LMHeadModel"),
|
||||
("gpt_neo", "GPTNeoForCausalLM"),
|
||||
("gpt_neox", "GPTNeoXForCausalLM"),
|
||||
@@ -321,6 +324,7 @@ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
||||
("data2vec-text", "Data2VecTextForCausalLM"),
|
||||
("electra", "ElectraForCausalLM"),
|
||||
("ernie", "ErnieForCausalLM"),
|
||||
("gpt-sw3", "GPT2LMHeadModel"),
|
||||
("gpt2", "GPT2LMHeadModel"),
|
||||
("gpt_neo", "GPTNeoForCausalLM"),
|
||||
("gpt_neox", "GPTNeoXForCausalLM"),
|
||||
@@ -577,6 +581,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
|
||||
("flaubert", "FlaubertForSequenceClassification"),
|
||||
("fnet", "FNetForSequenceClassification"),
|
||||
("funnel", "FunnelForSequenceClassification"),
|
||||
("gpt-sw3", "GPT2ForSequenceClassification"),
|
||||
("gpt2", "GPT2ForSequenceClassification"),
|
||||
("gpt_neo", "GPTNeoForSequenceClassification"),
|
||||
("gptj", "GPTJForSequenceClassification"),
|
||||
@@ -713,6 +718,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
|
||||
("flaubert", "FlaubertForTokenClassification"),
|
||||
("fnet", "FNetForTokenClassification"),
|
||||
("funnel", "FunnelForTokenClassification"),
|
||||
("gpt-sw3", "GPT2ForTokenClassification"),
|
||||
("gpt2", "GPT2ForTokenClassification"),
|
||||
("ibert", "IBertForTokenClassification"),
|
||||
("layoutlm", "LayoutLMForTokenClassification"),
|
||||
|
||||
@@ -38,6 +38,7 @@ FLAX_MODEL_MAPPING_NAMES = OrderedDict(
|
||||
("clip", "FlaxCLIPModel"),
|
||||
("distilbert", "FlaxDistilBertModel"),
|
||||
("electra", "FlaxElectraModel"),
|
||||
("gpt-sw3", "FlaxGPT2Model"),
|
||||
("gpt2", "FlaxGPT2Model"),
|
||||
("gpt_neo", "FlaxGPTNeoModel"),
|
||||
("gptj", "FlaxGPTJModel"),
|
||||
@@ -130,6 +131,7 @@ FLAX_MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
||||
("bert", "FlaxBertForCausalLM"),
|
||||
("big_bird", "FlaxBigBirdForCausalLM"),
|
||||
("electra", "FlaxElectraForCausalLM"),
|
||||
("gpt-sw3", "FlaxGPT2LMHeadModel"),
|
||||
("gpt2", "FlaxGPT2LMHeadModel"),
|
||||
("gpt_neo", "FlaxGPTNeoForCausalLM"),
|
||||
("gptj", "FlaxGPTJForCausalLM"),
|
||||
|
||||
@@ -50,6 +50,7 @@ TF_MODEL_MAPPING_NAMES = OrderedDict(
|
||||
("esm", "TFEsmModel"),
|
||||
("flaubert", "TFFlaubertModel"),
|
||||
("funnel", ("TFFunnelModel", "TFFunnelBaseModel")),
|
||||
("gpt-sw3", "TFGPT2Model"),
|
||||
("gpt2", "TFGPT2Model"),
|
||||
("gptj", "TFGPTJModel"),
|
||||
("groupvit", "TFGroupViTModel"),
|
||||
@@ -102,6 +103,7 @@ TF_MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
|
||||
("electra", "TFElectraForPreTraining"),
|
||||
("flaubert", "TFFlaubertWithLMHeadModel"),
|
||||
("funnel", "TFFunnelForPreTraining"),
|
||||
("gpt-sw3", "TFGPT2LMHeadModel"),
|
||||
("gpt2", "TFGPT2LMHeadModel"),
|
||||
("layoutlm", "TFLayoutLMForMaskedLM"),
|
||||
("lxmert", "TFLxmertForPreTraining"),
|
||||
@@ -133,6 +135,7 @@ TF_MODEL_WITH_LM_HEAD_MAPPING_NAMES = OrderedDict(
|
||||
("esm", "TFEsmForMaskedLM"),
|
||||
("flaubert", "TFFlaubertWithLMHeadModel"),
|
||||
("funnel", "TFFunnelForMaskedLM"),
|
||||
("gpt-sw3", "TFGPT2LMHeadModel"),
|
||||
("gpt2", "TFGPT2LMHeadModel"),
|
||||
("gptj", "TFGPTJForCausalLM"),
|
||||
("layoutlm", "TFLayoutLMForMaskedLM"),
|
||||
@@ -162,6 +165,7 @@ TF_MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
||||
("bert", "TFBertLMHeadModel"),
|
||||
("camembert", "TFCamembertForCausalLM"),
|
||||
("ctrl", "TFCTRLLMHeadModel"),
|
||||
("gpt-sw3", "TFGPT2LMHeadModel"),
|
||||
("gpt2", "TFGPT2LMHeadModel"),
|
||||
("gptj", "TFGPTJForCausalLM"),
|
||||
("openai-gpt", "TFOpenAIGPTLMHeadModel"),
|
||||
@@ -280,6 +284,7 @@ TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
|
||||
("esm", "TFEsmForSequenceClassification"),
|
||||
("flaubert", "TFFlaubertForSequenceClassification"),
|
||||
("funnel", "TFFunnelForSequenceClassification"),
|
||||
("gpt-sw3", "TFGPT2ForSequenceClassification"),
|
||||
("gpt2", "TFGPT2ForSequenceClassification"),
|
||||
("gptj", "TFGPTJForSequenceClassification"),
|
||||
("layoutlm", "TFLayoutLMForSequenceClassification"),
|
||||
|
||||
@@ -136,6 +136,7 @@ else:
|
||||
("fnet", ("FNetTokenizer", "FNetTokenizerFast" if is_tokenizers_available() else None)),
|
||||
("fsmt", ("FSMTTokenizer", None)),
|
||||
("funnel", ("FunnelTokenizer", "FunnelTokenizerFast" if is_tokenizers_available() else None)),
|
||||
("gpt-sw3", ("GPTSw3Tokenizer" if is_sentencepiece_available() else None, None)),
|
||||
("gpt2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
|
||||
("gpt_neo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
|
||||
("gpt_neox", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
|
||||
|
||||
48
src/transformers/models/gpt_sw3/__init__.py
Normal file
48
src/transformers/models/gpt_sw3/__init__.py
Normal file
@@ -0,0 +1,48 @@
|
||||
# flake8: noqa
|
||||
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||
# module, but to preserve other warnings. So, don't check this module at all.
|
||||
|
||||
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_sentencepiece_available
|
||||
|
||||
|
||||
_import_structure = {}
|
||||
|
||||
try:
|
||||
if not is_sentencepiece_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["tokenization_gpt_sw3"] = ["GPTSw3Tokenizer"]
|
||||
|
||||
|
||||
if TYPE_CHECKING:
|
||||
|
||||
try:
|
||||
if not is_sentencepiece_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .tokenization_gpt_sw3 import GPTSw3Tokenizer
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
||||
197
src/transformers/models/gpt_sw3/convert_megatron_to_pytorch.py
Normal file
197
src/transformers/models/gpt_sw3/convert_megatron_to_pytorch.py
Normal file
@@ -0,0 +1,197 @@
|
||||
# Copyright 2022 The HuggingFace Inc. team and the AI-Sweden team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Convert GPT-SW3 megatron checkpoints to pytorch"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
from os.path import isfile
|
||||
|
||||
import torch
|
||||
|
||||
from transformers import GPT2Config
|
||||
|
||||
|
||||
def recursive_print(name, val, spaces=0):
|
||||
# Format the message.
|
||||
if name is None:
|
||||
msg = None
|
||||
else:
|
||||
fmt = "." * max(0, spaces - 2) + "# {:" + str(50 - spaces) + "s}"
|
||||
msg = fmt.format(name)
|
||||
|
||||
# Print and recurse (if needed).
|
||||
if isinstance(val, dict):
|
||||
if msg is not None:
|
||||
print(msg)
|
||||
for k in val.keys():
|
||||
recursive_print(k, val[k], spaces + 2)
|
||||
elif isinstance(val, torch.Tensor):
|
||||
print(msg, ":", val.size())
|
||||
else:
|
||||
print(msg, ":", val)
|
||||
|
||||
|
||||
def fix_query_key_value_ordering(param, num_splits, num_heads, hidden_size):
|
||||
# Permutes layout of param tensor to [num_splits * num_heads * hidden_size, :]
|
||||
# for compatibility with later versions of NVIDIA Megatron-LM.
|
||||
# The inverse operation is performed inside Megatron-LM to read checkpoints:
|
||||
# https://github.com/NVIDIA/Megatron-LM/blob/v2.4/megatron/checkpointing.py#L209
|
||||
# If param is the weight tensor of the self-attention block, the returned tensor
|
||||
# will have to be transposed one more time to be read by HuggingFace GPT2.
|
||||
input_shape = param.size()
|
||||
# other versions store [num_heads * num_splits * hidden_size, :]
|
||||
saved_shape = (num_heads, num_splits, hidden_size) + input_shape[1:]
|
||||
param = param.view(*saved_shape)
|
||||
param = param.transpose(0, 1).contiguous()
|
||||
param = param.view(*input_shape)
|
||||
return param
|
||||
|
||||
|
||||
def convert_megatron_checkpoint(sd_megatron, config):
|
||||
"""
|
||||
Converts a Megatron checkpoint to a HuggingFace GPT-SW3 checkpoint.
|
||||
"""
|
||||
n_positions = config.n_positions
|
||||
layers = config.n_layer
|
||||
vocab_size = config.vocab_size
|
||||
heads = config.n_head
|
||||
hidden_size_per_head = config.n_embd // config.n_head
|
||||
|
||||
word_embeddings = sd_megatron["model.language_model.embedding.word_embeddings.weight"][:vocab_size, :]
|
||||
sd_hf = {
|
||||
"transformer.wte.weight": word_embeddings,
|
||||
"transformer.wpe.weight": sd_megatron["model.language_model.embedding.position_embeddings.weight"],
|
||||
"transformer.ln_f.weight": sd_megatron["model.language_model.encoder.final_layernorm.weight"],
|
||||
"transformer.ln_f.bias": sd_megatron["model.language_model.encoder.final_layernorm.bias"],
|
||||
}
|
||||
|
||||
pf = "model.language_model.encoder.layers."
|
||||
for i in range(layers):
|
||||
causal_mask = torch.tril(torch.ones((n_positions, n_positions), dtype=torch.uint8))
|
||||
causal_mask = causal_mask.view(1, 1, n_positions, n_positions)
|
||||
sd_hf[f"transformer.h.{i}.attn.bias"] = causal_mask
|
||||
sd_hf[f"transformer.h.{i}.attn.masked_bias"] = torch.tensor(-1e4, dtype=torch.bfloat16)
|
||||
|
||||
sd_hf[f"transformer.h.{i}.ln_1.weight"] = sd_megatron[f"{pf}{i}.input_layernorm.weight"]
|
||||
sd_hf[f"transformer.h.{i}.ln_1.bias"] = sd_megatron[f"{pf}{i}.input_layernorm.bias"]
|
||||
|
||||
val1 = sd_megatron[f"{pf}{i}.self_attention.query_key_value.weight"]
|
||||
val1 = fix_query_key_value_ordering(val1, 3, heads, hidden_size_per_head)
|
||||
sd_hf[f"transformer.h.{i}.attn.c_attn.weight"] = val1.transpose(0, 1).contiguous()
|
||||
|
||||
val2 = sd_megatron[f"{pf}{i}.self_attention.query_key_value.bias"]
|
||||
val2 = fix_query_key_value_ordering(val2, 3, heads, hidden_size_per_head)
|
||||
sd_hf[f"transformer.h.{i}.attn.c_attn.bias"] = val2
|
||||
|
||||
sd_hf[f"transformer.h.{i}.attn.c_proj.weight"] = sd_megatron[f"{pf}{i}.self_attention.dense.weight"].transpose(
|
||||
0, 1
|
||||
)
|
||||
sd_hf[f"transformer.h.{i}.attn.c_proj.bias"] = sd_megatron[f"{pf}{i}.self_attention.dense.bias"]
|
||||
sd_hf[f"transformer.h.{i}.ln_2.weight"] = sd_megatron[f"{pf}{i}.post_attention_layernorm.weight"]
|
||||
sd_hf[f"transformer.h.{i}.ln_2.bias"] = sd_megatron[f"{pf}{i}.post_attention_layernorm.bias"]
|
||||
sd_hf[f"transformer.h.{i}.mlp.c_fc.weight"] = sd_megatron[f"{pf}{i}.mlp.dense_h_to_4h.weight"].transpose(0, 1)
|
||||
sd_hf[f"transformer.h.{i}.mlp.c_fc.bias"] = sd_megatron[f"{pf}{i}.mlp.dense_h_to_4h.bias"]
|
||||
sd_hf[f"transformer.h.{i}.mlp.c_proj.weight"] = sd_megatron[f"{pf}{i}.mlp.dense_4h_to_h.weight"].transpose(
|
||||
0, 1
|
||||
)
|
||||
sd_hf[f"transformer.h.{i}.mlp.c_proj.bias"] = sd_megatron[f"{pf}{i}.mlp.dense_4h_to_h.bias"]
|
||||
|
||||
# For LM head, transformers' wants the matrix to weight embeddings.
|
||||
sd_hf["lm_head.weight"] = word_embeddings
|
||||
|
||||
return sd_hf
|
||||
|
||||
|
||||
def copy_config(config_hf, config_megatron):
|
||||
"""Copy the config from Megatron to hf."""
|
||||
config_hf.vocab_size = 64000
|
||||
config_hf.n_positions = config_megatron["encoder_seq_length"]
|
||||
config_hf.n_embd = config_megatron["hidden_size"]
|
||||
config_hf.n_layer = config_megatron["num_layers"]
|
||||
config_hf.n_head = config_megatron["num_attention_heads"]
|
||||
config_hf.n_inner = config_megatron["ffn_hidden_size"]
|
||||
config_hf.activation_function = "gelu"
|
||||
config_hf.resid_pdrop = 0.1
|
||||
config_hf.embd_pdrop = 0.1
|
||||
config_hf.attn_pdrop = 0.1
|
||||
config_hf.layer_norm_epsilon = config_megatron["layernorm_epsilon"] # 1e-5
|
||||
config_hf.initializer_range = config_megatron["init_method_std"] # 0.02
|
||||
config_hf.apply_query_key_layer_scaling = config_megatron["apply_query_key_layer_scaling"] # True
|
||||
config_hf.normalize_attention_scores = True
|
||||
config_hf.use_cache = True
|
||||
|
||||
# This identifies the 6.7B (7B) model which uses a different tokenizer
|
||||
if config_megatron["hidden_size"] == 4096:
|
||||
config_hf.bos_token_id = 1 # <|endoftext|>
|
||||
config_hf.eos_token_id = 1 # <|endoftext|>
|
||||
config_hf.pad_token_id = 0 # <unk>
|
||||
else:
|
||||
config_hf.bos_token_id = 2 # <s>
|
||||
config_hf.eos_token_id = 3 # <|endoftext|>
|
||||
config_hf.pad_token_id = 0 # <pad>
|
||||
|
||||
return config_hf
|
||||
|
||||
|
||||
def main(args):
|
||||
print(args)
|
||||
|
||||
checkpoint_path = args.checkpoint_path
|
||||
save_path = args.save_path
|
||||
if isfile(checkpoint_path):
|
||||
raise FileNotFoundError(f"ERROR! could not find file {checkpoint_path}")
|
||||
|
||||
# Load the model.
|
||||
checkpoint = torch.load(checkpoint_path, map_location="cpu")
|
||||
|
||||
# Load the config.
|
||||
config_megatron = checkpoint["hyper_parameters"]["cfg"]
|
||||
config_hf = GPT2Config()
|
||||
config_hf = copy_config(config_hf=config_hf, config_megatron=config_megatron)
|
||||
config_hf.architectures = ["GPT2LMHeadModel"]
|
||||
|
||||
sd_megatron = checkpoint["state_dict"]
|
||||
|
||||
# Convert.
|
||||
print("Converting")
|
||||
sd_hf = convert_megatron_checkpoint(sd_megatron, config_hf)
|
||||
|
||||
# Print the structure of converted state dict.
|
||||
if args.print_checkpoint_structure:
|
||||
recursive_print(None, sd_hf)
|
||||
|
||||
config_hf.tokenizer_class = "GPTSw3Tokenizer"
|
||||
|
||||
# Store the config to file.
|
||||
print("Saving config")
|
||||
config_hf.save_pretrained(save_path)
|
||||
|
||||
# Store the state_dict to file.
|
||||
output_checkpoint_file = os.path.join(save_path, "pytorch_model.bin")
|
||||
print(f'Saving checkpoint to "{output_checkpoint_file}"')
|
||||
torch.save(sd_hf, output_checkpoint_file)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
"--checkpoint_path",
|
||||
type=str,
|
||||
required=True,
|
||||
help="e.g. megatron_gpt--val_loss=2.42-step=38000-consumed_samples=54720000",
|
||||
)
|
||||
parser.add_argument("--save_path", type=str, required=True, help="e.g. /home/user/gpt-sw3/hf")
|
||||
parser.add_argument("--print-checkpoint-structure", action="store_true")
|
||||
_args = parser.parse_args()
|
||||
main(_args)
|
||||
314
src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py
Normal file
314
src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py
Normal file
@@ -0,0 +1,314 @@
|
||||
import os
|
||||
import re
|
||||
import unicodedata
|
||||
|
||||
from ... import is_torch_available
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from shutil import copyfile
|
||||
from typing import Any, Dict, List, Optional, Tuple, Union
|
||||
|
||||
import sentencepiece as spm
|
||||
|
||||
from ...tokenization_utils import PreTrainedTokenizer
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
VOCAB_FILES_NAMES = {"vocab_file": "spiece.model"}
|
||||
|
||||
PRETRAINED_VOCAB_FILES_MAP = {
|
||||
"vocab_file": {
|
||||
"AI-Sweden/gpt-sw3-126m": "https://huggingface.co/AI-Sweden/gpt-sw3-126m/resolve/main/spiece.model",
|
||||
"AI-Sweden/gpt-sw3-350m": "https://huggingface.co/AI-Sweden/gpt-sw3-350m/resolve/main/spiece.model",
|
||||
"AI-Sweden/gpt-sw3-1.6b": "https://huggingface.co/AI-Sweden/gpt-sw3-1.6b/resolve/main/spiece.model",
|
||||
"AI-Sweden/gpt-sw3-6.7b": "https://huggingface.co/AI-Sweden/gpt-sw3-6.7b/resolve/main/spiece.model",
|
||||
"AI-Sweden/gpt-sw3-20b": "https://huggingface.co/AI-Sweden/gpt-sw3-20b/resolve/main/spiece.model",
|
||||
}
|
||||
}
|
||||
|
||||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||
"AI-Sweden/gpt-sw3-126m": 2048,
|
||||
"AI-Sweden/gpt-sw3-350m": 2048,
|
||||
"AI-Sweden/gpt-sw3-1.6b": 2048,
|
||||
"AI-Sweden/gpt-sw3-6.7b": 2048,
|
||||
"AI-Sweden/gpt-sw3-20b": 2048,
|
||||
}
|
||||
|
||||
|
||||
class GPTSw3Tokenizer(PreTrainedTokenizer):
|
||||
"""
|
||||
Construct an GPTSw3 tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
|
||||
|
||||
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
|
||||
this superclass for more information regarding those methods.
|
||||
|
||||
Example usage:
|
||||
```
|
||||
>>> from transformers import GPTSw3Tokenizer
|
||||
>>> tokenizer = GPTSw3Tokenizer.from_pretrained("AI-Sweden/gpt-sw3-126m")
|
||||
>>> tokenizer("Svenska är kul!")['input_ids']
|
||||
[1814, 377, 3617, 63504]
|
||||
```
|
||||
|
||||
Args:
|
||||
vocab_file (`str`):
|
||||
[SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that
|
||||
contains the vocabulary necessary to instantiate a tokenizer.
|
||||
do_lower_case (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to lowercase the input when tokenizing.
|
||||
remove_space (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).
|
||||
keep_accents (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to keep accents when tokenizing.
|
||||
bos_token (`str`, *optional*):
|
||||
The beginning of sequence token that can be used for downstream task, was not seen during pretraining. If
|
||||
not provided, will default to '<s>' or '<|endoftext|>', depending on model size.
|
||||
eos_token (`str`, *optional*):
|
||||
The end of sequence token seen during pretraining. If not provided, will default to '<|endoftext|>'
|
||||
unk_token (`str`, *optional*):
|
||||
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
|
||||
token instead. If not provided, will default to '<unk>'.
|
||||
pad_token (`str`, *optional*):
|
||||
The token used for padding, for example when batching sequences of different lengths. If not provided, will
|
||||
default to '<pad>' or '<unk>' depending on model size.
|
||||
sp_model_kwargs (`dict`, *optional*):
|
||||
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
|
||||
SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
|
||||
to set:
|
||||
|
||||
- `enable_sampling`: Enable subword regularization.
|
||||
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
|
||||
|
||||
- `nbest_size = {0,1}`: No sampling is performed.
|
||||
- `nbest_size > 1`: samples from the nbest_size results.
|
||||
- `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
|
||||
using forward-filtering-and-backward-sampling algorithm.
|
||||
|
||||
- `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
|
||||
BPE-dropout.
|
||||
|
||||
Attributes:
|
||||
sp_model (`SentencePieceProcessor`):
|
||||
The *SentencePiece* processor that is used for every conversion (string, tokens and IDs).
|
||||
whitespaces (`set`):
|
||||
The whitespaces that are replaced in the whitespace normalization in preprocessing.
|
||||
non_printing_characters_re (`Pattern`):
|
||||
The compiled regular expression to remove non-printing characters in preprocessing.
|
||||
"""
|
||||
|
||||
vocab_files_names = VOCAB_FILES_NAMES
|
||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_file,
|
||||
do_lower_case=False,
|
||||
remove_space=False,
|
||||
keep_accents=False,
|
||||
pad_token=None,
|
||||
unk_token=None,
|
||||
eos_token=None,
|
||||
bos_token=None,
|
||||
sp_model_kwargs: Optional[Dict[str, Any]] = None,
|
||||
**kwargs
|
||||
) -> None:
|
||||
|
||||
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
|
||||
|
||||
name_or_path = kwargs.get("name_or_path")
|
||||
if name_or_path is None:
|
||||
logger.warning(
|
||||
"name_or_path not provided, will work for all GPTSw3 models except gpt-sw3-7b,"
|
||||
" you are testing the model, this can safely be ignored"
|
||||
)
|
||||
name_or_path = "None"
|
||||
|
||||
# Default definitions for our 2 tokenizer versions, with None-checks to enable proper testing
|
||||
eos_token = "<|endoftext|>" if eos_token is None else eos_token
|
||||
unk_token = "<unk>" if unk_token is None else unk_token
|
||||
if "gpt-sw3-7b" in name_or_path:
|
||||
pad_token = unk_token if pad_token is None else pad_token
|
||||
bos_token = eos_token if bos_token is None else bos_token
|
||||
else:
|
||||
pad_token = "<pad>" if pad_token is None else pad_token
|
||||
bos_token = "<s>" if bos_token is None else bos_token
|
||||
|
||||
super().__init__(
|
||||
do_lower_case=do_lower_case,
|
||||
remove_space=remove_space,
|
||||
keep_accents=keep_accents,
|
||||
bos_token=bos_token,
|
||||
eos_token=eos_token,
|
||||
unk_token=unk_token,
|
||||
pad_token=pad_token,
|
||||
sp_model_kwargs=self.sp_model_kwargs,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
self.do_lower_case = do_lower_case
|
||||
self.remove_space = remove_space
|
||||
self.keep_accents = keep_accents
|
||||
self.vocab_file = vocab_file
|
||||
|
||||
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
|
||||
self.sp_model.Load(vocab_file)
|
||||
|
||||
# Used for whitespace normalization in input texts
|
||||
# fmt : off
|
||||
self.whitespaces = {" ", " ", " ", " ", " ", " ", " ", " ", " ", " ", "", ""}
|
||||
# fmt : on
|
||||
|
||||
# Regular expression to remove non-printing characters (e.g. some unicode control chars) in preprocessing
|
||||
self.non_printing_characters_re = re.compile(
|
||||
f"[{''.join(map(chr, list(range(0, 9)) + list(range(11, 32)) + list(range(127, 160)) + [160, 173, 8203]))}]"
|
||||
)
|
||||
|
||||
# Copied from transformers.models.albert.tokenization_albert.AlbertTokenizer.__getstate__
|
||||
def __getstate__(self):
|
||||
state = self.__dict__.copy()
|
||||
state["sp_model"] = None
|
||||
return state
|
||||
|
||||
# Copied from transformers.models.albert.tokenization_albert.AlbertTokenizer.__setstate__
|
||||
def __setstate__(self, d):
|
||||
self.__dict__ = d
|
||||
|
||||
# for backward compatibility
|
||||
if not hasattr(self, "sp_model_kwargs"):
|
||||
self.sp_model_kwargs = {}
|
||||
|
||||
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
|
||||
self.sp_model.Load(self.vocab_file)
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.albert.tokenization_albert.AlbertTokenizer.vocab_size
|
||||
def vocab_size(self) -> int:
|
||||
return len(self.sp_model)
|
||||
|
||||
def preprocess_text(self, text: str) -> str:
|
||||
"""
|
||||
Returns the preprocessed text. This procedure is identical to what was used when training the tokenizer.
|
||||
"""
|
||||
|
||||
# Remove non-printing characters
|
||||
text = self.non_printing_characters_re.sub("", text)
|
||||
|
||||
# Normalize whitespaces
|
||||
text = "".join([char if char not in self.whitespaces else " " for char in text])
|
||||
|
||||
# NFC Unicode normalization
|
||||
text = unicodedata.normalize("NFC", text)
|
||||
return text
|
||||
|
||||
def _tokenize(self, text: str, **kwargs) -> List[str]:
|
||||
text = self.preprocess_text(text)
|
||||
return self.sp_model.encode(text, out_type=str)
|
||||
|
||||
def _convert_token_to_id(self, token: str) -> int:
|
||||
"""Converts a token (str) to an id (int) using the vocab."""
|
||||
return self.sp_model.PieceToId(token)
|
||||
|
||||
def _convert_id_to_token(self, index: int) -> str:
|
||||
"""Converts an index (int) to a token (str) using the vocab."""
|
||||
return self.sp_model.IdToPiece(index)
|
||||
|
||||
@staticmethod
|
||||
def clean_up_tokenization(out_string: str) -> str:
|
||||
"""Returns the input string, this function is overridden to remove the default clean up."""
|
||||
return out_string
|
||||
|
||||
def convert_tokens_to_string(self, tokens: List[str]) -> str:
|
||||
"""Converts a sequence of tokens (strings) to a single string. Special tokens remain intact."""
|
||||
current_sub_tokens = []
|
||||
out_string = ""
|
||||
prev_is_special = False
|
||||
for token in tokens:
|
||||
# make sure that special tokens are not decoded using sentencepiece model
|
||||
if token in self.all_special_tokens:
|
||||
if not prev_is_special:
|
||||
out_string += " "
|
||||
out_string += self.sp_model.decode(current_sub_tokens) + token
|
||||
prev_is_special = True
|
||||
current_sub_tokens = []
|
||||
else:
|
||||
current_sub_tokens.append(token)
|
||||
prev_is_special = False
|
||||
out_string += self.sp_model.decode(current_sub_tokens)
|
||||
|
||||
return out_string
|
||||
|
||||
# Copied from transformers.models.albert.tokenization_albert.AlbertTokenizer.get_vocab
|
||||
def get_vocab(self) -> Dict[str, int]:
|
||||
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
|
||||
vocab.update(self.added_tokens_encoder)
|
||||
return vocab
|
||||
|
||||
# Copied from transformers.models.albert.tokenization_albert.AlbertTokenizer.save_vocabulary
|
||||
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
|
||||
if not os.path.isdir(save_directory):
|
||||
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
|
||||
return
|
||||
out_vocab_file = os.path.join(
|
||||
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
|
||||
)
|
||||
|
||||
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
|
||||
copyfile(self.vocab_file, out_vocab_file)
|
||||
elif not os.path.isfile(self.vocab_file):
|
||||
with open(out_vocab_file, "wb") as fi:
|
||||
content_spiece_model = self.sp_model.serialized_model_proto()
|
||||
fi.write(content_spiece_model)
|
||||
|
||||
return (out_vocab_file,)
|
||||
|
||||
def encode_fast(
|
||||
self, text: Union[str, List[str]], return_tensors: Union[str, bool] = False
|
||||
) -> Union[List[int], List[List[int]], "torch.Tensor"]:
|
||||
"""
|
||||
Encodes a text or batch of texts to token ids using preprocessing and the raw SP tokenizer. This has reduced
|
||||
functionality but is often much faster.
|
||||
|
||||
Does NOT handle special tokens correctly, these can manually be added as ids afterwards.
|
||||
|
||||
Does NOT support padding, these can manually be added as ids afterwards.
|
||||
|
||||
Use default HuggingFace tokenization methods for full functionality.
|
||||
|
||||
Args:
|
||||
text (`str` or `List[str]`): One or several text(s) to convert to token ids.
|
||||
return_tensors (`str` or `bool`): Returns PyTorch tensors if set to True or "pt"
|
||||
|
||||
Returns:
|
||||
`List[int]`, `List[List[int]]`, or `torch.Tensor`: The encoded text(s) as token ids.
|
||||
"""
|
||||
|
||||
if isinstance(text, str):
|
||||
text = self.preprocess_text(text)
|
||||
token_ids = self.sp_model.encode(text)
|
||||
else:
|
||||
text = [self.preprocess_text(t) for t in text]
|
||||
token_ids = self.sp_model.encode(text)
|
||||
|
||||
if return_tensors is True or return_tensors == "pt":
|
||||
token_ids = torch.tensor(token_ids)
|
||||
|
||||
return token_ids
|
||||
|
||||
def decode_fast(self, token_ids: Union[int, List[int]]) -> str:
|
||||
"""
|
||||
Encodes a text or batch of texts to token ids using preprocessing and the raw SP tokenizer. This has reduced
|
||||
functionality but is often much faster.
|
||||
|
||||
Args:
|
||||
token_ids (`int` or `List[int]`): Encoded token or text as token id(s).
|
||||
|
||||
Returns:
|
||||
`str`: Decoded text
|
||||
"""
|
||||
|
||||
return self.sp_model.decode(token_ids)
|
||||
@@ -66,6 +66,13 @@ class FNetTokenizer(metaclass=DummyObject):
|
||||
requires_backends(self, ["sentencepiece"])
|
||||
|
||||
|
||||
class GPTSw3Tokenizer(metaclass=DummyObject):
|
||||
_backends = ["sentencepiece"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["sentencepiece"])
|
||||
|
||||
|
||||
class LayoutXLMTokenizer(metaclass=DummyObject):
|
||||
_backends = ["sentencepiece"]
|
||||
|
||||
|
||||
0
tests/models/gpt_sw3/__init__.py
Normal file
0
tests/models/gpt_sw3/__init__.py
Normal file
130
tests/models/gpt_sw3/test_tokenization_gpt_sw3.py
Normal file
130
tests/models/gpt_sw3/test_tokenization_gpt_sw3.py
Normal file
@@ -0,0 +1,130 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 Hugging Face inc.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import unittest
|
||||
|
||||
from transformers import GPTSw3Tokenizer
|
||||
from transformers.testing_utils import get_tests_dir, require_sentencepiece, require_tokenizers, slow
|
||||
|
||||
from ...test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
|
||||
SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece_with_bytefallback.model")
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class GPTSw3TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
tokenizer_class = GPTSw3Tokenizer
|
||||
test_rust_tokenizer = False
|
||||
test_sentencepiece = True
|
||||
test_sentencepiece_ignore_case = False
|
||||
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
|
||||
# We have a SentencePiece fixture for testing
|
||||
tokenizer = GPTSw3Tokenizer(SAMPLE_VOCAB, eos_token="<unk>", bos_token="<unk>", pad_token="<unk>")
|
||||
|
||||
tokenizer.save_pretrained(self.tmpdirname)
|
||||
|
||||
def get_input_output_texts(self, tokenizer):
|
||||
input_text = "This is a test"
|
||||
output_text = "This is a test"
|
||||
return input_text, output_text
|
||||
|
||||
def test_convert_token_and_id(self):
|
||||
"""Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
|
||||
token = "<s>"
|
||||
token_id = 1
|
||||
|
||||
self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
|
||||
self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
|
||||
|
||||
def test_get_vocab(self):
|
||||
vocab_keys = list(self.get_tokenizer().get_vocab().keys())
|
||||
|
||||
self.assertEqual(vocab_keys[0], "<unk>")
|
||||
self.assertEqual(vocab_keys[1], "<s>")
|
||||
self.assertEqual(vocab_keys[-1], "j")
|
||||
self.assertEqual(len(vocab_keys), 2_000)
|
||||
|
||||
def test_vocab_size(self):
|
||||
self.assertEqual(self.get_tokenizer().vocab_size, 2_000)
|
||||
|
||||
def test_full_tokenizer(self):
|
||||
tokenizer = GPTSw3Tokenizer(SAMPLE_VOCAB)
|
||||
|
||||
tokens = tokenizer.tokenize("This is a test")
|
||||
self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"])
|
||||
|
||||
self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [465, 287, 265, 631, 842])
|
||||
|
||||
tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
|
||||
# fmt: off
|
||||
self.assertListEqual(
|
||||
tokens,
|
||||
["▁I", "▁was", "▁bor", "n", "▁in", "▁", "<0x39>", "2", "0", "0", "0", ",", "▁and", "▁this", "▁is", "▁f", "al", "s", "<0xC3>", "<0xA9>", "."],
|
||||
)
|
||||
# fmt: on
|
||||
|
||||
ids = tokenizer.convert_tokens_to_ids(tokens)
|
||||
self.assertListEqual(
|
||||
ids,
|
||||
[262, 272, 1525, 286, 271, 268, 60, 916, 633, 633, 633, 259, 266, 301, 287, 384, 367, 263, 198, 172, 260],
|
||||
)
|
||||
|
||||
back_tokens = tokenizer.convert_ids_to_tokens(ids)
|
||||
# fmt: off
|
||||
self.assertListEqual(
|
||||
back_tokens,
|
||||
["▁I", "▁was", "▁bor", "n", "▁in", "▁", "<0x39>", "2", "0", "0", "0", ",", "▁and", "▁this", "▁is", "▁f", "al", "s", "<0xC3>", "<0xA9>", "."]
|
||||
)
|
||||
# fmt: on
|
||||
|
||||
def test_fast_encode_decode(self):
|
||||
tokenizer = GPTSw3Tokenizer(SAMPLE_VOCAB)
|
||||
texts = ["This is a test", "I was born in 92000, and this is falsé."]
|
||||
expected_ids_list = [
|
||||
[465, 287, 265, 631, 842],
|
||||
[262, 272, 1525, 286, 271, 268, 60, 916, 633, 633, 633, 259, 266, 301, 287, 384, 367, 263, 198, 172, 260],
|
||||
]
|
||||
|
||||
# Test that encode_fast returns the same as tokenize + convert_tokens_to_ids
|
||||
for text, expected_ids in zip(texts, expected_ids_list):
|
||||
self.assertListEqual(tokenizer.encode_fast(text), expected_ids)
|
||||
|
||||
# Test that decode_fast returns the input text
|
||||
for text, token_ids in zip(texts, expected_ids_list):
|
||||
self.assertEqual(tokenizer.decode_fast(token_ids), text)
|
||||
|
||||
@slow
|
||||
def test_tokenizer_integration(self):
|
||||
sequences = [
|
||||
"<|python|>def fibonacci(n)\n if n < 0:\n print('Incorrect input')",
|
||||
"Hey there, how are you doing this fine day?",
|
||||
"This is a text with a trailing spaces followed by a dot .",
|
||||
"Häj sväjs lillebrör! =)",
|
||||
"Det är inget fel på Mr. Cool",
|
||||
]
|
||||
|
||||
# fmt: off
|
||||
expected_encoding = {"input_ids": [[63423, 5, 6811, 14954, 282, 816, 3821, 63466, 63425, 63462, 18, 63978, 678, 301, 1320, 63423, 63455, 63458, 18, 63982, 4246, 3940, 1901, 47789, 5547, 18994], [19630, 1100, 63446, 1342, 633, 544, 4488, 593, 5102, 2416, 63495, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1652, 428, 268, 1936, 515, 268, 58593, 22413, 9106, 546, 268, 33213, 63979, 698, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [55130, 63450, 924, 63449, 2249, 4062, 1558, 318, 63504, 21498, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [509, 377, 2827, 2559, 332, 6575, 63443, 26801, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], "token_type_ids": [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], "attention_mask": [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
|
||||
# fmt: on
|
||||
self.tokenizer_integration_test_util(
|
||||
expected_encoding=expected_encoding,
|
||||
model_name="AI-Sweden/gpt-sw3-126m",
|
||||
sequences=sequences,
|
||||
)
|
||||
@@ -202,6 +202,7 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
||||
"FlavaImageModel",
|
||||
"FlavaMultimodalModel",
|
||||
"GPT2DoubleHeadsModel",
|
||||
"GPTSw3DoubleHeadsModel",
|
||||
"LayoutLMForQuestionAnswering",
|
||||
"LukeForMaskedLM",
|
||||
"LukeForEntityClassification",
|
||||
|
||||
Reference in New Issue
Block a user