diff --git a/README.md b/README.md index 4b56f24920..b3685fa357 100644 --- a/README.md +++ b/README.md @@ -1,19 +1,19 @@ # 👾 PyTorch-Transformers -[![CircleCI](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT.svg?style=svg)](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT) +[![CircleCI](https://circleci.com/gh/huggingface/pytorch-transformers.svg?style=svg)](https://circleci.com/gh/huggingface/pytorch-transformers) PyTorch-Transformers is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP). The library currently contains PyTorch implementations, pre-trained model weights, usage scripts and conversion utilities for the following models: -- **[Google's BERT model](https://github.com/google-research/bert)** released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. -- **[OpenAI's GPT model](https://github.com/openai/finetune-transformer-lm)** released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. -- **[OpenAI's GPT-2 model](https://blog.openai.com/better-language-models/)** released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. -- **[Google/CMU's Transformer-XL model](https://github.com/kimiyoung/transformer-xl)** released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. -- **[Google/CMU's XLNet model](https://github.com/zihangdai/xlnet/)** released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. -- **[Facebook's XLM model](https://github.com/facebookresearch/XLM/)** released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. +1. **[BERT](https://github.com/google-research/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. +2. **[GPT](https://github.com/openai/finetune-transformer-lm)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +3. **[GPT-2](https://blog.openai.com/better-language-models/)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. +4. **[Transformer-XL](https://github.com/kimiyoung/transformer-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. +5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. +6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. -These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](#documentation). +These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html). | Section | Description | |-|-| @@ -21,7 +21,7 @@ These implementations have been tested on several datasets (see the example scri | [Quick tour: Usage](#quick-tour-usage) | Tokenizers & models usage: Bert and GPT-2 | | [Quick tour: Fine-tuning/usage scripts](#quick-tour-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation | | [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-pytorch-transformers) | Migrating your code from pytorch-pretrained-bert to pytorch-transformers | -| [Documentation](#documentation) | Full API documentation and more | +| [Documentation](https://huggingface.co/pytorch-transformers/) | Full API documentation and more | ## Installation @@ -202,13 +202,14 @@ Examples for each model class of each model architecture (Bert, GPT, GPT-2, Tran The library comprises several example scripts with SOTA performances for NLU and NLG tasks: -- fine-tuning Bert/XLNet/XLM with a *sequence-level classifier* on nine different GLUE tasks, -- fine-tuning Bert/XLNet/XLM with a *token-level classifier* on the question answering dataset SQuAD 2.0, and -- using GPT/GPT-2/Transformer-XL and XLNet for conditional language generation. +- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*) +- `run_squad.py`: an example fine-tuning Bert, XLNet and XLM on the question answering dataset SQuAD 2.0 (*token-level classification*) +- `run_generation.py`: an example using GPT, GPT-2, Transformer-XL and XLNet for conditional language generation +- other model-specific examples (see the documentation). Here are three quick usage examples for these scripts: -### Fine-tuning for sequence classification: GLUE tasks examples +### `run_glue.py`: Fine-tuning on GLUE tasks for sequence classification The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems. @@ -302,7 +303,7 @@ Training with these hyper-parameters gave us the following results: loss = 0.07231863956341798 ``` -### Fine-tuning for question-answering: SQuAD example +### `run_squad.py`: Fine-tuning on SQuAD for question-answering This example code fine-tunes BERT on the SQuAD dataset using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD: @@ -333,7 +334,7 @@ python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncase This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`. -### Conditional generation: Text generation with GPT, GPT-2, Transformer-XL and XLNet +### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet A conditional generation script is also included to generate text from a prompt. The generation script include the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer). @@ -347,10 +348,6 @@ python ./examples/run_glue.py \ --model_name_or_path=gpt2 \ ``` -## Documentation - -The full documentation is available at https://huggingface.co/pytorch-transformers/. - ## Migrating from pytorch-pretrained-bert to pytorch-transformers Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` diff --git a/docs/source/converting_tensorflow_models.rst b/docs/source/converting_tensorflow_models.rst index afcacc00a0..932037c268 100644 --- a/docs/source/converting_tensorflow_models.rst +++ b/docs/source/converting_tensorflow_models.rst @@ -1,4 +1,4 @@ -Converting Tensorflow Models +Converting Tensorflow Checkpoints ================================================ A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the ``BertForPreTraining`` class (for BERT) or NumPy checkpoint in a PyTorch dump of the ``OpenAIGPTModel`` class (for OpenAI GPT). diff --git a/docs/source/index.rst b/docs/source/index.rst index aedb231163..be8cfc2a39 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,14 +1,24 @@ Pytorch-Transformers ================================================================================================================================================ +PyTorch-Transformers is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP). + +The library currently contains PyTorch implementations, pre-trained model weights, usage scripts and conversion utilities for the following models: + +1. `BERT `_ (from Google) released with the paper `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding `_ by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. +2. `GPT `_ (from OpenAI) released with the paper `Improving Language Understanding by Generative Pre-Training `_ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +3. `GPT-2 `_ (from OpenAI) released with the paper `Language Models are Unsupervised Multitask Learners `_ by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. +4. `Transformer-XL `_ (from Google/CMU) released with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context `_ by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. +5. `XLNet `_ (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive Pretraining for Language Understanding `_ by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. +6. `XLM `_ (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining `_ by Guillaume Lample and Alexis Conneau. .. toctree:: :maxdepth: 2 :caption: Notes installation - philosophy - usage + quickstart + pretrained_models examples notebooks converting_tensorflow_models @@ -28,263 +38,3 @@ Pytorch-Transformers model_doc/gpt2 model_doc/xlm model_doc/xlnet - - -.. image:: https://circleci.com/gh/huggingface/pytorch-pretrained-BERT.svg?style=svg - :target: https://circleci.com/gh/huggingface/pytorch-pretrained-BERT - :alt: CircleCI - - -This repository contains op-for-op PyTorch reimplementations, pre-trained models and fine-tuning examples for: - - -* `Google's BERT model `__\ , -* `OpenAI's GPT model `__\ , -* `Google/CMU's Transformer-XL model `__\ , and -* `OpenAI's GPT-2 model `__. - -These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the `Examples <./examples.html>`__ section. - -Here are some information on these models: - -**BERT** was released together with the paper `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding `__ by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. -This PyTorch implementation of BERT is provided with `Google's pre-trained models `__\ , examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided. - -**OpenAI GPT** was released together with the paper `Improving Language Understanding by Generative Pre-Training `__ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. -This PyTorch implementation of OpenAI GPT is an adaptation of the `PyTorch implementation by HuggingFace `__ and is provided with `OpenAI's pre-trained model `__ and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch. - -**Google/CMU's Transformer-XL** was released together with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context `__ by Zihang Dai\*, Zhilin Yang\* , Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. -This PyTorch implementation of Transformer-XL is an adaptation of the original `PyTorch implementation `__ which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models. - -**OpenAI GPT-2** was released together with the paper `Language Models are Unsupervised Multitask Learners `__ by Alec Radford\*, Jeffrey Wu\* , Rewon Child, David Luan, Dario Amodei\*\* and Ilya Sutskever\*\*. -This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation `__ and is provided with `OpenAI's pre-trained model `__ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch. - -**Facebook Research's XLM** was released together with the paper `Cross-lingual Language Model Pretraining `__ by Guillaume Lample and Alexis Conneau. -This PyTorch implementation of XLM is an adaptation of the original `PyTorch implementation `__. - -**Google's XLNet** was released together with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding `__ by Zhilin Yang\*, Zihang Dai\*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov and Quoc V. Le. -This PyTorch implementation of XLM is an adaptation of the `Tensorflow implementation `__. - - -Content -------- - -.. list-table:: - :header-rows: 1 - - * - Section - - Description - * - `Installation <./installation.html>`__ - - How to install the package - * - `Philosphy <./philosophy.html>`__ - - The philosophy behind this package - * - `Usage <./usage.html>`__ - - Quickstart examples - * - `Examples <./examples.html>`__ - - Detailed examples on how to fine-tune Bert - * - `Notebooks <./notebooks.html>`__ - - Introduction on the provided Jupyter Notebooks - * - `TPU <./tpu.html>`__ - - Notes on TPU support and pretraining scripts - * - `Command-line interface <./cli.html>`__ - - Convert a TensorFlow checkpoint in a PyTorch dump - * - `Migration <./migration.html>`__ - - Migrating from ``pytorch_pretrained_BERT`` (v0.6) to ``pytorch_transformers`` (v1.0) - * - `Bertology <./bertology.html>`__ - - Exploring the internals of the pretrained models. - * - `TorchScript <./torchscript.html>`__ - - Convert a model to TorchScript for use in other programming languages - -.. list-table:: - :header-rows: 1 - - * - Section - - Description - * - `Overview <./model_doc/overview.html>`__ - - Overview of the package - * - `BERT <./model_doc/bert.html>`__ - - BERT Models, Tokenizers and optimizers - * - `OpenAI GPT <./model_doc/gpt.html>`__ - - GPT Models, Tokenizers and optimizers - * - `TransformerXL <./model_doc/transformerxl.html>`__ - - TransformerXL Models, Tokenizers and optimizers - * - `OpenAI GPT2 <./model_doc/gpt2.html>`__ - - GPT2 Models, Tokenizers and optimizers - * - `XLM <./model_doc/xlm.html>`__ - - XLM Models, Tokenizers and optimizers - * - `XLNet <./model_doc/xlnet.html>`__ - - XLNet Models, Tokenizers and optimizers - -Overview --------- - -This package comprises the following classes that can be imported in Python and are detailed in the `documentation <./model_doc/overview.html>`__ section of this package: - - -* - Eight **Bert** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_bert.py <./_modules/pytorch_transformers/modeling_bert.html>`__ file): - - - * `BertModel <./model_doc/bert.html#pytorch_transformers.BertModel>`__ - raw BERT Transformer model (\ **fully pre-trained**\ ), - * `BertForMaskedLM <./model_doc/bert.html#pytorch_transformers.BertForMaskedLM>`__ - BERT Transformer with the pre-trained masked language modeling head on top (\ **fully pre-trained**\ ), - * `BertForNextSentencePrediction <./model_doc/bert.html#pytorch_transformers.BertForNextSentencePrediction>`__ - BERT Transformer with the pre-trained next sentence prediction classifier on top (\ **fully pre-trained**\ ), - * `BertForPreTraining <./model_doc/bert.html#pytorch_transformers.BertForPreTraining>`__ - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (\ **fully pre-trained**\ ), - * `BertForSequenceClassification <./model_doc/bert.html#pytorch_transformers.BertForSequenceClassification>`__ - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ), - * `BertForMultipleChoice <./model_doc/bert.html#pytorch_transformers.BertForMultipleChoice>`__ - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ), - * `BertForTokenClassification <./model_doc/bert.html#pytorch_transformers.BertForTokenClassification>`__ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ), - * `BertForQuestionAnswering <./model_doc/bert.html#pytorch_transformers.BertForQuestionAnswering>`__ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ). - -* - Three **OpenAI GPT** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_openai.py <./_modules/pytorch_transformers/modeling_openai.html>`__ file): - - - * `OpenAIGPTModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTModel>`__ - raw OpenAI GPT Transformer model (\ **fully pre-trained**\ ), - * `OpenAIGPTLMHeadModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTLMHeadModel>`__ - OpenAI GPT Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ), - * `OpenAIGPTDoubleHeadsModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTDoubleHeadsModel>`__ - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ), - -* - Two **Transformer-XL** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_transfo_xl.py <./_modules/pytorch_transformers/modeling_transfo_xl.html>`__ file): - - - * `TransfoXLModel <./model_doc/transformerxl.html#pytorch_transformers.TransfoXLModel>`__ - Transformer-XL model which outputs the last hidden state and memory cells (\ **fully pre-trained**\ ), - * `TransfoXLLMHeadModel <./model_doc/transformerxl.html#pytorch_transformers.TransfoXLLMHeadModel>`__ - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (\ **fully pre-trained**\ ), - -* - Three **OpenAI GPT-2** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_gpt2.py <./_modules/pytorch_transformers/modeling_gpt2.html>`__ file): - - - * `GPT2Model <./model_doc/gpt2.html#pytorch_transformers.GPT2Model>`__ - raw OpenAI GPT-2 Transformer model (\ **fully pre-trained**\ ), - * `GPT2LMHeadModel <./model_doc/gpt2.html#pytorch_transformers.GPT2LMHeadModel>`__ - OpenAI GPT-2 Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ), - * `GPT2DoubleHeadsModel <./model_doc/gpt2.html#pytorch_transformers.GPT2DoubleHeadsModel>`__ - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ), - -* - Four **XLM** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_xlm.py <./_modules/pytorch_transformers/modeling_xlm.html>`__ file): - - - * `XLMModel <./model_doc/xlm.html#pytorch_transformers.XLMModel>`__ - raw XLM Transformer model (\ **fully pre-trained**\ ), - * `XLMWithLMHeadModel <./model_doc/xlm.html#pytorch_transformers.XLMWithLMHeadModel>`__ - XLM Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ), - * `XLMForSequenceClassification <./model_doc/xlm.html#pytorch_transformers.XLMForSequenceClassification>`__ - XLM Transformer with a sequence classification head on top (XLM Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ), - * `XLMForQuestionAnswering <./model_doc/xlm.html#pytorch_transformers.XLMForQuestionAnswering>`__ - XLM Transformer with a token classification head on top (XLM Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ) - -* - Four **XLNet** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_xlnet.py <./_modules/pytorch_transformers/modeling_xlnet.html>`__ file): - - - * `XLNetModel <./model_doc/xlnet.html#pytorch_transformers.XLNetModel>`__ - raw XLNet Transformer model (\ **fully pre-trained**\ ), - * `XLNetLMHeadModel <./model_doc/xlnet.html#pytorch_transformers.XLNetLMHeadModel>`__ - XLNet Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ), - * `XLNetForSequenceClassification <./model_doc/xlnet.html#pytorch_transformers.XLNetForSequenceClassification>`__ - XLNet Transformer with a sequence classification head on top (XLM Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ), - * `XLNetForQuestionAnswering <./model_doc/xlnet.html#pytorch_transformers.XLNetForQuestionAnswering>`__ - XLNet Transformer with a token classification head on top (XLNet Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ) - - -TODO Lysandre filled: I filled in XLM and XLNet. I didn't do the Tokenizers because I don't know the current philosophy behind them. - -* - Tokenizers for **BERT** (using word-piece) (in the `tokenization_bert.py <./_modules/pytorch_transformers/tokenization_bert.html>`__ file): - - * ``BasicTokenizer`` - basic tokenization (punctuation splitting, lower casing, etc.), - * ``WordpieceTokenizer`` - WordPiece tokenization, - * ``BertTokenizer`` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization. - - -* - Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the `tokenization_openai.py <./_modules/pytorch_transformers/tokenization_openai.html>`__ file): - - * ``OpenAIGPTTokenizer`` - perform Byte-Pair-Encoding (BPE) tokenization. - - -* - Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the `tokenization_gpt2.py <./_modules/pytorch_transformers/tokenization_gpt2.html>`__ file): - - * ``GPT2Tokenizer`` - perform byte-level Byte-Pair-Encoding (BPE) tokenization. - - -* - Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the `tokenization_transfo_xl.py <./_modules/pytorch_transformers/tokenization_transfo_xl.html>`__ file): - - * ``OpenAIGPTTokenizer`` - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax. - - -* - Tokenizer for **XLNet** (SentencePiece based tokenizer) (in the `tokenization_xlnet.py <./_modules/pytorch_transformers/tokenization_xlnet.html>`__ file): - - * ``XLNetTokenizer`` - perform SentencePiece tokenization. - - -* - Tokenizer for **XLM** (using Byte-Pair-Encoding) (in the `tokenization_xlm.py <./_modules/pytorch_transformers/tokenization_xlm.html>`__ file): - - * ``GPT2Tokenizer`` - perform Byte-Pair-Encoding (BPE) tokenization. - - -* - Optimizer (in the `optimization.py <./_modules/pytorch_transformers/optimization.html>`__ file): - - - * ``AdamW`` - Version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate. - - -* - Configuration classes for BERT, OpenAI GPT, Transformer-XL, XLM and XLNet (in the respective \ - `modeling_bert.py <./_modules/pytorch_transformers/modeling_bert.html>`__\ , \ - `modeling_openai.py <./_modules/pytorch_transformers/modeling_openai.html>`__\ , \ - `modeling_transfo_xl.py <./_modules/pytorch_transformers/modeling_transfo_xl.html>`__, \ - `modeling_xlm.py <./_modules/pytorch_transformers/modeling_xlm.html>`__, \ - `modeling_xlnet.py <./_modules/pytorch_transformers/modeling_xlnet.html>`__ \ - files): - - - * ``BertConfig`` - Configuration class to store the configuration of a ``BertModel`` with utilities to read and write from JSON configuration files. - * ``OpenAIGPTConfig`` - Configuration class to store the configuration of a ``OpenAIGPTModel`` with utilities to read and write from JSON configuration files. - * ``GPT2Config`` - Configuration class to store the configuration of a ``GPT2Model`` with utilities to read and write from JSON configuration files. - * ``TransfoXLConfig`` - Configuration class to store the configuration of a ``TransfoXLModel`` with utilities to read and write from JSON configuration files. - * ``XLMConfig`` - Configuration class to store the configuration of a ``XLMModel`` with utilities to read and write from JSON configuration files. - * ``XLNetConfig`` - Configuration class to store the configuration of a ``XLNetModel`` with utilities to read and write from JSON configuration files. - -The repository further comprises: - - -* - Five examples on how to use **BERT** (in the `examples folder `__\ ): - - - * `run_bert_extract_features.py `__ - Show how to extract hidden states from an instance of ``BertModel``\ , - * `run_bert_classifier.py `__ - Show how to fine-tune an instance of ``BertForSequenceClassification`` on GLUE's MRPC task, - * `run_bert_squad.py `__ - Show how to fine-tune an instance of ``BertForQuestionAnswering`` on SQuAD v1.0 and SQuAD v2.0 tasks. - * `run_swag.py `__ - Show how to fine-tune an instance of ``BertForMultipleChoice`` on Swag task. - * `simple_lm_finetuning.py `__ - Show how to fine-tune an instance of ``BertForPretraining`` on a target text corpus. - -* - One example on how to use **OpenAI GPT** (in the `examples folder `__\ ): - - - * `run_openai_gpt.py `__ - Show how to fine-tune an instance of ``OpenGPTDoubleHeadsModel`` on the RocStories task. - -* - One example on how to use **Transformer-XL** (in the `examples folder `__\ ): - - - * `run_transfo_xl.py `__ - Show how to load and evaluate a pre-trained model of ``TransfoXLLMHeadModel`` on WikiText 103. - -* - One example on how to use **OpenAI GPT-2** in the unconditional and interactive mode (in the `examples folder `__\ ): - - - * `run_gpt2.py `__ - Show how to use OpenAI GPT-2 an instance of ``GPT2LMHeadModel`` to generate text (same as the original OpenAI GPT-2 examples). - - These examples are detailed in the `Examples <#examples>`__ section of this readme. - -* - Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the `notebooks folder `__\ ): - - - * `Comparing-TF-and-PT-models.ipynb `__ - Compare the hidden states predicted by ``BertModel``\ , - * `Comparing-TF-and-PT-models-SQuAD.ipynb `__ - Compare the spans predicted by ``BertForQuestionAnswering`` instances, - * `Comparing-TF-and-PT-models-MLM-NSP.ipynb `__ - Compare the predictions of the ``BertForPretraining`` instances. - - These notebooks are detailed in the `Notebooks <#notebooks>`__ section of this readme. - - -* - A command-line interface to convert TensorFlow checkpoints (BERT, Transformer-XL) or NumPy checkpoint (OpenAI) in a PyTorch save of the associated PyTorch model: - - This CLI is detailed in the `Command-line interface <#Command-line-interface>`__ section of this readme. diff --git a/docs/source/installation.rst b/docs/source/installation.rst index 054d7a1323..f8beb9f1c8 100644 --- a/docs/source/installation.rst +++ b/docs/source/installation.rst @@ -6,11 +6,41 @@ This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python With pip ^^^^^^^^ -PyTorch pretrained bert can be installed by pip as follows: +PyTorch pretrained bert can be installed with pip as follows: .. code-block:: bash - pip install pytorch-pretrained-bert + pip install pytorch-transformers + +From source +^^^^^^^^^^^ + +Clone the repository and instal locally: + +.. code-block:: bash + + git clone https://github.com/huggingface/pytorch-transformers.git + cd pytorch-transformers + pip install [--editable] . + + +Tests +^^^^^ + +An extensive test suite is included for the library and the example scripts. Library tests can be found in the `tests folder `_ and examples tests in the `examples folder `_. + +These tests can be run using `pytest` (install pytest if needed with `pip install pytest`). + +You can run the tests from the root of the cloned repository with the commands: + +.. code-block:: bash + + python -m pytest -sv ./pytorch_transformers/tests/ + python -m pytest -sv ./examples/ + + +OpenAI GPT original tokenization workflow +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` : @@ -20,29 +50,3 @@ If you want to reproduce the original tokenization process of the ``OpenAI GPT`` python -m spacy download en If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry). - -From source -^^^^^^^^^^^ - -Clone the repository and run: - -.. code-block:: bash - - pip install [--editable] . - -Here also, if you want to reproduce the original tokenization process of the ``OpenAI GPT`` model, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` : - -.. code-block:: bash - - pip install spacy ftfy==4.4.3 - python -m spacy download en - -Again, if you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage). - -A series of tests is included in the `tests folder `_ and can be run using ``pytest`` (install pytest if needed: ``pip install pytest``\ ). - -You can run the tests with the command: - -.. code-block:: bash - - python -m pytest -sv tests/ diff --git a/docs/source/migration.md b/docs/source/migration.md index 9165365fa8..440766e42e 100644 --- a/docs/source/migration.md +++ b/docs/source/migration.md @@ -1 +1,96 @@ -# Migration \ No newline at end of file +# Migrating from pytorch-pretrained-bert + + +Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` + +### Models always output `tuples` + +The main breaking change when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters. + +The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/). + +In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`. + +Here is a `pytorch-pretrained-bert` to `pytorch-transformers` conversion example for a `BertForSequenceClassification` classification model: + +```python +# Let's load our model +model = BertForSequenceClassification.from_pretrained('bert-base-uncased') + +# If you used to have this line in pytorch-pretrained-bert: +loss = model(input_ids, labels=labels) + +# Now just use this line in pytorch-transformers to extract the loss from the output tuple: +outputs = model(input_ids, labels=labels) +loss = outputs[0] + +# In pytorch-transformers you can also have access to the logits: +loss, logits = outputs[:2] + +# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation) +model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True) +outputs = model(input_ids, labels=labels) +loss, logits, attentions = outputs +``` + +### Serialization + +While not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other seralization method before. + +Here is an example: + +```python +### Let's load a model and tokenizer +model = BertForSequenceClassification.from_pretrained('bert-base-uncased') +tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') + +### Do some stuff to our model and tokenizer +# Ex: add new tokens to the vocabulary and embeddings of our model +tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]']) +model.resize_token_embeddings(len(tokenizer)) +# Train our model +train(model) + +### Now let's save our model and tokenizer to a directory +model.save_pretrained('./my_saved_model_directory/') +tokenizer.save_pretrained('./my_saved_model_directory/') + +### Reload the model and the tokenizer +model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/') +tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/') +``` + +### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules + +The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer. +The new optimizer `AdamW` matches PyTorch `Adam` optimizer API. + +The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore. + +Here is a conversion examples from `BertAdam` with a linear warmup and decay schedule to `AdamW` and the same schedule: + +```python +# Parameters: +lr = 1e-3 +num_total_steps = 1000 +num_warmup_steps = 100 +warmup_proportion = float(num_warmup_steps) / float(num_total_steps) # 0.1 + +### Previously BertAdam optimizer was instantiated like this: +optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, t_total=num_total_steps) +### and used like this: +for batch in train_data: + loss = model(batch) + loss.backward() + optimizer.step() + +### In PyTorch-Transformers, optimizer and schedules are splitted and instantiated like this: +optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False) # To reproduce BertAdam specific behavior set correct_bias=False +scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_total_steps) # PyTorch scheduler +### and used like this: +for batch in train_data: + loss = model(batch) + loss.backward() + scheduler.step() + optimizer.step() +``` diff --git a/docs/source/philosophy.md b/docs/source/philosophy.md deleted file mode 100644 index 78c4f0309f..0000000000 --- a/docs/source/philosophy.md +++ /dev/null @@ -1 +0,0 @@ -# Philosophy \ No newline at end of file diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst new file mode 100644 index 0000000000..2d72977951 --- /dev/null +++ b/docs/source/pretrained_models.rst @@ -0,0 +1,59 @@ +Pretrained models +================================================ + +Here is the full list of the currently provided pretrained models together with a short presentation of each model. + ++===============+============================================================+===========================+ +| Architecture | Shortcut name | Details of the model | ++===============+============================================================+===========================+ +| | ``bert-base-uncased`` | 12-layer, 768-hidden, 12-heads, 110M parameters +| | | Trained on lower-cased English text | +| +------------------------------------------------------------+---------------------------+ +| | ``bert-large-uncased`` | 24-layer, 1024-hidden, 16-heads, 340M parameters +| | | Trained on lower-cased English text | +| +------------------------------------------------------------+---------------------------+ +| | ``bert-base-cased`` | 12-layer, 768-hidden, 12-heads, 110M parameters +| | | Trained on cased English text | +| +------------------------------------------------------------+---------------------------+ +| | ``bert-large-cased`` | 24-layer, 1024-hidden, 16-heads, 340M parameters | +| | | Trained on cased English text | +| +------------------------------------------------------------+---------------------------+ +| | ``bert-base-multilingual-uncased`` | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters +| | | Trained on lower-cased text in the top 102 languages with the largest Wikipedias +| | | (see `details `_) | +| +------------------------------------------------------------+---------------------------+ +| | ``bert-base-multilingual-cased`` | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters | +| | | Trained on cased text in the top 104 languages with the largest Wikipedias +| | | (see `details `_) | +| +------------------------------------------------------------+---------------------------+ +| BERT | ``bert-base-chinese`` | 12-layer, 768-hidden, 12-heads, 110M parameters | +| | | Trained on cased Chinese Simplified and Traditional text | +| +------------------------------------------------------------+---------------------------+ +| | ``bert-base-german-cased`` | 12-layer, 768-hidden, 12-heads, 110M parameters | +| | | Trained on cased German text by Deepset.ai | +| | | (see `details on deepset.ai website `_) | +| +------------------------------------------------------------+---------------------------+ +| | ``bert-large-uncased-whole-word-masking`` | 24-layer, 1024-hidden, 16-heads, 340M parameters | +| | | Trained on lower-cased English text using Whole-Word-Masking | +| | | (see `details `_) | +| +------------------------------------------------------------+---------------------------+ +| | ``bert-large-cased-whole-word-masking`` | 24-layer, 1024-hidden, 16-heads, 340M parameters | +| | | Trained on cased English text using Whole-Word-Masking | +| | | (see `details `_) | +| +------------------------------------------------------------+---------------------------+ +| | ``bert-large-uncased-whole-word-masking-finetuned-squad`` | 24-layer, 1024-hidden, 16-heads, 340M parameters | +| | | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD | +| | | (see details of fine-tuning in the `example section`_) | +| +------------------------------------------------------------+---------------------------+ +| | ``bert-large-cased-whole-word-masking-finetuned-squad`` | 24-layer, 1024-hidden, 16-heads, 340M parameters | +| | | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD | +| | | (see `details of fine-tuning in the example section `_) | +| +------------------------------------------------------------+---------------------------+ +| | ``bert-base-cased-finetuned-mrpc`` | 12-layer, 768-hidden, 12-heads, 110M parameters | +| | | The ``bert-base-cased`` model fine-tuned on MRPC | +| | | (see `details of fine-tuning in the example section `_) | ++---------------+------------------------------------------------------------+---------------------------+ +| GPT | Cells may span columns. | ++---------------+----------------------------------------------------------------------------------------+ + +.. `_ \ No newline at end of file diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md new file mode 100644 index 0000000000..7414ef48c1 --- /dev/null +++ b/docs/source/quickstart.md @@ -0,0 +1,146 @@ +# Quickstart + +## Main concepts + + +## Quick tour: Usage + +Here are two quick-start examples showcasing a few `Bert` and `GPT2` classes and pre-trained models. + +See package reference for examples for each model classe. + +### BERT example + +First let's prepare a tokenized input from a text string using `BertTokenizer` + +```python +import torch +from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM + +# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows +import logging +logging.basicConfig(level=logging.INFO) + +# Load pre-trained model tokenizer (vocabulary) +tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') + +# Tokenize input +text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" +tokenized_text = tokenizer.tokenize(text) + +# Mask a token that we will try to predict back with `BertForMaskedLM` +masked_index = 8 +tokenized_text[masked_index] = '[MASK]' +assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]'] + +# Convert token to vocabulary indices +indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text) +# Define sentence A and B indices associated to 1st and 2nd sentences (see paper) +segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1] + +# Convert inputs to PyTorch tensors +tokens_tensor = torch.tensor([indexed_tokens]) +segments_tensors = torch.tensor([segments_ids]) +``` + +Let's see how we can use `BertModel` to encode our inputs in hidden-states: + +```python +# Load pre-trained model (weights) +model = BertModel.from_pretrained('bert-base-uncased') + +# Set the model in evaluation mode to desactivate the DropOut modules +# This is IMPORTANT to have reproductible results during evaluation! +model.eval() + +# If you have a GPU, put everything on cuda +tokens_tensor = tokens_tensor.to('cuda') +segments_tensors = segments_tensors.to('cuda') +model.to('cuda') + +# Predict hidden states features for each layer +with torch.no_grad(): + # See the models docstrings for the detail of the inputs + outputs = model(tokens_tensor, token_type_ids=segments_tensors) + # PyTorch-Transformers models always output tuples. + # See the models docstrings for the detail of all the outputs + # In our case, the first element is the hidden state of the last layer of the Bert model + encoded_layers = outputs[0] +# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension) +assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size) +``` + +And how to use `BertForMaskedLM` to predict a masked token: + +```python +# Load pre-trained model (weights) +model = BertForMaskedLM.from_pretrained('bert-base-uncased') +model.eval() + +# If you have a GPU, put everything on cuda +tokens_tensor = tokens_tensor.to('cuda') +segments_tensors = segments_tensors.to('cuda') +model.to('cuda') + +# Predict all tokens +with torch.no_grad(): + outputs = model(tokens_tensor, token_type_ids=segments_tensors) + predictions = outputs[0] + +# confirm we were able to predict 'henson' +predicted_index = torch.argmax(predictions[0, masked_index]).item() +predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0] +assert predicted_token == 'henson' +``` + +### OpenAI GPT-2 + +Here is a quick-start example using `GPT2Tokenizer` and `GPT2LMHeadModel` class with OpenAI's pre-trained model to predict the next token from a text prompt. + +First let's prepare a tokenized input from our text string using `GPT2Tokenizer` + +```python +import torch +from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel + +# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows +import logging +logging.basicConfig(level=logging.INFO) + +# Load pre-trained model tokenizer (vocabulary) +tokenizer = GPT2Tokenizer.from_pretrained('gpt2') + +# Encode a text inputs +text = "Who was Jim Henson ? Jim Henson was a" +indexed_tokens = tokenizer.encode(text) + +# Convert indexed tokens in a PyTorch tensor +tokens_tensor = torch.tensor([indexed_tokens]) +``` + +Let's see how to use `GPT2LMHeadModel` to generate the next token following our text: + +```python +# Load pre-trained model (weights) +model = GPT2LMHeadModel.from_pretrained('gpt2') + +# Set the model in evaluation mode to desactivate the DropOut modules +# This is IMPORTANT to have reproductible results during evaluation! +model.eval() + +# If you have a GPU, put everything on cuda +tokens_tensor = tokens_tensor.to('cuda') +model.to('cuda') + +# Predict all tokens +with torch.no_grad(): + outputs = model(tokens_tensor) + predictions = outputs[0] + +# get the predicted next sub-word (in our case, the word 'man') +predicted_index = torch.argmax(predictions[0, -1, :]).item() +predicted_text = tokenizer.decode(indexed_tokens + [predicted_index]) +assert predicted_text == 'Who was Jim Henson? Jim Henson was a man' +``` + +Examples for each model class of each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [documentation](#documentation). diff --git a/docs/source/usage.rst b/docs/source/usage.rst deleted file mode 100644 index 9956f3ac84..0000000000 --- a/docs/source/usage.rst +++ /dev/null @@ -1,339 +0,0 @@ -Usage -================================================ - -BERT -^^^^ - -Here is a quick-start example using ``BertTokenizer``\ , ``BertModel`` and ``BertForMaskedLM`` class with Google AI's pre-trained ``Bert base uncased`` model. See the `doc section <./model_doc/overview.html>`_ below for all the details on these classes. - -First let's prepare a tokenized input with ``BertTokenizer`` - -.. code-block:: python - - import torch - from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM - - # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows - import logging - logging.basicConfig(level=logging.INFO) - - # Load pre-trained model tokenizer (vocabulary) - tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') - - # Tokenized input - text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" - tokenized_text = tokenizer.tokenize(text) - - # Mask a token that we will try to predict back with `BertForMaskedLM` - masked_index = 8 - tokenized_text[masked_index] = '[MASK]' - assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]'] - - # Convert token to vocabulary indices - indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text) - # Define sentence A and B indices associated to 1st and 2nd sentences (see paper) - segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1] - - # Convert inputs to PyTorch tensors - tokens_tensor = torch.tensor([indexed_tokens]) - segments_tensors = torch.tensor([segments_ids]) - -Let's see how to use ``BertModel`` to get hidden states - -.. code-block:: python - - # Load pre-trained model (weights) - model = BertModel.from_pretrained('bert-base-uncased') - model.eval() - - # If you have a GPU, put everything on cuda - tokens_tensor = tokens_tensor.to('cuda') - segments_tensors = segments_tensors.to('cuda') - model.to('cuda') - - # Predict hidden states features for each layer - with torch.no_grad(): - encoded_layers, _ = model(tokens_tensor, segments_tensors) - # We have a hidden states for each of the 12 layers in model bert-base-uncased - assert len(encoded_layers) == 12 - -And how to use ``BertForMaskedLM`` - -.. code-block:: python - - # Load pre-trained model (weights) - model = BertForMaskedLM.from_pretrained('bert-base-uncased') - model.eval() - - # If you have a GPU, put everything on cuda - tokens_tensor = tokens_tensor.to('cuda') - segments_tensors = segments_tensors.to('cuda') - model.to('cuda') - - # Predict all tokens - with torch.no_grad(): - predictions = model(tokens_tensor, segments_tensors) - - # confirm we were able to predict 'henson' - predicted_index = torch.argmax(predictions[0, masked_index]).item() - predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0] - assert predicted_token == 'henson' - -OpenAI GPT -^^^^^^^^^^ - -Here is a quick-start example using ``OpenAIGPTTokenizer``\ , ``OpenAIGPTModel`` and ``OpenAIGPTLMHeadModel`` class with OpenAI's pre-trained model. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes. - -First let's prepare a tokenized input with ``OpenAIGPTTokenizer`` - -.. code-block:: python - - import torch - from pytorch_transformers import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel - - # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows - import logging - logging.basicConfig(level=logging.INFO) - - # Load pre-trained model tokenizer (vocabulary) - tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt') - - # Tokenized input - text = "Who was Jim Henson ? Jim Henson was a puppeteer" - tokenized_text = tokenizer.tokenize(text) - - # Convert token to vocabulary indices - indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text) - - # Convert inputs to PyTorch tensors - tokens_tensor = torch.tensor([indexed_tokens]) - -Let's see how to use ``OpenAIGPTModel`` to get hidden states - -.. code-block:: python - - # Load pre-trained model (weights) - model = OpenAIGPTModel.from_pretrained('openai-gpt') - model.eval() - - # If you have a GPU, put everything on cuda - tokens_tensor = tokens_tensor.to('cuda') - model.to('cuda') - - # Predict hidden states features for each layer - with torch.no_grad(): - hidden_states = model(tokens_tensor) - -And how to use ``OpenAIGPTLMHeadModel`` - -.. code-block:: python - - # Load pre-trained model (weights) - model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt') - model.eval() - - # If you have a GPU, put everything on cuda - tokens_tensor = tokens_tensor.to('cuda') - model.to('cuda') - - # Predict all tokens - with torch.no_grad(): - predictions = model(tokens_tensor) - - # get the predicted last token - predicted_index = torch.argmax(predictions[0, -1, :]).item() - predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0] - assert predicted_token == '.' - -And how to use ``OpenAIGPTDoubleHeadsModel`` - -.. code-block:: python - - # Load pre-trained model (weights) - model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt') - model.eval() - - # Prepare tokenized input - text1 = "Who was Jim Henson ? Jim Henson was a puppeteer" - text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man" - tokenized_text1 = tokenizer.tokenize(text1) - tokenized_text2 = tokenizer.tokenize(text2) - indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1) - indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2) - tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]]) - mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]]) - - # Predict hidden states features for each layer - with torch.no_grad(): - lm_logits, multiple_choice_logits = model(tokens_tensor, mc_token_ids) - -Transformer-XL -^^^^^^^^^^^^^^ - -Here is a quick-start example using ``TransfoXLTokenizer``\ , ``TransfoXLModel`` and ``TransfoXLModelLMHeadModel`` class with the Transformer-XL model pre-trained on WikiText-103. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes. - -First let's prepare a tokenized input with ``TransfoXLTokenizer`` - -.. code-block:: python - - import torch - from pytorch_transformers import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel - - # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows - import logging - logging.basicConfig(level=logging.INFO) - - # Load pre-trained model tokenizer (vocabulary from wikitext 103) - tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103') - - # Tokenized input - text_1 = "Who was Jim Henson ?" - text_2 = "Jim Henson was a puppeteer" - tokenized_text_1 = tokenizer.tokenize(text_1) - tokenized_text_2 = tokenizer.tokenize(text_2) - - # Convert token to vocabulary indices - indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1) - indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2) - - # Convert inputs to PyTorch tensors - tokens_tensor_1 = torch.tensor([indexed_tokens_1]) - tokens_tensor_2 = torch.tensor([indexed_tokens_2]) - -Let's see how to use ``TransfoXLModel`` to get hidden states - -.. code-block:: python - - # Load pre-trained model (weights) - model = TransfoXLModel.from_pretrained('transfo-xl-wt103') - model.eval() - - # If you have a GPU, put everything on cuda - tokens_tensor_1 = tokens_tensor_1.to('cuda') - tokens_tensor_2 = tokens_tensor_2.to('cuda') - model.to('cuda') - - with torch.no_grad(): - # Predict hidden states features for each layer - hidden_states_1, mems_1 = model(tokens_tensor_1) - # We can re-use the memory cells in a subsequent call to attend a longer context - hidden_states_2, mems_2 = model(tokens_tensor_2, mems=mems_1) - -And how to use ``TransfoXLLMHeadModel`` - -.. code-block:: python - - # Load pre-trained model (weights) - model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103') - model.eval() - - # If you have a GPU, put everything on cuda - tokens_tensor_1 = tokens_tensor_1.to('cuda') - tokens_tensor_2 = tokens_tensor_2.to('cuda') - model.to('cuda') - - with torch.no_grad(): - # Predict all tokens - predictions_1, mems_1 = model(tokens_tensor_1) - # We can re-use the memory cells in a subsequent call to attend a longer context - predictions_2, mems_2 = model(tokens_tensor_2, mems=mems_1) - - # get the predicted last token - predicted_index = torch.argmax(predictions_2[0, -1, :]).item() - predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0] - assert predicted_token == 'who' - -OpenAI GPT-2 -^^^^^^^^^^^^ - -Here is a quick-start example using ``GPT2Tokenizer``\ , ``GPT2Model`` and ``GPT2LMHeadModel`` class with OpenAI's pre-trained model. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes. - -First let's prepare a tokenized input with ``GPT2Tokenizer`` - -.. code-block:: python - - import torch - from pytorch_transformers import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel - - # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows - import logging - logging.basicConfig(level=logging.INFO) - - # Load pre-trained model tokenizer (vocabulary) - tokenizer = GPT2Tokenizer.from_pretrained('gpt2') - - # Encode some inputs - text_1 = "Who was Jim Henson ?" - text_2 = "Jim Henson was a puppeteer" - indexed_tokens_1 = tokenizer.encode(text_1) - indexed_tokens_2 = tokenizer.encode(text_2) - - # Convert inputs to PyTorch tensors - tokens_tensor_1 = torch.tensor([indexed_tokens_1]) - tokens_tensor_2 = torch.tensor([indexed_tokens_2]) - -Let's see how to use ``GPT2Model`` to get hidden states - -.. code-block:: python - - # Load pre-trained model (weights) - model = GPT2Model.from_pretrained('gpt2') - model.eval() - - # If you have a GPU, put everything on cuda - tokens_tensor_1 = tokens_tensor_1.to('cuda') - tokens_tensor_2 = tokens_tensor_2.to('cuda') - model.to('cuda') - - # Predict hidden states features for each layer - with torch.no_grad(): - hidden_states_1, past = model(tokens_tensor_1) - # past can be used to reuse precomputed hidden state in a subsequent predictions - # (see beam-search examples in the run_gpt2.py example). - hidden_states_2, past = model(tokens_tensor_2, past=past) - -And how to use ``GPT2LMHeadModel`` - -.. code-block:: python - - # Load pre-trained model (weights) - model = GPT2LMHeadModel.from_pretrained('gpt2') - model.eval() - - # If you have a GPU, put everything on cuda - tokens_tensor_1 = tokens_tensor_1.to('cuda') - tokens_tensor_2 = tokens_tensor_2.to('cuda') - model.to('cuda') - - # Predict all tokens - with torch.no_grad(): - predictions_1, past = model(tokens_tensor_1) - # past can be used to reuse precomputed hidden state in a subsequent predictions - # (see beam-search examples in the run_gpt2.py example). - predictions_2, past = model(tokens_tensor_2, past=past) - - # get the predicted last token - predicted_index = torch.argmax(predictions_2[0, -1, :]).item() - predicted_token = tokenizer.decode([predicted_index]) - -And how to use ``GPT2DoubleHeadsModel`` - -.. code-block:: python - - # Load pre-trained model (weights) - model = GPT2DoubleHeadsModel.from_pretrained('gpt2') - model.eval() - - # Prepare tokenized input - text1 = "Who was Jim Henson ? Jim Henson was a puppeteer" - text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man" - tokenized_text1 = tokenizer.tokenize(text1) - tokenized_text2 = tokenizer.tokenize(text2) - indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1) - indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2) - tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]]) - mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]]) - - # Predict hidden states features for each layer - with torch.no_grad(): - lm_logits, multiple_choice_logits, past = model(tokens_tensor, mc_token_ids)