Compare commits

...

54 Commits

Author SHA1 Message Date
Lysandre
7cb203fae4 Release: v2.9.1
Some checks failed
GitHub-hosted runner / check_code_quality (push) Has been cancelled
2020-05-13 17:38:50 -04:00
Sam Shleifer
9a687ebb77 [Marian Fixes] prevent predicting pad_token_id before softmax, support language codes, name multilingual models (#4290) 2020-05-13 17:29:41 -04:00
Patrick von Platen
839bfaedb2 [Docs, Notebook] Include generation pipeline (#4295)
* add first text for generation

* add generation pipeline to usage

* Created using Colaboratory

* correct docstring

* finish
2020-05-13 14:24:08 -04:00
Elyes Manai
2d184cb553 wrong variable name used (#4328) 2020-05-13 10:22:03 -04:00
Julien Plu
ca13618681 Question Answering for TF trainer (#4320)
* Add QA trainer example for TF

* Make data_dir optional

* Fix parameter logic

* Fix feature convert

* Update the READMEs to add the question-answering task

* Apply style

* Change 'sequence-classification' to 'text-classification' and prefix with 'eval' all the metric names

* Apply style

* Apply style
2020-05-13 09:22:31 -04:00
Denis
1e51bb717c Fix for #3865. PretrainedTokenizer mapped " do not" into " don't" when .decode(...) is called. Removed the " do not" --> " don't" mapping from clean_up_tokenization(...). (#4024) 2020-05-13 14:32:57 +02:00
Julien Chaumond
241759101e (v2) Improvements to the wandb integration (#4324)
* Improvements to the wandb integration

* small reorg + no global necessary

* feat(trainer): log epoch and final metrics

* Simplify logging a bit

* Fixup

* Fix crash when just running eval

Co-authored-by: Chris Van Pelt <vanpelt@gmail.com>
Co-authored-by: Boris Dayma <boris.dayma@gmail.com>
2020-05-12 21:52:01 -04:00
Funtowicz Morgan
7d7fe4997f Allow BatchEncoding to be initialized empty. (#4316)
* Allow BatchEncoding to be initialized empty.

This is required by recent changes introduced in TF 2.2.

* Attempt to unpin Tensorflow to 2.2 with the previous commit.
2020-05-12 15:02:46 -04:00
Savaş Yıldırım
0a97f6312a Update README.md (#4313) 2020-05-12 15:01:45 -04:00
Savaş Yıldırım
15a121fec5 Update README.md (#4315) 2020-05-12 15:01:34 -04:00
Stefan Schweter
15d45211f7 [model_cards]: 🇹🇷 Add new ELECTRA small and base models for Turkish (#4318) 2020-05-12 15:01:17 -04:00
Viktor Alm
8a017cbb5a Add modelcard with acknowledgements (#4321) 2020-05-12 15:00:56 -04:00
Julien Chaumond
4bf5042240 Fix BART tests on GPU (#4298) 2020-05-12 09:11:50 -04:00
Viktor Alm
e4512aab3b Add MultipleChoice to TFTrainer [WIP] (#4270)
* catch gpu len 1 set to gpu0

* Add mpc to trainer

* Add MPC for TF

* fix TF automodel for MPC and add Albert

* Apply style

* Fix import

* Note to self: double check

* Make shape None, None for datasetgenerator output shapes

* Add from_pt bool which doesnt seem to work

* Original checkpoint dir

* Fix docstrings for automodel

* Update readme and apply style

* Colab should probably not be from users

* Colabs should probably not be from users

* Add colab

* Update README.md

* Update README.md

* Cleanup __intit__

* Cleanup flake8 trailing comma

* Update src/transformers/training_args_tf.py

* Update src/transformers/modeling_tf_auto.py

Co-authored-by: Viktor Alm <viktoralm@pop-os.localdomain>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-12 08:48:48 -04:00
Levent Serinol
65be574aec fixed missing torch module import (#4305)
fixed missing torch module import in example usage code
2020-05-12 08:34:17 -04:00
Jangwon Park
31e67dd19f Remove hard-coded pad token id in distilbert and albert (#3965) 2020-05-12 08:32:44 -04:00
Lysandre Debut
30e343862f pin TF to 2.1 (#4297)
* pin TF to 2.1

* Pin flake8 as well
2020-05-11 21:03:30 -04:00
Julien Chaumond
56e8ef632f [ci] Restrict GPU tests to actual code commits 2020-05-11 20:40:41 -04:00
Julien Chaumond
ba6f6e44a8 [ci] Re-enable torch GPU tests 2020-05-12 00:05:36 +00:00
Lysandre Debut
9524956819 Documentation specification (#4294) 2020-05-11 16:43:57 -04:00
Bram Vanroy
61d22f9cc7 Simplify cache vars and allow for TRANSFORMERS_CACHE env (#4226)
* simplify cache vars and allow for TRANSFORMERS_CACHE env

As it currently stands, "TRANSFORMERS_CACHE" is not an accepted variable. It seems that the these variables were not updated when moving from version pytorch_transformers to transformers. In addition, the fallback procedure could be improved. and simplified. Pathlib seems redundant here.

* Update file_utils.py
2020-05-11 15:24:02 -04:00
Lysandre Debut
cd40cb8879 Fix special token doc (#4292) 2020-05-11 15:05:36 -04:00
Tianlei Wu
82601f4c1a Allow gpt2 to be exported to valid ONNX (#4244)
* allow gpt2 to be exported to valid ONNX model

* cast size from int to float explictly
2020-05-11 14:55:55 -04:00
Guo, Quan
39994051e4 Add migrating from pytorch-transformers (#4273)
"Migrating from pytorch-transformers to transformers" is missing in the main document. It is available in the main `readme` thought. Just move it to the document.
2020-05-11 13:35:13 -04:00
Lysandre Debut
051dcb2a07 CamemBERT does not make use of Token Type IDs (#4289) 2020-05-11 13:31:03 -04:00
fgaim
41e8291217 Add ALBERT to the Tensorflow to Pytorch model conversion cli (#3933)
* Add ALBERT to convert command of transformers-cli

* Document ALBERT tf to pytorch model conversion
2020-05-11 13:10:00 -04:00
Stefan Schweter
3f42eb979f Documentation: fix links to NER examples (#4279)
* docs: fix link to token classification (NER) example

* examples: fix links to NER scripts
2020-05-11 12:48:21 -04:00
Funtowicz Morgan
8fdb7997c6 Align sentiment-analysis' tokenizer (currently uncased) to the model (uncased). (#4264) 2020-05-11 12:45:53 -04:00
Sam Shleifer
4658896ee1 [Marian] Fix typo in docstring (#4284) 2020-05-11 11:47:51 -04:00
Levent Serinol
bf64b8cf09 Model card for bert-turkish-question-answering question-answering model (#4281)
* Create README.md

* Update model_cards/lserinol/bert-turkish-question-answering/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-11 11:32:25 -04:00
Julien Plu
94b57bf796 [TF 2.2 compat] use tf.VariableAggregation.ONLY_FIRST_REPLICA (#4283)
* Fix the issue to properly run the accumulator with TF 2.2

* Apply style

* Fix training_args_tf for TF 2.2

* Fix the TF training args when only one GPU is available

* Remove the fixed version of TF in setup.py
2020-05-11 11:28:37 -04:00
Savaş Yıldırım
cffbb3d8ed Update README.md (#4276) 2020-05-11 11:24:41 -04:00
Julien Plu
5f50d619dd Fix XTREME link + add number of eval documents + fix usage code (#4280) 2020-05-11 11:24:10 -04:00
theblackcat102
7751be7cee fix reformer apex scaling issue (#4242) 2020-05-11 16:53:42 +02:00
Patrick von Platen
ac7d5f67a2 [Reformer] Add Enwiki8 Reformer Model - Adapt convert script (#4282)
* adapt convert script

* update convert script

* finish

* fix marian pretrained docs
2020-05-11 16:38:07 +02:00
Patrick von Platen
336116d960 Reformer enwik8 - Model card (#4286) 2020-05-11 16:22:08 +02:00
flozi00
b290c32e16 [docs] fix typo (#4249) 2020-05-10 14:07:08 -04:00
Sam Shleifer
3487be75ef [Marian] documentation and AutoModel support (#4152)
- MarianSentencepieceTokenizer - > MarianTokenizer
- Start using unk token.
- add docs page
- add better generation params to MarianConfig
- more conversion utilities
2020-05-10 13:54:57 -04:00
Girishkumar
9d2f467bfb [README] Corrected some grammatical mistakes (#4199) 2020-05-10 09:02:36 -04:00
Julien Chaumond
7b75aa9fa5 [TPU] Doc, fix xla_spawn.py, only preprocess dataset once (#4223)
* [TPU] Doc, fix xla_spawn.py, only preprocess dataset once

* Update examples/README.md

* [xla_spawn] Add `_mp_fn` to other Trainer scripts

* [TPU] Fix: eval dataloader was None
2020-05-08 14:10:05 -04:00
Julien Chaumond
274d850d34 Fix #4098 2020-05-08 12:39:46 -04:00
Lorenzo De Mattei
26dad0a9fa example updated to use generation pipeline (#4230)
* example updated to use generation pipeline

* Update model_cards/LorenzoDeMattei/GePpeTto/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-08 09:45:10 -04:00
rmroczkowski
9ebb5b2a54 Model card for allegro/herbert-klej-cased-tokenizer-v1 (#4184) 2020-05-08 09:42:43 -04:00
rmroczkowski
9e54efd004 Model card for allegro/herbert-klej-cased-v1 (#4183) 2020-05-08 09:42:28 -04:00
Manuel Romero
a8b798e6c4 Model card for spanish electra small (#4196) 2020-05-08 09:30:15 -04:00
Savaş Yıldırım
242005d762 Create README.md (#4132)
* Create README.md

* Adding code fence around code block
2020-05-08 09:27:29 -04:00
Manuel Romero
5940c73bbb Create README.md (#4179)
model card for my De Novo Drug discovery model using MLM
2020-05-08 09:25:36 -04:00
Patrick von Platen
cf08830c28 [Pipeline, Generation] tf generation pipeline bug (#4217)
* fix PR

* move tests to correct place
2020-05-08 08:30:05 -04:00
Jared T Nielsen
8bf7312654 Add AlbertForPreTraining and TFAlbertForPreTraining models. (#4057)
* Add AlbertForPreTraining and TFAlbertForPreTraining models.

* PyTorch conversion

* TensorFlow conversion

* style

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-05-07 19:44:51 -04:00
Julien Chaumond
c99fe0386b [doc] Fix broken links + remove crazy big notebook 2020-05-07 18:44:18 -04:00
Savaş Yıldırım
66113bd626 Create README.md (#4202) 2020-05-07 18:31:22 -04:00
Julien Chaumond
6669915b65 [examples] Add column for pytorch-lightning support 2020-05-07 15:26:58 -04:00
Julien Chaumond
612fa1b10b Examples readme.md (#4215)
* README

* Update README.md
2020-05-07 15:00:06 -04:00
Lysandre
2e57824374 Pin isort and tf <= 2.1.0 2020-05-07 14:42:00 -04:00
120 changed files with 3810 additions and 550 deletions

View File

@@ -1,9 +1,13 @@
name: Self-hosted runner (push)
on:
# push:
# branches:
# - master
push:
branches:
- master
paths:
- "src/**"
- "tests/**"
- ".github/**"
# pull_request:
repository_dispatch:
@@ -31,8 +35,8 @@ jobs:
- name: Install dependencies
run: |
source .env/bin/activate
pip install .[sklearn,tf,torch,testing]
pip uninstall -y tensorflow
pip install torch==1.4.0
pip install .[sklearn,testing]
- name: Are GPUs recognized by our DL frameworks
run: |

View File

@@ -164,8 +164,9 @@ At some point in the future, you'll be able to seamlessly move from pre-training
17. **[ELECTRA](https://huggingface.co/transformers/model_doc/electra.html)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
18. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
19. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
20. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
21. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
20. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
21. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
22. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
@@ -414,7 +415,7 @@ Training with these hyper-parameters gave us the following results:
This example code fine-tunes BERT on the SQuAD dataset using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:
```bash
python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_type bert \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
@@ -447,7 +448,7 @@ The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-g
Here is how to run the script with the small version of OpenAI GPT-2 model:
```shell
python ./examples/run_generation.py \
python ./examples/text-generation/run_generation.py \
--model_type=gpt2 \
--length=20 \
--model_name_or_path=gpt2 \
@@ -455,7 +456,7 @@ python ./examples/run_generation.py \
and from the Salesforce CTRL model:
```shell
python ./examples/run_generation.py \
python ./examples/text-generation/run_generation.py \
--model_type=ctrl \
--length=20 \
--model_name_or_path=ctrl \

View File

@@ -67,3 +67,131 @@ It should build the static app that will be available under `/docs/_build/html`
Accepted files are reStructuredText (.rst) and Markdown (.md). Create a file with its extension and put it
in the source directory. You can then link it to the toc-tree by putting the filename without the extension.
## Writing Documentation - Specification
The `huggingface/transformers` documentation follows the
[Google documentation](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) style. It is
mostly written in ReStructuredText
([Sphinx simple documentation](https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html),
[Sourceforge complete documentation](https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html))
### Adding a new section
A section is a page held in the `Notes` toc-tree on the documentation. Adding a new section is done in two steps:
- Add a new file under `./source`. This file can either be ReStructuredText (.rst) or Markdown (.md).
- Link that file in `./source/index.rst` on the correct toc-tree.
### Adding a new model
When adding a new model:
- Create a file `xxx.rst` under `./source/model_doc`.
- Link that file in `./source/index.rst` on the `model_doc` toc-tree.
- Write a short overview of the model:
- Overview with paper & authors
- Paper abstract
- Tips and tricks and how to use it best
- Add the classes that should be linked in the model. This generally includes the configuration, the tokenizer, and
every model of that class (the base model, alongside models with additional heads), both in PyTorch and TensorFlow.
The order is generally:
- Configuration,
- Tokenizer
- PyTorch base model
- PyTorch head models
- TensorFlow base model
- TensorFlow head models
These classes should be added using the RST syntax. Usually as follows:
```
XXXConfig
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XXXConfig
:members:
```
This will include every public method of the configuration. If for some reason you wish for a method not to be displayed
in the documentation, you can do so by specifying which methods should be in the docs:
```
XXXTokenizer
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XXXTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
```
### Writing source documentation
Values that should be put in `code` should either be surrounded by double backticks: \`\`like so\`\` or be written as an object
using the :obj: syntax: :obj:\`like so\`.
When mentionning a class, it is recommended to use the :class: syntax as the mentioned class will be automatically
linked by Sphinx: :class:\`transformers.XXXClass\`
When mentioning a function, it is recommended to use the :func: syntax as the mentioned method will be automatically
linked by Sphinx: :func:\`transformers.XXXClass.method\`
Links should be done as so (note the double underscore at the end): \`text for the link <./local-link-or-global-link#loc>\`__
#### Defining arguments in a method
Arguments should be defined with the `Args:` prefix, followed by a line return and an indentation.
The argument should be followed by its type, with its shape if it is a tensor, and a line return.
Another indentation is necessary before writing the description of the argument.
Here's an example showcasing everything so far:
```
Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`transformers.AlbertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.encode_plus` for details.
`What are input IDs? <../glossary.html#input-ids>`__
```
#### Writing a multi-line code block
Multi-line code blocks can be useful for displaying examples. They are done like so:
```
Example::
# first line of code
# second line
# etc
```
The `Example` string at the beginning can be replaced by anything as long as there are two semicolons following it.
#### Writing a return block
Arguments should be defined with the `Args:` prefix, followed by a line return and an indentation.
The first line should be the type of the return, followed by a line return. No need to indent further for the elements
building the return.
Here's an example for tuple return, comprising several objects:
```
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
```
Here's an example for a single value return:
```
Returns:
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
```

View File

@@ -15,4 +15,4 @@ In order to help this new field develop, we have included a few additional featu
* accessing all the attention weights for each head of BERT/GPT/GPT-2,
* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.
To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.

View File

@@ -26,7 +26,7 @@ author = u'huggingface'
# The short X.Y version
version = u''
# The full version, including alpha/beta/rc tags
release = u'2.9.0'
release = u'2.9.1'
# -- General configuration ---------------------------------------------------

View File

@@ -12,7 +12,7 @@ A command-line interface is provided to convert original Bert/GPT/GPT-2/Transfor
BERT
^^^^
You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/transformers/convert_tf_checkpoint_to_pytorch.py>`_ script.
You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).
@@ -33,6 +33,26 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
ALBERT
^^^^^^
Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you will need to have TensorFlow and PyTorch installed.
Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model:
.. code-block:: shell
export ALBERT_BASE_DIR=/path/to/albert/albert_base
transformers-cli convert --model_type albert \
--tf_checkpoint $ALBERT_BASE_DIR/model.ckpt-best \
--config $ALBERT_BASE_DIR/albert_config.json \
--pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__.
OpenAI GPT
^^^^^^^^^^

View File

@@ -23,13 +23,13 @@ pip install -r ./examples/requirements.txt
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
## TensorFlow 2.0 Bert models on GLUE
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_glue.py).
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
@@ -93,7 +93,7 @@ python run_glue_tpu.py \
## Language model training
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py).
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
@@ -155,7 +155,7 @@ python run_language_modeling.py \
## Language generation
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py).
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
@@ -364,7 +364,7 @@ Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
```bash
#training on 4 tesla V100(16GB) GPUS
export SWAG_DIR=/path/to/swag_data_dir
python ./examples/run_multiple_choice.py \
python ./examples/multiple-choice/run_multiple_choice.py \
--task_name swag \
--model_name_or_path roberta-base \
--do_train \
@@ -388,7 +388,7 @@ eval_loss = 0.44457291918821606
## SQuAD
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py).
#### Fine-tuning BERT on SQuAD1.0
@@ -437,7 +437,7 @@ exact_match = 81.22
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
```bash
python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_type bert \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
@@ -548,7 +548,7 @@ Larger batch size may improve the performance while costing more memory.
## XNLI
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).

View File

@@ -108,3 +108,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
model_doc/electra
model_doc/dialogpt
model_doc/reformer
model_doc/marian

View File

@@ -74,7 +74,7 @@ This library hosts the processor to load the XNLI data:
Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
An example using these processors is given in the
`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_xnli.py>`__ script.
`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
SQuAD
@@ -150,4 +150,4 @@ Example::
Another example using these processors is given in the
`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/run_squad.py>`__ script.
`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.

View File

@@ -1,5 +1,18 @@
# Migrating from pytorch-pretrained-bert
# Migrating from previous packages
## Migrating from pytorch-transformers to transformers
Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to `transformers`.
### Positional order of some models' keywords inputs (`attention_mask`, `token_type_ids`...) changed
To be able to use Torchscript (see #1010, #1204 and #1195) the specific order of some models **keywords inputs** (`attention_mask`, `token_type_ids`...) has been changed.
If you used to call the models with keyword names for keyword arguments, e.g. `model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)`, this should not cause any change.
If you used to call the models with positional inputs for keyword arguments, e.g. `model(inputs_ids, attention_mask, token_type_ids)`, you may have to double check the exact order of input arguments.
## Migrating from pytorch-pretrained-bert
Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `transformers`

View File

@@ -1,6 +1,6 @@
Bart
----------------------------------------------------
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
**DISCLAIMER:** If you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@sshleifer

View File

@@ -0,0 +1,105 @@
MarianMT
----------------------------------------------------
**DISCLAIMER:** If you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@sshleifer. Translations should be similar, but not identical to, output in the test set linked to in each model card.
Implementation Notes
~~~~~~~~~~~~~~~~~~~~
- each model is about 298 MB on disk, there are 1,000+ models.
- The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
- The 1,000+ models were originally trained by `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian <https://marian-nmt.github.io/>`_ C++ library, which supports fast training and translation.
- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented in a model card.
- the 80 opus models that require BPE preprocessing are not supported.
- The modeling code is the same as ``BartForConditionalGeneration`` with a few minor modifications:
- static (sinusoid) positional embeddings (``MarianConfig.static_position_embeddings=True``)
- a new final_logits_bias (``MarianConfig.add_bias_logits=True``)
- no layernorm_embedding (``MarianConfig.normalize_embedding=False``)
- the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. (Bart uses <s/>)
- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``
Naming
~~~~~~
- All model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``
- The language codes used to name models are inconsistent. Two digit codes can usually be found `here <https://developers.google.com/admin-sdk/directory/v1/languages>`_, three digit codes require googling "language code {code}".
- Codes formatted like ``es_AR`` are usually ``code_{region}``. That one is spanish documents from Argentina.
Multilingual Models
~~~~~~~~~~~~~~~~~~~~
All model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``:
- if ``src`` is in all caps, the model supports multiple input languages, you can figure out which ones by looking at the model card, or the Group Members `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
- if ``tgt`` is in all caps, the model can output multiple languages, and you should specify a language code by prepending the desired output language to the src_text
- You can see a tokenizer's supported language codes in ``tokenizer.supported_language_codes``
Example of translating english to many romance languages, using language codes:
.. code-block:: python
from transformers import MarianMTModel, MarianTokenizer
src_text = [
'>>fr<< this is a sentence in english that we want to translate to french',
'>>pt<< This should go to portuguese',
'>>es<< And this to Spanish'
]
model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer.prepare_translation_batch(src_text))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
# ["c'est une phrase en anglais que nous voulons traduire en français",
# 'Isto deve ir para o português.',
# 'Y esto al español']
Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a separator for src or tgt, as in ``'Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi'``. These still require language codes.
There are many supported regional language codes, like ``>>es_ES<<`` (Spain) and ``>>es_AR<<`` (Argentina), that do not seem to change translations. I have not found these to provide different results than just using ``>>es<<``.
For Example:
- ``Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU``: translates from all NORTH_EU languages (see `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special language code like ``>>de<<`` to specify output language.
- ``Helsinki-NLP/opus-mt-ROMANCE-en``: translates from many romance languages to english, no codes needed since there is only 1 tgt language.
.. code-block:: python
GROUP_MEMBERS = {
'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
}
Code to see available pretrained models:
.. code-block:: python
from transformers.hf_api import HfApi
model_list = HfApi().model_list()
org = "Helsinki-NLP"
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
suffix = [x.split('/')[1] for x in model_ids]
multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
MarianMTModel
~~~~~~~~~~~~~
Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
Model API is identical to BartForConditionalGeneration.
Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__
This class inherits all functionality from ``BartForConditionalGeneration``, see that page for method signatures.
.. autoclass:: transformers.MarianMTModel
:members:
MarianTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MarianTokenizer
:members: prepare_translation_batch

View File

@@ -29,7 +29,7 @@ Tips:
XLNet is pretrained using only a sub-set of the output tokens as target which are selected
with the `target_mapping` input.
- To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the `perm_mask` and
`target_mapping` inputs to control the attention span and outputs (see examples in `examples/run_generation.py`)
`target_mapping` inputs to control the attention span and outputs (see examples in `examples/text-generation/run_generation.py`)
- XLNet is one of the few models that has no sequence length limit.
The original code can be found `here <https://github.com/zihangdai/xlnet/>`_.

View File

@@ -80,7 +80,7 @@ You can then feed it all as input to your model:
outputs = model(input_ids, langs=langs)
The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/run_generation.py>`__
The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__
can generate text using the CLM checkpoints from XLM, using the language embeddings.
XLM without Language Embeddings

View File

@@ -275,7 +275,7 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | | | FlauBERT large architecture |
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Bart | ``bart-large`` | | 12-layer, 1024-hidden, 16-heads, 406M parameters |
| Bart | ``bart-large`` | | 24-layer, 1024-hidden, 16-heads, 406M parameters |
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bart-large-mnli`` | | Adds a 2 layer classification head with 1 million parameters |
@@ -296,6 +296,12 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | ``DialoGPT-large`` | | 36-layer, 1280-hidden, 20-heads, 774M parameters |
| | | | Trained on English text: 147M conversation-like exchanges extracted from Reddit. |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Reformer | ``reformer-crime-and-punishment`` | | 6-layer, 256-hidden, 2-heads, 3M parameters |
| | | | Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky |
| Reformer | ``reformer-enwik8`` | | 12-layer, 1024-hidden, 8-heads, 149M parameters |
| | | | Trained on English Wikipedia data - enwik8. |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``reformer-crime-and-punishment`` | | 6-layer, 256-hidden, 2-heads, 3M parameters |
| | | | Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky. |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| MarianMT | ``Helsinki-NLP/opus-mt-{src}-{tgt}`` | | 12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size. |
| | | | (see `model list <https://huggingface.co/Helsinki-NLP>`_) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+

View File

@@ -8,7 +8,7 @@ The library was designed with two strong goals in mind:
- be as easy and fast to use as possible:
- we strongly limited the number of user-facing abstractions to learn, in fact there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
- we strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
- all of these classes can be initialized in a simple and unified way from pretrained instances by using a common `from_pretrained()` instantiation method which will take care of downloading (if needed), caching and loading the related class from a pretrained instance supplied in the library or your own saved instance.
- as a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend/build-upon the library, just use regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.
@@ -31,27 +31,27 @@ A few other goals:
## Main concepts
The library is build around three type of classes for each models:
The library is build around three types of classes for each model:
- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 8 models architectures currently provided in the library, e.g. `BertModel`
- **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`. You don't always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`
- **model classes** e.g., `BertModel` which are 20+ PyTorch models (`torch.nn.Modules`) that work with the pretrained weights provided in the library. In TF2, these are `tf.keras.Model`.
- **configuration classes** which store all the parameters required to build a model, e.g., `BertConfig`. You don't always need to instantiate these your-self. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model, e.g., `BertTokenizer`
All these classes can be instantiated from pretrained instances and saved locally using two methods:
- `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/transformers/pretrained_models.html)) or stored locally (or on a server) by the user,
- `save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.
We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized in two parts:
We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized into two parts:
- the **MAIN CLASSES** section details the common functionalities/method/attributes of the three main type of classes (configuration, model, tokenizer) plus some optimization related classes provided as utilities for training,
- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and in particular the input/output that you should expect when calling each of them.
- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and, in particular, the input/output that you should expect when calling each of them.
## Quick tour: Usage
Here are two examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.
See full API reference for examples for each model class.
See the full API reference for examples of each model class.
### BERT example
@@ -191,7 +191,7 @@ Examples for each model class of each model architecture (Bert, GPT, GPT-2, Tran
#### Using the past
GPT-2 as well as some other models (GPT, XLNet, Transfo-XL, CTRL) make use of a `past` or `mems` attribute which can be used to prevent re-computing the key/value pairs when using sequential decoding. It is useful when generating sequences as a big part of the attention mechanism benefits from previous computations.
GPT-2, as well as some other models (GPT, XLNet, Transfo-XL, CTRL), make use of a `past` or `mems` attribute which can be used to prevent re-computing the key/value pairs when using sequential decoding. It is useful when generating sequences as a big part of the attention mechanism benefits from previous computations.
Here is a fully-working example using the `past` with `GPT2LMHeadModel` and argmax decoding (which should only be used as an example, as argmax decoding introduces a lot of repetition):

View File

@@ -45,7 +45,7 @@ Sequence classification is the task of classifying sequences according to a give
of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune
a model on a GLUE sequence classification task, you may leverage the
`run_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`_ or
`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/run_tf_glue.py>`_ scripts.
`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_tf_glue.py>`_ scripts.
Here is an example using the pipelines do to sentiment analysis: identifying if a sequence is positive or negative.
It leverages a fine-tuned model on sst2, which is a GLUE task.
@@ -404,48 +404,150 @@ Causal language modeling is the task of predicting the token following a sequenc
model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
for generation tasks.
There is currently no pipeline to do causal language modeling/generation.
Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the input sequence.
Here is an example using the tokenizer and model. leveraging the :func:`~transformers.PreTrainedModel.generate` method
to generate the tokens following the initial sequence in PyTorch, and creating a simple loop in TensorFlow.
Here is an example using the tokenizer and model and leveraging the :func:`~transformers.PreTrainedModel.top_k_top_p_filtering` method to sample the next token following an input sequence of tokens.
::
## PYTORCH CODE
from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and "
input_ids = tokenizer.encode(sequence, return_tensors="pt")
# get logits of last hidden state
next_token_logits = model(input_ids)[0][:, -1, :]
# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
# sample
probs = F.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([input_ids, next_token], dim=-1)
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
## TENSORFLOW CODE
from transformers import TFAutoModelWithLMHead, AutoTokenizer, tf_top_k_top_p_filtering
import tensorflow as tf
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelWithLMHead.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and "
input_ids = tokenizer.encode(sequence, return_tensors="tf")
# get logits of last hidden state
next_token_logits = model(input_ids)[0][:, -1, :]
# filter
filtered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
# sample
next_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)
generated = tf.concat([input_ids, next_token], axis=1)
resulting_string = tokenizer.decode(generated.numpy().tolist()[0])
print(resulting_string)
This outputs a (hopefully) coherent next token following the original sequence, which is in our case is the word *has*:
::
Hugging Face is based in DUMBO, New York City, and has
In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to generate multiple tokens up to a user-defined length.
Text Generation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a continuation from the given context. As an example, is it shown how *GPT-2* can be used in pipelines to generate text. As a default all models apply *Top-K* sampling when used in pipelines as configured in their respective configurations (see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`_ for example).
::
from transformers import pipeline
text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50))
Here the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am concerned, I will"*.
The default arguments of ``PreTrainedModel.generate()`` can directly be overriden in the pipeline as is shown above for the argument ``max_length``.
Here is an example for text generation using XLNet and its tokenzier.
::
## PYTORCH CODE
from transformers import AutoModelWithLMHead, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
sequence = f"Hugging Face is based in DUMBO, New York City, and is"
# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
input = tokenizer.encode(sequence, return_tensors="pt")
generated = model.generate(input, max_length=50, do_sample=True)
prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer.encode(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")
prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
print(generated)
## TENSORFLOW CODE
from transformers import TFAutoModelWithLMHead, AutoTokenizer
import tensorflow as tf
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelWithLMHead.from_pretrained("gpt2")
model = TFAutoModelWithLMHead.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
sequence = f"Hugging Face is based in DUMBO, New York City, and is"
input = tokenizer.encode(sequence, return_tensors="tf")
generated = model.generate(input, max_length=50, do_sample=True)
# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer.encode(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="tf")
prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]
This outputs a (hopefully) coherent string from the original sequence, as the
:func:`~transformers.PreTrainedModel.generate` samples from a top_p/tok_k distribution:
print(generated)
::
Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-xl* often need to be padded to work well.
GPT-2 is usually a good choice for *open-ended text generation* because it was trained on millions on webpages with a causal language modeling objective.
Hugging Face is based in DUMBO, New York City, and is a live-action TV series based on the novel by John
Carpenter, and its producers, David Kustlin and Steve Pichar. The film is directed by!
For more information on how to apply different decoding strategies for text generation, please also refer to our generation blog post `here <https://huggingface.co/blog/how-to-generate>`_.
Named Entity Recognition

View File

@@ -1,10 +1,48 @@
# Examples
In this section a few examples are put together. All of these examples work for several models, making use of the very
similar API between the different models.
Version 2.9 of `transformers` introduces a new `Trainer` class for PyTorch, and its equivalent `TFTrainer` for TF 2.
Here is the list of all our examples:
- **grouped by task** (all official examples work for multiple models)
- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might just lack some features),
- whether they also include examples for **`pytorch-lightning`**, which is a great fully-featured, general-purpose training library for PyTorch,
- links to **Colab notebooks** to walk through the scripts and run them easily,
- links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
This is still a work-in-progress in particular documentation is still sparse so please **contribute improvements/pull requests.**
## Tasks built on Trainer
| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab | One-click Deploy to Azure (wip) |
|---|---|:---:|:---:|:---:|:---:|:---:|
| [`language-modeling`](./language-modeling) | Raw text | ✅ | - | - | - | - |
| [`text-classification`](./text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb) | [![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json) |
| [`token-classification`](./token-classification) | CoNLL NER | ✅ | ✅ | ✅ | - | - |
| [`multiple-choice`](./multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb) | - |
| [`question-answering`](./question-answering) | SQuAD | - | ✅ | - | - | - |
## Other examples and how-to's
| Section | Description |
|---|---|
| [TensorFlow 2.0 models on GLUE](./text-classification) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
| [Language Model training](./language-modeling) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
| [Language Generation](./text-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
| [GLUE](./text-classification) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
| [SQuAD](./question-answering) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
| [Multiple Choice](./multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
| [Named Entity Recognition](./token-classification) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
| [XNLI](./text-classification) | Examples running BERT/XLM on the XNLI benchmark. |
| [Adversarial evaluation of model performances](./adversarial) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
## Important note
**Important**
To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
To make sure you can successfully run the latest versions of the example scripts, you have to install the library from source and install some example-specific requirements.
Execute the following steps in a new virtual environment:
```bash
@@ -14,16 +52,30 @@ pip install .
pip install -r ./examples/requirements.txt
```
| Section | Description |
|----------------------------|-----------------------------------------------------
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
## Running on TPUs
When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context and information on how to setup your TPU environment refer to Google's documentation and to the
very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
In this repo, we provide a very simple launcher script named [xla_spawn.py](./xla_spawn.py) that lets you run our example scripts on multiple TPU cores without any boilerplate.
Just pass a `--num_cores` flag to this script, then your regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for torch.distributed).
For example for `run_glue`:
```bash
python examples/xla_spawn.py --num_cores 8 \
examples/text-classification/run_glue.py
--model_name_or_path bert-base-cased \
--task_name mnli \
--data_dir ./data/glue_data/MNLI \
--output_dir ./models/tpu \
--overwrite_output_dir \
--do_train \
--do_eval \
--num_train_epochs 1 \
--save_steps 20000
```
Feedback and more use cases and benchmarks involving TPUs are welcome, please share with the community.

View File

@@ -404,7 +404,7 @@ def main():
logger.info("Training/evaluation parameters %s", args)
# Prepare dataset for the GLUE task
eval_dataset = GlueDataset(args, tokenizer=tokenizer, evaluate=True, local_rank=args.local_rank)
eval_dataset = GlueDataset(args, tokenizer=tokenizer, evaluate=True)
if args.data_subset > 0:
eval_dataset = Subset(eval_dataset, list(range(min(args.data_subset, len(eval_dataset)))))
eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)

View File

@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" This is the exact same script as `examples/run_squad.py` (as of 2020, January 8th) with an additional and optional step of distillation."""
""" This is the exact same script as `examples/question-answering/run_squad.py` (as of 2020, January 8th) with an additional and optional step of distillation."""
import argparse
import glob

View File

@@ -1,7 +1,7 @@
## Language model training
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py).
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa

View File

@@ -265,7 +265,7 @@ def main():
eval_output = trainer.evaluate()
perplexity = math.exp(eval_output["loss"])
perplexity = math.exp(eval_output["eval_loss"])
result = {"perplexity": perplexity}
output_eval_file = os.path.join(training_args.output_dir, "eval_results_lm.txt")
@@ -280,5 +280,10 @@ def main():
return results
def _mp_fn(index):
# For xla_spawn (TPUs)
main()
if __name__ == "__main__":
main()

View File

@@ -8,7 +8,7 @@ Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
```bash
#training on 4 tesla V100(16GB) GPUS
export SWAG_DIR=/path/to/swag_data_dir
python ./examples/run_multiple_choice.py \
python ./examples/multiple-choice/run_multiple_choice.py \
--task_name swag \
--model_name_or_path roberta-base \
--do_train \
@@ -29,3 +29,28 @@ Training with the defined hyper-parameters yields the following results:
eval_acc = 0.8338998300509847
eval_loss = 0.44457291918821606
```
## Tensorflow
```bash
export SWAG_DIR=/path/to/swag_data_dir
python ./examples/multiple-choice/run_tf_multiple_choice.py \
--task_name swag \
--model_name_or_path bert-base-cased \
--do_train \
--do_eval \
--data_dir $SWAG_DIR \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--logging-dir logs \
--gradient_accumulation_steps 2 \
--overwrite_output
```
# Run it in colab
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)

View File

@@ -221,5 +221,10 @@ def main():
return results
def _mp_fn(index):
# For xla_spawn (TPUs)
main()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,211 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Finetuning the library models for multiple choice (Bert, Roberta, XLNet)."""
import logging
import os
from dataclasses import dataclass, field
from typing import Dict, Optional
import numpy as np
from transformers import (
AutoConfig,
AutoTokenizer,
EvalPrediction,
HfArgumentParser,
TFAutoModelForMultipleChoice,
TFTrainer,
TFTrainingArguments,
set_seed,
)
from utils_multiple_choice import Split, TFMultipleChoiceDataset, processors
logger = logging.getLogger(__name__)
def simple_accuracy(preds, labels):
return (preds == labels).mean()
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: str = field(
metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
)
config_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
)
tokenizer_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
)
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
"""
task_name: str = field(metadata={"help": "The name of the task to train on: " + ", ".join(processors.keys())})
data_dir: str = field(metadata={"help": "Should contain the data files for the task."})
max_seq_length: int = field(
default=128,
metadata={
"help": "The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)
def main():
# See all possible arguments in src/transformers/training_args.py
# or by passing the --help flag to this script.
# We now keep distinct sets of args, for a cleaner separation of concerns.
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TFTrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
and training_args.do_train
and not training_args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO,
)
logger.warning(
"device: %s, n_gpu: %s, 16-bits training: %s", training_args.device, training_args.n_gpu, training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)
# Set seed
set_seed(training_args.seed)
try:
processor = processors[data_args.task_name]()
label_list = processor.get_labels()
num_labels = len(label_list)
except KeyError:
raise ValueError("Task not found: %s" % (data_args.task_name))
# Load pretrained model and tokenizer
#
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
config = AutoConfig.from_pretrained(
model_args.config_name if model_args.config_name else model_args.model_name_or_path,
num_labels=num_labels,
finetuning_task=data_args.task_name,
cache_dir=model_args.cache_dir,
)
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
)
with training_args.strategy.scope():
model = TFAutoModelForMultipleChoice.from_pretrained(
model_args.model_name_or_path,
from_pt=bool(".bin" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
)
# Get datasets
train_dataset = (
TFMultipleChoiceDataset(
data_dir=data_args.data_dir,
tokenizer=tokenizer,
task=data_args.task_name,
max_seq_length=data_args.max_seq_length,
overwrite_cache=data_args.overwrite_cache,
mode=Split.train,
)
if training_args.do_train
else None
)
eval_dataset = (
TFMultipleChoiceDataset(
data_dir=data_args.data_dir,
tokenizer=tokenizer,
task=data_args.task_name,
max_seq_length=data_args.max_seq_length,
overwrite_cache=data_args.overwrite_cache,
mode=Split.dev,
)
if training_args.do_eval
else None
)
def compute_metrics(p: EvalPrediction) -> Dict:
preds = np.argmax(p.predictions, axis=1)
return {"acc": simple_accuracy(preds, p.label_ids)}
# Initialize our Trainer
trainer = TFTrainer(
model=model,
args=training_args,
train_dataset=train_dataset.get_dataset() if train_dataset else None,
eval_dataset=eval_dataset.get_dataset() if eval_dataset else None,
compute_metrics=compute_metrics,
)
# Training
if training_args.do_train:
trainer.train()
trainer.save_model()
tokenizer.save_pretrained(training_args.output_dir)
# Evaluation
results = {}
if training_args.do_eval:
logger.info("*** Evaluate ***")
result = trainer.evaluate()
output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in result.items():
logger.info(" %s = %s", key, value)
writer.write("%s = %s\n" % (key, value))
results.update(result)
return results
if __name__ == "__main__":
main()

View File

@@ -25,11 +25,9 @@ from dataclasses import dataclass
from enum import Enum
from typing import List, Optional
import torch
import tqdm
from torch.utils.data.dataset import Dataset
from transformers import PreTrainedTokenizer, torch_distributed_zero_first
from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available
logger = logging.getLogger(__name__)
@@ -76,66 +74,160 @@ class Split(Enum):
test = "test"
class MultipleChoiceDataset(Dataset):
"""
This will be superseded by a framework-agnostic approach
soon.
"""
if is_torch_available():
import torch
from torch.utils.data.dataset import Dataset
from transformers import torch_distributed_zero_first
features: List[InputFeatures]
class MultipleChoiceDataset(Dataset):
"""
This will be superseded by a framework-agnostic approach
soon.
"""
def __init__(
self,
data_dir: str,
tokenizer: PreTrainedTokenizer,
task: str,
max_seq_length: Optional[int] = None,
overwrite_cache=False,
mode: Split = Split.train,
local_rank=-1,
):
processor = processors[task]()
features: List[InputFeatures]
cached_features_file = os.path.join(
data_dir,
"cached_{}_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length), task,),
)
with torch_distributed_zero_first(local_rank):
# Make sure only the first process in distributed training processes the dataset,
# and the others will use the cache.
def __init__(
self,
data_dir: str,
tokenizer: PreTrainedTokenizer,
task: str,
max_seq_length: Optional[int] = None,
overwrite_cache=False,
mode: Split = Split.train,
local_rank=-1,
):
processor = processors[task]()
if os.path.exists(cached_features_file) and not overwrite_cache:
logger.info(f"Loading features from cached file {cached_features_file}")
self.features = torch.load(cached_features_file)
else:
logger.info(f"Creating features from dataset file at {data_dir}")
label_list = processor.get_labels()
if mode == Split.dev:
examples = processor.get_dev_examples(data_dir)
elif mode == Split.test:
examples = processor.get_test_examples(data_dir)
cached_features_file = os.path.join(
data_dir,
"cached_{}_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length), task,),
)
with torch_distributed_zero_first(local_rank):
# Make sure only the first process in distributed training processes the dataset,
# and the others will use the cache.
if os.path.exists(cached_features_file) and not overwrite_cache:
logger.info(f"Loading features from cached file {cached_features_file}")
self.features = torch.load(cached_features_file)
else:
examples = processor.get_train_examples(data_dir)
logger.info("Training examples: %s", len(examples))
# TODO clean up all this to leverage built-in features of tokenizers
self.features = convert_examples_to_features(
examples,
label_list,
max_seq_length,
tokenizer,
pad_on_left=bool(tokenizer.padding_side == "left"),
pad_token=tokenizer.pad_token_id,
pad_token_segment_id=tokenizer.pad_token_type_id,
)
if local_rank in [-1, 0]:
logger.info("Saving features into cached file %s", cached_features_file)
torch.save(self.features, cached_features_file)
logger.info(f"Creating features from dataset file at {data_dir}")
label_list = processor.get_labels()
if mode == Split.dev:
examples = processor.get_dev_examples(data_dir)
elif mode == Split.test:
examples = processor.get_test_examples(data_dir)
else:
examples = processor.get_train_examples(data_dir)
logger.info("Training examples: %s", len(examples))
# TODO clean up all this to leverage built-in features of tokenizers
self.features = convert_examples_to_features(
examples,
label_list,
max_seq_length,
tokenizer,
pad_on_left=bool(tokenizer.padding_side == "left"),
pad_token=tokenizer.pad_token_id,
pad_token_segment_id=tokenizer.pad_token_type_id,
)
if local_rank in [-1, 0]:
logger.info("Saving features into cached file %s", cached_features_file)
torch.save(self.features, cached_features_file)
def __len__(self):
return len(self.features)
def __len__(self):
return len(self.features)
def __getitem__(self, i) -> InputFeatures:
return self.features[i]
def __getitem__(self, i) -> InputFeatures:
return self.features[i]
if is_tf_available():
import tensorflow as tf
class TFMultipleChoiceDataset:
"""
This will be superseded by a framework-agnostic approach
soon.
"""
features: List[InputFeatures]
def __init__(
self,
data_dir: str,
tokenizer: PreTrainedTokenizer,
task: str,
max_seq_length: Optional[int] = 128,
overwrite_cache=False,
mode: Split = Split.train,
):
processor = processors[task]()
logger.info(f"Creating features from dataset file at {data_dir}")
label_list = processor.get_labels()
if mode == Split.dev:
examples = processor.get_dev_examples(data_dir)
elif mode == Split.test:
examples = processor.get_test_examples(data_dir)
else:
examples = processor.get_train_examples(data_dir)
logger.info("Training examples: %s", len(examples))
# TODO clean up all this to leverage built-in features of tokenizers
self.features = convert_examples_to_features(
examples,
label_list,
max_seq_length,
tokenizer,
pad_on_left=bool(tokenizer.padding_side == "left"),
pad_token=tokenizer.pad_token_id,
pad_token_segment_id=tokenizer.pad_token_type_id,
)
def gen():
for (ex_index, ex) in tqdm.tqdm(enumerate(self.features), desc="convert examples to features"):
if ex_index % 10000 == 0:
logger.info("Writing example %d of %d" % (ex_index, len(examples)))
yield (
{
"example_id": 0,
"input_ids": ex.input_ids,
"attention_mask": ex.attention_mask,
"token_type_ids": ex.token_type_ids,
},
ex.label,
)
self.dataset = tf.data.Dataset.from_generator(
gen,
(
{
"example_id": tf.int32,
"input_ids": tf.int32,
"attention_mask": tf.int32,
"token_type_ids": tf.int32,
},
tf.int64,
),
(
{
"example_id": tf.TensorShape([]),
"input_ids": tf.TensorShape([None, None]),
"attention_mask": tf.TensorShape([None, None]),
"token_type_ids": tf.TensorShape([None, None]),
},
tf.TensorShape([]),
),
)
def get_dataset(self):
return self.dataset
def __len__(self):
return len(self.features)
def __getitem__(self, i) -> InputFeatures:
return self.features[i]
class DataProcessor:
@@ -225,6 +317,52 @@ class RaceProcessor(DataProcessor):
return examples
class SynonymProcessor(DataProcessor):
"""Processor for the Synonym data set."""
def get_train_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} train".format(data_dir))
return self._create_examples(self._read_csv(os.path.join(data_dir, "mctrain.csv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} dev".format(data_dir))
return self._create_examples(self._read_csv(os.path.join(data_dir, "mchp.csv")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} dev".format(data_dir))
return self._create_examples(self._read_csv(os.path.join(data_dir, "mctest.csv")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1", "2", "3", "4"]
def _read_csv(self, input_file):
with open(input_file, "r", encoding="utf-8") as f:
return list(csv.reader(f))
def _create_examples(self, lines: List[List[str]], type: str):
"""Creates examples for the training and dev sets."""
examples = [
InputExample(
example_id=line[0],
question="", # in the swag dataset, the
# common beginning of each
# choice is stored in "sent2".
contexts=[line[1], line[1], line[1], line[1], line[1]],
endings=[line[2], line[3], line[4], line[5], line[6]],
label=line[7],
)
for line in lines # we skip the line with the column names
]
return examples
class SwagProcessor(DataProcessor):
"""Processor for the SWAG data set."""
@@ -435,7 +573,5 @@ def convert_examples_to_features(
return features
processors = {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor}
MULTIPLE_CHOICE_TASKS_NUM_LABELS = {"race", 4, "swag", 4, "arc", 4}
processors = {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor, "syn": SynonymProcessor}
MULTIPLE_CHOICE_TASKS_NUM_LABELS = {"race", 4, "swag", 4, "arc", 4, "syn", 5}

View File

@@ -2,7 +2,7 @@
## SQuAD
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py).
#### Fine-tuning BERT on SQuAD1.0
@@ -51,7 +51,7 @@ exact_match = 81.22
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
```bash
python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_type bert \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
@@ -157,3 +157,23 @@ Larger batch size may improve the performance while costing more memory.
}
```
## SQuAD with the Tensorflow Trainer
```bash
python run_tf_squad.py \
--model_name_or_path bert-base-uncased \
--output_dir model \
--max-seq-length 384 \
--num_train_epochs 2 \
--per_gpu_train_batch_size 8 \
--per_gpu_eval_batch_size 16 \
--do_train \
--logging_dir logs \
--mode question-answering \
--logging_steps 10 \
--learning_rate 3e-5 \
--doc_stride 128 \
--optimizer_name adamw
```
For the moment the evaluation is not available in the Tensorflow Trainer only the training.

View File

@@ -0,0 +1,237 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Fine-tuning the library models for question-answering."""
import logging
import os
from dataclasses import dataclass, field
from typing import Optional
from transformers import (
AutoConfig,
AutoTokenizer,
HfArgumentParser,
TFAutoModelForQuestionAnswering,
TFTrainer,
TFTrainingArguments,
squad_convert_examples_to_features,
)
from transformers.data.processors.squad import SquadV1Processor, SquadV2Processor
logger = logging.getLogger(__name__)
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: str = field(
metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
)
config_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
)
tokenizer_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
use_fast: bool = field(default=False, metadata={"help": "Set this flag to use fast tokenization."})
# If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
# or just modify its tokenizer_config.json.
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
)
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
"""
data_dir: Optional[str] = field(
default=None, metadata={"help": "The input data dir. Should contain the .json files for the SQuAD task."}
)
max_seq_length: int = field(
default=128,
metadata={
"help": "The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
},
)
doc_stride: int = field(
default=128,
metadata={"help": "When splitting up a long document into chunks, how much stride to take between chunks."},
)
max_query_length: int = field(
default=64,
metadata={
"help": "The maximum number of tokens for the question. Questions longer than this will "
"be truncated to this length."
},
)
max_answer_length: int = field(
default=30,
metadata={
"help": "The maximum length of an answer that can be generated. This is needed because the start "
"and end predictions are not conditioned on one another."
},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)
version_2_with_negative: bool = field(
default=False, metadata={"help": "If true, the SQuAD examples contain some that do not have an answer."}
)
null_score_diff_threshold: float = field(
default=0.0, metadata={"help": "If null_score - best_non_null is greater than the threshold predict null."}
)
n_best_size: int = field(
default=20, metadata={"help": "If null_score - best_non_null is greater than the threshold predict null."}
)
lang_id: int = field(
default=0,
metadata={
"help": "language id of input for language-specific xlm models (see tokenization_xlm.PRETRAINED_INIT_CONFIGURATION)"
},
)
def main():
# See all possible arguments in src/transformers/training_args.py
# or by passing the --help flag to this script.
# We now keep distinct sets of args, for a cleaner separation of concerns.
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TFTrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
and training_args.do_train
and not training_args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO,
)
logger.info(
"n_gpu: %s, distributed training: %s, 16-bits training: %s",
training_args.n_gpu,
bool(training_args.n_gpu > 1),
training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)
# Prepare Question-Answering task
# Load pretrained model and tokenizer
#
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
config = AutoConfig.from_pretrained(
model_args.config_name if model_args.config_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
)
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
use_fast=model_args.use_fast,
)
with training_args.strategy.scope():
model = TFAutoModelForQuestionAnswering.from_pretrained(
model_args.model_name_or_path,
from_pt=bool(".bin" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
)
# Get datasets
if not data_args.data_dir:
if data_args.version_2_with_negative:
logger.warn("tensorflow_datasets does not handle version 2 of SQuAD. Switch to version 1 automatically")
try:
import tensorflow_datasets as tfds
except ImportError:
raise ImportError("If not data_dir is specified, tensorflow_datasets needs to be installed.")
tfds_examples = tfds.load("squad")
train_examples = (
SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=False)
if training_args.do_train
else None
)
eval_examples = (
SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=True)
if training_args.do_eval
else None
)
else:
processor = SquadV2Processor() if data_args.version_2_with_negative else SquadV1Processor()
train_examples = processor.get_train_examples(data_args.data_dir) if training_args.do_train else None
eval_examples = processor.get_dev_examples(data_args.data_dir) if training_args.do_eval else None
train_dataset = (
squad_convert_examples_to_features(
examples=train_examples,
tokenizer=tokenizer,
max_seq_length=data_args.max_seq_length,
doc_stride=data_args.doc_stride,
max_query_length=data_args.max_query_length,
is_training=True,
return_dataset="tf",
)
if training_args.do_train
else None
)
eval_dataset = (
squad_convert_examples_to_features(
examples=eval_examples,
tokenizer=tokenizer,
max_seq_length=data_args.max_seq_length,
doc_stride=data_args.doc_stride,
max_query_length=data_args.max_query_length,
is_training=False,
return_dataset="tf",
)
if training_args.do_eval
else None
)
# Initialize our Trainer
trainer = TFTrainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset,)
# Training
if training_args.do_train:
trainer.train()
trainer.save_model()
tokenizer.save_pretrained(training_args.output_dir)
if __name__ == "__main__":
main()

View File

@@ -72,7 +72,7 @@ class ExamplesTests(unittest.TestCase):
""".split()
with patch.object(sys, "argv", testargs):
result = run_glue.main()
del result["loss"]
del result["eval_loss"]
for value in result.values():
self.assertGreaterEqual(value, 0.75)

View File

@@ -2,7 +2,7 @@
# Run TensorFlow 2.0 version
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_glue.py).
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
@@ -85,10 +85,12 @@ CoLA, SST-2. The following section provides details on how to run half-precision
said, there shouldnt be any issues in running half-precision training with the remaining GLUE tasks as well,
since the data processor for each task inherits from the base class DataProcessor.
## Running on TPUs
## Running on TPUs in PyTorch
You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
[README](https://github.com/pytorch/xla/blob/master/README.md).
**Update**: read the more up-to-date [Running on TPUs](../README.md#running-on-tpus) in the main README.md instead.
Even when running PyTorch, you can accelerate your workloads on Google's TPUs, using `pytorch/xla`. For information on how to setup your TPU environment refer to the
[pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
identical to your normal GPU + Huggingface setup.
@@ -101,7 +103,6 @@ export GLUE_DIR=/path/to/glue
export TASK_NAME=MNLI
python run_glue_tpu.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--task_name $TASK_NAME \
--do_train \
@@ -115,8 +116,7 @@ python run_glue_tpu.py \
--overwrite_output_dir \
--logging_steps 50 \
--save_steps 200 \
--num_cores=8 \
--only_log_master
--num_cores=8
```
### MRPC
@@ -256,7 +256,7 @@ TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recal
# XNLI
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).

View File

@@ -134,16 +134,8 @@ def main():
)
# Get datasets
train_dataset = (
GlueDataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank)
if training_args.do_train
else None
)
eval_dataset = (
GlueDataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank, evaluate=True)
if training_args.do_eval
else None
)
train_dataset = GlueDataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
def compute_metrics(p: EvalPrediction) -> Dict:
if output_mode == "classification":
@@ -181,9 +173,7 @@ def main():
eval_datasets = [eval_dataset]
if data_args.task_name == "mnli":
mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
eval_datasets.append(
GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, local_rank=training_args.local_rank, evaluate=True)
)
eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, evaluate=True))
for eval_dataset in eval_datasets:
result = trainer.evaluate(eval_dataset=eval_dataset)

View File

@@ -1,6 +1,6 @@
## Language generation
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py).
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you

View File

@@ -17,10 +17,10 @@
"""
Example command with bag of words:
python examples/run_pplm.py -B space --cond_text "The president" --length 100 --gamma 1.5 --num_iterations 3 --num_samples 10 --stepsize 0.01 --window_length 5 --kl_scale 0.01 --gm_scale 0.95
python run_pplm.py -B space --cond_text "The president" --length 100 --gamma 1.5 --num_iterations 3 --num_samples 10 --stepsize 0.01 --window_length 5 --kl_scale 0.01 --gm_scale 0.95
Example command with discriminator:
python examples/run_pplm.py -D sentiment --class_label 3 --cond_text "The lake" --length 10 --gamma 1.0 --num_iterations 30 --num_samples 10 --stepsize 0.01 --kl_scale 0.01 --gm_scale 0.95
python run_pplm.py -D sentiment --class_label 3 --cond_text "The lake" --length 10 --gamma 1.0 --num_iterations 30 --num_samples 10 --stepsize 0.01 --kl_scale 0.01 --gm_scale 0.95
"""
import argparse

View File

@@ -1,7 +1,7 @@
## Named Entity Recognition
Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/ner/run_ner.py) for Pytorch and
[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/ner/run_tf_ner.py) for Tensorflow 2.
Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py) for Pytorch and
[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_tf_ner.py) for Tensorflow 2.
This example fine-tune Bert Multilingual on GermEval 2014 (German NER).
Details and results for the fine-tuning provided by @stefan-it.

View File

@@ -292,5 +292,10 @@ def main():
return results
def _mp_fn(index):
# For xla_spawn (TPUs)
main()
if __name__ == "__main__":
main()

View File

@@ -6,7 +6,7 @@ from unittest.mock import patch
import run_ner
logging.basicConfig(level=logging.DEBUG)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
@@ -30,4 +30,4 @@ class ExamplesTests(unittest.TestCase):
""".split()
with patch.object(sys, "argv", ["run.py"] + testargs):
result = run_ner.main()
self.assertLess(result["loss"], 1.5)
self.assertLess(result["eval_loss"], 1.5)

View File

@@ -12,17 +12,13 @@ Inspired by https://github.com/pytorch/pytorch/blob/master/torch/distributed/lau
import importlib
import os
import sys
from argparse import REMAINDER, ArgumentParser
from pathlib import Path
import torch_xla.distributed.xla_multiprocessing as xmp
def trim_suffix(s: str, suffix: str):
return s if not s.endswith(suffix) or len(suffix) == 0 else s[: -len(suffix)]
def parse_args():
"""
Helper function parsing the command line options
@@ -44,7 +40,7 @@ def parse_args():
"training_script",
type=str,
help=(
"The full module name to the single TPU training "
"The full path to the single TPU training "
"program/script to be launched in parallel, "
"followed by all the arguments for the "
"training script"
@@ -61,7 +57,9 @@ def main():
args = parse_args()
# Import training_script as a module.
mod_name = trim_suffix(os.path.basename(args.training_script), ".py")
script_fpath = Path(args.training_script)
sys.path.append(str(script_fpath.parent.resolve()))
mod_name = script_fpath.stem
mod = importlib.import_module(mod_name)
# Patch sys.argv

View File

@@ -59,56 +59,64 @@ tokenizer = GPT2Tokenizer.from_pretrained(
## Example using GPT2LMHeadModel
```python
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('LorenzoDeMattei/GePpeTto')
model = GPT2LMHeadModel.from_pretrained(
'LorenzoDeMattei/GePpeTto', pad_token_id = tokenizer.eos_token_id
tokenizer = AutoTokenizer.from_pretrained("LorenzoDeMattei/GePpeTto")
model = AutoModelWithLMHead.from_pretrained("LorenzoDeMattei/GePpeTto")
text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
prompts = [
"Wikipedia Geppetto",
"Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso"]
samples_outputs = text_generator(
prompts,
do_sample=True,
max_length=50,
top_k=50,
top_p=0.95,
num_return_sequences=3
)
input_ids = tokenizer.encode(
'Wikipedia Geppetto', return_tensors = 'pt'
)
sample_outputs = model.generate(
input_ids,
do_sample = True,
max_length = 50,
top_k = 50,
top_p = 0.95,
num_return_sequences = 3,
)
print('Output:\n' + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
print(
'{}: {}'.format(
i, tokenizer.decode(sample_output, skip_special_tokens = True)
)
)
for i, sample_outputs in enumerate(samples_outputs):
print(100 * '-')
print("Prompt:", prompts[i])
for sample_output in sample_outputs:
print("Sample:", sample_output['generated_text'])
print()
```
Output is,
```text
Output:
```
----------------------------------------------------------------------------------------------------
0: Wikipedia Geppetto
Prompt: Wikipedia Geppetto
Sample: Wikipedia Geppetto rosso (film 1920)
Geppetto è una città degli Stati Uniti d'America, situata nello Stato dell'Iowa, nella Contea di Greene.
Geppetto rosso ("The Smokes in the Black") è un film muto del 1920 diretto da Henry H. Leonard.
Wikipedia The Sax
Il film fu prodotto dalla Selig Poly
The Sax è il primo album discografico
2: Wikipedia Geppetto/Passione
Sample: Wikipedia Geppetto
Geppetto è il primo album in studio dei Saturday Night Live, pubblicato dalla Iron Maiden nel 1974.
Geppetto ("Geppetto" in piemontese) è un comune italiano di 978 abitanti della provincia di Cuneo in Piemonte.
L'album è un lavoro di debutto che lo porta a definire
3: Wikipedia Geppetto
L'abitato, che si trova nel versante valtellinese, si sviluppa nella
Geppetto ("Fenëvëv" in calabrese) è un comune italiano di abitanti della regione Calabria.
Sample: Wikipedia Geppetto di Natale (romanzo)
Zona di particolare pregio storico-artistico, paesaggistico, storico-artistico,
Geppetto di Natale è un romanzo di Mario Caiano, pubblicato nel 2012.
----------------------------------------------------------------------------------------------------
Prompt: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso
Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso. Il burattino riesce a scappare. Dopo aver trovato un prezioso sacchetto si reca
Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso, e l'unico che lo possiede, ma, di fronte a tutte queste prove
Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso: - A voi gli occhi, le guance! A voi il mio pezzo!
```
## Citation

View File

@@ -0,0 +1,19 @@
# Norwegian Electra
Image incoming, im going to have som fun with this one.
Trained on Oscar + wikipedia + opensubtitles + some other data I had with the awesome power of TPUs(V3-8)
Use with caution. I have no downstream tasks in Norwegian to test on so I have no idea of its performance yet.
# Acknowledgments
### TensorFlow Research Cloud
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
- https://www.tensorflow.org/tfrc
#### OSCAR corpus
- https://oscar-corpus.com/
#### OPUS
- http://opus.nlpl.eu/
- http://www.opensubtitles.org/

View File

@@ -0,0 +1,44 @@
---
language: polish
---
# HerBERT tokenizer
**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** tokenizer is a character level byte-pair encoding with
vocabulary size of 50k tokens. The tokenizer was trained on [Wolne Lektury](https://wolnelektury.pl/) and a publicly available subset of
[National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=0) with [fastBPE](https://github.com/glample/fastBPE) library.
Tokenizer utilize `XLMTokenizer` implementation from [transformers](https://github.com/huggingface/transformers).
## Tokenizer usage
Herbert tokenizer should be used together with [HerBERT model](https://huggingface.co/allegro/herbert-klej-cased-v1):
```python
from transformers import XLMTokenizer, RobertaModel
tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd to jasne.", return_tensors='pt')
outputs = model(encoded_input)
```
## License
CC BY-SA 4.0
## Citation
If you use this tokenizer, please cite the following paper:
```
@misc{rybak2020klej,
title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},
year={2020},
eprint={2005.00630},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.
## Authors
Tokenizer was created by **Allegro Machine Learning Research** team.
You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>

View File

@@ -0,0 +1,85 @@
---
language: polish
---
# HerBERT
**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora
using only MLM objective with dynamic masking of whole words. For more details, please refer to:
[KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://arxiv.org/abs/2005.00630).
## Dataset
**HerBERT** training dataset is a combination of several publicly available corpora for Polish language:
| Corpus | Tokens | Texts |
| :------ | ------: | ------: |
| [OSCAR](https://traces1.inria.fr/oscar/)| 6710M | 145M |
| [Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 1084M | 1.1M |
| [Wikipedia](https://dumps.wikimedia.org/) | 260M | 1.5M |
| [Wolne Lektury](https://wolnelektury.pl/) | 41M | 5.5k |
| [Allegro Articles](https://allegro.pl/artykuly) | 18M | 33k |
## Tokenizer
The training dataset was tokenized into subwords using [HerBERT Tokenizer](https://huggingface.co/allegro/herbert-klej-cased-tokenizer-v1); a character level byte-pair encoding with
a vocabulary size of 50k tokens. The tokenizer itself was trained on [Wolne Lektury](https://wolnelektury.pl/) and a publicly available subset of
[National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=0) with a [fastBPE](https://github.com/glample/fastBPE) library.
Tokenizer utilizes `XLMTokenizer` implementation for that reason, one should load it as `allegro/herbert-klej-cased-tokenizer-v1`.
## HerBERT models summary
| Model | WWM | Cased | Tokenizer | Vocab Size | Batch Size | Train Steps |
| :------ | ------: | ------: | ------: | ------: | ------: | ------: |
| herbert-klej-cased-v1 | YES | YES | BPE | 50K | 570 | 180k |
## Model evaluation
HerBERT was evaluated on the [KLEJ](https://klejbenchmark.com/) benchmark, publicly available set of nine evaluation tasks for the Polish language understanding.
It had the best average performance and obtained the best results for three of them.
| Model | Average | NKJP-NER | CDSC-E | CDSC-R | CBD | PolEmo2.0-IN |PolEmo2.0-OUT | DYK | PSC | AR |
| :------ | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: |
| herbert-klej-cased-v1 | **80.5** | 92.7 | 92.5 | 91.9 | **50.3** | **89.2** |**76.3** |52.1 |95.3 | 84.5 |
Full leaderboard is available [online](https://klejbenchmark.com/leaderboard).
## HerBERT usage
Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.0.
Example code:
```python
from transformers import XLMTokenizer, RobertaModel
tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd to jasne.", return_tensors='pt')
outputs = model(encoded_input)
```
HerBERT can also be loaded using `AutoTokenizer` and `AutoModel`:
```python
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
```
## License
CC BY-SA 4.0
## Citation
If you use this model, please cite the following paper:
```
@misc{rybak2020klej,
title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},
year={2020},
eprint={2005.00630},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.
## Authors
Model was trained by **Allegro Machine Learning Research** team.
You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>

View File

@@ -0,0 +1,79 @@
---
language: turkish
license: mit
---
# 🤗 + 📚 dbmdz Turkish ELECTRA model
In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
Library open sources a cased ELECTRA base model for Turkish 🎉
# Turkish ELECTRA model
We release a base ELEC**TR**A model for Turkish, that was trained on the same data as *BERTurk*.
> ELECTRA is a new method for self-supervised language representation learning. It can be used to
> pre-train transformer networks using relatively little compute. ELECTRA models are trained to
> distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to
> the discriminator of a GAN.
More details about ELECTRA can be found in the [ICLR paper](https://openreview.net/forum?id=r1xMH1BtvB)
or in the [official ELECTRA repository](https://github.com/google-research/electra) on GitHub.
## Stats
The current version of the model is trained on a filtered and sentence
segmented version of the Turkish [OSCAR corpus](https://traces1.inria.fr/oscar/),
a recent Wikipedia dump, various [OPUS corpora](http://opus.nlpl.eu/) and a
special corpus provided by [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/).
The final training corpus has a size of 35GB and 44,04,976,662 tokens.
Thanks to Google's TensorFlow Research Cloud (TFRC) we could train a cased model
on a TPU v3-8 for 1M steps.
## Model weights
[Transformers](https://github.com/huggingface/transformers)
compatible weights for both PyTorch and TensorFlow are available.
| Model | Downloads
| ------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------
| `dbmdz/electra-base-turkish-cased-discriminator` | [`config.json`](https://cdn.huggingface.co/dbmdz/electra-base-turkish-cased-discriminator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-turkish-cased-discriminator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-turkish-cased-discriminator/vocab.txt)
## Usage
With Transformers >= 2.8 our ELECTRA base cased model can be loaded like:
```python
from transformers import AutoModelWithLMHead, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/electra-base-turkish-cased-discriminator")
model = AutoModelWithLMHead.from_pretrained("dbmdz/electra-base-turkish-cased-discriminator")
```
## Results
For results on PoS tagging or NER tasks, please refer to
[this repository](https://github.com/stefan-it/turkish-bert/electra).
# Huggingface model hub
All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
# Contact (Bugs, Feedback, Contribution and more)
For questions about our ELECTRA models just open an issue
[here](https://github.com/dbmdz/berts/issues/new) 🤗
# Acknowledgments
Thanks to [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/) for providing us
additional large corpora for Turkish. Many thanks to Reyyan Yeniterzi for providing
us the Turkish NER dataset for evaluation.
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
Thanks for providing access to the TFRC ❤️
Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
it is possible to download both cased and uncased models from their S3 storage 🤗

View File

@@ -0,0 +1,79 @@
---
language: turkish
license: mit
---
# 🤗 + 📚 dbmdz Turkish ELECTRA model
In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
Library open sources a cased ELECTRA small model for Turkish 🎉
# Turkish ELECTRA model
We release a small ELEC**TR**A model for Turkish, that was trained on the same data as *BERTurk*.
> ELECTRA is a new method for self-supervised language representation learning. It can be used to
> pre-train transformer networks using relatively little compute. ELECTRA models are trained to
> distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to
> the discriminator of a GAN.
More details about ELECTRA can be found in the [ICLR paper](https://openreview.net/forum?id=r1xMH1BtvB)
or in the [official ELECTRA repository](https://github.com/google-research/electra) on GitHub.
## Stats
The current version of the model is trained on a filtered and sentence
segmented version of the Turkish [OSCAR corpus](https://traces1.inria.fr/oscar/),
a recent Wikipedia dump, various [OPUS corpora](http://opus.nlpl.eu/) and a
special corpus provided by [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/).
The final training corpus has a size of 35GB and 44,04,976,662 tokens.
Thanks to Google's TensorFlow Research Cloud (TFRC) we could train a cased model
on a TPU v3-8 for 1M steps.
## Model weights
[Transformers](https://github.com/huggingface/transformers)
compatible weights for both PyTorch and TensorFlow are available.
| Model | Downloads
| ------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------
| `dbmdz/electra-small-turkish-cased-discriminator` | [`config.json`](https://cdn.huggingface.co/dbmdz/electra-small-turkish-cased-discriminator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-small-turkish-cased-discriminator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-small-turkish-cased-discriminator/vocab.txt)
## Usage
With Transformers >= 2.8 our ELECTRA small cased model can be loaded like:
```python
from transformers import AutoModelWithLMHead, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/electra-small-turkish-cased-discriminator")
model = AutoModelWithLMHead.from_pretrained("dbmdz/electra-small-turkish-cased-discriminator")
```
## Results
For results on PoS tagging or NER tasks, please refer to
[this repository](https://github.com/stefan-it/turkish-bert/electra).
# Huggingface model hub
All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
# Contact (Bugs, Feedback, Contribution and more)
For questions about our ELECTRA models just open an issue
[here](https://github.com/dbmdz/berts/issues/new) 🤗
# Acknowledgments
Thanks to [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/) for providing us
additional large corpora for Turkish. Many thanks to Reyyan Yeniterzi for providing
us the Turkish NER dataset for evaluation.
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
Thanks for providing access to the TFRC ❤️
Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
it is possible to download both cased and uncased models from their S3 storage 🤗

View File

@@ -11,7 +11,7 @@ A baseline model for question-answering in french ([CamemBERT](https://camembert
## Training hyperparameters
```shell
python3 ./examples/run_squad.py \
python3 ./examples/question-answering/run_squad.py \
--model_type camembert \
--model_name_or_path camembert-base \
--do_train \

View File

@@ -11,7 +11,7 @@ A baseline model for question-answering in french ([CamemBERT](https://camembert
## Training hyperparameters
```shell
python3 ./examples/run_squad.py \
python3 ./examples/question-answering/run_squad.py \
--model_type camembert \
--model_name_or_path camembert-base \
--do_train \

View File

@@ -11,7 +11,7 @@ A baseline model for question-answering in french ([flaubert](https://github.com
## Training hyperparameters
```shell
python3 ./examples/run_squad.py \
python3 ./examples/question-answering/run_squad.py \
--model_type flaubert \
--model_name_or_path flaubert-base-uncased \
--do_train \

View File

@@ -25,7 +25,7 @@ fill_mask = pipeline(
)
print(
fill_mask(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks.")
fill_mask(f"HuggingFace is creating a {fill_mask.tokenizer.mask_token} that the community uses to solve NLP tasks.")
)
```

View File

@@ -0,0 +1,57 @@
## Reformer Language model on character level and trained on enwik8.
*enwik8* is a dataset based on Wikipedia and is often used to measure the model's ability to *compress* data, *e.g.* in
the scope of the *Hutter prize*: https://en.wikipedia.org/wiki/Hutter_Prize.
`reformer-enwik8` was pretrained on the first 90M chars of *enwik8* whereas the text was chunked into batches of size 65536 chars (=2^16).
The model's weights were taken from https://console.cloud.google.com/storage/browser/trax-ml/reformer/enwik8 and converted
to Hugging Face's PyTorch ReformerLM model `ReformerModelWithLMHead`.
The model is a language model that operates on characters.
Therefore, this model does not need a tokenizer. The following function can instead be used for **encoding** and **decoding**:
```python
import torch
# Encoding
def encode(list_of_strings, pad_to_max_length=True, pad_token_id=0):
max_length = max([len(string) for string in list_of_strings])
# create emtpy tensors
attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)
for idx, string in enumerate(list_of_strings):
# make sure string is in byte format
if not isinstance(string, bytes):
string = str.encode(string)
input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
attention_masks[idx, :len(string)] = 1
return input_ids, attention_masks
# Decoding
def decode(outputs_ids):
decoded_outputs = []
for output_ids in outputs_ids.tolist():
# transform id back to char IDs < 2 are simply transformed to ""
decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
return decoded_outputs
```
Text can be generated as follows:
```python
from transformers import ReformerModelWithLMHead
model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
decode(model.generate(encoded, do_sample=True, max_length=150))
# gives:
# In 1965, Brooks left IBM to found the Department of Journalism in 1968. IBM had jurisdiction himself in 1980, while Brooks resolved, nevertheless thro
```
***Note***: Language generation using `ReformerModelWithLMHead` is not optimized yet and is rather slow.

View File

@@ -1,3 +1,4 @@
# XLM-R + NER
This model is a fine-tuned [XLM-Roberta-base](https://arxiv.org/abs/1911.02116) over the 40 languages proposed in [XTREME]([https://github.com/google-research/xtreme](https://github.com/google-research/xtreme)) from [Wikiann](https://aclweb.org/anthology/P17-1178). This is still an on-going work and the results will be updated everytime an improvement is reached.
@@ -12,6 +13,7 @@ O
## Metrics on evaluation set:
### Average over the 40 languages
Number of documents: 262300
```
precision recall f1-score support
@@ -24,6 +26,7 @@ macro avg 0.86 0.87 0.87 333298
```
### Afrikaans
Number of documents: 1000
```
precision recall f1-score support
@@ -36,6 +39,7 @@ macro avg 0.87 0.91 0.89 1469
```
### Arabic
Number of documents: 10000
```
precision recall f1-score support
@@ -48,6 +52,7 @@ macro avg 0.87 0.88 0.88 10754
```
### Basque
Number of documents: 10000
```
precision recall f1-score support
@@ -60,6 +65,7 @@ macro avg 0.89 0.89 0.89 12954
```
### Bengali
Number of documents: 1000
```
precision recall f1-score support
@@ -72,6 +78,7 @@ macro avg 0.91 0.92 0.91 1095
```
### Bulgarian
Number of documents: 1000
```
precision recall f1-score support
@@ -84,6 +91,7 @@ macro avg 0.91 0.92 0.91 14116
```
### Burmese
Number of documents: 100
```
precision recall f1-score support
@@ -96,6 +104,7 @@ macro avg 0.57 0.65 0.60 103
```
### Chinese
Number of documents: 10000
```
precision recall f1-score support
@@ -108,6 +117,7 @@ macro avg 0.76 0.78 0.77 11558
```
### Dutch
Number of documents: 10000
```
precision recall f1-score support
@@ -120,6 +130,7 @@ macro avg 0.91 0.92 0.91 13120
```
### English
Number of documents: 10000
```
precision recall f1-score support
@@ -132,6 +143,7 @@ macro avg 0.82 0.83 0.83 13973
```
### Estonian
Number of documents: 10000
```
precision recall f1-score support
@@ -144,6 +156,7 @@ macro avg 0.90 0.91 0.90 13558
```
### Finnish
Number of documents: 10000
```
precision recall f1-score support
@@ -156,6 +169,7 @@ macro avg 0.89 0.89 0.89 13930
```
### French
Number of documents: 10000
```
precision recall f1-score support
@@ -168,6 +182,7 @@ macro avg 0.89 0.90 0.90 12933
```
### Georgian
Number of documents: 10000
```
precision recall f1-score support
@@ -180,6 +195,7 @@ macro avg 0.84 0.86 0.85 12615
```
### German
Number of documents: 10000
```
precision recall f1-score support
@@ -192,6 +208,7 @@ macro avg 0.86 0.86 0.86 13638
```
### Greek
Number of documents: 10000
```
precision recall f1-score support
@@ -204,6 +221,7 @@ macro avg 0.88 0.90 0.89 12101
```
### Hebrew
Number of documents: 10000
```
precision recall f1-score support
@@ -216,6 +234,7 @@ macro avg 0.82 0.83 0.83 12934
```
### Hindi
Number of documents: 1000
```
precision recall f1-score support
@@ -228,6 +247,7 @@ macro avg 0.84 0.87 0.85 1211
```
### Hungarian
Number of documents: 10000
```
precision recall f1-score support
@@ -240,6 +260,7 @@ macro avg 0.91 0.92 0.91 13879
```
### Indonesian
Number of documents: 10000
```
precision recall f1-score support
@@ -252,6 +273,7 @@ macro avg 0.91 0.92 0.92 11376
```
### Italian
Number of documents: 10000
```
precision recall f1-score support
@@ -264,6 +286,7 @@ macro avg 0.90 0.90 0.90 13412
```
### Japanese
Number of documents: 10000
```
precision recall f1-score support
@@ -276,6 +299,7 @@ macro avg 0.69 0.72 0.70 12277
```
### Javanese
Number of documents: 100
```
precision recall f1-score support
@@ -288,6 +312,7 @@ macro avg 0.78 0.82 0.80 112
```
### Kazakh
Number of documents: 1000
```
precision recall f1-score support
@@ -300,6 +325,7 @@ macro avg 0.81 0.83 0.81 1135
```
### Korean
Number of documents: 10000
```
precision recall f1-score support
@@ -312,6 +338,7 @@ macro avg 0.83 0.83 0.83 13329
```
### Malay
Number of documents: 1000
```
precision recall f1-score support
@@ -324,6 +351,7 @@ macro avg 0.91 0.92 0.91 1088
```
### Malayalam
Number of documents: 1000
```
precision recall f1-score support
@@ -336,6 +364,7 @@ macro avg 0.78 0.80 0.79 1155
```
### Marathi
Number of documents: 1000
```
precision recall f1-score support
@@ -348,6 +377,7 @@ macro avg 0.85 0.86 0.85 1190
```
### Persian
Number of documents: 10000
```
precision recall f1-score support
@@ -360,6 +390,7 @@ macro avg 0.92 0.92 0.92 10494
```
### Portuguese
Number of documents: 10000
```
precision recall f1-score support
@@ -372,6 +403,7 @@ macro avg 0.90 0.91 0.90 12673
```
### Russian
Number of documents: 10000
```
precision recall f1-score support
@@ -384,6 +416,7 @@ macro avg 0.87 0.88 0.88 12051
```
### Spanish
Number of documents: 10000
```
precision recall f1-score support
@@ -396,6 +429,7 @@ macro avg 0.90 0.91 0.90 12153
```
### Swahili
Number of documents: 1000
```
precision recall f1-score support
@@ -408,6 +442,7 @@ macro avg 0.88 0.89 0.88 1202
```
### Tagalog
Number of documents: 1000
```
precision recall f1-score support
@@ -420,6 +455,7 @@ macro avg 0.90 0.92 0.91 1027
```
### Tamil
Number of documents: 1000
```
precision recall f1-score support
@@ -432,6 +468,7 @@ macro avg 0.82 0.83 0.82 1183
```
### Telugu
Number of documents: 1000
```
precision recall f1-score support
@@ -444,6 +481,7 @@ macro avg 0.73 0.77 0.75 1193
```
### Thai
Number of documents: 10000
```
precision recall f1-score support
@@ -456,6 +494,7 @@ macro avg 0.68 0.74 0.71 14722
```
### Turkish
Number of documents: 10000
```
precision recall f1-score support
@@ -468,6 +507,7 @@ macro avg 0.91 0.92 0.91 13360
```
### Urdu
Number of documents: 1000
```
precision recall f1-score support
@@ -480,6 +520,7 @@ macro avg 0.92 0.94 0.93 1011
```
### Vietnamese
Number of documents: 10000
```
precision recall f1-score support
@@ -492,6 +533,7 @@ macro avg 0.89 0.90 0.90 11107
```
### Yoruba
Number of documents: 100
```
precision recall f1-score support
@@ -504,7 +546,7 @@ macro avg 0.63 0.68 0.63 107
```
## Reproduce the results
Download and prepare the dataset from the [[https://github.com/google-research/xtreme#download-the-data](https://github.com/google-research/xtreme#download-the-data)](XTREME repo). Next, from the root of the transformers repo run:
Download and prepare the dataset from the [XTREME repo](https://github.com/google-research/xtreme#download-the-data). Next, from the root of the transformers repo run:
```
cd examples/ner
python run_tf_ner.py \
@@ -533,8 +575,9 @@ nlp_ner = pipeline(
model="jplu/tf-xlm-r-ner-40-lang",
tokenizer=(
'jplu/tf-xlm-r-ner-40-lang',
{"use_fast": True}
))
{"use_fast": True}),
framework="tf"
)
text_fr = "Barack Obama est né à Hawaï."
text_en = "Barack Obama was born in Hawaii."
@@ -553,4 +596,4 @@ nlp_ner(test_zh)
nlp_ner(test_ar)
#Output: [{'word': '▁با', 'score': 0.9903655648231506, 'entity': 'PER'}, {'word': 'راك', 'score': 0.9850614666938782, 'entity': 'PER'}, {'word': '▁أوباما', 'score': 0.9850308299064636, 'entity': 'PER'}, {'word': '▁ها', 'score': 0.9477543234825134, 'entity': 'LOC'}, {'word': 'وا', 'score': 0.9428229928016663, 'entity': 'LOC'}, {'word': 'ي', 'score': 0.9319471716880798, 'entity': 'LOC'}]
```
```

View File

@@ -1,5 +1,5 @@
### Model
**[`albert-xlarge-v2`](https://huggingface.co/albert-xlarge-v2)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)**
**[`albert-xlarge-v2`](https://huggingface.co/albert-xlarge-v2)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)**
### Training Parameters
Trained on 4 NVIDIA GeForce RTX 2080 Ti 11Gb

View File

@@ -1,5 +1,5 @@
### Model
**[`monologg/biobert_v1.1_pubmed`](https://huggingface.co/monologg/biobert_v1.1_pubmed)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)**
**[`monologg/biobert_v1.1_pubmed`](https://huggingface.co/monologg/biobert_v1.1_pubmed)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)**
This model is cased.

View File

@@ -1,5 +1,5 @@
### Model
**[`allenai/scibert_scivocab_uncased`](https://huggingface.co/allenai/scibert_scivocab_uncased)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)**
**[`allenai/scibert_scivocab_uncased`](https://huggingface.co/allenai/scibert_scivocab_uncased)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)**
### Training Parameters
Trained on 4 NVIDIA GeForce RTX 2080 Ti 11Gb

View File

@@ -0,0 +1,61 @@
---
language: turkish
---
# bert-turkish-question-answering
## Usage
```python
from transformers import pipeline
nlp = pipeline('question-answering', model='lserinol/bert-turkish-question-answering', tokenizer='lserinol/bert-turkish-question-answering')
nlp({
'question': "Ankara'da kaç ilçe vardır?",
'context': r"""Türkiye'nin başkenti Ankara'dır. Ülkenin en büyük idari birimleri illerdir ve 81 il vardır. Bu iller ilçelere ayrılmıştır, toplamda 973 ilçe mevcuttur."""
})
```
```python
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("lserinol/bert-turkish-question-answering")
model = AutoModelForQuestionAnswering.from_pretrained("lserinol/bert-turkish-question-answering")
text = r"""
Ankara'nın başkent ilan edilmesinin ardından (13 Ekim 1923) şehir hızla gelişmiş ve Türkiye'nin ikinci en kalabalık ili olmuştur.
Türkiye Cumhuriyeti'nin ilk yıllarında ekonomisi tarım ve hayvancılığa dayanan ilin topraklarının yarısı hâlâ tarım amaçlı
kullanılmaktadır. Ekonomik etkinlik büyük oranda ticaret ve sanayiye dayalıdır. Tarım ve hayvancılığın ağırlığı ise giderek
azalmaktadır. Ankara ve civarındaki gerek kamu sektörü gerek özel sektör yatırımları, başka illerden büyük bir nüfus göçünü
teşvik etmiştir. Cumhuriyetin kuruluşundan günümüze, nüfusu ülke nüfusunun iki katı hızda artmıştır. Nüfusun yaklaşık dörtte
üçü hizmet sektörü olarak tanımlanabilecek memuriyet, ulaşım, haberleşme ve ticaret benzeri işlerde, dörtte biri sanayide,
%2'si ise tarım alanında çalışır. Sanayi, özellikle tekstil, gıda ve inşaat sektörlerinde yoğunlaşmıştır. Günümüzde ise en çok
savunma, metal ve motor sektörlerinde yatırım yapılmaktadır. Türkiye'nin en çok sayıda üniversiteye sahip ili olan Ankara'da
ayrıca, üniversite diplomalı kişi oranı ülke ortalamasının iki katıdır. Bu eğitimli nüfus, teknoloji ağırlıklı yatırımların
gereksinim duyduğu iş gücünü oluşturur. Ankara'dan otoyollar, demir yolu ve hava yoluyla Türkiye'nin diğer şehirlerine ulaşılır.
Ankara aynı zamanda başkent olarak Türkiye Büyük Millet Meclisi (TBMM)'ye de ev sahipliği yapmaktadır.
"""
questions = [
"Ankara kaç yılında başkent oldu?",
"Ankara ne zaman başkent oldu?",
"Ankara'dan başka şehirlere nasıl ulaşılır?",
"TBMM neyin kısaltmasıdır?"
]
for question in questions:
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(
answer_start_scores
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}\n")
```

View File

@@ -40,7 +40,7 @@ python run_language_modeling.py \
## Model in action / Example of usage ✒
You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py)
You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py)
```bash
python run_generation.py \

View File

@@ -37,7 +37,7 @@ python run_language_modeling.py \
## Model in action / Example of usage: ✒
You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py)
You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py)
```bash
python run_generation.py \

View File

@@ -19,7 +19,7 @@ I preprocessed the dataset and splitted it as train / dev (80/20)
| Dev | 2.2 K |
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
- Labels covered:

View File

@@ -29,7 +29,7 @@ The model was trained on a Tesla P100 GPU and 25GB of RAM with the following com
```bash
export SQUAD_DIR=path/to/nl_squad
python transformers/examples/run_squad.py \
python transformers/examples/question-answering/run_squad.py \
--model_type bert \
--model_name_or_path dccuchile/bert-base-spanish-wwm-cased \
--do_train \

View File

@@ -29,7 +29,7 @@ The smaller BERT models are intended for environments with restricted computatio
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:

View File

@@ -29,7 +29,7 @@ The smaller BERT models are intended for environments with restricted computatio
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:

View File

@@ -29,7 +29,7 @@ The smaller BERT models are intended for environments with restricted computatio
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:

View File

@@ -11,7 +11,7 @@ thumbnail:
- Dataset: [GitHub Typo Corpus](https://github.com/mhagiwara/github-typo-corpus) 📚
- [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py) 🏋️‍♂️
- [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py) 🏋️‍♂️
## Metrics on test set 📋

View File

@@ -19,7 +19,7 @@ I preprocessed the dataset and splitted it as train / dev (80/20)
| Dev | 2.2 K |
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
- Labels covered:

View File

@@ -11,7 +11,7 @@ This model is a fine-tuned version of the Spanish BERT [(BETO)](https://github.c
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora)
#### [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
#### [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
#### 21 Syntax annotations (Labels) covered:

View File

@@ -19,7 +19,7 @@ I preprocessed the dataset and splitted it as train / dev (80/20)
| Dev | 50 K |
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
- **60** Labels covered:

View File

@@ -29,7 +29,7 @@ The smaller BERT models are intended for environments with restricted computatio
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:

View File

@@ -0,0 +1,77 @@
# *De Novo* Drug Design with MLM
## What is it?
An approximation to [Generative Recurrent Networks for De Novo Drug Design](https://onlinelibrary.wiley.com/doi/full/10.1002/minf.201700111) but training a MLM (RoBERTa like) from scratch.
## Why?
As mentioned in the paper:
Generative artificial intelligence models present a fresh approach to chemogenomics and de novo drug design, as they provide researchers with the ability to narrow down their search of the chemical space and focus on regions of interest.
They used a generative *recurrent neural network (RNN)* containing long shortterm memory (LSTM) cell to capture the syntax of molecular representations in terms of SMILES strings.
The learned pattern probabilities can be used for de novo SMILES generation. This molecular design concept **eliminates the need for virtual compound library enumeration** and **enables virtual compound design without requiring secondary or external activity prediction**.
## My Goal 🎯
By training a MLM from scratch on 438552 (cleaned*) SMILES I wanted to build a model that learns this kind of molecular combinations so that given a partial SMILE it can generate plausible combinations so that it can be proposed as new drugs.
By cleaned SMILES I mean that I used their [SMILES cleaning script](https://github.com/topazape/LSTM_Chem/blob/master/cleanup_smiles.py) to remove duplicates, salts, and stereochemical information.
You can see the detailed process of gathering the data, preprocess it and train the LSTM in their [repo](https://github.com/topazape/LSTM_Chem).
## Fast usage with ```pipelines``` 🧪
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model='/mrm8488/chEMBL_smiles_v1',
tokenizer='/mrm8488/chEMBL_smiles_v1'
)
# CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)cc1 Atazanavir
smile1 = "CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)<mask>"
fill_mask(smile1)
# Output:
'''
[{'score': 0.6040295958518982,
'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)nc</s>',
'token': 265},
{'score': 0.2185731679201126,
'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)N</s>',
'token': 50},
{'score': 0.0642734169960022,
'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)cc</s>',
'token': 261},
{'score': 0.01932266168296337,
'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)CCCl</s>',
'token': 452},
{'score': 0.005068355705589056,
'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)C</s>',
'token': 39}]
'''
```
## More
I also created a [second version](https://huggingface.co/mrm8488/chEMBL26_smiles_v2) without applying the cleaning SMILES script mentioned above. You can use it in the same way as this one.
```python
fill_mask = pipeline(
"fill-mask",
model='/mrm8488/chEMBL26_smiles_v2',
tokenizer='/mrm8488/chEMBL26_smiles_v2'
)
```
[Original paper](https://www.ncbi.nlm.nih.gov/pubmed/29095571) Authors:
<details>
Swiss Federal Institute of Technology (ETH), Department of Chemistry and Applied Biosciences, VladimirPrelogWeg 4, 8093, Zurich, Switzerland,
Stanford University, Department of Computer Science, 450 Sierra Mall, Stanford, CA, 94305, USA,
inSili.com GmbH, 8049, Zurich, Switzerland,
Gisbert Schneider, Email: hc.zhte@trebsig.
</details>
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -11,7 +11,7 @@ thumbnail:
- Dataset: [GitHub Typo Corpus](https://github.com/mhagiwara/github-typo-corpus) 📚 for 15 languages
- [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py) 🏋️‍♂️
- [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py) 🏋️‍♂️
## Metrics on test set 📋

View File

@@ -31,7 +31,7 @@ The model was fine-tuned on a Tesla P100 GPU and 25GB of RAM.
The script is the following:
```python
python transformers/examples/run_squad.py \
python transformers/examples/question-answering/run_squad.py \
--model_type distilbert \
--model_name_or_path distilbert-base-multilingual-cased \
--do_train \

View File

@@ -0,0 +1,67 @@
---
language: spanish
thumbnail: https://i.imgur.com/uxAvBfh.png
---
## ELECTRICIDAD: The Spanish Electra [Imgur](https://imgur.com/uxAvBfh)
**ELECTRICIDAD** is a small Electra like model (discriminator in this case) trained on a + 20 GB of the [OSCAR](https://oscar-corpus.com/) Spanish corpus.
As mentioned in the original [paper](https://openreview.net/pdf?id=r1xMH1BtvB):
**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
For a detailed description and experimental results, please refer the paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB).
## Model details ⚙
|Param| # Value|
|-----|--------|
|Layers| 12 |
|Hidden |256 |
|Params| 14M|
## Evaluation metrics (for discriminator) 🧾
|Metric | # Score |
|-------|---------|
|Accuracy| 0.94|
|Precision| 0.76|
|AUC | 0.92|
## Benchmarks 🔨
WIP 🚧
## How to use the discriminator in `transformers`
```python
from transformers import ElectraForPreTraining, ElectraTokenizerFast
import torch
discriminator = ElectraForPreTraining.from_pretrained("mrm8488/electricidad-small-discriminator")
tokenizer = ElectraTokenizerFast.from_pretrained("mrm8488/electricidad-small-discriminator")
sentence = "El rápido zorro marrón salta sobre el perro perezoso"
fake_sentence = "El rápido zorro marrón falsea sobre el perro perezoso"
fake_tokens = tokenizer.tokenize(sentence)
fake_inputs = tokenizer.encode(sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
[print("%7s" % token, end="") for token in fake_tokens]
[print("%7s" % prediction, end="") for prediction in predictions.tolist()]
```
## Acknowledgments
I thank [🤗/transformers team](https://github.com/huggingface/transformers) for answering my doubts and Google for helping me with the [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc) program.
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -26,7 +26,7 @@ thumbnail:
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:

View File

@@ -23,7 +23,7 @@ thumbnail:
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results:

View File

@@ -0,0 +1,90 @@
# For Turkish language, here is an easy-to-use NER application.
** Türkçe için kolay bir python NER (Bert + Transfer Learning) (İsim Varlık Tanıma) modeli...
Thanks to @stefan-it, I applied the followings for training
cd tr-data
for file in train.txt dev.txt test.txt labels.txt
do
wget https://schweter.eu/storage/turkish-bert-wikiann/$file
done
cd ..
It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data folder.
Run pre-training
After downloading the dataset, pre-training can be started. Just set the following environment variables:
```
export MAX_LENGTH=128
export BERT_MODEL=dbmdz/bert-base-turkish-cased
export OUTPUT_DIR=tr-new-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=625
export SEED=1
```
Then run pre-training:
```
python3 run_ner.py --data_dir ./tr-data3 \
--model_type bert \
--labels ./tr-data/labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR-$SEED \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict \
--fp16
```
# Usage
```
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
ner=pipeline('ner', model=model, tokenizer=tokenizer)
ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
```
# Some results
Data1: For the data above
Eval Results:
* precision = 0.916400580551524
* recall = 0.9342309684101502
* f1 = 0.9252298787412536
* loss = 0.11335893666411284
Test Results:
* precision = 0.9192058759362955
* recall = 0.9303010230367262
* f1 = 0.9247201697271198
* loss = 0.11182546521618497
Data2:
https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
The performance for the data given by @kemalaraz is as follows
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
* precision = 0.9461980692049029
* recall = 0.959309358847465
* f1 = 0.9527086063783312
* loss = 0.037054269206847804
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt
* precision = 0.9458370635631155
* recall = 0.9588201928530913
* f1 = 0.952284378344882
* loss = 0.035431676572445225

View File

@@ -0,0 +1,146 @@
# Bert-base Turkish Sentiment Model
https://huggingface.co/savasy/bert-base-turkish-sentiment-cased
This model is used for Sentiment Analysis, which is based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased
# Dataset
The dataset is taken from the studies [2] and [3] and merged.
* The study [2] gathered movie and product reviews. The products are book, DVD, electronics, and kitchen.
The movie dataset is taken from a cinema Web page (www.beyazperde.com) with
5331 positive and 5331 negative sentences. Reviews in the Web page are marked in
scale from 0 to 5 by the users who made the reviews. The study considered a review
sentiment positive if the rating is equal to or bigger than 4, and negative if it is less
or equal to 2. They also built Turkish product review dataset from an online retailer
Web page. They constructed benchmark dataset consisting of reviews regarding some
products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5,
and majority class of reviews are 5. Each category has 700 positive and 700 negative
reviews in which average rating of negative reviews is 2.27 and of positive reviews
is 4.5. This dataset is also used the study [1]
* The study[3] collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion.
*Merged Dataset*
| *size* | *data* |
|--------|----|
| 8000 |dev.tsv|
| 8262 |test.tsv|
| 32000 |train.tsv|
| *48290* |*total*|
The dataset is used by following papers
* 1 Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12.
* 2 Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
Discovery and Opinion Mining (WISDOM 13)
* [3] Hayran, A., Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey
# Training
```
export GLUE_DIR="./sst-2-newall"
export TASK_NAME=SST-2
python3 run_glue.py \
--model_type bert \
--model_name_or_path dbmdz/bert-base-turkish-uncased\
--task_name "SST-2" \
--do_train \
--do_eval \
--data_dir "./sst-2-newall" \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir "./model"
```
# Results
> 05/10/2020 17:00:43 - INFO - transformers.trainer - ***** Running Evaluation *****
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Num examples = 7999
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Batch size = 8
>Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
>05/10/2020 17:01:17 - INFO - __main__ - ***** Eval results sst-2 *****
>05/10/2020 17:01:17 - INFO - __main__ - acc = 0.9539942492811602
>05/10/2020 17:01:17 - INFO - __main__ - loss = 0.16348013816401363
Accuracy is about *%95.4*
# Code Usage
```
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
p= sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
print(p)
#[{'label': 'LABEL_1', 'score': 0.9871089}]
print (p[0]['label']=='LABEL_1')
#True
p= sa("Film çok kötü ve çok sahteydi")
print(p)
#[{'label': 'LABEL_0', 'score': 0.9975505}]
print (p[0]['label']=='LABEL_1')
#False
```
# Test your data
Suppose your file has lots of lines of comment and label (1 or 0) at the end (tab seperated)
> comment1 ... \t label
> comment2 ... \t label
> ...
```
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
f="/path/to/your/file/yourfile.tsv"
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
i,crr=0,0
for line in open(f):
lines=line.strip().split("\t")
if len(lines)==2:
i=i+1
if i%100==0:
print(i)
pred= sa(lines[0])
pred=pred[0]["label"].split("_")[1]
if pred== lines[1]:
crr=crr+1
print(crr, i, crr/i)
```

View File

@@ -30,7 +30,8 @@
},
"colab": {
"name": "03-pipelines.ipynb",
"provenance": []
"provenance": [],
"include_colab_link": true
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
@@ -1504,6 +1505,251 @@
"left": null
}
},
"3c86415352574190b71e1fe5a15d36f1": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
"state": {
"_view_name": "HBoxView",
"_dom_classes": [],
"_model_name": "HBoxModel",
"_view_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_view_count": null,
"_view_module_version": "1.5.0",
"box_style": "",
"layout": "IPY_MODEL_dd2c9dd935754cf2802233053554c21c",
"_model_module": "@jupyter-widgets/controls",
"children": [
"IPY_MODEL_8ae3be32d9c845e59fdb1c47884d48aa",
"IPY_MODEL_4dea0031f3554752ad5aad01fe516a60"
]
}
},
"dd2c9dd935754cf2802233053554c21c": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_view_name": "LayoutView",
"grid_template_rows": null,
"right": null,
"justify_content": null,
"_view_module": "@jupyter-widgets/base",
"overflow": null,
"_model_module_version": "1.2.0",
"_view_count": null,
"flex_flow": null,
"width": null,
"min_width": null,
"border": null,
"align_items": null,
"bottom": null,
"_model_module": "@jupyter-widgets/base",
"top": null,
"grid_column": null,
"overflow_y": null,
"overflow_x": null,
"grid_auto_flow": null,
"grid_area": null,
"grid_template_columns": null,
"flex": null,
"_model_name": "LayoutModel",
"justify_items": null,
"grid_row": null,
"max_height": null,
"align_content": null,
"visibility": null,
"align_self": null,
"height": null,
"min_height": null,
"padding": null,
"grid_auto_rows": null,
"grid_gap": null,
"max_width": null,
"order": null,
"_view_module_version": "1.2.0",
"grid_template_areas": null,
"object_position": null,
"object_fit": null,
"grid_auto_columns": null,
"margin": null,
"display": null,
"left": null
}
},
"8ae3be32d9c845e59fdb1c47884d48aa": {
"model_module": "@jupyter-widgets/controls",
"model_name": "FloatProgressModel",
"state": {
"_view_name": "ProgressView",
"style": "IPY_MODEL_1efb96d931a446de92f1930b973ae846",
"_dom_classes": [],
"description": "Downloading: 100%",
"_model_name": "FloatProgressModel",
"bar_style": "success",
"max": 230,
"_view_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"value": 230,
"_view_count": null,
"_view_module_version": "1.5.0",
"orientation": "horizontal",
"min": 0,
"description_tooltip": null,
"_model_module": "@jupyter-widgets/controls",
"layout": "IPY_MODEL_6a4f5aab5ba949fd860b5a35bba7db9c"
}
},
"4dea0031f3554752ad5aad01fe516a60": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HTMLModel",
"state": {
"_view_name": "HTMLView",
"style": "IPY_MODEL_4b02b2e964ad49af9f7ce7023131ceb8",
"_dom_classes": [],
"description": "",
"_model_name": "HTMLModel",
"placeholder": "",
"_view_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"value": " 230/230 [00:00&lt;00:00, 8.69kB/s]",
"_view_count": null,
"_view_module_version": "1.5.0",
"description_tooltip": null,
"_model_module": "@jupyter-widgets/controls",
"layout": "IPY_MODEL_0ae8a68c3668401da8d8a6d5ec9cac8f"
}
},
"1efb96d931a446de92f1930b973ae846": {
"model_module": "@jupyter-widgets/controls",
"model_name": "ProgressStyleModel",
"state": {
"_view_name": "StyleView",
"_model_name": "ProgressStyleModel",
"description_width": "initial",
"_view_module": "@jupyter-widgets/base",
"_model_module_version": "1.5.0",
"_view_count": null,
"_view_module_version": "1.2.0",
"bar_color": null,
"_model_module": "@jupyter-widgets/controls"
}
},
"6a4f5aab5ba949fd860b5a35bba7db9c": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_view_name": "LayoutView",
"grid_template_rows": null,
"right": null,
"justify_content": null,
"_view_module": "@jupyter-widgets/base",
"overflow": null,
"_model_module_version": "1.2.0",
"_view_count": null,
"flex_flow": null,
"width": null,
"min_width": null,
"border": null,
"align_items": null,
"bottom": null,
"_model_module": "@jupyter-widgets/base",
"top": null,
"grid_column": null,
"overflow_y": null,
"overflow_x": null,
"grid_auto_flow": null,
"grid_area": null,
"grid_template_columns": null,
"flex": null,
"_model_name": "LayoutModel",
"justify_items": null,
"grid_row": null,
"max_height": null,
"align_content": null,
"visibility": null,
"align_self": null,
"height": null,
"min_height": null,
"padding": null,
"grid_auto_rows": null,
"grid_gap": null,
"max_width": null,
"order": null,
"_view_module_version": "1.2.0",
"grid_template_areas": null,
"object_position": null,
"object_fit": null,
"grid_auto_columns": null,
"margin": null,
"display": null,
"left": null
}
},
"4b02b2e964ad49af9f7ce7023131ceb8": {
"model_module": "@jupyter-widgets/controls",
"model_name": "DescriptionStyleModel",
"state": {
"_view_name": "StyleView",
"_model_name": "DescriptionStyleModel",
"description_width": "",
"_view_module": "@jupyter-widgets/base",
"_model_module_version": "1.5.0",
"_view_count": null,
"_view_module_version": "1.2.0",
"_model_module": "@jupyter-widgets/controls"
}
},
"0ae8a68c3668401da8d8a6d5ec9cac8f": {
"model_module": "@jupyter-widgets/base",
"model_name": "LayoutModel",
"state": {
"_view_name": "LayoutView",
"grid_template_rows": null,
"right": null,
"justify_content": null,
"_view_module": "@jupyter-widgets/base",
"overflow": null,
"_model_module_version": "1.2.0",
"_view_count": null,
"flex_flow": null,
"width": null,
"min_width": null,
"border": null,
"align_items": null,
"bottom": null,
"_model_module": "@jupyter-widgets/base",
"top": null,
"grid_column": null,
"overflow_y": null,
"overflow_x": null,
"grid_auto_flow": null,
"grid_area": null,
"grid_template_columns": null,
"flex": null,
"_model_name": "LayoutModel",
"justify_items": null,
"grid_row": null,
"max_height": null,
"align_content": null,
"visibility": null,
"align_self": null,
"height": null,
"min_height": null,
"padding": null,
"grid_auto_rows": null,
"grid_gap": null,
"max_width": null,
"order": null,
"_view_module_version": "1.2.0",
"grid_template_areas": null,
"object_position": null,
"object_fit": null,
"grid_auto_columns": null,
"margin": null,
"display": null,
"left": null
}
},
"fd44cf6ab17e4b768b2e1d5cb8ce5af9": {
"model_module": "@jupyter-widgets/controls",
"model_name": "HBoxModel",
@@ -2105,6 +2351,16 @@
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/huggingface/transformers/blob/generation_pipeline_docs/notebooks/03-pipelines.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
@@ -2170,13 +2426,29 @@
},
"id": "4maAknWNrl_N",
"colab_type": "code",
"colab": {}
"colab": {
"base_uri": "https://localhost:8080/",
"height": 102
},
"outputId": "467e3cc8-a069-47da-8029-86e4142c7dde"
},
"source": [
"!pip install -q transformers"
],
"execution_count": 0,
"outputs": []
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"text": [
"\u001b[K |████████████████████████████████| 645kB 4.4MB/s \n",
"\u001b[K |████████████████████████████████| 3.8MB 11.7MB/s \n",
"\u001b[K |████████████████████████████████| 890kB 51.5MB/s \n",
"\u001b[K |████████████████████████████████| 1.0MB 46.0MB/s \n",
"\u001b[?25h Building wheel for sacremoses (setup.py) ... \u001b[?25l\u001b[?25hdone\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
@@ -2219,6 +2491,7 @@
},
"id": "AMRXHQw9rl_d",
"colab_type": "code",
"outputId": "a7a10851-b71e-4553-9afc-04066120410d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 83,
@@ -2232,14 +2505,13 @@
"ad84da685cf44abb90d17d9d2e023b48",
"a246f9eea2d7440cb979e728741d2e32"
]
},
"outputId": "a7a10851-b71e-4553-9afc-04066120410d"
}
},
"source": [
"nlp_sentence_classif = pipeline('sentiment-analysis')\n",
"nlp_sentence_classif('Such a nice weather outside !')"
],
"execution_count": 3,
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
@@ -2300,6 +2572,7 @@
},
"id": "B3BDRX_Krl_n",
"colab_type": "code",
"outputId": "a6b90b11-a272-4ecb-960d-4c682551b399",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 185,
@@ -2313,14 +2586,13 @@
"405afa5bb8b840d8bc0850e02f593ce4",
"78c718e3d5fa4cb892217260bea6d540"
]
},
"outputId": "a6b90b11-a272-4ecb-960d-4c682551b399"
}
},
"source": [
"nlp_token_class = pipeline('ner')\n",
"nlp_token_class('Hugging Face is a French company based in New-York.')"
],
"execution_count": 4,
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
@@ -2384,6 +2656,7 @@
},
"id": "ND_8LzQKrl_u",
"colab_type": "code",
"outputId": "c59ae695-c465-4de6-fa6e-181d8f1a3992",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 117,
@@ -2397,14 +2670,13 @@
"cd64e3f20b23483daa79712bde6622ea",
"67cbaa1f55d24e62ad6b022af36bca56"
]
},
"outputId": "c59ae695-c465-4de6-fa6e-181d8f1a3992"
}
},
"source": [
"nlp_qa = pipeline('question-answering')\n",
"nlp_qa(context='Hugging Face is a French company based in New-York.', question='Where is based Hugging Face ?')"
],
"execution_count": 5,
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
@@ -2470,6 +2742,7 @@
},
"id": "zpJQ2HXNrl_4",
"colab_type": "code",
"outputId": "3fb62e7a-25a6-4b06-ced8-51eb8aa6bf33",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 321,
@@ -2483,14 +2756,13 @@
"a35703cc8ff44e93a8c0eb413caddc40",
"9df7014c99b343f3b178fa020ff56010"
]
},
"outputId": "3fb62e7a-25a6-4b06-ced8-51eb8aa6bf33"
}
},
"source": [
"nlp_fill = pipeline('fill-mask')\n",
"nlp_fill('Hugging Face is a French company based in ' + nlp_fill.tokenizer.mask_token)"
],
"execution_count": 6,
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
@@ -2560,11 +2832,11 @@
"metadata": {
"id": "8BaOgzi1u1Yc",
"colab_type": "code",
"outputId": "2168e437-cfba-4247-a38c-07f02f555c6e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 88
},
"outputId": "2168e437-cfba-4247-a38c-07f02f555c6e"
}
},
"source": [
"TEXT_TO_SUMMARIZE = \"\"\" \n",
@@ -2590,7 +2862,7 @@
"summarizer = pipeline('summarization')\n",
"summarizer(TEXT_TO_SUMMARIZE)"
],
"execution_count": 7,
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
@@ -2631,6 +2903,7 @@
"metadata": {
"id": "8FwayP4nwV3Z",
"colab_type": "code",
"outputId": "66956816-c924-4718-fe58-cabef7d51974",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 83,
@@ -2644,15 +2917,14 @@
"ad78042ee71a41fd989e4b4ce9d2e3c1",
"40c8d2617f3d4c84b923b140456fa5da"
]
},
"outputId": "66956816-c924-4718-fe58-cabef7d51974"
}
},
"source": [
"# English to French\n",
"translator = pipeline('translation_en_to_fr')\n",
"translator(\"HuggingFace is a French company that is based in New York City. HuggingFace's mission is to solve NLP one commit at a time\")"
],
"execution_count": 8,
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
@@ -2696,6 +2968,7 @@
"metadata": {
"colab_type": "code",
"id": "ra0-WfznwoIW",
"outputId": "278a3d5f-cc42-40bc-a9db-c92ec5a3a2f0",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 83,
@@ -2709,15 +2982,14 @@
"4486f8a2efc34b9aab3864eb5ad2ba48",
"d6228324f3444aa6bd1323d65ae4ff75"
]
},
"outputId": "278a3d5f-cc42-40bc-a9db-c92ec5a3a2f0"
}
},
"source": [
"# English to German\n",
"translator = pipeline('translation_en_to_de')\n",
"translator(\"The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.\")"
],
"execution_count": 9,
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
@@ -2756,6 +3028,89 @@
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qPUpg0M8hCtB",
"colab_type": "text"
},
"source": [
"## 7. Text Generation\n",
"\n",
"Text generation is currently supported by GPT-2, OpenAi-GPT, TransfoXL, XLNet, CTRL and Reformer."
]
},
{
"cell_type": "code",
"metadata": {
"id": "5pKfxTxohXuZ",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 120,
"referenced_widgets": [
"3c86415352574190b71e1fe5a15d36f1",
"dd2c9dd935754cf2802233053554c21c",
"8ae3be32d9c845e59fdb1c47884d48aa",
"4dea0031f3554752ad5aad01fe516a60",
"1efb96d931a446de92f1930b973ae846",
"6a4f5aab5ba949fd860b5a35bba7db9c",
"4b02b2e964ad49af9f7ce7023131ceb8",
"0ae8a68c3668401da8d8a6d5ec9cac8f"
]
},
"outputId": "8705f6b4-2413-4ac6-f72d-e5ecce160662"
},
"source": [
"text_generator = pipeline(\"text-generation\")\n",
"text_generator(\"Today is a beautiful day and I will\")"
],
"execution_count": 5,
"outputs": [
{
"output_type": "display_data",
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3c86415352574190b71e1fe5a15d36f1",
"version_minor": 0,
"version_major": 2
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
]
},
"metadata": {
"tags": []
}
},
{
"output_type": "stream",
"text": [
"\n"
],
"name": "stdout"
},
{
"output_type": "stream",
"text": [
"Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence\n"
],
"name": "stderr"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[{'generated_text': 'Today is a beautiful day and I will celebrate my birthday!\"\\n\\nThe mother told CNN the two had planned their meal together. After dinner, she added that she and I walked down the street and stopped at a diner near her home. \"He'}]"
]
},
"metadata": {
"tags": []
},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"metadata": {
@@ -2763,7 +3118,7 @@
"colab_type": "text"
},
"source": [
"## 7. Projection - Features Extraction "
"## 8. Projection - Features Extraction "
]
},
{
@@ -2775,6 +3130,7 @@
},
"id": "O4SjR1QQrl__",
"colab_type": "code",
"outputId": "2ce966d5-7a89-4488-d48f-626d1c2a8222",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 83,
@@ -2788,8 +3144,7 @@
"31d97ecf78fa412c99e6659196d82828",
"c6be5d48ec3c4c799d1445607e5f1ac6"
]
},
"outputId": "2ce966d5-7a89-4488-d48f-626d1c2a8222"
}
},
"source": [
"import numpy as np\n",
@@ -2797,7 +3152,7 @@
"output = nlp_features('Hugging Face is a French company based in Paris')\n",
"np.array(output).shape # (Samples, Tokens, Vector Size)\n"
],
"execution_count": 10,
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
@@ -2861,6 +3216,7 @@
},
"id": "yFlBPQHtrmAH",
"colab_type": "code",
"outputId": "03cc3207-a7e8-49fd-904a-63a7a1d0eb7a",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 116,
@@ -2872,8 +3228,7 @@
"62b10ca525cc4ac68f3a006434eb7416",
"211109537fbe4e60b89a238c89db1346"
]
},
"outputId": "03cc3207-a7e8-49fd-904a-63a7a1d0eb7a"
}
},
"source": [
"task = widgets.Dropdown(\n",
@@ -2906,7 +3261,7 @@
"input.on_submit(forward)\n",
"display(task, input)"
],
"execution_count": 11,
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
@@ -2958,6 +3313,7 @@
},
"id": "GCoKbBTYrmAN",
"colab_type": "code",
"outputId": "57c3a647-160a-4b3a-e852-e7a1daf1294a",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143,
@@ -2969,8 +3325,7 @@
"d305ba1662e3466c93ab5cca7ebf8f33",
"879f7a3747ad455d810c7a29918648ee"
]
},
"outputId": "57c3a647-160a-4b3a-e852-e7a1daf1294a"
}
},
"source": [
"context = widgets.Textarea(\n",
@@ -2995,7 +3350,7 @@
"query.on_submit(forward)\n",
"display(context, query)"
],
"execution_count": 12,
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",

View File

@@ -79,13 +79,13 @@ extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rt
extras["quality"] = [
"black",
"isort",
"flake8",
"flake8==3.7.9",
]
extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3", "scikit-learn", "tensorflow", "torch"]
setup(
name="transformers",
version="2.9.0",
version="2.9.1",
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
author_email="thomas@huggingface.co",
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",

View File

@@ -2,7 +2,7 @@
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
__version__ = "2.9.0"
__version__ = "2.9.1"
# Work around to update TensorFlow's absl.logging threshold which alters the
# default Python logging output behavior when present.
@@ -248,7 +248,7 @@ if is_torch_available():
BART_PRETRAINED_MODEL_ARCHIVE_MAP,
)
from .modeling_marian import MarianMTModel
from .tokenization_marian import MarianSentencePieceTokenizer
from .tokenization_marian import MarianTokenizer
from .modeling_roberta import (
RobertaForMaskedLM,
RobertaModel,
@@ -287,6 +287,7 @@ if is_torch_available():
from .modeling_albert import (
AlbertPreTrainedModel,
AlbertModel,
AlbertForPreTraining,
AlbertForMaskedLM,
AlbertForSequenceClassification,
AlbertForQuestionAnswering,
@@ -358,6 +359,7 @@ if is_tf_available():
from .modeling_tf_auto import (
TFAutoModel,
TFAutoModelForPreTraining,
TFAutoModelForMultipleChoice,
TFAutoModelForSequenceClassification,
TFAutoModelForQuestionAnswering,
TFAutoModelWithLMHead,
@@ -490,7 +492,9 @@ if is_tf_available():
TFAlbertPreTrainedModel,
TFAlbertMainLayer,
TFAlbertModel,
TFAlbertForPreTraining,
TFAlbertForMaskedLM,
TFAlbertForMultipleChoice,
TFAlbertForSequenceClassification,
TFAlbertForQuestionAnswering,
TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,

View File

@@ -26,7 +26,7 @@ def gelu_new(x):
""" Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).
Also see https://arxiv.org/abs/1606.08415
"""
return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
if torch.__version__ < "1.4.0":
@@ -36,7 +36,7 @@ else:
def gelu_fast(x):
return 0.5 * x * (1 + torch.tanh(x * 0.7978845608 * (1 + 0.044715 * x * x)))
return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))
ACT2FN = {

View File

@@ -62,7 +62,21 @@ class ConvertCommand(BaseTransformersCLICommand):
self._finetuning_task_name = finetuning_task_name
def run(self):
if self._model_type == "bert":
if self._model_type == "albert":
try:
from transformers.convert_albert_original_tf_checkpoint_to_pytorch import (
convert_tf_checkpoint_to_pytorch,
)
except ImportError:
msg = (
"transformers can only be used from the commandline to convert TensorFlow models in PyTorch, "
"In that case, it requires TensorFlow to be installed. Please see "
"https://www.tensorflow.org/install/ for installation instructions."
)
raise ImportError(msg)
convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)
elif self._model_type == "bert":
try:
from transformers.convert_bert_original_tf_checkpoint_to_pytorch import (
convert_tf_checkpoint_to_pytorch,

View File

@@ -28,6 +28,7 @@ from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, Electr
from .configuration_encoder_decoder import EncoderDecoderConfig
from .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
from .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
from .configuration_marian import MarianConfig
from .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
from .configuration_reformer import ReformerConfig
from .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
@@ -73,6 +74,7 @@ CONFIG_MAPPING = OrderedDict(
("albert", AlbertConfig,),
("camembert", CamembertConfig,),
("xlm-roberta", XLMRobertaConfig,),
("marian", MarianConfig,),
("bart", BartConfig,),
("reformer", ReformerConfig,),
("roberta", RobertaConfig,),

View File

@@ -23,4 +23,5 @@ PRETRAINED_CONFIG_ARCHIVE_MAP = {
class MarianConfig(BartConfig):
model_type = "marian"
pretrained_config_archive_map = PRETRAINED_CONFIG_ARCHIVE_MAP

View File

@@ -24,7 +24,8 @@ from .configuration_utils import PretrainedConfig
logger = logging.getLogger(__name__)
REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"google/reformer-crime-and-punishment": "https://cdn.huggingface.co/google/reformer-crime-and-punishment/config.json"
"google/reformer-crime-and-punishment": "https://cdn.huggingface.co/google/reformer-crime-and-punishment/config.json",
"google/reformer-enwik8": "https://cdn.huggingface.co/google/reformer-enwik8/config.json",
}

View File

@@ -20,7 +20,7 @@ import logging
import torch
from transformers import AlbertConfig, AlbertForMaskedLM, load_tf_weights_in_albert
from transformers import AlbertConfig, AlbertForPreTraining, load_tf_weights_in_albert
logging.basicConfig(level=logging.INFO)
@@ -30,7 +30,7 @@ def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, albert_config_file, pyt
# Initialise PyTorch model
config = AlbertConfig.from_json_file(albert_config_file)
print("Building PyTorch model from configuration: {}".format(str(config)))
model = AlbertForMaskedLM(config)
model = AlbertForPreTraining(config)
# Load weights from tf checkpoint
load_tf_weights_in_albert(model, config, tf_checkpoint_path)

View File

@@ -11,7 +11,8 @@ import numpy as np
import torch
from tqdm import tqdm
from transformers import MarianConfig, MarianMTModel, MarianSentencePieceTokenizer
from transformers import MarianConfig, MarianMTModel, MarianTokenizer
from transformers.hf_api import HfApi
def remove_prefix(text: str, prefix: str):
@@ -38,6 +39,19 @@ def load_layers_(layer_lst: torch.nn.ModuleList, opus_state: dict, converter, is
layer.load_state_dict(sd, strict=True)
def find_pretrained_model(src_lang: str, tgt_lang: str) -> List[str]:
"""Find models that can accept src_lang as input and return tgt_lang as output."""
prefix = "Helsinki-NLP/opus-mt-"
api = HfApi()
model_list = api.model_list()
model_ids = [x.modelId for x in model_list if x.modelId.startswith("Helsinki-NLP")]
src_and_targ = [
remove_prefix(m, prefix).lower().split("-") for m in model_ids if "+" not in m
] # + cant be loaded.
matching = [f"{prefix}{a}-{b}" for (a, b) in src_and_targ if src_lang in a and tgt_lang in b]
return matching
def add_emb_entries(wemb, final_bias, n_special_tokens=1):
vsize, d_model = wemb.shape
embs_to_add = np.zeros((n_special_tokens, d_model))
@@ -81,7 +95,103 @@ def find_model_file(dest_dir): # this one better
return model_file
def parse_readmes(repo_path):
# Group Names Logic: change long opus model names to something shorter, like opus-mt-en-ROMANCE
ROM_GROUP = "fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la"
GROUPS = [
("cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh", "ZH"),
(ROM_GROUP, "ROMANCE"),
("de+nl+fy+af+da+fo+is+no+nb+nn+sv", "NORTH_EU"),
("da+fo+is+no+nb+nn+sv", "SCANDINAVIA"),
("se+sma+smj+smn+sms", "SAMI"),
("nb_NO+nb+nn_NO+nn+nog+no_nb+no", "NORWAY"),
("ga+cy+br+gd+kw+gv", "CELTIC"), # https://en.wikipedia.org/wiki/Insular_Celtic_languages
]
GROUP_TO_OPUS_NAME = {
"opus-mt-ZH-de": "cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-de",
"opus-mt-ZH-fi": "cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-fi",
"opus-mt-ZH-sv": "cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-sv",
"opus-mt-SCANDINAVIA-SCANDINAVIA": "da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv",
"opus-mt-NORTH_EU-NORTH_EU": "de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv",
"opus-mt-de-ZH": "de-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh",
"opus-mt-en_el_es_fi-en_el_es_fi": "en+el+es+fi-en+el+es+fi",
"opus-mt-en-ROMANCE": "en-fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO"
"+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR"
"+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la",
"opus-mt-en-CELTIC": "en-ga+cy+br+gd+kw+gv",
"opus-mt-es-NORWAY": "es-nb_NO+nb+nn_NO+nn+nog+no_nb+no",
"opus-mt-fi_nb_no_nn_ru_sv_en-SAMI": "fi+nb+no+nn+ru+sv+en-se+sma+smj+smn+sms",
"opus-mt-fi-ZH": "fi-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh",
"opus-mt-fi-NORWAY": "fi-nb_NO+nb+nn_NO+nn+nog+no_nb+no",
"opus-mt-ROMANCE-en": "fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO"
"+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR"
"+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la-en",
"opus-mt-CELTIC-en": "ga+cy+br+gd+kw+gv-en",
"opus-mt-sv-ZH": "sv-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh",
"opus-mt-sv-NORWAY": "sv-nb_NO+nb+nn_NO+nn+nog+no_nb+no",
}
OPUS_GITHUB_URL = "https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/"
ORG_NAME = "Helsinki-NLP/"
def convert_opus_name_to_hf_name(x):
for substr, grp_name in GROUPS:
x = x.replace(substr, grp_name)
return x.replace("+", "_")
def convert_hf_name_to_opus_name(hf_model_name):
"""Relies on the assumption that there are no language codes like pt_br in models that are not in GROUP_TO_OPUS_NAME."""
hf_model_name = remove_prefix(hf_model_name, ORG_NAME)
if hf_model_name in GROUP_TO_OPUS_NAME:
opus_w_prefix = GROUP_TO_OPUS_NAME[hf_model_name]
else:
opus_w_prefix = hf_model_name.replace("_", "+")
return remove_prefix(opus_w_prefix, "opus-mt-")
def write_model_card(
hf_model_name: str,
repo_path="OPUS-MT-train/models/",
dry_run=False,
model_card_dir=Path("marian_converted/model_cards/Helsinki-NLP/"),
) -> str:
"""Copy the most recent model's readme section from opus, and add metadata.
upload command: s3cmd sync --recursive model_card_dir s3://models.huggingface.co/bert/Helsinki-NLP/
"""
hf_model_name = remove_prefix(hf_model_name, ORG_NAME)
opus_name: str = convert_hf_name_to_opus_name(hf_model_name)
opus_src, opus_tgt = [x.split("+") for x in opus_name.split("-")]
readme_url = OPUS_GITHUB_URL + f"{opus_name}/README.md"
s, t = ",".join(opus_src), ",".join(opus_tgt)
extra_markdown = f"### {hf_model_name}\n\n* source languages: {s}\n* target languages: {t}\n* OPUS readme: [{opus_name}]({readme_url})\n"
# combine with opus markdown
opus_readme_path = Path(f"{repo_path}{opus_name}/README.md")
assert opus_readme_path.exists(), opus_readme_path
content = opus_readme_path.open().read()
content = content.split("\n# ")[-1] # Get the lowest level 1 header in the README -- the most recent model.
content = "*".join(content.split("*")[1:])
content = extra_markdown + "\n* " + content.replace("download", "download original weights")
if dry_run:
return content
# Save string to model_cards/hf_model_name/readme.md
model_card_dir.mkdir(exist_ok=True)
sub_dir = model_card_dir / hf_model_name
sub_dir.mkdir(exist_ok=True)
dest = sub_dir / "README.md"
dest.open("w").write(content)
return content
def get_clean_model_id_mapping(multiling_model_ids):
return {x: convert_opus_name_to_hf_name(x) for x in multiling_model_ids}
def make_registry(repo_path="Opus-MT-train/models"):
if not (Path(repo_path) / "fr-en" / "README.md").exists():
raise ValueError(
f"repo_path:{repo_path} does not exist: "
"You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git before calling."
)
results = {}
for p in Path(repo_path).ls():
n_dash = p.name.count("-")
@@ -90,22 +200,48 @@ def parse_readmes(repo_path):
else:
lns = list(open(p / "README.md").readlines())
results[p.name] = _parse_readme(lns)
return results
return [(k, v["pre-processing"], v["download"], v["download"][:-4] + ".test.txt") for k, v in results.items()]
def download_all_sentencepiece_models(repo_path="Opus-MT-train/models"):
def convert_all_sentencepiece_models(model_list=None, repo_path=None):
"""Requires 300GB"""
save_dir = Path("marian_ckpt")
if not Path(repo_path).exists():
raise ValueError("You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git")
results: dict = parse_readmes(repo_path)
for k, v in tqdm(list(results.items())):
if os.path.exists(save_dir / k):
print(f"already have path {k}")
dest_dir = Path("marian_converted")
dest_dir.mkdir(exist_ok=True)
if model_list is None:
model_list: list = make_registry(repo_path=repo_path)
for k, prepro, download, test_set_url in tqdm(model_list):
if "SentencePiece" not in prepro: # dont convert BPE models.
continue
if "SentencePiece" not in v["pre-processing"]:
if not os.path.exists(save_dir / k / "pytorch_model.bin"):
download_and_unzip(download, save_dir / k)
pair_name = convert_opus_name_to_hf_name(k)
convert(save_dir / k, dest_dir / f"opus-mt-{pair_name}")
def lmap(f, x) -> List:
return list(map(f, x))
def fetch_test_set(test_set_url):
import wget
fname = wget.download(test_set_url, f"opus_test.txt")
lns = Path(fname).open().readlines()
src = lmap(str.strip, lns[::4])
gold = lmap(str.strip, lns[1::4])
mar_model = lmap(str.strip, lns[2::4])
assert len(gold) == len(mar_model) == len(src)
os.remove(fname)
return src, mar_model, gold
def convert_whole_dir(path=Path("marian_ckpt/")):
for subdir in tqdm(list(path.ls())):
dest_dir = f"marian_converted/{subdir.name}"
if (dest_dir / "pytorch_model.bin").exists():
continue
download_and_unzip(v["download"], save_dir / k)
convert(source_dir, dest_dir)
def _parse_readme(lns):
@@ -131,7 +267,7 @@ def _parse_readme(lns):
return subres
def write_metadata(dest_dir: Path):
def save_tokenizer_config(dest_dir: Path):
dname = dest_dir.name.split("-")
dct = dict(target_lang=dname[-1], source_lang="-".join(dname[:-1]))
save_json(dct, dest_dir / "tokenizer_config.json")
@@ -148,13 +284,17 @@ def add_to_vocab_(vocab: Dict[str, int], special_tokens: List[str]):
return added
def find_vocab_file(model_dir):
return list(model_dir.glob("*vocab.yml"))[0]
def add_special_tokens_to_vocab(model_dir: Path) -> None:
vocab = load_yaml(model_dir / "opus.spm32k-spm32k.vocab.yml")
vocab = load_yaml(find_vocab_file(model_dir))
vocab = {k: int(v) for k, v in vocab.items()}
num_added = add_to_vocab_(vocab, ["<pad>"])
print(f"added {num_added} tokens to vocab")
save_json(vocab, model_dir / "vocab.json")
write_metadata(model_dir)
save_tokenizer_config(model_dir)
def save_tokenizer(self, save_directory):
@@ -251,7 +391,6 @@ class OpusState:
# Process decoder.yml
decoder_yml = cast_marian_config(load_yaml(source_dir / "decoder.yml"))
# TODO: what are normalize and word-penalty?
check_marian_cfg_assumptions(cfg)
self.hf_config = MarianConfig(
vocab_size=cfg["vocab_size"],
@@ -273,6 +412,9 @@ class OpusState:
dropout=0.1, # see opus-mt-train repo/transformer-dropout param.
# default: add_final_layer_norm=False,
num_beams=decoder_yml["beam-size"],
decoder_start_token_id=self.pad_token_id,
bad_words_ids=[[self.pad_token_id]],
max_length=512,
)
def _check_layer_entries(self):
@@ -349,12 +491,12 @@ def download_and_unzip(url, dest_dir):
os.remove(filename)
def main(source_dir, dest_dir):
def convert(source_dir: Path, dest_dir):
dest_dir = Path(dest_dir)
dest_dir.mkdir(exist_ok=True)
add_special_tokens_to_vocab(source_dir)
tokenizer = MarianSentencePieceTokenizer.from_pretrained(str(source_dir))
tokenizer = MarianTokenizer.from_pretrained(str(source_dir))
save_tokenizer(tokenizer, dest_dir)
opus_state = OpusState(source_dir)
@@ -377,7 +519,7 @@ if __name__ == "__main__":
source_dir = Path(args.src)
assert source_dir.exists()
dest_dir = f"converted-{source_dir.name}" if args.dest is None else args.dest
main(source_dir, dest_dir)
convert(source_dir, dest_dir)
def load_yaml(path):

View File

@@ -46,7 +46,7 @@ from transformers import (
OpenAIGPTConfig,
RobertaConfig,
T5Config,
TFAlbertForMaskedLM,
TFAlbertForPreTraining,
TFBertForPreTraining,
TFBertForQuestionAnswering,
TFBertForSequenceClassification,
@@ -109,7 +109,7 @@ if is_torch_available():
DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
CTRLLMHeadModel,
CTRL_PRETRAINED_MODEL_ARCHIVE_MAP,
AlbertForMaskedLM,
AlbertForPreTraining,
ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
T5ForConditionalGeneration,
T5_PRETRAINED_MODEL_ARCHIVE_MAP,
@@ -148,7 +148,7 @@ else:
DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
CTRLLMHeadModel,
CTRL_PRETRAINED_MODEL_ARCHIVE_MAP,
AlbertForMaskedLM,
AlbertForPreTraining,
ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
T5ForConditionalGeneration,
T5_PRETRAINED_MODEL_ARCHIVE_MAP,
@@ -318,8 +318,8 @@ MODEL_CLASSES = {
),
"albert": (
AlbertConfig,
TFAlbertForMaskedLM,
AlbertForMaskedLM,
TFAlbertForPreTraining,
AlbertForPreTraining,
ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
),

View File

@@ -93,7 +93,7 @@ def set_block_weights_in_torch(weights, torch_block, hidden_size):
set_layer_weights_in_torch_local(attn_weights, torch_block.attention, hidden_size)
# intermediate weighs
intermediate_weights = weights[2][0][2][2]
intermediate_weights = weights[2][0][1][2]
# Chunked Feed Forward
if len(intermediate_weights) == 4:
@@ -145,19 +145,16 @@ def set_model_weights_in_torch(weights, torch_model, hidden_size):
position_embeddings.weights[emb_idx] = torch.nn.Parameter(torch.tensor(emb_weights))
trax_layer_weights = weights[5]
assert len(torch_model_reformer.encoder.layers) * 4 + 1 == len(
assert len(torch_model_reformer.encoder.layers) * 4 == len(
trax_layer_weights
), "HF and trax model do not have the same number of layers"
for layer_idx, layer in enumerate(torch_model_reformer.encoder.layers):
block_weights = trax_layer_weights[4 * layer_idx : 4 * (layer_idx + 1)]
set_block_weights_in_torch(block_weights, layer, hidden_size)
# output weights
out_weights = weights[6]
# output layer norm
layer_norm_out_weight = np.asarray(out_weights[0][0])
layer_norm_out_bias = np.asarray(out_weights[0][1])
layer_norm_out_weight = np.asarray(weights[7][0])
layer_norm_out_bias = np.asarray(weights[7][1])
set_param(
torch_model_reformer.encoder.layer_norm,
torch.tensor(layer_norm_out_weight),
@@ -165,8 +162,8 @@ def set_model_weights_in_torch(weights, torch_model, hidden_size):
)
# output embeddings
output_embed_weights = np.asarray(out_weights[2][0])
output_embed_bias = np.asarray(out_weights[2][1])
output_embed_weights = np.asarray(weights[9][0])
output_embed_bias = np.asarray(weights[9][1])
set_param(
torch_model.lm_head.decoder,
torch.tensor(output_embed_weights).transpose(0, 1).contiguous(),

View File

@@ -5,12 +5,12 @@ from dataclasses import dataclass, field
from typing import List, Optional
import torch
from filelock import FileLock
from torch.utils.data.dataset import Dataset
from ...tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
from ...tokenization_utils import PreTrainedTokenizer
from ...tokenization_xlm_roberta import XLMRobertaTokenizer
from ...trainer import torch_distributed_zero_first
from ..processors.glue import glue_convert_examples_to_features, glue_output_modes, glue_processors
from ..processors.utils import InputFeatures
@@ -63,7 +63,6 @@ class GlueDataset(Dataset):
tokenizer: PreTrainedTokenizer,
limit_length: Optional[int] = None,
evaluate=False,
local_rank=-1,
):
self.args = args
processor = glue_processors[args.task_name]()
@@ -75,9 +74,11 @@ class GlueDataset(Dataset):
"dev" if evaluate else "train", tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,
),
)
with torch_distributed_zero_first(local_rank):
# Make sure only the first process in distributed training processes the dataset,
# and the others will use the cache.
# Make sure only the first process in distributed training processes the dataset,
# and the others will use the cache.
lock_path = cached_features_file + ".lock"
with FileLock(lock_path):
if os.path.exists(cached_features_file) and not args.overwrite_cache:
start = time.time()
@@ -109,13 +110,12 @@ class GlueDataset(Dataset):
label_list=label_list,
output_mode=self.output_mode,
)
if local_rank in [-1, 0]:
start = time.time()
torch.save(self.features, cached_features_file)
# ^ This seems to take a lot of time so I want to investigate why and how we can improve.
logger.info(
f"Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
)
start = time.time()
torch.save(self.features, cached_features_file)
# ^ This seems to take a lot of time so I want to investigate why and how we can improve.
logger.info(
f"Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
)
def __len__(self):
return len(self.features)

View File

@@ -15,6 +15,7 @@ import tempfile
from contextlib import contextmanager
from functools import partial, wraps
from hashlib import sha256
from pathlib import Path
from typing import Optional
from urllib.parse import urlparse
from zipfile import ZipFile, is_zipfile
@@ -68,19 +69,10 @@ except ImportError:
)
default_cache_path = os.path.join(torch_cache_home, "transformers")
try:
from pathlib import Path
PYTORCH_PRETRAINED_BERT_CACHE = Path(
os.getenv("PYTORCH_TRANSFORMERS_CACHE", os.getenv("PYTORCH_PRETRAINED_BERT_CACHE", default_cache_path))
)
except (AttributeError, ImportError):
PYTORCH_PRETRAINED_BERT_CACHE = os.getenv(
"PYTORCH_TRANSFORMERS_CACHE", os.getenv("PYTORCH_PRETRAINED_BERT_CACHE", default_cache_path)
)
PYTORCH_TRANSFORMERS_CACHE = PYTORCH_PRETRAINED_BERT_CACHE # Kept for backward compatibility
TRANSFORMERS_CACHE = PYTORCH_PRETRAINED_BERT_CACHE # Kept for backward compatibility
PYTORCH_PRETRAINED_BERT_CACHE = os.getenv("PYTORCH_PRETRAINED_BERT_CACHE", default_cache_path)
PYTORCH_TRANSFORMERS_CACHE = os.getenv("PYTORCH_TRANSFORMERS_CACHE", PYTORCH_PRETRAINED_BERT_CACHE)
TRANSFORMERS_CACHE = os.getenv("TRANSFORMERS_CACHE", PYTORCH_TRANSFORMERS_CACHE)
WEIGHTS_NAME = "pytorch_model.bin"
TF2_WEIGHTS_NAME = "tf_model.h5"

View File

@@ -111,7 +111,8 @@ def load_tf_weights_in_albert(model, config, tf_checkpoint_path):
# No ALBERT model currently handles the next sentence prediction task
if "seq_relationship" in name:
continue
name = name.replace("seq_relationship/output_", "sop_classifier/classifier/")
name = name.replace("weights", "weight")
name = name.split("/")
@@ -174,7 +175,7 @@ class AlbertEmbeddings(BertEmbeddings):
def __init__(self, config):
super().__init__(config)
self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=0)
self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)
self.LayerNorm = torch.nn.LayerNorm(config.embedding_size, eps=config.layer_norm_eps)
@@ -568,6 +569,115 @@ class AlbertModel(AlbertPreTrainedModel):
return outputs
@add_start_docstrings(
"""Albert Model with two heads on top as done during the pre-training: a `masked language modeling` head and
a `sentence order prediction (classification)` head. """,
ALBERT_START_DOCSTRING,
)
class AlbertForPreTraining(AlbertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.albert = AlbertModel(config)
self.predictions = AlbertMLMHead(config)
self.sop_classifier = AlbertSOPHead(config)
self.init_weights()
self.tie_weights()
def tie_weights(self):
self._tie_or_clone_weights(self.predictions.decoder, self.albert.embeddings.word_embeddings)
def get_output_embeddings(self):
return self.predictions.decoder
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
masked_lm_labels=None,
sentence_order_label=None,
):
r"""
masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]``
sentence_order_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
Indices should be in ``[0, 1]``.
``0`` indicates original order (sequence A, then sequence B),
``1`` indicates switched order (sequence B, then sequence A).
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
sop_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False
continuation before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Examples::
from transformers import AlbertTokenizer, AlbertForPreTraining
import torch
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
model = AlbertForPreTraining.from_pretrained('albert-base-v2')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
prediction_scores, sop_scores = outputs[:2]
"""
outputs = self.albert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
)
sequence_output, pooled_output = outputs[:2]
prediction_scores = self.predictions(sequence_output)
sop_scores = self.sop_classifier(pooled_output)
outputs = (prediction_scores, sop_scores,) + outputs[2:] # add hidden states and attention if they are here
if masked_lm_labels is not None and sentence_order_label is not None:
loss_fct = CrossEntropyLoss()
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
sentence_order_loss = loss_fct(sop_scores.view(-1, 2), sentence_order_label.view(-1))
total_loss = masked_lm_loss + sentence_order_loss
outputs = (total_loss,) + outputs
return outputs # (loss), prediction_scores, sop_scores, (hidden_states), (attentions)
class AlbertMLMHead(nn.Module):
def __init__(self, config):
super().__init__()
@@ -592,6 +702,19 @@ class AlbertMLMHead(nn.Module):
return prediction_scores
class AlbertSOPHead(nn.Module):
def __init__(self, config):
super().__init__()
self.dropout = nn.Dropout(config.classifier_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, pooled_output):
dropout_pooled_output = self.dropout(pooled_output)
logits = self.classifier(dropout_pooled_output)
return logits
@add_start_docstrings(
"Albert Model with a `language modeling` head on top.", ALBERT_START_DOCSTRING,
)
@@ -932,7 +1055,7 @@ class AlbertForQuestionAnswering(AlbertPreTrainedModel):
Examples::
# The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the
# examples/run_squad.py example to see how to fine-tune a model to a question answering task.
# examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.
from transformers import AlbertTokenizer, AlbertForQuestionAnswering
import torch

View File

@@ -39,10 +39,12 @@ from .configuration_auto import (
XLMRobertaConfig,
XLNetConfig,
)
from .configuration_marian import MarianConfig
from .configuration_utils import PretrainedConfig
from .modeling_albert import (
ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
AlbertForMaskedLM,
AlbertForPreTraining,
AlbertForQuestionAnswering,
AlbertForSequenceClassification,
AlbertForTokenClassification,
@@ -97,6 +99,7 @@ from .modeling_flaubert import (
FlaubertWithLMHeadModel,
)
from .modeling_gpt2 import GPT2_PRETRAINED_MODEL_ARCHIVE_MAP, GPT2LMHeadModel, GPT2Model
from .modeling_marian import MarianMTModel
from .modeling_openai import OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP, OpenAIGPTLMHeadModel, OpenAIGPTModel
from .modeling_reformer import ReformerModel, ReformerModelWithLMHead
from .modeling_roberta import (
@@ -189,7 +192,7 @@ MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
[
(T5Config, T5ForConditionalGeneration),
(DistilBertConfig, DistilBertForMaskedLM),
(AlbertConfig, AlbertForMaskedLM),
(AlbertConfig, AlbertForPreTraining),
(CamembertConfig, CamembertForMaskedLM),
(XLMRobertaConfig, XLMRobertaForMaskedLM),
(BartConfig, BartForConditionalGeneration),
@@ -213,6 +216,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
(AlbertConfig, AlbertForMaskedLM),
(CamembertConfig, CamembertForMaskedLM),
(XLMRobertaConfig, XLMRobertaForMaskedLM),
(MarianConfig, MarianMTModel),
(BartConfig, BartForConditionalGeneration),
(RobertaConfig, RobertaForMaskedLM),
(BertConfig, BertForMaskedLM),
@@ -902,7 +906,7 @@ class AutoModelForQuestionAnswering:
Examples::
config = BertConfig.from_pretrained('bert-base-uncased') # Download configuration from S3 and cache.
model = AutoModelForSequenceClassification.from_config(config) # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = AutoModelForQuestionAnswering.from_config(config) # E.g. model was saved using `save_pretrained('./test/saved_model/')`
"""
for config_class, model_class in MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():
if isinstance(config, config_class):

View File

@@ -886,7 +886,7 @@ class BartForConditionalGeneration(PretrainedBartModel):
if new_num_tokens <= old_num_tokens:
new_bias = self.final_logits_bias[:, :new_num_tokens]
else:
extra_bias = torch.zeros((1, new_num_tokens - old_num_tokens))
extra_bias = torch.zeros((1, new_num_tokens - old_num_tokens), device=self.final_logits_bias.device)
new_bias = torch.cat([self.final_logits_bias, extra_bias], dim=1)
self.register_buffer("final_logits_bias", new_bias)
@@ -980,12 +980,12 @@ class BartForConditionalGeneration(PretrainedBartModel):
"use_cache": use_cache, # change this to avoid caching (presumably for debugging)
}
def prepare_scores_for_generation(self, scores, cur_len, max_length):
def prepare_logits_for_generation(self, logits, cur_len, max_length):
if cur_len == 1:
self._force_token_ids_generation(scores, self.config.bos_token_id)
self._force_token_ids_generation(logits, self.config.bos_token_id)
if cur_len == max_length - 1 and self.config.eos_token_id is not None:
self._force_token_ids_generation(scores, self.config.eos_token_id)
return scores
self._force_token_ids_generation(logits, self.config.eos_token_id)
return logits
def _force_token_ids_generation(self, scores, token_ids) -> None:
"""force one of token_ids to be generated by setting prob of all other tokens to 0"""

View File

@@ -61,7 +61,7 @@ def create_sinusoidal_embeddings(n_pos, dim, out):
class Embeddings(nn.Module):
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=0)
self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)
if config.sinusoidal_pos_embds:
create_sinusoidal_embeddings(

View File

@@ -142,10 +142,10 @@ class Attention(nn.Module):
def _attn(self, q, k, v, attention_mask=None, head_mask=None):
w = torch.matmul(q, k)
if self.scale:
w = w / (v.size(-1) ** 0.5)
w = w / (float(v.size(-1)) ** 0.5)
nd, ns = w.size(-2), w.size(-1)
mask = self.bias[:, :, ns - nd : ns, :ns]
w = torch.where(mask, w, self.masked_bias)
w = torch.where(mask.bool(), w, self.masked_bias.to(w.dtype))
if attention_mask is not None:
# Apply the attention mask

View File

@@ -18,18 +18,33 @@
from transformers.modeling_bart import BartForConditionalGeneration
PRETRAINED_MODEL_ARCHIVE_MAP = {
"opus-mt-en-de": "https://cdn.huggingface.co/Helsinki-NLP/opus-mt-en-de/pytorch_model.bin",
}
class MarianMTModel(BartForConditionalGeneration):
"""Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
Model API is identical to BartForConditionalGeneration"""
r"""
Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
Model API is identical to BartForConditionalGeneration.
Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__
pretrained_model_archive_map = PRETRAINED_MODEL_ARCHIVE_MAP
Examples::
def prepare_scores_for_generation(self, scores, cur_len, max_length):
from transformers import MarianTokenizer, MarianMTModel
from typing import List
src = 'fr' # source language
trg = 'en' # target language
sample_text = "où est l'arrêt de bus ?"
mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'
model = MarianMTModel.from_pretrained(mname)
tok = MarianTokenizer.from_pretrained(mname)
batch = tok.prepare_translation_batch(src_texts=[sample_text]) # don't need tgt_text for inference
gen = model.generate(**batch) # for forward pass: model(**batch)
words: List[str] = tok.batch_decode(gen, skip_special_tokens=True) # returns "Where is the the bus stop ?"
"""
pretrained_model_archive_map = {} # see https://huggingface.co/models?search=Helsinki-NLP
def prepare_logits_for_generation(self, logits, cur_len, max_length):
logits[:, self.config.pad_token_id] = float("-inf")
if cur_len == max_length - 1 and self.config.eos_token_id is not None:
self._force_token_ids_generation(scores, self.config.eos_token_id)
return scores
self._force_token_ids_generation(logits, self.config.eos_token_id)
return logits

View File

@@ -36,7 +36,8 @@ from .modeling_utils import PreTrainedModel, apply_chunking_to_forward
logger = logging.getLogger(__name__)
REFORMER_PRETRAINED_MODEL_ARCHIVE_MAP = {
"google/reformer-crime-and-punishment": "https://cdn.huggingface.co/google/reformer-crime-and-punishment/pytorch_model.bin"
"google/reformer-crime-and-punishment": "https://cdn.huggingface.co/google/reformer-crime-and-punishment/pytorch_model.bin",
"google/reformer-enwik8": "https://cdn.huggingface.co/google/reformer-enwik8/pytorch_model.bin",
}
@@ -561,8 +562,8 @@ class LSHSelfAttention(nn.Module, EfficientAttentionMixin):
# get correct mask values depending on precision
if query_key_dots.dtype == torch.float16:
self_mask_value = self.self_mask_value_float16
mask_value = self.mask_value_float16
self_mask_value = self.self_mask_value_float16.half()
mask_value = self.mask_value_float16.half()
else:
self_mask_value = self.self_mask_value_float32
mask_value = self.mask_value_float32
@@ -833,7 +834,7 @@ class LocalSelfAttention(nn.Module, EfficientAttentionMixin):
if mask is not None:
# get mask tensor depending on half precision or not
if query_key_dots.dtype == torch.float16:
mask_value = self.mask_value_float16
mask_value = self.mask_value_float16.half()
else:
mask_value = self.mask_value_float32

View File

@@ -643,7 +643,7 @@ class RobertaForQuestionAnswering(BertPreTrainedModel):
Examples::
# The checkpoint roberta-large is not fine-tuned for question answering. Please see the
# examples/run_squad.py example to see how to fine-tune a model to a question answering task.
# examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.
from transformers import RobertaTokenizer, RobertaForQuestionAnswering
import torch

View File

@@ -21,7 +21,7 @@ import logging
import tensorflow as tf
from .configuration_albert import AlbertConfig
from .file_utils import add_start_docstrings, add_start_docstrings_to_callable
from .file_utils import MULTIPLE_CHOICE_DUMMY_INPUTS, add_start_docstrings, add_start_docstrings_to_callable
from .modeling_tf_bert import ACT2FN, TFBertSelfAttention
from .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list
from .tokenization_utils import BatchEncoding
@@ -475,7 +475,6 @@ class TFAlbertMLMHead(tf.keras.layers.Layer):
hidden_states = self.activation(hidden_states)
hidden_states = self.LayerNorm(hidden_states)
hidden_states = self.decoder(hidden_states, mode="linear") + self.decoder_bias
hidden_states = hidden_states + self.bias
return hidden_states
@@ -718,6 +717,73 @@ class TFAlbertModel(TFAlbertPreTrainedModel):
return outputs
@add_start_docstrings(
"""Albert Model with two heads on top for pre-training:
a `masked language modeling` head and a `sentence order prediction` (classification) head. """,
ALBERT_START_DOCSTRING,
)
class TFAlbertForPreTraining(TFAlbertPreTrainedModel):
def __init__(self, config, *inputs, **kwargs):
super().__init__(config, *inputs, **kwargs)
self.num_labels = config.num_labels
self.albert = TFAlbertMainLayer(config, name="albert")
self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name="predictions")
self.sop_classifier = TFAlbertSOPHead(config, name="sop_classifier")
def get_output_embeddings(self):
return self.albert.embeddings
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
def call(self, inputs, **kwargs):
r"""
Return:
:obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
sop_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):
Prediction scores of the sentence order prediction (classification) head (scores of True/False continuation before SoftMax).
hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
tuple of :obj:`tf.Tensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
import tensorflow as tf
from transformers import AlbertTokenizer, TFAlbertForPreTraining
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
model = TFAlbertForPreTraining.from_pretrained('albert-base-v2')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :] # Batch size 1
outputs = model(input_ids)
prediction_scores, sop_scores = outputs[:2]
"""
outputs = self.albert(inputs, **kwargs)
sequence_output, pooled_output = outputs[:2]
prediction_scores = self.predictions(sequence_output)
sop_scores = self.sop_classifier(pooled_output, training=kwargs.get("training", False))
outputs = (prediction_scores, sop_scores) + outputs[2:]
return outputs
class TFAlbertSOPHead(tf.keras.layers.Layer):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.dropout = tf.keras.layers.Dropout(config.classifier_dropout_prob)
self.classifier = tf.keras.layers.Dense(
config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier",
)
def call(self, pooled_output, training: bool):
dropout_pooled_output = self.dropout(pooled_output, training=training)
logits = self.classifier(dropout_pooled_output)
return logits
@add_start_docstrings("""Albert Model with a `language modeling` head on top. """, ALBERT_START_DOCSTRING)
class TFAlbertForMaskedLM(TFAlbertPreTrainedModel):
def __init__(self, config, *inputs, **kwargs):
@@ -865,7 +931,7 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel):
Examples::
# The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the
# examples/run_squad.py example to see how to fine-tune a model to a question answering task.
# examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.
import tensorflow as tf
from transformers import AlbertTokenizer, TFAlbertForQuestionAnswering
@@ -891,3 +957,127 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel):
outputs = (start_logits, end_logits,) + outputs[2:]
return outputs # start_logits, end_logits, (hidden_states), (attentions)
@add_start_docstrings(
"""Albert Model with a multiple choice classification head on top (a linear layer on top of
the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
ALBERT_START_DOCSTRING,
)
class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel):
def __init__(self, config, *inputs, **kwargs):
super().__init__(config, *inputs, **kwargs)
self.albert = TFAlbertMainLayer(config, name="albert")
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
self.classifier = tf.keras.layers.Dense(
1, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
)
@property
def dummy_inputs(self):
""" Dummy inputs to build the network.
Returns:
tf.Tensor with dummy inputs
"""
return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
def call(
self,
inputs,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
training=False,
):
r"""
Return:
:obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:
`num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).
Classification scores (before SoftMax).
hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
tuple of :obj:`tf.Tensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
import tensorflow as tf
from transformers import AlbertTokenizer, TFAlbertForMultipleChoice
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
model = TFAlbertForMultipleChoice.from_pretrained('albert-base-v2')
example1 = ["This is a context", "Is it a context? Yes"]
example2 = ["This is a context", "Is it a context? No"]
encoding = tokenizer.batch_encode_plus([example1, example2], return_tensors='tf', truncation_strategy="only_first", pad_to_max_length=True, max_length=128)
outputs = model(encoding["input_ids"][None, :])
logits = outputs[0]
"""
if isinstance(inputs, (tuple, list)):
input_ids = inputs[0]
attention_mask = inputs[1] if len(inputs) > 1 else attention_mask
token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids
position_ids = inputs[3] if len(inputs) > 3 else position_ids
head_mask = inputs[4] if len(inputs) > 4 else head_mask
inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds
assert len(inputs) <= 6, "Too many inputs."
elif isinstance(inputs, dict):
print("isdict(1)")
input_ids = inputs.get("input_ids")
print(input_ids)
attention_mask = inputs.get("attention_mask", attention_mask)
token_type_ids = inputs.get("token_type_ids", token_type_ids)
position_ids = inputs.get("position_ids", position_ids)
head_mask = inputs.get("head_mask", head_mask)
inputs_embeds = inputs.get("inputs_embeds", inputs_embeds)
assert len(inputs) <= 6, "Too many inputs."
else:
input_ids = inputs
if input_ids is not None:
num_choices = shape_list(input_ids)[1]
seq_length = shape_list(input_ids)[2]
else:
num_choices = shape_list(inputs_embeds)[1]
seq_length = shape_list(inputs_embeds)[2]
flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None
flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None
flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None
flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None
flat_inputs = [
flat_input_ids,
flat_attention_mask,
flat_token_type_ids,
flat_position_ids,
head_mask,
inputs_embeds,
]
outputs = self.albert(flat_inputs, training=training)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output, training=training)
logits = self.classifier(pooled_output)
reshaped_logits = tf.reshape(logits, (-1, num_choices))
outputs = (reshaped_logits,) + outputs[2:] # add hidden states and attention if they are here
return outputs # reshaped_logits, (hidden_states), (attentions)

View File

@@ -36,6 +36,8 @@ from .configuration_utils import PretrainedConfig
from .modeling_tf_albert import (
TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
TFAlbertForMaskedLM,
TFAlbertForMultipleChoice,
TFAlbertForPreTraining,
TFAlbertForQuestionAnswering,
TFAlbertForSequenceClassification,
TFAlbertModel,
@@ -43,6 +45,7 @@ from .modeling_tf_albert import (
from .modeling_tf_bert import (
TF_BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
TFBertForMaskedLM,
TFBertForMultipleChoice,
TFBertForPreTraining,
TFBertForQuestionAnswering,
TFBertForSequenceClassification,
@@ -132,7 +135,7 @@ TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
[
(T5Config, TFT5ForConditionalGeneration),
(DistilBertConfig, TFDistilBertForMaskedLM),
(AlbertConfig, TFAlbertForMaskedLM),
(AlbertConfig, TFAlbertForPreTraining),
(RobertaConfig, TFRobertaForMaskedLM),
(BertConfig, TFBertForPreTraining),
(OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),
@@ -171,6 +174,10 @@ TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
]
)
TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
[(BertConfig, TFBertForMultipleChoice), (AlbertConfig, TFAlbertForMultipleChoice)]
)
TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
[
(DistilBertConfig, TFDistilBertForQuestionAnswering),
@@ -412,7 +419,7 @@ class TFAutoModelForPreTraining(object):
in the `pretrained_model_name_or_path` string (in the following order):
- contains `t5`: :class:`~transformers.TFT5ModelWithLMHead` (T5 model)
- contains `distilbert`: :class:`~transformers.TFDistilBertForMaskedLM` (DistilBERT model)
- contains `albert`: :class:`~transformers.TFAlbertForMaskedLM` (ALBERT model)
- contains `albert`: :class:`~transformers.TFAlbertForPreTraining` (ALBERT model)
- contains `roberta`: :class:`~transformers.TFRobertaForMaskedLM` (RoBERTa model)
- contains `bert`: :class:`~transformers.TFBertForPreTraining` (Bert model)
- contains `openai-gpt`: :class:`~transformers.TFOpenAIGPTLMHeadModel` (OpenAI GPT model)
@@ -661,6 +668,153 @@ class TFAutoModelWithLMHead(object):
)
class TFAutoModelForMultipleChoice:
r"""
:class:`~transformers.TFAutoModelForMultipleChoice` is a generic model class
that will be instantiated as one of the multiple choice model classes of the library
when created with the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)`
class method.
The `from_pretrained()` method takes care of returning the correct model class instance
based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string.
The model class to instantiate is selected as the first pattern matching
in the `pretrained_model_name_or_path` string (in the following order):
- contains `albert`: TFAlbertForMultipleChoice (Albert model)
- contains `bert`: TFBertForMultipleChoice (Bert model)
This class cannot be instantiated using `__init__()` (throws an error).
"""
def __init__(self):
raise EnvironmentError(
"TFAutoModelForMultipleChoice is designed to be instantiated "
"using the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)` or "
"`TFAutoModelForMultipleChoice.from_config(config)` methods."
)
@classmethod
def from_config(cls, config):
r""" Instantiates one of the base model classes of the library
from a configuration.
config: (`optional`) instance of a class derived from :class:`~transformers.PretrainedConfig`:
The model class to instantiate is selected based on the configuration class:
- isInstance of `albert` configuration class: AlbertModel (Albert model)
- isInstance of `bert` configuration class: BertModel (Bert model)
Examples::
config = BertConfig.from_pretrained('bert-base-uncased') # Download configuration from S3 and cache.
model = AutoModelForMulitpleChoice.from_config(config) # E.g. model was saved using `save_pretrained('./test/saved_model/')`
"""
for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():
if isinstance(config, config_class):
return model_class(config)
raise ValueError(
"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\n"
"Model type should be one of {}.".format(
config.__class__,
cls.__name__,
", ".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),
)
)
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
r""" Instantiates one of the multiple choice model classes of the library
from a pre-trained model configuration.
The `from_pretrained()` method takes care of returning the correct model class instance
based on the `model_type` property of the config object, or when it's missing,
falling back to using pattern matching on the `pretrained_model_name_or_path` string.
The model class to instantiate is selected as the first pattern matching
in the `pretrained_model_name_or_path` string (in the following order):
- contains `albert`: TFRobertaForMultiple (Albert model)
- contains `bert`: TFBertForMultipleChoice (Bert model)
The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
To train the model, you should first set it back in training mode with `model.train()`
Params:
pretrained_model_name_or_path: either:
- a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
- a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
- a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
- a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.
from_pt: (`Optional`) Boolean
Set to True if the Checkpoint is a PyTorch checkpoint.
model_args: (`optional`) Sequence of positional arguments:
All remaning positional arguments will be passed to the underlying model's ``__init__`` method
config: (`optional`) instance of a class derived from :class:`~transformers.PretrainedConfig`:
Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
- the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or
- the model was saved using :func:`~transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.
- the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.
state_dict: (`optional`) dict:
an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.
This option can be used if you want to create a model from a pretrained configuration but load your own weights.
In this case though, you should check if using :func:`~transformers.PreTrainedModel.save_pretrained` and :func:`~transformers.PreTrainedModel.from_pretrained` is not a simpler option.
cache_dir: (`optional`) string:
Path to a directory in which a downloaded pre-trained model
configuration should be cached if the standard cache should not be used.
force_download: (`optional`) boolean, default False:
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
resume_download: (`optional`) boolean, default False:
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
proxies: (`optional`) dict, default None:
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
The proxies are used on each request.
output_loading_info: (`optional`) boolean:
Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
kwargs: (`optional`) Remaining dictionary of keyword arguments:
Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
- If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
- If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
Examples::
model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased') # Download model and configuration from S3 and cache.
model = TFAutoModelFormultipleChoice.from_pretrained('./test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased', output_attention=True) # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = TFAutoModelFormultipleChoice.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)
"""
config = kwargs.pop("config", None)
if not isinstance(config, PretrainedConfig):
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():
if isinstance(config, config_class):
return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
raise ValueError(
"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\n"
"Model type should be one of {}.".format(
config.__class__,
cls.__name__,
", ".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),
)
)
class TFAutoModelForSequenceClassification(object):
r"""
:class:`~transformers.TFAutoModelForSequenceClassification` is a generic model class

View File

@@ -481,7 +481,7 @@ class TFRobertaForQuestionAnswering(TFRobertaPreTrainedModel):
Examples::
# The checkpoint roberta-base is not fine-tuned for question answering. Please see the
# examples/run_squad.py example to see how to fine-tune a model to a question answering task.
# examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.
import tensorflow as tf
from transformers import RobertaTokenizer, TFRobertaForQuestionAnswering

View File

@@ -744,8 +744,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
def prepare_inputs_for_generation(self, input_ids, **kwargs):
return {"input_ids": input_ids}
def prepare_scores_for_generation(self, scores, **kwargs):
return scores
def prepare_logits_for_generation(self, logits, **kwargs):
return logits
def _use_cache(self, outputs, use_cache):
"""During generation, decide whether to pass the `past` variable to the next forward pass."""
@@ -857,7 +857,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
Defaults to `None`.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
decoder_start_token_id=None: (`optional`) int
If an encoder-decoder model starts decoding with a different token than BOS.
@@ -1342,10 +1342,13 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
if temperature != 1.0:
next_token_logits = next_token_logits / temperature
scores = F.log_softmax(next_token_logits, dim=-1) # (batch_size * num_beams, vocab_size)
if self.config.is_encoder_decoder and do_sample is False:
# TODO (PVP) still a bit hacky here - there might be a better solutino
scores = self.prepare_scores_for_generation(scores, cur_len=cur_len, max_length=max_length)
# TODO (PVP) still a bit hacky here - there might be a better solution
next_token_logits = self.prepare_logits_for_generation(
next_token_logits, cur_len=cur_len, max_length=max_length
)
scores = F.log_softmax(next_token_logits, dim=-1) # (batch_size * num_beams, vocab_size)
# set eos token prob to zero if min_length is not reached
if eos_token_id is not None and cur_len < min_length:

View File

@@ -204,7 +204,10 @@ class GradientAccumulator(object):
"""Number of accumulated steps."""
if self._accum_steps is None:
self._accum_steps = tf.Variable(
tf.constant(0, dtype=tf.int64), trainable=False, synchronization=tf.VariableSynchronization.ON_READ,
tf.constant(0, dtype=tf.int64),
trainable=False,
synchronization=tf.VariableSynchronization.ON_READ,
aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,
)
return self._accum_steps.value()
@@ -223,7 +226,10 @@ class GradientAccumulator(object):
self._gradients.extend(
[
tf.Variable(
tf.zeros_like(gradient), trainable=False, synchronization=tf.VariableSynchronization.ON_READ,
tf.zeros_like(gradient),
trainable=False,
synchronization=tf.VariableSynchronization.ON_READ,
aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,
)
for gradient in gradients
]

Some files were not shown because too many files have changed in this diff Show More