Files

Thomas Wolf ba8c4d0ac0 [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659 )

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉

* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

2020-10-18 20:51:24 +02:00

adversarial

Black 20 release

2020-08-26 17:20:22 +02:00

benchmarking

[Benchmarks] Change all args to from no_... to their positive form (#7075 )

2020-09-23 13:25:24 -04:00

bert-loses-patience

[logging] remove no longer needed verbosity override (#7100 )

2020-09-15 04:01:14 -04:00

bertology

Black 20 release

2020-08-26 17:20:22 +02:00

contrib

Transformer-XL: Remove unused parameters (#7087 )

2020-09-17 06:10:34 -04:00

deebert

[logging] remove no longer needed verbosity override (#7100 )

2020-09-15 04:01:14 -04:00

distillation

[logging] remove no longer needed verbosity override (#7100 )

2020-09-15 04:01:14 -04:00

language-modeling

Fix code quality

2020-10-12 08:22:27 -04:00

longform-qa

RAG (#6813 )

2020-09-22 18:29:58 +02:00

lxmert

Demoing LXMERT with raw images by incorporating the FRCNN model for roi-pooled extraction and bounding-box predction on the GQA answer set. (#6986 )

2020-09-14 10:07:04 -04:00

movement-pruning

[logging] remove no longer needed verbosity override (#7100 )

2020-09-15 04:01:14 -04:00

multiple-choice

Black 20 release

2020-08-26 17:20:22 +02:00

question-answering

[logging] remove no longer needed verbosity override (#7100 )

2020-09-15 04:01:14 -04:00

rag

Fix missing reference titles in retrieval evaluation of RAG (#7817 )

2020-10-16 10:15:49 +02:00

seq2seq

[s2s testing] turn all to unittests, use auto-delete temp dirs (#7859 )

2020-10-17 14:33:21 -04:00

text-classification

Don't use store_xxx on optional bools (#7786 )

2020-10-14 12:05:02 -04:00

text-generation

feat: allow prefix for any generative model (#5885 )

2020-09-07 03:03:45 -04:00

token-classification

token-classification: update url of GermEval 2014 dataset (#6571 )

2020-09-18 06:18:06 -04:00

conftest.py

[testing] disable FutureWarning in examples tests (#7842 )

2020-10-16 03:35:39 -04:00

lightning_base.py

[cleanup] assign todos, faster bart-cnn test (#7835 )

2020-10-16 03:11:18 -04:00

README.md

[doc] rm Azure buttons as not implemented yet

2020-09-30 17:31:08 -04:00

requirements.txt

[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659 )

2020-10-18 20:51:24 +02:00

test_examples.py

[examples] bump pl=0.9.0 (#7053 )

2020-10-11 16:39:38 -04:00

test_xla_examples.py

Set XLA example time to 500s

2020-10-15 12:34:29 +02:00

xla_spawn.py

[TPU] Doc, fix xla_spawn.py, only preprocess dataset once (#4223 )

2020-05-08 14:10:05 -04:00

README.md

Examples

Version 2.9 of 🤗 Transformers introduces a new Trainer class for PyTorch, and its equivalent TFTrainer for TF 2. Running the examples requires PyTorch 1.3.1+ or TensorFlow 2.2+.

Here is the list of all our examples:

grouped by task (all official examples work for multiple models)
with information on whether they are built on top of Trainer/TFTrainer (if not, they still work, they might just lack some features),
whether they also include examples for pytorch-lightning, which is a great fully-featured, general-purpose training library for PyTorch,
links to Colab notebooks to walk through the scripts and run them easily,
links to Cloud deployments to be able to deploy large-scale trainings in the Cloud with little to no setup.

This is still a work-in-progress – in particular documentation is still sparse – so please contribute improvements/pull requests.

The Big Table of Tasks

Task	Example datasets	Trainer support	TFTrainer support	pytorch-lightning	Colab
`language-modeling`	Raw text	✅	-	-
`text-classification`	GLUE, XNLI	✅	✅	✅
`token-classification`	CoNLL NER	✅	✅	✅	-
`multiple-choice`	SWAG, RACE, ARC	✅	✅	-
`question-answering`	SQuAD	✅	✅	-	-
`text-generation`	-	n/a	n/a	n/a
`distillation`	All	-	-	-	-
`summarization`	CNN/Daily Mail	✅	-	✅	-
`translation`	WMT	✅	-	✅	-
`bertology`	-	-	-	-	-
`adversarial`	HANS	✅	-	-	-

Important note

Important To make sure you can successfully run the latest versions of the example scripts, you have to install the library from source and install some example-specific requirements. Execute the following steps in a new virtual environment:

git clone https://github.com/huggingface/transformers
cd transformers
pip install .
pip install -r ./examples/requirements.txt

One-click Deploy to Cloud (wip)

Coming soon!

Running on TPUs

When using Tensorflow, TPUs are supported out of the box as a tf.distribute.Strategy.

When using PyTorch, we support TPUs thanks to pytorch/xla. For more context and information on how to setup your TPU environment refer to Google's documentation and to the very detailed pytorch/xla README.

In this repo, we provide a very simple launcher script named xla_spawn.py that lets you run our example scripts on multiple TPU cores without any boilerplate. Just pass a --num_cores flag to this script, then your regular training script with its arguments (this is similar to the torch.distributed.launch helper for torch.distributed).

For example for run_glue:

python examples/xla_spawn.py --num_cores 8 \
	examples/text-classification/run_glue.py
	--model_name_or_path bert-base-cased \
	--task_name mnli \
	--data_dir ./data/glue_data/MNLI \
	--output_dir ./models/tpu \
	--overwrite_output_dir \
	--do_train \
	--do_eval \
	--num_train_epochs 1 \
	--save_steps 20000

Feedback and more use cases and benchmarks involving TPUs are welcome, please share with the community.

Logging & Experiment tracking

You can easily log and monitor your runs code. The following are currently supported:

Weights & Biases

To use Weights & Biases, install the wandb package with:

pip install wandb

Then log in the command line:

wandb login

If you are in Jupyter or Colab, you should login with:

import wandb
wandb.login()

Whenever you use Trainer or TFTrainer classes, your losses, evaluation metrics, model topology and gradients (for Trainer only) will automatically be logged.

When using 🤗 Transformers with PyTorch Lightning, runs can be tracked through WandbLogger. Refer to related documentation & examples.

Comet.ml

To use comet_ml, install the Python package with:

pip install comet_ml

or if in a Conda environment:

conda install -c comet_ml -c anaconda -c conda-forge comet_ml

README.md Unescape Escape

Examples

The Big Table of Tasks

Important note

One-click Deploy to Cloud (wip)

Running on TPUs

Logging & Experiment tracking

Weights & Biases

Comet.ml

README.md