Commit Graph

20 Commits

Author SHA1 Message Date
Julien Chaumond
d4c2cb402d Kill model archive maps (#4636)
* Kill model archive maps

* Fixup

* Also kill model_archive_map for MaskedBertPreTrainedModel

* Unhook config_archive_map

* Tokenizers: align with model id changes

* make style && make quality

* Fix CI
2020-06-02 09:39:33 -04:00
Iz Beltagy
8f1d047148 Longformer (#4352)
* first commit

* bug fixes

* better examples

* undo padding

* remove wrong VOCAB_FILES_NAMES

* License

* make style

* make isort happy

* unit tests

* integration test

* make `black` happy by undoing `isort` changes!!

* lint

* no need for the padding value

* batch_size not bsz

* remove unused type casting

* seqlen not seq_len

* staticmethod

* `bert` selfattention instead of `n2`

* uint8 instead of bool + lints

* pad inputs_embeds using embeddings not a constant

* black

* unit test with padding

* fix unit tests

* remove redundant unit test

* upload model weights

* resolve todo

* simpler _mask_invalid_locations without lru_cache + backward compatible masked_fill_

* increase unittest coverage
2020-05-19 16:04:43 +02:00
Sam Shleifer
3487be75ef [Marian] documentation and AutoModel support (#4152)
- MarianSentencepieceTokenizer - > MarianTokenizer
- Start using unk token.
- add docs page
- add better generation params to MarianConfig
- more conversion utilities
2020-05-10 13:54:57 -04:00
Patrick von Platen
dca34695d0 Reformer (#3351)
* first copy & past commit from Bert and morgans LSH code

* add easy way to compare to trax original code

* translate most of function

* make trax lsh self attention deterministic with numpy seed + copy paste code

* add same config

* add same config

* make layer init work

* implemented hash_vectors function for lsh attention

* continue reformer translation

* hf LSHSelfAttentionLayer gives same output as trax layer

* refactor code

* refactor code

* refactor code

* refactor

* refactor + add reformer config

* delete bogus file

* split reformer attention layer into two layers

* save intermediate step

* save intermediate step

* make test work

* add complete reformer block layer

* finish reformer layer

* implement causal and self mask

* clean reformer test and refactor code

* fix merge conflicts

* fix merge conflicts

* update init

* fix device for GPU

* fix chunk length init for tests

* include morgans optimization

* improve memory a bit

* improve comment

* factorize num_buckets

* better testing parameters

* make whole model work

* make lm model work

* add t5 copy paste tokenizer

* add chunking feed forward

* clean config

* add improved assert statements

* make tokenizer work

* improve test

* correct typo

* extend config

* add complexer test

* add new axial position embeddings

* add local block attention layer

* clean tests

* refactor

* better testing

* save intermediate progress

* clean test file

* make shorter input length work for model

* allow variable input length

* refactor

* make forward pass for pretrained model work

* add generation possibility

* finish dropout and init

* make style

* refactor

* add first version of RevNet Layers

* make forward pass work and add convert file

* make uploaded model forward pass work

* make uploaded model forward pass work

* refactor code

* add namedtuples and cache buckets

* correct head masks

* refactor

* made reformer more flexible

* make style

* remove set max length

* add attention masks

* fix up tests

* fix lsh attention mask

* make random seed optional for the moment

* improve memory in reformer

* add tests

* make style

* make sure masks work correctly

* detach gradients

* save intermediate

* correct backprob through gather

* make style

* change back num hashes

* rename to labels

* fix rotation shape

* fix detach

* update

* fix trainer

* fix backward dropout

* make reformer more flexible

* fix conflict

* fix

* fix

* add tests for fixed seed in reformer layer

* fix trainer typo

* fix typo in activations

* add fp16 tests

* add fp16 training

* support fp16

* correct gradient bug in reformer

* add fast gelu

* re-add dropout for embedding dropout

* better naming

* better naming

* renaming

* finalize test branch

* finalize tests

* add more tests

* finish tests

* fix

* fix type trainer

* fix fp16 tests

* fix tests

* fix tests

* fix tests

* fix issue with dropout

* fix dropout seeds

* correct random seed on gpu

* finalize random seed for dropout

* finalize random seed for dropout

* remove duplicate line

* correct half precision bug

* make style

* refactor

* refactor

* docstring

* remove sinusoidal position encodings for reformer

* move chunking to modeling_utils

* make style

* clean config

* make style

* fix tests

* fix auto tests

* pretrained models

* fix docstring

* update conversion file

* Update pretrained_models.rst

* fix rst

* fix rst

* update copyright

* fix test path

* fix test path

* fix small issue in test

* include reformer in generation tests

* add docs for axial position encoding

* finish docs

* Update convert_reformer_trax_checkpoint_to_pytorch.py

* remove isort

* include sams comments

* remove wrong comment in utils

* correct typos

* fix typo

* Update reformer.rst

* applied morgans optimization

* make style

* make gpu compatible

* remove bogus file

* big test refactor

* add example for chunking

* fix typo

* add to README
2020-05-07 10:17:01 +02:00
Julien Chaumond
455c639093 CDN urls (#4030)
* [file_utils] use_cdn + documentation

* Move to cdn. urls for weights

* [urls] Hotfix for bert-base-japanese
2020-04-28 20:27:14 -04:00
Lysandre Debut
d5d7d88612 ELECTRA (#3257)
* Electra wip

* helpers

* Electra wip

* Electra v1

* ELECTRA may be saved/loaded

* Generator & Discriminator

* Embedding size instead of halving the hidden size

* ELECTRA Tokenizer

* Revert BERT helpers

* ELECTRA Conversion script

* Archive maps

* PyTorch tests

* Start fixing tests

* Tests pass

* Same configuration for both models

* Compatible with base + large

* Simplification + weight tying

* Archives

* Auto + Renaming to standard names

* ELECTRA is uncased

* Tests

* Slight API changes

* Update tests

* wip

* ElectraForTokenClassification

* temp

* Simpler arch + tests

Removed ElectraForPreTraining which will be in a script

* Conversion script

* Auto model

* Update links to S3

* Split ElectraForPreTraining and ElectraForTokenClassification

* Actually test PreTraining model

* Remove num_labels from configuration

* wip

* wip

* From discriminator and generator to electra

* Slight API changes

* Better naming

* TensorFlow ELECTRA tests

* Accurate conversion script

* Added to conversion script

* Fast ELECTRA tokenizer

* Style

* Add ELECTRA to README

* Modeling Pytorch Doc + Real style

* TF Docs

* Docs

* Correct links

* Correct model intialized

* random fixes

* style

* Addressing Patrick's and Sam's comments

* Correct links in docs
2020-04-03 14:10:54 -04:00
Lysandre Debut
3716c3d8af False by default (#3002) 2020-02-24 18:30:57 -05:00
Sam Shleifer
53ce3854a1 New BartModel (#2745)
* Results same as fairseq
* Wrote a ton of tests
* Struggled with api signatures
* added some docs
2020-02-20 18:11:13 -05:00
Funtowicz Morgan
3f3fa7f7da Integrate fast tokenizers library inside transformers (#2674)
* Implemented fast version of tokenizers

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bumped tokenizers version requirements to latest 0.2.1

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added matching tests

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching OpenAI GPT tokenization !

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching GPT2 on tokenizers

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Expose add_prefix_space as constructor parameter for GPT2

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching Roberta tokenization !

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removed fast implementation of CTRL.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Binding TransformerXL tokenizers to Rust.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Updating tests accordingly.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added tokenizers as top-level modules.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Black & isort.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rename LookupTable to WordLevel to match Rust side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Black.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use "fast" suffix instead of "ru" for rust tokenizers implementations.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Introduce tokenize() method on fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* encode_plus dispatchs to batch_encode_plus

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* batch_encode_plus now dispatchs to encode if there is only one input element.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bind all the encode_plus parameter to the forwarded batch_encode_plus call.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.3.0

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Formatting.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix tokenization_auto with support for new (python, fast) mapping schema.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give correct fixtures path in test_tokenization_fast.py for the CLI.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Expose max_len_ properties on BertTokenizerFast

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move max_len_ properties to PreTrainedTokenizerFast and override in specific subclasses.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* _convert_encoding should keep the batch axis tensor if only one sample in the batch.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add warning message for RobertaTokenizerFast if used for MLM.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added use_fast (bool) parameter on AutoTokenizer.from_pretrained().

This allows to easily enable/disable Rust-based tokenizer instantiation.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Let's tokenizers handle all the truncation and padding stuff.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Allow to provide tokenizer arguments during pipeline creation.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update test_fill_mask pipeline to not use fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix too much parameters for convert_encoding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* When enabling padding, max_length should be set to None.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Avoid returning nested tensors of length 1 when calling encode_plus

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure output is padded when return_tensor is not None.

Tensor creation requires the inital list input to be of the exact same size.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Disable transfoxl unittest if pytorch is not available (required to load the model)

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* encode_plus should not remove the leading batch axis if return_tensor is set

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Temporary disable fast tokenizers on QA pipelines.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix formatting issues.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update tokenizers to 0.4.0

* Update style

* Enable truncation + stride unit test on fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add unittest ensuring special_tokens set match between Python and Rust.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure special_tokens are correctly set during construction.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give more warning feedback to the user in case of padding without pad_token.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* quality & format.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added possibility to add a single token as str

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added unittest for add_tokens and add_special_tokens on fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix rebase mismatch on pipelines qa default model.

QA requires cased input while the tokenizers would be uncased.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Using offset mapping relative to the original string + unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: save_vocabulary requires folder and file name

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Simplify import for Bert.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: truncate_and_pad disables padding according to the same heuristic than the one enabling padding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Remove private member access in tokenize()

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Bump tokenizers dependency to 0.4.2

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* format & quality.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Use named arguments when applicable.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Add Github link to Roberta/GPT2 space issue on masked input.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Move max_len_single_sentence / max_len_sentences_pair to PreTrainedTokenizerFast + tests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Relax type checking to include tuple and list object.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Document the truncate_and_pad manager behavior.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Raise an exception if return_offsets_mapping is not available with the current tokenizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure padding is set on the tokenizers before setting any padding strategy + unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* On pytorch we need to stack tensor to get proper new axis.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Generalize tests to different framework removing hard written return_tensors="..."

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizer dependency for num_special_tokens_to_add

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Overflowing tokens in batch_encode_plus are now stacked over the batch axis.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Improved error message for padding strategy without pad token.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bumping tokenizers dependency to 0.5.0 for release.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Optimizing convert_encoding around 4x improvement. 🚀

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* expose pad_to_max_length in encode_plus to avoid duplicating the parameters in kwargs

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Generate a proper overflow_to_sampling_mapping when return_overflowing_tokens is True.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix unittests for overflow_to_sampling_mapping not being returned as tensor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Format & quality.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove perfect alignment constraint for Roberta (allowing 1% difference max)

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Triggering final CI

Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
2020-02-19 11:35:40 -05:00
Lysandre
1e82cd8457 Flaubert auto tokenizer + tests
cc @julien-c
2020-01-31 14:16:52 -05:00
Julien Chaumond
f9bc3f5771 style tweak 2020-01-30 19:26:59 -05:00
Julien Chaumond
34a3c25a30 Fix for XLMRobertaConfig inherits from RobertaConfig
hat/tip @stefan-it
2020-01-22 17:50:24 -05:00
Julien Chaumond
894812c652 Fixup mapping 2020-01-13 23:34:19 +00:00
Julien Chaumond
b20f11d4ca 🔫 Python35 2020-01-13 23:20:44 +00:00
Julien Chaumond
0304628590 Map configs to models and tokenizers 2020-01-13 23:11:44 +00:00
Julien Chaumond
4d1c98c012 AutoConfig + other Auto classes honor model_type 2020-01-11 02:46:17 +00:00
alberduris
81d6841b4b GPU text generation: mMoved the encoded_prompt to correct device 2020-01-06 15:11:12 +01:00
alberduris
dd4df80f0b Moved the encoded_prompts to correct device 2020-01-06 15:11:12 +01:00
Aymeric Augustin
c824d15aa1 Remove __future__ imports. 2019-12-22 17:47:54 +01:00
Aymeric Augustin
6be7cdda66 Move source code inside a src subdirectory.
This prevents transformers from being importable simply because the CWD
is the root of the git repository, while not being importable from other
directories. That led to inconsistent behavior, especially in examples.

Once you fetch this commit, in your dev environment, you must run:

    $ pip uninstall transformers
    $ pip install -e .
2019-12-22 14:15:13 +01:00