* fix to ensure that returned tensors after the tokenization is Long
* fix to ensure that returned tensors after the tokenization is Long
Co-authored-by: Ashwin Geet Dsa <adsa@grvingt-6.nancy.grid5000.fr>
* add dataset for albert pretrain
* datacollator for albert pretrain
* naming, comprehension, file reading change
* data cleaning is no needed after this modification
* delete prints
* fix a bug
* file structure change
* add tests for albert datacollator
* remove random seed
* add back len and get item function
* sample file for testing and test code added
* format change for black
* more format change
* Style
* var assignment issue resolve
* add back wrongly deleted DataCollatorWithPadding in init file
* Style
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
* Add cache_dir to save features TextDataset
This is in case the dataset is in a RO filesystem, for which is the case
in tests (GKE TPU tests).
* style
* add datacollator and dataset for next sentence prediction task
* bug fix (numbers of special tokens & truncate sequences)
* bug fix (+ dict inputs support for data collator)
* add padding for nsp data collator; renamed cached files to avoid conflict.
* add test for nsp data collator
* Style
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
* Data collator with padding
* Add type annotation
* Support tensors as well
* Add comment
* Fix for labels wrong shape
* Data collator with padding
* Add type annotation
* Support tensors as well
* Add comment
* Fix for labels wrong shape
* Remove changes rendered unnecessary
* Optimized banned token masking
* Avoid duplicate EOS masking if in bad_words_id
* Updated mask generation to handle empty banned token list
* Addition of unit tests for the updated bad_words_ids masking
* Updated timeout handling in `test_postprocess_next_token_scores_large_bad_words_list` unit test
* Updated timeout handling in `test_postprocess_next_token_scores_large_bad_words_list` unit test (timeout does not work on Windows)
* Moving Marian import to the test context to allow TF only environments to run
* Moving imports to torch_available test
* Updated operations device and test
* Updated operations device and test
* Added docstring and comment for in-place scores modification
* Moving test to own test_generation_utils, use of lighter models for testing
* removed unneded imports in test_modeling_common
* revert formatting change for ModelTesterMixin
* Updated caching, simplified eos token id test, removed unnecessary @require_torch
* formatting compliance
* Attempt to fix the way squad_convert_examples_to_features pad the elements for the QA pipeline.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Quality
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Make the code easier to read and avoid testing multiple test the same thing.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* missing enum value on truncation_strategy.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Rethinking for the easiest fix: expose the padding strategy on squad_convert_examples_to_features.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Remove unused imports.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Ensure padding and question cannot have higher probs than context.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Add bart the the list of tokenizers adding two <sep> tokens for squad_convert_example_to_feature
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Format.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Addressing @patrickvonplaten comments.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Addressing @patrickvonplaten comments about masking non-context element when generating the answer.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Addressing @sshleifer comments.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Make sure we mask CLS after handling impossible answers
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Mask in the correct vectors ...
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* Added data collator for XLNet language modeling and related calls
Added DataCollatorForXLNetLanguageModeling in data/data_collator.py
to generate necessary inputs for language modeling training with
XLNetLMHeadModel. Also added related arguments, logic and calls in
examples/language-modeling/run_language_modeling.py.
Resolves: #4739, #2008 (partially)
* Changed name to `DataCollatorForPermutationLanguageModeling`
Changed the name of `DataCollatorForXLNetLanguageModeling` to the more general `DataCollatorForPermutationLanguageModelling`.
Removed the `--mlm` flag requirement for the new collator and defined a separate `--plm_probability` flag for its use.
CTRL uses a CLM loss just like GPT and GPT-2, so should work out of the box with this script (provided `past` is taken care of
similar to `mems` for XLNet).
Changed calls and imports appropriately.
* Added detailed comments, changed variable names
Added more detailed comments to `DataCollatorForPermutationLanguageModeling` in `data/data_collator.py` to explain working. Also cleaned up variable names and made them more informative.
* Added tests for new data collator
Added tests in `tests/test_trainer.py` for DataCollatorForPermutationLanguageModeling based on those in DataCollatorForLanguageModeling. A specific test has been added to check for odd-length sequences.
* Fixed styling issues
* remove references to old API in docstring - update data processors
* style
* fix tests - better type checking error messages
* better type checking
* include awesome fix by @LysandreJik for #5310
* updated doc and examples
* Better None gradients handling
* Apply Style
* Apply Style
* Create a loss class per task to compute its respective loss
* Add loss classes to the ALBERT TF models
* Add loss classes to the BERT TF models
* Add question answering and multiple choice to TF Camembert
* Remove prints
* Add multiple choice model to TF DistilBERT + loss computation
* Add question answering model to TF Electra + loss computation
* Add token classification, question answering and multiple choice models to TF Flaubert
* Add multiple choice model to TF Roberta + loss computation
* Add multiple choice model to TF XLM + loss computation
* Add multiple choice and question answering models to TF XLM-Roberta
* Add multiple choice model to TF XLNet + loss computation
* Remove unused parameters
* Add task loss classes
* Reorder TF imports + add new model classes
* Add new model classes
* Bugfix in TF T5 model
* Bugfix for TF T5 tests
* Bugfix in TF T5 model
* Fix TF T5 model tests
* Fix T5 tests + some renaming
* Fix inheritance issue in the AutoX tests
* Add tests for TF Flaubert and TF XLM Roberta
* Add tests for TF Flaubert and TF XLM Roberta
* Remove unused piece of code in the TF trainer
* bugfix and remove unused code
* Bugfix for TF 2.2
* Apply Style
* Divide TFSequenceClassificationAndMultipleChoiceLoss into their two respective name
* Apply style
* Mirror the PT Trainer in the TF one: fp16, optimizers and tb_writer as class parameter and better dataset handling
* Fix TF optimizations tests and apply style
* Remove useless parameter
* Bugfix and apply style
* Fix TF Trainer prediction
* Now the TF models return the loss such as their PyTorch couterparts
* Apply Style
* Ignore some tests output
* Take into account the SQuAD cls_index, p_mask and is_impossible parameters for the QuestionAnswering task models.
* Fix names for SQuAD data
* Apply Style
* Fix conflicts with 2.11 release
* Fix conflicts with 2.11
* Fix wrongname
* Add better documentation on the new create_optimizer function
* Fix isort
* logging_dir: use same default as PyTorch
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
* Glue task cleaup
* Enable writing cache to cache_dir in case dataset lives in readOnly
filesystem.
* Differentiate match vs mismatch for MNLI metrics.
* Style
* Fix pytype
* Fix type
* Use cache_dir in mnli mismatch eval dataset
* Small Tweaks
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
* Adds predict stage for glue tasks, and generate result files which could be submitted to gluebenchmark.com website.
* Use Split enum + always output the label name
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
* doc
* [tests] Add sample files for a regression task
* [HUGE] Trainer
* Feedback from @sshleifer
* Feedback from @thomwolf + logging tweak
* [file_utils] when downloading concurrently, get_from_cache will use the cached file for subsequent processes
* [glue] Use default max_seq_length of 128 like before
* [glue] move DataTrainingArguments around
* [ner] Change interface of InputExample, and align run_{tf,pl}
* Re-align the pl scripts a little bit
* ner
* [ner] Add integration test
* Fix language_modeling with API tweak
* [ci] Tweak loss target
* Don't break console output
* amp.initialize: model must be on right device before
* [multiple-choice] update for Trainer
* Re-align to 827d6d6ef0
* Big cleanup of `glue_convert_examples_to_features`
* Use batch_encode_plus
* Cleaner wrapping of glue_convert_examples_to_features for TF
@lysandrejik
* Cleanup syntax, thanks to @mfuntowicz
* Raise explicit error in case of user error
* [ci] Also run test_examples in py37
(will revert at the end of the experiment)
* InputExample: use immutable dataclass
* [deps] Install dataclasses for Py<3.7
* [skip ci] Revert "[ci] Also run test_examples in py37"
This reverts commit d29afd9959786b77759b0b8fa4e6b4335b952015.