[WIP] Tapas v4 (tres) (#9117)

* First commit: adding all files from tapas_v3

* Fix multiple bugs including soft dependency and new structure of the library

* Improve testing by adding torch_device to inputs and adding dependency on scatter

* Use Python 3 inheritance rather than Python 2

* First draft model cards of base sized models

* Remove model cards as they are already on the hub

* Fix multiple bugs with integration tests

* All model integration tests pass

* Remove print statement

* Add test for convert_logits_to_predictions method of TapasTokenizer

* Incorporate suggestions by Google authors

* Fix remaining tests

* Change position embeddings sizes to 512 instead of 1024

* Comment out positional embedding sizes

* Update PRETRAINED_VOCAB_FILES_MAP and PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

* Added more model names

* Fix truncation when no max length is specified

* Disable torchscript test

* Make style & make quality

* Quality

* Address CI needs

* Test the Masked LM model

* Fix the masked LM model

* Truncate when overflowing

* More much needed docs improvements

* Fix some URLs

* Some more docs improvements

* Test PyTorch scatter

* Set to slow + minify

* Calm flake8 down

* First commit: adding all files from tapas_v3

* Fix multiple bugs including soft dependency and new structure of the library

* Improve testing by adding torch_device to inputs and adding dependency on scatter

* Use Python 3 inheritance rather than Python 2

* First draft model cards of base sized models

* Remove model cards as they are already on the hub

* Fix multiple bugs with integration tests

* All model integration tests pass

* Remove print statement

* Add test for convert_logits_to_predictions method of TapasTokenizer

* Incorporate suggestions by Google authors

* Fix remaining tests

* Change position embeddings sizes to 512 instead of 1024

* Comment out positional embedding sizes

* Update PRETRAINED_VOCAB_FILES_MAP and PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

* Added more model names

* Fix truncation when no max length is specified

* Disable torchscript test

* Make style & make quality

* Quality

* Address CI needs

* Test the Masked LM model

* Fix the masked LM model

* Truncate when overflowing

* More much needed docs improvements

* Fix some URLs

* Some more docs improvements

* Add add_pooling_layer argument to TapasModel

Fix comments by @sgugger and @patrickvonplaten

* Fix issue in docs + fix style and quality

* Clean up conversion script and add task parameter to TapasConfig

* Revert the task parameter of TapasConfig

Some minor fixes

* Improve conversion script and add test for absolute position embeddings

* Improve conversion script and add test for absolute position embeddings

* Fix bug with reset_position_index_per_cell arg of the conversion cli

* Add notebooks to the examples directory and fix style and quality

* Apply suggestions from code review

* Move from `nielsr/` to `google/` namespace

* Apply Sylvain's comments

Co-authored-by: sgugger <sylvain.gugger@gmail.com>

Co-authored-by: Rogge Niels <niels.rogge@howest.be>
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: sgugger <sylvain.gugger@gmail.com>

This commit is contained in:

NielsRogge

2020-12-15 23:08:49 +01:00

committed by

GitHub

parent ad895af98d

commit 1551e2dc6d

22 changed files with 8497 additions and 78 deletions

									
										4

tests/test_tokenization_common.py
									
												View File
												
				@@ -584,7 +584,7 @@ class TokenizerTesterMixin:

				                # We want to have sequence 0 and sequence 1 are tagged

				                # respectively with 0 and 1 token_ids

				                # (regardeless of weither the model use token type ids)

				                # (regardless of whether the model use token type ids)

				                # We use this assumption in the QA pipeline among other place

				                output = tokenizer(seq_0, return_token_type_ids=True)

				                self.assertIn(0, output["token_type_ids"])

				@@ -600,7 +600,7 @@ class TokenizerTesterMixin:

				                # We want to have sequence 0 and sequence 1 are tagged

				                # respectively with 0 and 1 token_ids

				                # (regardeless of weither the model use token type ids)

				                # (regardless of whether the model use token type ids)

				                # We use this assumption in the QA pipeline among other place

				                output = tokenizer(seq_0)

				                self.assertIn(0, output.sequence_ids())

[WIP] Tapas v4 (tres) (#9117)

4 tests/test_tokenization_common.py Unescape Escape View File

4

tests/test_tokenization_common.py

View File