HuggingFace_transformer

Author	SHA1	Message	Date
Michael Watkins	2670b0d682	Fix bug which lowercases special tokens	2019-12-06 16:15:53 -05:00
Aymeric Augustin	35401fe50f	Remove dependency on pytest for running tests (#2055 ) * Switch to plain unittest for skipping slow tests. Add a RUN_SLOW environment variable for running them. * Switch to plain unittest for PyTorch dependency. * Switch to plain unittest for TensorFlow dependency. * Avoid leaking open files in the test suite. This prevents spurious warnings when running tests. * Fix unicode warning on Python 2 when running tests. The warning was: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal * Support running PyTorch tests on a GPU. Reverts `27e015bd`. * Tests no longer require pytest. * Make tests pass on cuda	2019-12-06 13:57:38 -05:00
Thomas Wolf	5afca00b47	Merge pull request #1724 from huggingface/fix_encode_plus Fix encode_plus	2019-11-27 17:14:49 +01:00
Thomas Wolf	21637d4924	Merge branch 'master' into do_lower_case	2019-11-27 17:04:39 +01:00
Lysandre	74d0bcb6ff	Fix special tokens addition in decoder	2019-11-12 15:27:57 -05:00
Michael Watkins	7246d3c2f9	Consider do_lower_case in PreTrainedTokenizer As pointed out in #1545, when using an uncased model, and adding a new uncased token, the tokenizer does not correctly identify this in the case that the input text contains the token in a cased format. For instance, if we load bert-base-uncased into BertTokenizer, and then use .add_tokens() to add "cool-token", we get the expected result for .tokenize('this is a cool-token'). However, we get a possibly unexpected result for .tokenize('this is a cOOl-Token'), which in fact mirrors the result for the former from before the new token was added. This commit adds - functionality to PreTrainedTokenizer to handle this situation in case a tokenizer (currently Bert, DistilBert, and XLNet) has the do_lower_case=True kwarg by: 1) lowercasing tokens added with .add_tokens() 2) lowercasing text at the beginning of .tokenize() - new common test case for tokenizers https://github.com/huggingface/transformers/issues/1545	2019-11-12 13:08:30 +02:00
thomwolf	8d6b9d717c	fix #1532 and encode_plus	2019-11-04 17:07:51 +01:00
Lysandre	7d709e55ed	Remove	2019-10-22 14:12:33 -04:00
thomwolf	78ef1a9930	fixes	2019-10-04 17:59:44 -04:00
thomwolf	6c1d0bc066	update encode_plus - add truncation strategies	2019-10-04 17:38:38 -04:00
LysandreJik	aebd83230f	Update naming + remove f string in run_lm_finetuning example	2019-10-03 11:31:36 -04:00
LysandreJik	651bfb7ad5	always_truncate by default	2019-10-03 11:31:36 -04:00
LysandreJik	cc412edd42	Supports already existing special tokens	2019-10-03 11:31:36 -04:00
LysandreJik	2f259b228e	Sequence IDS	2019-10-03 11:31:36 -04:00
LysandreJik	7c789c337d	Always truncate argument in the encode method	2019-10-03 11:31:36 -04:00
thomwolf	31c23bd5ee	[BIG] pytorch-transformers => transformers	2019-09-26 10:15:53 +02:00

16 Commits