HuggingFace_transformer

Author	SHA1	Message	Date
Lysandre Debut	00c4e39581	Merge branch 'master' into squad-refactor	2019-12-09 10:41:15 -05:00
Michael Watkins	2670b0d682	Fix bug which lowercases special tokens	2019-12-06 16:15:53 -05:00
Aymeric Augustin	35401fe50f	Remove dependency on pytest for running tests (#2055 ) * Switch to plain unittest for skipping slow tests. Add a RUN_SLOW environment variable for running them. * Switch to plain unittest for PyTorch dependency. * Switch to plain unittest for TensorFlow dependency. * Avoid leaking open files in the test suite. This prevents spurious warnings when running tests. * Fix unicode warning on Python 2 when running tests. The warning was: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal * Support running PyTorch tests on a GPU. Reverts `27e015bd`. * Tests no longer require pytest. * Make tests pass on cuda	2019-12-06 13:57:38 -05:00
LysandreJik	9ecd83dace	Patch evaluation for impossible values + cleanup	2019-12-05 14:44:57 -05:00
LysandreJik	a7ca6d738b	Padding side is tokenizer-dependant	2019-12-04 15:43:34 -05:00
LysandreJik	fbaf05bd92	Remove annoying tokenization message	2019-12-02 18:23:00 -05:00
Thomas Wolf	5afca00b47	Merge pull request #1724 from huggingface/fix_encode_plus Fix encode_plus	2019-11-27 17:14:49 +01:00
Thomas Wolf	5340d1f21f	Merge branch 'master' into resumable_http	2019-11-27 17:10:36 +01:00
Thomas Wolf	21637d4924	Merge branch 'master' into do_lower_case	2019-11-27 17:04:39 +01:00
LysandreJik	a7dafe2f41	Padding strategy (left and right) rather than boolean flag	2019-11-22 16:27:25 -05:00
LysandreJik	9f374c8252	`encode` and `encode_plus` handle attention masks and padding	2019-11-22 16:27:15 -05:00
Lysandre	72e506b22e	wip	2019-11-22 16:26:00 -05:00
Thomas Wolf	5b322a36db	Merge pull request #1811 from huggingface/special-tokens Fix special tokens addition in decoder #1807	2019-11-14 22:17:24 +01:00
Thomas Wolf	1a237d7f42	Merge pull request #1831 from iedmrc/gpt2-tokenization-sum-func-replacement sum() is replaced by itertools.chain.from_iterable()	2019-11-14 22:11:54 +01:00
Lysandre	a67e747889	Reorganized max_len warning	2019-11-14 10:30:22 -05:00
İbrahim Ethem Demirci	7627dde1f8	sum() is the leanest method to flatten a string list, so it's been replaced by itertools.chain.from_iterable()	2019-11-14 17:06:15 +03:00
Lysandre	74d0bcb6ff	Fix special tokens addition in decoder	2019-11-12 15:27:57 -05:00
Michael Watkins	7246d3c2f9	Consider do_lower_case in PreTrainedTokenizer As pointed out in #1545, when using an uncased model, and adding a new uncased token, the tokenizer does not correctly identify this in the case that the input text contains the token in a cased format. For instance, if we load bert-base-uncased into BertTokenizer, and then use .add_tokens() to add "cool-token", we get the expected result for .tokenize('this is a cool-token'). However, we get a possibly unexpected result for .tokenize('this is a cOOl-Token'), which in fact mirrors the result for the former from before the new token was added. This commit adds - functionality to PreTrainedTokenizer to handle this situation in case a tokenizer (currently Bert, DistilBert, and XLNet) has the do_lower_case=True kwarg by: 1) lowercasing tokens added with .add_tokens() 2) lowercasing text at the beginning of .tokenize() - new common test case for tokenizers https://github.com/huggingface/transformers/issues/1545	2019-11-12 13:08:30 +02:00
Lysandre	b5d330d118	Fix #1784	2019-11-11 10:15:14 -05:00
thomwolf	8d6b9d717c	fix #1532 and encode_plus	2019-11-04 17:07:51 +01:00
Sergey Mironov	0e4cc050d6	Add support for resumable downloads for HTTP protocol.	2019-10-31 18:25:34 +03:00
Lysandre	7d709e55ed	Remove	2019-10-22 14:12:33 -04:00
thomwolf	a5997dd81a	better error messages	2019-10-10 11:31:01 +02:00
Lysandre Debut	e84470ef81	Merge pull request #1384 from huggingface/encoding-qol Quality of life enhancements in encoding + patch MLM masking	2019-10-09 11:18:24 -04:00
thomwolf	78ef1a9930	fixes	2019-10-04 17:59:44 -04:00
thomwolf	6c1d0bc066	update encode_plus - add truncation strategies	2019-10-04 17:38:38 -04:00
thomwolf	92c0f2fb90	Merge remote-tracking branch 'origin/julien_multiple-choice' into encoding-qol	2019-10-04 15:48:06 -04:00
LysandreJik	7bddb45a6f	Decode documentaton	2019-10-04 14:27:38 -04:00
LysandreJik	aebd83230f	Update naming + remove f string in run_lm_finetuning example	2019-10-03 11:31:36 -04:00
LysandreJik	651bfb7ad5	always_truncate by default	2019-10-03 11:31:36 -04:00
LysandreJik	cc412edd42	Supports already existing special tokens	2019-10-03 11:31:36 -04:00
LysandreJik	2f259b228e	Sequence IDS	2019-10-03 11:31:36 -04:00
LysandreJik	7c789c337d	Always truncate argument in the encode method	2019-10-03 11:31:36 -04:00
danai-antoniou	a95158518d	Moved duplicate token check	2019-10-02 07:44:15 +01:00
danai-antoniou	d73957899a	Merge branch 'master' of https://github.com/danai-antoniou/pytorch-transformers into add-duplicate-tokens-error	2019-10-02 07:38:50 +01:00
thomwolf	391db836ab	fix #1260 - remove special logic for decoding pairs of sequence	2019-10-01 19:09:13 -04:00
Julien Chaumond	b350662955	overflowing_tokens do not really make sense here, let's just return a number Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>	2019-09-30 16:37:09 -04:00
Julien Chaumond	f5bcde0b2f	[multiple-choice] Simplify and use tokenizer.encode_plus	2019-09-30 16:04:55 -04:00
Julien Chaumond	d8b641c839	6 -> 8 models	2019-09-27 17:22:01 -04:00
thomwolf	31c23bd5ee	[BIG] pytorch-transformers => transformers	2019-09-26 10:15:53 +02:00

40 Commits