LysandreJik
fbaf05bd92
Remove annoying tokenization message
2019-12-02 18:23:00 -05:00
Thomas Wolf
5afca00b47
Merge pull request #1724 from huggingface/fix_encode_plus
...
Fix encode_plus
2019-11-27 17:14:49 +01:00
Thomas Wolf
5340d1f21f
Merge branch 'master' into resumable_http
2019-11-27 17:10:36 +01:00
Thomas Wolf
21637d4924
Merge branch 'master' into do_lower_case
2019-11-27 17:04:39 +01:00
Thomas Wolf
5b322a36db
Merge pull request #1811 from huggingface/special-tokens
...
Fix special tokens addition in decoder #1807
2019-11-14 22:17:24 +01:00
Thomas Wolf
1a237d7f42
Merge pull request #1831 from iedmrc/gpt2-tokenization-sum-func-replacement
...
sum() is replaced by itertools.chain.from_iterable()
2019-11-14 22:11:54 +01:00
Lysandre
a67e747889
Reorganized max_len warning
2019-11-14 10:30:22 -05:00
İbrahim Ethem Demirci
7627dde1f8
sum() is the leanest method to flatten a string list, so it's been replaced by itertools.chain.from_iterable()
2019-11-14 17:06:15 +03:00
Lysandre
74d0bcb6ff
Fix special tokens addition in decoder
2019-11-12 15:27:57 -05:00
Michael Watkins
7246d3c2f9
Consider do_lower_case in PreTrainedTokenizer
...
As pointed out in #1545 , when using an uncased model, and adding
a new uncased token, the tokenizer does not correctly identify this
in the case that the input text contains the token in a cased format.
For instance, if we load bert-base-uncased into BertTokenizer, and
then use .add_tokens() to add "cool-token", we get the expected
result for .tokenize('this is a cool-token'). However, we get a
possibly unexpected result for .tokenize('this is a cOOl-Token'),
which in fact mirrors the result for the former from before the new
token was added.
This commit adds
- functionality to PreTrainedTokenizer to handle this
situation in case a tokenizer (currently Bert, DistilBert,
and XLNet) has the do_lower_case=True kwarg by:
1) lowercasing tokens added with .add_tokens()
2) lowercasing text at the beginning of .tokenize()
- new common test case for tokenizers
https://github.com/huggingface/transformers/issues/1545
2019-11-12 13:08:30 +02:00
Lysandre
b5d330d118
Fix #1784
2019-11-11 10:15:14 -05:00
thomwolf
8d6b9d717c
fix #1532 and encode_plus
2019-11-04 17:07:51 +01:00
Sergey Mironov
0e4cc050d6
Add support for resumable downloads for HTTP protocol.
2019-10-31 18:25:34 +03:00
Lysandre
7d709e55ed
Remove
2019-10-22 14:12:33 -04:00
thomwolf
a5997dd81a
better error messages
2019-10-10 11:31:01 +02:00
Lysandre Debut
e84470ef81
Merge pull request #1384 from huggingface/encoding-qol
...
Quality of life enhancements in encoding + patch MLM masking
2019-10-09 11:18:24 -04:00
thomwolf
78ef1a9930
fixes
2019-10-04 17:59:44 -04:00
thomwolf
6c1d0bc066
update encode_plus - add truncation strategies
2019-10-04 17:38:38 -04:00
thomwolf
92c0f2fb90
Merge remote-tracking branch 'origin/julien_multiple-choice' into encoding-qol
2019-10-04 15:48:06 -04:00
LysandreJik
7bddb45a6f
Decode documentaton
2019-10-04 14:27:38 -04:00
LysandreJik
aebd83230f
Update naming + remove f string in run_lm_finetuning example
2019-10-03 11:31:36 -04:00
LysandreJik
651bfb7ad5
always_truncate by default
2019-10-03 11:31:36 -04:00
LysandreJik
cc412edd42
Supports already existing special tokens
2019-10-03 11:31:36 -04:00
LysandreJik
2f259b228e
Sequence IDS
2019-10-03 11:31:36 -04:00
LysandreJik
7c789c337d
Always truncate argument in the encode method
2019-10-03 11:31:36 -04:00
danai-antoniou
a95158518d
Moved duplicate token check
2019-10-02 07:44:15 +01:00
danai-antoniou
d73957899a
Merge branch 'master' of https://github.com/danai-antoniou/pytorch-transformers into add-duplicate-tokens-error
2019-10-02 07:38:50 +01:00
thomwolf
391db836ab
fix #1260 - remove special logic for decoding pairs of sequence
2019-10-01 19:09:13 -04:00
Julien Chaumond
b350662955
overflowing_tokens do not really make sense here, let's just return a number
...
Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr >
2019-09-30 16:37:09 -04:00
Julien Chaumond
f5bcde0b2f
[multiple-choice] Simplify and use tokenizer.encode_plus
2019-09-30 16:04:55 -04:00
Julien Chaumond
d8b641c839
6 -> 8 models
2019-09-27 17:22:01 -04:00
thomwolf
31c23bd5ee
[BIG] pytorch-transformers => transformers
2019-09-26 10:15:53 +02:00