Tokenization behave the same as original XLM proprocessing for most languages except zh, ja and th; Change API to allow specifying language in tokenize

This commit is contained in:
Shijie Wu
2019-08-23 14:40:17 -04:00
parent df9d6effae
commit 436ce07218
3 changed files with 135 additions and 20 deletions

View File

@@ -9,4 +9,6 @@ requests
# For OpenAI GPT
regex
# For XLNet
sentencepiece
sentencepiece
# For XLM
sacremoses