Tokenization behave the same as original XLM proprocessing for most languages except zh, ja and th; Change API to allow specifying language in tokenize

2019-08-23 14:40:17 -04:00
parent df9d6effae
commit 436ce07218
3 changed files with 135 additions and 20 deletions
--- a/requirements.txt
+++ b/requirements.txt
@@ -9,4 +9,6 @@ requests
 # For OpenAI GPT
 regex
 # For XLNet
-sentencepiece
+sentencepiece
+# For XLM
+sacremoses