Add sudachi and jumanpp tokenizers for bert_japanese (#19043)

* add sudachipy and jumanpp tokenizers for bert_japanese * use ImportError instead of ModuleNotFoundError in SudachiTokenizer and JumanppTokenizer * put test cases of test_tokenization_bert_japanese in one line * add require_sudachi and require_jumanpp decorator for testing * add sudachi and pyknp(jumanpp) to dependencies * remove sudachi_dict_small and sudachi_dict_full from dependencies * empty commit for ci
2022-10-06 00:41:37 +09:00
parent 60db81ff60
commit 2f53ab5745
8 changed files with 373 additions and 7 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -409,6 +409,16 @@ jobs:
                  keys:
                      - v0.5-custom_tokenizers-{{ checksum "setup.py" }}
                      - v0.5-custom_tokenizers-
+            - run: sudo apt-get -y update && sudo apt-get install -y cmake
+            - run:
+                name: install jumanpp
+                command: |
+                    wget https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz
+                    tar xvf jumanpp-2.0.0-rc3.tar.xz
+                    mkdir jumanpp-2.0.0-rc3/bld
+                    cd jumanpp-2.0.0-rc3/bld
+                    sudo cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local
+                    sudo make install
            - run: pip install --upgrade pip
            - run: pip install .[ja,testing,sentencepiece,jieba,spacy,ftfy,rjieba]
            - run: python -m unidic download