Add sudachi and jumanpp tokenizers for bert_japanese (#19043)

* add sudachipy and jumanpp tokenizers for bert_japanese

* use ImportError instead of ModuleNotFoundError in SudachiTokenizer and JumanppTokenizer

* put test cases of test_tokenization_bert_japanese in one line

* add require_sudachi and require_jumanpp decorator for testing

* add sudachi and pyknp(jumanpp) to dependencies

* remove sudachi_dict_small and sudachi_dict_full from dependencies

* empty commit for ci
This commit is contained in:
r-terada
2022-10-06 00:41:37 +09:00
committed by GitHub
parent 60db81ff60
commit 2f53ab5745
8 changed files with 373 additions and 7 deletions

View File

@@ -409,6 +409,16 @@ jobs:
keys:
- v0.5-custom_tokenizers-{{ checksum "setup.py" }}
- v0.5-custom_tokenizers-
- run: sudo apt-get -y update && sudo apt-get install -y cmake
- run:
name: install jumanpp
command: |
wget https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz
tar xvf jumanpp-2.0.0-rc3.tar.xz
mkdir jumanpp-2.0.0-rc3/bld
cd jumanpp-2.0.0-rc3/bld
sudo cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local
sudo make install
- run: pip install --upgrade pip
- run: pip install .[ja,testing,sentencepiece,jieba,spacy,ftfy,rjieba]
- run: python -m unidic download