Add changes for uroman package to handle non-Roman characters (#32404)

* Add changes for uroman package to handle non-Roman characters

* Update docs for uroman changes

* Modifying error message to warning, for backward compatibility

* Update instruction for user to install uroman

* Update docs for uroman python version dependency and backward compatibility

* Update warning message for python version compatibility with uroman

* Refine docs
This commit is contained in:
Ritik Nandwal
2024-08-26 20:37:01 +05:30
committed by GitHub
parent 72d4a3f9c1
commit a378a54a57
4 changed files with 48 additions and 8 deletions

View File

@@ -93,12 +93,33 @@ from transformers import VitsTokenizer
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")
print(tokenizer.is_uroman)
```
If the is_uroman attribute is `True`, the tokenizer will automatically apply the `uroman` package to your text inputs, but you need to install uroman if not already installed using:
```
pip install --upgrade uroman
```
Note: Python version required to use `uroman` as python package should be >= `3.10`.
You can use the tokenizer as usual without any additional preprocessing steps:
```python
import torch
from transformers import VitsTokenizer, VitsModel, set_seed
import os
import subprocess
If required, you should apply the uroman package to your text inputs **prior** to passing them to the `VitsTokenizer`,
since currently the tokenizer does not support performing the pre-processing itself.
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-kor")
model = VitsModel.from_pretrained("facebook/mms-tts-kor")
text = "이봐 무슨 일이야"
inputs = tokenizer(text=text, return_tensors="pt")
set_seed(555) # make deterministic
with torch.no_grad():
outputs = model(inputs["input_ids"])
waveform = outputs.waveform[0]
```
If you don't want to upgrade to python >= `3.10`, then you can use the `uroman` perl package to pre-process the text inputs to the Roman alphabet.
To do this, first clone the uroman repository to your local machine and set the bash variable `UROMAN` to the local path:
```bash
git clone https://github.com/isi-nlp/uroman.git
cd uroman