From bff6d517cde8b474af3a1a03e4e8432cf1dad90e Mon Sep 17 00:00:00 2001 From: hakan <66843134+hakanozgur@users.noreply.github.com> Date: Wed, 2 Sep 2020 00:35:24 +0300 Subject: [PATCH] loodos turkish model cards added (#6840) --- .../albert-base-turkish-uncased/README.md | 48 ++++++++++++++++++ .../bert-base-turkish-uncased/README.md | 49 +++++++++++++++++++ .../README.md | 48 ++++++++++++++++++ .../electra-base-turkish-uncased/README.md | 49 +++++++++++++++++++ .../README.md | 27 ++++++++++ .../README.md | 49 +++++++++++++++++++ 6 files changed, 270 insertions(+) create mode 100644 model_cards/loodos/albert-base-turkish-uncased/README.md create mode 100644 model_cards/loodos/bert-base-turkish-uncased/README.md create mode 100644 model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md create mode 100644 model_cards/loodos/electra-base-turkish-uncased/README.md create mode 100644 model_cards/loodos/electra-small-turkish-cased-discriminator/README.md create mode 100644 model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md diff --git a/model_cards/loodos/albert-base-turkish-uncased/README.md b/model_cards/loodos/albert-base-turkish-uncased/README.md new file mode 100644 index 0000000000..0f2317c600 --- /dev/null +++ b/model_cards/loodos/albert-base-turkish-uncased/README.md @@ -0,0 +1,48 @@ +--- +language: tr +--- + +# Turkish Language Models with Huggingface's Transformers + +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) + +# Turkish ALBERT-Base (uncased) + +This is ALBERT-Base model which has 12 repeated encore layers with 768 hidden layer size trained on uncased Turkish dataset. + +## Usage + +Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below. + +```python +from transformers import AutoModel, AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("loodos/albert-base-turkish-uncased", do_lower_case=False, keep_accents=True) +model = AutoModel.from_pretrained("loodos/albert-base-turkish-uncased") + +normalizer = TextNormalization() +normalized_text = normalizer(text, do_lower_case=True) +tokenizer.tokenize(normalized_text) +``` + +### Notes on Tokenizers +Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons. + +1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish. + +2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions + +- "I" and "İ" to 'i' +- 'i' and 'ı' to 'I' + +respectively. However, in Turkish, 'I' and 'İ' are two different letters. + +We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models). + +# Details and Contact + +You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). + +# Acknowledgments + +Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models. + diff --git a/model_cards/loodos/bert-base-turkish-uncased/README.md b/model_cards/loodos/bert-base-turkish-uncased/README.md new file mode 100644 index 0000000000..88007b2a00 --- /dev/null +++ b/model_cards/loodos/bert-base-turkish-uncased/README.md @@ -0,0 +1,49 @@ +--- +language: tr +--- + +# Turkish Language Models with Huggingface's Transformers + +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) + +# Turkish BERT-Base (uncased) + +This is BERT-Base model which has 12 encoder layer with 768 hidden layer size trained on uncased Turkish dataset. + +## Usage + +Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below. + +```python +from transformers import AutoModel, AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False) +model = AutoModel.from_pretrained("loodos/bert-base-turkish-uncased") + +normalizer = TextNormalization() +normalized_text = normalizer(text, do_lower_case=True) +tokenizer.tokenize(normalized_text) +``` + +### Notes on Tokenizers +Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons. + +1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish. + +2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions + +- "I" and "İ" to 'i' +- 'i' and 'ı' to 'I' + +respectively. However, in Turkish, 'I' and 'İ' are two different letters. + +We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models). + + +# Contact + +You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). + +# Acknowledgments + +Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models. + diff --git a/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md b/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md new file mode 100644 index 0000000000..8fc14eef2d --- /dev/null +++ b/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md @@ -0,0 +1,48 @@ +--- +language: tr +--- + +# Turkish Language Models with Huggingface's Transformers + +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) + +# Turkish ELECTRA-Base-discriminator (uncased/64k) + +This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. This version has a vocab of size 64k different from default, 32k. + +## Usage + +Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below. + +```python +from transformers import AutoModel, AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator", do_lower_case=False) +model = AutoModel.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator") + +normalizer = TextNormalization() +normalized_text = normalizer(text, do_lower_case=True) +tokenizer.tokenize(normalized_text) +``` + +### Notes on Tokenizers +Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons. + +1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish. + +2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions + +- "I" and "İ" to 'i' +- 'i' and 'ı' to 'I' + +respectively. However, in Turkish, 'I' and 'İ' are two different letters. + +We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models). + +# Details and Contact + +You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). + +# Acknowledgments + +Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models. + diff --git a/model_cards/loodos/electra-base-turkish-uncased/README.md b/model_cards/loodos/electra-base-turkish-uncased/README.md new file mode 100644 index 0000000000..0fab1fe825 --- /dev/null +++ b/model_cards/loodos/electra-base-turkish-uncased/README.md @@ -0,0 +1,49 @@ +--- +language: tr +--- + +# Turkish Language Models with Huggingface's Transformers + +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) + +# Turkish ELECTRA-Base (uncased) + +This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. + +## Usage + +Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below. + +```python +from transformers import AutoModel, AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-uncased", do_lower_case=False) +model = AutoModel.from_pretrained("loodos/electra-base-turkish-uncased") + +normalizer = TextNormalization() +normalized_text = normalizer(text, do_lower_case=True) +tokenizer.tokenize(normalized_text) +``` + +### Notes on Tokenizers +Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons. + +1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish. + +2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions + +- "I" and "İ" to 'i' +- 'i' and 'ı' to 'I' + +respectively. However, in Turkish, 'I' and 'İ' are two different letters. + +We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models). + + +# Details and Contact + +You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). + +# Acknowledgments + +Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models. + diff --git a/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md b/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md new file mode 100644 index 0000000000..1faa581d95 --- /dev/null +++ b/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md @@ -0,0 +1,27 @@ +--- +language: tr +--- + +# Turkish Language Models with Huggingface's Transformers + +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) + +# Turkish ELECTRA-Small-discriminator (cased) + +This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on cased Turkish dataset. + +## Usage + +Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below. + +```python +from transformers import AutoModel, AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-cased-discriminator") +model = AutoModel.from_pretrained("loodos/electra-small-turkish-cased-discriminator") +``` + +# Details and Contact + +You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). + +# Acknowledgments diff --git a/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md b/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md new file mode 100644 index 0000000000..0dda2c10e2 --- /dev/null +++ b/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md @@ -0,0 +1,49 @@ +--- +language: tr +--- + +# Turkish Language Models with Huggingface's Transformers + +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. The details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) + +# Turkish ELECTRA-Small-discriminator (uncased) + +This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on uncased Turkish dataset. Please refer to + +## Usage + +Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below. + +```python +from transformers import AutoModel, AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-uncased-discriminator", do_lower_case=False) +model = AutoModel.from_pretrained("loodos/electra-small-turkish-uncased-discriminator") + +normalizer = TextNormalization() +normalized_text = normalizer(text, do_lower_case=True) +tokenizer.tokenize(normalized_text) +``` + +### Notes on Tokenizers +Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons. + +1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish. + +2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions + +- "I" and "İ" to 'i' +- 'i' and 'ı' to 'I' + +respectively. However, in Turkish, 'I' and 'İ' are two different letters. + +We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models). + + +# Details and Contact + +You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). + +# Acknowledgments + +Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models. +