From 653a79ccad7398bfbb6bd93def9f59409c36d55b Mon Sep 17 00:00:00 2001 From: abdullaholuk-loodos <70137509+abdullaholuk-loodos@users.noreply.github.com> Date: Thu, 3 Sep 2020 16:13:43 +0300 Subject: [PATCH] Loodos model cards had errors on "Usage" section. It is fixed. Also "electra-base-turkish-uncased" model removed from s3 and re-uploaded as "electra-base-turkish-uncased-discriminator". Its README added. (#6921) Co-authored-by: Abdullah Oluk --- .../albert-base-turkish-uncased/README.md | 20 +++++++++------- .../bert-base-turkish-uncased/README.md | 15 +++++++----- .../README.md | 22 ++++++++++-------- .../README.md | 23 +++++++++++-------- .../README.md | 19 +++++++++------ .../README.md | 21 +++++++++-------- 6 files changed, 71 insertions(+), 49 deletions(-) rename model_cards/loodos/{electra-base-turkish-uncased => electra-base-turkish-uncased-discriminator}/README.md (73%) diff --git a/model_cards/loodos/albert-base-turkish-uncased/README.md b/model_cards/loodos/albert-base-turkish-uncased/README.md index 0f2317c600..69b1143ada 100644 --- a/model_cards/loodos/albert-base-turkish-uncased/README.md +++ b/model_cards/loodos/albert-base-turkish-uncased/README.md @@ -4,11 +4,11 @@ language: tr # Turkish Language Models with Huggingface's Transformers -As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models). # Turkish ALBERT-Base (uncased) -This is ALBERT-Base model which has 12 repeated encore layers with 768 hidden layer size trained on uncased Turkish dataset. +This is ALBERT-Base model which has 12 repeated encoder layers with 768 hidden layer size trained on uncased Turkish dataset. ## Usage @@ -16,16 +16,19 @@ Using AutoModel and AutoTokenizer from Transformers, you can import the model as ```python from transformers import AutoModel, AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("loodos/albert-base-turkish-uncased", do_lower_case=False, keep_accents=True) -model = AutoModel.from_pretrained("loodos/albert-base-turkish-uncased") +tokenizer = AutoTokenizer.from_pretrained("loodos/albert-base-turkish-uncased", do_lower_case=False, keep_accents=True) + +model = AutoModel.from_pretrained("loodos/albert-base-turkish-uncased") + normalizer = TextNormalization() -normalized_text = normalizer(text, do_lower_case=True) +normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True) + tokenizer.tokenize(normalized_text) ``` ### Notes on Tokenizers -Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons. +Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons. 1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish. @@ -38,11 +41,12 @@ respectively. However, in Turkish, 'I' and 'İ' are two different letters. We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models). -# Details and Contact + +## Details and Contact You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). -# Acknowledgments +## Acknowledgments Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models. diff --git a/model_cards/loodos/bert-base-turkish-uncased/README.md b/model_cards/loodos/bert-base-turkish-uncased/README.md index 88007b2a00..6768b4a172 100644 --- a/model_cards/loodos/bert-base-turkish-uncased/README.md +++ b/model_cards/loodos/bert-base-turkish-uncased/README.md @@ -4,11 +4,11 @@ language: tr # Turkish Language Models with Huggingface's Transformers -As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models). # Turkish BERT-Base (uncased) -This is BERT-Base model which has 12 encoder layer with 768 hidden layer size trained on uncased Turkish dataset. +This is BERT-Base model which has 12 encoder layers with 768 hidden layer size trained on uncased Turkish dataset. ## Usage @@ -16,16 +16,19 @@ Using AutoModel and AutoTokenizer from Transformers, you can import the model as ```python from transformers import AutoModel, AutoTokenizer + tokenizer = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False) + model = AutoModel.from_pretrained("loodos/bert-base-turkish-uncased") normalizer = TextNormalization() -normalized_text = normalizer(text, do_lower_case=True) +normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True) + tokenizer.tokenize(normalized_text) ``` ### Notes on Tokenizers -Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons. +Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons. 1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish. @@ -39,11 +42,11 @@ respectively. However, in Turkish, 'I' and 'İ' are two different letters. We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models). -# Contact +## Details and Contact You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). -# Acknowledgments +## Acknowledgments Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models. diff --git a/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md b/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md index 8fc14eef2d..e64607acc8 100644 --- a/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md +++ b/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md @@ -4,28 +4,31 @@ language: tr # Turkish Language Models with Huggingface's Transformers -As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models). # Turkish ELECTRA-Base-discriminator (uncased/64k) -This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. This version has a vocab of size 64k different from default, 32k. +This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. This version has a vocab of size 64k, different from default 32k. ## Usage -Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below. +Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below. ```python -from transformers import AutoModel, AutoTokenizer +from transformers import AutoModel, AutoModelWithLMHead + tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator", do_lower_case=False) -model = AutoModel.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator") + +model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator") normalizer = TextNormalization() -normalized_text = normalizer(text, do_lower_case=True) +normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True) + tokenizer.tokenize(normalized_text) ``` ### Notes on Tokenizers -Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons. +Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons. 1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish. @@ -38,11 +41,12 @@ respectively. However, in Turkish, 'I' and 'İ' are two different letters. We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models). -# Details and Contact + +## Details and Contact You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). -# Acknowledgments +## Acknowledgments Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models. diff --git a/model_cards/loodos/electra-base-turkish-uncased/README.md b/model_cards/loodos/electra-base-turkish-uncased-discriminator/README.md similarity index 73% rename from model_cards/loodos/electra-base-turkish-uncased/README.md rename to model_cards/loodos/electra-base-turkish-uncased-discriminator/README.md index 0fab1fe825..fc0b6b01f6 100644 --- a/model_cards/loodos/electra-base-turkish-uncased/README.md +++ b/model_cards/loodos/electra-base-turkish-uncased-discriminator/README.md @@ -4,28 +4,31 @@ language: tr # Turkish Language Models with Huggingface's Transformers -As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models). -# Turkish ELECTRA-Base (uncased) +# Turkish ELECTRA-Base-discriminator (uncased) This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. ## Usage -Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below. +Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below. ```python -from transformers import AutoModel, AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-uncased", do_lower_case=False) -model = AutoModel.from_pretrained("loodos/electra-base-turkish-uncased") +from transformers import AutoModel, AutoModelWithLMHead + +tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-uncased-discriminator", do_lower_case=False) + +model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-uncased-discriminator") normalizer = TextNormalization() -normalized_text = normalizer(text, do_lower_case=True) +normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True) + tokenizer.tokenize(normalized_text) ``` ### Notes on Tokenizers -Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons. +Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons. 1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish. @@ -39,11 +42,11 @@ respectively. However, in Turkish, 'I' and 'İ' are two different letters. We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models). -# Details and Contact +## Details and Contact You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). -# Acknowledgments +## Acknowledgments Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models. diff --git a/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md b/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md index 1faa581d95..de7a3dd6e6 100644 --- a/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md +++ b/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md @@ -4,24 +4,29 @@ language: tr # Turkish Language Models with Huggingface's Transformers -As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models). # Turkish ELECTRA-Small-discriminator (cased) -This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on cased Turkish dataset. +This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layers size trained on cased Turkish dataset. ## Usage -Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below. +Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below. ```python -from transformers import AutoModel, AutoTokenizer +from transformers import AutoModel, AutoModelWithLMHead + tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-cased-discriminator") -model = AutoModel.from_pretrained("loodos/electra-small-turkish-cased-discriminator") + +model = AutoModelWithLMHead.from_pretrained("loodos/electra-small-turkish-cased-discriminator") ``` -# Details and Contact +## Details and Contact You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). -# Acknowledgments +## Acknowledgments + +Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models. + diff --git a/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md b/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md index 0dda2c10e2..91d9b270bf 100644 --- a/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md +++ b/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md @@ -4,28 +4,31 @@ language: tr # Turkish Language Models with Huggingface's Transformers -As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. The details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models) +As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models). # Turkish ELECTRA-Small-discriminator (uncased) -This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on uncased Turkish dataset. Please refer to +This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on uncased Turkish dataset. ## Usage -Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below. +Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below. ```python -from transformers import AutoModel, AutoTokenizer +from transformers import AutoModel, AutoModelWithLMHead + tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-uncased-discriminator", do_lower_case=False) -model = AutoModel.from_pretrained("loodos/electra-small-turkish-uncased-discriminator") + +model = AutoModelWithLMHead.from_pretrained("loodos/electra-small-turkish-uncased-discriminator") normalizer = TextNormalization() -normalized_text = normalizer(text, do_lower_case=True) +normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True) + tokenizer.tokenize(normalized_text) ``` ### Notes on Tokenizers -Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons. +Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons. 1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish. @@ -39,11 +42,11 @@ respectively. However, in Turkish, 'I' and 'İ' are two different letters. We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models). -# Details and Contact +## Details and Contact You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models). -# Acknowledgments +## Acknowledgments Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.