From 653a79ccad7398bfbb6bd93def9f59409c36d55b Mon Sep 17 00:00:00 2001
From: abdullaholuk-loodos
 <70137509+abdullaholuk-loodos@users.noreply.github.com>
Date: Thu, 3 Sep 2020 16:13:43 +0300
Subject: [PATCH] Loodos model cards had errors on "Usage" section. It is
 fixed. Also "electra-base-turkish-uncased" model removed from s3 and
 re-uploaded as "electra-base-turkish-uncased-discriminator". Its README
 added. (#6921)

Co-authored-by: Abdullah Oluk <abdullaholuk123@gmail.com>
---
 .../albert-base-turkish-uncased/README.md     | 20 +++++++++-------
 .../bert-base-turkish-uncased/README.md       | 15 +++++++-----
 .../README.md                                 | 22 ++++++++++--------
 .../README.md                                 | 23 +++++++++++--------
 .../README.md                                 | 19 +++++++++------
 .../README.md                                 | 21 +++++++++--------
 6 files changed, 71 insertions(+), 49 deletions(-)
 rename model_cards/loodos/{electra-base-turkish-uncased => electra-base-turkish-uncased-discriminator}/README.md (73%)

diff --git a/model_cards/loodos/albert-base-turkish-uncased/README.md b/model_cards/loodos/albert-base-turkish-uncased/README.md
index 0f2317c600..69b1143ada 100644
--- a/model_cards/loodos/albert-base-turkish-uncased/README.md
+++ b/model_cards/loodos/albert-base-turkish-uncased/README.md
@@ -4,11 +4,11 @@ language: tr
 
 # Turkish Language Models with Huggingface's Transformers
 
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
+As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
 
 # Turkish ALBERT-Base (uncased)
 
-This is ALBERT-Base model which has 12 repeated encore layers with 768 hidden layer size trained on uncased Turkish dataset.
+This is ALBERT-Base model which has 12 repeated encoder layers with 768 hidden layer size trained on uncased Turkish dataset.
 
 ## Usage
 
@@ -16,16 +16,19 @@ Using AutoModel and AutoTokenizer from Transformers, you can import the model as
 
 ```python
 from transformers import AutoModel, AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("loodos/albert-base-turkish-uncased", do_lower_case=False, keep_accents=True)
-model = AutoModel.from_pretrained("loodos/albert-base-turkish-uncased") 
 
+tokenizer = AutoTokenizer.from_pretrained("loodos/albert-base-turkish-uncased", do_lower_case=False, keep_accents=True)
+
+model = AutoModel.from_pretrained("loodos/albert-base-turkish-uncased")
+ 
 normalizer = TextNormalization()
-normalized_text = normalizer(text, do_lower_case=True)
+normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
+
 tokenizer.tokenize(normalized_text)
 ```
 
 ### Notes on Tokenizers
-Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons.
+Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
 
 1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
 
@@ -38,11 +41,12 @@ respectively. However, in Turkish, 'I' and 'İ' are two different letters.
 
 We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
 
-# Details and Contact
+
+## Details and Contact
 
 You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
 
-# Acknowledgments
+## Acknowledgments
 
 Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
 
diff --git a/model_cards/loodos/bert-base-turkish-uncased/README.md b/model_cards/loodos/bert-base-turkish-uncased/README.md
index 88007b2a00..6768b4a172 100644
--- a/model_cards/loodos/bert-base-turkish-uncased/README.md
+++ b/model_cards/loodos/bert-base-turkish-uncased/README.md
@@ -4,11 +4,11 @@ language: tr
 
 # Turkish Language Models with Huggingface's Transformers
 
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
+As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
 
 # Turkish BERT-Base (uncased)
 
-This is BERT-Base model which has 12 encoder layer with 768 hidden layer size trained on uncased Turkish dataset.
+This is BERT-Base model which has 12 encoder layers with 768 hidden layer size trained on uncased Turkish dataset.
 
 ## Usage
 
@@ -16,16 +16,19 @@ Using AutoModel and AutoTokenizer from Transformers, you can import the model as
 
 ```python
 from transformers import AutoModel, AutoTokenizer
+
 tokenizer = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False)
+
 model = AutoModel.from_pretrained("loodos/bert-base-turkish-uncased")
  
 normalizer = TextNormalization()
-normalized_text = normalizer(text, do_lower_case=True)
+normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
+
 tokenizer.tokenize(normalized_text)
 ```
 
 ### Notes on Tokenizers
-Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons.
+Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
 
 1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
 
@@ -39,11 +42,11 @@ respectively. However, in Turkish, 'I' and 'İ' are two different letters.
 We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
 
 
-# Contact
+## Details and Contact
 
 You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
 
-# Acknowledgments
+## Acknowledgments
 
 Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
 
diff --git a/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md b/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md
index 8fc14eef2d..e64607acc8 100644
--- a/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md
+++ b/model_cards/loodos/electra-base-turkish-64k-uncased-discriminator/README.md
@@ -4,28 +4,31 @@ language: tr
 
 # Turkish Language Models with Huggingface's Transformers
 
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
+As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
 
 # Turkish ELECTRA-Base-discriminator (uncased/64k)
 
-This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. This version has a vocab of size 64k different from default, 32k.
+This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. This version has a vocab of size 64k, different from default 32k.
 
 ## Usage
 
-Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
+Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
 
 ```python
-from transformers import AutoModel, AutoTokenizer
+from transformers import AutoModel, AutoModelWithLMHead
+
 tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator", do_lower_case=False)
-model = AutoModel.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator")
+
+model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator")
  
 normalizer = TextNormalization()
-normalized_text = normalizer(text, do_lower_case=True)
+normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
+
 tokenizer.tokenize(normalized_text)
 ```
 
 ### Notes on Tokenizers
-Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons.
+Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
 
 1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
 
@@ -38,11 +41,12 @@ respectively. However, in Turkish, 'I' and 'İ' are two different letters.
 
 We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
 
-# Details and Contact
+
+## Details and Contact
 
 You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
 
-# Acknowledgments
+## Acknowledgments
 
 Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
 
diff --git a/model_cards/loodos/electra-base-turkish-uncased/README.md b/model_cards/loodos/electra-base-turkish-uncased-discriminator/README.md
similarity index 73%
rename from model_cards/loodos/electra-base-turkish-uncased/README.md
rename to model_cards/loodos/electra-base-turkish-uncased-discriminator/README.md
index 0fab1fe825..fc0b6b01f6 100644
--- a/model_cards/loodos/electra-base-turkish-uncased/README.md
+++ b/model_cards/loodos/electra-base-turkish-uncased-discriminator/README.md
@@ -4,28 +4,31 @@ language: tr
 
 # Turkish Language Models with Huggingface's Transformers
 
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
+As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
 
-# Turkish ELECTRA-Base (uncased)
+# Turkish ELECTRA-Base-discriminator (uncased)
 
 This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset.
 
 ## Usage
 
-Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
+Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
 
 ```python
-from transformers import AutoModel, AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-uncased", do_lower_case=False)
-model = AutoModel.from_pretrained("loodos/electra-base-turkish-uncased")
+from transformers import AutoModel, AutoModelWithLMHead
+
+tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-uncased-discriminator", do_lower_case=False)
+
+model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-uncased-discriminator")
  
 normalizer = TextNormalization()
-normalized_text = normalizer(text, do_lower_case=True)
+normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
+
 tokenizer.tokenize(normalized_text)
 ```
 
 ### Notes on Tokenizers
-Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons.
+Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
 
 1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
 
@@ -39,11 +42,11 @@ respectively. However, in Turkish, 'I' and 'İ' are two different letters.
 We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
 
 
-# Details and Contact
+## Details and Contact
 
 You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
 
-# Acknowledgments
+## Acknowledgments
 
 Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
 
diff --git a/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md b/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md
index 1faa581d95..de7a3dd6e6 100644
--- a/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md
+++ b/model_cards/loodos/electra-small-turkish-cased-discriminator/README.md
@@ -4,24 +4,29 @@ language: tr
 
 # Turkish Language Models with Huggingface's Transformers
 
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
+As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
 
 # Turkish ELECTRA-Small-discriminator (cased)
 
-This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on cased Turkish dataset.
+This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layers size trained on cased Turkish dataset.
 
 ## Usage
 
-Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
+Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
 
 ```python
-from transformers import AutoModel, AutoTokenizer
+from transformers import AutoModel, AutoModelWithLMHead
+
 tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-cased-discriminator")
-model = AutoModel.from_pretrained("loodos/electra-small-turkish-cased-discriminator")
+
+model = AutoModelWithLMHead.from_pretrained("loodos/electra-small-turkish-cased-discriminator")
 ```
 
-# Details and Contact
+## Details and Contact
 
 You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
 
-# Acknowledgments
+## Acknowledgments
+
+Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
+
diff --git a/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md b/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md
index 0dda2c10e2..91d9b270bf 100644
--- a/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md
+++ b/model_cards/loodos/electra-small-turkish-uncased-discriminator/README.md
@@ -4,28 +4,31 @@ language: tr
 
 # Turkish Language Models with Huggingface's Transformers
 
-As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. The details about pretrained models and evaluations on downstream tasks can be found [here](https://github.com/Loodos/turkish-language-models)
+As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
 
 # Turkish ELECTRA-Small-discriminator (uncased)
 
-This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on uncased Turkish dataset. Please refer to 
+This is ELECTRA-Small model's discriminator which has 12 encoder layers with 256 hidden layer size trained on uncased Turkish dataset.
 
 ## Usage
 
-Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
+Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
 
 ```python
-from transformers import AutoModel, AutoTokenizer
+from transformers import AutoModel, AutoModelWithLMHead
+
 tokenizer = AutoTokenizer.from_pretrained("loodos/electra-small-turkish-uncased-discriminator", do_lower_case=False)
-model = AutoModel.from_pretrained("loodos/electra-small-turkish-uncased-discriminator")
+
+model = AutoModelWithLMHead.from_pretrained("loodos/electra-small-turkish-uncased-discriminator")
  
 normalizer = TextNormalization()
-normalized_text = normalizer(text, do_lower_case=True)
+normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
+
 tokenizer.tokenize(normalized_text)
 ```
 
 ### Notes on Tokenizers
-Currently, huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are 2 reasons.
+Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
 
 1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
 
@@ -39,11 +42,11 @@ respectively. However, in Turkish, 'I' and 'İ' are two different letters.
 We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
 
 
-# Details and Contact
+## Details and Contact
 
 You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
 
-# Acknowledgments
+## Acknowledgments
 
 Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.