Added 12 model cards for Indian Language Models (#8198)

* Create README.md * added model cards
2020-11-02 10:47:43 +05:30
parent 9bd30f7cf4
commit aa79aa4e7d
12 changed files with 344 additions and 0 deletions
--- a/model_cards/neuralspace-reverie/indic-transformers-bn-bert/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-bn-bert/README.md
@@ -0,0 +1,25 @@
 ---
 language: 
 - bn 
 tags:
 - MaskedLM
 - Bengali
 ---
 # Indic-Transformers Bengali BERT
 ## Model description
 This is a BERT language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-bert')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-bert')
 text = "আপনি কেমন আছেন?"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 6, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
--- a/model_cards/neuralspace-reverie/indic-transformers-bn-distilbert/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-bn-distilbert/README.md
@@ -0,0 +1,29 @@
 ---
 language: 
 - bn 
 tags:
 - MaskedLM
 - Bengali
 - DistilBERT
 - Question-Answering
 - Token Classification
 - Text Classification
 ---
 # Indic-Transformers Bengali DistilBERT
 ## Model description
 This is a DistilBERT language model pre-trained on ~6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert')
 text = "আপনি কেমন আছেন?"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 5, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
--- a/model_cards/neuralspace-reverie/indic-transformers-bn-roberta/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-bn-roberta/README.md
@@ -0,0 +1,29 @@
 ---
 language: 
 - bn 
 tags:
 - MaskedLM
 - Bengali
 - RoBERTa
 - Question-Answering
 - Token Classification
 - Text Classification
 ---
 # Indic-Transformers Bengali RoBERTa
 ## Model description
 This is a RoBERTa language model pre-trained on ~6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-roberta')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-roberta')
 text = "আপনি কেমন আছেন?"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 10, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
--- a/model_cards/neuralspace-reverie/indic-transformers-bn-xlmroberta/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-bn-xlmroberta/README.md
@@ -0,0 +1,29 @@
 ---
 language: 
 - bn 
 tags:
 - MaskedLM
 - Bengali
 - XLMRoBERTa
 - Question-Answering
 - Token Classification
 - Text Classification
 ---
 # Indic-Transformers Bengali XLMRoBERTa
 ## Model description
 This is a XLMRoBERTa language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-xlmroberta')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-xlmroberta')
 text = "আপনি কেমন আছেন?"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 5, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
--- a/model_cards/neuralspace-reverie/indic-transformers-hi-bert/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-hi-bert/README.md
@@ -0,0 +1,29 @@
 ---
 language: 
 - hi 
 tags:
 - MaskedLM
 - Hindi
 - BERT
 - Question-Answering
 - Token Classification
 - Text Classification
 ---
 # Indic-Transformers Hindi BERT
 ## Model description
 This is a BERT language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-bert')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-bert')
 text = "आपका स्वागत हैं"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 5, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
--- a/model_cards/neuralspace-reverie/indic-transformers-hi-distilbert/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-hi-distilbert/README.md
@@ -0,0 +1,29 @@
 ---
 language: 
 - hi 
 tags:
 - MaskedLM
 - Hindi
 - DistilBERT
 - Question-Answering
 - Token Classification
 - Text Classification
 ---
 # Indic-Transformers Hindi DistilBERT
 ## Model description
 This is a DistilBERT language model pre-trained on ~10 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-distilbert')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-distilbert')
 text = "आपका स्वागत हैं"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 5, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
--- a/model_cards/neuralspace-reverie/indic-transformers-hi-roberta/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-hi-roberta/README.md
@@ -0,0 +1,29 @@
 ---
 language: 
 - hi 
 tags:
 - MaskedLM
 - Hindi
 - RoBERTa
 - Question-Answering
 - Token Classification
 - Text Classification
 ---
 # Indic-Transformers Hindi RoBERTa
 ## Model description
 This is a RoBERTa language model pre-trained on ~10 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-roberta')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-roberta')
 text = "आपका स्वागत हैं"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 11, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
--- a/model_cards/neuralspace-reverie/indic-transformers-hi-xlmroberta/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-hi-xlmroberta/README.md
@@ -0,0 +1,29 @@
 ---
 language: 
 - hi 
 tags:
 - MaskedLM
 - Hindi
 - XLMRoBERTa
 - Question-Answering
 - Token Classification
 - Text Classification
 ---
 # Indic-Transformers Hindi XLMRoBERTa
 ## Model description
 This is a XLMRoBERTa language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-xlmroberta')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-xlmroberta')
 text = "आपका स्वागत हैं"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 5, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
--- a/model_cards/neuralspace-reverie/indic-transformers-te-bert/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-te-bert/README.md
@@ -0,0 +1,29 @@
 ---
 language: 
 - te
 tags:
 - MaskedLM
 - Telugu
 - BERT
 - Question-Answering
 - Token Classification
 - Text Classification
 ---
 # Indic-Transformers Telugu BERT
 ## Model description
 This is a BERT language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
 text = "మీరు ఎలా ఉన్నారు"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 5, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
--- a/model_cards/neuralspace-reverie/indic-transformers-te-distilbert/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-te-distilbert/README.md
@@ -0,0 +1,29 @@
 ---
 language: 
 - te
 tags:
 - MaskedLM
 - Telugu
 - DistilBERT
 - Question-Answering
 - Token Classification
 - Text Classification
 ---
 # Indic-Transformers Telugu DistilBERT
 ## Model description
 This is a DistilBERT language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
 text = "మీరు ఎలా ఉన్నారు"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 5, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
--- a/model_cards/neuralspace-reverie/indic-transformers-te-roberta/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-te-roberta/README.md
@@ -0,0 +1,29 @@
 ---
 language: 
 - te
 tags:
 - MaskedLM
 - Telugu
 - RoBERTa
 - Question-Answering
 - Token Classification
 - Text Classification
 ---
 # Indic-Transformers Telugu RoBERTa
 ## Model description
 This is a RoBERTa language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
 text = "మీరు ఎలా ఉన్నారు"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 14, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
--- a/model_cards/neuralspace-reverie/indic-transformers-te-xlmroberta/README.md
+++ b/model_cards/neuralspace-reverie/indic-transformers-te-xlmroberta/README.md
@@ -0,0 +1,29 @@
 ---
 language: 
 - te
 tags:
 - MaskedLM
 - Telugu
 - XLMRoBERTa
 - Question-Answering
 - Token Classification
 - Text Classification
 ---
 # Indic-Transformers Telugu XLMRoBERTa
 ## Model description
 This is a XLMRoBERTa language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
 This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
 ## Intended uses & limitations
 #### How to use
 ```
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
 model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
 text = "మీరు ఎలా ఉన్నారు"
 input_ids = tokenizer(text, return_tensors='pt')['input_ids']
 out = model(input_ids)[0]
 print(out.shape)
 # out = [1, 5, 768] 
 ```
 #### Limitations and bias
 The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).