Added 12 model cards for Indian Language Models (#8198)
* Create README.md * added model cards
This commit is contained in:
@@ -0,0 +1,25 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- bn
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Bengali
|
||||||
|
---
|
||||||
|
# Indic-Transformers Bengali BERT
|
||||||
|
## Model description
|
||||||
|
This is a BERT language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-bert')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-bert')
|
||||||
|
text = "আপনি কেমন আছেন?"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 6, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- bn
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Bengali
|
||||||
|
- DistilBERT
|
||||||
|
- Question-Answering
|
||||||
|
- Token Classification
|
||||||
|
- Text Classification
|
||||||
|
---
|
||||||
|
# Indic-Transformers Bengali DistilBERT
|
||||||
|
## Model description
|
||||||
|
This is a DistilBERT language model pre-trained on ~6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert')
|
||||||
|
text = "আপনি কেমন আছেন?"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 5, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- bn
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Bengali
|
||||||
|
- RoBERTa
|
||||||
|
- Question-Answering
|
||||||
|
- Token Classification
|
||||||
|
- Text Classification
|
||||||
|
---
|
||||||
|
# Indic-Transformers Bengali RoBERTa
|
||||||
|
## Model description
|
||||||
|
This is a RoBERTa language model pre-trained on ~6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-roberta')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-roberta')
|
||||||
|
text = "আপনি কেমন আছেন?"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 10, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- bn
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Bengali
|
||||||
|
- XLMRoBERTa
|
||||||
|
- Question-Answering
|
||||||
|
- Token Classification
|
||||||
|
- Text Classification
|
||||||
|
---
|
||||||
|
# Indic-Transformers Bengali XLMRoBERTa
|
||||||
|
## Model description
|
||||||
|
This is a XLMRoBERTa language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-xlmroberta')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-xlmroberta')
|
||||||
|
text = "আপনি কেমন আছেন?"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 5, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- hi
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Hindi
|
||||||
|
- BERT
|
||||||
|
- Question-Answering
|
||||||
|
- Token Classification
|
||||||
|
- Text Classification
|
||||||
|
---
|
||||||
|
# Indic-Transformers Hindi BERT
|
||||||
|
## Model description
|
||||||
|
This is a BERT language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-bert')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-bert')
|
||||||
|
text = "आपका स्वागत हैं"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 5, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- hi
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Hindi
|
||||||
|
- DistilBERT
|
||||||
|
- Question-Answering
|
||||||
|
- Token Classification
|
||||||
|
- Text Classification
|
||||||
|
---
|
||||||
|
# Indic-Transformers Hindi DistilBERT
|
||||||
|
## Model description
|
||||||
|
This is a DistilBERT language model pre-trained on ~10 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-distilbert')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-distilbert')
|
||||||
|
text = "आपका स्वागत हैं"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 5, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- hi
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Hindi
|
||||||
|
- RoBERTa
|
||||||
|
- Question-Answering
|
||||||
|
- Token Classification
|
||||||
|
- Text Classification
|
||||||
|
---
|
||||||
|
# Indic-Transformers Hindi RoBERTa
|
||||||
|
## Model description
|
||||||
|
This is a RoBERTa language model pre-trained on ~10 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-roberta')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-roberta')
|
||||||
|
text = "आपका स्वागत हैं"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 11, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- hi
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Hindi
|
||||||
|
- XLMRoBERTa
|
||||||
|
- Question-Answering
|
||||||
|
- Token Classification
|
||||||
|
- Text Classification
|
||||||
|
---
|
||||||
|
# Indic-Transformers Hindi XLMRoBERTa
|
||||||
|
## Model description
|
||||||
|
This is a XLMRoBERTa language model pre-trained on ~3 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-hi-xlmroberta')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-hi-xlmroberta')
|
||||||
|
text = "आपका स्वागत हैं"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 5, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- te
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Telugu
|
||||||
|
- BERT
|
||||||
|
- Question-Answering
|
||||||
|
- Token Classification
|
||||||
|
- Text Classification
|
||||||
|
---
|
||||||
|
# Indic-Transformers Telugu BERT
|
||||||
|
## Model description
|
||||||
|
This is a BERT language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-bert')
|
||||||
|
text = "మీరు ఎలా ఉన్నారు"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 5, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- te
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Telugu
|
||||||
|
- DistilBERT
|
||||||
|
- Question-Answering
|
||||||
|
- Token Classification
|
||||||
|
- Text Classification
|
||||||
|
---
|
||||||
|
# Indic-Transformers Telugu DistilBERT
|
||||||
|
## Model description
|
||||||
|
This is a DistilBERT language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-distilbert')
|
||||||
|
text = "మీరు ఎలా ఉన్నారు"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 5, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- te
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Telugu
|
||||||
|
- RoBERTa
|
||||||
|
- Question-Answering
|
||||||
|
- Token Classification
|
||||||
|
- Text Classification
|
||||||
|
---
|
||||||
|
# Indic-Transformers Telugu RoBERTa
|
||||||
|
## Model description
|
||||||
|
This is a RoBERTa language model pre-trained on ~2 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-roberta')
|
||||||
|
text = "మీరు ఎలా ఉన్నారు"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 14, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- te
|
||||||
|
tags:
|
||||||
|
- MaskedLM
|
||||||
|
- Telugu
|
||||||
|
- XLMRoBERTa
|
||||||
|
- Question-Answering
|
||||||
|
- Token Classification
|
||||||
|
- Text Classification
|
||||||
|
---
|
||||||
|
# Indic-Transformers Telugu XLMRoBERTa
|
||||||
|
## Model description
|
||||||
|
This is a XLMRoBERTa language model pre-trained on ~1.6 GB of monolingual training corpus. The pre-training data was majorly taken from [OSCAR](https://oscar-corpus.com/).
|
||||||
|
This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
|
||||||
|
## Intended uses & limitations
|
||||||
|
#### How to use
|
||||||
|
```
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
|
||||||
|
model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-te-xlmroberta')
|
||||||
|
text = "మీరు ఎలా ఉన్నారు"
|
||||||
|
input_ids = tokenizer(text, return_tensors='pt')['input_ids']
|
||||||
|
out = model(input_ids)[0]
|
||||||
|
print(out.shape)
|
||||||
|
# out = [1, 5, 768]
|
||||||
|
```
|
||||||
|
#### Limitations and bias
|
||||||
|
The original language model has been trained using `PyTorch` and hence the use of `pytorch_model.bin` weights file is recommended. The h5 file for `Tensorflow` has been generated manually by commands suggested [here](https://huggingface.co/transformers/model_sharing.html).
|
||||||
Reference in New Issue
Block a user