From 4753816e391decc6b97b0ce6a679d62e839aa5da Mon Sep 17 00:00:00 2001 From: Sagor Sarker Date: Sat, 12 Sep 2020 01:17:25 +0600 Subject: [PATCH] added bangla-bert-base model card and also modified other model cards (#7071) * added bangla-bert-base * Apply suggestions from code review Co-authored-by: Julien Chaumond --- .../sagorsarker/bangla-bert-base/README.md | 101 ++++++++++++++++++ .../codeswitch-hineng-lid-lince/README.md | 2 +- .../codeswitch-hineng-ner-lince/README.md | 2 +- .../codeswitch-hineng-pos-lince/README.md | 2 +- .../codeswitch-nepeng-lid-lince/README.md | 2 +- .../codeswitch-spaeng-lid-lince/README.md | 2 +- .../codeswitch-spaeng-ner-lince/README.md | 2 +- .../codeswitch-spaeng-pos-lince/README.md | 2 +- .../README.md | 2 +- 9 files changed, 109 insertions(+), 8 deletions(-) create mode 100644 model_cards/sagorsarker/bangla-bert-base/README.md diff --git a/model_cards/sagorsarker/bangla-bert-base/README.md b/model_cards/sagorsarker/bangla-bert-base/README.md new file mode 100644 index 0000000000..de7bb7287b --- /dev/null +++ b/model_cards/sagorsarker/bangla-bert-base/README.md @@ -0,0 +1,101 @@ +--- +language: bn +tags: +- bert +- bengali +- bengali-lm +- bangla +license: MIT +datasets: +- common_crawl +- wikipedia +- oscar +--- + + +# Bangla BERT Base +A long way passed. Here is our **Bangla-Bert**! It is now available in huggingface model hub. + +[Bangla-Bert-Base](https://github.com/sagorbrur/bangla-bert) is a pretrained language model of Bengali language using mask language modeling described in [BERT](https://arxiv.org/abs/1810.04805) and it's github [repository](https://github.com/google-research/bert) + + + +## Pretrain Corpus Details +Corpus was downloaded from two main sources: + +* Bengali commoncrawl copurs downloaded from [OSCAR](https://oscar-corpus.com/) +* [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/) + +After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents. + +``` +sentence 1 +sentence 2 + +sentence 1 +sentence 2 + +``` + +## Building Vocab +We used [BNLP](https://github.com/sagorbrur/bnlp) package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format. +Our final vocab file availabe at [https://github.com/sagorbrur/bangla-bert](https://github.com/sagorbrur/bangla-bert) and also at [huggingface](https://huggingface.co/sagorsarker/bangla-bert-base) model hub. + +## Training Details +* Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert) +* Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters) +* Total Training Steps: 1 Million +* The model was trained on a single Google Cloud TPU + +## Evaluation Results + +After training 1 millions steps here is the evaluation resutls. + +``` +global_step = 1000000 +loss = 2.2406516 +masked_lm_accuracy = 0.60641736 +masked_lm_loss = 2.201459 +next_sentence_accuracy = 0.98625 +next_sentence_loss = 0.040997364 +perplexity = numpy.exp(2.2406516) = 9.393331287442784 +Loss for final step: 2.426227 + + +``` + +**NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.** + + +## How to Use +You can use this model directly with a pipeline for masked language modeling: + +```py +from transformers import BertForMaskedLM, BertTokenizer, pipeline + +model = BertForMaskedLM.from_pretrained("bangla-bert-base") +tokenizer = BertTokenizer.from_pretrained("bangla-bert-base") +nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer) +for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"): + print(pred) + +# {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'} + +``` + + +## Author +[Sagor Sarker](https://github.com/sagorbrur) + +## Acknowledgements + +* Thanks to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you! +* Thank to all the people around, who always helping us to build something for Bengali. + +## Reference +* https://github.com/google-research/bert + + + + + diff --git a/model_cards/sagorsarker/codeswitch-hineng-lid-lince/README.md b/model_cards/sagorsarker/codeswitch-hineng-lid-lince/README.md index 490826d3ef..f31169005b 100644 --- a/model_cards/sagorsarker/codeswitch-hineng-lid-lince/README.md +++ b/model_cards/sagorsarker/codeswitch-hineng-lid-lince/README.md @@ -3,7 +3,7 @@ language: - hi - en datasets: -- LinCE +- lince license: "MIT" tags: - codeswitching diff --git a/model_cards/sagorsarker/codeswitch-hineng-ner-lince/README.md b/model_cards/sagorsarker/codeswitch-hineng-ner-lince/README.md index 161a23547d..4fdda8412f 100644 --- a/model_cards/sagorsarker/codeswitch-hineng-ner-lince/README.md +++ b/model_cards/sagorsarker/codeswitch-hineng-ner-lince/README.md @@ -3,7 +3,7 @@ language: - hi - en datasets: -- LinCE +- lince license: "MIT" tags: - codeswitching diff --git a/model_cards/sagorsarker/codeswitch-hineng-pos-lince/README.md b/model_cards/sagorsarker/codeswitch-hineng-pos-lince/README.md index 1cf4845966..bea61d9b84 100644 --- a/model_cards/sagorsarker/codeswitch-hineng-pos-lince/README.md +++ b/model_cards/sagorsarker/codeswitch-hineng-pos-lince/README.md @@ -3,7 +3,7 @@ language: - hi - en datasets: -- LinCE +- lince license: "MIT" tags: - codeswitching diff --git a/model_cards/sagorsarker/codeswitch-nepeng-lid-lince/README.md b/model_cards/sagorsarker/codeswitch-nepeng-lid-lince/README.md index 0c9e2840b6..36e1c96e53 100644 --- a/model_cards/sagorsarker/codeswitch-nepeng-lid-lince/README.md +++ b/model_cards/sagorsarker/codeswitch-nepeng-lid-lince/README.md @@ -3,7 +3,7 @@ language: - ne - en datasets: -- LinCE +- lince license: "MIT" tags: - codeswitching diff --git a/model_cards/sagorsarker/codeswitch-spaeng-lid-lince/README.md b/model_cards/sagorsarker/codeswitch-spaeng-lid-lince/README.md index 4930ea6411..c91e2c800a 100644 --- a/model_cards/sagorsarker/codeswitch-spaeng-lid-lince/README.md +++ b/model_cards/sagorsarker/codeswitch-spaeng-lid-lince/README.md @@ -3,7 +3,7 @@ language: - es - en datasets: -- LinCE +- lince license: "MIT" tags: - codeswitching diff --git a/model_cards/sagorsarker/codeswitch-spaeng-ner-lince/README.md b/model_cards/sagorsarker/codeswitch-spaeng-ner-lince/README.md index 8b2a979565..9ad3c58900 100644 --- a/model_cards/sagorsarker/codeswitch-spaeng-ner-lince/README.md +++ b/model_cards/sagorsarker/codeswitch-spaeng-ner-lince/README.md @@ -3,7 +3,7 @@ language: - es - en datasets: -- LinCE +- lince license: "MIT" tags: - codeswitching diff --git a/model_cards/sagorsarker/codeswitch-spaeng-pos-lince/README.md b/model_cards/sagorsarker/codeswitch-spaeng-pos-lince/README.md index 73ca2f139d..f335e92ef3 100644 --- a/model_cards/sagorsarker/codeswitch-spaeng-pos-lince/README.md +++ b/model_cards/sagorsarker/codeswitch-spaeng-pos-lince/README.md @@ -3,7 +3,7 @@ language: - es - en datasets: -- LinCE +- lince license: "MIT" tags: - codeswitching diff --git a/model_cards/sagorsarker/codeswitch-spaeng-sentiment-analysis-lince/README.md b/model_cards/sagorsarker/codeswitch-spaeng-sentiment-analysis-lince/README.md index e797232390..6fab7b606e 100644 --- a/model_cards/sagorsarker/codeswitch-spaeng-sentiment-analysis-lince/README.md +++ b/model_cards/sagorsarker/codeswitch-spaeng-sentiment-analysis-lince/README.md @@ -3,7 +3,7 @@ language: - es - en datasets: -- LinCE +- lince license: "MIT" tags: - codeswitching