From 518f291eef9f246e26a3932971cb3970b7d0b11a Mon Sep 17 00:00:00 2001 From: Nick Doiron Date: Sun, 26 Apr 2020 17:30:48 -0400 Subject: [PATCH] add model card for Hindi-BERT --- model_cards/monsoon-nlp/hindi-bert/README.md | 68 ++++++++++++++++++++ 1 file changed, 68 insertions(+) create mode 100644 model_cards/monsoon-nlp/hindi-bert/README.md diff --git a/model_cards/monsoon-nlp/hindi-bert/README.md b/model_cards/monsoon-nlp/hindi-bert/README.md new file mode 100644 index 0000000000..4070cbcf36 --- /dev/null +++ b/model_cards/monsoon-nlp/hindi-bert/README.md @@ -0,0 +1,68 @@ +--- +language: Hindi +--- + +# Hindi-BERT (Discriminator) + +This is a first run of a Hindi language model trained with Google Research's [ELECTRA](https://github.com/google-research/electra). **I don't modify ELECTRA until we get into finetuning** + +Tokenization and training CoLab: https://colab.research.google.com/drive/1R8TciRSM7BONJRBc9CBZbzOmz39FTLl_ + +Blog post: https://medium.com/@mapmeld/teaching-hindi-to-electra-b11084baab81 + +Greatly influenced by: https://huggingface.co/blog/how-to-train + +## Corpus + +Download: https://drive.google.com/drive/u/1/folders/1WikYHHMI72hjZoCQkLPr45LDV8zm9P7p + +The corpus is two files: +- Hindi CommonCrawl deduped by OSCAR https://traces1.inria.fr/oscar/ +- latest Hindi Wikipedia ( https://dumps.wikimedia.org/hiwiki/20200420/ ) + WikiExtractor to txt + +Bonus notes: +- Adding English wiki text or parallel corpus could help with cross-lingual tasks and training + +## Vocabulary + +https://drive.google.com/file/d/1-02Um-8ogD4vjn4t-wD2EwCE-GtBjnzh/view?usp=sharing + +Bonus notes: +- Created with HuggingFace Tokenizers; could be longer or shorter, review ELECTRA vocab_size param + +## Pretrain TF Records + +[build_pretraining_dataset.py](https://github.com/google-research/electra/blob/master/build_pretraining_dataset.py) splits the corpus into training documents + +Set the ELECTRA model size and whether to split the corpus by newlines. This process can take hours on its own. + +https://drive.google.com/drive/u/1/folders/1--wBjSH59HSFOVkYi4X-z5bigLnD32R5 + +Bonus notes: +- I am not sure of the meaning of the corpus newline split (what is the alternative?) and given this corpus, which creates the better training docs + +## Training + +Structure your files, with data-dir named "trainer" here + +``` +trainer +- vocab.txt +- pretrain_tfrecords +-- (all .tfrecord... files) +- models +-- modelname +--- checkpoint +--- graph.pbtxt +--- model.* +``` + +CoLab notebook gives examples of GPU vs. TPU setup + +[configure_pretraining.py](https://github.com/google-research/electra/blob/master/configure_pretraining.py) + +Model https://drive.google.com/drive/folders/1cwQlWryLE4nlke4OixXA7NK8hzlmUR0c?usp=sharing + +## Using this model with Transformers + +Sample movie reviews classifier: https://colab.research.google.com/drive/1mSeeSfVSOT7e-dVhPlmSsQRvpn6xC05w