From b1c06140f4b2ac188d625fe9d34bc32ec2e36078 Mon Sep 17 00:00:00 2001 From: Bobby Donchev Date: Wed, 7 Oct 2020 22:46:03 +0200 Subject: [PATCH] Create README.md for IsRoBERTa language model (#7640) * Create README.md * Update README.md * Apply suggestions from code review Co-authored-by: Julien Chaumond --- model_cards/neurocode/IsRoBERTa/README.md | 74 +++++++++++++++++++++++ 1 file changed, 74 insertions(+) create mode 100644 model_cards/neurocode/IsRoBERTa/README.md diff --git a/model_cards/neurocode/IsRoBERTa/README.md b/model_cards/neurocode/IsRoBERTa/README.md new file mode 100644 index 0000000000..b56c9296f6 --- /dev/null +++ b/model_cards/neurocode/IsRoBERTa/README.md @@ -0,0 +1,74 @@ +--- +language: is +datasets: +- Icelandic portion of the OSCAR corpus from INRIA +- oscar +--- + +# IsRoBERTa a RoBERTa-like masked language model + +Probably the first icelandic transformer language model! + +## Overview +**Language:** Icelandic +**Downstream-task:** masked-lm +**Training data:** OSCAR corpus +**Code:** See [here](https://github.com/neurocode-io/icelandic-language-model) +**Infrastructure**: 1x Nvidia K80 + +## Hyperparameters + +``` +per_device_train_batch_size = 48 +n_epochs = 1 +vocab_size = 52.000 +max_position_embeddings = 514 +num_attention_heads = 12 +num_hidden_layers = 6 +type_vocab_size = 1 +learning_rate=0.00005 +``` + + +## Usage + +### In Transformers +```python +from transformers import ( + pipeline, + AutoTokenizer, + AutoModelWithLMHead +) + +model_name = "neurocode/IsRoBERTa" + +tokenizer = AutoTokenizer.from_pretrained(model_name) +model = AutoModelWithLMHead.from_pretrained(model_name) +>>> fill_mask = pipeline( +... "fill-mask", +... model=model, +... tokenizer=tokenizer +... ) +>>> result = fill_mask("Hann fór út að .") +>>> result +[ + {'sequence': 'Hann fór út að nýju.', 'score': 0.03395755589008331, 'token': 2219, 'token_str': 'Ġnýju'}, + {'sequence': 'Hann fór út að undanförnu.', 'score': 0.029087543487548828, 'token': 7590, 'token_str': 'Ġundanförnu'}, + {'sequence': 'Hann fór út að lokum.', 'score': 0.024420788511633873, 'token': 4384, 'token_str': 'Ġlokum'}, + {'sequence': 'Hann fór út að þessu.', 'score': 0.021231256425380707, 'token': 921, 'token_str': 'Ġþessu'}, + {'sequence': 'Hann fór út að honum.', 'score': 0.0205782949924469, 'token': 1136, 'token_str': 'Ġhonum'} +] +``` + + +## Authors +Bobby Donchev: `contact [at] donchev.is` +Elena Cramer: `elena.cramer [at] neurocode.io` + +## About us + +We bring AI software for our customers live +Our focus: AI software development + +Get in touch: +[LinkedIn](https://de.linkedin.com/company/neurocodeio) | [Website](https://neurocode.io)