From e3c55ceb8d78412919c1c2c0dc511a7c492e907d Mon Sep 17 00:00:00 2001 From: David Mark Nemeskey Date: Wed, 2 Sep 2020 10:50:10 +0200 Subject: [PATCH] Model card for huBERT (#6893) * Create README.md Model card for huBERT. * Update README.md lowercase h * Update model_cards/SZTAKI-HLT/hubert-base-cc/README.md Co-authored-by: Julien Chaumond --- .../SZTAKI-HLT/hubert-base-cc/README.md | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 model_cards/SZTAKI-HLT/hubert-base-cc/README.md diff --git a/model_cards/SZTAKI-HLT/hubert-base-cc/README.md b/model_cards/SZTAKI-HLT/hubert-base-cc/README.md new file mode 100644 index 0000000000..8ecfd1fc95 --- /dev/null +++ b/model_cards/SZTAKI-HLT/hubert-base-cc/README.md @@ -0,0 +1,43 @@ +--- +language: hu +license: apache-2.0 +datasets: +- common_crawl +- wikipedia +--- + +# huBERT base model (cased) + +## Model description + +Cased BERT model for Hungarian, trained on the (filtered, deduplicated) Hungarian subset of the Common Crawl and a snapshot of the Hungarian Wikipedia. + +## Intended uses & limitations + +The model can be used as any other (cased) BERT model. It has been tested on the chunking and +named entity recognition tasks and set a new state-of-the-art on the former. + +## Training + +Details of the training data and procedure can be found in the PhD thesis linked below. (With the caveat that it only contains preliminary results +based on the Wikipedia subcorpus. Evaluation of the full model will appear in a future paper.) + +## Eval results + +When fine-tuned (via `BertForTokenClassification`) on chunking and NER, the model outperforms multilingual BERT, achieves state-of-the-art results on the +former task and comes within 0.5% F1 to the SotA on the latter. The exact scores are + +| NER | Minimal NP | Maximal NP | +|-----|------------|------------| +| 97.62% | **97.14%** | **96.97%** | + +### BibTeX entry and citation info + +```bibtex +@PhDThesis{ Nemeskey:2020, + author = {Nemeskey, Dávid Márk}, + title = {Natural Language Processing Methods for Language Modeling}, + year = {2020}, + school = {E\"otv\"os Lor\'and University} +} +```