Model card: Literary German BERT (#2843)
* feat: create model card * chore: add description * feat: stats plot * Delete prosa-jahre.svg * feat: years plot (again) * chore: add more details * fix: typos * feat: kfold plot * feat: kfold plot * Rename model_cards/severinsimmler/literary-german-bert.md to model_cards/severinsimmler/literary-german-bert/README.md * Support for linked images + add tags cc @severinsimmler Co-authored-by: Julien Chaumond <chaumond@gmail.com>
This commit is contained in:
51
model_cards/severinsimmler/literary-german-bert/README.md
Normal file
51
model_cards/severinsimmler/literary-german-bert/README.md
Normal file
@@ -0,0 +1,51 @@
|
||||
---
|
||||
language: german
|
||||
thumbnail: kfold.png
|
||||
---
|
||||
|
||||
# German BERT for literary texts
|
||||
|
||||
This German BERT is based on `bert-base-german-dbmdz-cased`, and has been adapted to the domain of literary texts by fine-tuning the language modeling task on the [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1). Afterwards the model was fine-tuned for named entity recognition on the [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release) corpus, so you can use it to recognize protagonists in German novels.
|
||||
|
||||
|
||||
# Stats
|
||||
|
||||
## Language modeling
|
||||
|
||||
The [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1) consists of 3,194 documents with 203,516,988 tokens or 1,520,855 types. The publication year of the texts ranges from the 18th to the 20th century:
|
||||
|
||||

|
||||
|
||||
|
||||
### Results
|
||||
|
||||
After one epoch:
|
||||
|
||||
| Model | Perplexity |
|
||||
| ---------------- | ---------- |
|
||||
| Vanilla BERT | 6.82 |
|
||||
| Fine-tuned BERT | 4.98 |
|
||||
|
||||
|
||||
## Named entity recognition
|
||||
|
||||
The provided model was also fine-tuned for two epochs on 10,799 sentences for training, validated on 547 and tested on 1,845 with three labels: `B-PER`, `I-PER` and `O`.
|
||||
|
||||
|
||||
## Results
|
||||
|
||||
| Dataset | Precision | Recall | F1 |
|
||||
| ------- | --------- | ------ | ---- |
|
||||
| Dev | 96.4 | 87.3 | 91.6 |
|
||||
| Test | 92.8 | 94.9 | 93.8 |
|
||||
|
||||
The model has also been evaluated using 10-fold cross validation and compared with a classic Conditional Random Field baseline described in [Jannidis et al.](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf) (2015):
|
||||
|
||||

|
||||
|
||||
|
||||
# References
|
||||
|
||||
Markus Krug, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, Fotis Jannidis, [Description of a Corpus of Character References in German Novels](http://webdoc.sub.gwdg.de/pub/mon/dariah-de/dwp-2018-27.pdf), 2018.
|
||||
|
||||
Fotis Jannidis, Isabella Reger, Lukas Weimer, Markus Krug, Martin Toepfer, Frank Puppe, [Automatische Erkennung von Figuren in deutschsprachigen Romanen](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf), 2015.
|
||||
BIN
model_cards/severinsimmler/literary-german-bert/kfold.png
Normal file
BIN
model_cards/severinsimmler/literary-german-bert/kfold.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 5.6 KiB |
BIN
model_cards/severinsimmler/literary-german-bert/prosa-jahre.png
Normal file
BIN
model_cards/severinsimmler/literary-german-bert/prosa-jahre.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 21 KiB |
Reference in New Issue
Block a user