From 30a09f382726b667f7ab7334d5b24452c72ffadb Mon Sep 17 00:00:00 2001 From: Timo Moeller Date: Wed, 20 May 2020 22:08:29 +0200 Subject: [PATCH] Adjust german bert model card, add new model card (#4488) --- model_cards/bert-base-german-cased-README.md | 5 +++- .../bert-base-german-cased-oldvocab/README.md | 28 +++++++++++++++++++ 2 files changed, 32 insertions(+), 1 deletion(-) create mode 100644 model_cards/deepset/bert-base-german-cased-oldvocab/README.md diff --git a/model_cards/bert-base-german-cased-README.md b/model_cards/bert-base-german-cased-README.md index d719842421..324093de33 100644 --- a/model_cards/bert-base-german-cased-README.md +++ b/model_cards/bert-base-german-cased-README.md @@ -18,13 +18,16 @@ tags: **Eval data:** Conll03 (NER), GermEval14 (NER), GermEval18 (Classification), GNAD (Classification) **Infrastructure**: 1x TPU v2 **Published**: Jun 14th, 2019 + +**Update April 3rd, 2020**: we updated the vocabulary file on deepset's s3 to conform with the default tokenization of punctuation tokens. +For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60). If you want to use the old vocab we have also uploaded a ["deepset/bert-base-german-cased-oldvocab"](https://huggingface.co/deepset/bert-base-german-cased-oldvocab) model. ## Details - We trained using Google's Tensorflow code on a single cloud TPU v2 with standard settings. - We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days. - As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB). - We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT. -- Update April 3rd, 2020: updated the vocab file on deepset s3 to adjust tokenization of punctuation. + See https://deepset.ai/german-bert for more details diff --git a/model_cards/deepset/bert-base-german-cased-oldvocab/README.md b/model_cards/deepset/bert-base-german-cased-oldvocab/README.md new file mode 100644 index 0000000000..6abaec0473 --- /dev/null +++ b/model_cards/deepset/bert-base-german-cased-oldvocab/README.md @@ -0,0 +1,28 @@ +--- +language: german +thumbnail: https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png +tags: +- exbert +--- + + + + + +# German BERT with old vocabulary +For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60). + + +## About us +![deepset logo](https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/deepset_logo.png) + +We bring NLP to the industry via open source! +Our focus: Industry specific language models & large scale QA systems. + +Some of our work: +- [German BERT (aka "bert-base-german-cased")](https://deepset.ai/german-bert) +- [FARM](https://github.com/deepset-ai/FARM) +- [Haystack](https://github.com/deepset-ai/haystack/) + +Get in touch: +[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Website](https://deepset.ai)