From 4ab8ab4f50baf391612cbc78cfa3f09b7ad0c3ac Mon Sep 17 00:00:00 2001 From: Timo Moeller Date: Fri, 3 Apr 2020 15:44:21 +0200 Subject: [PATCH] Adjust model card to reflect changes to vocabulary (cherry picked from commit 8e25c4bf2838211378db4d93e7f9722386cc1a04) --- model_cards/bert-base-german-cased-README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/model_cards/bert-base-german-cased-README.md b/model_cards/bert-base-german-cased-README.md index 35677ec9ce..f058bdd956 100644 --- a/model_cards/bert-base-german-cased-README.md +++ b/model_cards/bert-base-german-cased-README.md @@ -22,6 +22,7 @@ tags: - We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days. - As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB). - We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT. +- Update April 3rd, 2020: updated the vocab file on deepset s3 to adjust tokenization of punctuation. See https://deepset.ai/german-bert for more details