From 8a2d9bc9ef38452e80ce872505a5ad5623c12657 Mon Sep 17 00:00:00 2001 From: Julien Chaumond Date: Thu, 5 Mar 2020 09:34:43 -0500 Subject: [PATCH] Add model cards for DeepPavlov models (#3138) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * add empty model cards for every current DeepPavlov model * fix: replace cyrillic `с` with `c` * docs: add model cards for current DeepPavlov BERT models * docs: add links for arXiv preprints --- .../bert-base-bg-cs-pl-ru-cased/README.md | 18 +++++++++++++++ .../bert-base-cased-conversational/README.md | 23 +++++++++++++++++++ .../README.md | 22 ++++++++++++++++++ .../README.md | 18 +++++++++++++++ .../rubert-base-cased-sentence/README.md | 21 +++++++++++++++++ .../DeepPavlov/rubert-base-cased/README.md | 14 +++++++++++ 6 files changed, 116 insertions(+) create mode 100644 model_cards/DeepPavlov/bert-base-bg-cs-pl-ru-cased/README.md create mode 100644 model_cards/DeepPavlov/bert-base-cased-conversational/README.md create mode 100644 model_cards/DeepPavlov/bert-base-multilingual-cased-sentence/README.md create mode 100644 model_cards/DeepPavlov/rubert-base-cased-conversational/README.md create mode 100644 model_cards/DeepPavlov/rubert-base-cased-sentence/README.md create mode 100644 model_cards/DeepPavlov/rubert-base-cased/README.md diff --git a/model_cards/DeepPavlov/bert-base-bg-cs-pl-ru-cased/README.md b/model_cards/DeepPavlov/bert-base-bg-cs-pl-ru-cased/README.md new file mode 100644 index 0000000000..7e4aa0c461 --- /dev/null +++ b/model_cards/DeepPavlov/bert-base-bg-cs-pl-ru-cased/README.md @@ -0,0 +1,18 @@ +--- +language: +- bulgarian +- czech +- polish +- russian +--- + +# bert-base-bg-cs-pl-ru-cased + +SlavicBERT\[1\] \(Slavic \(bg, cs, pl, ru\), cased, 12-layer, 768-hidden, 12-heads, 180M parameters\) was trained +on Russian News and four Wikipedias: Bulgarian, Czech, Polish, and Russian. +Subtoken vocabulary was built using this data. Multilingual BERT was used as an initialization for SlavicBERT. + + +\[1\]: Arkhipov M., Trofimova M., Kuratov Y., Sorokin A. \(2019\). +[Tuning Multilingual Transformers for Language-Specific Named Entity Recognition](https://www.aclweb.org/anthology/W19-3712/). +ACL anthology W19-3712. diff --git a/model_cards/DeepPavlov/bert-base-cased-conversational/README.md b/model_cards/DeepPavlov/bert-base-cased-conversational/README.md new file mode 100644 index 0000000000..357527d232 --- /dev/null +++ b/model_cards/DeepPavlov/bert-base-cased-conversational/README.md @@ -0,0 +1,23 @@ +--- +language: +- english +--- + +# bert-base-cased-conversational + +Conversational BERT \(English, cased, 12-layer, 768-hidden, 12-heads, 110M parameters\) was trained +on the English part of Twitter, Reddit, DailyDialogues\[1\], OpenSubtitles\[2\], Debates\[3\], Blogs\[4\], +Facebook News Comments. We used this training data to build the vocabulary of English subtokens and took +English cased version of BERT-base as an initialization for English Conversational BERT. + + +\[1\]: Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A Manually Labelled +Multi-turn Dialogue Dataset. IJCNLP 2017. + +\[2\]: P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. +In Proceedings of the 10th International Conference on Language Resources and Evaluation \(LREC 2016\) + +\[3\]: Justine Zhang, Ravi Kumar, Sujith Ravi, Cristian Danescu-Niculescu-Mizil. Proceedings of NAACL, 2016. + +\[4\]: J. Schler, M. Koppel, S. Argamon and J. Pennebaker \(2006\). Effects of Age and Gender on Blogging +in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. diff --git a/model_cards/DeepPavlov/bert-base-multilingual-cased-sentence/README.md b/model_cards/DeepPavlov/bert-base-multilingual-cased-sentence/README.md new file mode 100644 index 0000000000..1e07210e77 --- /dev/null +++ b/model_cards/DeepPavlov/bert-base-multilingual-cased-sentence/README.md @@ -0,0 +1,22 @@ +--- +language: +- multilingual +--- + +# bert-base-multilingual-cased-sentence + +Sentence Multilingual BERT \(101 languages, cased, 12-layer, 768-hidden, 12-heads, 180M parameters\) +is a representation-based sentence encoder for 101 languages of Multilingual BERT. +It is initialized with Multilingual BERT and then fine-tuned on english MultiNLI\[1\] and on dev set +of multilingual XNLI\[2\]. +Sentence representations are mean pooled token embeddings in the same manner as in Sentence-BERT\[3\]. + + +\[1\]: Williams A., Nangia N. & Bowman S. \(2017\) A Broad-Coverage Challenge Corpus for Sentence Understanding +through Inference. arXiv preprint [arXiv:1704.05426](https://arxiv.org/abs/1704.05426) + +\[2\]: Williams A., Bowman S. \(2018\) XNLI: Evaluating Cross-lingual Sentence Representations. +arXiv preprint [arXiv:1809.05053](https://arxiv.org/abs/1809.05053) + +\[3\]: N. Reimers, I. Gurevych \(2019\) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. +arXiv preprint [arXiv:1908.10084](https://arxiv.org/abs/1908.10084) diff --git a/model_cards/DeepPavlov/rubert-base-cased-conversational/README.md b/model_cards/DeepPavlov/rubert-base-cased-conversational/README.md new file mode 100644 index 0000000000..4ea20c2cd1 --- /dev/null +++ b/model_cards/DeepPavlov/rubert-base-cased-conversational/README.md @@ -0,0 +1,18 @@ +--- +language: +- russian +--- + +# rubert-base-cased-conversational + +Conversational RuBERT \(Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters\) was trained +on OpenSubtitles\[1\], [Dirty](https://d3.ru/), [Pikabu](https://pikabu.ru/), +and a Social Media segment of Taiga corpus\[2\]. We assembled a new vocabulary for Conversational RuBERT model +on this data and initialized the model with [RuBERT](../rubert-base-cased). + + +\[1\]: P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. +In Proceedings of the 10th International Conference on Language Resources and Evaluation \(LREC 2016\) + +\[2\]: Shavrina T., Shapovalova O. \(2017\) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: +«TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017. diff --git a/model_cards/DeepPavlov/rubert-base-cased-sentence/README.md b/model_cards/DeepPavlov/rubert-base-cased-sentence/README.md new file mode 100644 index 0000000000..9bac38460f --- /dev/null +++ b/model_cards/DeepPavlov/rubert-base-cased-sentence/README.md @@ -0,0 +1,21 @@ +--- +language: +- russian +--- + +# rubert-base-cased-sentence + +Sentence RuBERT \(Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters\) +is a representation-based sentence encoder for Russian. It is initialized with RuBERT and fine-tuned on SNLI\[1\] +google-translated to russian and on russian part of XNLI dev set\[2\]. Sentence representations are mean pooled +token embeddings in the same manner as in Sentence-BERT\[3\]. + + +\[1\]: S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. \(2015\) A large annotated corpus for learning +natural language inference. arXiv preprint [arXiv:1508.05326](https://arxiv.org/abs/1508.05326) + +\[2\]: Williams A., Bowman S. \(2018\) XNLI: Evaluating Cross-lingual Sentence Representations. +arXiv preprint [arXiv:1809.05053](https://arxiv.org/abs/1809.05053) + +\[3\]: N. Reimers, I. Gurevych \(2019\) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. +arXiv preprint [arXiv:1908.10084](https://arxiv.org/abs/1908.10084) diff --git a/model_cards/DeepPavlov/rubert-base-cased/README.md b/model_cards/DeepPavlov/rubert-base-cased/README.md new file mode 100644 index 0000000000..36e12cdeff --- /dev/null +++ b/model_cards/DeepPavlov/rubert-base-cased/README.md @@ -0,0 +1,14 @@ +--- +language: +- russian +--- + +# rubert-base-cased + +RuBERT \(Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters\) was trained on the Russian part of Wikipedia +and news data. We used this training data to build a vocabulary of Russian subtokens and took a multilingual version +of BERT-base as an initialization for RuBERT\[1\]. + + +\[1\]: Kuratov, Y., Arkhipov, M. \(2019\). Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. +arXiv preprint [arXiv:1905.07213](https://arxiv.org/abs/1905.07213).