From c7d96b60e45deb341615f222eb6b9d233771f357 Mon Sep 17 00:00:00 2001 From: Moseli Motsoehli Date: Tue, 7 Jul 2020 00:38:15 -1000 Subject: [PATCH] zuBERTa model card (#5536) * Create README * Update README.md Co-authored-by: Kevin Canwen Xu --- model_cards/MoseliMotsoehli/zuBERTa/README.md | 56 +++++++++++++++++++ 1 file changed, 56 insertions(+) create mode 100644 model_cards/MoseliMotsoehli/zuBERTa/README.md diff --git a/model_cards/MoseliMotsoehli/zuBERTa/README.md b/model_cards/MoseliMotsoehli/zuBERTa/README.md new file mode 100644 index 0000000000..48f0295a99 --- /dev/null +++ b/model_cards/MoseliMotsoehli/zuBERTa/README.md @@ -0,0 +1,56 @@ +--- +language: zulu +--- + +# zuBERTa +zuBERTa is a RoBERTa style transformer language model trained on zulu text. + +## Intended uses & limitations +The model can be used for getting embeddings to use on a down-stream task such as question answering. + +#### How to use + +```python +>>> from transformers import pipeline +>>> from transformers import AutoTokenizer, AutoModelWithLMHead + +>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/zuBERTa") +>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/zuBERTa") +>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer) +>>> unmasker("Abafika eNkandla bafika sebeholwa uMpongo kaZingelwayo.") + +[ + { + "sequence": "Abafika eNkandla bafika sebeholwa khona uMpongo kaZingelwayo.", + "score": 0.050459690392017365, + "token": 555, + "token_str": "Ġkhona" + }, + { + "sequence": "Abafika eNkandla bafika sebeholwa inkosi uMpongo kaZingelwayo.", + "score": 0.03668094798922539, + "token": 2321, + "token_str": "Ġinkosi" + }, + { + "sequence": "Abafika eNkandla bafika sebeholwa ubukhosi uMpongo kaZingelwayo.", + "score": 0.028774697333574295, + "token": 5101, + "token_str": "Ġubukhosi" + } +] +``` + +## Training data + +1. 30k sentences of text, came from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download) of zulu 2018. These were collected from news articles and creative writtings. +2. ~7500 articles of human generated translations were scraped from the zulu [wikipedia](https://zu.wikipedia.org/wiki/Special:AllPages). + +### BibTeX entry and citation info + +```bibtex +@inproceedings{author = {Moseli Motsoehli}, + title = {Towards transformation of Southern African language models through transformers.}, + year={2020} +} +```