From 1eec69a90007b8f4a7af10805dab4904ea5dea77 Mon Sep 17 00:00:00 2001 From: Ilias Chalkidis Date: Fri, 14 Feb 2020 02:16:46 +0200 Subject: [PATCH] Create README.md --- .../bert-base-greek-uncased-v1/README.md | 76 +++++++++++++++++++ 1 file changed, 76 insertions(+) create mode 100644 model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md diff --git a/model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md b/model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md new file mode 100644 index 0000000000..2f153759a9 --- /dev/null +++ b/model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md @@ -0,0 +1,76 @@ +# GreekBERT + +A Greek version of BERT pre-trained language model. + + + + +## Pre-training corpora + +The pre-training corpora of `bert-base-greek-uncased-v1` include: + +* The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων), +* The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/), and +* The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org). + +Future release will also include: + +* The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr), +* The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en). + +## Requirements + +We published `bert-base-greek-uncased-v1` as part of [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) repository. So, you need to install transfomers library through pip along with PyTorch or Tensorflow 2. + +``` +pip install transfomers +pip install (torch|tensorflow) +``` + +## Pre-process text (Deaccent - Lower) + +In order to use `bert-base-greek-uncased-v1`, you have to pre-process texts in order to lowercase letters and remove all Greek diacritics. + +```python + +import unicodedata + +def strip_accents_and_lowercase(s): + return ''.join(c for c in unicodedata.normalize('NFD', s) + if unicodedata.category(c) != 'Mn').lower() + +accented_string = "Αυτή είναι η Ελληνίκη έκδοση του BERT." +unaccented_string = strip_accents_and_lowercase(accented_string) + +print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert. + +``` + +## Load Pretrained Model + +```python +from transformers import AutoTokenizer, AutoModel + +tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1") +model = AutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1") +``` + +## Author + +Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr) + +| Github: [@ilias.chalkidis](https://github.com/seolhokim) | Twitter: [@KiddoThe2B](https://twitter.com/KiddoThe2B) | + +## About Us + +[AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr) develops algorithms, models, and systems that allow computers to process and generate natural language texts. + +The group's current research interests include: +* question answering systems for databases, ontologies, document collections, and the Web, especially biomedical question answering, +* natural language generation from databases and ontologies, especially Semantic Web ontologies, +text classification, including filtering spam and abusive content, +* information extraction and opinion mining, including legal text analytics and sentiment analysis, +* natural language processing tools for Greek, for example parsers and named-entity recognizers, +machine learning in natural language processing, especially deep learning. + +The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.