Create README.md
This commit is contained in:
committed by
Julien Chaumond
parent
8744402f1e
commit
1eec69a900
76
model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md
Normal file
76
model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md
Normal file
@@ -0,0 +1,76 @@
|
|||||||
|
# GreekBERT
|
||||||
|
|
||||||
|
A Greek version of BERT pre-trained language model.
|
||||||
|
|
||||||
|
<img src="https://github.com/nlpaueb/GreekBERT/raw/master/greek-bert-logo.png" width="600"/>
|
||||||
|
|
||||||
|
|
||||||
|
## Pre-training corpora
|
||||||
|
|
||||||
|
The pre-training corpora of `bert-base-greek-uncased-v1` include:
|
||||||
|
|
||||||
|
* The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων),
|
||||||
|
* The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/), and
|
||||||
|
* The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org).
|
||||||
|
|
||||||
|
Future release will also include:
|
||||||
|
|
||||||
|
* The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr),
|
||||||
|
* The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en).
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
We published `bert-base-greek-uncased-v1` as part of [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) repository. So, you need to install transfomers library through pip along with PyTorch or Tensorflow 2.
|
||||||
|
|
||||||
|
```
|
||||||
|
pip install transfomers
|
||||||
|
pip install (torch|tensorflow)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Pre-process text (Deaccent - Lower)
|
||||||
|
|
||||||
|
In order to use `bert-base-greek-uncased-v1`, you have to pre-process texts in order to lowercase letters and remove all Greek diacritics.
|
||||||
|
|
||||||
|
```python
|
||||||
|
|
||||||
|
import unicodedata
|
||||||
|
|
||||||
|
def strip_accents_and_lowercase(s):
|
||||||
|
return ''.join(c for c in unicodedata.normalize('NFD', s)
|
||||||
|
if unicodedata.category(c) != 'Mn').lower()
|
||||||
|
|
||||||
|
accented_string = "Αυτή είναι η Ελληνίκη έκδοση του BERT."
|
||||||
|
unaccented_string = strip_accents_and_lowercase(accented_string)
|
||||||
|
|
||||||
|
print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert.
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## Load Pretrained Model
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
|
||||||
|
model = AutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Author
|
||||||
|
|
||||||
|
Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
|
||||||
|
|
||||||
|
| Github: [@ilias.chalkidis](https://github.com/seolhokim) | Twitter: [@KiddoThe2B](https://twitter.com/KiddoThe2B) |
|
||||||
|
|
||||||
|
## About Us
|
||||||
|
|
||||||
|
[AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr) develops algorithms, models, and systems that allow computers to process and generate natural language texts.
|
||||||
|
|
||||||
|
The group's current research interests include:
|
||||||
|
* question answering systems for databases, ontologies, document collections, and the Web, especially biomedical question answering,
|
||||||
|
* natural language generation from databases and ontologies, especially Semantic Web ontologies,
|
||||||
|
text classification, including filtering spam and abusive content,
|
||||||
|
* information extraction and opinion mining, including legal text analytics and sentiment analysis,
|
||||||
|
* natural language processing tools for Greek, for example parsers and named-entity recognizers,
|
||||||
|
machine learning in natural language processing, especially deep learning.
|
||||||
|
|
||||||
|
The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.
|
||||||
Reference in New Issue
Block a user