Create README.md
This commit is contained in:
committed by
Julien Chaumond
parent
8744402f1e
commit
1eec69a900
76
model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md
Normal file
76
model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# GreekBERT
|
||||
|
||||
A Greek version of BERT pre-trained language model.
|
||||
|
||||
<img src="https://github.com/nlpaueb/GreekBERT/raw/master/greek-bert-logo.png" width="600"/>
|
||||
|
||||
|
||||
## Pre-training corpora
|
||||
|
||||
The pre-training corpora of `bert-base-greek-uncased-v1` include:
|
||||
|
||||
* The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων),
|
||||
* The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/), and
|
||||
* The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org).
|
||||
|
||||
Future release will also include:
|
||||
|
||||
* The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr),
|
||||
* The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en).
|
||||
|
||||
## Requirements
|
||||
|
||||
We published `bert-base-greek-uncased-v1` as part of [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) repository. So, you need to install transfomers library through pip along with PyTorch or Tensorflow 2.
|
||||
|
||||
```
|
||||
pip install transfomers
|
||||
pip install (torch|tensorflow)
|
||||
```
|
||||
|
||||
## Pre-process text (Deaccent - Lower)
|
||||
|
||||
In order to use `bert-base-greek-uncased-v1`, you have to pre-process texts in order to lowercase letters and remove all Greek diacritics.
|
||||
|
||||
```python
|
||||
|
||||
import unicodedata
|
||||
|
||||
def strip_accents_and_lowercase(s):
|
||||
return ''.join(c for c in unicodedata.normalize('NFD', s)
|
||||
if unicodedata.category(c) != 'Mn').lower()
|
||||
|
||||
accented_string = "Αυτή είναι η Ελληνίκη έκδοση του BERT."
|
||||
unaccented_string = strip_accents_and_lowercase(accented_string)
|
||||
|
||||
print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert.
|
||||
|
||||
```
|
||||
|
||||
## Load Pretrained Model
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
|
||||
model = AutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
|
||||
```
|
||||
|
||||
## Author
|
||||
|
||||
Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
|
||||
|
||||
| Github: [@ilias.chalkidis](https://github.com/seolhokim) | Twitter: [@KiddoThe2B](https://twitter.com/KiddoThe2B) |
|
||||
|
||||
## About Us
|
||||
|
||||
[AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr) develops algorithms, models, and systems that allow computers to process and generate natural language texts.
|
||||
|
||||
The group's current research interests include:
|
||||
* question answering systems for databases, ontologies, document collections, and the Web, especially biomedical question answering,
|
||||
* natural language generation from databases and ontologies, especially Semantic Web ontologies,
|
||||
text classification, including filtering spam and abusive content,
|
||||
* information extraction and opinion mining, including legal text analytics and sentiment analysis,
|
||||
* natural language processing tools for Greek, for example parsers and named-entity recognizers,
|
||||
machine learning in natural language processing, especially deep learning.
|
||||
|
||||
The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.
|
||||
Reference in New Issue
Block a user