From f3d44301cc958424162795e257ed4f4088eb8832 Mon Sep 17 00:00:00 2001
From: Lorenzo De Mattei <lorenzo.demattei@gmail.com>
Date: Fri, 1 May 2020 17:46:42 +0200
Subject: [PATCH] =?UTF-8?q?GePpeTto=20model=20=F0=9F=87=AE=F0=9F=87=B9=20(?=
 =?UTF-8?q?#4099)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Create GePpeTto.md

* Update model_cards/LorenzoDeMattei/GePpeTto.md

* Update model_cards/LorenzoDeMattei/GePpeTto.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
---
 model_cards/LorenzoDeMattei/GePpeTto.md | 133 ++++++++++++++++++++++++
 1 file changed, 133 insertions(+)
 create mode 100644 model_cards/LorenzoDeMattei/GePpeTto.md

diff --git a/model_cards/LorenzoDeMattei/GePpeTto.md b/model_cards/LorenzoDeMattei/GePpeTto.md
new file mode 100644
index 0000000000..cdffed246c
--- /dev/null
+++ b/model_cards/LorenzoDeMattei/GePpeTto.md
@@ -0,0 +1,133 @@
+---
+language: italian
+---
+
+# GePpeTto GPT2 Model 🇮🇹
+
+Pretrained GPT2 117M model for Italian.
+
+You can find further details in the paper:
+
+Lorenzo De Mattei, Michele Cafagna, Felice Dell’Orletta, Malvina Nissim, Marco Guerini "GePpeTto Carves Italian into a Language Model", arXiv preprint. Pdf available at: https://arxiv.org/abs/2004.14253
+
+## Pretraining Corpus
+
+The pretraining set comprises two main sources. The first one is a dump of Italian Wikipedia (November 2019), 
+consisting of 2.8GB of text. The second one is the ItWac corpus (Baroni et al., 2009), which amounts to 11GB of web
+texts. This collection provides a mix of standard and less standard Italian, on a rather wide chronological span, 
+with older texts than the Wikipedia dump (the latter stretches only to the late 2000s).
+
+## Pretraining details
+
+This model was trained using GPT2's Hugging Face implemenation on 4 NVIDIA Tesla T4 GPU for 620k steps.
+
+Training parameters:
+
+- GPT-2 small configuration
+- vocabulary size: 30k
+- Batch size: 32
+- Block size: 100
+- Adam Optimizer
+- Initial learning rate: 5e-5
+- Warm up steps: 10k
+
+## Perplexity scores
+
+| Domain | Perplexity |
+|---|---|
+| Wikipedia | 26.1052 |
+| ItWac | 30.3965 |
+| Legal | 37.2197 |
+| News | 45.3859 |
+| Social Media | 84.6408 |
+
+For further details, qualitative analysis and human evaluation check out: https://arxiv.org/abs/2004.14253
+
+## Load Pretrained Model
+
+You can use this model by installing Huggingface library `transformers`. And you can use it directly by initializing it like this:  
+
+```python
+from transformers import GPT2Tokenizer, GPT2Model
+
+model = GPT2Model.from_pretrained('LorenzoDeMattei/GePpeTto')
+tokenizer = GPT2Tokenizer.from_pretrained(
+    'LorenzoDeMattei/GePpeTto',
+)
+```
+
+## Example using GPT2LMHeadModel
+
+```python
+from transformers import GPT2Tokenizer, GPT2LMHeadModel
+
+tokenizer = GPT2Tokenizer.from_pretrained('LorenzoDeMattei/GePpeTto')
+model = GPT2LMHeadModel.from_pretrained(
+    'LorenzoDeMattei/GePpeTto', pad_token_id = tokenizer.eos_token_id
+)
+
+input_ids = tokenizer.encode(
+    'Wikipedia Geppetto', return_tensors = 'pt'
+)
+sample_outputs = model.generate(
+    input_ids,
+    do_sample = True,
+    max_length = 50,
+    top_k = 50,
+    top_p = 0.95,
+    num_return_sequences = 3,
+)
+
+print('Output:\n' + 100 * '-')
+for i, sample_output in enumerate(sample_outputs):
+    print(
+        '{}: {}'.format(
+            i, tokenizer.decode(sample_output, skip_special_tokens = True)
+        )
+    )
+```
+
+Output is,
+
+```text
+Output:
+----------------------------------------------------------------------------------------------------
+0: Wikipedia Geppetto
+
+Geppetto è una città degli Stati Uniti d'America, situata nello Stato dell'Iowa, nella Contea di Greene.
+
+Wikipedia The Sax
+
+The Sax è il primo album discografico
+2: Wikipedia Geppetto/Passione
+
+Geppetto è il primo album in studio dei Saturday Night Live, pubblicato dalla Iron Maiden nel 1974.
+
+L'album è un lavoro di debutto che lo porta a definire
+3: Wikipedia Geppetto
+
+Geppetto ("Fenëvëv" in calabrese) è un comune italiano di abitanti della regione Calabria.
+
+Zona di particolare pregio storico-artistico, paesaggistico, storico-artistico,
+```
+
+## Citation
+
+Please use the following bibtex entry:
+
+```
+@misc{mattei2020geppetto,
+    title={GePpeTto Carves Italian into a Language Model},
+    author={Lorenzo De Mattei and Michele Cafagna and Felice Dell'Orletta and Malvina Nissim and Marco Guerini},
+    year={2020},
+    eprint={2004.14253},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+
+## References
+
+Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
+and Eros Zanchetta. 2009. The WaCky wide web: a
+collection of very large linguistically processed webcrawled corpora. Language resources and evaluation, 43(3):209–226.