From a9d56a675ae08446ad1e122842c3d3650f7015da Mon Sep 17 00:00:00 2001 From: Gianpaolo Di Pietro Date: Fri, 17 Jul 2020 19:50:49 +0200 Subject: [PATCH] Added model card for neuraly/bert-base-italian-cased-sentiment (#5845) * Added model card for neuraly/bert-base-italian-cased-sentiment * Update model_cards/neuraly/bert-base-italian-cased-sentiment/README.md Co-authored-by: Julien Chaumond Co-authored-by: Gianpy15 Co-authored-by: Julien Chaumond --- .../README.md | 97 +++++++++++++++++++ 1 file changed, 97 insertions(+) create mode 100644 model_cards/neuraly/bert-base-italian-cased-sentiment/README.md diff --git a/model_cards/neuraly/bert-base-italian-cased-sentiment/README.md b/model_cards/neuraly/bert-base-italian-cased-sentiment/README.md new file mode 100644 index 0000000000..4017b38030 --- /dev/null +++ b/model_cards/neuraly/bert-base-italian-cased-sentiment/README.md @@ -0,0 +1,97 @@ +--- +language: it +thumbnail: "https://neuraly.ai/static/assets/images/huggingface/thumbnail.png" +tags: + +- sentiment +- Italian + +license: MIT +widget: +- text: "Huggingface è un team fantastico!" +--- + +# 🤗 + neuraly - Italian BERT Sentiment model + +## Model description + +This model performs sentiment analysis on Italian sentences. It was trained starting from an instance of [bert-base-italian-cased](https://huggingface.co/dbmdz/bert-base-italian-cased), and fine-tuned on an Italian dataset of tweets, reaching 82% of accuracy on the latter one. + +## Intended uses & limitations + +#### How to use + +```python +import torch +from torch import nn +from transformers import AutoTokenizer, AutoModelForSequenceClassification + +# Load the tokenizer +tokenizer = AutoTokenizer.from_pretrained("neuraly/bert-base-italian-cased-sentiment") +# Load the model, use .cuda() to load it on the GPU +model = AutoModelForSequenceClassification.from_pretrained("neuraly/bert-base-italian-cased-sentiment") + +sentence = 'Huggingface è un team fantastico!' +input_ids = tokenizer.encode(sentence, add_special_tokens=True) + +# Create tensor, use .cuda() to transfer the tensor to GPU +tensor = torch.tensor(input_ids).long() +# Fake batch dimension +tensor = tensor.unsqueeze(0) + +# Call the model and get the logits +logits, = model(tensor) + +# Remove the fake batch dimension +logits = logits.squeeze(0) + +# The model was trained with a Log Likelyhood + Softmax combined loss, hence to extract probabilities we need a softmax on top of the logits tensor +proba = nn.functional.softmax(logits, dim=0) + +# Unpack the tensor to obtain negative, neutral and positive probabilities +negative, neutral, positive = proba +``` + +#### Limitations and bias + +A possible drawback (or bias) of this model is related to the fact that it was trained on a tweet dataset, with all the limitations that come with it. The domain is strongly related to football players and teams, but it works surprisingly well even on other topics. + +## Training data + +We trained the model by combining the two tweet datasets taken from [Sentipolc EVALITA 2016](http://www.di.unito.it/~tutreeb/sentipolc-evalita16/data.html). Overall the dataset consists of 45K pre-processed tweets. + +The model weights come from a pre-trained instance of [bert-base-italian-cased](https://huggingface.co/dbmdz/bert-base-italian-cased). A huge "thank you" goes to that team, brilliant work! + +## Training procedure + +#### Preprocessing + +We tried to save as much information as possible, since BERT captures extremely well the semantic of complex text sequences. Overall we removed only **@mentions**, **urls** and **emails** from every tweet and kept pretty much everything else. + +#### Hardware + +- **GPU**: Nvidia GTX1080ti +- **CPU**: AMD Ryzen7 3700x 8c/16t +- **RAM**: 64GB DDR4 + +#### Hyperparameters + +- Optimizer: **AdamW** with learning rate of **2e-5**, epsilon of **1e-8** +- Max epochs: **5** +- Batch size: **32** +- Early Stopping: **enabled** with patience = 1 + +Early stopping was triggered after 3 epochs. + +## Eval results + +The model achieves an overall accuracy on the test set equal to 82% +The test set is a 20% split of the whole dataset. + +## About us +[Neuraly](https://neuraly.ai) is a young and dynamic startup committed to designing AI-driven solutions and services through the most advanced Machine Learning and Data Science technologies. You can find out more about who we are and what we do on our [website](https://neuraly.ai). + +## Acknowledgments + +Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team, +it is possible to download the model from their S3 storage and live test it from their inference API 🤗.