From ecbb5ee194f4775a89f2225eed7c24797c26feb4 Mon Sep 17 00:00:00 2001 From: Ethan Villarosa <113210015+EthanV431@users.noreply.github.com> Date: Wed, 30 Jul 2025 08:33:13 -0700 Subject: [PATCH] standardized BARThez model card (#39701) * standardized barthez model card according to template * Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * suggested changes to barthez model card --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/barthez.md | 92 +++++++++++++++++++---------- 1 file changed, 62 insertions(+), 30 deletions(-) diff --git a/docs/source/en/model_doc/barthez.md b/docs/source/en/model_doc/barthez.md index 0f8568cc05..fdaf28c8d7 100644 --- a/docs/source/en/model_doc/barthez.md +++ b/docs/source/en/model_doc/barthez.md @@ -14,49 +14,81 @@ rendered properly in your Markdown viewer. --> -# BARThez - -
-PyTorch -TensorFlow -Flax +
+
+ PyTorch + TensorFlow + Flax +
-## Overview +# BARThez -The BARThez model was proposed in [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://huggingface.co/papers/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis on 23 Oct, -2020. +[BARThez](https://huggingface.co/papers/2010.12321) is a [BART](./bart) model designed for French language tasks. Unlike existing French BERT models, BARThez includes a pretrained encoder-decoder, allowing it to generate text as well. This model is also available as a multilingual variant, mBARThez, by continuing pretraining multilingual BART on a French corpus. -The abstract of the paper: +You can find all of the original BARThez checkpoints under the [BARThez](https://huggingface.co/collections/dascim/barthez-670920b569a07aa53e3b6887) collection. + +> [!TIP] +> This model was contributed by [moussakam](https://huggingface.co/moussakam). +> Refer to the [BART](./bart) docs for more usage examples. -*Inductive transfer learning, enabled by self-supervised learning, have taken the entire Natural Language Processing -(NLP) field by storm, with models such as BERT and BART setting new state of the art on countless natural language -understanding tasks. While there are some notable exceptions, most of the available models and research have been -conducted for the English language. In this work, we introduce BARThez, the first BART model for the French language -(to the best of our knowledge). BARThez was pretrained on a very large monolingual French corpus from past research -that we adapted to suit BART's perturbation schemes. Unlike already existing BERT-based French language models such as -CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also -its decoder is pretrained. In addition to discriminative tasks from the FLUE benchmark, we evaluate BARThez on a novel -summarization dataset, OrangeSum, that we release with this paper. We also continue the pretraining of an already -pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez, -provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.* +The example below demonstrates how to predict the `` token with [`Pipeline`], [`AutoModel`], and from the command line. -This model was contributed by [moussakam](https://huggingface.co/moussakam). The Authors' code can be found [here](https://github.com/moussaKam/BARThez). + + - +```py +import torch +from transformers import pipeline -BARThez implementation is the same as BART, except for tokenization. Refer to [BART documentation](bart) for information on -configuration classes and their parameters. BARThez-specific tokenizers are documented below. +pipeline = pipeline( + task="fill-mask", + model="moussaKam/barthez", + torch_dtype=torch.float16, + device=0 +) +pipeline("Les plantes produisent grâce à un processus appelé photosynthèse.") +``` - + + -## Resources +```py +import torch +from transformers import AutoModelForMaskedLM, AutoTokenizer -- BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check: - [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md). +tokenizer = AutoTokenizer.from_pretrained( + "moussaKam/barthez", +) +model = AutoModelForMaskedLM.from_pretrained( + "moussaKam/barthez", + torch_dtype=torch.float16, + device_map="auto", +) +inputs = tokenizer("Les plantes produisent grâce à un processus appelé photosynthèse.", return_tensors="pt").to("cuda") +with torch.no_grad(): + outputs = model(**inputs) + predictions = outputs.logits + +masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1] +predicted_token_id = predictions[0, masked_index].argmax(dim=-1) +predicted_token = tokenizer.decode(predicted_token_id) + +print(f"The predicted token is: {predicted_token}") +``` + + + + +```bash +echo -e "Les plantes produisent grâce à un processus appelé photosynthèse." | transformers run --task fill-mask --model moussaKam/barthez --device 0 +``` + + + ## BarthezTokenizer