From a646fd55fdd97427ad33c4ee17d41758b7cafa99 Mon Sep 17 00:00:00 2001 From: Muhammad Shaheer Malik <157911864+MShaheerMalik77@users.noreply.github.com> Date: Fri, 11 Jul 2025 22:59:09 +0500 Subject: [PATCH] Updated CamemBERT model card to new standardized format (#39227) * Updated CamemBERT model card to new standardized format * Applied review suggestions for CamemBERT: restored API refs, added examples, badges, and attribution * Updated CamemBERT usage examples, quantization, badges, and format * Updated CamemBERT badges * Fixed CLI Section --- docs/source/en/model_doc/camembert.md | 123 +++++++++++++++++++------- 1 file changed, 89 insertions(+), 34 deletions(-) diff --git a/docs/source/en/model_doc/camembert.md b/docs/source/en/model_doc/camembert.md index aad9662de9..efa57e1704 100644 --- a/docs/source/en/model_doc/camembert.md +++ b/docs/source/en/model_doc/camembert.md @@ -14,49 +14,105 @@ rendered properly in your Markdown viewer. --> -# CamemBERT - -
-PyTorch -TensorFlow -SDPA +
+
+ PyTorch + TensorFlow + SDPA +
-## Overview +# CamemBERT -The CamemBERT model was proposed in [CamemBERT: a Tasty French Language Model](https://huggingface.co/papers/1911.03894) by -[Louis Martin](https://huggingface.co/louismartin), [Benjamin Muller](https://huggingface.co/benjamin-mlr), [Pedro Javier Ortiz Suárez](https://huggingface.co/pjox), Yoann Dupont, Laurent Romary, Éric Villemonte de la -Clergerie, [Djamé Seddah](https://huggingface.co/Djame), and [Benoît Sagot](https://huggingface.co/sagot). It is based on Facebook's RoBERTa model released in 2019. It is a model -trained on 138GB of French text. +[CamemBERT](https://huggingface.co/papers/1911.03894) is a language model based on [RoBERTa](./roberta), but trained specifically on French text from the OSCAR dataset, making it more effective for French language tasks. -The abstract from the paper is the following: +What sets CamemBERT apart is that it learned from a huge, high quality collection of French data, as opposed to mixing lots of languages. This helps it really understand French better than many multilingual models. -*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available -models have either been trained on English data or on the concatenation of data in multiple languages. This makes -practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French, -we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the -performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging, -dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art -for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and -downstream applications for French NLP.* +Common applications of CamemBERT include masked language modeling (Fill-mask prediction), text classification (sentiment analysis), token classification (entity recognition) and sentence pair classification (entailment tasks). -This model was contributed by [the ALMAnaCH team (Inria)](https://huggingface.co/almanach). The original code can be found [here](https://camembert-model.fr/). +You can find all the original CamemBERT checkpoints under the [ALMAnaCH](https://huggingface.co/almanach/models?search=camembert) organization. - +> [!TIP] +> This model was contributed by the [ALMAnaCH (Inria)](https://huggingface.co/almanach) team. +> +> Click on the CamemBERT models in the right sidebar for more examples of how to apply CamemBERT to different NLP tasks. -This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples as well -as the information relative to the inputs and outputs. +The examples below demonstrate how to predict the `` token with [`Pipeline`], [`AutoModel`], and from the command line. - + -## Resources + -- [Text classification task guide](../tasks/sequence_classification) -- [Token classification task guide](../tasks/token_classification) -- [Question answering task guide](../tasks/question_answering) -- [Causal language modeling task guide](../tasks/language_modeling) -- [Masked language modeling task guide](../tasks/masked_language_modeling) -- [Multiple choice task guide](../tasks/multiple_choice) +```python +import torch +from transformers import pipeline + +pipeline = pipeline("fill-mask", model="camembert-base", torch_dtype=torch.float16, device=0) +pipeline("Le camembert est un délicieux fromage .") +``` + + + + +```python +import torch +from transformers import AutoTokenizer, AutoModelForMaskedLM + +tokenizer = AutoTokenizer.from_pretrained("camembert-base") +model = AutoModelForMaskedLM.from_pretrained("camembert-base", torch_dtype="auto", device_map="auto", attn_implementation="sdpa") +inputs = tokenizer("Le camembert est un délicieux fromage .", return_tensors="pt").to("cuda") + +with torch.no_grad(): + outputs = model(**inputs) + predictions = outputs.logits + +masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1] +predicted_token_id = predictions[0, masked_index].argmax(dim=-1) +predicted_token = tokenizer.decode(predicted_token_id) + +print(f"The predicted token is: {predicted_token}") +``` + + + + +```bash +echo -e "Le camembert est un délicieux fromage ." | transformers run --task fill-mask --model camembert-base --device 0 +``` + + + + + + +Quantization reduces the memory burden of large models by representing weights in lower precision. Refer to the [Quantization](../quantization/overview) overview for available options. + +The example below uses [bitsandbytes](../quantization/bitsandbytes) quantization to quantize the weights to 8-bits. + +```python +from transformers import AutoTokenizer, AutoModelForMaskedLM, BitsAndBytesConfig +import torch + +quant_config = BitsAndBytesConfig(load_in_8bit=True) +model = AutoModelForMaskedLM.from_pretrained( + "almanach/camembert-large", + quantization_config=quant_config, + device_map="auto" +) +tokenizer = AutoTokenizer.from_pretrained("almanach/camembert-large") + +inputs = tokenizer("Le camembert est un délicieux fromage .", return_tensors="pt").to("cuda") + +with torch.no_grad(): + outputs = model(**inputs) + predictions = outputs.logits + +masked_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] +predicted_token_id = predictions[0, masked_index].argmax(dim=-1) +predicted_token = tokenizer.decode(predicted_token_id) + +print(f"The predicted token is: {predicted_token}") +``` ## CamembertConfig @@ -137,5 +193,4 @@ as the information relative to the inputs and outputs. [[autodoc]] TFCamembertForQuestionAnswering - - + \ No newline at end of file