From 9519f0cd63b1b5d895f9f993e3198ec55d1e57fa Mon Sep 17 00:00:00 2001 From: Jeroen Steggink Date: Tue, 13 Jul 2021 14:40:27 +0200 Subject: [PATCH] Wrong model is used in example, should be character instead of subword model (#12676) * Wrong model is used, should be character instead of subword In the original Google repo for CANINE there was mixup in the model names in the README.md, which was fixed 2 weeks ago. Since this transformer model was created before, it probably resulted in wrong use in this example. s = subword, c = character * canine.rst style fix * Update docs/source/model_doc/canine.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Styling canine.rst * Added links to model cards. * Fixed links to model cards. Co-authored-by: Jeroen Steggink <978411+jsteggink@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --- docs/source/model_doc/canine.rst | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/docs/source/model_doc/canine.rst b/docs/source/model_doc/canine.rst index 80b1e05267..2f868bdae9 100644 --- a/docs/source/model_doc/canine.rst +++ b/docs/source/model_doc/canine.rst @@ -48,6 +48,12 @@ Tips: (which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The details for this can be found in the paper. +- Models: + + - `google/canine-c `__: Pre-trained with autoregressive character loss, + 12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB). + - `google/canine-s `__: Pre-trained with subword loss, 12-layer, + 768-hidden, 12-heads, 121M parameters (size ~500 MB). This model was contributed by `nielsr `__. The original code can be found `here `__. @@ -63,7 +69,7 @@ CANINE works on raw characters, so it can be used without a tokenizer: from transformers import CanineModel import torch - model = CanineModel.from_pretrained('google/canine-s') # model pre-trained with autoregressive character loss + model = CanineModel.from_pretrained('google/canine-c') # model pre-trained with autoregressive character loss text = "hello world" # use Python's built-in ord() function to turn each character into its unicode code point id @@ -81,8 +87,8 @@ sequences to the same length): from transformers import CanineTokenizer, CanineModel - model = CanineModel.from_pretrained('google/canine-s') - tokenizer = CanineTokenizer.from_pretrained('google/canine-s') + model = CanineModel.from_pretrained('google/canine-c') + tokenizer = CanineTokenizer.from_pretrained('google/canine-c') inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."] encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")