Update all references to canonical models (#29001)

* Script & Manual edition * Update
2024-02-16 08:16:58 +01:00
parent 1e402b957d
commit f497f564bb
561 changed files with 2682 additions and 2687 deletions
--- a/examples/README.md
+++ b/examples/README.md
@@ -118,8 +118,8 @@ pip install runhouse
 # For an on-demand V100 with whichever cloud provider you have configured:
 python run_on_remote.py \
    --example pytorch/text-generation/run_generation.py \
-    --model_type=gpt2 \
-    --model_name_or_path=gpt2 \
+    --model_type=openai-community/gpt2 \
+    --model_name_or_path=openai-community/gpt2 \
    --prompt "I am a language model and"

 # For byo (bring your own) cluster:
--- a/examples/flax/image-captioning/README.md
+++ b/examples/flax/image-captioning/README.md
@@ -34,7 +34,7 @@ Next, we create a [FlaxVisionEncoderDecoderModel](https://huggingface.co/docs/tr
 python3 create_model_from_encoder_decoder_models.py \
    --output_dir model \
    --encoder_model_name_or_path google/vit-base-patch16-224-in21k \
-    --decoder_model_name_or_path gpt2
+    --decoder_model_name_or_path openai-community/gpt2
 ```

 ### Train the model
--- a/examples/flax/language-modeling/README.md
+++ b/examples/flax/language-modeling/README.md
@@ -28,7 +28,7 @@ way which enables simple and efficient model parallelism.
 In the following, we demonstrate how to train a bi-directional transformer model 
 using masked language modeling objective as introduced in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
 More specifically, we demonstrate how JAX/Flax can be leveraged 
-to pre-train [**`roberta-base`**](https://huggingface.co/roberta-base)
+to pre-train [**`FacebookAI/roberta-base`**](https://huggingface.co/FacebookAI/roberta-base)
 in Norwegian on a single TPUv3-8 pod.

 The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
@@ -76,13 +76,13 @@ tokenizer.save("./norwegian-roberta-base/tokenizer.json")
 ### Create configuration

 Next, we create the model's configuration file. This is as simple 
-as loading and storing [`**roberta-base**`](https://huggingface.co/roberta-base)
+as loading and storing [`**FacebookAI/roberta-base**`](https://huggingface.co/FacebookAI/roberta-base)
 in the local model folder:

 ```python
 from transformers import RobertaConfig

-config = RobertaConfig.from_pretrained("roberta-base", vocab_size=50265)
+config = RobertaConfig.from_pretrained("FacebookAI/roberta-base", vocab_size=50265)
 config.save_pretrained("./norwegian-roberta-base")
 ```

@@ -129,8 +129,8 @@ look at [this](https://colab.research.google.com/github/huggingface/notebooks/bl

 In the following, we demonstrate how to train an auto-regressive causal transformer model 
 in JAX/Flax.
-More specifically, we pretrain a randomly initialized [**`gpt2`**](https://huggingface.co/gpt2) model in Norwegian on a single TPUv3-8.
-to pre-train 124M [**`gpt2`**](https://huggingface.co/gpt2)
+More specifically, we pretrain a randomly initialized [**`openai-community/gpt2`**](https://huggingface.co/openai-community/gpt2) model in Norwegian on a single TPUv3-8.
+to pre-train 124M [**`openai-community/gpt2`**](https://huggingface.co/openai-community/gpt2)
 in Norwegian on a single TPUv3-8 pod.

 The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
@@ -179,13 +179,13 @@ tokenizer.save("./norwegian-gpt2/tokenizer.json")
 ### Create configuration

 Next, we create the model's configuration file. This is as simple 
-as loading and storing [`**gpt2**`](https://huggingface.co/gpt2)
+as loading and storing [`**openai-community/gpt2**`](https://huggingface.co/openai-community/gpt2)
 in the local model folder:

 ```python
 from transformers import GPT2Config

-config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0, vocab_size=50257)
+config = GPT2Config.from_pretrained("openai-community/gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0, vocab_size=50257)
 config.save_pretrained("./norwegian-gpt2")
 ```

@@ -199,7 +199,7 @@ Finally, we can run the example script to pretrain the model:
 ```bash
 python run_clm_flax.py \
    --output_dir="./norwegian-gpt2" \
-    --model_type="gpt2" \
+    --model_type="openai-community/gpt2" \
    --config_name="./norwegian-gpt2" \
    --tokenizer_name="./norwegian-gpt2" \
    --dataset_name="oscar" \
--- a/examples/flax/question-answering/README.md
+++ b/examples/flax/question-answering/README.md
@@ -29,7 +29,7 @@ The following example fine-tunes BERT on SQuAD:

 ```bash
 python run_qa.py \
-  --model_name_or_path bert-base-uncased \
+  --model_name_or_path google-bert/bert-base-uncased \
  --dataset_name squad \
  --do_train   \
  --do_eval   \
@@ -67,7 +67,7 @@ Here is an example training on 4 TITAN RTX GPUs and Bert Whole Word Masking unca
 ```bash
 export CUDA_VISIBLE_DEVICES=0,1,2,3
 python run_qa.py   \
--model_name_or_path bert-large-uncased-whole-word-masking   \
+--model_name_or_path google-bert/bert-large-uncased-whole-word-masking   \
 --dataset_name squad   \
 --do_train   \
 --do_eval   \
--- a/examples/flax/test_flax_examples.py
+++ b/examples/flax/test_flax_examples.py
@@ -78,7 +78,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_glue.py
-            --model_name_or_path distilbert-base-uncased
+            --model_name_or_path distilbert/distilbert-base-uncased
            --output_dir {tmp_dir}
            --train_file ./tests/fixtures/tests_samples/MRPC/train.csv
            --validation_file ./tests/fixtures/tests_samples/MRPC/dev.csv
@@ -101,7 +101,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_clm_flax.py
-            --model_name_or_path distilgpt2
+            --model_name_or_path distilbert/distilgpt2
            --train_file ./tests/fixtures/sample_text.txt
            --validation_file ./tests/fixtures/sample_text.txt
            --do_train
@@ -125,7 +125,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_summarization.py
-            --model_name_or_path t5-small
+            --model_name_or_path google-t5/t5-small
            --train_file tests/fixtures/tests_samples/xsum/sample.json
            --validation_file tests/fixtures/tests_samples/xsum/sample.json
            --test_file tests/fixtures/tests_samples/xsum/sample.json
@@ -155,7 +155,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_mlm.py
-            --model_name_or_path distilroberta-base
+            --model_name_or_path distilbert/distilroberta-base
            --train_file ./tests/fixtures/sample_text.txt
            --validation_file ./tests/fixtures/sample_text.txt
            --output_dir {tmp_dir}
@@ -179,7 +179,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_t5_mlm_flax.py
-            --model_name_or_path t5-small
+            --model_name_or_path google-t5/t5-small
            --train_file ./tests/fixtures/sample_text.txt
            --validation_file ./tests/fixtures/sample_text.txt
            --do_train
@@ -206,7 +206,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_flax_ner.py
-            --model_name_or_path bert-base-uncased
+            --model_name_or_path google-bert/bert-base-uncased
            --train_file tests/fixtures/tests_samples/conll/sample.json
            --validation_file tests/fixtures/tests_samples/conll/sample.json
            --output_dir {tmp_dir}
@@ -233,7 +233,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_qa.py
-            --model_name_or_path bert-base-uncased
+            --model_name_or_path google-bert/bert-base-uncased
            --version_2_with_negative
            --train_file tests/fixtures/tests_samples/SQUAD/sample.json
            --validation_file tests/fixtures/tests_samples/SQUAD/sample.json
--- a/examples/flax/text-classification/README.md
+++ b/examples/flax/text-classification/README.md
@@ -31,7 +31,7 @@ GLUE is made up of a total of 9 different tasks. Here is how to run the script o
 export TASK_NAME=mrpc

 python run_flax_glue.py \
-  --model_name_or_path bert-base-cased \
+  --model_name_or_path google-bert/bert-base-cased \
  --task_name ${TASK_NAME} \
  --max_seq_length 128 \
  --learning_rate 2e-5 \
--- a/examples/flax/token-classification/README.md
+++ b/examples/flax/token-classification/README.md
@@ -25,7 +25,7 @@ The following example fine-tunes BERT on CoNLL-2003:

 ```bash
 python run_flax_ner.py \
-  --model_name_or_path bert-base-cased \
+  --model_name_or_path google-bert/bert-base-cased \
  --dataset_name conll2003 \
  --max_seq_length 128 \
  --learning_rate 2e-5 \
--- a/examples/legacy/benchmarking/README.md
+++ b/examples/legacy/benchmarking/README.md
@@ -22,5 +22,5 @@ If you would like to list benchmark results on your favorite models of the [mode

 | Benchmark description | Results | Environment info |      Author      |
 |:----------|:-------------|:-------------|------:|
-| PyTorch Benchmark on inference for `bert-base-cased` |[memory](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_memory.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | 
-| PyTorch Benchmark on inference for `bert-base-cased` |[time](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_time.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | 
+| PyTorch Benchmark on inference for `google-bert/bert-base-cased` |[memory](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_memory.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | 
+| PyTorch Benchmark on inference for `google-bert/bert-base-cased` |[time](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_time.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | 
--- a/examples/legacy/question-answering/README.md
+++ b/examples/legacy/question-answering/README.md
@@ -1,7 +1,7 @@
 #### Fine-tuning BERT on SQuAD1.0 with relative position embeddings

 The following examples show how to fine-tune BERT models with different relative position embeddings. The BERT model 
-`bert-base-uncased` was pretrained with default absolute position embeddings. We provide the following pretrained 
+`google-bert/bert-base-uncased` was pretrained with default absolute position embeddings. We provide the following pretrained 
 models which were pre-trained on the same training data (BooksCorpus and English Wikipedia) as in the BERT model 
 training, but with different relative position embeddings. 

@@ -10,7 +10,7 @@ Shaw et al., [Self-Attention with Relative Position Representations](https://arx
 * `zhiheng-huang/bert-base-uncased-embedding-relative-key-query`, trained from scratch with relative embedding method 4 
 in Huang et al. [Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658)
 * `zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query`, fine-tuned from model 
-`bert-large-uncased-whole-word-masking` with 3 additional epochs with relative embedding method 4 in Huang et al. 
+`google-bert/bert-large-uncased-whole-word-masking` with 3 additional epochs with relative embedding method 4 in Huang et al. 
 [Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658)


@@ -61,7 +61,7 @@ torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
    --gradient_accumulation_steps 3
 ```
 Training with the above command leads to the f1 score of 93.52, which is slightly better than the f1 score of 93.15 for 
-`bert-large-uncased-whole-word-masking`.
+`google-bert/bert-large-uncased-whole-word-masking`.

 #### Distributed training

@@ -69,7 +69,7 @@ Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word

 ```bash
 torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
-    --model_name_or_path bert-large-uncased-whole-word-masking \
+    --model_name_or_path google-bert/bert-large-uncased-whole-word-masking \
    --dataset_name squad \
    --do_train \
    --do_eval \
@@ -90,7 +90,7 @@ exact_match = 86.91
 ```

 This fine-tuned model is available as a checkpoint under the reference
-[`bert-large-uncased-whole-word-masking-finetuned-squad`](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad).
+[`google-bert/bert-large-uncased-whole-word-masking-finetuned-squad`](https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking-finetuned-squad).

 ## Results

--- a/examples/legacy/run_camembert.py
+++ b/examples/legacy/run_camembert.py
@@ -39,8 +39,8 @@ def fill_mask(masked_input, model, tokenizer, topk=5):
    return topk_filled_outputs


-tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
-model = CamembertForMaskedLM.from_pretrained("camembert-base")
+tokenizer = CamembertTokenizer.from_pretrained("almanach/camembert-base")
+model = CamembertForMaskedLM.from_pretrained("almanach/camembert-base")
 model.eval()

 masked_input = "Le camembert est <mask> :)"
--- a/examples/legacy/run_openai_gpt.py
+++ b/examples/legacy/run_openai_gpt.py
@@ -20,7 +20,7 @@

    This script with default values fine-tunes and evaluate a pretrained OpenAI GPT on the RocStories dataset:
        python run_openai_gpt.py \
-          --model_name openai-gpt \
+          --model_name openai-community/openai-gpt \
          --do_train \
          --do_eval \
          --train_dataset "$ROC_STORIES_DIR/cloze_test_val__spring2016 - cloze_test_ALL_val.csv" \
@@ -104,7 +104,7 @@ def pre_process_datasets(encoded_datasets, input_len, cap_length, start_token, d

 def main():
    parser = argparse.ArgumentParser()
-    parser.add_argument("--model_name", type=str, default="openai-gpt", help="pretrained model name")
+    parser.add_argument("--model_name", type=str, default="openai-community/openai-gpt", help="pretrained model name")
    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
    parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
    parser.add_argument(
--- a/examples/legacy/run_transfo_xl.py
+++ b/examples/legacy/run_transfo_xl.py
@@ -40,7 +40,7 @@ logger = logging.getLogger(__name__)

 def main():
    parser = argparse.ArgumentParser(description="PyTorch Transformer Language Model")
-    parser.add_argument("--model_name", type=str, default="transfo-xl-wt103", help="pretrained model name")
+    parser.add_argument("--model_name", type=str, default="transfo-xl/transfo-xl-wt103", help="pretrained model name")
    parser.add_argument(
        "--split", type=str, default="test", choices=["all", "valid", "test"], help="which split to evaluate"
    )
--- a/examples/legacy/seq2seq/README.md
+++ b/examples/legacy/seq2seq/README.md
@@ -170,7 +170,7 @@ If 'translation' is in your task name, the computed metric will be BLEU. Otherwi
 For t5, you need to specify --task translation_{src}_to_{tgt} as follows:
 ```bash
 export DATA_DIR=wmt_en_ro
-./run_eval.py t5-base \
+./run_eval.py google-t5/t5-base \
    $DATA_DIR/val.source t5_val_generations.txt \
    --reference_path $DATA_DIR/val.target \
    --score_path enro_bleu.json \
--- a/examples/legacy/seq2seq/old_test_datasets.py
+++ b/examples/legacy/seq2seq/old_test_datasets.py
@@ -28,7 +28,7 @@ from transformers.testing_utils import TestCasePlus, slow
 from utils import FAIRSEQ_AVAILABLE, DistributedSortishSampler, LegacySeq2SeqDataset, Seq2SeqDataset


-BERT_BASE_CASED = "bert-base-cased"
+BERT_BASE_CASED = "google-bert/bert-base-cased"
 PEGASUS_XSUM = "google/pegasus-xsum"
 ARTICLES = [" Sam ate lunch today.", "Sams lunch ingredients."]
 SUMMARIES = ["A very interesting story about what I ate for lunch.", "Avocado, celery, turkey, coffee"]
--- a/examples/legacy/seq2seq/pack_dataset.py
+++ b/examples/legacy/seq2seq/pack_dataset.py
@@ -74,7 +74,7 @@ def pack_data_dir(tok, data_dir: Path, max_tokens, save_path):

 def packer_cli():
    parser = argparse.ArgumentParser()
-    parser.add_argument("--tok_name", type=str, help="like facebook/bart-large-cnn,t5-base, etc.")
+    parser.add_argument("--tok_name", type=str, help="like facebook/bart-large-cnn,google-t5/t5-base, etc.")
    parser.add_argument("--max_seq_len", type=int, default=128)
    parser.add_argument("--data_dir", type=str)
    parser.add_argument("--save_path", type=str)
--- a/examples/legacy/seq2seq/run_distributed_eval.py
+++ b/examples/legacy/seq2seq/run_distributed_eval.py
@@ -124,7 +124,7 @@ def run_generate():
    parser.add_argument(
        "--model_name",
        type=str,
-        help="like facebook/bart-large-cnn,t5-base, etc.",
+        help="like facebook/bart-large-cnn,google-t5/t5-base, etc.",
        default="sshleifer/distilbart-xsum-12-3",
    )
    parser.add_argument("--save_dir", type=str, help="where to save", default="tmp_gen")
--- a/examples/legacy/seq2seq/run_eval.py
+++ b/examples/legacy/seq2seq/run_eval.py
@@ -100,7 +100,7 @@ def run_generate(verbose=True):
    """

    parser = argparse.ArgumentParser()
-    parser.add_argument("model_name", type=str, help="like facebook/bart-large-cnn,t5-base, etc.")
+    parser.add_argument("model_name", type=str, help="like facebook/bart-large-cnn,google-t5/t5-base, etc.")
    parser.add_argument("input_path", type=str, help="like cnn_dm/test.source")
    parser.add_argument("save_path", type=str, help="where to save summaries")
    parser.add_argument("--reference_path", type=str, required=False, help="like cnn_dm/test.target")
--- a/examples/legacy/token-classification/README.md
+++ b/examples/legacy/token-classification/README.md
@@ -34,7 +34,7 @@ Let's define some variables that we need for further pre-processing steps and tr

 ```bash
 export MAX_LENGTH=128
-export BERT_MODEL=bert-base-multilingual-cased
+export BERT_MODEL=google-bert/bert-base-multilingual-cased
 ```

 Run the pre-processing script on training, dev and test datasets:
@@ -92,7 +92,7 @@ Instead of passing all parameters via commandline arguments, the `run_ner.py` sc
 {
    "data_dir": ".",
    "labels": "./labels.txt",
-    "model_name_or_path": "bert-base-multilingual-cased",
+    "model_name_or_path": "google-bert/bert-base-multilingual-cased",
    "output_dir": "germeval-model",
    "max_seq_length": 128,
    "num_train_epochs": 3,
@@ -222,7 +222,7 @@ Let's define some variables that we need for further pre-processing steps:

 ```bash
 export MAX_LENGTH=128
-export BERT_MODEL=bert-large-cased
+export BERT_MODEL=google-bert/bert-large-cased
 ```

 Here we use the English BERT large model for fine-tuning.
@@ -250,7 +250,7 @@ This configuration file looks like:
 {
    "data_dir": "./data_wnut_17",
    "labels": "./data_wnut_17/labels.txt",
-    "model_name_or_path": "bert-large-cased",
+    "model_name_or_path": "google-bert/bert-large-cased",
    "output_dir": "wnut-17-model-1",
    "max_seq_length": 128,
    "num_train_epochs": 3,
--- a/examples/legacy/token-classification/utils_ner.py
+++ b/examples/legacy/token-classification/utils_ner.py
@@ -113,7 +113,7 @@ class TokenClassificationTask:
            for word, label in zip(example.words, example.labels):
                word_tokens = tokenizer.tokenize(word)

-                # bert-base-multilingual-cased sometimes output "nothing ([]) when calling tokenize with just a space.
+                # google-bert/bert-base-multilingual-cased sometimes output "nothing ([]) when calling tokenize with just a space.
                if len(word_tokens) > 0:
                    tokens.extend(word_tokens)
                    # Use the real label id for the first token of the word, and padding ids for the remaining tokens
--- a/examples/pytorch/README.md
+++ b/examples/pytorch/README.md
@@ -109,7 +109,7 @@ classification MNLI task using the `run_glue` script, with 8 GPUs:
 ```bash
 torchrun \
    --nproc_per_node 8 pytorch/text-classification/run_glue.py \
-    --model_name_or_path bert-large-uncased-whole-word-masking \
+    --model_name_or_path google-bert/bert-large-uncased-whole-word-masking \
    --task_name mnli \
    --do_train \
    --do_eval \
@@ -153,7 +153,7 @@ classification MNLI task using the `run_glue` script, with 8 TPUs (from this fol
 ```bash
 python xla_spawn.py --num_cores 8 \
    text-classification/run_glue.py \
-    --model_name_or_path bert-large-uncased-whole-word-masking \
+    --model_name_or_path google-bert/bert-large-uncased-whole-word-masking \
    --task_name mnli \
    --do_train \
    --do_eval \
--- a/examples/pytorch/contrastive-image-text/README.md
+++ b/examples/pytorch/contrastive-image-text/README.md
@@ -64,10 +64,10 @@ from transformers import (
 )

 model = VisionTextDualEncoderModel.from_vision_text_pretrained(
-    "openai/clip-vit-base-patch32", "roberta-base"
+    "openai/clip-vit-base-patch32", "FacebookAI/roberta-base"
 )

-tokenizer = AutoTokenizer.from_pretrained("roberta-base")
+tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")
 image_processor = AutoImageProcessor.from_pretrained("openai/clip-vit-base-patch32")
 processor = VisionTextDualEncoderProcessor(image_processor, tokenizer)

--- a/examples/pytorch/language-modeling/README.md
+++ b/examples/pytorch/language-modeling/README.md
@@ -36,7 +36,7 @@ the tokenization). The loss here is that of causal language modeling.

 ```bash
 python run_clm.py \
-    --model_name_or_path gpt2 \
+    --model_name_or_path openai-community/gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
@@ -53,7 +53,7 @@ To run on your own training and validation files, use the following command:

 ```bash
 python run_clm.py \
-    --model_name_or_path gpt2 \
+    --model_name_or_path openai-community/gpt2 \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --per_device_train_batch_size 8 \
@@ -69,7 +69,7 @@ This uses the built in HuggingFace `Trainer` for training. If you want to use a
 python run_clm_no_trainer.py \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
-    --model_name_or_path gpt2 \
+    --model_name_or_path openai-community/gpt2 \
    --output_dir /tmp/test-clm
 ```

@@ -84,7 +84,7 @@ converge slightly slower (over-fitting takes more epochs).

 ```bash
 python run_mlm.py \
-    --model_name_or_path roberta-base \
+    --model_name_or_path FacebookAI/roberta-base \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
@@ -98,7 +98,7 @@ To run on your own training and validation files, use the following command:

 ```bash
 python run_mlm.py \
-    --model_name_or_path roberta-base \
+    --model_name_or_path FacebookAI/roberta-base \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --per_device_train_batch_size 8 \
@@ -117,7 +117,7 @@ This uses the built in HuggingFace `Trainer` for training. If you want to use a
 python run_mlm_no_trainer.py \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
-    --model_name_or_path roberta-base \
+    --model_name_or_path FacebookAI/roberta-base \
    --output_dir /tmp/test-mlm
 ```

@@ -144,7 +144,7 @@ Here is how to fine-tune XLNet on wikitext-2:

 ```bash
 python run_plm.py \
-    --model_name_or_path=xlnet-base-cased \
+    --model_name_or_path=xlnet/xlnet-base-cased \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
@@ -158,7 +158,7 @@ To fine-tune it on your own training and validation file, run:

 ```bash
 python run_plm.py \
-    --model_name_or_path=xlnet-base-cased \
+    --model_name_or_path=xlnet/xlnet-base-cased \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --per_device_train_batch_size 8 \
@@ -188,7 +188,7 @@ When training a model from scratch, configuration values may be overridden with


 ```bash
-python run_clm.py --model_type gpt2 --tokenizer_name gpt2 \ --config_overrides="n_embd=1024,n_head=16,n_layer=48,n_positions=102" \
+python run_clm.py --model_type openai-community/gpt2 --tokenizer_name openai-community/gpt2 \ --config_overrides="n_embd=1024,n_head=16,n_layer=48,n_positions=102" \
 [...]
 ```

--- a/examples/pytorch/multiple-choice/README.md
+++ b/examples/pytorch/multiple-choice/README.md
@@ -22,7 +22,7 @@ limitations under the License.

 ```bash
 python examples/multiple-choice/run_swag.py \
--model_name_or_path roberta-base \
+--model_name_or_path FacebookAI/roberta-base \
 --do_train \
 --do_eval \
 --learning_rate 5e-5 \
@@ -62,7 +62,7 @@ then
 export DATASET_NAME=swag

 python run_swag_no_trainer.py \
-  --model_name_or_path bert-base-cased \
+  --model_name_or_path google-bert/bert-base-cased \
  --dataset_name $DATASET_NAME \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
@@ -89,7 +89,7 @@ that will check everything is ready for training. Finally, you can launch traini
 export DATASET_NAME=swag

 accelerate launch run_swag_no_trainer.py \
-  --model_name_or_path bert-base-cased \
+  --model_name_or_path google-bert/bert-base-cased \
  --dataset_name $DATASET_NAME \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
--- a/examples/pytorch/old_test_xla_examples.py
+++ b/examples/pytorch/old_test_xla_examples.py
@@ -54,7 +54,7 @@ class TorchXLAExamplesTests(TestCasePlus):
            ./examples/pytorch/text-classification/run_glue.py
            --num_cores=8
            ./examples/pytorch/text-classification/run_glue.py
-            --model_name_or_path distilbert-base-uncased
+            --model_name_or_path distilbert/distilbert-base-uncased
            --output_dir {tmp_dir}
            --overwrite_output_dir
            --train_file ./tests/fixtures/tests_samples/MRPC/train.csv
--- a/examples/pytorch/question-answering/README.md
+++ b/examples/pytorch/question-answering/README.md
@@ -40,7 +40,7 @@ on a single tesla V100 16GB.

 ```bash
 python run_qa.py \
-  --model_name_or_path bert-base-uncased \
+  --model_name_or_path google-bert/bert-base-uncased \
  --dataset_name squad \
  --do_train \
  --do_eval \
@@ -67,7 +67,7 @@ The [`run_qa_beam_search.py`](https://github.com/huggingface/transformers/blob/m

 ```bash
 python run_qa_beam_search.py \
-    --model_name_or_path xlnet-large-cased \
+    --model_name_or_path xlnet/xlnet-large-cased \
    --dataset_name squad \
    --do_train \
    --do_eval \
@@ -87,7 +87,7 @@ python run_qa_beam_search.py \
 export SQUAD_DIR=/path/to/SQUAD

 python run_qa_beam_search.py \
-    --model_name_or_path xlnet-large-cased \
+    --model_name_or_path xlnet/xlnet-large-cased \
    --dataset_name squad_v2 \
    --do_train \
    --do_eval \
@@ -111,7 +111,7 @@ This example code fine-tunes T5 on the SQuAD2.0 dataset.

 ```bash
 python run_seq2seq_qa.py \
-  --model_name_or_path t5-small \
+  --model_name_or_path google-t5/t5-small \
  --dataset_name squad_v2 \
  --context_column context \
  --question_column question \
@@ -143,7 +143,7 @@ then

 ```bash
 python run_qa_no_trainer.py \
-  --model_name_or_path bert-base-uncased \
+  --model_name_or_path google-bert/bert-base-uncased \
  --dataset_name squad \
  --max_seq_length 384 \
  --doc_stride 128 \
@@ -166,7 +166,7 @@ that will check everything is ready for training. Finally, you can launch traini

 ```bash
 accelerate launch run_qa_no_trainer.py \
-  --model_name_or_path bert-base-uncased \
+  --model_name_or_path google-bert/bert-base-uncased \
  --dataset_name squad \
  --max_seq_length 384 \
  --doc_stride 128 \
--- a/examples/pytorch/summarization/README.md
+++ b/examples/pytorch/summarization/README.md
@@ -41,7 +41,7 @@ and you also will find examples of these below.
 Here is an example on a summarization task:
 ```bash
 python examples/pytorch/summarization/run_summarization.py \
-    --model_name_or_path t5-small \
+    --model_name_or_path google-t5/t5-small \
    --do_train \
    --do_eval \
    --dataset_name cnn_dailymail \
@@ -54,9 +54,9 @@ python examples/pytorch/summarization/run_summarization.py \
    --predict_with_generate
 ```

-Only T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "summarize: "`.
+Only T5 models `google-t5/t5-small`, `google-t5/t5-base`, `google-t5/t5-large`, `google-t5/t5-3b` and `google-t5/t5-11b` must use an additional argument: `--source_prefix "summarize: "`.

-We used CNN/DailyMail dataset in this example as `t5-small` was trained on it and one can get good scores even when pre-training with a very small sample.
+We used CNN/DailyMail dataset in this example as `google-t5/t5-small` was trained on it and one can get good scores even when pre-training with a very small sample.

 Extreme Summarization (XSum) Dataset is another commonly used dataset for the task of summarization. To use it replace `--dataset_name cnn_dailymail --dataset_config "3.0.0"` with  `--dataset_name xsum`.

@@ -65,7 +65,7 @@ And here is how you would use it on your own files, after adjusting the values f

 ```bash
 python examples/pytorch/summarization/run_summarization.py \
-    --model_name_or_path t5-small \
+    --model_name_or_path google-t5/t5-small \
    --do_train \
    --do_eval \
    --train_file path_to_csv_or_jsonlines_file \
@@ -156,7 +156,7 @@ then

 ```bash
 python run_summarization_no_trainer.py \
-    --model_name_or_path t5-small \
+    --model_name_or_path google-t5/t5-small \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
@@ -179,7 +179,7 @@ that will check everything is ready for training. Finally, you can launch traini

 ```bash
 accelerate launch run_summarization_no_trainer.py \
-    --model_name_or_path t5-small \
+    --model_name_or_path google-t5/t5-small \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
--- a/examples/pytorch/summarization/run_summarization.py
+++ b/examples/pytorch/summarization/run_summarization.py
@@ -368,11 +368,11 @@ def main():
    logger.info(f"Training/evaluation parameters {training_args}")

    if data_args.source_prefix is None and model_args.model_name_or_path in [
-        "t5-small",
-        "t5-base",
-        "t5-large",
-        "t5-3b",
-        "t5-11b",
+        "google-t5/t5-small",
+        "google-t5/t5-base",
+        "google-t5/t5-large",
+        "google-t5/t5-3b",
+        "google-t5/t5-11b",
    ]:
        logger.warning(
            "You're running a t5 model but didn't provide a source prefix, which is the expected, e.g. with "
--- a/examples/pytorch/summarization/run_summarization_no_trainer.py
+++ b/examples/pytorch/summarization/run_summarization_no_trainer.py
@@ -339,11 +339,11 @@ def main():

    accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs)
    if args.source_prefix is None and args.model_name_or_path in [
-        "t5-small",
-        "t5-base",
-        "t5-large",
-        "t5-3b",
-        "t5-11b",
+        "google-t5/t5-small",
+        "google-t5/t5-base",
+        "google-t5/t5-large",
+        "google-t5/t5-3b",
+        "google-t5/t5-11b",
    ]:
        logger.warning(
            "You're running a t5 model but didn't provide a source prefix, which is the expected, e.g. with "
--- a/examples/pytorch/test_accelerate_examples.py
+++ b/examples/pytorch/test_accelerate_examples.py
@@ -80,7 +80,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            {self.examples_dir}/pytorch/text-classification/run_glue_no_trainer.py
-            --model_name_or_path distilbert-base-uncased
+            --model_name_or_path distilbert/distilbert-base-uncased
            --output_dir {tmp_dir}
            --train_file ./tests/fixtures/tests_samples/MRPC/train.csv
            --validation_file ./tests/fixtures/tests_samples/MRPC/dev.csv
@@ -105,7 +105,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            {self.examples_dir}/pytorch/language-modeling/run_clm_no_trainer.py
-            --model_name_or_path distilgpt2
+            --model_name_or_path distilbert/distilgpt2
            --train_file ./tests/fixtures/sample_text.txt
            --validation_file ./tests/fixtures/sample_text.txt
            --block_size 128
@@ -133,7 +133,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            {self.examples_dir}/pytorch/language-modeling/run_mlm_no_trainer.py
-            --model_name_or_path distilroberta-base
+            --model_name_or_path distilbert/distilroberta-base
            --train_file ./tests/fixtures/sample_text.txt
            --validation_file ./tests/fixtures/sample_text.txt
            --output_dir {tmp_dir}
@@ -156,7 +156,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            {self.examples_dir}/pytorch/token-classification/run_ner_no_trainer.py
-            --model_name_or_path bert-base-uncased
+            --model_name_or_path google-bert/bert-base-uncased
            --train_file tests/fixtures/tests_samples/conll/sample.json
            --validation_file tests/fixtures/tests_samples/conll/sample.json
            --output_dir {tmp_dir}
@@ -181,7 +181,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            {self.examples_dir}/pytorch/question-answering/run_qa_no_trainer.py
-            --model_name_or_path bert-base-uncased
+            --model_name_or_path google-bert/bert-base-uncased
            --version_2_with_negative
            --train_file tests/fixtures/tests_samples/SQUAD/sample.json
            --validation_file tests/fixtures/tests_samples/SQUAD/sample.json
@@ -209,7 +209,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            {self.examples_dir}/pytorch/multiple-choice/run_swag_no_trainer.py
-            --model_name_or_path bert-base-uncased
+            --model_name_or_path google-bert/bert-base-uncased
            --train_file tests/fixtures/tests_samples/swag/sample.json
            --validation_file tests/fixtures/tests_samples/swag/sample.json
            --output_dir {tmp_dir}
@@ -232,7 +232,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            {self.examples_dir}/pytorch/summarization/run_summarization_no_trainer.py
-            --model_name_or_path t5-small
+            --model_name_or_path google-t5/t5-small
            --train_file tests/fixtures/tests_samples/xsum/sample.json
            --validation_file tests/fixtures/tests_samples/xsum/sample.json
            --output_dir {tmp_dir}
--- a/examples/pytorch/test_pytorch_examples.py
+++ b/examples/pytorch/test_pytorch_examples.py
@@ -99,7 +99,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_glue.py
-            --model_name_or_path distilbert-base-uncased
+            --model_name_or_path distilbert/distilbert-base-uncased
            --output_dir {tmp_dir}
            --overwrite_output_dir
            --train_file ./tests/fixtures/tests_samples/MRPC/train.csv
@@ -127,7 +127,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_clm.py
-            --model_name_or_path distilgpt2
+            --model_name_or_path distilbert/distilgpt2
            --train_file ./tests/fixtures/sample_text.txt
            --validation_file ./tests/fixtures/sample_text.txt
            --do_train
@@ -160,7 +160,7 @@ class ExamplesTests(TestCasePlus):
        testargs = f"""
            run_clm.py
            --model_type gpt2
-            --tokenizer_name gpt2
+            --tokenizer_name openai-community/gpt2
            --train_file ./tests/fixtures/sample_text.txt
            --output_dir {tmp_dir}
            --config_overrides n_embd=10,n_head=2
@@ -181,7 +181,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_mlm.py
-            --model_name_or_path distilroberta-base
+            --model_name_or_path distilbert/distilroberta-base
            --train_file ./tests/fixtures/sample_text.txt
            --validation_file ./tests/fixtures/sample_text.txt
            --output_dir {tmp_dir}
@@ -207,7 +207,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_ner.py
-            --model_name_or_path bert-base-uncased
+            --model_name_or_path google-bert/bert-base-uncased
            --train_file tests/fixtures/tests_samples/conll/sample.json
            --validation_file tests/fixtures/tests_samples/conll/sample.json
            --output_dir {tmp_dir}
@@ -235,7 +235,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_qa.py
-            --model_name_or_path bert-base-uncased
+            --model_name_or_path google-bert/bert-base-uncased
            --version_2_with_negative
            --train_file tests/fixtures/tests_samples/SQUAD/sample.json
            --validation_file tests/fixtures/tests_samples/SQUAD/sample.json
@@ -260,7 +260,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_seq2seq_qa.py
-            --model_name_or_path t5-small
+            --model_name_or_path google-t5/t5-small
            --context_column context
            --question_column question
            --answer_column answers
@@ -289,7 +289,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_swag.py
-            --model_name_or_path bert-base-uncased
+            --model_name_or_path google-bert/bert-base-uncased
            --train_file tests/fixtures/tests_samples/swag/sample.json
            --validation_file tests/fixtures/tests_samples/swag/sample.json
            --output_dir {tmp_dir}
@@ -327,7 +327,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_summarization.py
-            --model_name_or_path t5-small
+            --model_name_or_path google-t5/t5-small
            --train_file tests/fixtures/tests_samples/xsum/sample.json
            --validation_file tests/fixtures/tests_samples/xsum/sample.json
            --output_dir {tmp_dir}
--- a/examples/pytorch/text-classification/README.md
+++ b/examples/pytorch/text-classification/README.md
@@ -31,7 +31,7 @@ GLUE is made up of a total of 9 different tasks. Here is how to run the script o
 export TASK_NAME=mrpc

 python run_glue.py \
-  --model_name_or_path bert-base-cased \
+  --model_name_or_path google-bert/bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
@@ -68,7 +68,7 @@ The following example fine-tunes BERT on the `imdb` dataset hosted on our [hub](

 ```bash
 python run_glue.py \
-  --model_name_or_path bert-base-cased \
+  --model_name_or_path google-bert/bert-base-cased \
  --dataset_name imdb  \
  --do_train \
  --do_predict \
@@ -90,7 +90,7 @@ We can specify the metric, the label column and aso choose which text columns to
 dataset="amazon_reviews_multi"
 subset="en"
 python run_classification.py \
-    --model_name_or_path  bert-base-uncased \
+    --model_name_or_path  google-bert/bert-base-uncased \
    --dataset_name ${dataset} \
    --dataset_config_name ${subset} \
    --shuffle_train_dataset \
@@ -113,7 +113,7 @@ The following is a multi-label classification example. It fine-tunes BERT on the
 dataset="reuters21578"
 subset="ModApte"
 python run_classification.py \
-    --model_name_or_path bert-base-uncased \
+    --model_name_or_path google-bert/bert-base-uncased \
    --dataset_name ${dataset} \
    --dataset_config_name ${subset} \
    --shuffle_train_dataset \
@@ -175,7 +175,7 @@ then
 export TASK_NAME=mrpc

 python run_glue_no_trainer.py \
-  --model_name_or_path bert-base-cased \
+  --model_name_or_path google-bert/bert-base-cased \
  --task_name $TASK_NAME \
  --max_length 128 \
  --per_device_train_batch_size 32 \
@@ -202,7 +202,7 @@ that will check everything is ready for training. Finally, you can launch traini
 export TASK_NAME=mrpc

 accelerate launch run_glue_no_trainer.py \
-  --model_name_or_path bert-base-cased \
+  --model_name_or_path google-bert/bert-base-cased \
  --task_name $TASK_NAME \
  --max_length 128 \
  --per_device_train_batch_size 32 \
@@ -232,7 +232,7 @@ This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It

 ```bash
 python run_xnli.py \
-  --model_name_or_path bert-base-multilingual-cased \
+  --model_name_or_path google-bert/bert-base-multilingual-cased \
  --language de \
  --train_language en \
  --do_train \
--- a/examples/pytorch/text-generation/README.md
+++ b/examples/pytorch/text-generation/README.md
@@ -26,6 +26,6 @@ Example usage:

 ```bash
 python run_generation.py \
-    --model_type=gpt2 \
-    --model_name_or_path=gpt2
+    --model_type=openai-community/gpt2 \
+    --model_name_or_path=openai-community/gpt2
 ```
--- a/examples/pytorch/text-generation/run_generation_contrastive_search.py
+++ b/examples/pytorch/text-generation/run_generation_contrastive_search.py
@@ -16,7 +16,7 @@
 """ The examples of running contrastive search on the auto-APIs;

 Running this example:
-python run_generation_contrastive_search.py --model_name_or_path=gpt2-large --penalty_alpha=0.6 --k=4 --length=256
+python run_generation_contrastive_search.py --model_name_or_path=openai-community/gpt2-large --penalty_alpha=0.6 --k=4 --length=256
 """


--- a/examples/pytorch/token-classification/README.md
+++ b/examples/pytorch/token-classification/README.md
@@ -29,7 +29,7 @@ The following example fine-tunes BERT on CoNLL-2003:

 ```bash
 python run_ner.py \
-  --model_name_or_path bert-base-uncased \
+  --model_name_or_path google-bert/bert-base-uncased \
  --dataset_name conll2003 \
  --output_dir /tmp/test-ner \
  --do_train \
@@ -42,7 +42,7 @@ To run on your own training and validation files, use the following command:

 ```bash
 python run_ner.py \
-  --model_name_or_path bert-base-uncased \
+  --model_name_or_path google-bert/bert-base-uncased \
  --train_file path_to_train_file \
  --validation_file path_to_validation_file \
  --output_dir /tmp/test-ner \
@@ -84,7 +84,7 @@ then
 export TASK_NAME=ner

 python run_ner_no_trainer.py \
-  --model_name_or_path bert-base-cased \
+  --model_name_or_path google-bert/bert-base-cased \
  --dataset_name conll2003 \
  --task_name $TASK_NAME \
  --max_length 128 \
@@ -112,7 +112,7 @@ that will check everything is ready for training. Finally, you can launch traini
 export TASK_NAME=ner

 accelerate launch run_ner_no_trainer.py \
-  --model_name_or_path bert-base-cased \
+  --model_name_or_path google-bert/bert-base-cased \
  --dataset_name conll2003 \
  --task_name $TASK_NAME \
  --max_length 128 \
--- a/examples/pytorch/translation/README.md
+++ b/examples/pytorch/translation/README.md
@@ -59,11 +59,11 @@ python examples/pytorch/translation/run_translation.py \

 MBart and some T5 models require special handling.

-T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example:
+T5 models `google-t5/t5-small`, `google-t5/t5-base`, `google-t5/t5-large`, `google-t5/t5-3b` and `google-t5/t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example:

 ```bash
 python examples/pytorch/translation/run_translation.py \
-    --model_name_or_path t5-small \
+    --model_name_or_path google-t5/t5-small \
    --do_train \
    --do_eval \
    --source_lang en \
@@ -105,7 +105,7 @@ values for the arguments `--train_file`, `--validation_file` to match your setup

 ```bash
 python examples/pytorch/translation/run_translation.py \
-    --model_name_or_path t5-small \
+    --model_name_or_path google-t5/t5-small \
    --do_train \
    --do_eval \
    --source_lang en \
@@ -134,7 +134,7 @@ If you want to use a pre-processed dataset that leads to high BLEU scores, but f

 ```bash
 python examples/pytorch/translation/run_translation.py \
-    --model_name_or_path t5-small \
+    --model_name_or_path google-t5/t5-small \
    --do_train \
    --do_eval \
    --source_lang en \
--- a/examples/pytorch/translation/run_translation.py
+++ b/examples/pytorch/translation/run_translation.py
@@ -317,11 +317,11 @@ def main():
    logger.info(f"Training/evaluation parameters {training_args}")

    if data_args.source_prefix is None and model_args.model_name_or_path in [
-        "t5-small",
-        "t5-base",
-        "t5-large",
-        "t5-3b",
-        "t5-11b",
+        "google-t5/t5-small",
+        "google-t5/t5-base",
+        "google-t5/t5-large",
+        "google-t5/t5-3b",
+        "google-t5/t5-11b",
    ]:
        logger.warning(
            "You're running a t5 model but didn't provide a source prefix, which is expected, e.g. with "
--- a/examples/research_projects/bert-loses-patience/README.md
+++ b/examples/research_projects/bert-loses-patience/README.md
@@ -15,7 +15,7 @@ export TASK_NAME=MRPC

 python ./run_glue_with_pabee.py \
  --model_type albert \
-  --model_name_or_path bert-base-uncased/albert-base-v2 \
+  --model_name_or_path google-bert/bert-base-uncased/albert/albert-base-v2 \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
--- a/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_albert.py
+++ b/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_albert.py
@@ -276,8 +276,8 @@ class AlbertForSequenceClassificationWithPabee(AlbertPreTrainedModel):
                from torch import nn
                import torch

-                tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
-                model = AlbertForSequenceClassificationWithPabee.from_pretrained('albert-base-v2')
+                tokenizer = AlbertTokenizer.from_pretrained('albert/albert-base-v2')
+                model = AlbertForSequenceClassificationWithPabee.from_pretrained('albert/albert-base-v2')
                input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
                labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
                outputs = model(input_ids, labels=labels)
--- a/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_bert.py
+++ b/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_bert.py
@@ -300,8 +300,8 @@ class BertForSequenceClassificationWithPabee(BertPreTrainedModel):
            from torch import nn
            import torch

-            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-            model = BertForSequenceClassificationWithPabee.from_pretrained('bert-base-uncased')
+            tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased')
+            model = BertForSequenceClassificationWithPabee.from_pretrained('google-bert/bert-base-uncased')

            input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
            labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
--- a/examples/research_projects/bert-loses-patience/test_run_glue_with_pabee.py
+++ b/examples/research_projects/bert-loses-patience/test_run_glue_with_pabee.py
@@ -29,7 +29,7 @@ class PabeeTests(TestCasePlus):
        testargs = f"""
            run_glue_with_pabee.py
            --model_type albert
-            --model_name_or_path albert-base-v2
+            --model_name_or_path albert/albert-base-v2
            --data_dir ./tests/fixtures/tests_samples/MRPC/
            --output_dir {tmp_dir}
            --overwrite_output_dir
--- a/examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py
+++ b/examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py
@@ -107,7 +107,7 @@ def convert_bertabs_checkpoints(path_to_checkpoints, dump_path):
    # ----------------------------------

    logging.info("Make sure that the models' outputs are identical")
-    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+    tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")

    # prepare the model inputs
    encoder_input_ids = tokenizer.encode("This is sample éàalj'-.")
--- a/examples/research_projects/bertabs/modeling_bertabs.py
+++ b/examples/research_projects/bertabs/modeling_bertabs.py
@@ -128,7 +128,7 @@ class Bert(nn.Module):

    def __init__(self):
        super().__init__()
-        config = BertConfig.from_pretrained("bert-base-uncased")
+        config = BertConfig.from_pretrained("google-bert/bert-base-uncased")
        self.model = BertModel(config)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None, **kwargs):
--- a/examples/research_projects/bertabs/run_summarization.py
+++ b/examples/research_projects/bertabs/run_summarization.py
@@ -29,7 +29,7 @@ Batch = namedtuple("Batch", ["document_names", "batch_size", "src", "segs", "mas


 def evaluate(args):
-    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)
+    tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased", do_lower_case=True)
    model = BertAbs.from_pretrained("remi/bertabs-finetuned-extractive-abstractive-summarization")
    model.to(args.device)
    model.eval()
--- a/examples/research_projects/codeparrot/README.md
+++ b/examples/research_projects/codeparrot/README.md
@@ -79,7 +79,7 @@ python scripts/pretokenizing.py \
 Before training a new model for code we create a new tokenizer that is efficient at code tokenization. To train the tokenizer you can run the following command: 
 ```bash
 python scripts/bpe_training.py \
-    --base_tokenizer gpt2 \
+    --base_tokenizer openai-community/gpt2 \
    --dataset_name codeparrot/codeparrot-clean-train
 ```

@@ -90,12 +90,12 @@ The models are randomly initialized and trained from scratch. To initialize a ne

 ```bash
 python scripts/initialize_model.py \
--config_name gpt2-large \
+--config_name openai-community/gpt2-large \
 --tokenizer_name codeparrot/codeparrot \
 --model_name codeparrot \
 --push_to_hub True
 ```
-This will initialize a new model with the architecture and configuration of `gpt2-large` and use the tokenizer to appropriately size the input embeddings. Finally, the initilaized model is pushed the hub.
+This will initialize a new model with the architecture and configuration of `openai-community/gpt2-large` and use the tokenizer to appropriately size the input embeddings. Finally, the initilaized model is pushed the hub.

 We can either pass the name of a text dataset or a pretokenized dataset which speeds up training a bit.
 Now that the tokenizer and model are also ready we can start training the model. The main training script is built with `accelerate` to scale across a wide range of platforms and infrastructure scales. We train two models with [110M](https://huggingface.co/codeparrot/codeparrot-small/) and [1.5B](https://huggingface.co/codeparrot/codeparrot/) parameters for 25-30B tokens on a 16xA100 (40GB) machine which takes 1 day and 1 week, respectively.
--- a/examples/research_projects/codeparrot/scripts/arguments.py
+++ b/examples/research_projects/codeparrot/scripts/arguments.py
@@ -172,7 +172,7 @@ class TokenizerTrainingArguments:
    """

    base_tokenizer: Optional[str] = field(
-        default="gpt2", metadata={"help": "Base tokenizer to build new tokenizer from."}
+        default="openai-community/gpt2", metadata={"help": "Base tokenizer to build new tokenizer from."}
    )
    dataset_name: Optional[str] = field(
        default="transformersbook/codeparrot-train", metadata={"help": "Dataset to train tokenizer on."}
@@ -211,7 +211,7 @@ class InitializationArguments:
    """

    config_name: Optional[str] = field(
-        default="gpt2-large", metadata={"help": "Configuration to use for model initialization."}
+        default="openai-community/gpt2-large", metadata={"help": "Configuration to use for model initialization."}
    )
    tokenizer_name: Optional[str] = field(
        default="codeparrot/codeparrot", metadata={"help": "Tokenizer attached to model."}
--- a/examples/research_projects/deebert/test_glue_deebert.py
+++ b/examples/research_projects/deebert/test_glue_deebert.py
@@ -48,7 +48,7 @@ class DeeBertTests(TestCasePlus):
    def test_glue_deebert_train(self):
        train_args = """
            --model_type roberta
-            --model_name_or_path roberta-base
+            --model_name_or_path FacebookAI/roberta-base
            --task_name MRPC
            --do_train
            --do_eval
@@ -61,7 +61,7 @@ class DeeBertTests(TestCasePlus):
            --num_train_epochs 3
            --overwrite_output_dir
            --seed 42
-            --output_dir ./examples/deebert/saved_models/roberta-base/MRPC/two_stage
+            --output_dir ./examples/deebert/saved_models/FacebookAI/roberta-base/MRPC/two_stage
            --plot_data_dir ./examples/deebert/results/
            --save_steps 0
            --overwrite_cache
@@ -71,12 +71,12 @@ class DeeBertTests(TestCasePlus):

        eval_args = """
            --model_type roberta
-            --model_name_or_path ./examples/deebert/saved_models/roberta-base/MRPC/two_stage
+            --model_name_or_path ./examples/deebert/saved_models/FacebookAI/roberta-base/MRPC/two_stage
            --task_name MRPC
            --do_eval
            --do_lower_case
            --data_dir ./tests/fixtures/tests_samples/MRPC/
-            --output_dir ./examples/deebert/saved_models/roberta-base/MRPC/two_stage
+            --output_dir ./examples/deebert/saved_models/FacebookAI/roberta-base/MRPC/two_stage
            --plot_data_dir ./examples/deebert/results/
            --max_seq_length 128
            --eval_each_highway
@@ -88,12 +88,12 @@ class DeeBertTests(TestCasePlus):

        entropy_eval_args = """
            --model_type roberta
-            --model_name_or_path ./examples/deebert/saved_models/roberta-base/MRPC/two_stage
+            --model_name_or_path ./examples/deebert/saved_models/FacebookAI/roberta-base/MRPC/two_stage
            --task_name MRPC
            --do_eval
            --do_lower_case
            --data_dir ./tests/fixtures/tests_samples/MRPC/
-            --output_dir ./examples/deebert/saved_models/roberta-base/MRPC/two_stage
+            --output_dir ./examples/deebert/saved_models/FacebookAI/roberta-base/MRPC/two_stage
            --plot_data_dir ./examples/deebert/results/
            --max_seq_length 128
            --early_exit_entropy 0.1
--- a/examples/research_projects/information-gain-filtration/README.md
+++ b/examples/research_projects/information-gain-filtration/README.md
@@ -64,7 +64,7 @@ To fine-tune a transformer model with IGF on a language modeling task, use the f
  
 ```python
 python run_clm_igf.py\
--model_name_or_path "gpt2" \
+--model_name_or_path "openai-community/gpt2" \
 --data_file="data/tokenized_stories_train_wikitext103" \
 --igf_data_file="data/IGF_values" \
 --context_len 32 \
--- a/examples/research_projects/information-gain-filtration/igf/igf.py
+++ b/examples/research_projects/information-gain-filtration/igf/igf.py
@@ -69,9 +69,9 @@ def compute_perplexity(model, test_data, context_len):
    return perplexity


-def load_gpt2(model_name="gpt2"):
+def load_gpt2(model_name="openai-community/gpt2"):
    """
-    load original gpt2 and save off for quicker loading
+    load original openai-community/gpt2 and save off for quicker loading

    Args:
        model_name: GPT-2
--- a/examples/research_projects/information-gain-filtration/run_clm_igf.py
+++ b/examples/research_projects/information-gain-filtration/run_clm_igf.py
@@ -84,7 +84,7 @@ def generate_n_pairs(
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    # load pretrained model
-    model = load_gpt2("gpt2").to(device)
+    model = load_gpt2("openai-community/gpt2").to(device)
    print("computing perplexity on objective set")
    orig_perp = compute_perplexity(model, objective_set, context_len).item()
    print("perplexity on objective set:", orig_perp)
@@ -121,7 +121,7 @@ def training_secondary_learner(
    set_seed(42)

    # Load pre-trained model
-    model = GPT2LMHeadModel.from_pretrained("gpt2")
+    model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2")

    # Initialize secondary learner to use embedding weights of model
    secondary_learner = SecondaryLearner(model)
@@ -153,7 +153,7 @@ def finetune(
    recopy_model=recopy_gpt2,
    secondary_learner=None,
    eval_interval=10,
-    finetuned_model_name="gpt2_finetuned.pt",
+    finetuned_model_name="openai-community/gpt2_finetuned.pt",
 ):
    """
    fine-tune with IGF if secondary_learner is not None, else standard fine-tuning
@@ -346,7 +346,10 @@ def main():
    )

    parser.add_argument(
-        "--batch_size", default=16, type=int, help="batch size of training data of language model(gpt2) "
+        "--batch_size",
+        default=16,
+        type=int,
+        help="batch size of training data of language model(openai-community/gpt2) ",
    )

    parser.add_argument(
@@ -383,7 +386,9 @@ def main():
        ),
    )

-    parser.add_argument("--finetuned_model_name", default="gpt2_finetuned.pt", type=str, help="finetuned_model_name")
+    parser.add_argument(
+        "--finetuned_model_name", default="openai-community/gpt2_finetuned.pt", type=str, help="finetuned_model_name"
+    )

    parser.add_argument(
        "--recopy_model",
@@ -416,16 +421,16 @@ def main():
        igf_model_path="igf_model.pt",
    )

-    # load pretrained gpt2 model
-    model = GPT2LMHeadModel.from_pretrained("gpt2")
+    # load pretrained openai-community/gpt2 model
+    model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2")
    set_seed(42)

-    # Generate train and test data to train and evaluate gpt2 model
+    # Generate train and test data to train and evaluate openai-community/gpt2 model
    train_dataset, test_dataset = generate_datasets(
        context_len=32, file="data/tokenized_stories_train_wikitext103.jbl", number=100, min_len=1026, trim=True
    )

-    # fine-tuning of the gpt2 model using igf (Information Gain Filtration)
+    # fine-tuning of the openai-community/gpt2 model using igf (Information Gain Filtration)
    finetune(
        model,
        train_dataset,
@@ -437,7 +442,7 @@ def main():
        recopy_model=recopy_gpt2,
        secondary_learner=secondary_learner,
        eval_interval=10,
-        finetuned_model_name="gpt2_finetuned.pt",
+        finetuned_model_name="openai-community/gpt2_finetuned.pt",
    )


--- a/examples/research_projects/jax-projects/README.md
+++ b/examples/research_projects/jax-projects/README.md
@@ -159,13 +159,13 @@ to be used, but that everybody in team is on the same page on what type of model
 To give an example, a well-defined project would be the following:

 - task: summarization
- model: [t5-small](https://huggingface.co/t5-small)
+- model: [google-t5/t5-small](https://huggingface.co/google-t5/t5-small)
 - dataset: [CNN/Daily mail](https://huggingface.co/datasets/cnn_dailymail)
 - training script: [run_summarization_flax.py](https://github.com/huggingface/transformers/blob/main/examples/flax/summarization/run_summarization_flax.py)
 - outcome: t5 model that can summarize news
- work flow: adapt `run_summarization_flax.py` to work with `t5-small`.
+- work flow: adapt `run_summarization_flax.py` to work with `google-t5/t5-small`.

-This example is a very easy and not the most interesting project since a `t5-small`
+This example is a very easy and not the most interesting project since a `google-t5/t5-small`
 summarization model exists already for CNN/Daily mail and pretty much no code has to be 
 written. 
 A well-defined project does not need to have the dataset be part of 
@@ -335,7 +335,7 @@ dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', str

 dummy_input = next(iter(dataset))["text"]

-tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
+tokenizer = RobertaTokenizerFast.from_pretrained("FacebookAI/roberta-base")
 input_ids = tokenizer(dummy_input, return_tensors="np").input_ids[:, :10]

 model = FlaxRobertaModel.from_pretrained("julien-c/dummy-unknown")
@@ -492,7 +492,7 @@ dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', str

 dummy_input = next(iter(dataset))["text"]

-tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
+tokenizer = RobertaTokenizerFast.from_pretrained("FacebookAI/roberta-base")
 input_ids = tokenizer(dummy_input, return_tensors="np").input_ids[:, :10]

 model = FlaxRobertaModel.from_pretrained("julien-c/dummy-unknown")
@@ -518,7 +518,7 @@ be available in a couple of days.
 - [BigBird](https://github.com/huggingface/transformers/blob/main/src/transformers/models/big_bird/modeling_flax_big_bird.py)
 - [CLIP](https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/modeling_flax_clip.py)
 - [ELECTRA](https://github.com/huggingface/transformers/blob/main/src/transformers/models/electra/modeling_flax_electra.py)
- [GPT2](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_flax_gpt2.py)
+- [GPT2](https://github.com/huggingface/transformers/blob/main/src/transformers/models/openai-community/gpt2/modeling_flax_gpt2.py)
 - [(TODO) MBART](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mbart/modeling_flax_mbart.py)
 - [RoBERTa](https://github.com/huggingface/transformers/blob/main/src/transformers/models/roberta/modeling_flax_roberta.py)
 - [T5](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_flax_t5.py)
@@ -729,7 +729,7 @@ Let's use the base `FlaxRobertaModel` without any heads as an example.
 from transformers import FlaxRobertaModel, RobertaTokenizerFast
 import jax

-tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
+tokenizer = RobertaTokenizerFast.from_pretrained("FacebookAI/roberta-base")
 inputs = tokenizer("JAX/Flax is amazing ", padding="max_length", max_length=128, return_tensors="np")

 model = FlaxRobertaModel.from_pretrained("julien-c/dummy-unknown")
@@ -1011,7 +1011,7 @@ and run the following commands in a Python shell to save a config.
 ```python
 from transformers import RobertaConfig

-config = RobertaConfig.from_pretrained("roberta-base")
+config = RobertaConfig.from_pretrained("FacebookAI/roberta-base")
 config.save_pretrained("./")
 ```

@@ -1193,12 +1193,12 @@ All the widgets are open sourced in the `huggingface_hub` [repo](https://github.
 **NLP**
 * **Conversational:** To have the best conversations!. [Example](https://huggingface.co/microsoft/DialoGPT-large?).
 * **Feature Extraction:** Retrieve the input embeddings. [Example](https://huggingface.co/sentence-transformers/distilbert-base-nli-mean-tokens?text=test).
-* **Fill Mask:** Predict potential words for a mask token. [Example](https://huggingface.co/bert-base-uncased?).
-* **Question Answering:** Given a context and a question, predict the answer. [Example](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad).
+* **Fill Mask:** Predict potential words for a mask token. [Example](https://huggingface.co/google-bert/bert-base-uncased?).
+* **Question Answering:** Given a context and a question, predict the answer. [Example](https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking-finetuned-squad).
 * **Sentence Simmilarity:** Predict how similar a set of sentences are. Useful for Sentence Transformers.
 * **Summarization:** Given a text, output a summary of it. [Example](https://huggingface.co/sshleifer/distilbart-cnn-12-6).
 * **Table Question Answering:** Given a table and a question, predict the answer. [Example](https://huggingface.co/google/tapas-base-finetuned-wtq).
-* **Text Generation:** Generate text based on a prompt. [Example](https://huggingface.co/gpt2)
+* **Text Generation:** Generate text based on a prompt. [Example](https://huggingface.co/openai-community/gpt2)
 * **Token Classification:** Useful for tasks such as Named Entity Recognition and Part of Speech. [Example](https://huggingface.co/dslim/bert-base-NER).
 * **Zero-Shot Classification:** Too cool to explain with words. Here is an [example](https://huggingface.co/typeform/distilbert-base-uncased-mnli)
 * ([WIP](https://github.com/huggingface/huggingface_hub/issues/99)) **Table to Text Generation**.
--- a/examples/research_projects/jax-projects/dataset-streaming/README.md
+++ b/examples/research_projects/jax-projects/dataset-streaming/README.md
@@ -31,7 +31,7 @@ without ever having to download the full dataset.
 In the following, we demonstrate how to train a bi-directional transformer model 
 using masked language modeling objective as introduced in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
 More specifically, we demonstrate how JAX/Flax and dataset streaming can be leveraged 
-to pre-train [**`roberta-base`**](https://huggingface.co/roberta-base)
+to pre-train [**`FacebookAI/roberta-base`**](https://huggingface.co/FacebookAI/roberta-base)
 in English on a single TPUv3-8 pod for 10000 update steps.

 The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
@@ -80,8 +80,8 @@ from transformers import RobertaTokenizerFast, RobertaConfig

 model_dir = "./english-roberta-base-dummy"

-tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
-config = RobertaConfig.from_pretrained("roberta-base")
+tokenizer = RobertaTokenizerFast.from_pretrained("FacebookAI/roberta-base")
+config = RobertaConfig.from_pretrained("FacebookAI/roberta-base")

 tokenizer.save_pretrained(model_dir)
 config.save_pretrained(model_dir)
--- a/examples/research_projects/jax-projects/hybrid_clip/README.md
+++ b/examples/research_projects/jax-projects/hybrid_clip/README.md
@@ -32,7 +32,7 @@ Models written in JAX/Flax are **immutable** and updated in a purely functional
 way which enables simple and efficient model parallelism.

 In this example we will use the vision model from [CLIP](https://huggingface.co/models?filter=clip)
-as the image encoder and [`roberta-base`](https://huggingface.co/roberta-base) as the text encoder.
+as the image encoder and [`FacebookAI/roberta-base`](https://huggingface.co/FacebookAI/roberta-base) as the text encoder.
 Note that one can also use the [ViT](https://huggingface.co/models?filter=vit) model as image encoder and any other BERT or ROBERTa model as text encoder.
 To train the model on languages other than English one should choose a text encoder trained on the desired
 language and a image-text dataset in that language. One such dataset is [WIT](https://github.com/google-research-datasets/wit).	
@@ -76,7 +76,7 @@ Here is an example of how to load the model using pre-trained text and vision mo
 ```python
 from modeling_hybrid_clip import FlaxHybridCLIP

-model = FlaxHybridCLIP.from_text_vision_pretrained("bert-base-uncased", "openai/clip-vit-base-patch32")
+model = FlaxHybridCLIP.from_text_vision_pretrained("google-bert/bert-base-uncased", "openai/clip-vit-base-patch32")

 # save the model
 model.save_pretrained("bert-clip")
@@ -89,7 +89,7 @@ If the checkpoints are in PyTorch then one could pass `text_from_pt=True` and `v
 PyTorch checkpoints convert them to flax and load the model.

 ```python
-model = FlaxHybridCLIP.from_text_vision_pretrained("bert-base-uncased", "openai/clip-vit-base-patch32", text_from_pt=True, vision_from_pt=True)
+model = FlaxHybridCLIP.from_text_vision_pretrained("google-bert/bert-base-uncased", "openai/clip-vit-base-patch32", text_from_pt=True, vision_from_pt=True)
 ```

 This loads both the text and vision encoders using pre-trained weights, the projection layers are randomly
@@ -154,9 +154,9 @@ Next we can run the example script to train the model:
 ```bash
 python run_hybrid_clip.py \
    --output_dir ${MODEL_DIR} \
-    --text_model_name_or_path="roberta-base" \
+    --text_model_name_or_path="FacebookAI/roberta-base" \
    --vision_model_name_or_path="openai/clip-vit-base-patch32" \
-    --tokenizer_name="roberta-base" \
+    --tokenizer_name="FacebookAI/roberta-base" \
    --train_file="coco_dataset/train_dataset.json" \
    --validation_file="coco_dataset/validation_dataset.json" \
    --do_train --do_eval \
--- a/examples/research_projects/jax-projects/hybrid_clip/modeling_hybrid_clip.py
+++ b/examples/research_projects/jax-projects/hybrid_clip/modeling_hybrid_clip.py
@@ -314,8 +314,6 @@ class FlaxHybridCLIP(FlaxPreTrainedModel):
                Information necessary to initiate the text model. Can be either:

                    - A string, the `model id` of a pretrained model hosted inside a model repo on huggingface.co.
-                      Valid model ids can be located at the root-level, like ``bert-base-uncased``, or namespaced under
-                      a user or organization name, like ``dbmdz/bert-base-german-cased``.
                    - A path to a `directory` containing model weights saved using
                      :func:`~transformers.FlaxPreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``.
                    - A path or url to a `PyTorch checkpoint folder` (e.g, ``./pt_model``). In
@@ -327,8 +325,6 @@ class FlaxHybridCLIP(FlaxPreTrainedModel):
                Information necessary to initiate the vision model. Can be either:

                    - A string, the `model id` of a pretrained model hosted inside a model repo on huggingface.co.
-                      Valid model ids can be located at the root-level, like ``bert-base-uncased``, or namespaced under
-                      a user or organization name, like ``dbmdz/bert-base-german-cased``.
                    - A path to a `directory` containing model weights saved using
                      :func:`~transformers.FlaxPreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``.
                    - A path or url to a `PyTorch checkpoint folder` (e.g, ``./pt_model``). In
@@ -354,7 +350,7 @@ class FlaxHybridCLIP(FlaxPreTrainedModel):
            >>> from transformers import FlaxHybridCLIP
            >>> # initialize a model from pretrained BERT and CLIP models. Note that the projection layers will be randomly initialized.
            >>> # If using CLIP's vision model the vision projection layer will be initialized using pre-trained weights
-            >>> model = FlaxHybridCLIP.from_text_vision_pretrained('bert-base-uncased', 'openai/clip-vit-base-patch32')
+            >>> model = FlaxHybridCLIP.from_text_vision_pretrained('google-bert/bert-base-uncased', 'openai/clip-vit-base-patch32')
            >>> # saving model after fine-tuning
            >>> model.save_pretrained("./bert-clip")
            >>> # load fine-tuned model
--- a/examples/research_projects/jax-projects/model_parallel/README.md
+++ b/examples/research_projects/jax-projects/model_parallel/README.md
@@ -54,7 +54,7 @@ model.save_pretrained("gpt-neo-1.3B")
 ```bash
 python run_clm_mp.py \
    --model_name_or_path gpt-neo-1.3B  \
-    --tokenizer_name gpt2 \
+    --tokenizer_name openai-community/gpt2 \
    --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
    --do_train  --do_eval \
    --block_size 1024 \
--- a/examples/research_projects/longform-qa/eli5_app.py
+++ b/examples/research_projects/longform-qa/eli5_app.py
@@ -36,7 +36,7 @@ def load_models():
        _ = s2s_model.eval()
    else:
        s2s_tokenizer, s2s_model = make_qa_s2s_model(
-            model_name="t5-small", from_file="seq2seq_models/eli5_t5_model_1024_4.pth", device="cuda:0"
+            model_name="google-t5/t5-small", from_file="seq2seq_models/eli5_t5_model_1024_4.pth", device="cuda:0"
        )
    return (qar_tokenizer, qar_model, s2s_tokenizer, s2s_model)

--- a/examples/research_projects/mlm_wwm/README.md
+++ b/examples/research_projects/mlm_wwm/README.md
@@ -32,7 +32,7 @@ to that word). This technique has been refined for Chinese in [this paper](https
 To fine-tune a model using whole word masking, use the following script:
 ```bash
 python run_mlm_wwm.py \
-    --model_name_or_path roberta-base \
+    --model_name_or_path FacebookAI/roberta-base \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
@@ -83,7 +83,7 @@ export VALIDATION_REF_FILE=/path/to/validation/chinese_ref/file
 export OUTPUT_DIR=/tmp/test-mlm-wwm

 python run_mlm_wwm.py \
-    --model_name_or_path roberta-base \
+    --model_name_or_path FacebookAI/roberta-base \
    --train_file $TRAIN_FILE \
    --validation_file $VALIDATION_FILE \
    --train_ref_file $TRAIN_REF_FILE \
--- a/examples/research_projects/mm-imdb/README.md
+++ b/examples/research_projects/mm-imdb/README.md
@@ -10,7 +10,7 @@ Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformer
 python run_mmimdb.py \
    --data_dir /path/to/mmimdb/dataset/ \
    --model_type bert \
-    --model_name_or_path bert-base-uncased \
+    --model_name_or_path google-bert/bert-base-uncased \
    --output_dir /path/to/save/dir/ \
    --do_train \
    --do_eval \
--- a/examples/research_projects/movement-pruning/README.md
+++ b/examples/research_projects/movement-pruning/README.md
@@ -61,7 +61,7 @@ python examples/movement-pruning/masked_run_squad.py \
    --predict_file dev-v1.1.json \
    --do_train --do_eval --do_lower_case \
    --model_type masked_bert \
-    --model_name_or_path bert-base-uncased \
+    --model_name_or_path google-bert/bert-base-uncased \
    --per_gpu_train_batch_size 16 \
    --warmup_steps 5400 \
    --num_train_epochs 10 \
@@ -84,7 +84,7 @@ python examples/movement-pruning/masked_run_squad.py \
    --predict_file dev-v1.1.json \
    --do_train --do_eval --do_lower_case \
    --model_type masked_bert \
-    --model_name_or_path bert-base-uncased \
+    --model_name_or_path google-bert/bert-base-uncased \
    --per_gpu_train_batch_size 16 \
    --warmup_steps 5400 \
    --num_train_epochs 10 \
@@ -104,7 +104,7 @@ python examples/movement-pruning/masked_run_squad.py \
    --predict_file dev-v1.1.json \
    --do_train --do_eval --do_lower_case \
    --model_type masked_bert \
-    --model_name_or_path bert-base-uncased \
+    --model_name_or_path google-bert/bert-base-uncased \
    --per_gpu_train_batch_size 16 \
    --warmup_steps 5400 \
    --num_train_epochs 10 \
@@ -124,7 +124,7 @@ python examples/movement-pruning/masked_run_squad.py \
    --predict_file dev-v1.1.json \
    --do_train --do_eval --do_lower_case \
    --model_type masked_bert \
-    --model_name_or_path bert-base-uncased \
+    --model_name_or_path google-bert/bert-base-uncased \
    --per_gpu_train_batch_size 16 \
    --warmup_steps 5400 \
    --num_train_epochs 10 \
--- a/examples/research_projects/performer/README.md
+++ b/examples/research_projects/performer/README.md
@@ -10,8 +10,8 @@ Paper authors: Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyo

 ## Examples

-`sanity_script.sh` will launch performer fine-tuning from the bert-base-cased checkpoint on the Simple Wikipedia dataset (a small, easy-language English Wikipedia) from `datasets`.
-`full_script.sh` will launch performer fine-tuning from the bert-large-cased checkpoint on the English Wikipedia dataset from `datasets`.
+`sanity_script.sh` will launch performer fine-tuning from the google-bert/bert-base-cased checkpoint on the Simple Wikipedia dataset (a small, easy-language English Wikipedia) from `datasets`.
+`full_script.sh` will launch performer fine-tuning from the google-bert/bert-large-cased checkpoint on the English Wikipedia dataset from `datasets`.

 Here are a few key arguments:
 - Remove the `--performer` argument to use a standard Bert model.
--- a/examples/research_projects/pplm/run_pplm.py
+++ b/examples/research_projects/pplm/run_pplm.py
@@ -61,7 +61,7 @@ DISCRIMINATOR_MODELS_PARAMS = {
        "embed_size": 1024,
        "class_vocab": {"non_clickbait": 0, "clickbait": 1},
        "default_class": 1,
-        "pretrained_model": "gpt2-medium",
+        "pretrained_model": "openai-community/gpt2-medium",
    },
    "sentiment": {
        "url": "https://s3.amazonaws.com/models.huggingface.co/bert/pplm/discriminators/SST_classifier_head.pt",
@@ -69,7 +69,7 @@ DISCRIMINATOR_MODELS_PARAMS = {
        "embed_size": 1024,
        "class_vocab": {"very_positive": 2, "very_negative": 3},
        "default_class": 3,
-        "pretrained_model": "gpt2-medium",
+        "pretrained_model": "openai-community/gpt2-medium",
    },
 }

@@ -585,7 +585,7 @@ def set_generic_model_params(discrim_weights, discrim_meta):


 def run_pplm_example(
-    pretrained_model="gpt2-medium",
+    pretrained_model="openai-community/gpt2-medium",
    cond_text="",
    uncond=False,
    num_samples=1,
@@ -738,7 +738,7 @@ if __name__ == "__main__":
        "--pretrained_model",
        "-M",
        type=str,
-        default="gpt2-medium",
+        default="openai-community/gpt2-medium",
        help="pretrained model name or path to local checkpoint",
    )
    parser.add_argument("--cond_text", type=str, default="The lake", help="Prefix texts to condition on")
--- a/examples/research_projects/pplm/run_pplm_discrim_train.py
+++ b/examples/research_projects/pplm/run_pplm_discrim_train.py
@@ -45,7 +45,7 @@ max_length_seq = 100
 class Discriminator(nn.Module):
    """Transformer encoder followed by a Classification Head"""

-    def __init__(self, class_size, pretrained_model="gpt2-medium", cached_mode=False, device="cpu"):
+    def __init__(self, class_size, pretrained_model="openai-community/gpt2-medium", cached_mode=False, device="cpu"):
        super().__init__()
        self.tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model)
        self.encoder = GPT2LMHeadModel.from_pretrained(pretrained_model)
@@ -218,7 +218,7 @@ def get_cached_data_loader(dataset, batch_size, discriminator, shuffle=False, de
 def train_discriminator(
    dataset,
    dataset_fp=None,
-    pretrained_model="gpt2-medium",
+    pretrained_model="openai-community/gpt2-medium",
    epochs=10,
    batch_size=64,
    log_interval=10,
@@ -502,7 +502,10 @@ if __name__ == "__main__":
        help="File path of the dataset to use. Needed only in case of generic datadset",
    )
    parser.add_argument(
-        "--pretrained_model", type=str, default="gpt2-medium", help="Pretrained model to use as encoder"
+        "--pretrained_model",
+        type=str,
+        default="openai-community/gpt2-medium",
+        help="Pretrained model to use as encoder",
    )
    parser.add_argument("--epochs", type=int, default=10, metavar="N", help="Number of training epochs")
    parser.add_argument(
--- a/examples/research_projects/quantization-qdqbert/README.md
+++ b/examples/research_projects/quantization-qdqbert/README.md
@@ -50,11 +50,11 @@ Calibrate the pretrained model and finetune with quantization awared:

 ```bash
 python3 run_quant_qa.py \
-  --model_name_or_path bert-base-uncased \
+  --model_name_or_path google-bert/bert-base-uncased \
  --dataset_name squad \
  --max_seq_length 128 \
  --doc_stride 32 \
-  --output_dir calib/bert-base-uncased \
+  --output_dir calib/google-bert/bert-base-uncased \
  --do_calib \
  --calibrator percentile \
  --percentile 99.99
@@ -62,7 +62,7 @@ python3 run_quant_qa.py \

 ```bash
 python3 run_quant_qa.py \
-  --model_name_or_path calib/bert-base-uncased \
+  --model_name_or_path calib/google-bert/bert-base-uncased \
  --dataset_name squad \
  --do_train \
  --do_eval \
@@ -71,8 +71,8 @@ python3 run_quant_qa.py \
  --num_train_epochs 2 \
  --max_seq_length 128 \
  --doc_stride 32 \
-  --output_dir finetuned_int8/bert-base-uncased \
-  --tokenizer_name bert-base-uncased \
+  --output_dir finetuned_int8/google-bert/bert-base-uncased \
+  --tokenizer_name google-bert/bert-base-uncased \
  --save_steps 0
 ```

@@ -82,14 +82,14 @@ To export the QAT model finetuned above:

 ```bash
 python3 run_quant_qa.py \
-  --model_name_or_path finetuned_int8/bert-base-uncased \
+  --model_name_or_path finetuned_int8/google-bert/bert-base-uncased \
  --output_dir ./ \
  --save_onnx \
  --per_device_eval_batch_size 1 \
  --max_seq_length 128 \
  --doc_stride 32 \
  --dataset_name squad \
-  --tokenizer_name bert-base-uncased
+  --tokenizer_name google-bert/bert-base-uncased
 ```

 Use `--recalibrate-weights` to calibrate the weight ranges according to the quantizer axis. Use `--quant-per-tensor` for per tensor quantization (default is per channel).
@@ -117,7 +117,7 @@ python3 evaluate-hf-trt-qa.py \
  --max_seq_length 128 \
  --doc_stride 32 \
  --dataset_name squad \
-  --tokenizer_name bert-base-uncased \
+  --tokenizer_name google-bert/bert-base-uncased \
  --int8 \
  --seed 42
 ```
@@ -128,14 +128,14 @@ Finetune a fp32 precision model with [transformers/examples/pytorch/question-ans

 ```bash
 python3 ../../pytorch/question-answering/run_qa.py \
-  --model_name_or_path bert-base-uncased \
+  --model_name_or_path google-bert/bert-base-uncased \
  --dataset_name squad \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 128 \
  --doc_stride 32 \
-  --output_dir ./finetuned_fp32/bert-base-uncased \
+  --output_dir ./finetuned_fp32/google-bert/bert-base-uncased \
  --save_steps 0 \
  --do_train \
  --do_eval
@@ -147,13 +147,13 @@ python3 ../../pytorch/question-answering/run_qa.py \

 ```bash
 python3 run_quant_qa.py \
-  --model_name_or_path ./finetuned_fp32/bert-base-uncased \
+  --model_name_or_path ./finetuned_fp32/google-bert/bert-base-uncased \
  --dataset_name squad \
  --calibrator percentile \
  --percentile 99.99 \
  --max_seq_length 128 \
  --doc_stride 32 \
-  --output_dir ./calib/bert-base-uncased \
+  --output_dir ./calib/google-bert/bert-base-uncased \
  --save_steps 0 \
  --do_calib \
  --do_eval
@@ -163,14 +163,14 @@ python3 run_quant_qa.py \

 ```bash
 python3 run_quant_qa.py \
-  --model_name_or_path ./calib/bert-base-uncased \
+  --model_name_or_path ./calib/google-bert/bert-base-uncased \
  --output_dir ./ \
  --save_onnx \
  --per_device_eval_batch_size 1 \
  --max_seq_length 128 \
  --doc_stride 32 \
  --dataset_name squad \
-  --tokenizer_name bert-base-uncased
+  --tokenizer_name google-bert/bert-base-uncased
 ```

 ### Evaluate the INT8 PTQ ONNX model inference with TensorRT
@@ -183,7 +183,7 @@ python3 evaluate-hf-trt-qa.py \
  --max_seq_length 128 \
  --doc_stride 32 \
  --dataset_name squad \
-  --tokenizer_name bert-base-uncased \
+  --tokenizer_name google-bert/bert-base-uncased \
  --int8 \
  --seed 42
 ```
--- a/examples/tensorflow/benchmarking/README.md
+++ b/examples/tensorflow/benchmarking/README.md
@@ -22,5 +22,5 @@ If you would like to list benchmark results on your favorite models of the [mode

 | Benchmark description | Results | Environment info |      Author      |
 |:----------|:-------------|:-------------|------:|
-| PyTorch Benchmark on inference for `bert-base-cased` |[memory](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_memory.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | 
-| PyTorch Benchmark on inference for `bert-base-cased` |[time](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_time.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | 
+| PyTorch Benchmark on inference for `google-bert/bert-base-cased` |[memory](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_memory.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | 
+| PyTorch Benchmark on inference for `google-bert/bert-base-cased` |[time](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_time.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | 
--- a/examples/tensorflow/contrastive-image-text/README.md
+++ b/examples/tensorflow/contrastive-image-text/README.md
@@ -65,7 +65,7 @@ Finally, we can run the example script to train the model:
 python examples/tensorflow/contrastive-image-text/run_clip.py \
    --output_dir ./clip-roberta-finetuned \
    --vision_model_name_or_path openai/clip-vit-base-patch32 \
-    --text_model_name_or_path roberta-base \
+    --text_model_name_or_path FacebookAI/roberta-base \
    --data_dir $PWD/data \
    --dataset_name ydshieh/coco_dataset_script \
    --dataset_config_name=2017 \
--- a/examples/tensorflow/language-modeling-tpu/run_mlm.py
+++ b/examples/tensorflow/language-modeling-tpu/run_mlm.py
@@ -57,7 +57,7 @@ def parse_args():
    parser.add_argument(
        "--pretrained_model_config",
        type=str,
-        default="roberta-base",
+        default="FacebookAI/roberta-base",
        help="The model config to use. Note that we don't copy the model's weights, only the config!",
    )
    parser.add_argument(
--- a/examples/tensorflow/language-modeling/README.md
+++ b/examples/tensorflow/language-modeling/README.md
@@ -43,7 +43,7 @@ This script trains a masked language model.
 ### Example command
 ```bash
 python run_mlm.py \
--model_name_or_path distilbert-base-cased \
+--model_name_or_path distilbert/distilbert-base-cased \
 --output_dir output \
 --dataset_name wikitext \
 --dataset_config_name wikitext-103-raw-v1
@@ -52,7 +52,7 @@ python run_mlm.py \
 When using a custom dataset, the validation file can be separately passed as an input argument. Otherwise some split (customizable) of training data is used as validation.
 ```bash
 python run_mlm.py \
--model_name_or_path distilbert-base-cased \
+--model_name_or_path distilbert/distilbert-base-cased \
 --output_dir output \
 --train_file train_file_path
 ```
@@ -64,7 +64,7 @@ This script trains a causal language model.
 ### Example command
 ```bash
 python run_clm.py \
--model_name_or_path distilgpt2 \
+--model_name_or_path distilbert/distilgpt2 \
 --output_dir output \
 --dataset_name wikitext \
 --dataset_config_name wikitext-103-raw-v1
@@ -74,7 +74,7 @@ When using a custom dataset, the validation file can be separately passed as an

 ```bash
 python run_clm.py \
--model_name_or_path distilgpt2 \
+--model_name_or_path distilbert/distilgpt2 \
 --output_dir output \
 --train_file train_file_path
 ```
--- a/examples/tensorflow/multiple-choice/README.md
+++ b/examples/tensorflow/multiple-choice/README.md
@@ -36,7 +36,7 @@ README, but for more information you can see the 'Input Datasets' section of
 ### Example command
 ```bash
 python run_swag.py \
- --model_name_or_path distilbert-base-cased \
+ --model_name_or_path distilbert/distilbert-base-cased \
 --output_dir output \
 --do_eval \
 --do_train
--- a/examples/tensorflow/question-answering/README.md
+++ b/examples/tensorflow/question-answering/README.md
@@ -47,7 +47,7 @@ README, but for more information you can see the 'Input Datasets' section of
 ### Example command
 ```bash
 python run_qa.py \
--model_name_or_path distilbert-base-cased \
+--model_name_or_path distilbert/distilbert-base-cased \
 --output_dir output \
 --dataset_name squad \
 --do_train \
--- a/examples/tensorflow/summarization/run_summarization.py
+++ b/examples/tensorflow/summarization/run_summarization.py
@@ -334,11 +334,11 @@ def main():

    # region T5 special-casing
    if data_args.source_prefix is None and model_args.model_name_or_path in [
-        "t5-small",
-        "t5-base",
-        "t5-large",
-        "t5-3b",
-        "t5-11b",
+        "google-t5/t5-small",
+        "google-t5/t5-base",
+        "google-t5/t5-large",
+        "google-t5/t5-3b",
+        "google-t5/t5-11b",
    ]:
        logger.warning(
            "You're running a t5 model but didn't provide a source prefix, which is the expected, e.g. with "
--- a/examples/tensorflow/test_tensorflow_examples.py
+++ b/examples/tensorflow/test_tensorflow_examples.py
@@ -107,7 +107,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_text_classification.py
-            --model_name_or_path distilbert-base-uncased
+            --model_name_or_path distilbert/distilbert-base-uncased
            --output_dir {tmp_dir}
            --overwrite_output_dir
            --train_file ./tests/fixtures/tests_samples/MRPC/train.csv
@@ -137,7 +137,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_clm.py
-            --model_name_or_path distilgpt2
+            --model_name_or_path distilbert/distilgpt2
            --train_file ./tests/fixtures/sample_text.txt
            --validation_file ./tests/fixtures/sample_text.txt
            --do_train
@@ -163,7 +163,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_mlm.py
-            --model_name_or_path distilroberta-base
+            --model_name_or_path distilbert/distilroberta-base
            --train_file ./tests/fixtures/sample_text.txt
            --validation_file ./tests/fixtures/sample_text.txt
            --max_seq_length 64
@@ -188,7 +188,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_ner.py
-            --model_name_or_path bert-base-uncased
+            --model_name_or_path google-bert/bert-base-uncased
            --train_file tests/fixtures/tests_samples/conll/sample.json
            --validation_file tests/fixtures/tests_samples/conll/sample.json
            --output_dir {tmp_dir}
@@ -212,7 +212,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_qa.py
-            --model_name_or_path bert-base-uncased
+            --model_name_or_path google-bert/bert-base-uncased
            --version_2_with_negative
            --train_file tests/fixtures/tests_samples/SQUAD/sample.json
            --validation_file tests/fixtures/tests_samples/SQUAD/sample.json
@@ -237,7 +237,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_swag.py
-            --model_name_or_path bert-base-uncased
+            --model_name_or_path google-bert/bert-base-uncased
            --train_file tests/fixtures/tests_samples/swag/sample.json
            --validation_file tests/fixtures/tests_samples/swag/sample.json
            --output_dir {tmp_dir}
@@ -261,7 +261,7 @@ class ExamplesTests(TestCasePlus):
        tmp_dir = self.get_auto_remove_tmp_dir()
        testargs = f"""
            run_summarization.py
-            --model_name_or_path t5-small
+            --model_name_or_path google-t5/t5-small
            --train_file tests/fixtures/tests_samples/xsum/sample.json
            --validation_file tests/fixtures/tests_samples/xsum/sample.json
            --output_dir {tmp_dir}
--- a/examples/tensorflow/text-classification/README.md
+++ b/examples/tensorflow/text-classification/README.md
@@ -71,7 +71,7 @@ README, but for more information you can see the 'Input Datasets' section of
 ### Example command
 ```bash
 python run_text_classification.py \
--model_name_or_path distilbert-base-cased \
+--model_name_or_path distilbert/distilbert-base-cased \
 --train_file training_data.json \
 --validation_file validation_data.json \
 --output_dir output/ \
@@ -103,7 +103,7 @@ README, but for more information you can see the 'Input Datasets' section of
 ### Example command
 ```bash
 python run_glue.py \
--model_name_or_path distilbert-base-cased \
+--model_name_or_path distilbert/distilbert-base-cased \
 --task_name mnli \
 --do_train \
 --do_eval \
--- a/examples/tensorflow/token-classification/README.md
+++ b/examples/tensorflow/token-classification/README.md
@@ -27,7 +27,7 @@ The following example fine-tunes BERT on CoNLL-2003:

 ```bash
 python run_ner.py \
-  --model_name_or_path bert-base-uncased \
+  --model_name_or_path google-bert/bert-base-uncased \
  --dataset_name conll2003 \
  --output_dir /tmp/test-ner
 ```
@@ -36,7 +36,7 @@ To run on your own training and validation files, use the following command:

 ```bash
 python run_ner.py \
-  --model_name_or_path bert-base-uncased \
+  --model_name_or_path google-bert/bert-base-uncased \
  --train_file path_to_train_file \
  --validation_file path_to_validation_file \
  --output_dir /tmp/test-ner
--- a/examples/tensorflow/translation/README.md
+++ b/examples/tensorflow/translation/README.md
@@ -29,11 +29,11 @@ can also be used by passing the name of the TPU resource with the `--tpu` argume

 MBart and some T5 models require special handling.

-T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example:
+T5 models `google-t5/t5-small`, `google-t5/t5-base`, `google-t5/t5-large`, `google-t5/t5-3b` and `google-t5/t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example:

 ```bash
 python run_translation.py \
-    --model_name_or_path t5-small \
+    --model_name_or_path google-t5/t5-small \
    --do_train \
    --do_eval \
    --source_lang en \