diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index e3c7576724..4dadc37af6 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -277,7 +277,9 @@ Follow these steps to start contributing:
    example.
 7. Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos and other non-text files. We prefer to leverage a hf.co hosted `dataset` like
    the ones hosted on [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) in which to place these files and reference 
-   them by URL.
+   them by URL. We recommend putting them in the following dataset: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images).
+   If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images
+   to this dataset.
 
 See more about the checks run on a pull request in our [PR guide](pr_checks)
 
diff --git a/README.md b/README.md
index 9f670f8ece..51e1bf566a 100644
--- a/README.md
+++ b/README.md
@@ -16,7 +16,7 @@ limitations under the License.
 
 <p align="center">
     <br>
-    <img src="https://raw.githubusercontent.com/huggingface/transformers/master/docs/source/imgs/transformers_logo_name.png" width="400"/>
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_logo_name.png" width="400"/>
     <br>
 <p>
 <p align="center">
@@ -52,7 +52,7 @@ limitations under the License.
 </h3>
 
 <h3 align="center">
-    <a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/transformers/master/docs/source/imgs/course_banner.png"></a>
+    <a href="https://hf.co/course"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/course_banner.png"></a>
 </h3>
 
 🤗 Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. 
diff --git a/README_ko.md b/README_ko.md
index 47c3ea033c..c2bb731cd8 100644
--- a/README_ko.md
+++ b/README_ko.md
@@ -16,7 +16,7 @@ limitations under the License.
 
 <p align="center">
     <br>
-    <img src="https://raw.githubusercontent.com/huggingface/transformers/master/docs/source/imgs/transformers_logo_name.png" width="400"/>
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_logo_name.png" width="400"/>
     <br>
 <p>
 <p align="center">
@@ -52,7 +52,7 @@ limitations under the License.
 </h3>
 
 <h3 align="center">
-    <a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/transformers/master/docs/source/imgs/course_banner.png"></a>
+    <a href="https://hf.co/course"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/course_banner.png"></a>
 </h3>
 
 🤗 Transformers는 분류, 정보 추출, 질문 답변, 요약, 번역, 문장 생성 등을 100개 이상의 언어로 수행할 수 있는 수천개의 사전학습된 모델을 제공합니다. 우리의 목표는 모두가 최첨단의 NLP 기술을 쉽게 사용하는 것입니다.
diff --git a/README_zh-hans.md b/README_zh-hans.md
index 07ca87440c..7aea040acf 100644
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -41,7 +41,7 @@ checkpoint: 检查点
 
 <p align="center">
     <br>
-    <img src="https://raw.githubusercontent.com/huggingface/transformers/master/docs/source/imgs/transformers_logo_name.png" width="400"/>
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_logo_name.png" width="400"/>
     <br>
 <p>
 <p align="center">
@@ -77,7 +77,7 @@ checkpoint: 检查点
 </h3>
 
 <h3 align="center">
-    <a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/transformers/master/docs/source/imgs/course_banner.png"></a>
+    <a href="https://hf.co/course"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/course_banner.png"></a>
 </h3>
 
 🤗 Transformers 提供了数以千计的预训练模型，支持 100 多种语言的文本分类、信息抽取、问答、摘要、翻译、文本生成。它的宗旨让最先进的 NLP 技术人人易用。
diff --git a/README_zh-hant.md b/README_zh-hant.md
index 81c3a39bc0..52c034d3f8 100644
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -53,7 +53,7 @@ user: 使用者
 
 <p align="center">
     <br>
-    <img src="https://raw.githubusercontent.com/huggingface/transformers/master/docs/source/imgs/transformers_logo_name.png" width="400"/>
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_logo_name.png" width="400"/>
     <br>
 <p>
 <p align="center">
@@ -89,7 +89,7 @@ user: 使用者
 </h3>
 
 <h3 align="center">
-    <a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/transformers/master/docs/source/imgs/course_banner.png"></a>
+    <a href="https://hf.co/course"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/course_banner.png"></a>
 </h3>
 
 🤗 Transformers 提供了數以千計的預訓練模型，支援 100 多種語言的文本分類、資訊擷取、問答、摘要、翻譯、文本生成。它的宗旨是讓最先進的 NLP 技術人人易用。
diff --git a/docs/README.md b/docs/README.md
index b076c5de9d..36b39f3e67 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -324,3 +324,11 @@ So using this particular example's output -- if your current section's header us
 If you needed to add yet another sub-level, then pick a character that is not used already. That is you must pick a character that is not in the output of that script.
 
 Here is the full list of characters that can be used in this context: `= - ` : ' " ~ ^ _ * + # < >`
+
+#### Adding an image
+
+Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos and other non-text files. We prefer to leverage a hf.co hosted `dataset` like
+the ones hosted on [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) in which to place these files and reference
+them by URL. We recommend putting them in the following dataset: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images).
+If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images
+to this dataset.
diff --git a/docs/source/add_new_model.rst b/docs/source/add_new_model.rst
index 4ea8bcf1a8..decd664f6c 100644
--- a/docs/source/add_new_model.rst
+++ b/docs/source/add_new_model.rst
@@ -72,7 +72,7 @@ call the model to be added to 🤗 Transformers ``BrandNewBert``.
 
 Let's take a look:
 
-.. image:: /imgs/transformers_overview.png
+.. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_overview.png
 
 As you can see, we do make use of inheritance in 🤗 Transformers, but we keep the level of abstraction to an absolute
 minimum. There are never more than two levels of abstraction for any model in the library. :obj:`BrandNewBertModel`
diff --git a/docs/source/imgs/course_banner.png b/docs/source/imgs/course_banner.png
deleted file mode 100644
index 45773d164c..0000000000
Binary files a/docs/source/imgs/course_banner.png and /dev/null differ
diff --git a/docs/source/imgs/local_attention_mask.png b/docs/source/imgs/local_attention_mask.png
deleted file mode 100644
index 284e728820..0000000000
Binary files a/docs/source/imgs/local_attention_mask.png and /dev/null differ
diff --git a/docs/source/imgs/parallelism-deepspeed-3d.png b/docs/source/imgs/parallelism-deepspeed-3d.png
deleted file mode 100644
index 391072b543..0000000000
Binary files a/docs/source/imgs/parallelism-deepspeed-3d.png and /dev/null differ
diff --git a/docs/source/imgs/parallelism-flexflow.jpeg b/docs/source/imgs/parallelism-flexflow.jpeg
deleted file mode 100644
index 137015c7eb..0000000000
Binary files a/docs/source/imgs/parallelism-flexflow.jpeg and /dev/null differ
diff --git a/docs/source/imgs/parallelism-gpipe-bubble.png b/docs/source/imgs/parallelism-gpipe-bubble.png
deleted file mode 100644
index 03bda9fae5..0000000000
Binary files a/docs/source/imgs/parallelism-gpipe-bubble.png and /dev/null differ
diff --git a/docs/source/imgs/parallelism-sagemaker-interleaved-pipeline.png b/docs/source/imgs/parallelism-sagemaker-interleaved-pipeline.png
deleted file mode 100644
index b1e44b0ea3..0000000000
Binary files a/docs/source/imgs/parallelism-sagemaker-interleaved-pipeline.png and /dev/null differ
diff --git a/docs/source/imgs/parallelism-tp-independent-gelu.png b/docs/source/imgs/parallelism-tp-independent-gelu.png
deleted file mode 100644
index 81d289c9ca..0000000000
Binary files a/docs/source/imgs/parallelism-tp-independent-gelu.png and /dev/null differ
diff --git a/docs/source/imgs/parallelism-tp-parallel_gemm.png b/docs/source/imgs/parallelism-tp-parallel_gemm.png
deleted file mode 100644
index 06ca09a365..0000000000
Binary files a/docs/source/imgs/parallelism-tp-parallel_gemm.png and /dev/null differ
diff --git a/docs/source/imgs/parallelism-tp-parallel_self_attention.png b/docs/source/imgs/parallelism-tp-parallel_self_attention.png
deleted file mode 100644
index 60fae3fcd5..0000000000
Binary files a/docs/source/imgs/parallelism-tp-parallel_self_attention.png and /dev/null differ
diff --git a/docs/source/imgs/parallelism-tp-parallel_shard_processing.png b/docs/source/imgs/parallelism-tp-parallel_shard_processing.png
deleted file mode 100644
index 8f7e7e9ab7..0000000000
Binary files a/docs/source/imgs/parallelism-tp-parallel_shard_processing.png and /dev/null differ
diff --git a/docs/source/imgs/parallelism-zero-dp-pp.png b/docs/source/imgs/parallelism-zero-dp-pp.png
deleted file mode 100644
index fa09854248..0000000000
Binary files a/docs/source/imgs/parallelism-zero-dp-pp.png and /dev/null differ
diff --git a/docs/source/imgs/parallelism-zero.png b/docs/source/imgs/parallelism-zero.png
deleted file mode 100644
index 0c1afaa4b1..0000000000
Binary files a/docs/source/imgs/parallelism-zero.png and /dev/null differ
diff --git a/docs/source/imgs/perf-moe-transformer.png b/docs/source/imgs/perf-moe-transformer.png
deleted file mode 100644
index 5999f55d2a..0000000000
Binary files a/docs/source/imgs/perf-moe-transformer.png and /dev/null differ
diff --git a/docs/source/imgs/ppl_chunked.gif b/docs/source/imgs/ppl_chunked.gif
deleted file mode 100644
index 2e33736935..0000000000
Binary files a/docs/source/imgs/ppl_chunked.gif and /dev/null differ
diff --git a/docs/source/imgs/ppl_full.gif b/docs/source/imgs/ppl_full.gif
deleted file mode 100644
index 2869208faa..0000000000
Binary files a/docs/source/imgs/ppl_full.gif and /dev/null differ
diff --git a/docs/source/imgs/ppl_sliding.gif b/docs/source/imgs/ppl_sliding.gif
deleted file mode 100644
index d2dc26f55b..0000000000
Binary files a/docs/source/imgs/ppl_sliding.gif and /dev/null differ
diff --git a/docs/source/imgs/tf32-bf16-fp16-fp32.png b/docs/source/imgs/tf32-bf16-fp16-fp32.png
deleted file mode 100644
index aa247bd997..0000000000
Binary files a/docs/source/imgs/tf32-bf16-fp16-fp32.png and /dev/null differ
diff --git a/docs/source/imgs/transformers_logo_name.png b/docs/source/imgs/transformers_logo_name.png
deleted file mode 100644
index 5e4c2dcf57..0000000000
Binary files a/docs/source/imgs/transformers_logo_name.png and /dev/null differ
diff --git a/docs/source/imgs/transformers_overview.png b/docs/source/imgs/transformers_overview.png
deleted file mode 100644
index abb15b3dd7..0000000000
Binary files a/docs/source/imgs/transformers_overview.png and /dev/null differ
diff --git a/docs/source/imgs/warmup_constant_schedule.png b/docs/source/imgs/warmup_constant_schedule.png
deleted file mode 100644
index e2448e9f2c..0000000000
Binary files a/docs/source/imgs/warmup_constant_schedule.png and /dev/null differ
diff --git a/docs/source/imgs/warmup_cosine_hard_restarts_schedule.png b/docs/source/imgs/warmup_cosine_hard_restarts_schedule.png
deleted file mode 100644
index be73605b9c..0000000000
Binary files a/docs/source/imgs/warmup_cosine_hard_restarts_schedule.png and /dev/null differ
diff --git a/docs/source/imgs/warmup_cosine_schedule.png b/docs/source/imgs/warmup_cosine_schedule.png
deleted file mode 100644
index 6d27926ab1..0000000000
Binary files a/docs/source/imgs/warmup_cosine_schedule.png and /dev/null differ
diff --git a/docs/source/imgs/warmup_cosine_warm_restarts_schedule.png b/docs/source/imgs/warmup_cosine_warm_restarts_schedule.png
deleted file mode 100644
index 71b39bffd3..0000000000
Binary files a/docs/source/imgs/warmup_cosine_warm_restarts_schedule.png and /dev/null differ
diff --git a/docs/source/imgs/warmup_linear_schedule.png b/docs/source/imgs/warmup_linear_schedule.png
deleted file mode 100644
index 4e1af31025..0000000000
Binary files a/docs/source/imgs/warmup_linear_schedule.png and /dev/null differ
diff --git a/docs/source/main_classes/optimizer_schedules.rst b/docs/source/main_classes/optimizer_schedules.rst
index 71cf192574..6ac96a8546 100644
--- a/docs/source/main_classes/optimizer_schedules.rst
+++ b/docs/source/main_classes/optimizer_schedules.rst
@@ -52,30 +52,30 @@ Learning Rate Schedules (Pytorch)
 
 .. autofunction:: transformers.get_constant_schedule_with_warmup
 
-.. image:: /imgs/warmup_constant_schedule.png
-    :target: /imgs/warmup_constant_schedule.png
+.. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_constant_schedule.png
+    :target: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_constant_schedule.png
     :alt:
 
 
 .. autofunction:: transformers.get_cosine_schedule_with_warmup
 
-.. image:: /imgs/warmup_cosine_schedule.png
-    :target: /imgs/warmup_cosine_schedule.png
+.. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_cosine_schedule.png
+    :target: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_cosine_schedule.png
     :alt:
 
 
 .. autofunction:: transformers.get_cosine_with_hard_restarts_schedule_with_warmup
 
-.. image:: /imgs/warmup_cosine_hard_restarts_schedule.png
-    :target: /imgs/warmup_cosine_hard_restarts_schedule.png
+.. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_cosine_hard_restarts_schedule.png
+    :target: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_cosine_hard_restarts_schedule.png
     :alt:
 
 
 
 .. autofunction:: transformers.get_linear_schedule_with_warmup
 
-.. image:: /imgs/warmup_linear_schedule.png
-    :target: /imgs/warmup_linear_schedule.png
+.. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_linear_schedule.png
+    :target: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/warmup_linear_schedule.png
     :alt:
 
 
diff --git a/docs/source/model_summary.rst b/docs/source/model_summary.rst
index 106beb96aa..0bd42bb819 100644
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -877,7 +877,7 @@ Some preselected input tokens are also given global attention: for those few tok
 all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in
 their local window). This is shown in Figure 2d of the paper, see below for a sample attention mask:
 
-.. image:: /imgs/local_attention_mask.png
+.. image:: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local_attention_mask.png
    :scale: 50 %
    :align: center
 
diff --git a/docs/source/parallelism.md b/docs/source/parallelism.md
index ce8ebaaf26..622838ab80 100644
--- a/docs/source/parallelism.md
+++ b/docs/source/parallelism.md
@@ -46,7 +46,7 @@ Most users with just 2 GPUs already enjoy the increased training speed up thanks
 ## ZeRO Data Parallel
 
 ZeRO-powered data parallelism (ZeRO-DP) is described on the following diagram from this [blog post](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
-![DeepSpeed-Image-1](/imgs/parallelism-zero.png)
+![DeepSpeed-Image-1](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero.png)
 
 It can be difficult to wrap one's head around it, but in reality the concept is quite simple. This is just the usual DataParallel (DP), except, instead of replicating the full model params, gradients and optimizer states, each GPU stores only a slice of it.  And then at run-time when the full layer params are needed just for the given layer, all GPUs synchronize to give each other parts that they miss - this is it.
 
@@ -150,7 +150,7 @@ Pipeline Parallel (PP) is almost identical to a naive MP, but it solves the GPU
 
 The following illustration from the [GPipe paper](https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html) shows the naive MP on the top, and PP on the bottom:
 
-![mp-pp](/imgs/parallelism-gpipe-bubble.png)
+![mp-pp](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-gpipe-bubble.png)
 
 It's easy to see from the bottom diagram how PP has less dead zones, where GPUs are idle. The idle parts are referred to as the "bubble".
 
@@ -203,7 +203,7 @@ Implementations:
 Other approaches:
 
 DeepSpeed, Varuna and SageMaker use the concept of an [Interleaved Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html)
-![interleaved-pipeline-execution](/imgs/parallelism-sagemaker-interleaved-pipeline.png)
+![interleaved-pipeline-execution](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-sagemaker-interleaved-pipeline.png)
 
 Here the bubble (idle time) is further minimized by prioritizing backward passes.
 
@@ -221,16 +221,16 @@ The main building block of any transformer is a fully connected `nn.Linear` foll
 Following the Megatron's paper notation, we can write the dot-product part of it as `Y = GeLU(XA)`, where `X` and `Y` are the input and output vectors, and `A` is the weight matrix.
 
 If we look at the computation in matrix form, it's easy to see how the matrix multiplication can be split between multiple GPUs:
-![Parallel GEMM](/imgs/parallelism-tp-parallel_gemm.png)
+![Parallel GEMM](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_gemm.png)
 
 If we split the weight matrix `A` column-wise across `N` GPUs and perform matrix multiplications `XA_1` through `XA_n` in parallel, then we will end up with `N` output vectors `Y_1, Y_2, ..., Y_n` which can be fed into `GeLU` independently:
-![independent GeLU](/imgs/parallelism-tp-independent-gelu.png)
+![independent GeLU](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-independent-gelu.png)
 
 Using this principle, we can update an MLP of arbitrary depth, without the need for any synchronization between GPUs until the very end, where we need to reconstruct the output vector from shards. The Megatron-LM paper authors provide a helpful illustration for that:
-![parallel shard processing](/imgs/parallelism-tp-parallel_shard_processing.png)
+![parallel shard processing](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_shard_processing.png)
 
 Parallelizing the multi-headed attention layers is even simpler, since they are already inherently parallel, due to having multiple independent heads!
-![parallel self-attention](/imgs/parallelism-tp-parallel_self_attention.png)
+![parallel self-attention](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_self_attention.png)
 
 Special considerations: TP requires very fast network, and therefore it's not advisable to do TP across more than one node. Practically, if a node has 4 GPUs, the highest TP degree is therefore 4. If you need a TP degree of 8, you need to use nodes that have at least 8 GPUs.
 
@@ -258,7 +258,7 @@ Implementations:
 
 The following diagram from the DeepSpeed [pipeline tutorial](https://www.deepspeed.ai/tutorials/pipeline/) demonstrates how one combines DP with PP.
 
-![dp-pp-2d](/imgs/parallelism-zero-dp-pp.png)
+![dp-pp-2d](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero-dp-pp.png)
 
 Here it's important to see how DP rank 0 doesn't see GPU2 and DP rank 1 doesn't see GPU3. To DP there is just GPUs 0 and 1 where it feeds data as if there were just 2 GPUs. GPU0 "secretly" offloads some of its load to GPU2 using PP. And GPU1 does the same by enlisting GPU3 to its aid.
 
@@ -277,7 +277,7 @@ Implementations:
 
 To get an even more efficient training a 3D parallelism is used where PP is combined with TP and DP. This can be seen in the following diagram.
 
-![dp-pp-tp-3d](/imgs/parallelism-deepspeed-3d.png)
+![dp-pp-tp-3d](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-deepspeed-3d.png)
 
 This diagram is from a blog post [3D parallelism: Scaling to trillion-parameter models](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/), which is a good read as well.
 
@@ -342,7 +342,7 @@ We have 10 batches of 512 length. If we parallelize them by attribute dimension
 
 It is similar with tensor model parallelism or naive layer-wise model parallelism.
 
-![flex-flow-soap](/imgs/parallelism-flexflow.jpeg)
+![flex-flow-soap](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-flexflow.jpeg)
 
 The significance of this framework is that it takes resources like (1) GPU/TPU/CPU vs. (2) RAM/DRAM vs. (3) fast-intra-connect/slow-inter-connect and it automatically optimizes all these  algorithmically deciding which parallelisation to use where.
 
diff --git a/docs/source/performance.md b/docs/source/performance.md
index 851da82852..b058adf064 100644
--- a/docs/source/performance.md
+++ b/docs/source/performance.md
@@ -248,7 +248,7 @@ Here are the commonly used floating point data types choice of which impacts bot
 
 Here is a diagram that shows how these data types correlate to each other.
 
-![data types](/imgs/tf32-bf16-fp16-fp32.png)
+![data types](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tf32-bf16-fp16-fp32.png)
 
 (source: [NVIDIA Blog](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/))
 
@@ -524,7 +524,7 @@ Since it has been discovered that more parameters lead to better performance, th
 
 In this approach every other FFN layer is replaced with a MoE Layer which consists of many experts, with a gated function that trains each expert in a balanced way depending on the input token's position in a sequence.
 
-![MoE Transformer 2x block](/imgs/perf-moe-transformer.png)
+![MoE Transformer 2x block](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perf-moe-transformer.png)
 
 (source: [GLAM](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html))
 
diff --git a/docs/source/perplexity.mdx b/docs/source/perplexity.mdx
index 98a7bdd95d..f53b565037 100644
--- a/docs/source/perplexity.mdx
+++ b/docs/source/perplexity.mdx
@@ -34,7 +34,7 @@ intuition about perplexity and its relationship to Bits Per Character (BPC) and
 If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
 factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.
 
-<img width="600" alt="Full decomposition of a sequence with unlimited context length" src="/imgs/ppl_full.gif"/>
+<img width="600" alt="Full decomposition of a sequence with unlimited context length" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ppl_full.gif"/>
 
 When working with approximate models, however, we typically have a constraint on the number of tokens the model can
 process. The largest version of [GPT-2](model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we
@@ -46,7 +46,7 @@ input size is \\(k\\), we then approximate the likelihood of a token \\(x_t\\) b
 sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
 log-likelihoods of each segment independently.
 
-<img width="600" alt="Suboptimal PPL not taking advantage of full available context" src="/imgs/ppl_chunked.gif"/>
+<img width="600" alt="Suboptimal PPL not taking advantage of full available context" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ppl_chunked.gif"/>
 
 This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
 approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
@@ -55,7 +55,7 @@ have less context at most of the prediction steps.
 Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
 sliding the context window so that the model has more context when making each prediction.
 
-<img width="600" alt="Sliding window PPL taking advantage of all available context" src="/imgs/ppl_sliding.gif"/>
+<img width="600" alt="Sliding window PPL taking advantage of all available context" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ppl_sliding.gif"/>
 
 This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
 favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
diff --git a/templates/adding_a_new_model/ADD_NEW_MODEL_PROPOSAL_TEMPLATE.md b/templates/adding_a_new_model/ADD_NEW_MODEL_PROPOSAL_TEMPLATE.md
index 784314b56d..d87b6a58ec 100644
--- a/templates/adding_a_new_model/ADD_NEW_MODEL_PROPOSAL_TEMPLATE.md
+++ b/templates/adding_a_new_model/ADD_NEW_MODEL_PROPOSAL_TEMPLATE.md
@@ -91,7 +91,7 @@ exemplary purposes, we will call the PyTorch model to be added to 🤗 Transform
 
 Let's take a look:
 
-![image](../../docs/source/imgs/transformers_overview.png)
+![image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_overview.png)
 
 As you can see, we do make use of inheritance in 🤗 Transformers, but we
 keep the level of abstraction to an absolute minimum. There are never
diff --git a/templates/adding_a_new_model/open_model_proposals/ADD_BIG_BIRD.md b/templates/adding_a_new_model/open_model_proposals/ADD_BIG_BIRD.md
index fea8376a80..106dcc9542 100644
--- a/templates/adding_a_new_model/open_model_proposals/ADD_BIG_BIRD.md
+++ b/templates/adding_a_new_model/open_model_proposals/ADD_BIG_BIRD.md
@@ -73,7 +73,7 @@ exemplary purposes, we will call the PyTorch model to be added to 🤗 Transform
 
 Let's take a look:
 
-![image](../../../docs/source/imgs/transformers_overview.png)
+![image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_overview.png)
 
 As you can see, we do make use of inheritance in 🤗 Transformers, but we
 keep the level of abstraction to an absolute minimum. There are never