From 5b1ad0eb732a07ccc4ea406fb33dd21c590c80be Mon Sep 17 00:00:00 2001
From: Joao Gante <joaofranciscocardosogante@gmail.com>
Date: Tue, 16 May 2023 18:54:34 +0100
Subject: [PATCH] Docs: add link to assisted generation blog post (#23397)

---
 docs/source/en/generation_strategies.mdx | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/docs/source/en/generation_strategies.mdx b/docs/source/en/generation_strategies.mdx
index 2b4f9880cf..b59649bae4 100644
--- a/docs/source/en/generation_strategies.mdx
+++ b/docs/source/en/generation_strategies.mdx
@@ -338,9 +338,8 @@ For the complete list of the available parameters, refer to the [API documentati
 Assisted decoding is a modification of the decoding strategies above that uses an assistant model with the same
 tokenizer (ideally a much smaller model) to greedily generate a few candidate tokens. The main model then validates
 the candidate tokens in a single forward pass, which speeds up the decoding process. Currently, only greedy search
-and sampling are supported with assisted decoding, and doesn't support batched inputs.
-
-<!-- TODO: add link to the blog post about assisted decoding when it exists -->
+and sampling are supported with assisted decoding, and doesn't support batched inputs. To learn more about assisted
+decoding, check [this blog post](https://huggingface.co/blog/assisted-generation).
 
 To enable assisted decoding, set the `assistant_model` argument with a model.
 
@@ -364,8 +363,6 @@ To enable assisted decoding, set the `assistant_model` argument with a model.
 When using assisted decoding with sampling methods, you can use the `temperarure` argument to control the randomness
 just like in multinomial sampling. However, in assisted decoding, reducing the temperature will help improving latency.
 
-<!-- TODO: link the blog post again to explain why the tradeoff exists -->
-
 ```python
 >>> from transformers import AutoModelForCausalLM, AutoTokenizer