Fix typos (#31819)
* fix typo * fix typo * fix typos * fix typo * fix typos
This commit is contained in:
@@ -147,7 +147,7 @@ Let's call it now for the next experiment.
|
||||
```python
|
||||
flush()
|
||||
```
|
||||
In the recent version of the accelerate library, you can also use an utility method called `release_memory()`
|
||||
In the recent version of the accelerate library, you can also use a utility method called `release_memory()`
|
||||
|
||||
```python
|
||||
from accelerate.utils import release_memory
|
||||
@@ -683,7 +683,7 @@ Assistant: Germany has ca. 81 million inhabitants
|
||||
|
||||
In this chat, the LLM runs auto-regressive decoding twice:
|
||||
1. The first time, the key-value cache is empty and the input prompt is `"User: How many people live in France?"` and the model auto-regressively generates the text `"Roughly 75 million people live in France"` while increasing the key-value cache at every decoding step.
|
||||
2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, it's computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`.
|
||||
2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, its computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`.
|
||||
|
||||
Two things should be noted here:
|
||||
1. Keeping all the context is crucial for LLMs deployed in chat so that the LLM understands all the previous context of the conversation. E.g. for the example above the LLM needs to understand that the user refers to the population when asking `"And how many are in Germany"`.
|
||||
|
||||
Reference in New Issue
Block a user