[Docs] Fix spelling and grammar mistakes (#28825)
* Fix typos and grammar mistakes in docs and examples * Fix typos in docstrings and comments * Fix spelling of `tokenizer` in model tests * Remove erroneous spaces in decorators * Remove extra spaces in Markdown link texts
This commit is contained in:
@@ -89,7 +89,7 @@ In a traditional auto-regressive Transformer, attention is written as
|
||||
|
||||
$$O = \hbox{softmax}(QK^{T} / \sqrt{d}) V$$
|
||||
|
||||
with \\(Q\\), \\(K\\) and \\(V\\) are matrices of shape `seq_len x hidden_size` named query, key and value (they are actually bigger matrices with a batch dimension and an attention head dimension but we're only interested in the last two, which is where the matrix product is taken, so for the sake of simplicity we only consider those two). The product \\(QK^{T}\\) then has shape `seq_len x seq_len` and we can take the maxtrix product with \\(V\\) to get the output \\(O\\) of the same shape as the others.
|
||||
with \\(Q\\), \\(K\\) and \\(V\\) are matrices of shape `seq_len x hidden_size` named query, key and value (they are actually bigger matrices with a batch dimension and an attention head dimension but we're only interested in the last two, which is where the matrix product is taken, so for the sake of simplicity we only consider those two). The product \\(QK^{T}\\) then has shape `seq_len x seq_len` and we can take the matrix product with \\(V\\) to get the output \\(O\\) of the same shape as the others.
|
||||
|
||||
Replacing the softmax by its value gives:
|
||||
|
||||
@@ -109,7 +109,7 @@ with \\(u\\) and \\(w\\) learnable parameters called in the code `time_first` an
|
||||
|
||||
$$N_{i} = e^{u + K_{i}} V_{i} + \hat{N}_{i} \hbox{ where } \hat{N}_{i} = e^{K_{i-1}} V_{i-1} + e^{w + K_{i-2}} V_{i-2} \cdots + e^{(i-2)w + K_{1}} V_{1}$$
|
||||
|
||||
so \\(\hat{N}_{i}\\) (called `numerator_state` in the code) satistfies
|
||||
so \\(\hat{N}_{i}\\) (called `numerator_state` in the code) satisfies
|
||||
|
||||
$$\hat{N}_{0} = 0 \hbox{ and } \hat{N}_{j+1} = e^{K_{j}} V_{j} + e^{w} \hat{N}_{j}$$
|
||||
|
||||
@@ -117,7 +117,7 @@ and
|
||||
|
||||
$$D_{i} = e^{u + K_{i}} + \hat{D}_{i} \hbox{ where } \hat{D}_{i} = e^{K_{i-1}} + e^{w + K_{i-2}} \cdots + e^{(i-2)w + K_{1}}$$
|
||||
|
||||
so \\(\hat{D}_{i}\\) (called `denominator_state` in the code) satistfies
|
||||
so \\(\hat{D}_{i}\\) (called `denominator_state` in the code) satisfies
|
||||
|
||||
$$\hat{D}_{0} = 0 \hbox{ and } \hat{D}_{j+1} = e^{K_{j}} + e^{w} \hat{D}_{j}$$
|
||||
|
||||
|
||||
Reference in New Issue
Block a user