docs: Gemma 3n audio encoder (#39087)

Updating Gemma 3n docs and docstrings to clarify the relationship between the newly trained audio encoder used in Gemma 3n and the USM model from the original paper.
2025-06-30 08:10:51 -04:00
parent 4a79bf947d
commit ed9f252608
4 changed files with 12 additions and 12 deletions
--- a/docs/source/en/model_doc/gemma3n.md
+++ b/docs/source/en/model_doc/gemma3n.md
@@ -32,8 +32,8 @@ this model, including [Alternating Updates][altup] (AltUp), [Learned Augmented R
 [MatFormer][matformer], Per-Layer Embeddings (PLE), activation sparsity, and KV cache sharing. The language model uses
 a similar attention pattern to [Gemma 3](./gemma3.md) with alternating 4 local sliding window self-attention layers for
 every global self-attention layer with a maximum context length of 32k tokens. Gemma 3n introduces
-[MobileNet v5][mobilenetv5] as the vision encoder, using a default resolution of 768x768 pixels, and adds a
-[Universal Speech Model][usm] (USM) as the audio encoder.
+[MobileNet v5][mobilenetv5] as the vision encoder, using a default resolution of 768x768 pixels, and adds a newly
+trained audio encoder based on the [Universal Speech Model][usm] (USM) architecture.

 The instruction-tuned variant was post-trained with knowledge distillation and reinforcement learning.