docs: Gemma 3n audio encoder (#39087)

Updating Gemma 3n docs and docstrings to clarify the relationship
between the newly trained audio encoder used in Gemma 3n and the USM
model from the original paper.
This commit is contained in:
Ryan Mullins
2025-06-30 08:10:51 -04:00
committed by GitHub
parent 4a79bf947d
commit ed9f252608
4 changed files with 12 additions and 12 deletions

View File

@@ -32,8 +32,8 @@ this model, including [Alternating Updates][altup] (AltUp), [Learned Augmented R
[MatFormer][matformer], Per-Layer Embeddings (PLE), activation sparsity, and KV cache sharing. The language model uses
a similar attention pattern to [Gemma 3](./gemma3.md) with alternating 4 local sliding window self-attention layers for
every global self-attention layer with a maximum context length of 32k tokens. Gemma 3n introduces
[MobileNet v5][mobilenetv5] as the vision encoder, using a default resolution of 768x768 pixels, and adds a
[Universal Speech Model][usm] (USM) as the audio encoder.
[MobileNet v5][mobilenetv5] as the vision encoder, using a default resolution of 768x768 pixels, and adds a newly
trained audio encoder based on the [Universal Speech Model][usm] (USM) architecture.
The instruction-tuned variant was post-trained with knowledge distillation and reinforcement learning.