docs: Gemma 3n audio encoder (#39087)

Updating Gemma 3n docs and docstrings to clarify the relationship
between the newly trained audio encoder used in Gemma 3n and the USM
model from the original paper.
This commit is contained in:
Ryan Mullins
2025-06-30 08:10:51 -04:00
committed by GitHub
parent 4a79bf947d
commit ed9f252608
4 changed files with 12 additions and 12 deletions

View File

@@ -32,8 +32,8 @@ this model, including [Alternating Updates][altup] (AltUp), [Learned Augmented R
[MatFormer][matformer], Per-Layer Embeddings (PLE), activation sparsity, and KV cache sharing. The language model uses [MatFormer][matformer], Per-Layer Embeddings (PLE), activation sparsity, and KV cache sharing. The language model uses
a similar attention pattern to [Gemma 3](./gemma3.md) with alternating 4 local sliding window self-attention layers for a similar attention pattern to [Gemma 3](./gemma3.md) with alternating 4 local sliding window self-attention layers for
every global self-attention layer with a maximum context length of 32k tokens. Gemma 3n introduces every global self-attention layer with a maximum context length of 32k tokens. Gemma 3n introduces
[MobileNet v5][mobilenetv5] as the vision encoder, using a default resolution of 768x768 pixels, and adds a [MobileNet v5][mobilenetv5] as the vision encoder, using a default resolution of 768x768 pixels, and adds a newly
[Universal Speech Model][usm] (USM) as the audio encoder. trained audio encoder based on the [Universal Speech Model][usm] (USM) architecture.
The instruction-tuned variant was post-trained with knowledge distillation and reinforcement learning. The instruction-tuned variant was post-trained with knowledge distillation and reinforcement learning.

View File

@@ -301,10 +301,10 @@ class Gemma3nTextConfig(PretrainedConfig):
class Gemma3nAudioConfig(PretrainedConfig): class Gemma3nAudioConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a [`Gemma3nAudioEncoder`], based on Gogole's This is the configuration class to store the configuration of a [`Gemma3nAudioEncoder`]. It is used to instantiate
[Universal Speech Model](). It is used to instantiate an Gemma3nAudioEncoder model according to the specified an `Gemma3nAudioEncoder` model according to the specified arguments, defining the model architecture. Instantiating
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar a configuration with the defaults will yield a similar configuration to that of the Gemma 3n E4B, e.g.,
configuration to that of the Gemma 3n E4B, e.g. [google/gemma-3n-E4B](https://huggingface.co/google/gemma-3n-E4B). [google/gemma-3n-E4B](https://huggingface.co/google/gemma-3n-E4B).
Configuration objects that inherit from [`Gemma3nAudioConfig`] and can be used to control the model outputs. Read Configuration objects that inherit from [`Gemma3nAudioConfig`] and can be used to control the model outputs. Read
the documentation from [`Gemma3nAudioConfig`] for more information. the documentation from [`Gemma3nAudioConfig`] for more information.

View File

@@ -911,7 +911,7 @@ class Gemma3nAudioConformerBlock(nn.Module):
class Gemma3nAudioEncoder(PreTrainedModel): class Gemma3nAudioEncoder(PreTrainedModel):
"""A Universal Speech Encoder -- https://arxiv.org/abs/2303.01037""" """An audio encoder based on the [Universal Speech Model](https://arxiv.org/abs/2303.01037) architecture."""
config_class = Gemma3nAudioConfig config_class = Gemma3nAudioConfig

View File

@@ -313,10 +313,10 @@ class Gemma3nTextConfig(Gemma2Config, PretrainedConfig):
class Gemma3nAudioConfig(PretrainedConfig): class Gemma3nAudioConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a [`Gemma3nAudioEncoder`], based on Gogole's This is the configuration class to store the configuration of a [`Gemma3nAudioEncoder`]. It is used to instantiate
[Universal Speech Model](). It is used to instantiate an Gemma3nAudioEncoder model according to the specified an `Gemma3nAudioEncoder` model according to the specified arguments, defining the model architecture. Instantiating
arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar a configuration with the defaults will yield a similar configuration to that of the Gemma 3n E4B, e.g.,
configuration to that of the Gemma 3n E4B, e.g. [google/gemma-3n-E4B](https://huggingface.co/google/gemma-3n-E4B). [google/gemma-3n-E4B](https://huggingface.co/google/gemma-3n-E4B).
Configuration objects that inherit from [`Gemma3nAudioConfig`] and can be used to control the model outputs. Read Configuration objects that inherit from [`Gemma3nAudioConfig`] and can be used to control the model outputs. Read
the documentation from [`Gemma3nAudioConfig`] for more information. the documentation from [`Gemma3nAudioConfig`] for more information.
@@ -1473,7 +1473,7 @@ class Gemma3nAudioConformerBlock(nn.Module):
class Gemma3nAudioEncoder(PreTrainedModel): class Gemma3nAudioEncoder(PreTrainedModel):
"""A Universal Speech Encoder -- https://arxiv.org/abs/2303.01037""" """An audio encoder based on the [Universal Speech Model](https://arxiv.org/abs/2303.01037) architecture."""
config_class = Gemma3nAudioConfig config_class = Gemma3nAudioConfig