Add Nemotron HF Support (#31699)
* Add nemotron support * fix inference * add unit test * add layernorm1p as a class to avoid meta device mismatch * test fixed * Add copied_from statements * remove pretraining_tp args * remove nemotronlayernorm * force LN computation done in FP32 * remove nemotrontokenizer and use llamatokenizer * license update * add option for kv_channels for minitron8b * remove assert * o_proj fixed * o_proj reshape * add gated_proj option * typo * remove todos * fix broken test after merging latest main * remove nezha/nat after meging main * chnage default config to 15b model * add nemo conversion script * rename conversion script * remove gate_proj option * pr comment resolved * fix unit test * rename kv_channels to head_dim * resolve PR issue * add nemotron md * fix broken tests * refactor rope for nemotron * test fix * remove linearscaling * whitespace and import * fix some copied-from * code style fix * reformatted * add position_embedding to nemotronattention * rope refactor to only use config, copied-from fix * format * Run make fix-copies * nemotron md with autodoc * doc fix * fix order * pass check_config_docstrings.py * fix config_attributes * remove all llama BC related code * Use PreTrainedTokenizerFast * ruff check examples * conversion script update * add nemotron to toctree
This commit is contained in:
@@ -468,6 +468,8 @@
|
||||
title: MT5
|
||||
- local: model_doc/mvp
|
||||
title: MVP
|
||||
- local: model_doc/nemotron
|
||||
title: Nemotron
|
||||
- local: model_doc/nezha
|
||||
title: NEZHA
|
||||
- local: model_doc/nllb
|
||||
|
||||
@@ -222,6 +222,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| [MusicGen Melody](model_doc/musicgen_melody) | ✅ | ❌ | ❌ |
|
||||
| [MVP](model_doc/mvp) | ✅ | ❌ | ❌ |
|
||||
| [NAT](model_doc/nat) | ✅ | ❌ | ❌ |
|
||||
| [Nemotron](model_doc/nemotron) | ✅ | ❌ | ❌ |
|
||||
| [Nezha](model_doc/nezha) | ✅ | ❌ | ❌ |
|
||||
| [NLLB](model_doc/nllb) | ✅ | ❌ | ❌ |
|
||||
| [NLLB-MOE](model_doc/nllb-moe) | ✅ | ❌ | ❌ |
|
||||
|
||||
148
docs/source/en/model_doc/nemotron.md
Normal file
148
docs/source/en/model_doc/nemotron.md
Normal file
@@ -0,0 +1,148 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
-->
|
||||
|
||||
# Nemotron
|
||||
|
||||
## Nemotron
|
||||
|
||||
### License
|
||||
|
||||
The use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://developer.nvidia.com/downloads/nv-ai-foundation-models-license).
|
||||
|
||||
### Description
|
||||
|
||||
Nemotron-4 is a family of enterprise ready generative text models compatible with [NVIDIA NeMo Framework](https://www.nvidia.com/en-us/ai-data-science/generative-ai/nemo-framework/).
|
||||
|
||||
NVIDIA NeMo is an end-to-end, cloud-native platform to build, customize, and deploy generative AI models anywhere. It includes training and inferencing frameworks, guardrailing toolkits, data curation tools, and pretrained models, offering enterprises an easy, cost-effective, and fast way to adopt generative AI. To get access to NeMo Framework, please sign up at [this link](https://developer.nvidia.com/nemo-framework/join).
|
||||
|
||||
### References
|
||||
|
||||
[Announcement Blog](https://developer.nvidia.com/blog/nvidia-ai-foundation-models-build-custom-enterprise-chatbots-and-co-pilots-with-production-ready-llms/)
|
||||
|
||||
### Model Architecture
|
||||
|
||||
**Architecture Type:** Transformer
|
||||
|
||||
**Network Architecture:** Transformer Decoder (auto-regressive language model).
|
||||
|
||||
## Minitron
|
||||
|
||||
### Minitron 4B Base
|
||||
|
||||
Minitron is a family of small language models (SLMs) obtained by pruning NVIDIA's [Nemotron-4 15B](https://arxiv.org/abs/2402.16819) model. We prune model embedding size, attention heads, and MLP intermediate dimension, following which, we perform continued training with distillation to arrive at the final models.
|
||||
|
||||
Deriving the Minitron 8B and 4B models from the base 15B model using our approach requires up to **40x fewer training tokens** per model compared to training from scratch; this results in **compute cost savings of 1.8x** for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. Please refer to our [arXiv paper](https://arxiv.org/abs/2407.14679) for more details.
|
||||
|
||||
Minitron models are for research and development only.
|
||||
|
||||
### HuggingFace Quickstart
|
||||
|
||||
The following code provides an example of how to load the Minitron-4B model and use it to perform text generation.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
# Load the tokenizer and model
|
||||
model_path = 'nvidia/Minitron-4B-Base'
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||
|
||||
device = 'cuda'
|
||||
dtype = torch.bfloat16
|
||||
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=dtype, device_map=device)
|
||||
|
||||
# Prepare the input text
|
||||
prompt = 'Complete the paragraph: our solar system is'
|
||||
inputs = tokenizer.encode(prompt, return_tensors='pt').to(model.device)
|
||||
|
||||
# Generate the output
|
||||
outputs = model.generate(inputs, max_length=20)
|
||||
|
||||
# Decode and print the output
|
||||
output_text = tokenizer.decode(outputs[0])
|
||||
print(output_text)
|
||||
```
|
||||
|
||||
### License
|
||||
|
||||
Minitron is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
|
||||
|
||||
### Evaluation Results
|
||||
|
||||
*5-shot performance.* Language Understanding evaluated using [Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300):
|
||||
|
||||
| Average |
|
||||
| :---- |
|
||||
| 58.6 |
|
||||
|
||||
*Zero-shot performance.* Evaluated using select datasets from the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) with additions:
|
||||
|
||||
| HellaSwag | Winogrande | GSM8K| ARC-C | XLSum |
|
||||
| :------------- | :------------- | :------------- | :------------- | :------------- |
|
||||
| 75.0 | 74.0 | 24.1 | 50.9 | 29.5
|
||||
|
||||
|
||||
*Code generation performance*. Evaluated using [HumanEval](https://github.com/openai/human-eval):
|
||||
|
||||
| p@1, 0-Shot |
|
||||
| :------------- |
|
||||
| 23.3 |
|
||||
|
||||
Please refer to our [paper](https://arxiv.org/abs/2407.14679) for the full set of results.
|
||||
|
||||
### Citation
|
||||
|
||||
If you find our work helpful, please consider citing our paper:
|
||||
```
|
||||
@article{minitron2024,
|
||||
title={Compact Language Models via Pruning and Knowledge Distillation},
|
||||
author={Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov},
|
||||
journal={arXiv preprint arXiv:2407.14679},
|
||||
year={2024},
|
||||
url={https://arxiv.org/abs/2407.14679},
|
||||
}
|
||||
```
|
||||
|
||||
## NemotronConfig
|
||||
|
||||
[[autodoc]] NemotronConfig
|
||||
|
||||
|
||||
## NemotronModel
|
||||
|
||||
[[autodoc]] NemotronModel
|
||||
- forward
|
||||
|
||||
|
||||
## NemotronForCausalLM
|
||||
|
||||
[[autodoc]] NemotronForCausalLM
|
||||
- forward
|
||||
|
||||
## NemotronForSequenceClassification
|
||||
|
||||
[[autodoc]] NemotronForSequenceClassification
|
||||
- forward
|
||||
|
||||
|
||||
## NemotronForQuestionAnswering
|
||||
|
||||
[[autodoc]] NemotronForQuestionAnswering
|
||||
- forward
|
||||
|
||||
|
||||
## NemotronForTokenClassification
|
||||
|
||||
[[autodoc]] NemotronForTokenClassification
|
||||
- forward
|
||||
@@ -67,6 +67,7 @@ FlashAttention-2 is currently supported for the following architectures:
|
||||
* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel)
|
||||
* [Musicgen](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenModel)
|
||||
* [MusicGen Melody](https://huggingface.co/docs/transformers/model_doc/musicgen_melody#transformers.MusicgenMelodyModel)
|
||||
* [Nemotron](https://huggingface.co/docs/transformers/model_doc/nemotron)
|
||||
* [NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)
|
||||
* [OLMo](https://huggingface.co/docs/transformers/model_doc/olmo#transformers.OlmoModel)
|
||||
* [OPT](https://huggingface.co/docs/transformers/model_doc/opt#transformers.OPTModel)
|
||||
@@ -228,6 +229,7 @@ For now, Transformers supports SDPA inference and training for the following arc
|
||||
* [Qwen2MoE](https://huggingface.co/docs/transformers/model_doc/qwen2_moe#transformers.Qwen2MoeModel)
|
||||
* [Musicgen](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenModel)
|
||||
* [MusicGen Melody](https://huggingface.co/docs/transformers/model_doc/musicgen_melody#transformers.MusicgenMelodyModel)
|
||||
* [Nemotron](https://huggingface.co/docs/transformers/model_doc/nemotron)
|
||||
* [ViT](https://huggingface.co/docs/transformers/model_doc/vit#transformers.ViTModel)
|
||||
* [ViTHybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid#transformers.ViTHybridModel)
|
||||
* [ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae#transformers.ViTMAEModel)
|
||||
|
||||
Reference in New Issue
Block a user