Use HF papers (#38184)

* Use hf papers * Hugging Face papers * doi to hf papers * style
2025-06-13 13:07:09 +02:00
parent 1031ed5166
commit de24fb63ed
811 changed files with 2622 additions and 2617 deletions
--- a/docs/source/en/quantization/aqlm.md
+++ b/docs/source/en/quantization/aqlm.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # AQLM

-Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.
+Additive Quantization of Language Models ([AQLM](https://huggingface.co/papers/2401.06118)) quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.

 AQLM also supports fine-tuning with [LoRA](https://huggingface.co/docs/peft/package_reference/lora) with the [PEFT](https://huggingface.co/docs/peft) library, and is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for even faster inference and training.

--- a/docs/source/en/quantization/bitnet.md
+++ b/docs/source/en/quantization/bitnet.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # BitNet

-[BitNet](https://arxiv.org/abs/2402.17764) replaces traditional linear layers in Multi-Head Attention and feed-forward networks with specialized BitLinear layers. The BitLinear layers quantize the weights using ternary precision (with values of -1, 0, and 1) and quantize the activations to 8-bit precision.
+[BitNet](https://huggingface.co/papers/2402.17764) replaces traditional linear layers in Multi-Head Attention and feed-forward networks with specialized BitLinear layers. The BitLinear layers quantize the weights using ternary precision (with values of -1, 0, and 1) and quantize the activations to 8-bit precision.

 <figure style="text-align: center;">
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/1.58llm_extreme_quantization/bitlinear.png" alt="Alt Text" />
@@ -27,7 +27,7 @@ BitNet models can't be quantized on the fly. They need to be quantized during pr

 1. Compute the average of the absolute values of the weight matrix and use as a scale.
 2. Divide the weights by the scale, round the values, constrain them between -1 and 1, and rescale them to continue in full precision.
-3. Activations are quantized to a specified bit-width (8-bit) using [absmax](https://arxiv.org/pdf/2208.07339) quantization (symmetric per channel quantization). This involves scaling the activations into a range of [−128,127].
+3. Activations are quantized to a specified bit-width (8-bit) using [absmax](https://huggingface.co/papers/2208.07339) quantization (symmetric per channel quantization). This involves scaling the activations into a range of [−128,127].

 Refer to this [PR](https://github.com/huggingface/nanotron/pull/180) to pretrain or fine-tune a 1.58-bit model with [Nanotron](https://github.com/huggingface/nanotron). For fine-tuning, convert a model from the Hugging Face to Nanotron format. Find the conversion steps in this [PR](https://github.com/huggingface/nanotron/pull/174).

--- a/docs/source/en/quantization/higgs.md
+++ b/docs/source/en/quantization/higgs.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # HIGGS

-[HIGGS](https://arxiv.org/abs/2411.17525) is a zero-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and state-of-the-art performance.
+[HIGGS](https://huggingface.co/papers/2411.17525) is a zero-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and state-of-the-art performance.

 Runtime support for HIGGS is implemented through the [FLUTE](https://github.com/HanGuo97/flute) library. Only the 70B and 405B variants of Llama 3 and Llama 3.0, and the 8B and 27B variants of Gemma 2 are currently supported. HIGGS also doesn't support quantized training and backward passes in general at the moment.

--- a/docs/source/en/quantization/vptq.md
+++ b/docs/source/en/quantization/vptq.md
@@ -69,4 +69,4 @@ VPTQ achieves better accuracy and higher throughput with lower quantization over

 See an example demo of VPTQ on the VPTQ Online Demo [Space](https://huggingface.co/spaces/microsoft/VPTQ) or try running the VPTQ inference [notebook](https://colab.research.google.com/github/microsoft/VPTQ/blob/main/notebooks/vptq_example.ipynb).

-For more information, read the VPTQ [paper](https://arxiv.org/pdf/2409.17066).
+For more information, read the VPTQ [paper](https://huggingface.co/papers/2409.17066).