ModernBERT bug fixes (#35404)

* bug fixes * organize imports * wrap cpu warning in reference_compile * Avoid needing repad_logits_with_grad, always repad with grads when training I'm not 100% that the conditional with "or labels is None" makes sense though - not sure what the intention is there. Perhaps we can remove that? * Revert "Avoid needing repad_logits_with_grad, always repad with grads when training" This reverts commit cedcb4e89bcea199a1135a0933e71f534b656239. * Fix grammar: keep -> keeps * Propagate grammar fix with modular_model_converter --------- Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com> Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2025-01-09 14:15:38 -05:00
parent e97d7a5be5
commit 1e3ddcb2d0
5 changed files with 53 additions and 19 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -505,7 +505,7 @@
      - local: model_doc/mobilebert
        title: MobileBERT
      - local: model_doc/modernbert
-        title: ModernBert
+        title: ModernBERT
      - local: model_doc/mpnet
        title: MPNet
      - local: model_doc/mpt
--- a/docs/source/en/model_doc/modernbert.md
+++ b/docs/source/en/model_doc/modernbert.md
@@ -14,7 +14,7 @@ rendered properly in your Markdown viewer.

 -->

-# ModernBert
+# ModernBERT

 <div class="flex flex-wrap space-x-1">
 <a href="https://huggingface.co/models?filter=modernbert">
@@ -27,7 +27,7 @@ rendered properly in your Markdown viewer.

 ## Overview

-The ModernBert model was proposed in [Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference](https://arxiv.org/abs/2412.13663) by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.
+The ModernBERT model was proposed in [Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference](https://arxiv.org/abs/2412.13663) by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.

 It is a refresh of the traditional encoder architecture, as used in previous models such as [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert) and [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta).