Doc new front (#14590)

* Convert PretrainedConfig doc to Markdown * Use syntax * Add necessary doc files (#14496) * Doc fixes (#14499) * Fixes for the new front * Convert DETR file for table * Title is needed * Simplify a bit * Even simpler * Remove imports * Fix typo in toctree (#14516) * Fix checkpoints badge * Update versions.yml format (#14517) * Doc new front github actions (#14512) * Doc new front github actions * Fix docstring * Fix feature extraction utils import (#14515) * Address Julien's comments * Push to doc-builder * Ready for merge * Remove old build and deploy * Doc misc fixes (#14583) * Rm versions.yml from doc * Fix converting.rst * Rm pretrained_models from toctree * Fix index links (#14567) * Fix links in README * Localized READMEs * Fix copy script * Fix find doc script * Update README_ko.md Co-authored-by: Julien Chaumond <julien@huggingface.co> Co-authored-by: Julien Chaumond <julien@huggingface.co> * Adapt build command to new CLI tools (#14578) * Fix typo * Fix doc interlinks (#14589) * Convert PretrainedConfig doc to Markdown * Use syntax * Rm pattern <[a-z]+(.html).*> * Rm huggingface.co/transformers/master * Rm .html * Rm .html from index.mdx * Rm .html from model_summary.rst * Update index.mdx rm html * Update remove .html * Fix inner doc links * Fix interlink in preprocssing.rst * Update pr_checks Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Convert PretrainedConfig doc to Markdown * Use syntax * Add necessary doc files (#14496) * Doc fixes (#14499) * Fixes for the new front * Convert DETR file for table * Title is needed * Simplify a bit * Even simpler * Remove imports * Fix checkpoints badge * Fix typo in toctree (#14516) * Update versions.yml format (#14517) * Doc new front github actions (#14512) * Doc new front github actions * Fix docstring * Fix feature extraction utils import (#14515) * Address Julien's comments * Push to doc-builder * Ready for merge * Remove old build and deploy * Doc misc fixes (#14583) * Rm versions.yml from doc * Fix converting.rst * Rm pretrained_models from toctree * Fix index links (#14567) * Fix links in README * Localized READMEs * Fix copy script * Fix find doc script * Update README_ko.md Co-authored-by: Julien Chaumond <julien@huggingface.co> Co-authored-by: Julien Chaumond <julien@huggingface.co> * Adapt build command to new CLI tools (#14578) * Fix typo * Fix doc interlinks (#14589) * Convert PretrainedConfig doc to Markdown * Use syntax * Rm pattern <[a-z]+(.html).*> * Rm huggingface.co/transformers/master * Rm .html * Rm .html from index.mdx * Rm .html from model_summary.rst * Update index.mdx rm html * Update remove .html * Fix inner doc links * Fix interlink in preprocssing.rst * Update pr_checks Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Styling Co-authored-by: Mishig Davaadorj <mishig.davaadorj@coloradocollege.edu> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Julien Chaumond <julien@huggingface.co>
2021-12-01 14:13:02 -05:00
parent 14cc50d081
commit 4df7d05a87
43 changed files with 1467 additions and 3185 deletions
--- a/docs/source/_static/css/Calibre-Light.ttf
+++ b/docs/source/_static/css/Calibre-Light.ttf
--- a/docs/source/_static/css/Calibre-Medium.otf
+++ b/docs/source/_static/css/Calibre-Medium.otf
--- a/docs/source/_static/css/Calibre-Regular.otf
+++ b/docs/source/_static/css/Calibre-Regular.otf
--- a/docs/source/_static/css/Calibre-Thin.otf
+++ b/docs/source/_static/css/Calibre-Thin.otf
--- a/docs/source/_static/css/code-snippets.css
+++ b/docs/source/_static/css/code-snippets.css
@@ -1,16 +0,0 @@
-
-.highlight .c1, .highlight .sd{
-    color: #999
-}
-
-.highlight .nn, .highlight .k, .highlight .s1, .highlight .nb, .highlight .bp, .highlight .kc {
-    color: #FB8D68;
-}
-
-.highlight .kn, .highlight .nv, .highlight .s2, .highlight .ow {
-    color: #6670FF;
-}
-
-.highlight .gp {
-    color: #FB8D68;
-}
--- a/docs/source/_static/css/huggingface.css
+++ b/docs/source/_static/css/huggingface.css
@@ -1,350 +0,0 @@
-/* Our DOM objects */
-
-/* Colab dropdown */
-
-table.center-aligned-table td {
-    text-align: center;
-}
-
-table.center-aligned-table th {
-    text-align: center;
-    vertical-align: middle;
-}
-
-.colab-dropdown {
-    position: relative;
-    display: inline-block;
-}
-  
-.colab-dropdown-content {
-    display: none;
-    position: absolute;
-    background-color: #f9f9f9;
-    min-width: 117px;
-    box-shadow: 0px 8px 16px 0px rgba(0,0,0,0.2);
-    z-index: 1;
-}
-  
-.colab-dropdown-content button {
-    color: #6670FF;
-    background-color: #f9f9f9;
-    font-size: 12px;
-    border: none;
-    min-width: 117px;
-    padding: 5px 5px;
-    text-decoration: none;
-    display: block;
-}
-  
-.colab-dropdown-content button:hover {background-color: #eee;}
-  
-.colab-dropdown:hover .colab-dropdown-content {display: block;}
-
-/* Version control */
-
-.version-button {
-    background-color: #6670FF;
-    color: white;
-    border: none;
-    padding: 5px;
-    font-size: 15px;
-    cursor: pointer;
-}
-
-.version-button:hover, .version-button:focus {
-    background-color: #A6B0FF;
-}
- 
-.version-dropdown {
-    display: none;
-    background-color: #6670FF;
-    min-width: 160px;
-    overflow: auto;
-    font-size: 15px;
-}
-  
-.version-dropdown a {
-    color: white;
-    padding: 3px 4px;
-    text-decoration: none;
-    display: block;
-}
-  
-.version-dropdown a:hover {
-    background-color: #A6B0FF;
-}
-  
-.version-show {
-    display: block;
-}
-
-/* Framework selector */
-
-.framework-selector {
-    display: flex;
-    flex-direction: row;
-    justify-content: flex-end;
-    margin-right: 30px;
-}
-
-.framework-selector > button {
-    background-color: white;
-    color: #6670FF;
-    border: 1px solid #6670FF;
-    padding: 5px;
-}
-
-.framework-selector > button.selected{
-    background-color: #6670FF;
-    color: white;
-    border: 1px solid #6670FF;
-    padding: 5px;
-}
-
-/* Copy button */
-
-a.copybtn {
-    margin: 3px;
-}
-
-/* The literal code blocks */
-.rst-content tt.literal, .rst-content tt.literal, .rst-content code.literal {
-    color: #6670FF;
-}
-
-/* To keep the logo centered */
-.wy-side-scroll {
-    width: auto;
-    font-size: 20px;
-}
-
-/* The div that holds the Hugging Face logo */
-.HuggingFaceDiv {
-    width: 100%
-}
-
-/* The research field on top of the toc tree */
-.wy-side-nav-search{
-    padding-top: 0;
-    background-color: #6670FF;
-}
-
-/* The toc tree */
-.wy-nav-side{
-    background-color: #6670FF;
-}
-
-/* The section headers in the toc tree */
-.wy-menu-vertical p.caption{
-    background-color: #4d59ff;
-    line-height: 40px;
-}
-
-/* The selected items in the toc tree */
-.wy-menu-vertical li.current{
-    background-color: #A6B0FF;
-}
-
-/* When a list item that does belong to the selected block from the toc tree is hovered */
-.wy-menu-vertical li.current a:hover{
-    background-color: #B6C0FF;
-}
-
-/* When a list item that does NOT belong to the selected block from the toc tree is hovered. */
-.wy-menu-vertical li a:hover{
-    background-color: #A7AFFB;
-}
-
-/* The text items on the toc tree */
-.wy-menu-vertical a {
-    color: #FFFFDD;
-    font-family: Calibre-Light, sans-serif;
-}
-.wy-menu-vertical header, .wy-menu-vertical p.caption{
-    color: white;
-    font-family: Calibre-Light, sans-serif;
-}
-
-/* The color inside the selected toc tree block */
-.wy-menu-vertical li.toctree-l2 a, .wy-menu-vertical li.toctree-l3 a, .wy-menu-vertical li.toctree-l4 a {
-    color: black;
-}
-
-/* Inside the depth-2 selected toc tree block */
-.wy-menu-vertical li.toctree-l2.current>a {
-    background-color: #B6C0FF
-}
-.wy-menu-vertical li.toctree-l2.current li.toctree-l3>a {
-    background-color: #C6D0FF
-}
-
-/* Inside the depth-3 selected toc tree block */
-.wy-menu-vertical li.toctree-l3.current li.toctree-l4>a{
-    background-color: #D6E0FF
-}
-
-/* Inside code snippets */
-.rst-content dl:not(.docutils) dt{
-    font-size: 15px;
-}
-
-/* Links */
-a {
-    color: #6670FF;
-}
-
-/* Content bars */
-.rst-content dl:not(.docutils) dt {
-    background-color: rgba(251, 141, 104, 0.1);
-    border-right: solid 2px #FB8D68;
-    border-left: solid 2px #FB8D68;
-    color: #FB8D68;
-    font-family: Calibre-Light, sans-serif;
-    border-top: none;
-    font-style: normal !important;
-}
-
-/* Expand button */
-.wy-menu-vertical li.toctree-l2 span.toctree-expand,
-.wy-menu-vertical li.on a span.toctree-expand, .wy-menu-vertical li.current>a span.toctree-expand,
-.wy-menu-vertical li.toctree-l3 span.toctree-expand{
-    color: black;
-}
-
-/* Max window size */
-.wy-nav-content{
-    max-width: 1200px;
-}
-
-/* Mobile header */
-.wy-nav-top{
-    background-color: #6670FF;
-}
-
-
-/* Source spans */
-.rst-content .viewcode-link, .rst-content .viewcode-back{
-    color: #6670FF;
-    font-size: 110%;
-    letter-spacing: 2px;
-    text-transform: uppercase;
-}
-
-/* It would be better for table to be visible without horizontal scrolling */
-.wy-table-responsive table td, .wy-table-responsive table th{
-    white-space: normal;
-}
-
-.footer {
-    margin-top: 20px;
-}
-
-.footer__Social {
-    display: flex;
-    flex-direction: row;
-}
-
-.footer__CustomImage {
-    margin: 2px 5px 0 0;
-}
-
-/* class and method names in doc */
-.rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) code.descclassname{
-    font-family: Calibre, sans-serif;
-    font-size: 20px !important;
-}
-
-/* class name in doc*/
-.rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname{
-    margin-right: 10px;
-    font-family: Calibre-Medium, sans-serif;
-}
-
-/* Method and class parameters */
-.sig-param{
-    line-height: 23px;
-}
-
-/* Class introduction "class" string at beginning */
-.rst-content dl:not(.docutils) .property{
-    font-size: 18px;
-    color: black;
-}
-
-
-/* FONTS */
-body{
-    font-family: Calibre, sans-serif;
-    font-size: 16px;
-}
-
-h1 {
-    font-family: Calibre-Thin, sans-serif;
-    font-size: 70px;
-}
-
-h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend{
-    font-family: Calibre-Medium, sans-serif;
-}
-
-@font-face {
-    font-family: Calibre-Medium;
-    src: url(./Calibre-Medium.otf);
-    font-weight:400;
-}
-
-@font-face {
-    font-family: Calibre;
-    src: url(./Calibre-Regular.otf);
-    font-weight:400;
-}
-
-@font-face {
-    font-family: Calibre-Light;
-    src: url(./Calibre-Light.ttf);
-    font-weight:400;
-}
-
-@font-face {
-    font-family: Calibre-Thin;
-    src: url(./Calibre-Thin.otf);
-    font-weight:400;
-}
-
-
-/**
- * Nav Links to other parts of huggingface.co
- */
- div.menu {
-    position: absolute;
-    top: 0;
-    right: 0;
-    padding-top: 20px;
-    padding-right: 20px;
-    z-index: 1000;
-}
-div.menu a {
-    font-size: 14px;
-    letter-spacing: 0.3px;
-    text-transform: uppercase;
-    color: white;
-    -webkit-font-smoothing: antialiased;
-    background: linear-gradient(0deg, #6671ffb8, #9a66ffb8 50%);
-    padding: 10px 16px 6px 16px;
-    border-radius: 3px;
-    margin-left: 12px;
-    position: relative;
-}
-div.menu a:active {
-    top: 1px;
-}
-@media (min-width: 768px) and (max-width: 1750px) {
-    .wy-breadcrumbs {
-        margin-top: 32px;
-    }
-}
-@media (max-width: 768px) {
-    div.menu {
-        display: none;
-    }
-}
--- a/docs/source/_static/js/custom.js
+++ b/docs/source/_static/js/custom.js
--- a/docs/source/_static/js/huggingface_logo.svg
+++ b/docs/source/_static/js/huggingface_logo.svg
--- a/docs/source/community.md
+++ b/docs/source/community.md
@@ -6,7 +6,7 @@ This page regroups resources around 🤗 Transformers developed by the community

 | Resource     |      Description      |      Author      |
 |:----------|:-------------|------:|
-| [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | A set of flashcards based on the [Transformers Docs Glossary](https://huggingface.co/transformers/master/glossary.html) that has been put into a form which can be easily learnt/revised using [Anki ](https://apps.ankiweb.net/) an open source, cross platform app specifically designed for long term knowledge retention. See this [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) |
+| [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | A set of flashcards based on the [Transformers Docs Glossary](glossary) that has been put into a form which can be easily learnt/revised using [Anki ](https://apps.ankiweb.net/) an open source, cross platform app specifically designed for long term knowledge retention. See this [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) |

 ## Community notebooks:

--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -1,227 +0,0 @@
-# -*- coding: utf-8 -*-
-#
-# Configuration file for the Sphinx documentation builder.
-#
-# This file does only contain a selection of the most common options. For a
-# full list see the documentation:
-# http://www.sphinx-doc.org/en/master/config
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-#
-import os
-import sys
-
-sys.path.insert(0, os.path.abspath("../../src"))
-
-
-# -- Project information -----------------------------------------------------
-
-project = "transformers"
-copyright = "2020, The Hugging Face Team, Licenced under the Apache License, Version 2.0"
-author = "huggingface"
-
-# The short X.Y version
-version = ""
-# The full version, including alpha/beta/rc tags
-release = "4.13.0.dev0"
-
-
-
-
-
-
-
-
-
-# Prefix link to point to master, comment this during version release and uncomment below line
-extlinks = {"prefix_link": ("https://github.com/huggingface/transformers/blob/master/%s", "")}
-# Prefix link to always point to corresponding version, uncomment this during version release
-# extlinks = {'prefix_link': ('https://github.com/huggingface/transformers/blob/v'+ release + '/%s', '')}
-
-# -- General configuration ---------------------------------------------------
-
-# If your documentation needs a minimal Sphinx version, state it here.
-#
-# needs_sphinx = '1.0'
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
-    "sphinx.ext.autodoc",
-    "sphinx.ext.extlinks",
-    "sphinx.ext.coverage",
-    "sphinx.ext.napoleon",
-    "recommonmark",
-    "sphinx.ext.viewcode",
-    "sphinx_markdown_tables",
-    "sphinxext.opengraph",
-    "sphinx_copybutton",
-]
-
-# Add any paths that contain templates here, relative to this directory.
-templates_path = ["_templates"]
-
-# The suffix(es) of source filenames.
-# You can specify multiple suffix as a list of string:
-#
-source_suffix = [".rst", ".md"]
-# source_suffix = '.rst'
-
-# The master toctree document.
-master_doc = "index"
-
-# The language for content autogenerated by Sphinx. Refer to documentation
-# for a list of supported languages.
-#
-# This is also used if you do content translation via gettext catalogs.
-# Usually you set "language" from the command line for these cases.
-language = None
-
-# List of patterns, relative to source directory, that match files and
-# directories to ignore when looking for source files.
-# This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
-
-# The name of the Pygments (syntax highlighting) style to use.
-pygments_style = None
-
-# Remove the prompt when copying examples
-copybutton_prompt_text = r">>> |\.\.\. "
-copybutton_prompt_is_regexp = True
-
-# -- Options for HTML output -------------------------------------------------
-
-# The theme to use for HTML and HTML Help pages.  See the documentation for
-# a list of builtin themes.
-#
-html_theme = "sphinx_rtd_theme"
-
-# Theme options are theme-specific and customize the look and feel of a theme
-# further.  For a list of options available for each theme, see the
-# documentation.
-#
-html_theme_options = {"analytics_id": "UA-83738774-2", "navigation_with_keys": True}
-
-#  Configuration for OpenGraph and Twitter Card Tags.
-# These are responsible for creating nice shareable social images https://ahrefs.com/blog/open-graph-meta-tags/
-# https://ogp.me/#type_website
-ogp_image = "https://huggingface.co/front/thumbnails/transformers.png"
-ogp_description = "State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0. Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone"
-ogp_description_length = 160
-
-ogp_custom_meta_tags = [
-    f'<meta name="twitter:image" content="{ogp_image}">',
-    f'<meta name="twitter:description" content="{ogp_description}">',
-]
-
-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
-html_static_path = ["_static"]
-
-# Custom sidebar templates, must be a dictionary that maps document names
-# to template names.
-#
-# The default sidebars (for documents that don't match any pattern) are
-# defined by theme itself.  Builtin themes are using these templates by
-# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
-# 'searchbox.html']``.
-#
-# html_sidebars = {}
-
-# This must be the name of an image file (path relative to the configuration
-# directory) that is the favicon of the docs. Modern browsers use this as
-# the icon for tabs, windows and bookmarks. It should be a Windows-style
-# icon file (.ico).
-html_favicon = "favicon.ico"
-
-
-# -- Options for HTMLHelp output ---------------------------------------------
-
-# Output file base name for HTML help builder.
-htmlhelp_basename = "transformersdoc"
-
-
-# -- Options for LaTeX output ------------------------------------------------
-
-latex_elements = {
-    # The paper size ('letterpaper' or 'a4paper').
-    #
-    # 'papersize': 'letterpaper',
-    # The font size ('10pt', '11pt' or '12pt').
-    #
-    # 'pointsize': '10pt',
-    # Additional stuff for the LaTeX preamble.
-    #
-    # 'preamble': '',
-    # Latex figure (float) alignment
-    #
-    # 'figure_align': 'htbp',
-}
-
-# Grouping the document tree into LaTeX files. List of tuples
-# (source start file, target name, title,
-#  author, documentclass [howto, manual, or own class]).
-latex_documents = [
-    (master_doc, "transformers.tex", "transformers Documentation", "huggingface", "manual"),
-]
-
-
-# -- Options for manual page output ------------------------------------------
-
-# One entry per manual page. List of tuples
-# (source start file, name, description, authors, manual section).
-man_pages = [(master_doc, "transformers", "transformers Documentation", [author], 1)]
-
-
-# -- Options for Texinfo output ----------------------------------------------
-
-# Grouping the document tree into Texinfo files. List of tuples
-# (source start file, target name, title, author,
-#  dir menu entry, description, category)
-texinfo_documents = [
-    (
-        master_doc,
-        "transformers",
-        "transformers Documentation",
-        author,
-        "transformers",
-        "One line description of project.",
-        "Miscellaneous",
-    ),
-]
-
-
-# -- Options for Epub output -------------------------------------------------
-
-# Bibliographic Dublin Core info.
-epub_title = project
-
-# The unique identifier of the text. This can be a ISBN number
-# or the project homepage.
-#
-# epub_identifier = ''
-
-# A unique identification for the text.
-#
-# epub_uid = ''
-
-# A list of files that should not be packed into the epub file.
-epub_exclude_files = ["search.html"]
-
-# Localization
-locale_dirs = ['locale/']
-gettext_compact = False
-
-def setup(app):
-    app.add_css_file("css/huggingface.css")
-    app.add_css_file("css/code-snippets.css")
-    app.add_js_file("js/custom.js")
-
-
-# -- Extension configuration -------------------------------------------------
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
@@ -26,22 +26,22 @@ BERT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google
-<https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the
+<https://github.com/google-research/bert#pre-trained-models>`_) in a PyTorch save file by using the
 :prefix_link:`convert_bert_original_tf_checkpoint_to_pytorch.py
 <src/transformers/models/bert/convert_bert_original_tf_checkpoint_to_pytorch.py>` script.

-This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
-configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
-from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
-can be imported using ``from_pretrained()`` (see example in :doc:`quicktour` , :prefix_link:`run_glue.py
-<examples/pytorch/text-classification/run_glue.py>` \ ).
+This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``) and the associated
+configuration file (``bert_config.json``), and creates a PyTorch model for this configuration, loads the weights from
+the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can
+be imported using ``from_pretrained()`` (see example in :doc:`quicktour` , :prefix_link:`run_glue.py
+<examples/pytorch/text-classification/run_glue.py>` ).

 You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
-checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\
-``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
+checkpoint (the three files starting with ``bert_model.ckpt``) but be sure to keep the configuration file (\
+``bert_config.json``) and the vocabulary file (``vocab.txt``) as these are needed for the PyTorch model too.

-To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install
-tensorflow``\ ). The rest of the repository only requires PyTorch.
+To run this specific conversion script you will need to have TensorFlow and PyTorch installed (``pip install
+tensorflow``). The rest of the repository only requires PyTorch.

 Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model:

@@ -64,9 +64,9 @@ Convert TensorFlow model checkpoints of ALBERT to PyTorch using the
 :prefix_link:`convert_albert_original_tf_checkpoint_to_pytorch.py
 <src/transformers/models/albert/convert_albert_original_tf_checkpoint_to_pytorch.py>` script.

-The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying
-configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you
-will need to have TensorFlow and PyTorch installed.
+The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``) and the accompanying
+configuration file (``albert_config.json``), then creates and saves a PyTorch model. To run this conversion you will
+need to have TensorFlow and PyTorch installed.

 Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model:

@@ -104,7 +104,7 @@ OpenAI GPT-2
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here
-<https://github.com/openai/gpt-2>`__\ )
+<https://github.com/openai/gpt-2>`__)

 .. code-block:: shell

@@ -120,7 +120,7 @@ Transformer-XL
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here
-<https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
+<https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__)

 .. code-block:: shell

--- a/docs/source/favicon.ico
+++ b/docs/source/favicon.ico
--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@@ -0,0 +1,262 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# 🤗 Transformers
+
+State-of-the-art Natural Language Processing for Jax, Pytorch and TensorFlow
+
+🤗 Transformers (formerly known as _pytorch-transformers_ and _pytorch-pretrained-bert_) provides general-purpose
+architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural
+Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between Jax,
+PyTorch and TensorFlow.
+
+This is the documentation of our repository [transformers](https://github.com/huggingface/transformers). You can
+also follow our [online course](https://huggingface.co/course) that teaches how to use this library, as well as the
+other libraries developed by Hugging Face and the Hub.
+
+## If you are looking for custom support from the Hugging Face team
+
+<a target="_blank" href="https://huggingface.co/support">
+<img alt="HuggingFace Expert Acceleration Program" src="https://huggingface.co/front/thumbnails/support.png" style="max-width: 600px; border: 1px solid #eee; border-radius: 4px; box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.05);">
+</a><br>
+
+## Features
+
+- High performance on NLU and NLG tasks
+- Low barrier to entry for educators and practitioners
+
+State-of-the-art NLP for everyone:
+
+- Deep learning researchers
+- Hands-on practitioners
+- AI/ML/NLP teachers and educators
+
+Lower compute costs, smaller carbon footprint:
+
+- Researchers can share trained models instead of always retraining
+- Practitioners can reduce compute time and production costs
+- 8 architectures with over 30 pretrained models, some in more than 100 languages
+
+Choose the right framework for every part of a model's lifetime:
+
+- Train state-of-the-art models in 3 lines of code
+- Deep interoperability between Jax, Pytorch and TensorFlow models
+- Move a single model between Jax/PyTorch/TensorFlow frameworks at will
+- Seamlessly pick the right framework for training, evaluation, production
+
+The support for Jax is still experimental (with a few models right now), expect to see it grow in the coming months!
+
+[All the model checkpoints](https://huggingface.co/models) are seamlessly integrated from the huggingface.co [model
+hub](https://huggingface.co) where they are uploaded directly by [users](https://huggingface.co/users) and
+[organizations](https://huggingface.co/organizations).
+
+Current number of checkpoints: <img src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen">
+
+## Contents
+
+The documentation is organized in five parts:
+
+- **GET STARTED** contains a quick tour, the installation instructions and some useful information about our philosophy
+  and a glossary.
+- **USING 🤗 TRANSFORMERS** contains general tutorials on how to use the library.
+- **ADVANCED GUIDES** contains more advanced guides that are more specific to a given script or part of the library.
+- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general research in
+  transformers model
+- **API** contains the documentation of each public class and function, grouped in:
+
+  - **MAIN CLASSES** for the main classes exposing the important APIs of the library.
+  - **MODELS** for the classes and functions related to each model implemented in the library.
+  - **INTERNAL HELPERS** for the classes and functions we use internally.
+
+The library currently contains Jax, PyTorch and Tensorflow implementations, pretrained model weights, usage scripts and
+conversion utilities for the following models.
+
+### Supported models
+
+<!--This list is updated automatically from the README with _make fix-copies_. Do not update manually! -->
+
+1. **[ALBERT](model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
+1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
+1. **[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
+1. **[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
+1. **[BEiT](model_doc/beit)** (from Microsoft) released with the paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by Hangbo Bao, Li Dong, Furu Wei.
+1. **[BERT](model_doc/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
+1. **[BERTweet](model_doc/bertweet)** (from VinAI Research) released with the paper [BERTweet: A pre-trained language model for English Tweets](https://aclanthology.org/2020.emnlp-demos.2/) by Dat Quoc Nguyen, Thanh Vu and Anh Tuan Nguyen.
+1. **[BERT For Sequence Generation](model_doc/bertgeneration)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
+1. **[BigBird-RoBERTa](model_doc/bigbird)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
+1. **[BigBird-Pegasus](model_doc/bigbird_pegasus)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
+1. **[Blenderbot](model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
+1. **[BlenderbotSmall](model_doc/blenderbot_small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
+1. **[BORT](model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry.
+1. **[ByT5](model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
+1. **[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
+1. **[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
+1. **[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
+1. **[ConvBERT](model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
+1. **[CPM](model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
+1. **[CTRL](model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
+1. **[DeBERTa](model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
+1. **[DeBERTa-v2](model_doc/deberta_v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
+1. **[DeiT](model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
+1. **[DETR](model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
+1. **[DialoGPT](model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
+1. **[DistilBERT](model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/master/examples/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
+1. **[DPR](model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
+1. **[EncoderDecoder](model_doc/encoderdecoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
+1. **[ELECTRA](model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
+1. **[FlauBERT](model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
+1. **[FNet](model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
+1. **[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
+1. **[GPT](model_doc/gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
+1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
+1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
+1. **[GPT Neo](model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
+1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
+1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
+1. **[LayoutLM](model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
+1. **[LayoutLMv2](model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
+1. **[LayoutXLM](model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.
+1. **[LED](model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+1. **[Longformer](model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+1. **[LUKE](model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
+1. **[LXMERT](model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
+1. **[M2M100](model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
+1. **[MarianMT](model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
+1. **[MBart](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+1. **[MBart-50](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
+1. **[Megatron-BERT](model_doc/megatron_bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
+1. **[Megatron-GPT2](model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
+1. **[MPNet](model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
+1. **[MT5](model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
+1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
+1. **[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
+1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
+1. **[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
+1. **[Reformer](model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
+1. **[RemBERT](model_doc/rembert)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
+1. **[RoBERTa](model_doc/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
+1. **[RoFormer](model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
+1. **[SegFormer](model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
+1. **[SEW](model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
+1. **[SEW-D](model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
+1. **[SpeechToTextTransformer](model_doc/speech_to_text)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
+1. **[SpeechToTextTransformer2](model_doc/speech_to_text_2)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
+1. **[Splinter](model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
+1. **[SqueezeBert](model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
+1. **[T5](model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
+1. **[T5v1.1](model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
+1. **[TAPAS](model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
+1. **[Transformer-XL](model_doc/transformerxl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+1. **[TrOCR](model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
+1. **[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
+1. **[UniSpeechSat](model_doc/unispeech_sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
+1. **[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
+1. **[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
+1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
+1. **[XLM](model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
+1. **[XLM-ProphetNet](model_doc/xlmprophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
+1. **[XLM-RoBERTa](model_doc/xlmroberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
+1. **[XLNet](model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
+1. **[XLSR-Wav2Vec2](model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
+
+
+### Supported frameworks
+
+The table below represents the current support in the library for each of those models, whether they have a Python
+tokenizer (called "slow"). A "fast" tokenizer backed by the 🤗 Tokenizers library, whether they have support in Jax (via
+Flax), PyTorch, and/or TensorFlow.
+
+<!--This table is updated automatically from the auto modules with _make fix-copies_. Do not update manually!-->
+
+|            Model            | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax Support |
+|-----------------------------|----------------|----------------|-----------------|--------------------|--------------|
+|           ALBERT            |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|            BART             |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|            BEiT             |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
+|            BERT             |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|       Bert Generation       |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
+|           BigBird           |       ✅       |       ✅       |       ✅        |         ❌         |      ✅      |
+|       BigBirdPegasus        |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|         Blenderbot          |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|       BlenderbotSmall       |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|          CamemBERT          |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|           Canine            |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
+|            CLIP             |       ✅       |       ✅       |       ✅        |         ❌         |      ✅      |
+|          ConvBERT           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|            CTRL             |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
+|           DeBERTa           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|         DeBERTa-v2          |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
+|            DeiT             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|            DETR             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|         DistilBERT          |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|             DPR             |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|           ELECTRA           |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|       Encoder decoder       |       ❌       |       ❌       |       ✅        |         ✅         |      ✅      |
+| FairSeq Machine-Translation |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
+|          FlauBERT           |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
+|            FNet             |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
+|     Funnel Transformer      |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|           GPT Neo           |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
+|            GPT-J            |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
+|           Hubert            |       ❌       |       ❌       |       ✅        |         ✅         |      ❌      |
+|           I-BERT            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|          ImageGPT           |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|          LayoutLM           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|         LayoutLMv2          |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
+|             LED             |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|         Longformer          |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|            LUKE             |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
+|           LXMERT            |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|           M2M100            |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
+|           Marian            |       ✅       |       ❌       |       ✅        |         ✅         |      ✅      |
+|            mBART            |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|        MegatronBert         |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|         MobileBERT          |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|            MPNet            |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|             mT5             |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|         OpenAI GPT          |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|        OpenAI GPT-2         |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|           Pegasus           |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|         ProphetNet          |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
+|           QDQBert           |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|             RAG             |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
+|          Reformer           |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
+|           RemBERT           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|          RetriBERT          |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
+|           RoBERTa           |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|          RoFormer           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|          SegFormer          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|             SEW             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|            SEW-D            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|   Speech Encoder decoder    |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|         Speech2Text         |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
+|        Speech2Text2         |       ✅       |       ❌       |       ❌        |         ❌         |      ❌      |
+|          Splinter           |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
+|         SqueezeBERT         |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
+|             T5              |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|            TAPAS            |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
+|       Transformer-XL        |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
+|            TrOCR            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|          UniSpeech          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|        UniSpeechSat         |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|   Vision Encoder decoder    |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
+|    VisionTextDualEncoder    |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
+|         VisualBert          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|             ViT             |       ❌       |       ❌       |       ✅        |         ✅         |      ✅      |
+|          Wav2Vec2           |       ✅       |       ❌       |       ✅        |         ✅         |      ✅      |
+|             XLM             |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
+|         XLM-RoBERTa         |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+|        XLMProphetNet        |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
+|            XLNet            |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+
+<!-- End table-->
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -1,710 +0,0 @@
-Transformers
-=======================================================================================================================
-
-State-of-the-art Natural Language Processing for Jax, Pytorch and TensorFlow
-
-🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides general-purpose
-architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural
-Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between Jax,
-PyTorch and TensorFlow.
-
-This is the documentation of our repository `transformers <https://github.com/huggingface/transformers>`__. You can
-also follow our `online course <https://huggingface.co/course>`__ that teaches how to use this library, as well as the
-other libraries developed by Hugging Face and the Hub.
-
-If you are looking for custom support from the Hugging Face team
-----------------------------------------------------------------------------------------------------------------------
-
-.. raw:: html
-
-    <a target="_blank" href="https://huggingface.co/support">
-        <img alt="HuggingFace Expert Acceleration Program" src="https://huggingface.co/front/thumbnails/support.png" style="max-width: 600px; border: 1px solid #eee; border-radius: 4px; box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.05);">
-    </a><br>
-
-Features
-----------------------------------------------------------------------------------------------------------------------
-
- High performance on NLU and NLG tasks
- Low barrier to entry for educators and practitioners
-
-State-of-the-art NLP for everyone:
-
- Deep learning researchers
- Hands-on practitioners
- AI/ML/NLP teachers and educators
-
-..
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Lower compute costs, smaller carbon footprint:
-
- Researchers can share trained models instead of always retraining
- Practitioners can reduce compute time and production costs
- 8 architectures with over 30 pretrained models, some in more than 100 languages
-
-Choose the right framework for every part of a model's lifetime:
-
- Train state-of-the-art models in 3 lines of code
- Deep interoperability between Jax, Pytorch and TensorFlow models
- Move a single model between Jax/PyTorch/TensorFlow frameworks at will
- Seamlessly pick the right framework for training, evaluation, production
-
-The support for Jax is still experimental (with a few models right now), expect to see it grow in the coming months!
-
-`All the model checkpoints <https://huggingface.co/models>`__ are seamlessly integrated from the huggingface.co `model
-hub <https://huggingface.co>`__ where they are uploaded directly by `users <https://huggingface.co/users>`__ and
-`organizations <https://huggingface.co/organizations>`__.
-
-Current number of checkpoints: |checkpoints|
-
-.. |checkpoints| image:: https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen
-
-Contents
-----------------------------------------------------------------------------------------------------------------------
-
-The documentation is organized in five parts:
-
- **GET STARTED** contains a quick tour, the installation instructions and some useful information about our philosophy
-  and a glossary.
- **USING 🤗 TRANSFORMERS** contains general tutorials on how to use the library.
- **ADVANCED GUIDES** contains more advanced guides that are more specific to a given script or part of the library.
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general research in
-  transformers model
- The three last section contain the documentation of each public class and function, grouped in:
-
-    - **MAIN CLASSES** for the main classes exposing the important APIs of the library.
-    - **MODELS** for the classes and functions related to each model implemented in the library.
-    - **INTERNAL HELPERS** for the classes and functions we use internally.
-
-The library currently contains Jax, PyTorch and Tensorflow implementations, pretrained model weights, usage scripts and
-conversion utilities for the following models.
-
-Supported models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-..
-    This list is updated automatically from the README with `make fix-copies`. Do not update manually!
-
-1. :doc:`ALBERT <model_doc/albert>` (from Google Research and the Toyota Technological Institute at Chicago) released
-   with the paper `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
-   <https://arxiv.org/abs/1909.11942>`__, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush
-   Sharma, Radu Soricut.
-2. :doc:`BART <model_doc/bart>` (from Facebook) released with the paper `BART: Denoising Sequence-to-Sequence
-   Pre-training for Natural Language Generation, Translation, and Comprehension
-   <https://arxiv.org/pdf/1910.13461.pdf>`__ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
-   Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
-3. :doc:`BARThez <model_doc/barthez>` (from École polytechnique) released with the paper `BARThez: a Skilled Pretrained
-   French Sequence-to-Sequence Model <https://arxiv.org/abs/2010.12321>`__ by Moussa Kamal Eddine, Antoine J.-P.
-   Tixier, Michalis Vazirgiannis.
-4. :doc:`BARTpho <model_doc/bartpho>` (from VinAI Research) released with the paper `BARTpho: Pre-trained
-   Sequence-to-Sequence Models for Vietnamese <https://arxiv.org/abs/2109.09701>`__ by Nguyen Luong Tran, Duong Minh Le
-   and Dat Quoc Nguyen.
-5. :doc:`BEiT <model_doc/beit>` (from Microsoft) released with the paper `BEiT: BERT Pre-Training of Image Transformers
-   <https://arxiv.org/abs/2106.08254>`__ by Hangbo Bao, Li Dong, Furu Wei.
-6. :doc:`BERT <model_doc/bert>` (from Google) released with the paper `BERT: Pre-training of Deep Bidirectional
-   Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__ by Jacob Devlin, Ming-Wei Chang,
-   Kenton Lee and Kristina Toutanova.
-7. :doc:`BERTweet <model_doc/bertweet>` (from VinAI Research) released with the paper `BERTweet: A pre-trained language
-   model for English Tweets <https://aclanthology.org/2020.emnlp-demos.2/>`__ by Dat Quoc Nguyen, Thanh Vu and Anh Tuan
-   Nguyen.
-8. :doc:`BERT For Sequence Generation <model_doc/bertgeneration>` (from Google) released with the paper `Leveraging
-   Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi
-   Narayan, Aliaksei Severyn.
-9. :doc:`BigBird-RoBERTa <model_doc/bigbird>` (from Google Research) released with the paper `Big Bird: Transformers
-   for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua
-   Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
-10. :doc:`BigBird-Pegasus <model_doc/bigbird_pegasus>` (from Google Research) released with the paper `Big Bird:
-    Transformers for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by Manzil Zaheer, Guru Guruganesh, Avinava
-    Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr
-    Ahmed.
-11. :doc:`Blenderbot <model_doc/blenderbot>` (from Facebook) released with the paper `Recipes for building an
-    open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary
-    Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
-12. :doc:`BlenderbotSmall <model_doc/blenderbot_small>` (from Facebook) released with the paper `Recipes for building
-    an open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju,
-    Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
-13. :doc:`BORT <model_doc/bort>` (from Alexa) released with the paper `Optimal Subarchitecture Extraction For BERT
-    <https://arxiv.org/abs/2010.10499>`__ by Adrian de Wynter and Daniel J. Perry.
-14. :doc:`ByT5 <model_doc/byt5>` (from Google Research) released with the paper `ByT5: Towards a token-free future with
-    pre-trained byte-to-byte models <https://arxiv.org/abs/2105.13626>`__ by Linting Xue, Aditya Barua, Noah Constant,
-    Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
-15. :doc:`CamemBERT <model_doc/camembert>` (from Inria/Facebook/Sorbonne) released with the paper `CamemBERT: a Tasty
-    French Language Model <https://arxiv.org/abs/1911.03894>`__ by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz
-    Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
-16. :doc:`CANINE <model_doc/canine>` (from Google Research) released with the paper `CANINE: Pre-training an Efficient
-    Tokenization-Free Encoder for Language Representation <https://arxiv.org/abs/2103.06874>`__ by Jonathan H. Clark,
-    Dan Garrette, Iulia Turc, John Wieting.
-17. :doc:`CLIP <model_doc/clip>` (from OpenAI) released with the paper `Learning Transferable Visual Models From
-    Natural Language Supervision <https://arxiv.org/abs/2103.00020>`__ by Alec Radford, Jong Wook Kim, Chris Hallacy,
-    Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
-    Krueger, Ilya Sutskever.
-18. :doc:`ConvBERT <model_doc/convbert>` (from YituTech) released with the paper `ConvBERT: Improving BERT with
-    Span-based Dynamic Convolution <https://arxiv.org/abs/2008.02496>`__ by Zihang Jiang, Weihao Yu, Daquan Zhou,
-    Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
-19. :doc:`CPM <model_doc/cpm>` (from Tsinghua University) released with the paper `CPM: A Large-scale Generative
-    Chinese Pre-trained Language Model <https://arxiv.org/abs/2012.00413>`__ by Zhengyan Zhang, Xu Han, Hao Zhou, Pei
-    Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng,
-    Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang,
-    Juanzi Li, Xiaoyan Zhu, Maosong Sun.
-20. :doc:`CTRL <model_doc/ctrl>` (from Salesforce) released with the paper `CTRL: A Conditional Transformer Language
-    Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`__ by Nitish Shirish Keskar*, Bryan McCann*,
-    Lav R. Varshney, Caiming Xiong and Richard Socher.
-21. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft) released with the paper `DeBERTa: Decoding-enhanced BERT with
-    Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu
-    Chen.
-22. :doc:`DeBERTa-v2 <model_doc/deberta_v2>` (from Microsoft) released with the paper `DeBERTa: Decoding-enhanced BERT
-    with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao,
-    Weizhu Chen.
-23. :doc:`DeiT <model_doc/deit>` (from Facebook) released with the paper `Training data-efficient image transformers &
-    distillation through attention <https://arxiv.org/abs/2012.12877>`__ by Hugo Touvron, Matthieu Cord, Matthijs
-    Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
-24. :doc:`DETR <model_doc/detr>` (from Facebook) released with the paper `End-to-End Object Detection with Transformers
-    <https://arxiv.org/abs/2005.12872>`__ by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier,
-    Alexander Kirillov, Sergey Zagoruyko.
-25. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
-    Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`__ by Yizhe
-    Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
-26. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
-    distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__ by Victor
-    Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2
-    <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, RoBERTa into `DistilRoBERTa
-    <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, Multilingual BERT into
-    `DistilmBERT <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German
-    version of DistilBERT.
-27. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
-    Question Answering <https://arxiv.org/abs/2004.04906>`__ by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
-    Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
-28. :doc:`EncoderDecoder <model_doc/encoderdecoder>` (from Google Research) released with the paper `Leveraging
-    Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi
-    Narayan, Aliaksei Severyn.
-29. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
-    Pre-training text encoders as discriminators rather than generators <https://arxiv.org/abs/2003.10555>`__ by Kevin
-    Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
-30. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
-    Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,
-    Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
-31. :doc:`FNet <model_doc/fnet>` (from Google Research) released with the paper `FNet: Mixing Tokens with Fourier
-    Transforms <https://arxiv.org/abs/2105.03824>`__ by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago
-    Ontanon.
-32. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
-    Filtering out Sequential Redundancy for Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__ by
-    Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
-33. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
-    Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans
-    and Ilya Sutskever.
-34. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
-    Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford*, Jeffrey Wu*, Rewon Child, David
-    Luan, Dario Amodei** and Ilya Sutskever**.
-35. :doc:`GPT-J <model_doc/gptj>` (from EleutherAI) released in the repository `kingoflolz/mesh-transformer-jax
-    <https://github.com/kingoflolz/mesh-transformer-jax/>`__ by Ben Wang and Aran Komatsuzaki.
-36. :doc:`GPT Neo <model_doc/gpt_neo>` (from EleutherAI) released in the repository `EleutherAI/gpt-neo
-    <https://github.com/EleutherAI/gpt-neo>`__ by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
-37. :doc:`Hubert <model_doc/hubert>` (from Facebook) released with the paper `HuBERT: Self-Supervised Speech
-    Representation Learning by Masked Prediction of Hidden Units <https://arxiv.org/abs/2106.07447>`__ by Wei-Ning Hsu,
-    Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
-38. :doc:`I-BERT <model_doc/ibert>` (from Berkeley) released with the paper `I-BERT: Integer-only BERT Quantization
-    <https://arxiv.org/abs/2101.01321>`__ by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
-39. `ImageGPT <https://huggingface.co/transformers/master/model_doc/imagegpt.html>`__ (from OpenAI) released with the
-    paper `Generative Pretraining from Pixels <https://openai.com/blog/image-gpt/>`__ by Mark Chen, Alec Radford, Rewon
-    Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
-40. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
-    of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li,
-    Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
-41. :doc:`LayoutLMv2 <model_doc/layoutlmv2>` (from Microsoft Research Asia) released with the paper `LayoutLMv2:
-    Multi-modal Pre-training for Visually-Rich Document Understanding <https://arxiv.org/abs/2012.14740>`__ by Yang Xu,
-    Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min
-    Zhang, Lidong Zhou.
-42. :doc:`LayoutXLM <model_doc/layoutlmv2>` (from Microsoft Research Asia) released with the paper `LayoutXLM:
-    Multimodal Pre-training for Multilingual Visually-rich Document Understanding <https://arxiv.org/abs/2104.08836>`__
-    by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.
-43. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
-    <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
-44. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
-    Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
-45. :doc:`LUKE <model_doc/luke>` (from Studio Ousia) released with the paper `LUKE: Deep Contextualized Entity
-    Representations with Entity-aware Self-attention <https://arxiv.org/abs/2010.01057>`__ by Ikuya Yamada, Akari Asai,
-    Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
-46. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
-    Encoder Representations from Transformers for Open-Domain Question Answering <https://arxiv.org/abs/1908.07490>`__
-    by Hao Tan and Mohit Bansal.
-47. :doc:`M2M100 <model_doc/m2m_100>` (from Facebook) released with the paper `Beyond English-Centric Multilingual
-    Machine Translation <https://arxiv.org/abs/2010.11125>`__ by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma,
-    Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal,
-    Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
-48. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
-    Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
-    Translator Team.
-49. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
-    Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
-    Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
-50. :doc:`MBart-50 <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Translation with Extensible
-    Multilingual Pretraining and Finetuning <https://arxiv.org/abs/2008.00401>`__ by Yuqing Tang, Chau Tran, Xian Li,
-    Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
-51. :doc:`Megatron-BERT <model_doc/megatron_bert>` (from NVIDIA) released with the paper `Megatron-LM: Training
-    Multi-Billion Parameter Language Models Using Model Parallelism <https://arxiv.org/abs/1909.08053>`__ by Mohammad
-    Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
-52. :doc:`Megatron-GPT2 <model_doc/megatron_gpt2>` (from NVIDIA) released with the paper `Megatron-LM: Training
-    Multi-Billion Parameter Language Models Using Model Parallelism <https://arxiv.org/abs/1909.08053>`__ by Mohammad
-    Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
-53. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
-    Pre-training for Language Understanding <https://arxiv.org/abs/2004.09297>`__ by Kaitao Song, Xu Tan, Tao Qin,
-    Jianfeng Lu, Tie-Yan Liu.
-54. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
-    text-to-text transformer <https://arxiv.org/abs/2010.11934>`__ by Linting Xue, Noah Constant, Adam Roberts, Mihir
-    Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
-55. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
-    Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__ by Jingqing Zhang, Yao Zhao,
-    Mohammad Saleh and Peter J. Liu.
-56. :doc:`PhoBERT <model_doc/phobert>` (from VinAI Research) released with the paper `PhoBERT: Pre-trained language
-    models for Vietnamese <https://www.aclweb.org/anthology/2020.findings-emnlp.92/>`__ by Dat Quoc Nguyen and Anh Tuan
-    Nguyen.
-57. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
-    Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
-    Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
-58. :doc:`QDQBert <model_doc/qdqbert>` (from NVIDIA) released with the paper `Integer Quantization for Deep Learning
-    Inference: Principles and Empirical Evaluation <https://arxiv.org/abs/2004.09602>`__ by Hao Wu, Patrick Judd,
-    Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
-59. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
-    Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-60. :doc:`RemBERT <model_doc/rembert>` (from Google Research) released with the paper `Rethinking embedding coupling in
-    pre-trained language models <https://arxiv.org/pdf/2010.12821.pdf>`__ by Hyung Won Chung, Thibault Févry, Henry
-    Tsai, M. Johnson, Sebastian Ruder.
-61. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
-    Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
-    Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-62. :doc:`RoFormer <model_doc/roformer>` (from ZhuiyiTechnology), released together with the paper a `RoFormer:
-    Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/pdf/2104.09864v1.pdf>`__ by Jianlin Su and
-    Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
-63. :doc:`SegFormer <model_doc/segformer>` (from NVIDIA) released with the paper `SegFormer: Simple and Efficient
-    Design for Semantic Segmentation with Transformers <https://arxiv.org/abs/2105.15203>`__ by Enze Xie, Wenhai Wang,
-    Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
-64. :doc:`SEW <model_doc/sew>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in Unsupervised
-    Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu
-    Han, Kilian Q. Weinberger, Yoav Artzi.
-65. :doc:`SEW-D <model_doc/sew_d>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in
-    Unsupervised Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim,
-    Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
-66. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
-    `fairseq S2T: Fast Speech-to-Text Modeling with fairseq <https://arxiv.org/abs/2010.05171>`__ by Changhan Wang, Yun
-    Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
-67. :doc:`SpeechToTextTransformer2 <model_doc/speech_to_text_2>` (from Facebook), released together with the paper
-    `Large-Scale Self- and Semi-Supervised Learning for Speech Translation <https://arxiv.org/abs/2104.06678>`__ by
-    Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
-68. :doc:`Splinter <model_doc/splinter>` (from Tel Aviv University), released together with the paper `Few-Shot
-    Question Answering by Pretraining Span Selection <https://arxiv.org/abs/2101.00438>`__ by Ori Ram, Yuval Kirstain,
-    Jonathan Berant, Amir Globerson, Omer Levy.
-69. :doc:`SqueezeBert <model_doc/squeezebert>` (from Berkeley) released with the paper `SqueezeBERT: What can computer
-    vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola,
-    Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
-70. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
-    Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
-    Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
-71. :doc:`T5v1.1 <model_doc/t5v1.1>` (from Google AI) released in the repository
-    `google-research/text-to-text-transfer-transformer
-    <https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511>`__ by
-    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi
-    Zhou and Wei Li and Peter J. Liu.
-72. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
-    Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
-    Francesco Piccinno and Julian Martin Eisenschlos.
-73. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
-    Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
-    Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-74. :doc:`TrOCR <model_doc/trocr>` (from Microsoft), released together with the paper `TrOCR: Transformer-based Optical
-    Character Recognition with Pre-trained Models <https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei
-    Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
-75. :doc:`UniSpeech <model_doc/unispeech>` (from Microsoft Research) released with the paper `UniSpeech: Unified Speech
-    Representation Learning with Labeled and Unlabeled Data <https://arxiv.org/abs/2101.07597>`__ by Chengyi Wang, Yu
-    Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
-76. :doc:`UniSpeechSat <model_doc/unispeech_sat>` (from Microsoft Research) released with the paper `UNISPEECH-SAT:
-    UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING <https://arxiv.org/abs/2110.05752>`__ by
-    Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li,
-    Xiangzhan Yu.
-77. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
-    Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy,
-    Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
-    Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
-78. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
-    Performant Baseline for Vision and Language <https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark
-    Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
-79. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
-    Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
-    Zhou, Abdelrahman Mohamed, Michael Auli.
-80. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
-    Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
-81. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
-    Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
-    Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
-82. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
-    Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
-    Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
-    Zettlemoyer and Veselin Stoyanov.
-83. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
-    Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
-    Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
-84. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
-    Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis
-    Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
-
-
-Supported frameworks
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The table below represents the current support in the library for each of those models, whether they have a Python
-tokenizer (called "slow"). A "fast" tokenizer backed by the 🤗 Tokenizers library, whether they have support in Jax (via
-Flax), PyTorch, and/or TensorFlow.
-
-..
-    This table is updated automatically from the auto modules with `make fix-copies`. Do not update manually!
-
-.. rst-class:: center-aligned-table
-
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            Model            | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax Support |
-+=============================+================+================+=================+====================+==============+
-|           ALBERT            |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            BART             |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            BEiT             |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            BERT             |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|       Bert Generation       |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           BigBird           |       ✅       |       ✅       |       ✅        |         ❌         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|       BigBirdPegasus        |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         Blenderbot          |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|       BlenderbotSmall       |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          CamemBERT          |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           Canine            |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            CLIP             |       ✅       |       ✅       |       ✅        |         ❌         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          ConvBERT           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            CTRL             |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           DeBERTa           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         DeBERTa-v2          |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            DeiT             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            DETR             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         DistilBERT          |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|             DPR             |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           ELECTRA           |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|       Encoder decoder       |       ❌       |       ❌       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-| FairSeq Machine-Translation |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          FlauBERT           |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            FNet             |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|     Funnel Transformer      |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           GPT Neo           |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            GPT-J            |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           Hubert            |       ❌       |       ❌       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           I-BERT            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          ImageGPT           |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          LayoutLM           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         LayoutLMv2          |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|             LED             |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         Longformer          |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            LUKE             |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           LXMERT            |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           M2M100            |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           Marian            |       ✅       |       ❌       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            mBART            |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|        MegatronBert         |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         MobileBERT          |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            MPNet            |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|             mT5             |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         OpenAI GPT          |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|        OpenAI GPT-2         |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           Pegasus           |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         ProphetNet          |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           QDQBert           |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|             RAG             |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          Reformer           |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           RemBERT           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          RetriBERT          |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|           RoBERTa           |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          RoFormer           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          SegFormer          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|             SEW             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            SEW-D            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|   Speech Encoder decoder    |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         Speech2Text         |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|        Speech2Text2         |       ✅       |       ❌       |       ❌        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          Splinter           |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         SqueezeBERT         |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|             T5              |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            TAPAS            |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|       Transformer-XL        |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            TrOCR            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          UniSpeech          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|        UniSpeechSat         |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|   Vision Encoder decoder    |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|    VisionTextDualEncoder    |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         VisualBert          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|             ViT             |       ❌       |       ❌       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|          Wav2Vec2           |       ✅       |       ❌       |       ✅        |         ✅         |      ✅      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|             XLM             |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|         XLM-RoBERTa         |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|        XLMProphetNet        |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-|            XLNet            |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
-+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
-
-.. toctree::
-    :maxdepth: 2
-    :caption: Get started
-
-    quicktour
-    installation
-    philosophy
-    glossary
-
-.. toctree::
-    :maxdepth: 2
-    :caption: Using 🤗 Transformers
-
-    task_summary
-    model_summary
-    preprocessing
-    training
-    model_sharing
-    tokenizer_summary
-    multilingual
-
-.. toctree::
-    :maxdepth: 2
-    :caption: Advanced guides
-
-    pretrained_models
-    examples
-    troubleshooting
-    custom_datasets
-    notebooks
-    sagemaker
-    community
-    converting_tensorflow_models
-    migration
-    contributing
-    add_new_model
-    add_new_pipeline
-    fast_tokenizers
-    performance
-    parallelism
-    testing
-    debugging
-    serialization
-    pr_checks
-
-.. toctree::
-    :maxdepth: 2
-    :caption: Research
-
-    bertology
-    perplexity
-    benchmarks
-
-.. toctree::
-    :maxdepth: 2
-    :caption: Main Classes
-
-    main_classes/callback
-    main_classes/configuration
-    main_classes/data_collator
-    main_classes/keras_callbacks
-    main_classes/logging
-    main_classes/model
-    main_classes/optimizer_schedules
-    main_classes/output
-    main_classes/pipelines
-    main_classes/processors
-    main_classes/tokenizer
-    main_classes/trainer
-    main_classes/deepspeed
-    main_classes/feature_extractor
-
-.. toctree::
-    :maxdepth: 2
-    :caption: Models
-
-    model_doc/albert
-    model_doc/auto
-    model_doc/bart
-    model_doc/barthez
-    model_doc/bartpho
-    model_doc/beit
-    model_doc/bert
-    model_doc/bertweet
-    model_doc/bertgeneration
-    model_doc/bert_japanese
-    model_doc/bigbird
-    model_doc/bigbird_pegasus
-    model_doc/blenderbot
-    model_doc/blenderbot_small
-    model_doc/bort
-    model_doc/byt5
-    model_doc/camembert
-    model_doc/canine
-    model_doc/clip
-    model_doc/convbert
-    model_doc/cpm
-    model_doc/ctrl
-    model_doc/deberta
-    model_doc/deberta_v2
-    model_doc/deit
-    model_doc/detr
-    model_doc/dialogpt
-    model_doc/distilbert
-    model_doc/dpr
-    model_doc/electra
-    model_doc/encoderdecoder
-    model_doc/flaubert
-    model_doc/fnet
-    model_doc/fsmt
-    model_doc/funnel
-    model_doc/herbert
-    model_doc/ibert
-    model_doc/imagegpt
-    model_doc/layoutlm
-    model_doc/layoutlmv2
-    model_doc/layoutxlm
-    model_doc/led
-    model_doc/longformer
-    model_doc/luke
-    model_doc/lxmert
-    model_doc/marian
-    model_doc/m2m_100
-    model_doc/mbart
-    model_doc/megatron_bert
-    model_doc/megatron_gpt2
-    model_doc/mobilebert
-    model_doc/mpnet
-    model_doc/mt5
-    model_doc/gpt
-    model_doc/gpt2
-    model_doc/gptj
-    model_doc/gpt_neo
-    model_doc/hubert
-    model_doc/pegasus
-    model_doc/phobert
-    model_doc/prophetnet
-    model_doc/qdqbert
-    model_doc/rag
-    model_doc/reformer
-    model_doc/rembert
-    model_doc/retribert
-    model_doc/roberta
-    model_doc/roformer
-    model_doc/segformer
-    model_doc/sew
-    model_doc/sew_d
-    model_doc/speechencoderdecoder
-    model_doc/speech_to_text
-    model_doc/speech_to_text_2
-    model_doc/splinter
-    model_doc/squeezebert
-    model_doc/t5
-    model_doc/t5v1.1
-    model_doc/tapas
-    model_doc/transformerxl
-    model_doc/trocr
-    model_doc/unispeech
-    model_doc/unispeech_sat
-    model_doc/visionencoderdecoder
-    model_doc/vision_text_dual_encoder
-    model_doc/vit
-    model_doc/visual_bert
-    model_doc/wav2vec2
-    model_doc/xlm
-    model_doc/xlmprophetnet
-    model_doc/xlmroberta
-    model_doc/xlnet
-    model_doc/xlsr_wav2vec2
-
-.. toctree::
-    :maxdepth: 2
-    :caption: Internal Helpers
-
-    internal/modeling_utils
-    internal/pipelines_utils
-    internal/tokenization_utils
-    internal/trainer_utils
-    internal/generation_utils
-    internal/file_utils
--- a/docs/source/migration.md
+++ b/docs/source/migration.md
@@ -31,7 +31,7 @@ This introduces two breaking changes:

 ##### How to obtain the same behavior as v3.x in v4.x

- The pipelines now contain additional features out of the box. See the [token-classification pipeline with the `grouped_entities` flag](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=textclassification#tokenclassificationpipeline).
+- The pipelines now contain additional features out of the box. See the [token-classification pipeline with the `grouped_entities` flag](main_classes/pipelines#transformers.TokenClassificationPipeline).
 - The auto-tokenizers now return rust tokenizers. In order to obtain the python tokenizers instead, the user may use the `use_fast` flag by setting it to `False`:

 In version `v3.x`:
@@ -98,7 +98,7 @@ from transformers.models.bert.modeling_bert import BertLayer

 #### 4. Switching the `return_dict` argument to `True` by default

-The [`return_dict` argument](https://huggingface.co/transformers/main_classes/output.html) enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice.
+The [`return_dict` argument](main_classes/output) enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice.

 This is a breaking change as the limitation of that tuple is that it cannot be unpacked: `value0, value1 = outputs` will not work.

--- a/docs/source/model_doc/blenderbot.rst
+++ b/docs/source/model_doc/blenderbot.rst
@@ -47,7 +47,7 @@ Implementation Notes
 - Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__.
 - This is the `default` Blenderbot model class. However, some smaller checkpoints, such as
  ``facebook/blenderbot_small_90M``, have a different architecture and consequently should be used with
-  `BlenderbotSmall <https://huggingface.co/transformers/master/model_doc/blenderbot_small.html>`__.
+  `BlenderbotSmall <blenderbot_small>`__.


 Usage
--- a/docs/source/model_doc/deit.rst
+++ b/docs/source/model_doc/deit.rst
@@ -25,12 +25,12 @@ Overview

 The DeiT model was proposed in `Training data-efficient image transformers & distillation through attention
 <https://arxiv.org/abs/2012.12877>`__ by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
-Sablayrolles, Hervé Jégou. The `Vision Transformer (ViT) <https://huggingface.co/transformers/model_doc/vit.html>`__
-introduced in `Dosovitskiy et al., 2020 <https://arxiv.org/abs/2010.11929>`__ has shown that one can match or even
-outperform existing convolutional neural networks using a Transformer encoder (BERT-like). However, the ViT models
-introduced in that paper required training on expensive infrastructure for multiple weeks, using external data. DeiT
-(data-efficient image transformers) are more efficiently trained transformers for image classification, requiring far
-less data and far less computing resources compared to the original ViT models.
+Sablayrolles, Hervé Jégou. The `Vision Transformer (ViT) <vit>`__ introduced in `Dosovitskiy et al., 2020
+<https://arxiv.org/abs/2010.11929>`__ has shown that one can match or even outperform existing convolutional neural
+networks using a Transformer encoder (BERT-like). However, the ViT models introduced in that paper required training on
+expensive infrastructure for multiple weeks, using external data. DeiT (data-efficient image transformers) are more
+efficiently trained transformers for image classification, requiring far less data and far less computing resources
+compared to the original ViT models.

 The abstract from the paper is the following:

--- a/docs/source/model_doc/detr.mdx
+++ b/docs/source/model_doc/detr.mdx
@@ -0,0 +1,169 @@
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# DETR
+
+## Overview
+
+The DETR model was proposed in [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by
+Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov and Sergey Zagoruyko. DETR
+consists of a convolutional backbone followed by an encoder-decoder Transformer which can be trained end-to-end for
+object detection. It greatly simplifies a lot of the complexity of models like Faster-R-CNN and Mask-R-CNN, which use
+things like region proposals, non-maximum suppression procedure and anchor generation. Moreover, DETR can also be
+naturally extended to perform panoptic segmentation, by simply adding a mask head on top of the decoder outputs.
+
+The abstract from the paper is the following:
+
+*We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the
+detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression
+procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the
+new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via
+bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries,
+DETR reasons about the relations of the objects and the global image context to directly output the final set of
+predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many
+other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and
+highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily
+generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive
+baselines.*
+
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/facebookresearch/detr).
+
+The quickest way to get started with DETR is by checking the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) (which showcase both inference and
+fine-tuning on custom data).
+
+Here's a TLDR explaining how [`~transformers.DetrForObjectDetection`] works:
+
+First, an image is sent through a pre-trained convolutional backbone (in the paper, the authors use
+ResNet-50/ResNet-101). Let's assume we also add a batch dimension. This means that the input to the backbone is a
+tensor of shape `(batch_size, 3, height, width)`, assuming the image has 3 color channels (RGB). The CNN backbone
+outputs a new lower-resolution feature map, typically of shape `(batch_size, 2048, height/32, width/32)`. This is
+then projected to match the hidden dimension of the Transformer of DETR, which is `256` by default, using a
+`nn.Conv2D` layer. So now, we have a tensor of shape `(batch_size, 256, height/32, width/32).` Next, the
+feature map is flattened and transposed to obtain a tensor of shape `(batch_size, seq_len, d_model)` =
+`(batch_size, width/32*height/32, 256)`. So a difference with NLP models is that the sequence length is actually
+longer than usual, but with a smaller `d_model` (which in NLP is typically 768 or higher).
+
+Next, this is sent through the encoder, outputting `encoder_hidden_states` of the same shape (you can consider
+these as image features). Next, so-called **object queries** are sent through the decoder. This is a tensor of shape
+`(batch_size, num_queries, d_model)`, with `num_queries` typically set to 100 and initialized with zeros.
+These input embeddings are learnt positional encodings that the authors refer to as object queries, and similarly to
+the encoder, they are added to the input of each attention layer. Each object query will look for a particular object
+in the image. The decoder updates these embeddings through multiple self-attention and encoder-decoder attention layers
+to output `decoder_hidden_states` of the same shape: `(batch_size, num_queries, d_model)`. Next, two heads
+are added on top for object detection: a linear layer for classifying each object query into one of the objects or "no
+object", and a MLP to predict bounding boxes for each query.
+
+The model is trained using a **bipartite matching loss**: so what we actually do is compare the predicted classes +
+bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N
+(so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as
+bounding box). The [Hungarian matching algorithm](https://en.wikipedia.org/wiki/Hungarian_algorithm) is used to find
+an optimal one-to-one mapping of each of the N queries to each of the N annotations. Next, standard cross-entropy (for
+the classes) and a linear combination of the L1 and [generalized IoU loss](https://giou.stanford.edu/) (for the
+bounding boxes) are used to optimize the parameters of the model.
+
+DETR can be naturally extended to perform panoptic segmentation (which unifies semantic segmentation and instance
+segmentation). [`~transformers.DetrForSegmentation`] adds a segmentation mask head on top of
+[`~transformers.DetrForObjectDetection`]. The mask head can be trained either jointly, or in a two steps process,
+where one first trains a [`~transformers.DetrForObjectDetection`] model to detect bounding boxes around both
+"things" (instances) and "stuff" (background things like trees, roads, sky), then freeze all the weights and train only
+the mask head for 25 epochs. Experimentally, these two approaches give similar results. Note that predicting boxes is
+required for the training to be possible, since the Hungarian matching is computed using distances between boxes.
+
+Tips:
+
+- DETR uses so-called **object queries** to detect objects in an image. The number of queries determines the maximum
+  number of objects that can be detected in a single image, and is set to 100 by default (see parameter
+  `num_queries` of [`~transformers.DetrConfig`]). Note that it's good to have some slack (in COCO, the
+  authors used 100, while the maximum number of objects in a COCO image is ~70).
+- The decoder of DETR updates the query embeddings in parallel. This is different from language models like GPT-2,
+  which use autoregressive decoding instead of parallel. Hence, no causal attention mask is used.
+- DETR adds position embeddings to the hidden states at each self-attention and cross-attention layer before projecting
+  to queries and keys. For the position embeddings of the image, one can choose between fixed sinusoidal or learned
+  absolute position embeddings. By default, the parameter `position_embedding_type` of
+  [`~transformers.DetrConfig`] is set to `"sine"`.
+- During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help
+  the model output the correct number of objects of each class. If you set the parameter `auxiliary_loss` of
+  [`~transformers.DetrConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses
+  are added after each decoder layer (with the FFNs sharing parameters).
+- If you want to train the model in a distributed environment across multiple nodes, then one should update the
+  _num_boxes_ variable in the _DetrLoss_ class of _modeling_detr.py_. When training on multiple nodes, this should be
+  set to the average number of target boxes across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/detr/blob/a54b77800eb8e64e3ad0d8237789fcbf2f8350c5/models/detr.py#L227-L232).
+- [`~transformers.DetrForObjectDetection`] and [`~transformers.DetrForSegmentation`] can be initialized with
+  any convolutional backbone available in the [timm library](https://github.com/rwightman/pytorch-image-models).
+  Initializing with a MobileNet backbone for example can be done by setting the `backbone` attribute of
+  [`~transformers.DetrConfig`] to `"tf_mobilenetv3_small_075"`, and then initializing the model with that
+  config.
+- DETR resizes the input images such that the shortest side is at least a certain amount of pixels while the longest is
+  at most 1333 pixels. At training time, scale augmentation is used such that the shortest side is randomly set to at
+  least 480 and at most 800 pixels. At inference time, the shortest side is set to 800. One can use
+  [`~transformers.DetrFeatureExtractor`] to prepare images (and optional annotations in COCO format) for the
+  model. Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the
+  largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding.
+  Alternatively, one can also define a custom `collate_fn` in order to batch images together, using
+  [`~transformers.DetrFeatureExtractor.pad_and_create_pixel_mask`].
+- The size of the images will determine the amount of memory being used, and will thus determine the `batch_size`.
+  It is advised to use a batch size of 2 per GPU. See [this Github thread](https://github.com/facebookresearch/detr/issues/150) for more info.
+
+As a summary, consider the following table:
+
+| Task | Object detection | Instance segmentation | Panoptic segmentation |
+|------|------------------|-----------------------|-----------------------|
+| **Description** | Predicting bounding boxes and class labels around objects in an image | Predicting masks around objects (i.e. instances) in an image | Predicting masks around both objects (i.e. instances) as well as "stuff" (i.e. background things like trees and roads) in an image |
+| **Model** | [`~transformers.DetrForObjectDetection`] | [`~transformers.DetrForSegmentation`] | [`~transformers.DetrForSegmentation`] |
+| **Example dataset** | COCO detection | COCO detection, COCO panoptic | COCO panoptic  |                                                                        |
+| **Format of annotations to provide to**  [`~transformers.DetrFeatureExtractor`] | {'image_id': `int`, 'annotations': `List[Dict]`} each Dict being a COCO object annotation  | {'image_id': `int`, 'annotations': `List[Dict]`}  (in case of COCO detection) or {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} (in case of COCO panoptic) | {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} and masks_path (path to directory containing PNG files of the masks) |
+| **Postprocessing** (i.e. converting the output of the model to COCO API) | [`~transformers.DetrFeatureExtractor.post_process`] | [`~transformers.DetrFeatureExtractor.post_process_segmentation`] | [`~transformers.DetrFeatureExtractor.post_process_segmentation`], [`~transformers.DetrFeatureExtractor.post_process_panoptic`] |
+| **evaluators** | `CocoEvaluator` with `iou_types="bbox"` | `CocoEvaluator` with `iou_types="bbox"` or `"segm"` | `CocoEvaluator` with `iou_tupes="bbox"` or `"segm"`, `PanopticEvaluator` |
+
+In short, one should prepare the data either in COCO detection or COCO panoptic format, then use
+[`~transformers.DetrFeatureExtractor`] to create `pixel_values`, `pixel_mask` and optional
+`labels`, which can then be used to train (or fine-tune) a model. For evaluation, one should first convert the
+outputs of the model using one of the postprocessing methods of [`~transformers.DetrFeatureExtractor`]. These can
+be be provided to either `CocoEvaluator` or `PanopticEvaluator`, which allow you to calculate metrics like
+mean Average Precision (mAP) and Panoptic Quality (PQ). The latter objects are implemented in the [original repository](https://github.com/facebookresearch/detr). See the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) for more info regarding evaluation.
+
+
+## DETR specific outputs
+
+[[autodoc]] models.detr.modeling_detr.DetrModelOutput
+
+[[autodoc]] models.detr.modeling_detr.DetrObjectDetectionOutput
+
+[[autodoc]] models.detr.modeling_detr.DetrSegmentationOutput
+
+## DetrConfig
+
+[[autodoc]] DetrConfig
+
+## DetrFeatureExtractor
+
+[[autodoc]] DetrFeatureExtractor
+    - __call__
+    - pad_and_create_pixel_mask
+    - post_process
+    - post_process_segmentation
+    - post_process_panoptic
+
+## DetrModel
+
+[[autodoc]] DetrModel
+    - forward
+
+## DetrForObjectDetection
+
+[[autodoc]] DetrForObjectDetection
+    - forward
+
+## DetrForSegmentation
+
+[[autodoc]] DetrForSegmentation
+    - forward
--- a/docs/source/model_doc/detr.rst
+++ b/docs/source/model_doc/detr.rst
@@ -1,207 +0,0 @@
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-DETR
-----------------------------------------------------------------------------------------------------------------------
-
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The DETR model was proposed in `End-to-End Object Detection with Transformers <https://arxiv.org/abs/2005.12872>`__ by
-Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov and Sergey Zagoruyko. DETR
-consists of a convolutional backbone followed by an encoder-decoder Transformer which can be trained end-to-end for
-object detection. It greatly simplifies a lot of the complexity of models like Faster-R-CNN and Mask-R-CNN, which use
-things like region proposals, non-maximum suppression procedure and anchor generation. Moreover, DETR can also be
-naturally extended to perform panoptic segmentation, by simply adding a mask head on top of the decoder outputs.
-
-The abstract from the paper is the following:
-
-*We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the
-detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression
-procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the
-new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via
-bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries,
-DETR reasons about the relations of the objects and the global image context to directly output the final set of
-predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many
-other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and
-highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily
-generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive
-baselines.*
-
-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
-<https://github.com/facebookresearch/detr>`__.
-
-The quickest way to get started with DETR is by checking the `example notebooks
-<https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR>`__ (which showcase both inference and
-fine-tuning on custom data).
-
-Here's a TLDR explaining how :class:`~transformers.DetrForObjectDetection` works:
-
-First, an image is sent through a pre-trained convolutional backbone (in the paper, the authors use
-ResNet-50/ResNet-101). Let's assume we also add a batch dimension. This means that the input to the backbone is a
-tensor of shape :obj:`(batch_size, 3, height, width)`, assuming the image has 3 color channels (RGB). The CNN backbone
-outputs a new lower-resolution feature map, typically of shape :obj:`(batch_size, 2048, height/32, width/32)`. This is
-then projected to match the hidden dimension of the Transformer of DETR, which is :obj:`256` by default, using a
-:obj:`nn.Conv2D` layer. So now, we have a tensor of shape :obj:`(batch_size, 256, height/32, width/32).` Next, the
-feature map is flattened and transposed to obtain a tensor of shape :obj:`(batch_size, seq_len, d_model)` =
-:obj:`(batch_size, width/32*height/32, 256)`. So a difference with NLP models is that the sequence length is actually
-longer than usual, but with a smaller :obj:`d_model` (which in NLP is typically 768 or higher).
-
-Next, this is sent through the encoder, outputting :obj:`encoder_hidden_states` of the same shape (you can consider
-these as image features). Next, so-called **object queries** are sent through the decoder. This is a tensor of shape
-:obj:`(batch_size, num_queries, d_model)`, with :obj:`num_queries` typically set to 100 and initialized with zeros.
-These input embeddings are learnt positional encodings that the authors refer to as object queries, and similarly to
-the encoder, they are added to the input of each attention layer. Each object query will look for a particular object
-in the image. The decoder updates these embeddings through multiple self-attention and encoder-decoder attention layers
-to output :obj:`decoder_hidden_states` of the same shape: :obj:`(batch_size, num_queries, d_model)`. Next, two heads
-are added on top for object detection: a linear layer for classifying each object query into one of the objects or "no
-object", and a MLP to predict bounding boxes for each query.
-
-The model is trained using a **bipartite matching loss**: so what we actually do is compare the predicted classes +
-bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N
-(so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as
-bounding box). The `Hungarian matching algorithm <https://en.wikipedia.org/wiki/Hungarian_algorithm>`__ is used to find
-an optimal one-to-one mapping of each of the N queries to each of the N annotations. Next, standard cross-entropy (for
-the classes) and a linear combination of the L1 and `generalized IoU loss <https://giou.stanford.edu/>`__ (for the
-bounding boxes) are used to optimize the parameters of the model.
-
-DETR can be naturally extended to perform panoptic segmentation (which unifies semantic segmentation and instance
-segmentation). :class:`~transformers.DetrForSegmentation` adds a segmentation mask head on top of
-:class:`~transformers.DetrForObjectDetection`. The mask head can be trained either jointly, or in a two steps process,
-where one first trains a :class:`~transformers.DetrForObjectDetection` model to detect bounding boxes around both
-"things" (instances) and "stuff" (background things like trees, roads, sky), then freeze all the weights and train only
-the mask head for 25 epochs. Experimentally, these two approaches give similar results. Note that predicting boxes is
-required for the training to be possible, since the Hungarian matching is computed using distances between boxes.
-
-Tips:
-
- DETR uses so-called **object queries** to detect objects in an image. The number of queries determines the maximum
-  number of objects that can be detected in a single image, and is set to 100 by default (see parameter
-  :obj:`num_queries` of :class:`~transformers.DetrConfig`). Note that it's good to have some slack (in COCO, the
-  authors used 100, while the maximum number of objects in a COCO image is ~70).
- The decoder of DETR updates the query embeddings in parallel. This is different from language models like GPT-2,
-  which use autoregressive decoding instead of parallel. Hence, no causal attention mask is used.
- DETR adds position embeddings to the hidden states at each self-attention and cross-attention layer before projecting
-  to queries and keys. For the position embeddings of the image, one can choose between fixed sinusoidal or learned
-  absolute position embeddings. By default, the parameter :obj:`position_embedding_type` of
-  :class:`~transformers.DetrConfig` is set to :obj:`"sine"`.
- During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help
-  the model output the correct number of objects of each class. If you set the parameter :obj:`auxiliary_loss` of
-  :class:`~transformers.DetrConfig` to :obj:`True`, then prediction feedforward neural networks and Hungarian losses
-  are added after each decoder layer (with the FFNs sharing parameters).
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
-  `num_boxes` variable in the `DetrLoss` class of `modeling_detr.py`. When training on multiple nodes, this should be
-  set to the average number of target boxes across all nodes, as can be seen in the original implementation `here
-  <https://github.com/facebookresearch/detr/blob/a54b77800eb8e64e3ad0d8237789fcbf2f8350c5/models/detr.py#L227-L232>`__.
- :class:`~transformers.DetrForObjectDetection` and :class:`~transformers.DetrForSegmentation` can be initialized with
-  any convolutional backbone available in the `timm library <https://github.com/rwightman/pytorch-image-models>`__.
-  Initializing with a MobileNet backbone for example can be done by setting the :obj:`backbone` attribute of
-  :class:`~transformers.DetrConfig` to :obj:`"tf_mobilenetv3_small_075"`, and then initializing the model with that
-  config.
- DETR resizes the input images such that the shortest side is at least a certain amount of pixels while the longest is
-  at most 1333 pixels. At training time, scale augmentation is used such that the shortest side is randomly set to at
-  least 480 and at most 800 pixels. At inference time, the shortest side is set to 800. One can use
-  :class:`~transformers.DetrFeatureExtractor` to prepare images (and optional annotations in COCO format) for the
-  model. Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the
-  largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding.
-  Alternatively, one can also define a custom :obj:`collate_fn` in order to batch images together, using
-  :meth:`~transformers.DetrFeatureExtractor.pad_and_create_pixel_mask`.
- The size of the images will determine the amount of memory being used, and will thus determine the :obj:`batch_size`.
-  It is advised to use a batch size of 2 per GPU. See `this Github thread
-  <https://github.com/facebookresearch/detr/issues/150>`__ for more info.
-
-As a summary, consider the following table:
-
-+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
-| **Task**                                    | **Object detection**                                    | **Instance segmentation**                                            | **Panoptic segmentation**                                              |
-+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
-| **Description**                             | Predicting bounding boxes and class labels around       | Predicting masks around objects (i.e. instances) in an image         | Predicting masks around both objects (i.e. instances) as well as       |
-|                                             | objects in an image                                     |                                                                      | "stuff" (i.e. background things like trees and roads) in an image      |
-+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
-| **Model**                                   | :class:`~transformers.DetrForObjectDetection`           | :class:`~transformers.DetrForSegmentation`                           | :class:`~transformers.DetrForSegmentation`                             |
-+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
-| **Example dataset**                         | COCO detection                                          | COCO detection,                                                      | COCO panoptic                                                          |
-|                                             |                                                         | COCO panoptic                                                        |                                                                        |
-+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
-| **Format of annotations to provide to**     | {‘image_id’: int,                                       | {‘image_id’: int,                                                    | {‘file_name: str,                                                      |
-| :class:`~transformers.DetrFeatureExtractor` | ‘annotations’: List[Dict]}, each Dict being a COCO      | ‘annotations’: [List[Dict]] } (in case of COCO detection)            | ‘image_id: int,                                                        |
-|                                             | object annotation                                       |                                                                      | ‘segments_info’: List[Dict] }                                          |
-|                                             |                                                         | or                                                                   |                                                                        |
-|                                             |                                                         |                                                                      | and masks_path (path to directory containing PNG files of the masks)   |
-|                                             |                                                         | {‘file_name’: str,                                                   |                                                                        |
-|                                             |                                                         | ‘image_id’: int,                                                     |                                                                        |
-|                                             |                                                         | ‘segments_info’: List[Dict]} (in case of COCO panoptic)              |                                                                        |
-+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
-| **Postprocessing** (i.e. converting the     | :meth:`~transformers.DetrFeatureExtractor.post_process` | :meth:`~transformers.DetrFeatureExtractor.post_process_segmentation` | :meth:`~transformers.DetrFeatureExtractor.post_process_segmentation`,  |
-| output of the model to COCO API)            |                                                         |                                                                      | :meth:`~transformers.DetrFeatureExtractor.post_process_panoptic`       |
-+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
-| **evaluators**                              | :obj:`CocoEvaluator` with iou_types = “bbox”            | :obj:`CocoEvaluator` with iou_types = “bbox”, “segm”                 | :obj:`CocoEvaluator` with iou_tupes = “bbox, “segm”                    |
-|                                             |                                                         |                                                                      |                                                                        |
-|                                             |                                                         |                                                                      | :obj:`PanopticEvaluator`                                               |
-+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
-
-In short, one should prepare the data either in COCO detection or COCO panoptic format, then use
-:class:`~transformers.DetrFeatureExtractor` to create :obj:`pixel_values`, :obj:`pixel_mask` and optional
-:obj:`labels`, which can then be used to train (or fine-tune) a model. For evaluation, one should first convert the
-outputs of the model using one of the postprocessing methods of :class:`~transformers.DetrFeatureExtractor`. These can
-be be provided to either :obj:`CocoEvaluator` or :obj:`PanopticEvaluator`, which allow you to calculate metrics like
-mean Average Precision (mAP) and Panoptic Quality (PQ). The latter objects are implemented in the `original repository
-<https://github.com/facebookresearch/detr>`__. See the `example notebooks
-<https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR>`__ for more info regarding evaluation.
-
-
-DETR specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.models.detr.modeling_detr.DetrModelOutput
-    :members:
-
-.. autoclass:: transformers.models.detr.modeling_detr.DetrObjectDetectionOutput
-    :members:
-
-.. autoclass:: transformers.models.detr.modeling_detr.DetrSegmentationOutput
-    :members:
-
-
-DetrConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.DetrConfig
-    :members:
-
-
-DetrFeatureExtractor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.DetrFeatureExtractor
-    :members: __call__, pad_and_create_pixel_mask, post_process, post_process_segmentation, post_process_panoptic
-
-
-DetrModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.DetrModel
-    :members: forward
-
-
-DetrForObjectDetection
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.DetrForObjectDetection
-    :members: forward
-
-
-DetrForSegmentation
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.DetrForSegmentation
-    :members: forward
--- a/docs/source/model_doc/layoutlmv2.rst
+++ b/docs/source/model_doc/layoutlmv2.rst
@@ -18,9 +18,8 @@ Overview

 The LayoutLMV2 model was proposed in `LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
 <https://arxiv.org/abs/2012.14740>`__ by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu,
-Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves `LayoutLM
-<https://huggingface.co/transformers/model_doc/layoutlm.html>`__ to obtain state-of-the-art results across several
-document image understanding benchmarks:
+Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves `LayoutLM <layoutlm>`__ to obtain
+state-of-the-art results across several document image understanding benchmarks:

 - information extraction from scanned documents: the `FUNSD <https://guillaumejaume.github.io/FUNSD/>`__ dataset (a
  collection of 199 annotated forms comprising more than 30,000 words), the `CORD <https://github.com/clovaai/cord>`__
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -80,7 +80,7 @@ Original GPT
   <a href="https://huggingface.co/models?filter=openai-gpt">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-openai--gpt-blueviolet">
   </a>
-   <a href="model_doc/gpt.html">
+   <a href="model_doc/gpt">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
   </a>

@@ -100,7 +100,7 @@ GPT-2
   <a href="https://huggingface.co/models?filter=gpt2">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-gpt2-blueviolet">
   </a>
-   <a href="model_doc/gpt2.html">
+   <a href="model_doc/gpt2">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
   </a>

@@ -122,7 +122,7 @@ CTRL
   <a href="https://huggingface.co/models?filter=ctrl">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-ctrl-blueviolet">
   </a>
-   <a href="model_doc/ctrl.html">
+   <a href="model_doc/ctrl">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-ctrl-blueviolet">
   </a>

@@ -143,7 +143,7 @@ Transformer-XL
   <a href="https://huggingface.co/models?filter=transfo-xl">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-transfo--xl-blueviolet">
   </a>
-   <a href="model_doc/transformerxl.html">
+   <a href="model_doc/transformerxl">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
   </a>

@@ -174,7 +174,7 @@ Reformer
   <a href="https://huggingface.co/models?filter=reformer">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-reformer-blueviolet">
   </a>
-   <a href="model_doc/reformer.html">
+   <a href="model_doc/reformer">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
   </a>

@@ -208,7 +208,7 @@ XLNet
   <a href="https://huggingface.co/models?filter=xlnet">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlnet-blueviolet">
   </a>
-   <a href="model_doc/xlnet.html">
+   <a href="model_doc/xlnet">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
   </a>

@@ -248,7 +248,7 @@ BERT
   <a href="https://huggingface.co/models?filter=bert">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bert-blueviolet">
   </a>
-   <a href="model_doc/bert.html">
+   <a href="model_doc/bert">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bert-blueviolet">
   </a>

@@ -277,7 +277,7 @@ ALBERT
   <a href="https://huggingface.co/models?filter=albert">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-albert-blueviolet">
   </a>
-   <a href="model_doc/albert.html">
+   <a href="model_doc/albert">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-albert-blueviolet">
   </a>

@@ -306,7 +306,7 @@ RoBERTa
   <a href="https://huggingface.co/models?filter=roberta">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-roberta-blueviolet">
   </a>
-   <a href="model_doc/roberta.html">
+   <a href="model_doc/roberta">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
   </a>

@@ -331,7 +331,7 @@ DistilBERT
   <a href="https://huggingface.co/models?filter=distilbert">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-distilbert-blueviolet">
   </a>
-   <a href="model_doc/distilbert.html">
+   <a href="model_doc/distilbert">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-distilbert-blueviolet">
   </a>

@@ -356,7 +356,7 @@ ConvBERT
   <a href="https://huggingface.co/models?filter=convbert">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-convbert-blueviolet">
   </a>
-   <a href="model_doc/convbert.html">
+   <a href="model_doc/convbert">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-convbert-blueviolet">
   </a>

@@ -386,7 +386,7 @@ XLM
   <a href="https://huggingface.co/models?filter=xlm">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm-blueviolet">
   </a>
-   <a href="model_doc/xlm.html">
+   <a href="model_doc/xlm">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm-blueviolet">
   </a>

@@ -420,7 +420,7 @@ XLM-RoBERTa
   <a href="https://huggingface.co/models?filter=xlm-roberta">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm--roberta-blueviolet">
   </a>
-   <a href="model_doc/xlmroberta.html">
+   <a href="model_doc/xlmroberta">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm--roberta-blueviolet">
   </a>

@@ -442,7 +442,7 @@ FlauBERT
   <a href="https://huggingface.co/models?filter=flaubert">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-flaubert-blueviolet">
   </a>
-   <a href="model_doc/flaubert.html">
+   <a href="model_doc/flaubert">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-flaubert-blueviolet">
   </a>

@@ -460,7 +460,7 @@ ELECTRA
   <a href="https://huggingface.co/models?filter=electra">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-electra-blueviolet">
   </a>
-   <a href="model_doc/electra.html">
+   <a href="model_doc/electra">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-electra-blueviolet">
   </a>

@@ -484,7 +484,7 @@ Funnel Transformer
   <a href="https://huggingface.co/models?filter=funnel">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-funnel-blueviolet">
   </a>
-   <a href="model_doc/funnel.html">
+   <a href="model_doc/funnel">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-funnel-blueviolet">
   </a>

@@ -518,7 +518,7 @@ Longformer
   <a href="https://huggingface.co/models?filter=longformer">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-longformer-blueviolet">
   </a>
-   <a href="model_doc/longformer.html">
+   <a href="model_doc/longformer">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-longformer-blueviolet">
   </a>

@@ -558,7 +558,7 @@ BART
   <a href="https://huggingface.co/models?filter=bart">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bart-blueviolet">
   </a>
-   <a href="model_doc/bart.html">
+   <a href="model_doc/bart">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bart-blueviolet">
   </a>

@@ -585,7 +585,7 @@ Pegasus
   <a href="https://huggingface.co/models?filter=pegasus">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-pegasus-blueviolet">
   </a>
-   <a href="model_doc/pegasus.html">
+   <a href="model_doc/pegasus">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
   </a>

@@ -616,7 +616,7 @@ MarianMT
   <a href="https://huggingface.co/models?filter=marian">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-marian-blueviolet">
   </a>
-   <a href="model_doc/marian.html">
+   <a href="model_doc/marian">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-marian-blueviolet">
   </a>

@@ -635,7 +635,7 @@ T5
   <a href="https://huggingface.co/models?filter=t5">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-t5-blueviolet">
   </a>
-   <a href="model_doc/t5.html">
+   <a href="model_doc/t5">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
   </a>

@@ -668,7 +668,7 @@ MT5
   <a href="https://huggingface.co/models?filter=mt5">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-mt5-blueviolet">
   </a>
-   <a href="model_doc/mt5.html">
+   <a href="model_doc/mt5">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mt5-blueviolet">
   </a>

@@ -689,7 +689,7 @@ MBart
   <a href="https://huggingface.co/models?filter=mbart">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-mbart-blueviolet">
   </a>
-   <a href="model_doc/mbart.html">
+   <a href="model_doc/mbart">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
   </a>

@@ -718,7 +718,7 @@ ProphetNet
   <a href="https://huggingface.co/models?filter=prophetnet">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-prophetnet-blueviolet">
   </a>
-   <a href="model_doc/prophetnet.html">
+   <a href="model_doc/prophetnet">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-prophetnet-blueviolet">
   </a>

@@ -743,7 +743,7 @@ XLM-ProphetNet
   <a href="https://huggingface.co/models?filter=xprophetnet">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xprophetnet-blueviolet">
   </a>
-   <a href="model_doc/xlmprophetnet.html">
+   <a href="model_doc/xlmprophetnet">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xprophetnet-blueviolet">
   </a>

@@ -781,7 +781,7 @@ model know which part of the input vector corresponds to the text and which to t
 The pretrained model only works for classification.

 ..
-    More information in this :doc:`model documentation </model_doc/mmbt.html>`. TODO: write this page
+    More information in this :doc:`model documentation <model_doc/mmbt>`. TODO: write this page

 .. _retrieval-based-models:

@@ -799,7 +799,7 @@ DPR
   <a href="https://huggingface.co/models?filter=dpr">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-dpr-blueviolet">
   </a>
-   <a href="model_doc/dpr.html">
+   <a href="model_doc/dpr">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
   </a>

@@ -828,7 +828,7 @@ RAG
   <a href="https://huggingface.co/models?filter=rag">
       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-rag-blueviolet">
   </a>
-   <a href="model_doc/rag.html">
+   <a href="model_doc/rag">
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
   </a>

--- a/docs/source/parallelism.md
+++ b/docs/source/parallelism.md
@@ -46,7 +46,7 @@ Most users with just 2 GPUs already enjoy the increased training speed up thanks
 ## ZeRO Data Parallel

 ZeRO-powered data parallelism (ZeRO-DP) is described on the following diagram from this [blog post](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
-![DeepSpeed-Image-1](imgs/parallelism-zero.png)
+![DeepSpeed-Image-1](/transformers/_images/parallelism-zero.png)

 It can be difficult to wrap one's head around it, but in reality the concept is quite simple. This is just the usual DataParallel (DP), except, instead of replicating the full model params, gradients and optimizer states, each GPU stores only a slice of it.  And then at run-time when the full layer params are needed just for the given layer, all GPUs synchronize to give each other parts that they miss - this is it.

@@ -122,7 +122,7 @@ Implementations:

 - [DeepSpeed](https://www.deepspeed.ai/features/#the-zero-redundancy-optimizer) ZeRO-DP stages 1+2+3
 - [Fairscale](https://github.com/facebookresearch/fairscale/#optimizer-state-sharding-zero) ZeRO-DP stages 1+2+3
- [`transformers` integration](https://huggingface.co/transformers/master/main_classes/trainer.html#trainer-integrations)
+- [`transformers` integration](main_classes/trainer#trainer-integrations)

 ## Naive Model Parallel (Vertical) and Pipeline Parallel

@@ -150,7 +150,7 @@ Pipeline Parallel (PP) is almost identical to a naive MP, but it solves the GPU

 The following illustration from the [GPipe paper](https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html) shows the naive MP on the top, and PP on the bottom:

-![mp-pp](imgs/parallelism-gpipe-bubble.png)
+![mp-pp](/transformers/_images/parallelism-gpipe-bubble.png)

 It's easy to see from the bottom diagram how PP has less dead zones, where GPUs are idle. The idle parts are referred to as the "bubble".

@@ -203,7 +203,7 @@ Implementations:
 Other approaches:

 DeepSpeed, Varuna and SageMaker use the concept of an [Interleaved Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html)
-![interleaved-pipeline-execution](imgs/parallelism-sagemaker-interleaved-pipeline.png)
+![interleaved-pipeline-execution](/transformers/_images/parallelism-sagemaker-interleaved-pipeline.png)

 Here the bubble (idle time) is further minimized by prioritizing backward passes.

@@ -221,16 +221,16 @@ The main building block of any transformer is a fully connected `nn.Linear` foll
 Following the Megatron's paper notation, we can write the dot-product part of it as `Y = GeLU(XA)`, where `X` and `Y` are the input and output vectors, and `A` is the weight matrix.

 If we look at the computation in matrix form, it's easy to see how the matrix multiplication can be split between multiple GPUs:
-![Parallel GEMM](imgs/parallelism-tp-parallel_gemm.png)
+![Parallel GEMM](/transformers/_images/parallelism-tp-parallel_gemm.png)

 If we split the weight matrix `A` column-wise across `N` GPUs and perform matrix multiplications `XA_1` through `XA_n` in parallel, then we will end up with `N` output vectors `Y_1, Y_2, ..., Y_n` which can be fed into `GeLU` independently:
-![independent GeLU](imgs/parallelism-tp-independent-gelu.png)
+![independent GeLU](/transformers/_images/parallelism-tp-independent-gelu.png)

 Using this principle, we can update an MLP of arbitrary depth, without the need for any synchronization between GPUs until the very end, where we need to reconstruct the output vector from shards. The Megatron-LM paper authors provide a helpful illustration for that:
-![parallel shard processing](imgs/parallelism-tp-parallel_shard_processing.png)
+![parallel shard processing](/transformers/_images/parallelism-tp-parallel_shard_processing.png)

 Parallelizing the multi-headed attention layers is even simpler, since they are already inherently parallel, due to having multiple independent heads!
-![parallel self-attention](imgs/parallelism-tp-parallel_self_attention.png)
+![parallel self-attention](/transformers/_images/parallelism-tp-parallel_self_attention.png)

 Special considerations: TP requires very fast network, and therefore it's not advisable to do TP across more than one node. Practically, if a node has 4 GPUs, the highest TP degree is therefore 4. If you need a TP degree of 8, you need to use nodes that have at least 8 GPUs.

@@ -258,7 +258,7 @@ Implementations:

 The following diagram from the DeepSpeed [pipeline tutorial](https://www.deepspeed.ai/tutorials/pipeline/) demonstrates how one combines DP with PP.

-![dp-pp-2d](imgs/parallelism-zero-dp-pp.png)
+![dp-pp-2d](/transformers/_images/parallelism-zero-dp-pp.png)

 Here it's important to see how DP rank 0 doesn't see GPU2 and DP rank 1 doesn't see GPU3. To DP there is just GPUs 0 and 1 where it feeds data as if there were just 2 GPUs. GPU0 "secretly" offloads some of its load to GPU2 using PP. And GPU1 does the same by enlisting GPU3 to its aid.

@@ -277,7 +277,7 @@ Implementations:

 To get an even more efficient training a 3D parallelism is used where PP is combined with TP and DP. This can be seen in the following diagram.

-![dp-pp-tp-3d](imgs/parallelism-deepspeed-3d.png)
+![dp-pp-tp-3d](/transformers/_images/parallelism-deepspeed-3d.png)

 This diagram is from a blog post [3D parallelism: Scaling to trillion-parameter models](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/), which is a good read as well.

@@ -342,7 +342,7 @@ We have 10 batches of 512 length. If we parallelize them by attribute dimension

 It is similar with tensor model parallelism or naive layer-wise model parallelism.

-![flex-flow-soap](imgs/parallelism-flexflow.jpeg)
+![flex-flow-soap](/transformers/_images/parallelism-flexflow.jpeg)

 The significance of this framework is that it takes resources like (1) GPU/TPU/CPU vs. (2) RAM/DRAM vs. (3) fast-intra-connect/slow-inter-connect and it automatically optimizes all these  algorithmically deciding which parallelisation to use where.

--- a/docs/source/preprocessing.rst
+++ b/docs/source/preprocessing.rst
@@ -56,10 +56,9 @@ is its ``__call__``: you just need to feed your sentence to your tokenizer objec
     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

-This returns a dictionary string to list of ints. The `input_ids <glossary.html#input-ids>`__ are the indices
-corresponding to each token in our sentence. We will see below what the `attention_mask
-<glossary.html#attention-mask>`__ is used for and in :ref:`the next section <sentence-pairs>` the goal of
-`token_type_ids <glossary.html#token-type-ids>`__.
+This returns a dictionary string to list of ints. The `input_ids <glossary#input-ids>`__ are the indices corresponding
+to each token in our sentence. We will see below what the `attention_mask <glossary#attention-mask>`__ is used for and
+in :ref:`the next section <preprocessing-pairs-of-sentences>` the goal of `token_type_ids <glossary#token-type-ids>`__.

 The tokenizer can decode a list of token ids in a proper sentence:

@@ -132,8 +131,8 @@ You can do all of this by using the following options when feeding your list of
                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}

 It returns a dictionary with string keys and tensor values. We can now see what the `attention_mask
-<glossary.html#attention-mask>`__ is all about: it points out which tokens the model should pay attention to and which
-ones it should not (because they represent padding in this case).
+<glossary#attention-mask>`__ is all about: it points out which tokens the model should pay attention to and which ones
+it should not (because they represent padding in this case).


 Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
@@ -166,8 +165,8 @@ This will once again return a dict string to list of ints:
     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 
     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

-This shows us what the `token_type_ids <glossary.html#token-type-ids>`__ are for: they indicate to the model which part
-of the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
+This shows us what the `token_type_ids <glossary#token-type-ids>`__ are for: they indicate to the model which part of
+the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
 `token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that
 its associated model expects. You can force the return (or the non-return) of any of those special arguments by using
 ``return_input_ids`` or ``return_token_type_ids``.
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -1,492 +0,0 @@
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Pretrained models
-=======================================================================================================================
-
-Here is a partial list of some of the available pretrained models together with a short presentation of each model.
-
-For the full list, refer to `https://huggingface.co/models <https://huggingface.co/models>`__.
-
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Architecture       | Model id                                                   | Details of the model                                                                                                                  |
-+====================+============================================================+=======================================================================================================================================+
-| BERT               | ``bert-base-uncased``                                      | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
-|                    |                                                            | | Trained on lower-cased English text.                                                                                                |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-uncased``                                     | | 24-layer, 1024-hidden, 16-heads, 336M parameters.                                                                                   |
-|                    |                                                            | | Trained on lower-cased English text.                                                                                                |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-cased``                                        | | 12-layer, 768-hidden, 12-heads, 109M parameters.                                                                                    |
-|                    |                                                            | | Trained on cased English text.                                                                                                      |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-cased``                                       | | 24-layer, 1024-hidden, 16-heads, 335M parameters.                                                                                   |
-|                    |                                                            | | Trained on cased English text.                                                                                                      |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-multilingual-uncased``                         | | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 168M parameters.                                                        |
-|                    |                                                            | | Trained on lower-cased text in the top 102 languages with the largest Wikipedias                                                    |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__).                                              |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-multilingual-cased``                           | | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 179M parameters.                                                             |
-|                    |                                                            | | Trained on cased text in the top 104 languages with the largest Wikipedias                                                          |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__).                                              |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-chinese``                                      | | 12-layer, 768-hidden, 12-heads, 103M parameters.                                                                                    |
-|                    |                                                            | | Trained on cased Chinese Simplified and Traditional text.                                                                           |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-german-cased``                                 | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
-|                    |                                                            | | Trained on cased German text by Deepset.ai                                                                                          |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details on deepset.ai website <https://deepset.ai/german-bert>`__).                                                             |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-uncased-whole-word-masking``                  | | 24-layer, 1024-hidden, 16-heads, 336M parameters.                                                                                   |
-|                    |                                                            | | Trained on lower-cased English text using Whole-Word-Masking                                                                        |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__).                                                                    |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-cased-whole-word-masking``                    | | 24-layer, 1024-hidden, 16-heads, 335M parameters.                                                                                   |
-|                    |                                                            | | Trained on cased English text using Whole-Word-Masking                                                                              |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__).                                                                    |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-uncased-whole-word-masking-finetuned-squad``  | | 24-layer, 1024-hidden, 16-heads, 336M parameters.                                                                                   |
-|                    |                                                            | | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD                                                             |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see details of fine-tuning in the `example section <https://github.com/huggingface/transformers/tree/master/examples>`__).           |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-cased-whole-word-masking-finetuned-squad``    | | 24-layer, 1024-hidden, 16-heads, 335M parameters                                                                                    |
-|                    |                                                            | | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD                                                               |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/transformers/examples.html>`__)                           |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-cased-finetuned-mrpc``                         | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
-|                    |                                                            | | The ``bert-base-cased`` model fine-tuned on MRPC                                                                                    |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/transformers/examples.html>`__)                           |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-german-dbmdz-cased``                           | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
-|                    |                                                            | | Trained on cased German text by DBMDZ                                                                                               |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__).                                                         |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-german-dbmdz-uncased``                         | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
-|                    |                                                            | | Trained on uncased German text by DBMDZ                                                                                             |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__).                                                         |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``cl-tohoku/bert-base-japanese``                           | | 12-layer, 768-hidden, 12-heads, 111M parameters.                                                                                    |
-|                    |                                                            | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies,                     |
-|                    |                                                            | | `fugashi <https://github.com/polm/fugashi>`__ which is a wrapper around `MeCab <https://taku910.github.io/mecab/>`__.               |
-|                    |                                                            | | Use ``pip install transformers["ja"]`` (or ``pip install -e .["ja"]`` if you install from source) to install them.                  |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``cl-tohoku/bert-base-japanese-whole-word-masking``        | | 12-layer, 768-hidden, 12-heads, 111M parameters.                                                                                    |
-|                    |                                                            | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies,                     |
-|                    |                                                            | | `fugashi <https://github.com/polm/fugashi>`__ which is a wrapper around `MeCab <https://taku910.github.io/mecab/>`__.               |
-|                    |                                                            | | Use ``pip install transformers["ja"]`` (or ``pip install -e .["ja"]`` if you install from source) to install them.                  |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``cl-tohoku/bert-base-japanese-char``                      | | 12-layer, 768-hidden, 12-heads, 90M parameters.                                                                                     |
-|                    |                                                            | | Trained on Japanese text. Text is tokenized into characters.                                                                        |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``cl-tohoku/bert-base-japanese-char-whole-word-masking``   | | 12-layer, 768-hidden, 12-heads, 90M parameters.                                                                                     |
-|                    |                                                            | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters.                                               |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``TurkuNLP/bert-base-finnish-cased-v1``                    | | 12-layer, 768-hidden, 12-heads, 125M parameters.                                                                                    |
-|                    |                                                            | | Trained on cased Finnish text.                                                                                                      |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__).                                                                     |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``TurkuNLP/bert-base-finnish-uncased-v1``                  | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
-|                    |                                                            | | Trained on uncased Finnish text.                                                                                                    |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__).                                                                     |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``wietsedv/bert-base-dutch-cased``                         | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
-|                    |                                                            | | Trained on cased Dutch text.                                                                                                        |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details on wietsedv repository <https://github.com/wietsedv/bertje/>`__).                                                       |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| GPT                | ``openai-gpt``                                             | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
-|                    |                                                            | | OpenAI GPT English model                                                                                                            |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| GPT-2              | ``gpt2``                                                   | | 12-layer, 768-hidden, 12-heads, 117M parameters.                                                                                    |
-|                    |                                                            | | OpenAI GPT-2 English model                                                                                                          |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``gpt2-medium``                                            | | 24-layer, 1024-hidden, 16-heads, 345M parameters.                                                                                   |
-|                    |                                                            | | OpenAI's Medium-sized GPT-2 English model                                                                                           |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``gpt2-large``                                             | | 36-layer, 1280-hidden, 20-heads, 774M parameters.                                                                                   |
-|                    |                                                            | | OpenAI's Large-sized GPT-2 English model                                                                                            |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``gpt2-xl``                                                | | 48-layer, 1600-hidden, 25-heads, 1558M parameters.                                                                                  |
-|                    |                                                            | | OpenAI's XL-sized GPT-2 English model                                                                                               |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| GPTNeo             | ``EleutherAI/gpt-neo-1.3B``                                | | 24-layer, 2048-hidden, 16-heads, 1.3B parameters.                                                                                   |
-|                    |                                                            | | EleutherAI's GPT-3 like language model.                                                                                             |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``EleutherAI/gpt-neo-2.7B``                                | | 32-layer, 2560-hidden, 20-heads, 2.7B parameters.                                                                                   |
-|                    |                                                            | | EleutherAI's GPT-3 like language model.                                                                                             |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Transformer-XL     | ``transfo-xl-wt103``                                       | | 18-layer, 1024-hidden, 16-heads, 257M parameters.                                                                                   |
-|                    |                                                            | | English model trained on wikitext-103                                                                                               |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| XLNet              | ``xlnet-base-cased``                                       | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
-|                    |                                                            | | XLNet English model                                                                                                                 |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlnet-large-cased``                                      | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
-|                    |                                                            | | XLNet Large English model                                                                                                           |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| XLM                | ``xlm-mlm-en-2048``                                        | | 12-layer, 2048-hidden, 16-heads                                                                                                     |
-|                    |                                                            | | XLM English model                                                                                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlm-mlm-ende-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                       |
-|                    |                                                            | | XLM English-German model trained on the concatenation of English and German wikipedia                                               |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlm-mlm-enfr-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                       |
-|                    |                                                            | | XLM English-French model trained on the concatenation of English and French wikipedia                                               |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlm-mlm-enro-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                       |
-|                    |                                                            | | XLM English-Romanian Multi-language model                                                                                           |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlm-mlm-xnli15-1024``                                    | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
-|                    |                                                            | | XLM Model pre-trained with MLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__.                             |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlm-mlm-tlm-xnli15-1024``                                | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
-|                    |                                                            | | XLM Model pre-trained with MLM + TLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__.                       |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlm-clm-enfr-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                       |
-|                    |                                                            | | XLM English-French model trained with CLM (Causal Language Modeling) on the concatenation of English and French wikipedia           |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlm-clm-ende-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                       |
-|                    |                                                            | | XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia           |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlm-mlm-17-1280``                                        | | 16-layer, 1280-hidden, 16-heads                                                                                                     |
-|                    |                                                            | | XLM model trained with MLM (Masked Language Modeling) on 17 languages.                                                              |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlm-mlm-100-1280``                                       | | 16-layer, 1280-hidden, 16-heads                                                                                                     |
-|                    |                                                            | | XLM model trained with MLM (Masked Language Modeling) on 100 languages.                                                             |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| RoBERTa            | ``roberta-base``                                           | | 12-layer, 768-hidden, 12-heads, 125M parameters                                                                                     |
-|                    |                                                            | | RoBERTa using the BERT-base architecture                                                                                            |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__)                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``roberta-large``                                          | | 24-layer, 1024-hidden, 16-heads, 355M parameters                                                                                    |
-|                    |                                                            | | RoBERTa using the BERT-large architecture                                                                                           |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__)                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``roberta-large-mnli``                                     | | 24-layer, 1024-hidden, 16-heads, 355M parameters                                                                                    |
-|                    |                                                            | | ``roberta-large`` fine-tuned on `MNLI <http://www.nyu.edu/projects/bowman/multinli/>`__.                                            |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__)                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``distilroberta-base``                                     | | 6-layer, 768-hidden, 12-heads, 82M parameters                                                                                       |
-|                    |                                                            | | The DistilRoBERTa model distilled from the RoBERTa model `roberta-base` checkpoint.                                                 |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation>`__)                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``roberta-base-openai-detector``                           | | 12-layer, 768-hidden, 12-heads, 125M parameters                                                                                     |
-|                    |                                                            | | ``roberta-base`` fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.                                             |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/openai/gpt-2-output-dataset/tree/master/detector>`__)                                               |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``roberta-large-openai-detector``                          | | 24-layer, 1024-hidden, 16-heads, 355M parameters                                                                                    |
-|                    |                                                            | | ``roberta-large`` fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.                                            |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/openai/gpt-2-output-dataset/tree/master/detector>`__)                                               |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| DistilBERT         | ``distilbert-base-uncased``                                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
-|                    |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint                                                   |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation>`__)                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``distilbert-base-uncased-distilled-squad``                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
-|                    |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer.                 |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation>`__)                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``distilbert-base-cased``                                  | | 6-layer, 768-hidden, 12-heads, 65M parameters                                                                                       |
-|                    |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint                                                     |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation>`__)                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``distilbert-base-cased-distilled-squad``                  | | 6-layer, 768-hidden, 12-heads, 65M parameters                                                                                       |
-|                    |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint, with an additional question answering layer.       |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation>`__)                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``distilgpt2``                                             | | 6-layer, 768-hidden, 12-heads, 82M parameters                                                                                       |
-|                    |                                                            | | The DistilGPT2 model distilled from the GPT2 model `gpt2` checkpoint.                                                               |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation>`__)                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``distilbert-base-german-cased``                           | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
-|                    |                                                            | | The German DistilBERT model distilled from the German DBMDZ BERT model `bert-base-german-dbmdz-cased` checkpoint.                   |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation>`__)                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``distilbert-base-multilingual-cased``                     | | 6-layer, 768-hidden, 12-heads, 134M parameters                                                                                      |
-|                    |                                                            | | The multilingual DistilBERT model distilled from the Multilingual BERT model `bert-base-multilingual-cased` checkpoint.             |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation>`__)                   |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| CTRL               | ``ctrl``                                                   | | 48-layer, 1280-hidden, 16-heads, 1.6B parameters                                                                                    |
-|                    |                                                            | | Salesforce's Large-sized CTRL English model                                                                                         |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| CamemBERT          | ``camembert-base``                                         | | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                                     |
-|                    |                                                            | | CamemBERT using the BERT-base architecture                                                                                          |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/camembert>`__)                                                 |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| ALBERT             | ``albert-base-v1``                                         | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters                                                            |
-|                    |                                                            | | ALBERT base model                                                                                                                   |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/ALBERT>`__)                                                                         |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``albert-large-v1``                                        | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters                                                           |
-|                    |                                                            | | ALBERT large model                                                                                                                  |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/ALBERT>`__)                                                                         |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``albert-xlarge-v1``                                       | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters                                                           |
-|                    |                                                            | | ALBERT xlarge model                                                                                                                 |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/ALBERT>`__)                                                                         |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``albert-xxlarge-v1``                                      | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters                                                           |
-|                    |                                                            | | ALBERT xxlarge model                                                                                                                |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/ALBERT>`__)                                                                         |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``albert-base-v2``                                         | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters                                                            |
-|                    |                                                            | | ALBERT base model with no dropout, additional training data and longer training                                                     |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/ALBERT>`__)                                                                         |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``albert-large-v2``                                        | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters                                                           |
-|                    |                                                            | | ALBERT large model with no dropout, additional training data and longer training                                                    |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/ALBERT>`__)                                                                         |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``albert-xlarge-v2``                                       | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters                                                           |
-|                    |                                                            | | ALBERT xlarge model with no dropout, additional training data and longer training                                                   |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/ALBERT>`__)                                                                         |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``albert-xxlarge-v2``                                      | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters                                                           |
-|                    |                                                            | | ALBERT xxlarge model with no dropout, additional training data and longer training                                                  |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/google-research/ALBERT>`__)                                                                         |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| T5                 | ``t5-small``                                               | | ~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads,                                           |
-|                    |                                                            | | Trained on English text: the Colossal Clean Crawled Corpus (C4)                                                                     |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``t5-base``                                                | | ~220M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads,                                        |
-|                    |                                                            | | Trained on English text: the Colossal Clean Crawled Corpus (C4)                                                                     |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``t5-large``                                               | | ~770M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,                                       |
-|                    |                                                            | | Trained on English text: the Colossal Clean Crawled Corpus (C4)                                                                     |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``t5-3B``                                                  | | ~2.8B parameters with 24-layers, 1024-hidden-state, 16384 feed-forward hidden-state, 32-heads,                                      |
-|                    |                                                            | | Trained on English text: the Colossal Clean Crawled Corpus (C4)                                                                     |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``t5-11B``                                                 | | ~11B parameters with 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads,                                      |
-|                    |                                                            | | Trained on English text: the Colossal Clean Crawled Corpus (C4)                                                                     |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| XLM-RoBERTa        | ``xlm-roberta-base``                                       | | ~270M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 8-heads,                                         |
-|                    |                                                            | | Trained on on 2.5 TB of newly created clean CommonCrawl data in 100 languages                                                       |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlm-roberta-large``                                      | | ~550M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,                                       |
-|                    |                                                            | | Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages                                                          |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| FlauBERT           | ``flaubert/flaubert_small_cased``                          | | 6-layer, 512-hidden, 8-heads, 54M parameters                                                                                        |
-|                    |                                                            | | FlauBERT small architecture                                                                                                         |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``flaubert/flaubert_base_uncased``                         | | 12-layer, 768-hidden, 12-heads, 137M parameters                                                                                     |
-|                    |                                                            | | FlauBERT base architecture with uncased vocabulary                                                                                  |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``flaubert/flaubert_base_cased``                           | | 12-layer, 768-hidden, 12-heads, 138M parameters                                                                                     |
-|                    |                                                            | | FlauBERT base architecture with cased vocabulary                                                                                    |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``flaubert/flaubert_large_cased``                          | | 24-layer, 1024-hidden, 16-heads, 373M parameters                                                                                    |
-|                    |                                                            | | FlauBERT large architecture                                                                                                         |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Bart               | ``facebook/bart-large``                                    | | 24-layer, 1024-hidden, 16-heads, 406M parameters                                                                                    |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_)                                                       |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``facebook/bart-base``                                     | | 12-layer, 768-hidden, 16-heads, 139M parameters                                                                                     |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``facebook/bart-large-mnli``                               | | Adds a 2 layer classification head with 1 million parameters                                                                        |
-|                    |                                                            | | bart-large base architecture with a classification head, finetuned on MNLI                                                          |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``facebook/bart-large-cnn``                                | | 24-layer, 1024-hidden, 16-heads, 406M parameters       (same as large)                                                              |
-|                    |                                                            | | bart-large base architecture finetuned on cnn summarization task                                                                    |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| BARThez            | ``moussaKam/barthez``                                      | | 12-layer,  768-hidden, 12-heads, 216M parameters                                                                                    |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/moussaKam/BARThez>`__)                                                                              |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``moussaKam/mbarthez``                                     | | 24-layer, 1024-hidden, 16-heads, 561M parameters                                                                                    |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| DialoGPT           | ``DialoGPT-small``                                         | | 12-layer, 768-hidden, 12-heads, 124M parameters                                                                                     |
-|                    |                                                            | | Trained on English text: 147M conversation-like exchanges extracted from Reddit.                                                    |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``DialoGPT-medium``                                        | | 24-layer, 1024-hidden, 16-heads, 355M parameters                                                                                    |
-|                    |                                                            | | Trained on English text: 147M conversation-like exchanges extracted from Reddit.                                                    |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``DialoGPT-large``                                         | | 36-layer, 1280-hidden, 20-heads, 774M parameters                                                                                    |
-|                    |                                                            | | Trained on English text: 147M conversation-like exchanges extracted from Reddit.                                                    |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Reformer           | ``reformer-enwik8``                                        | | 12-layer, 1024-hidden, 8-heads, 149M parameters                                                                                     |
-|                    |                                                            | | Trained on English Wikipedia data - enwik8.                                                                                         |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``reformer-crime-and-punishment``                          | | 6-layer, 256-hidden, 2-heads, 3M parameters                                                                                         |
-|                    |                                                            | | Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky.                                                          |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| M2M100             | ``facebook/m2m100_418M``                                   | | 24-layer, 1024-hidden, 16-heads, 418M parameters                                                                                    |
-|                    |                                                            | | multilingual machine translation model for 100 languages                                                                            |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``facebook/m2m100_1.2B``                                   | | 48-layer, 1024-hidden, 16-heads, 1.2B parameters                                                                                    |
-|                    |                                                            | | multilingual machine translation model for 100 languages                                                                            |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| MarianMT           | ``Helsinki-NLP/opus-mt-{src}-{tgt}``                       | | 12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size.            |
-|                    |                                                            | | (see `model list <https://huggingface.co/Helsinki-NLP>`_)                                                                           |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Pegasus            | ``google/pegasus-{dataset}``                               | | 16-layer, 1024-hidden, 16-heads, ~568M parameter, 2.2 GB for summary. `model list <https://huggingface.co/models?search=pegasus>`__ |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Longformer         | ``allenai/longformer-base-4096``                           | | 12-layer, 768-hidden, 12-heads, ~149M parameters                                                                                    |
-|                    |                                                            | | Starting from RoBERTa-base checkpoint, trained on documents of max length 4,096                                                     |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``allenai/longformer-large-4096``                          | | 24-layer, 1024-hidden, 16-heads, ~435M parameters                                                                                   |
-|                    |                                                            | | Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096                                                    |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| MBart              | ``facebook/mbart-large-cc25``                              | | 24-layer, 1024-hidden, 16-heads, 610M parameters                                                                                    |
-|                    |                                                            | | mBART (bart-large architecture) model trained on 25 languages' monolingual corpus                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``facebook/mbart-large-en-ro``                             | | 24-layer, 1024-hidden, 16-heads, 610M parameters                                                                                    |
-|                    |                                                            | | mbart-large-cc25 model finetuned on WMT english romanian translation.                                                               |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``facebook/mbart-large-50``                                | | 24-layer, 1024-hidden, 16-heads,                                                                                                    |
-|                    |                                                            | | mBART model trained on 50 languages' monolingual corpus.                                                                            |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``facebook/mbart-large-50-one-to-many-mmt``                | | 24-layer, 1024-hidden, 16-heads,                                                                                                    |
-|                    |                                                            | | mbart-50-large model finetuned for one (English) to many multilingual machine translation covering 50 languages.                    |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``facebook/mbart-large-50-many-to-many-mmt``               | | 24-layer, 1024-hidden, 16-heads,                                                                                                    |
-|                    |                                                            | | mbart-50-large model finetuned for many to many multilingual machine translation covering 50 languages.                             |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Lxmert             | ``lxmert-base-uncased``                                    | | 9-language layers, 9-relationship layers, and 12-cross-modality layers                                                              |
-|                    |                                                            | | 768-hidden, 12-heads (for each layer) ~ 228M parameters                                                                             |
-|                    |                                                            | | Starting from lxmert-base checkpoint, trained on over 9 million image-text couplets from COCO, VisualGenome, GQA, VQA               |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Funnel Transformer | ``funnel-transformer/small``                               | | 14 layers: 3 blocks of 4 layers then 2 layers decoder, 768-hidden, 12-heads, 130M parameters                                        |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/laiguokun/Funnel-Transformer>`__)                                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``funnel-transformer/small-base``                          | | 12 layers: 3 blocks of 4 layers (no decoder), 768-hidden, 12-heads, 115M parameters                                                 |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/laiguokun/Funnel-Transformer>`__)                                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``funnel-transformer/medium``                              | | 14 layers: 3 blocks 6, 3x2, 3x2 layers then 2 layers decoder, 768-hidden, 12-heads, 130M parameters                                 |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/laiguokun/Funnel-Transformer>`__)                                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``funnel-transformer/medium-base``                         | | 12 layers: 3 blocks 6, 3x2, 3x2 layers(no decoder), 768-hidden, 12-heads, 115M parameters                                           |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/laiguokun/Funnel-Transformer>`__)                                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``funnel-transformer/intermediate``                        | | 20 layers: 3 blocks of 6 layers then 2 layers decoder, 768-hidden, 12-heads, 177M parameters                                        |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/laiguokun/Funnel-Transformer>`__)                                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``funnel-transformer/intermediate-base``                   | | 18 layers: 3 blocks of 6 layers (no decoder), 768-hidden, 12-heads, 161M parameters                                                 |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/laiguokun/Funnel-Transformer>`__)                                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``funnel-transformer/large``                               | | 26 layers: 3 blocks of 8 layers then 2 layers decoder, 1024-hidden, 12-heads, 386M parameters                                       |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/laiguokun/Funnel-Transformer>`__)                                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``funnel-transformer/large-base``                          | | 24 layers: 3 blocks of 8 layers (no decoder), 1024-hidden, 12-heads, 358M parameters                                                |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/laiguokun/Funnel-Transformer>`__)                                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``funnel-transformer/xlarge``                              | | 32 layers: 3 blocks of 10 layers then 2 layers decoder, 1024-hidden, 12-heads, 468M parameters                                      |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/laiguokun/Funnel-Transformer>`__)                                                                   |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``funnel-transformer/xlarge-base``                         | | 30 layers: 3 blocks of 10 layers (no decoder), 1024-hidden, 12-heads, 440M parameters                                               |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/laiguokun/Funnel-Transformer>`__)                                                                   |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| LayoutLM           | ``microsoft/layoutlm-base-uncased``                        | | 12 layers, 768-hidden, 12-heads, 113M parameters                                                                                    |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/microsoft/unilm/tree/master/layoutlm>`__)                                                           |
-+                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``microsoft/layoutlm-large-uncased``                       | | 24 layers, 1024-hidden, 16-heads, 343M parameters                                                                                   |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/microsoft/unilm/tree/master/layoutlm>`__)                                                           |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| DeBERTa            | ``microsoft/deberta-base``                                 | | 12-layer, 768-hidden, 12-heads, ~140M parameters                                                                                    |
-|                    |                                                            | | DeBERTa using the BERT-base architecture                                                                                            |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/microsoft/DeBERTa>`__)                                                                              |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``microsoft/deberta-large``                                | | 24-layer, 1024-hidden, 16-heads, ~400M parameters                                                                                   |
-|                    |                                                            | | DeBERTa using the BERT-large architecture                                                                                           |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/microsoft/DeBERTa>`__)                                                                              |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``microsoft/deberta-xlarge``                               | | 48-layer, 1024-hidden, 16-heads, ~750M parameters                                                                                   |
-|                    |                                                            | | DeBERTa XLarge with similar BERT architecture                                                                                       |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/microsoft/DeBERTa>`__)                                                                              |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``microsoft/deberta-xlarge-v2``                            | | 24-layer, 1536-hidden, 24-heads, ~900M parameters                                                                                   |
-|                    |                                                            | | DeBERTa XLarge V2 with similar BERT architecture                                                                                    |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/microsoft/DeBERTa>`__)                                                                              |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``microsoft/deberta-xxlarge-v2``                           | | 48-layer, 1536-hidden, 24-heads, ~1.5B parameters                                                                                   |
-|                    |                                                            | | DeBERTa XXLarge V2 with similar BERT architecture                                                                                   |
-|                    |                                                            |                                                                                                                                       |
-|                    |                                                            | (see `details <https://github.com/microsoft/DeBERTa>`__)                                                                              |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| SqueezeBERT        | ``squeezebert/squeezebert-uncased``                        | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
-|                    |                                                            | | SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks.          |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``squeezebert/squeezebert-mnli``                           | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
-|                    |                                                            | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.      |
-|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``squeezebert/squeezebert-mnli-headless``                  | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
-|                    |                                                            | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.      |
-|                    |                                                            | | The final classification layer is removed, so when you finetune, the final layer will be reinitialized.                             |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@@ -197,10 +197,9 @@ To apply these steps on a given text, we can just feed it to our tokenizer:

    >>> inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")

-This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary.html#input-ids>`__, as
+This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary#input-ids>`__, as
 mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
-`attention mask <glossary.html#attention-mask>`__ that the model will use to have a better understanding of the
-sequence:
+`attention mask <glossary#attention-mask>`__ that the model will use to have a better understanding of the sequence:


 .. code-block::
--- a/docs/source/toctree.yml
+++ b/docs/source/toctree.yml
@@ -0,0 +1,304 @@
+- sections: 
+  - local: index
+    title: 🤗 Transformers
+  - local: quicktour
+    title: Quick tour
+  - local: installation
+    title: Installation
+  - local: philosophy
+    title: Philosophy
+  - local: glossary
+    title: Glossary
+  title: Get started
+- sections:
+  - local: task_summary
+    title: Summary of the tasks
+  - local: model_summary
+    title: Summary of the models
+  - local: preprocessing
+    title: Preprocessing data
+  - local: training
+    title: Fine-tuning a pretrained model
+  - local: model_sharing
+    title: Model sharing and uploading
+  - local: tokenizer_summary
+    title: Summary of the tokenizers
+  - local: multilingual
+    title: Multi-lingual models
+  title: "Using 🤗 Transformers"
+- sections:
+  - local: examples
+    title: Examples
+  - local: troubleshooting
+    title: Troubleshooting
+  - local: custom_datasets
+    title: Fine-tuning with custom datasets
+  - local: notebooks
+    title: "🤗 Transformers Notebooks"
+  - local: sagemaker
+    title: Run training on Amazon SageMaker
+  - local: community
+    title: Community
+  - local: converting_tensorflow_models
+    title: Converting Tensorflow Checkpoints
+  - local: migration
+    title: Migrating from previous packages
+  - local: contributing
+    title: How to contribute to transformers?
+  - local: add_new_model
+    title: "How to add a model to 🤗 Transformers?"
+  - local: add_new_pipeline
+    title: "How to add a pipeline to 🤗 Transformers?"
+  - local: fast_tokenizers
+    title: "Using tokenizers from 🤗 Tokenizers"
+  - local: performance
+    title: 'Performance and Scalability: How To Fit a Bigger Model and Train It Faster'
+  - local: parallelism
+    title: Model Parallelism
+  - local: testing
+    title: Testing
+  - local: debugging
+    title: Debugging
+  - local: serialization
+    title: Exporting transformers models
+  title: Advanced guides
+- sections:
+  - local: bertology
+    title: BERTology
+  - local: perplexity
+    title: Perplexity of fixed-length models
+  - local: benchmarks
+    title: Benchmarks
+  title: Research
+- sections:
+  - sections:
+    - local: main_classes/callback
+      title: Callbacks
+    - local: main_classes/configuration
+      title: Configuration
+    - local: main_classes/data_collator
+      title: Data Collator
+    - local: main_classes/keras_callbacks
+      title: Keras callbacks
+    - local: main_classes/logging
+      title: Logging
+    - local: main_classes/model
+      title: Models
+    - local: main_classes/optimizer_schedules
+      title: Optimization
+    - local: main_classes/output
+      title: Model outputs
+    - local: main_classes/pipelines
+      title: Pipelines
+    - local: main_classes/processors
+      title: Processors
+    - local: main_classes/tokenizer
+      title: Tokenizer
+    - local: main_classes/trainer
+      title: Trainer
+    - local: main_classes/deepspeed
+      title: DeepSpeed Integration
+    - local: main_classes/feature_extractor
+      title: Feature Extractor
+    title: Main Classes
+  - sections:
+    - local: model_doc/albert
+      title: ALBERT
+    - local: model_doc/auto
+      title: Auto Classes
+    - local: model_doc/bart
+      title: BART
+    - local: model_doc/barthez
+      title: BARThez
+    - local: model_doc/bartpho
+      title: BARTpho
+    - local: model_doc/beit
+      title: BEiT
+    - local: model_doc/bert
+      title: BERT
+    - local: model_doc/bertweet
+      title: Bertweet
+    - local: model_doc/bertgeneration
+      title: BertGeneration
+    - local: model_doc/bert_japanese
+      title: BertJapanese
+    - local: model_doc/bigbird
+      title: BigBird
+    - local: model_doc/bigbird_pegasus
+      title: BigBirdPegasus
+    - local: model_doc/blenderbot
+      title: Blenderbot
+    - local: model_doc/blenderbot_small
+      title: Blenderbot Small
+    - local: model_doc/bort
+      title: BORT
+    - local: model_doc/byt5
+      title: ByT5
+    - local: model_doc/camembert
+      title: CamemBERT
+    - local: model_doc/canine
+      title: CANINE
+    - local: model_doc/clip
+      title: CLIP
+    - local: model_doc/convbert
+      title: ConvBERT
+    - local: model_doc/cpm
+      title: CPM
+    - local: model_doc/ctrl
+      title: CTRL
+    - local: model_doc/deberta
+      title: DeBERTa
+    - local: model_doc/deberta_v2
+      title: DeBERTa-v2
+    - local: model_doc/deit
+      title: DeiT
+    - local: model_doc/detr
+      title: DETR
+    - local: model_doc/dialogpt
+      title: DialoGPT
+    - local: model_doc/distilbert
+      title: DistilBERT
+    - local: model_doc/dpr
+      title: DPR
+    - local: model_doc/electra
+      title: ELECTRA
+    - local: model_doc/encoderdecoder
+      title: Encoder Decoder Models
+    - local: model_doc/flaubert
+      title: FlauBERT
+    - local: model_doc/fnet
+      title: FlauBERT
+    - local: model_doc/fsmt
+      title: FSMT
+    - local: model_doc/funnel
+      title: Funnel Transformer
+    - local: model_doc/herbert
+      title: herBERT
+    - local: model_doc/ibert
+      title: I-BERT
+    - local: model_doc/imagegpt
+      title: ImageGPT
+    - local: model_doc/layoutlm
+      title: LayoutLM
+    - local: model_doc/layoutlmv2
+      title: LayoutLMV2
+    - local: model_doc/layoutxlm
+      title: LayoutXLM
+    - local: model_doc/led
+      title: LED
+    - local: model_doc/longformer
+      title: Longformer
+    - local: model_doc/luke
+      title: LUKE
+    - local: model_doc/lxmert
+      title: LXMERT
+    - local: model_doc/marian
+      title: MarianMT
+    - local: model_doc/m2m_100
+      title: M2M100
+    - local: model_doc/mbart
+      title: MBart and MBart-50
+    - local: model_doc/megatron_bert
+      title: MegatronBERT
+    - local: model_doc/megatron_gpt2
+      title: MegatronGPT2
+    - local: model_doc/mobilebert
+      title: MobileBERT
+    - local: model_doc/mpnet
+      title: MPNet
+    - local: model_doc/mt5
+      title: MT5
+    - local: model_doc/gpt
+      title: OpenAI GPT
+    - local: model_doc/gpt2
+      title: OpenAI GPT2
+    - local: model_doc/gptj
+      title: GPT-J
+    - local: model_doc/gpt_neo
+      title: GPT Neo
+    - local: model_doc/hubert
+      title: Hubert
+    - local: model_doc/pegasus
+      title: Pegasus
+    - local: model_doc/phobert
+      title: PhoBERT
+    - local: model_doc/prophetnet
+      title: ProphetNet
+    - local: model_doc/qdqbert
+      title: QDQBert
+    - local: model_doc/rag
+      title: RAG
+    - local: model_doc/reformer
+      title: Reformer
+    - local: model_doc/rembert
+      title: RemBERT
+    - local: model_doc/retribert
+      title: RetriBERT
+    - local: model_doc/roberta
+      title: RoBERTa
+    - local: model_doc/roformer
+      title: RoFormer
+    - local: model_doc/segformer
+      title: SegFormer
+    - local: model_doc/sew
+      title: SEW
+    - local: model_doc/sew_d
+      title: SEW-D
+    - local: model_doc/speechencoderdecoder
+      title: Speech Encoder Decoder Models
+    - local: model_doc/speech_to_text
+      title: Speech2Text
+    - local: model_doc/speech_to_text_2
+      title: Speech2Text2
+    - local: model_doc/splinter
+      title: Splinter
+    - local: model_doc/squeezebert
+      title: SqueezeBERT
+    - local: model_doc/t5
+      title: T5
+    - local: model_doc/t5v1.1
+      title: T5v1.1
+    - local: model_doc/tapas
+      title: TAPAS
+    - local: model_doc/transformerxl
+      title: Transformer XL
+    - local: model_doc/trocr
+      title: TrOCR
+    - local: model_doc/unispeech
+      title: UniSpeech
+    - local: model_doc/unispeech_sat
+      title: UniSpeech-SAT
+    - local: model_doc/visionencoderdecoder
+      title: Vision Encoder Decoder Models
+    - local: model_doc/vit
+      title: Vision Transformer (ViT)
+    - local: model_doc/visual_bert
+      title: VisualBERT
+    - local: model_doc/wav2vec2
+      title: Wav2Vec2
+    - local: model_doc/xlm
+      title: XLM
+    - local: model_doc/xlmprophetnet
+      title: XLM-ProphetNet
+    - local: model_doc/xlmroberta
+      title: XLM-RoBERTa
+    - local: model_doc/xlnet
+      title: XLNet
+    - local: model_doc/xlsr_wav2vec2
+      title: XLSR-Wav2Vec2
+    title: Models
+  - sections:
+    - local: internal/modeling_utils
+      title: Custom Layers and Utilities
+    - local: internal/pipelines_utils
+      title: Utilities for pipelines
+    - local: internal/tokenization_utils
+      title: Utilities for Tokenizers
+    - local: internal/trainer_utils
+      title: Utilities for Trainer
+    - local: internal/generation_utils
+      title: Utilities for Generation
+    - local: internal/file_utils
+      title: General Utilities
+    title: Internal Helpers
+  title: API
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -417,5 +417,5 @@ To look at more fine-tuning examples you can refer to:
 - `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`__ which includes scripts
  to train on all common NLP tasks in PyTorch and TensorFlow.

- `🤗 Transformers Notebooks <notebooks.html>`__ which contains various notebooks and in particular one per task (look
-  for the `how to finetune a model on xxx`).
+- `🤗 Transformers Notebooks <notebooks>`__ which contains various notebooks and in particular one per task (look for
+  the `how to finetune a model on xxx`).
--- a/docs/source/troubleshooting.md
+++ b/docs/source/troubleshooting.md
@@ -27,4 +27,4 @@ ValueError: Connection error, and we cannot find the requested files in the cach
 Please try again or make sure your Internet connection is on.
 ```

-One possible solution in this situation is to use the ["offline-mode"](https://huggingface.co/transformers/installation.html#offline-mode).
+One possible solution in this situation is to use the ["offline-mode"](installation#offline-mode).