New model sharing tutorial (#5323)

2020-06-27 11:10:02 -04:00
parent efae6645e2
commit 1af58c0706
5 changed files with 211 additions and 147 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -139,9 +139,8 @@ conversion utilities for the following models:
    task_summary
    model_summary
    training
    preprocessing
-    serialization
+    training
    model_sharing
    multilingual
--- a/docs/source/model_sharing.md
+++ b/docs/source/model_sharing.md
@@ -1,55 +0,0 @@
 # Model upload and sharing
 Starting with `v2.2.2`, you can now upload and share your fine-tuned models with the community, using the <abbr title="Command-line interface">CLI</abbr> that's built-in to the library.
 **First, create an account on [https://huggingface.co/join](https://huggingface.co/join)**. Optionally, join an existing organization or create a new one. Then:
 ```shell
 transformers-cli login
 # log in using the same credentials as on huggingface.co
 ```
 Upload your model:
 ```shell
 transformers-cli upload ./path/to/pretrained_model/
 # ^^ Upload folder containing weights/tokenizer/config
 # saved via `.save_pretrained()`
 transformers-cli upload ./config.json [--filename folder/foobar.json]
 # ^^ Upload a single file
 # (you can optionally override its filename, which can be nested inside a folder)
 ```
 If you want your model to be namespaced by your organization name rather than your username, add the following flag to any command:
 ```shell
 --organization organization_name
 ```
 Your model will then be accessible through its identifier, a concatenation of your username (or organization name) and the folder name above:
 ```python
 "username/pretrained_model"
 # or if an org:
 "organization_name/pretrained_model"
 ```
 **Please add a README.md model card** to the repo under `model_cards/` with: model description, training params (dataset, preprocessing, hardware used, hyperparameters), evaluation results, intended uses & limitations, etc.
 Your model now has a page on huggingface.co/models 🔥
 Anyone can load it from code:
 ```python
 tokenizer = AutoTokenizer.from_pretrained("namespace/pretrained_model")
 model = AutoModel.from_pretrained("namespace/pretrained_model")
 ```
 List all your files on S3:
 ```shell
 transformers-cli s3 ls
 ```
 You can also delete unneeded files:
 ```shell
 transformers-cli s3 rm …
 ```
--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
@@ -0,0 +1,209 @@
 Model sharing and uploading
 ===========================
 In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on
 the `model hub <https://huggingface.co/models>`__.
 .. note::
    You will need to create an account on `huggingface.co <https://huggingface.co/join>`__ for this.
    Optionally, you can join an existing organization or create a new one.
 Prepare your model for uploading
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
 done something similar on your task, either using the model directly in your own training loop or using the
 :class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on
 the `model hub <https://huggingface.co/models>`__.
 Basic steps
 ^^^^^^^^^^^
 .. 
    When #5258 is merged, we can remove the need to create the directory.
 First, pick a directory with the name you want your model to have on the model hub (its full name will then be
 `username/awesome-name-you-picked` of `organization/awesome-name-you-picked`) and create it with either
 ::
    mkdir path/to/awesome-name-you-picked
 or in python
 ::
    import os
    os.makedirs("path/to/awesome-name-you-picked")
 then you can save your model and tokenizer with:
 ::
    model.save_pretrained("path/to/awesome-name-you-picked")
    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
 Or, if you're using the Trainer API
 ::
    trainer.save_model("path/to/awesome-name-you-picked")
    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
 Make your model work on all frameworks
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. 
    TODO Sylvain: make this automatic during the upload
 You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
 PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
 your model in another framework, but it will be slower). Don't worry, it's super easy to do (and in a future version,
 it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to
 worry about the GPU, so it should be very easy. Check the
 `TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ 
 and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
 First check that your model class exists in the other framework, that is try to import the same model by either adding
 or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to
 type
 ::
    from transformers import TFDistilBertForSequenceClassification
 and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to
 type
 ::
    from transformers import DistilBertForSequenceClassification
 This will give back an error if your model does not exist in the other framework (something that should be pretty rare
 since we're aiming for full parity between the two frameworks). In this case, skip this and go to the next step.
 Now, if you trained your model in PyTorch and have to create a TensorFlow version, adapt the following code to your
 model class:
 ::
    tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True)
    tf_model.save_pretrained("path/to/awesome-name-you-picked")
 and if you trained your model in TensorFlow and have to create a PyTorch version, adapt the following code to your
 model class:
 ::
    pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True)
    pt_model.save_pretrained("path/to/awesome-name-you-picked")
 That's all there is to it!
 Check the directory before uploading
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Make sure there are no garbage files in the directory you'll upload. It should only have:
 - a `config.json` file, which saves the :doc:`configuration <main_classes/configuration>` of your model ;
 - a `pytorch_model.bin` file, which is the PyTorch checkpoint (unless you can't have it for some reason) ;
 - a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
 - a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
 - a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
 - a `vocab.txt`, which is the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>`
  save;
 - maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.
 Other files can safely be deleted.
 Upload your model with the CLI
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Now go in a terminal and run the following command. It should be in the virtual enviromnent where you installed 🤗
 Transformers, since that command :obj:`transformers-cli` comes from the library.
 ::
    transformers-cli login
 Then log in using the same credentials as on huggingface.co. To upload your model, just type
 ::
    transformers-cli upload path/to/awesome-name-you-picked/
 This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.
 If you want to upload a single file (a new version of your model, or the other framework checkpoint you want to add),
 just type:
 ::
    transformers-cli upload path/to/awesome-name-you-picked/that-file 
 or
 ::
   transformers-cli upload path/to/awesome-name-you-picked/that-file --filename awesome-name-you-picked/new_name
 if you want to change its filename.
 This uploads the model to your personal account. If you want your model to be namespaced by your organization name
 rather than your username, add the following flag to any command:
 ::
    --organization organization_name
 so for instance:
 ::
    transformers-cli upload path/to/awesome-name-you-picked/ --organization organization_name
 Your model will then be accessible through its identifier, which is, as we saw above,
 `username/awesome-name-you-picked` of `organization/awesome-name-you-picked`.
 Add a model card
 ^^^^^^^^^^^^^^^^
 To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
 considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should be named
 `awesome-name-you-picked-READMED.md` and follow `this template <https://github.com/huggingface/model_card>`__.
 If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
 don't forget to link to its model card so that people can fully trace how your model was built.
 If you have never made a pull request to the 🤗 Transformers repo, look at the
 :doc:`contributing guide <contributing>` to see the steps to follow.
 Using your model
 ^^^^^^^^^^^^^^^^
 Your model now has a page on huggingface.co/models 🔥
 Anyone can load it from code:
 ::
    tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
    model = AutoModel.from_pretrained("namespace/awesome-name-you-picked")
 Additional commands
 ^^^^^^^^^^^^^^^^^^^
 You can list all the files you uploaded on the hub like this:
 ::
    transformers-cli s3 ls
 You can also delete unneeded files with
 ::
    transformers-cli s3 rm awesome-name-you-picked/filename
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@@ -282,7 +282,7 @@ Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#to
 `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual
 training loop. 🤗 Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if
 you are using TensorFlow) class to help with your training (taking care of things such as distributed training, mixed
-precision, etc.). See the training tutorial (coming soon) for more details.
+precision, etc.). See the :doc:`training tutorial <training>` for more details.
 Once your model is fine-tuned, you can save it with its tokenizer the following way:
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@@ -1,89 +0,0 @@
 Serialization best-practices
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
 There are three types of files you need to save to be able to reload a fine-tuned model:
 * the model itself which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`__\ ,
 * the configuration file of the model which is saved as a JSON file, and
 * the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
 The *default filenames* of these files are as follow:
 * the model weights file: ``pytorch_model.bin``\ ,
 * the configuration file: ``config.json``\ ,
 * the vocabulary file: ``vocab.txt`` for BERT and Transformer-XL, ``vocab.json`` for GPT/GPT-2 (BPE vocabulary),
 * for GPT/GPT-2 (BPE vocabulary) the additional merges file: ``merges.txt``.
 **If you save a model using these *default filenames*\ , you can then re-load the model and tokenizer using the ``from_pretrained()`` method.**
 Here is the recommended way of saving the model, configuration and vocabulary to an ``output_dir`` directory and reloading the model and tokenizer afterwards:
 .. code-block:: python
   from transformers import WEIGHTS_NAME, CONFIG_NAME
   output_dir = "./models/"
   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
   # If we have a distributed model, save only the encapsulated model
   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
   model_to_save = model.module if hasattr(model, 'module') else model
   # If we save using the predefined names, we can load using `from_pretrained`
   output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
   output_config_file = os.path.join(output_dir, CONFIG_NAME)
   torch.save(model_to_save.state_dict(), output_model_file)
   model_to_save.config.to_json_file(output_config_file)
   tokenizer.save_pretrained(output_dir)
   # Step 2: Re-load the saved model and vocabulary
   # Example for a Bert model
   model = BertForQuestionAnswering.from_pretrained(output_dir)
   tokenizer = BertTokenizer.from_pretrained(output_dir)  # Add specific options if needed
   # Example for a GPT model
   model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
   tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
 Here is another way you can save and reload the model if you want to use specific paths for each type of files:
 .. code-block:: python
   output_model_file = "./models/my_own_model_file.bin"
   output_config_file = "./models/my_own_config_file.bin"
   output_vocab_file = "./models/my_own_vocab_file.bin"
   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
   # If we have a distributed model, save only the encapsulated model
   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
   model_to_save = model.module if hasattr(model, 'module') else model
   torch.save(model_to_save.state_dict(), output_model_file)
   model_to_save.config.to_json_file(output_config_file)
   tokenizer.save_vocabulary(output_vocab_file)
   # Step 2: Re-load the saved model and vocabulary
   # We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
   # Here is how to do it in this situation:
   # Example for a Bert model
   config = BertConfig.from_json_file(output_config_file)
   model = BertForQuestionAnswering(config)
   state_dict = torch.load(output_model_file)
   model.load_state_dict(state_dict)
   tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)
   # Example for a GPT model
   config = OpenAIGPTConfig.from_json_file(output_config_file)
   model = OpenAIGPTDoubleHeadsModel(config)
   state_dict = torch.load(output_model_file)
   model.load_state_dict(state_dict)
   tokenizer = OpenAIGPTTokenizer(output_vocab_file)