From 1af58c07064d8f4580909527a8f18de226b226ee Mon Sep 17 00:00:00 2001 From: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Date: Sat, 27 Jun 2020 11:10:02 -0400 Subject: [PATCH] New model sharing tutorial (#5323) --- docs/source/index.rst | 3 +- docs/source/model_sharing.md | 55 --------- docs/source/model_sharing.rst | 209 ++++++++++++++++++++++++++++++++++ docs/source/quicktour.rst | 2 +- docs/source/serialization.rst | 89 --------------- 5 files changed, 211 insertions(+), 147 deletions(-) delete mode 100644 docs/source/model_sharing.md create mode 100644 docs/source/model_sharing.rst delete mode 100644 docs/source/serialization.rst diff --git a/docs/source/index.rst b/docs/source/index.rst index 269bc33052..bbd841fb85 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -139,9 +139,8 @@ conversion utilities for the following models: task_summary model_summary - training preprocessing - serialization + training model_sharing multilingual diff --git a/docs/source/model_sharing.md b/docs/source/model_sharing.md deleted file mode 100644 index cad003fadc..0000000000 --- a/docs/source/model_sharing.md +++ /dev/null @@ -1,55 +0,0 @@ -# Model upload and sharing - -Starting with `v2.2.2`, you can now upload and share your fine-tuned models with the community, using the CLI that's built-in to the library. - -**First, create an account on [https://huggingface.co/join](https://huggingface.co/join)**. Optionally, join an existing organization or create a new one. Then: - -```shell -transformers-cli login -# log in using the same credentials as on huggingface.co -``` -Upload your model: -```shell -transformers-cli upload ./path/to/pretrained_model/ - -# ^^ Upload folder containing weights/tokenizer/config -# saved via `.save_pretrained()` - -transformers-cli upload ./config.json [--filename folder/foobar.json] - -# ^^ Upload a single file -# (you can optionally override its filename, which can be nested inside a folder) -``` - -If you want your model to be namespaced by your organization name rather than your username, add the following flag to any command: -```shell ---organization organization_name -``` - -Your model will then be accessible through its identifier, a concatenation of your username (or organization name) and the folder name above: -```python -"username/pretrained_model" -# or if an org: -"organization_name/pretrained_model" -``` - -**Please add a README.md model card** to the repo under `model_cards/` with: model description, training params (dataset, preprocessing, hardware used, hyperparameters), evaluation results, intended uses & limitations, etc. - -Your model now has a page on huggingface.co/models 🔥 - -Anyone can load it from code: -```python -tokenizer = AutoTokenizer.from_pretrained("namespace/pretrained_model") -model = AutoModel.from_pretrained("namespace/pretrained_model") -``` - -List all your files on S3: -```shell -transformers-cli s3 ls -``` - -You can also delete unneeded files: - -```shell -transformers-cli s3 rm … -``` diff --git a/docs/source/model_sharing.rst b/docs/source/model_sharing.rst new file mode 100644 index 0000000000..fd0e74d0e6 --- /dev/null +++ b/docs/source/model_sharing.rst @@ -0,0 +1,209 @@ +Model sharing and uploading +=========================== + +In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on +the `model hub `__. + +.. note:: + + You will need to create an account on `huggingface.co `__ for this. + + Optionally, you can join an existing organization or create a new one. + +Prepare your model for uploading +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We have seen in the :doc:`training tutorial `: how to fine-tune a model on a given task. You have probably +done something similar on your task, either using the model directly in your own training loop or using the +:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on +the `model hub `__. + +Basic steps +^^^^^^^^^^^ + +.. + When #5258 is merged, we can remove the need to create the directory. + +First, pick a directory with the name you want your model to have on the model hub (its full name will then be +`username/awesome-name-you-picked` of `organization/awesome-name-you-picked`) and create it with either + +:: + + mkdir path/to/awesome-name-you-picked + +or in python + +:: + + import os + os.makedirs("path/to/awesome-name-you-picked") + +then you can save your model and tokenizer with: + +:: + + model.save_pretrained("path/to/awesome-name-you-picked") + tokenizer.save_pretrained("path/to/awesome-name-you-picked") + +Or, if you're using the Trainer API + +:: + + trainer.save_model("path/to/awesome-name-you-picked") + tokenizer.save_pretrained("path/to/awesome-name-you-picked") + +Make your model work on all frameworks +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. + TODO Sylvain: make this automatic during the upload + +You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both +PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load +your model in another framework, but it will be slower). Don't worry, it's super easy to do (and in a future version, +it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to +worry about the GPU, so it should be very easy. Check the +`TensorFlow installation page `__ +and/or the `PyTorch installation page `__ to see how. + +First check that your model class exists in the other framework, that is try to import the same model by either adding +or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to +type + +:: + + from transformers import TFDistilBertForSequenceClassification + +and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to +type + +:: + + from transformers import DistilBertForSequenceClassification + +This will give back an error if your model does not exist in the other framework (something that should be pretty rare +since we're aiming for full parity between the two frameworks). In this case, skip this and go to the next step. + +Now, if you trained your model in PyTorch and have to create a TensorFlow version, adapt the following code to your +model class: + +:: + + tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True) + tf_model.save_pretrained("path/to/awesome-name-you-picked") + +and if you trained your model in TensorFlow and have to create a PyTorch version, adapt the following code to your +model class: + +:: + + pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True) + pt_model.save_pretrained("path/to/awesome-name-you-picked") + +That's all there is to it! + +Check the directory before uploading +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Make sure there are no garbage files in the directory you'll upload. It should only have: + +- a `config.json` file, which saves the :doc:`configuration ` of your model ; +- a `pytorch_model.bin` file, which is the PyTorch checkpoint (unless you can't have it for some reason) ; +- a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ; +- a `special_tokens_map.json`, which is part of your :doc:`tokenizer ` save; +- a `tokenizer_config.json`, which is part of your :doc:`tokenizer ` save; +- a `vocab.txt`, which is the vocabulary of your tokenizer, part of your :doc:`tokenizer ` + save; +- maybe a `added_tokens.json`, which is part of your :doc:`tokenizer ` save. + +Other files can safely be deleted. + +Upload your model with the CLI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Now go in a terminal and run the following command. It should be in the virtual enviromnent where you installed 🤗 +Transformers, since that command :obj:`transformers-cli` comes from the library. + +:: + + transformers-cli login + +Then log in using the same credentials as on huggingface.co. To upload your model, just type + +:: + + transformers-cli upload path/to/awesome-name-you-picked/ + +This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section. + +If you want to upload a single file (a new version of your model, or the other framework checkpoint you want to add), +just type: + +:: + + transformers-cli upload path/to/awesome-name-you-picked/that-file + +or + +:: + + transformers-cli upload path/to/awesome-name-you-picked/that-file --filename awesome-name-you-picked/new_name + +if you want to change its filename. + +This uploads the model to your personal account. If you want your model to be namespaced by your organization name +rather than your username, add the following flag to any command: + +:: + + --organization organization_name + +so for instance: + +:: + + transformers-cli upload path/to/awesome-name-you-picked/ --organization organization_name + +Your model will then be accessible through its identifier, which is, as we saw above, +`username/awesome-name-you-picked` of `organization/awesome-name-you-picked`. + +Add a model card +^^^^^^^^^^^^^^^^ + +To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical +considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should be named +`awesome-name-you-picked-READMED.md` and follow `this template `__. + +If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do), +don't forget to link to its model card so that people can fully trace how your model was built. + +If you have never made a pull request to the 🤗 Transformers repo, look at the +:doc:`contributing guide ` to see the steps to follow. + +Using your model +^^^^^^^^^^^^^^^^ + +Your model now has a page on huggingface.co/models 🔥 + +Anyone can load it from code: + +:: + + tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked") + model = AutoModel.from_pretrained("namespace/awesome-name-you-picked") + +Additional commands +^^^^^^^^^^^^^^^^^^^ + +You can list all the files you uploaded on the hub like this: + +:: + + transformers-cli s3 ls + +You can also delete unneeded files with + +:: + + transformers-cli s3 rm awesome-name-you-picked/filename + diff --git a/docs/source/quicktour.rst b/docs/source/quicktour.rst index 523ede72bf..88bbb5ed3e 100644 --- a/docs/source/quicktour.rst +++ b/docs/source/quicktour.rst @@ -282,7 +282,7 @@ Models are standard `torch.nn.Module `__ so you can use them in your usual training loop. 🤗 Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if you are using TensorFlow) class to help with your training (taking care of things such as distributed training, mixed -precision, etc.). See the training tutorial (coming soon) for more details. +precision, etc.). See the :doc:`training tutorial ` for more details. Once your model is fine-tuned, you can save it with its tokenizer the following way: diff --git a/docs/source/serialization.rst b/docs/source/serialization.rst deleted file mode 100644 index f65c29e576..0000000000 --- a/docs/source/serialization.rst +++ /dev/null @@ -1,89 +0,0 @@ -Serialization best-practices -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL). -There are three types of files you need to save to be able to reload a fine-tuned model: - - -* the model itself which should be saved following PyTorch serialization `best practices `__\ , -* the configuration file of the model which is saved as a JSON file, and -* the vocabulary (and the merges for the BPE-based models GPT and GPT-2). - -The *default filenames* of these files are as follow: - - -* the model weights file: ``pytorch_model.bin``\ , -* the configuration file: ``config.json``\ , -* the vocabulary file: ``vocab.txt`` for BERT and Transformer-XL, ``vocab.json`` for GPT/GPT-2 (BPE vocabulary), -* for GPT/GPT-2 (BPE vocabulary) the additional merges file: ``merges.txt``. - -**If you save a model using these *default filenames*\ , you can then re-load the model and tokenizer using the ``from_pretrained()`` method.** - -Here is the recommended way of saving the model, configuration and vocabulary to an ``output_dir`` directory and reloading the model and tokenizer afterwards: - -.. code-block:: python - - from transformers import WEIGHTS_NAME, CONFIG_NAME - - output_dir = "./models/" - - # Step 1: Save a model, configuration and vocabulary that you have fine-tuned - - # If we have a distributed model, save only the encapsulated model - # (it was wrapped in PyTorch DistributedDataParallel or DataParallel) - model_to_save = model.module if hasattr(model, 'module') else model - - # If we save using the predefined names, we can load using `from_pretrained` - output_model_file = os.path.join(output_dir, WEIGHTS_NAME) - output_config_file = os.path.join(output_dir, CONFIG_NAME) - - torch.save(model_to_save.state_dict(), output_model_file) - model_to_save.config.to_json_file(output_config_file) - tokenizer.save_pretrained(output_dir) - - # Step 2: Re-load the saved model and vocabulary - - # Example for a Bert model - model = BertForQuestionAnswering.from_pretrained(output_dir) - tokenizer = BertTokenizer.from_pretrained(output_dir) # Add specific options if needed - # Example for a GPT model - model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir) - tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir) - -Here is another way you can save and reload the model if you want to use specific paths for each type of files: - -.. code-block:: python - - output_model_file = "./models/my_own_model_file.bin" - output_config_file = "./models/my_own_config_file.bin" - output_vocab_file = "./models/my_own_vocab_file.bin" - - # Step 1: Save a model, configuration and vocabulary that you have fine-tuned - - # If we have a distributed model, save only the encapsulated model - # (it was wrapped in PyTorch DistributedDataParallel or DataParallel) - model_to_save = model.module if hasattr(model, 'module') else model - - torch.save(model_to_save.state_dict(), output_model_file) - model_to_save.config.to_json_file(output_config_file) - tokenizer.save_vocabulary(output_vocab_file) - - # Step 2: Re-load the saved model and vocabulary - - # We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`. - # Here is how to do it in this situation: - - # Example for a Bert model - config = BertConfig.from_json_file(output_config_file) - model = BertForQuestionAnswering(config) - state_dict = torch.load(output_model_file) - model.load_state_dict(state_dict) - tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case) - - # Example for a GPT model - config = OpenAIGPTConfig.from_json_file(output_config_file) - model = OpenAIGPTDoubleHeadsModel(config) - state_dict = torch.load(output_model_file) - model.load_state_dict(state_dict) - tokenizer = OpenAIGPTTokenizer(output_vocab_file) -