From f553c3ce4c34c8c0d991a21e50b2eb085e74c10d Mon Sep 17 00:00:00 2001
From: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Date: Tue, 5 Apr 2022 10:48:42 -0700
Subject: [PATCH] Update summary of the tasks (#16528)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* 📝 add image/vision classification and asr

* 🖍 minor formatting fixes

* Fixed a typo in legacy seq2seq_trainer.py (#16531)

* Add ONNX export for BeiT (#16498)

* Add beit onnx conversion support

* Updated docs

* Added cross reference to ViT ONNX config

* call on_train_end when trial is pruned (#16536)

* Type hints added (#16529)

* Fix Bart type hints (#16297)

* Add type hints to PLBart PyTorch

* Remove pending merge conflicts

* Fix PLBart Type Hints

* Add changes from review

* Add VisualBert type hints (#16544)

* Adding missing type hints for mBART model (PyTorch) (#16429)

* added type hints for mbart tensorflow tf implementation

* Adding missing type hints for mBART model

Tensorflow Implementation model added with missing type hints

* Missing Type hints - correction

For TF model

* Code fixup using make quality tests

* Hint types - typo error

* make fix-copies and make fixup

* type hints

* updated files

* type hints update

* making dependent modesls coherent

Co-authored-by: matt <rocketknight1@gmail.com>

* Remove MBart subclass of XLMRoberta in tokenzier docs (#16546)

* Remove MBart subclass of XLMRoberta in tokenzier

* Fix style

* Copy docs from MBart50 tokenizer

* Use random_attention_mask for TF tests (#16517)

* use random_attention_mask for TF tests

* Fix for TFCLIP test (for now).

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Improve code example (#16450)

Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home>

* Pin tokenizers version <0.13 (#16539)

* Pin tokenizers version <0.13

* Style

* Add code samples for TF speech models (#16494)

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* [FlaxSpeechEncoderDecoder] Fix dtype bug (#16581)

* [FlaxSpeechEncoderDecoder] Fix dtype bug

* more fixes

* Making the impossible to connect error actually report the right URL. (#16446)

* Fix flax import in __init__.py: modeling_xglm -> modeling_flax_xglm (#16556)

* Add utility to find model labels (#16526)

* Add utility to find model labels

* Use it in the Trainer

* Update src/transformers/utils/generic.py

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

* Quality

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

* Enable doc in Spanish (#16518)

* Reorganize doc for multilingual support

* Fix style

* Style

* Toc trees

* Adapt templates

* Add use_auth to load_datasets for private datasets to PT and TF examples (#16521)

* fix formatting and remove use_auth

* Add use_auth_token to Flax examples

* add a test checking the format of `convert_tokens_to_string`'s output (#16540)

* add new tests

* add comment to overridden tests

* TF: Finalize `unpack_inputs`-related changes (#16499)

* Add unpack_inputs to remaining models

* removed kwargs to `call()` in TF models

* fix TF T5 tests

* [SpeechEncoderDecoderModel] Correct Encoder Last Hidden State Output (#16586)

* initialize the default rank set on TrainerState (#16530)

* initialize the default rank set on TrainerState

* fix style

* Trigger doc build

* Fix CI: test_inference_for_pretraining in ViTMAEModelTest (#16591)

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* add a template to add missing tokenization test (#16553)

* add a template to add missing tokenization test

* add cookiecutter setting

* improve doc

* Update templates/adding_a_missing_tokenization_test/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* made _load_pretrained_model_low_mem static + bug fix (#16548)

* handle torch_dtype in low cpu mem usage (#16580)

* [Doctests] Correct filenaming (#16599)

* [Doctests] Correct filenaming

* improve quicktour

* make style

* Adding new train_step logic to make things less confusing for users (#15994)

* Adding new train_step logic to make things less confusing for users

* DO NOT ASK WHY WE NEED THAT SUBCLASS

* Metrics now working, at least for single-output models with type annotations!

* Updates and TODOs for the new train_step

* Make fixup

* Temporary test workaround until T5 has types

* Temporary test workaround until T5 has types

* I think this actually works! Needs a lot of tests though

* MAke style/quality

* Revert changes to T5 tests

* Deleting the aforementioned unmentionable subclass

* Deleting the aforementioned unmentionable subclass

* Adding a Keras API test

* Style fixes

* Removing unneeded TODO and comments

* Update test_step too

* Stop trying to compute metrics with the dummy_loss, patch up test

* Make style

* make fixup

* Docstring cleanup

* make fixup

* make fixup

* Stop expanding 1D input tensors when using dummy loss

* Adjust T5 test given the new compile()

* make fixup

* Skipping test for convnext

* Removing old T5-specific Keras test now that we have a common one

* make fixup

* make fixup

* Only skip convnext test on CPU

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Avoiding TF import issues

* make fixup

* Update compile() to support TF 2.3

* Skipping model.fit() on template classes for now

* Skipping model.fit() on template class tests for now

* Replace ad-hoc solution with find_labels

* make fixup

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Adding missing type hints for BigBird model   (#16555)

* added type hints for mbart tensorflow tf implementation

* Adding missing type hints for mBART model

Tensorflow Implementation model added with missing type hints

* Missing Type hints - correction

For TF model

* Code fixup using make quality tests

* Hint types - typo error

* make fix-copies and make fixup

* type hints

* updated files

* type hints update

* making dependent modesls coherent

* Type hints for BigBird

* removing typos

Co-authored-by: matt <rocketknight1@gmail.com>

* [deepspeed] fix typo, adjust config name (#16597)

* 🖍 apply feedback

Co-authored-by: Cathy <815244047@qq.com>
Co-authored-by: Jim Rohrer <jrohrer1@gmail.com>
Co-authored-by: Ferdinand Schlatt <fschlatt@gmail.com>
Co-authored-by: Dahlbomii <101373053+Dahlbomii@users.noreply.github.com>
Co-authored-by: Gunjan Chhablani <chhablani.gunjan@gmail.com>
Co-authored-by: Rishav Chandra Varma <rishavchandra.v16@iiits.in>
Co-authored-by: matt <rocketknight1@gmail.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home>
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Daniel Stancl <46073029+stancld@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Co-authored-by: Karim Foda <35491698+KMFODA@users.noreply.github.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
Co-authored-by: Joao Gante <joao@huggingface.co>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: Andres Codas <andrescodas@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com>
Co-authored-by: Francesco Saverio Zuppichini <francesco.zuppichini@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
---
 docs/source/en/task_summary.mdx | 155 ++++++++++++++++++++++++++++++++
 1 file changed, 155 insertions(+)

diff --git a/docs/source/en/task_summary.mdx b/docs/source/en/task_summary.mdx
index 95c2d9c201..55e4a230a1 100644
--- a/docs/source/en/task_summary.mdx
+++ b/docs/source/en/task_summary.mdx
@@ -967,3 +967,158 @@ Here is an example of doing translation using a model and a tokenizer. The proce
 </frameworkcontent>
 
 We get the same translation as with the pipeline example.
+
+## Audio classification
+
+Audio classification assigns a class to an audio signal. The Keyword Spotting dataset from the [SUPERB](https://huggingface.co/datasets/superb) benchmark is an example dataset that can be used for audio classification fine-tuning. This dataset contains ten classes of keywords for classification. If you'd like to fine-tune a model for audio classification, take a look at the [run_audio_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/run_audio_classification.py) script or this [how-to guide](./tasks/audio_classification).
+
+The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for audio classification inference:
+
+```py
+>>> from transformers import pipeline
+
+>>> audio_classifier = pipeline(
+...     task="audio-classification", model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
+... )
+>>> audio_classifier("jfk_moon_speech.wav")
+[{'label': 'calm', 'score': 0.13856211304664612},
+ {'label': 'disgust', 'score': 0.13148026168346405},
+ {'label': 'happy', 'score': 0.12635163962841034},
+ {'label': 'angry', 'score': 0.12439591437578201},
+ {'label': 'fearful', 'score': 0.12404385954141617}]
+```
+
+The general process for using a model and feature extractor for audio classification is:
+
+1. Instantiate a feature extractor and a model from the checkpoint name.
+2. Process the audio signal to be classified with a feature extractor.
+3. Pass the input through the model and take the `argmax` to retrieve the most likely class.
+4. Convert the class id to a class name with `id2label` to return an interpretable result.
+
+<frameworkcontent>
+<pt>
+```py
+>>> from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
+>>> from datasets import load_dataset
+>>> import torch
+
+>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
+>>> dataset = dataset.sort("id")
+>>> sampling_rate = dataset.features["audio"].sampling_rate
+
+>>> feature_extractor = AutoFeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-ks")
+>>> model = AutoModelForAudioClassification.from_pretrained("superb/wav2vec2-base-superb-ks")
+
+>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
+
+>>> with torch.no_grad():
+...     logits = model(**inputs).logits
+
+>>> predicted_class_ids = torch.argmax(logits, dim=-1).item()
+>>> predicted_label = model.config.id2label[predicted_class_ids]
+>>> predicted_label
+```
+</pt>
+</frameworkcontent>
+
+## Automatic speech recognition
+
+Automatic speech recognition transcribes an audio signal to text. The [Common Voice](https://huggingface.co/datasets/common_voice) dataset is an example dataset that can be used for automatic speech recognition fine-tuning. It contains an audio file of a speaker and the corresponding sentence. If you'd like to fine-tune a model for automatic speech recognition, take a look at the [run_speech_recognition_ctc.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) or [run_speech_recognition_seq2seq.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py) scripts or this [how-to guide](./tasks/asr).
+
+The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for automatic speech recognition inference:
+
+```py
+>>> from transformers import pipeline
+
+>>> speech_recognizer = pipeline(
+...     task="automatic-speech-recognition", model="facebook/wav2vec2-base-960h"
+... )
+>>> speech_recognizer("jfk_moon_speech.wav")
+{'text': "PRESENTETE MISTER VICE PRESIDENT GOVERNOR CONGRESSMEN THOMAS SAN O TE WILAN CONGRESSMAN MILLA MISTER WEBB MSTBELL SCIENIS DISTINGUISHED GUESS AT LADIES AND GENTLEMAN I APPRECIATE TO YOUR PRESIDENT HAVING MADE ME AN HONORARY VISITING PROFESSOR AND I WILL ASSURE YOU THAT MY FIRST LECTURE WILL BE A VERY BRIEF I AM DELIGHTED TO BE HERE AND I'M PARTICULARLY DELIGHTED TO BE HERE ON THIS OCCASION WE MEED AT A COLLEGE NOTED FOR KNOWLEGE IN A CITY NOTED FOR PROGRESS IN A STATE NOTED FOR STRAINTH AN WE STAND IN NEED OF ALL THREE"}
+```
+
+The general process for using a model and processor for automatic speech recognition is:
+
+1. Instantiate a processor (which regroups a feature extractor for input processing and a tokenizer for decoding) and a model from the checkpoint name.
+2. Process the audio signal and text with a processor.
+3. Pass the input through the model and take the `argmax` to retrieve the predicted text.
+4. Decode the text with a tokenizer to obtain the transcription.
+
+<frameworkcontent>
+<pt>
+```py
+>>> from transformers import AutoProcessor, AutoModelForCTC
+>>> from datasets import load_dataset
+>>> import torch
+
+>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
+>>> dataset = dataset.sort("id")
+>>> sampling_rate = dataset.features["audio"].sampling_rate
+
+>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
+>>> model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h")
+
+>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
+>>> with torch.no_grad():
+...     logits = model(**inputs).logits
+>>> predicted_ids = torch.argmax(logits, dim=-1)
+
+>>> transcription = processor.batch_decode(predicted_ids)
+>>> transcription[0]
+```
+</pt>
+</frameworkcontent>
+
+## Image classification
+
+Like text and audio classification, image classification assigns a class to an image. The [CIFAR-100](https://huggingface.co/datasets/cifar100) dataset is an example dataset that can be used for image classification fine-tuning. It contains an image and the corresponding class. If you'd like to fine-tune a model for image classification, take a look at the [run_image_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/image-classification/run_image_classification.py) script or this [how-to guide](./tasks/image_classification).
+
+The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for image classification inference:
+
+```py
+>>> from transformers import pipeline
+
+>>> vision_classifier = pipeline(task="image-classification")
+>>> vision_classifier(
+...     images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+... )
+[{'label': 'lynx, catamount', 'score': 0.4403027892112732},
+ {'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor',
+  'score': 0.03433405980467796},
+ {'label': 'snow leopard, ounce, Panthera uncia',
+  'score': 0.032148055732250214},
+ {'label': 'Egyptian cat', 'score': 0.02353910356760025},
+ {'label': 'tiger cat', 'score': 0.023034192621707916}]
+```
+
+The general process for using a model and feature extractor for image classification is:
+
+1. Instantiate a feature extractor and a model from the checkpoint name.
+2. Process the image to be classified with a feature extractor.
+3. Pass the input through the model and take the `argmax` to retrieve the predicted class.
+4. Convert the class id to a class name with `id2label` to return an interpretable result.
+
+<frameworkcontent>
+<pt>
+```py
+>>> from transformers import AutoFeatureExtractor, AutoModelForImageClassification
+>>> import torch
+>>> from datasets import load_dataset
+
+>>> dataset = load_dataset("huggingface/cats-image")
+>>> image = dataset["test"]["image"][0]
+
+>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
+>>> model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
+
+>>> inputs = feature_extractor(image, return_tensors="pt")
+
+>>> with torch.no_grad():
+...     logits = model(**inputs).logits
+
+>>> predicted_label = logits.argmax(-1).item()
+>>> print(model.config.id2label[predicted_label])
+Egyptian cat
+```
+</pt>
+</frameworkcontent>