From f553c3ce4c34c8c0d991a21e50b2eb085e74c10d Mon Sep 17 00:00:00 2001 From: Steven Liu <59462357+stevhliu@users.noreply.github.com> Date: Tue, 5 Apr 2022 10:48:42 -0700 Subject: [PATCH] Update summary of the tasks (#16528) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * 📝 add image/vision classification and asr * 🖍 minor formatting fixes * Fixed a typo in legacy seq2seq_trainer.py (#16531) * Add ONNX export for BeiT (#16498) * Add beit onnx conversion support * Updated docs * Added cross reference to ViT ONNX config * call on_train_end when trial is pruned (#16536) * Type hints added (#16529) * Fix Bart type hints (#16297) * Add type hints to PLBart PyTorch * Remove pending merge conflicts * Fix PLBart Type Hints * Add changes from review * Add VisualBert type hints (#16544) * Adding missing type hints for mBART model (PyTorch) (#16429) * added type hints for mbart tensorflow tf implementation * Adding missing type hints for mBART model Tensorflow Implementation model added with missing type hints * Missing Type hints - correction For TF model * Code fixup using make quality tests * Hint types - typo error * make fix-copies and make fixup * type hints * updated files * type hints update * making dependent modesls coherent Co-authored-by: matt * Remove MBart subclass of XLMRoberta in tokenzier docs (#16546) * Remove MBart subclass of XLMRoberta in tokenzier * Fix style * Copy docs from MBart50 tokenizer * Use random_attention_mask for TF tests (#16517) * use random_attention_mask for TF tests * Fix for TFCLIP test (for now). Co-authored-by: ydshieh * Improve code example (#16450) Co-authored-by: Niels Rogge * Pin tokenizers version <0.13 (#16539) * Pin tokenizers version <0.13 * Style * Add code samples for TF speech models (#16494) Co-authored-by: ydshieh * [FlaxSpeechEncoderDecoder] Fix dtype bug (#16581) * [FlaxSpeechEncoderDecoder] Fix dtype bug * more fixes * Making the impossible to connect error actually report the right URL. (#16446) * Fix flax import in __init__.py: modeling_xglm -> modeling_flax_xglm (#16556) * Add utility to find model labels (#16526) * Add utility to find model labels * Use it in the Trainer * Update src/transformers/utils/generic.py Co-authored-by: Matt * Quality Co-authored-by: Matt * Enable doc in Spanish (#16518) * Reorganize doc for multilingual support * Fix style * Style * Toc trees * Adapt templates * Add use_auth to load_datasets for private datasets to PT and TF examples (#16521) * fix formatting and remove use_auth * Add use_auth_token to Flax examples * add a test checking the format of `convert_tokens_to_string`'s output (#16540) * add new tests * add comment to overridden tests * TF: Finalize `unpack_inputs`-related changes (#16499) * Add unpack_inputs to remaining models * removed kwargs to `call()` in TF models * fix TF T5 tests * [SpeechEncoderDecoderModel] Correct Encoder Last Hidden State Output (#16586) * initialize the default rank set on TrainerState (#16530) * initialize the default rank set on TrainerState * fix style * Trigger doc build * Fix CI: test_inference_for_pretraining in ViTMAEModelTest (#16591) Co-authored-by: ydshieh * add a template to add missing tokenization test (#16553) * add a template to add missing tokenization test * add cookiecutter setting * improve doc * Update templates/adding_a_missing_tokenization_test/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * made _load_pretrained_model_low_mem static + bug fix (#16548) * handle torch_dtype in low cpu mem usage (#16580) * [Doctests] Correct filenaming (#16599) * [Doctests] Correct filenaming * improve quicktour * make style * Adding new train_step logic to make things less confusing for users (#15994) * Adding new train_step logic to make things less confusing for users * DO NOT ASK WHY WE NEED THAT SUBCLASS * Metrics now working, at least for single-output models with type annotations! * Updates and TODOs for the new train_step * Make fixup * Temporary test workaround until T5 has types * Temporary test workaround until T5 has types * I think this actually works! Needs a lot of tests though * MAke style/quality * Revert changes to T5 tests * Deleting the aforementioned unmentionable subclass * Deleting the aforementioned unmentionable subclass * Adding a Keras API test * Style fixes * Removing unneeded TODO and comments * Update test_step too * Stop trying to compute metrics with the dummy_loss, patch up test * Make style * make fixup * Docstring cleanup * make fixup * make fixup * Stop expanding 1D input tensors when using dummy loss * Adjust T5 test given the new compile() * make fixup * Skipping test for convnext * Removing old T5-specific Keras test now that we have a common one * make fixup * make fixup * Only skip convnext test on CPU * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Avoiding TF import issues * make fixup * Update compile() to support TF 2.3 * Skipping model.fit() on template classes for now * Skipping model.fit() on template class tests for now * Replace ad-hoc solution with find_labels * make fixup Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Adding missing type hints for BigBird model (#16555) * added type hints for mbart tensorflow tf implementation * Adding missing type hints for mBART model Tensorflow Implementation model added with missing type hints * Missing Type hints - correction For TF model * Code fixup using make quality tests * Hint types - typo error * make fix-copies and make fixup * type hints * updated files * type hints update * making dependent modesls coherent * Type hints for BigBird * removing typos Co-authored-by: matt * [deepspeed] fix typo, adjust config name (#16597) * 🖍 apply feedback Co-authored-by: Cathy <815244047@qq.com> Co-authored-by: Jim Rohrer Co-authored-by: Ferdinand Schlatt Co-authored-by: Dahlbomii <101373053+Dahlbomii@users.noreply.github.com> Co-authored-by: Gunjan Chhablani Co-authored-by: Rishav Chandra Varma Co-authored-by: matt Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: ydshieh Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Niels Rogge Co-authored-by: Lysandre Debut Co-authored-by: Patrick von Platen Co-authored-by: Nicolas Patry Co-authored-by: Daniel Stancl <46073029+stancld@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Matt Co-authored-by: Karim Foda <35491698+KMFODA@users.noreply.github.com> Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> Co-authored-by: Joao Gante Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Andres Codas Co-authored-by: Sylvain Gugger Co-authored-by: Francesco Saverio Zuppichini Co-authored-by: Suraj Patil Co-authored-by: Stas Bekman --- docs/source/en/task_summary.mdx | 155 ++++++++++++++++++++++++++++++++ 1 file changed, 155 insertions(+) diff --git a/docs/source/en/task_summary.mdx b/docs/source/en/task_summary.mdx index 95c2d9c201..55e4a230a1 100644 --- a/docs/source/en/task_summary.mdx +++ b/docs/source/en/task_summary.mdx @@ -967,3 +967,158 @@ Here is an example of doing translation using a model and a tokenizer. The proce We get the same translation as with the pipeline example. + +## Audio classification + +Audio classification assigns a class to an audio signal. The Keyword Spotting dataset from the [SUPERB](https://huggingface.co/datasets/superb) benchmark is an example dataset that can be used for audio classification fine-tuning. This dataset contains ten classes of keywords for classification. If you'd like to fine-tune a model for audio classification, take a look at the [run_audio_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/run_audio_classification.py) script or this [how-to guide](./tasks/audio_classification). + +The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for audio classification inference: + +```py +>>> from transformers import pipeline + +>>> audio_classifier = pipeline( +... task="audio-classification", model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition" +... ) +>>> audio_classifier("jfk_moon_speech.wav") +[{'label': 'calm', 'score': 0.13856211304664612}, + {'label': 'disgust', 'score': 0.13148026168346405}, + {'label': 'happy', 'score': 0.12635163962841034}, + {'label': 'angry', 'score': 0.12439591437578201}, + {'label': 'fearful', 'score': 0.12404385954141617}] +``` + +The general process for using a model and feature extractor for audio classification is: + +1. Instantiate a feature extractor and a model from the checkpoint name. +2. Process the audio signal to be classified with a feature extractor. +3. Pass the input through the model and take the `argmax` to retrieve the most likely class. +4. Convert the class id to a class name with `id2label` to return an interpretable result. + + + +```py +>>> from transformers import AutoFeatureExtractor, AutoModelForAudioClassification +>>> from datasets import load_dataset +>>> import torch + +>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") +>>> dataset = dataset.sort("id") +>>> sampling_rate = dataset.features["audio"].sampling_rate + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-ks") +>>> model = AutoModelForAudioClassification.from_pretrained("superb/wav2vec2-base-superb-ks") + +>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") + +>>> with torch.no_grad(): +... logits = model(**inputs).logits + +>>> predicted_class_ids = torch.argmax(logits, dim=-1).item() +>>> predicted_label = model.config.id2label[predicted_class_ids] +>>> predicted_label +``` + + + +## Automatic speech recognition + +Automatic speech recognition transcribes an audio signal to text. The [Common Voice](https://huggingface.co/datasets/common_voice) dataset is an example dataset that can be used for automatic speech recognition fine-tuning. It contains an audio file of a speaker and the corresponding sentence. If you'd like to fine-tune a model for automatic speech recognition, take a look at the [run_speech_recognition_ctc.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) or [run_speech_recognition_seq2seq.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py) scripts or this [how-to guide](./tasks/asr). + +The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for automatic speech recognition inference: + +```py +>>> from transformers import pipeline + +>>> speech_recognizer = pipeline( +... task="automatic-speech-recognition", model="facebook/wav2vec2-base-960h" +... ) +>>> speech_recognizer("jfk_moon_speech.wav") +{'text': "PRESENTETE MISTER VICE PRESIDENT GOVERNOR CONGRESSMEN THOMAS SAN O TE WILAN CONGRESSMAN MILLA MISTER WEBB MSTBELL SCIENIS DISTINGUISHED GUESS AT LADIES AND GENTLEMAN I APPRECIATE TO YOUR PRESIDENT HAVING MADE ME AN HONORARY VISITING PROFESSOR AND I WILL ASSURE YOU THAT MY FIRST LECTURE WILL BE A VERY BRIEF I AM DELIGHTED TO BE HERE AND I'M PARTICULARLY DELIGHTED TO BE HERE ON THIS OCCASION WE MEED AT A COLLEGE NOTED FOR KNOWLEGE IN A CITY NOTED FOR PROGRESS IN A STATE NOTED FOR STRAINTH AN WE STAND IN NEED OF ALL THREE"} +``` + +The general process for using a model and processor for automatic speech recognition is: + +1. Instantiate a processor (which regroups a feature extractor for input processing and a tokenizer for decoding) and a model from the checkpoint name. +2. Process the audio signal and text with a processor. +3. Pass the input through the model and take the `argmax` to retrieve the predicted text. +4. Decode the text with a tokenizer to obtain the transcription. + + + +```py +>>> from transformers import AutoProcessor, AutoModelForCTC +>>> from datasets import load_dataset +>>> import torch + +>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") +>>> dataset = dataset.sort("id") +>>> sampling_rate = dataset.features["audio"].sampling_rate + +>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h") +>>> model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h") + +>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") +>>> with torch.no_grad(): +... logits = model(**inputs).logits +>>> predicted_ids = torch.argmax(logits, dim=-1) + +>>> transcription = processor.batch_decode(predicted_ids) +>>> transcription[0] +``` + + + +## Image classification + +Like text and audio classification, image classification assigns a class to an image. The [CIFAR-100](https://huggingface.co/datasets/cifar100) dataset is an example dataset that can be used for image classification fine-tuning. It contains an image and the corresponding class. If you'd like to fine-tune a model for image classification, take a look at the [run_image_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/image-classification/run_image_classification.py) script or this [how-to guide](./tasks/image_classification). + +The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for image classification inference: + +```py +>>> from transformers import pipeline + +>>> vision_classifier = pipeline(task="image-classification") +>>> vision_classifier( +... images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +[{'label': 'lynx, catamount', 'score': 0.4403027892112732}, + {'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor', + 'score': 0.03433405980467796}, + {'label': 'snow leopard, ounce, Panthera uncia', + 'score': 0.032148055732250214}, + {'label': 'Egyptian cat', 'score': 0.02353910356760025}, + {'label': 'tiger cat', 'score': 0.023034192621707916}] +``` + +The general process for using a model and feature extractor for image classification is: + +1. Instantiate a feature extractor and a model from the checkpoint name. +2. Process the image to be classified with a feature extractor. +3. Pass the input through the model and take the `argmax` to retrieve the predicted class. +4. Convert the class id to a class name with `id2label` to return an interpretable result. + + + +```py +>>> from transformers import AutoFeatureExtractor, AutoModelForImageClassification +>>> import torch +>>> from datasets import load_dataset + +>>> dataset = load_dataset("huggingface/cats-image") +>>> image = dataset["test"]["image"][0] + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224") +>>> model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224") + +>>> inputs = feature_extractor(image, return_tensors="pt") + +>>> with torch.no_grad(): +... logits = model(**inputs).logits + +>>> predicted_label = logits.argmax(-1).item() +>>> print(model.config.id2label[predicted_label]) +Egyptian cat +``` + +