Add TAPEX (#16473)
* Add TapexTokenizer * Improve docstrings and provide option to provide answer * Remove option for pretokenized inputs * Add TAPEX to README * Fix copies * Remove option for pretokenized inputs * Initial commit: add tapex fine-tuning examples on both table-based question answering and table-based fact verification. * - Draft a README file for running the script and introducing some background. - Remove unused code lines in tabfact script. - Disable the deafult `pad_to_max_length` option which is memory-consuming. * * Support `as_target_tokenizer` function for TapexTokenizer. * Fix the do_lower_case behaviour of TapexTokenizer. * Add unit tests for target scenarios and cased/uncased scenarios for both source and target. * * Replace the label BartTokenizer with TapexTokenizer's as_target_tokenizer function. * Fix typos in tapex example README. * * fix the evaluation script - remove the property `task_name` * * Make the label space more clear for tabfact tasks * * Using a new fine-tuning script for tapex-base on tabfact. * * Remove the lowercase code outside the tokenizer - we use the tokenizer to control whether do_lower_case * Guarantee the hyper-parameter can be run without out-of-memory on 16GB card and report the new reproduced number on wikisql * * Remove the default tokenizer_name option. * Provide evaluation command. * * Support for WikiTableQuestion dataset. * Fix a typo in README. * * Fix the datasets's key name in WikiTableQuestions * Run make fixup and move test to folder * Fix quality * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Suraj Patil <surajp815@gmail.com> * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply some more suggestions from code review * Improve docstrings * Overwrite failing test * Improve comment in example scripts * Fix rebase * Add TAPEX to Auto mapping * Add TAPEX to auto config mappings * Put TAPEX higher than BART in auto mapping * Add TAPEX to doc tests Co-authored-by: Niels Rogge <nielsrogge@Nielss-MBP.localdomain> Co-authored-by: SivilTaram <qianlxc@outlook.com> Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
This commit is contained in:
@@ -330,6 +330,8 @@
|
||||
title: T5v1.1
|
||||
- local: model_doc/tapas
|
||||
title: TAPAS
|
||||
- local: model_doc/tapex
|
||||
title: TAPEX
|
||||
- local: model_doc/transfo-xl
|
||||
title: Transformer XL
|
||||
- local: model_doc/trocr
|
||||
|
||||
@@ -139,6 +139,7 @@ The library currently contains JAX, PyTorch and TensorFlow implementations, pret
|
||||
1. **[T5](model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||
1. **[T5v1.1](model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||
1. **[TAPAS](model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
|
||||
1. **[TAPEX](model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
|
||||
1. **[Transformer-XL](model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
|
||||
1. **[TrOCR](model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
|
||||
1. **[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
|
||||
@@ -252,6 +253,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| Swin | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| T5 | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| TAPAS | ✅ | ❌ | ✅ | ✅ | ❌ |
|
||||
| TAPEX | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| Transformer-XL | ✅ | ❌ | ✅ | ✅ | ❌ |
|
||||
| TrOCR | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| UniSpeech | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
|
||||
130
docs/source/en/model_doc/tapex.mdx
Normal file
130
docs/source/en/model_doc/tapex.mdx
Normal file
@@ -0,0 +1,130 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# TAPEX
|
||||
|
||||
## Overview
|
||||
|
||||
The TAPEX model was proposed in [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu,
|
||||
Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou. TAPEX pre-trains a BART model to solve synthetic SQL queries, after
|
||||
which it can be fine-tuned to answer natural language questions related to tabular data, as well as performing table fact checking.
|
||||
|
||||
TAPEX has been fine-tuned on several datasets:
|
||||
- [SQA](https://www.microsoft.com/en-us/download/details.aspx?id=54253) (Sequential Question Answering by Microsoft)
|
||||
- [WTQ](https://github.com/ppasupat/WikiTableQuestions) (Wiki Table Questions by Stanford University)
|
||||
- [WikiSQL](https://github.com/salesforce/WikiSQL) (by Salesforce)
|
||||
- [TabFact](https://tabfact.github.io/) (by USCB NLP Lab).
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Recent progress in language model pre-training has achieved a great success via leveraging large-scale unstructured textual data. However, it is
|
||||
still a challenge to apply pre-training on structured tabular data due to the absence of large-scale high-quality tabular data. In this paper, we
|
||||
propose TAPEX to show that table pre-training can be achieved by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically
|
||||
synthesizing executable SQL queries and their execution outputs. TAPEX addresses the data scarcity challenge via guiding the language model to mimic a SQL
|
||||
executor on the diverse, large-scale and high-quality synthetic corpus. We evaluate TAPEX on four benchmark datasets. Experimental results demonstrate that
|
||||
TAPEX outperforms previous table pre-training approaches by a large margin and achieves new state-of-the-art results on all of them. This includes improvements
|
||||
on the weakly-supervised WikiSQL denotation accuracy to 89.5% (+2.3%), the WikiTableQuestions denotation accuracy to 57.5% (+4.8%), the SQA denotation accuracy
|
||||
to 74.5% (+3.5%), and the TabFact accuracy to 84.2% (+3.2%). To our knowledge, this is the first work to exploit table pre-training via synthetic executable programs
|
||||
and to achieve new state-of-the-art results on various downstream tasks.*
|
||||
|
||||
Tips:
|
||||
|
||||
- TAPEX is a generative (seq2seq) model. One can directly plug in the weights of TAPEX into a BART model.
|
||||
- TAPEX has checkpoints on the hub that are either pre-trained only, or fine-tuned on WTQ, SQA, WikiSQL and TabFact.
|
||||
- Sentences + tables are presented to the model as `sentence + " " + linearized table`. The linearized table has the following format:
|
||||
`col: col1 | col2 | col 3 row 1 : val1 | val2 | val3 row 2 : ...`.
|
||||
- TAPEX has its own tokenizer, that allows to prepare all data for the model easily. One can pass Pandas DataFrames and strings to the tokenizer,
|
||||
and it will automatically create the `input_ids` and `attention_mask` (as shown in the usage examples below).
|
||||
|
||||
## Usage: inference
|
||||
|
||||
Below, we illustrate how to use TAPEX for table question answering. As one can see, one can directly plug in the weights of TAPEX into a BART model.
|
||||
We use the [Auto API](auto), which will automatically instantiate the appropriate tokenizer ([`TapexTokenizer`]) and model ([`BartForConditionalGeneration`]) for us,
|
||||
based on the configuration file of the checkpoint on the hub.
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
||||
>>> import pandas as pd
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/tapex-large-finetuned-wtq")
|
||||
>>> model = AutoModelForSeq2SeqLM.from_pretrained("microsoft/tapex-large-finetuned-wtq")
|
||||
|
||||
>>> # prepare table + question
|
||||
>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
|
||||
>>> table = pd.DataFrame.from_dict(data)
|
||||
>>> question = "how many movies does Leonardo Di Caprio have?"
|
||||
|
||||
>>> encoding = tokenizer(table, question, return_tensors="pt")
|
||||
|
||||
>>> # let the model generate an answer autoregressively
|
||||
>>> outputs = model.generate(**encoding)
|
||||
|
||||
>>> # decode back to text
|
||||
>>> predicted_answer = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
|
||||
>>> print(predicted_answer)
|
||||
53
|
||||
```
|
||||
|
||||
Note that [`TapexTokenizer`] also supports batched inference. Hence, one can provide a batch of different tables/questions, or a batch of a single table
|
||||
and multiple questions, or a batch of a single query and multiple tables. Let's illustrate this:
|
||||
|
||||
```python
|
||||
>>> # prepare table + question
|
||||
>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
|
||||
>>> table = pd.DataFrame.from_dict(data)
|
||||
>>> questions = [
|
||||
... "how many movies does Leonardo Di Caprio have?",
|
||||
... "which actor has 69 movies?",
|
||||
... "what's the first name of the actor who has 87 movies?",
|
||||
... ]
|
||||
>>> encoding = tokenizer(table, questions, padding=True, return_tensors="pt")
|
||||
|
||||
>>> # let the model generate an answer autoregressively
|
||||
>>> outputs = model.generate(**encoding)
|
||||
|
||||
>>> # decode back to text
|
||||
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
|
||||
[' 53', ' george clooney', ' brad pitt']
|
||||
```
|
||||
|
||||
In case one wants to do table verification (i.e. the task of determining whether a given sentence is supported or refuted by the contents
|
||||
of a table), one can instantiate a [`BartForSequenceClassification`] model. TAPEX has checkpoints on the hub fine-tuned on TabFact, an important
|
||||
benchmark for table fact checking (it achieves 84% accuracy). The code example below again leverages the [Auto API](auto).
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/tapex-large-finetuned-tabfact")
|
||||
>>> model = AutoModelForSequenceClassification.from_pretrained("microsoft/tapex-large-finetuned-tabfact")
|
||||
|
||||
>>> # prepare table + sentence
|
||||
>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
|
||||
>>> table = pd.DataFrame.from_dict(data)
|
||||
>>> sentence = "George Clooney has 30 movies"
|
||||
|
||||
>>> encoding = tokenizer(table, sentence, return_tensors="pt")
|
||||
|
||||
>>> # forward pass
|
||||
>>> outputs = model(**encoding)
|
||||
|
||||
>>> # print prediction
|
||||
>>> predicted_class_idx = outputs.logits[0].argmax(dim=0).item()
|
||||
>>> print(model.config.id2label[predicted_class_idx])
|
||||
Refused
|
||||
```
|
||||
|
||||
|
||||
## TapexTokenizer
|
||||
|
||||
[[autodoc]] TapexTokenizer
|
||||
- __call__
|
||||
- save_vocabulary
|
||||
@@ -67,6 +67,7 @@ Ready-made configurations include the following architectures:
|
||||
- PLBart
|
||||
- RoBERTa
|
||||
- T5
|
||||
- TAPEX
|
||||
- ViT
|
||||
- XLM-RoBERTa
|
||||
- XLM-RoBERTa-XL
|
||||
|
||||
Reference in New Issue
Block a user