Add LayoutLMv2 + LayoutXLM (#12604)
* First commit * Make style * Fix dummy objects * Add Detectron2 config * Add LayoutLMv2 pooler * More improvements, add documentation * More improvements * Add model tests * Add clarification regarding image input * Improve integration test * Fix bug * Fix another bug * Fix another bug * Fix another bug * More improvements * Make more tests pass * Make more tests pass * Improve integration test * Remove gradient checkpointing and add head masking * Add integration test * Add LayoutLMv2ForSequenceClassification to the tests * Add LayoutLMv2ForQuestionAnswering * More improvements * More improvements * Small improvements * Fix _LazyModule * Fix fast tokenizer * Move sync_batch_norm to a separate method * Replace dummies by requires_backends * Move calculation of visual bounding boxes to separate method + update README * Add models to main init * First draft * More improvements * More improvements * More improvements * More improvements * More improvements * Remove is_split_into_words * More improvements * Simply tesseract - no use of pandas anymore * Add LayoutLMv2Processor * Update is_pytesseract_available * Fix bugs * Improve feature extractor * Fix bug * Add print statement * Add truncation of bounding boxes * Add tests for LayoutLMv2FeatureExtractor and LayoutLMv2Tokenizer * Improve tokenizer tests * Make more tokenizer tests pass * Make more tests pass, add integration tests * Finish integration tests * More improvements * More improvements - update API of the tokenizer * More improvements * Remove support for VQA training * Remove some files * Improve feature extractor * Improve documentation and one more tokenizer test * Make quality and small docs improvements * Add batched tests for LayoutLMv2Processor, remove fast tokenizer * Add truncation of labels * Apply suggestions from code review * Improve processor tests * Fix failing tests and add suggestion from code review * Fix tokenizer test * Add detectron2 CI job * Simplify CI job * Comment out non-detectron2 jobs and specify number of processes * Add pip install torchvision * Add durations to see which tests are slow * Fix tokenizer test and make model tests smaller * Frist draft * Use setattr * Possible fix * Proposal with configuration * First draft of fast tokenizer * More improvements * Enable fast tokenizer tests * Make more tests pass * Make more tests pass * More improvements * Addd padding to fast tokenizer * Mkae more tests pass * Make more tests pass * Make all tests pass for fast tokenizer * Make fast tokenizer support overflowing boxes and labels * Add support for overflowing_labels to slow tokenizer * Add support for fast tokenizer to the processor * Update processor tests for both slow and fast tokenizers * Add head models to model mappings * Make style & quality * Remove Detectron2 config file * Add configurable option to label all subwords * Fix test * Skip visual segment embeddings in test * Use ResNet-18 backbone in tests instead of ResNet-101 * Proposal * Re-enable all jobs on CI * Fix installation of tesseract * Fix failing test * Fix index table * Add LayoutXLM doc page, first draft of code examples * Improve documentation a lot * Update expected boxes for Tesseract 4.0.0 beta * Use offsets to create labels instead of checking if they start with ## * Update expected boxes for Tesseract 4.1.1 * Fix conflict * Make variable names cleaner, add docstring, add link to notebooks * Revert "Fix conflict" This reverts commit a9b46ce9afe47ebfcfe7b45e6a121d49e74ef2c5. * Revert to make integration test pass * Apply suggestions from @LysandreJik's review * Address @patrickvonplaten's comments * Remove fixtures DocVQA in favor of dataset on the hub Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
This commit is contained in:
314
docs/source/model_doc/layoutlmv2.rst
Normal file
314
docs/source/model_doc/layoutlmv2.rst
Normal file
@@ -0,0 +1,314 @@
|
||||
..
|
||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
LayoutLMV2
|
||||
-----------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
Overview
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The LayoutLMV2 model was proposed in `LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
|
||||
<https://arxiv.org/abs/2012.14740>`__ by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu,
|
||||
Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves `LayoutLM
|
||||
<https://huggingface.co/transformers/model_doc/layoutlm.html>`__ to obtain state-of-the-art results across several
|
||||
document image understanding benchmarks:
|
||||
|
||||
- information extraction from scanned documents: the `FUNSD <https://guillaumejaume.github.io/FUNSD/>`__ dataset (a
|
||||
collection of 199 annotated forms comprising more than 30,000 words), the `CORD <https://github.com/clovaai/cord>`__
|
||||
dataset (a collection of 800 receipts for training, 100 for validation and 100 for testing), the `SROIE
|
||||
<https://rrc.cvc.uab.es/?ch=13>`__ dataset (a collection of 626 receipts for training and 347 receipts for testing)
|
||||
and the `Kleister-NDA <https://github.com/applicaai/kleister-nda>`__ dataset (a collection of non-disclosure
|
||||
agreements from the EDGAR database, including 254 documents for training, 83 documents for validation, and 203
|
||||
documents for testing).
|
||||
- document image classification: the `RVL-CDIP <https://www.cs.cmu.edu/~aharley/rvl-cdip/>`__ dataset (a collection of
|
||||
400,000 images belonging to one of 16 classes).
|
||||
- document visual question answering: the `DocVQA <https://arxiv.org/abs/2007.00398>`__ dataset (a collection of 50,000
|
||||
questions defined on 12,000+ document images).
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to
|
||||
its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this
|
||||
paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model
|
||||
architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked
|
||||
visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training
|
||||
stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention
|
||||
mechanism into the Transformer architecture, so that the model can fully understand the relative positional
|
||||
relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and
|
||||
achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks,
|
||||
including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852),
|
||||
RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained LayoutLMv2 model is publicly available at
|
||||
this https URL.*
|
||||
|
||||
Tips:
|
||||
|
||||
- The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during
|
||||
pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning).
|
||||
- LayoutLMv2 adds both a relative 1D attention bias as well as a spatial 2D attention bias to the attention scores in
|
||||
the self-attention layers. Details can be found on page 5 of the `paper <https://arxiv.org/abs/2012.14740>`__.
|
||||
- Demo notebooks on how to use the LayoutLMv2 model on RVL-CDIP, FUNSD, DocVQA, CORD can be found `here
|
||||
<https://github.com/NielsRogge/Transformers-Tutorials>`__.
|
||||
- LayoutLMv2 uses Facebook AI's `Detectron2 <https://github.com/facebookresearch/detectron2/>`__ package for its visual
|
||||
backbone. See `this link <https://detectron2.readthedocs.io/en/latest/tutorials/install.html>`__ for installation
|
||||
instructions.
|
||||
- In addition to :obj:`input_ids`, :meth:`~transformer.LayoutLMv2Model.forward` expects 2 additional inputs, namely
|
||||
:obj:`image` and :obj:`bbox`. The :obj:`image` input corresponds to the original document image in which the text
|
||||
tokens occur. The model expects each document image to be of size 224x224. This means that if you have a batch of
|
||||
document images, :obj:`image` should be a tensor of shape (batch_size, 3, 224, 224). This can be either a
|
||||
:obj:`torch.Tensor` or a :obj:`Detectron2.structures.ImageList`. You don't need to normalize the channels, as this is
|
||||
done by the model. Important to note is that the visual backbone expects BGR channels instead of RGB, as all models
|
||||
in Detectron2 are pre-trained using the BGR format. The :obj:`bbox` input are the bounding boxes (i.e. 2D-positions)
|
||||
of the input text tokens. This is identical to :class:`~transformer.LayoutLMModel`. These can be obtained using an
|
||||
external OCR engine such as Google's `Tesseract <https://github.com/tesseract-ocr/tesseract>`__ (there's a `Python
|
||||
wrapper <https://pypi.org/project/pytesseract/>`__ available). Each bounding box should be in (x0, y0, x1, y1)
|
||||
format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1)
|
||||
represents the position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on
|
||||
a 0-1000 scale. To normalize, you can use the following function:
|
||||
|
||||
.. code-block::
|
||||
|
||||
def normalize_bbox(bbox, width, height):
|
||||
return [
|
||||
int(1000 * (bbox[0] / width)),
|
||||
int(1000 * (bbox[1] / height)),
|
||||
int(1000 * (bbox[2] / width)),
|
||||
int(1000 * (bbox[3] / height)),
|
||||
]
|
||||
|
||||
Here, :obj:`width` and :obj:`height` correspond to the width and height of the original document in which the token
|
||||
occurs (before resizing the image). Those can be obtained using the Python Image Library (PIL) library for example, as
|
||||
follows:
|
||||
|
||||
.. code-block::
|
||||
|
||||
from PIL import Image
|
||||
|
||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
|
||||
|
||||
width, height = image.size
|
||||
|
||||
However, this model includes a brand new :class:`~transformer.LayoutLMv2Processor` which can be used to directly
|
||||
prepare data for the model (including applying OCR under the hood). More information can be found in the "Usage"
|
||||
section below.
|
||||
|
||||
- Internally, :class:`~transformer.LayoutLMv2Model` will send the :obj:`image` input through its visual backbone to
|
||||
obtain a lower-resolution feature map, whose shape is equal to the :obj:`image_feature_pool_shape` attribute of
|
||||
:class:`~transformer.LayoutLMv2Config`. This feature map is then flattened to obtain a sequence of image tokens. As
|
||||
the size of the feature map is 7x7 by default, one obtains 49 image tokens. These are then concatenated with the text
|
||||
tokens, and send through the Transformer encoder. This means that the last hidden states of the model will have a
|
||||
length of 512 + 49 = 561, if you pad the text tokens up to the max length. More generally, the last hidden states
|
||||
will have a shape of :obj:`seq_length` + :obj:`image_feature_pool_shape[0]` *
|
||||
:obj:`config.image_feature_pool_shape[1]`.
|
||||
- When calling :meth:`~transformer.LayoutLMv2Model.from_pretrained`, a warning will be printed with a long list of
|
||||
parameter names that are not initialized. This is not a problem, as these parameters are batch normalization
|
||||
statistics, which are going to have values when fine-tuning on a custom dataset.
|
||||
- If you want to train the model in a distributed environment, make sure to call :meth:`synchronize_batch_norm` on the
|
||||
model in order to properly synchronize the batch normalization layers of the visual backbone.
|
||||
|
||||
In addition, there's LayoutXLM, which is a multilingual version of LayoutLMv2. More information can be found on
|
||||
:doc:`LayoutXLM's documentation page <layoutxlm>`.
|
||||
|
||||
Usage: LayoutLMv2Processor
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The easiest way to prepare data for the model is to use :class:`~transformer.LayoutLMv2Processor`, which internally
|
||||
combines a feature extractor (:class:`~transformer.LayoutLMv2FeatureExtractor`) and a tokenizer
|
||||
(:class:`~transformer.LayoutLMv2Tokenizer` or :class:`~transformer.LayoutLMv2TokenizerFast`). The feature extractor
|
||||
handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal
|
||||
for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one
|
||||
modality.
|
||||
|
||||
.. code-block::
|
||||
|
||||
from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2TokenizerFast, LayoutLMv2Processor
|
||||
|
||||
feature_extractor = LayoutLMv2FeatureExtractor() # apply_ocr is set to True by default
|
||||
tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
|
||||
processor = LayoutLMv2Processor(feature_extractor, tokenizer)
|
||||
|
||||
In short, one can provide a document image (and possibly additional data) to :class:`~transformer.LayoutLMv2Processor`,
|
||||
and it will create the inputs expected by the model. Internally, the processor first uses
|
||||
:class:`~transformer.LayoutLMv2FeatureExtractor` to apply OCR on the image to get a list of words and normalized
|
||||
bounding boxes, as well to resize the image to a given size in order to get the :obj:`image` input. The words and
|
||||
normalized bounding boxes are then provided to :class:`~transformer.LayoutLMv2Tokenizer` or
|
||||
:class:`~transformer.LayoutLMv2TokenizerFast`, which converts them to token-level :obj:`input_ids`,
|
||||
:obj:`attention_mask`, :obj:`token_type_ids`, :obj:`bbox`. Optionally, one can provide word labels to the processor,
|
||||
which are turned into token-level :obj:`labels`.
|
||||
|
||||
:class:`~transformer.LayoutLMv2Processor` uses `PyTesseract <https://pypi.org/project/pytesseract/>`__, a Python
|
||||
wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of
|
||||
choice, and provide the words and normalized boxes yourself. This requires initializing
|
||||
:class:`~transformer.LayoutLMv2FeatureExtractor` with :obj:`apply_ocr` set to :obj:`False`.
|
||||
|
||||
In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these
|
||||
use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).
|
||||
|
||||
**Use case 1: document image classification (training, inference) + token classification (inference), apply_ocr =
|
||||
True**
|
||||
|
||||
This is the simplest case, in which the processor (actually the feature extractor) will perform OCR on the image to get
|
||||
the words and normalized bounding boxes.
|
||||
|
||||
.. code-block::
|
||||
|
||||
from transformers import LayoutLMv2Processor
|
||||
from PIL import Image
|
||||
|
||||
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
|
||||
|
||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
||||
encoding = processor(image, return_tensors="pt") # you can also add all tokenizer parameters here such as padding, truncation
|
||||
print(encoding.keys())
|
||||
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
||||
|
||||
**Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False**
|
||||
|
||||
In case one wants to do OCR themselves, one can initialize the feature extractor with :obj:`apply_ocr` set to
|
||||
:obj:`False`. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to
|
||||
the processor.
|
||||
|
||||
.. code-block::
|
||||
|
||||
from transformers import LayoutLMv2Processor
|
||||
from PIL import Image
|
||||
|
||||
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
|
||||
|
||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
||||
words = ["hello", "world"]
|
||||
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
|
||||
encoding = processor(image, words, boxes=boxes, return_tensors="pt")
|
||||
print(encoding.keys())
|
||||
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
||||
|
||||
**Use case 3: token classification (training), apply_ocr=False**
|
||||
|
||||
For token classification tasks (such as FUNSD, CORD, SROIE, Kleister-NDA), one can also provide the corresponding word
|
||||
labels in order to train a model. The processor will then convert these into token-level :obj:`labels`. By default, it
|
||||
will only label the first wordpiece of a word, and label the remaining wordpieces with -100, which is the
|
||||
:obj:`ignore_index` of PyTorch's CrossEntropyLoss. In case you want all wordpieces of a word to be labeled, you can
|
||||
initialize the tokenizer with :obj:`only_label_first_subword` set to :obj:`False`.
|
||||
|
||||
.. code-block::
|
||||
|
||||
from transformers import LayoutLMv2Processor
|
||||
from PIL import Image
|
||||
|
||||
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
|
||||
|
||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
||||
words = ["hello", "world"]
|
||||
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
|
||||
word_labels = [1, 2]
|
||||
encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
|
||||
print(encoding.keys())
|
||||
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'labels', 'image'])
|
||||
|
||||
**Use case 4: visual question answering (inference), apply_ocr=True**
|
||||
|
||||
For visual question answering tasks (such as DocVQA), you can provide a question to the processor. By default, the
|
||||
processor will apply OCR on the image, and create [CLS] question tokens [SEP] word tokens [SEP].
|
||||
|
||||
.. code-block::
|
||||
|
||||
from transformers import LayoutLMv2Processor
|
||||
from PIL import Image
|
||||
|
||||
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
|
||||
|
||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
||||
question = "What's his name?"
|
||||
encoding = processor(image, question, return_tensors="pt")
|
||||
print(encoding.keys())
|
||||
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
||||
|
||||
**Use case 5: visual question answering (inference), apply_ocr=False**
|
||||
|
||||
For visual question answering tasks (such as DocVQA), you can provide a question to the processor. If you want to
|
||||
perform OCR yourself, you can provide your own words and (normalized) bounding boxes to the processor.
|
||||
|
||||
.. code-block::
|
||||
|
||||
from transformers import LayoutLMv2Processor
|
||||
from PIL import Image
|
||||
|
||||
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
|
||||
|
||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
||||
question = "What's his name?"
|
||||
words = ["hello", "world"]
|
||||
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
|
||||
encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
|
||||
print(encoding.keys())
|
||||
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
||||
|
||||
LayoutLMv2Config
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LayoutLMv2Config
|
||||
:members:
|
||||
|
||||
|
||||
LayoutLMv2FeatureExtractor
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LayoutLMv2FeatureExtractor
|
||||
:members: __call__
|
||||
|
||||
|
||||
LayoutLMv2Tokenizer
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LayoutLMv2Tokenizer
|
||||
:members: __call__, save_vocabulary
|
||||
|
||||
|
||||
LayoutLMv2TokenizerFast
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LayoutLMv2TokenizerFast
|
||||
:members: __call__
|
||||
|
||||
|
||||
LayoutLMv2Processor
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LayoutLMv2Processor
|
||||
:members: __call__
|
||||
|
||||
|
||||
LayoutLMv2Model
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LayoutLMv2Model
|
||||
:members: forward
|
||||
|
||||
|
||||
LayoutLMv2ForSequenceClassification
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LayoutLMv2ForSequenceClassification
|
||||
:members:
|
||||
|
||||
|
||||
LayoutLMv2ForTokenClassification
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LayoutLMv2ForTokenClassification
|
||||
:members:
|
||||
|
||||
|
||||
LayoutLMv2ForQuestionAnswering
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LayoutLMv2ForQuestionAnswering
|
||||
:members:
|
||||
47
docs/source/model_doc/layoutxlm.rst
Normal file
47
docs/source/model_doc/layoutxlm.rst
Normal file
@@ -0,0 +1,47 @@
|
||||
..
|
||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
LayoutXLM
|
||||
-----------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
Overview
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
LayoutXLM was proposed in `LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
|
||||
<https://arxiv.org/abs/2104.08836>`__ by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
|
||||
Zhang, Furu Wei. It's a multilingual extension of the `LayoutLMv2 model <https://arxiv.org/abs/2012.14740>`__ trained
|
||||
on 53 languages.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document
|
||||
understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In
|
||||
this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to
|
||||
bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also
|
||||
introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in
|
||||
7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled
|
||||
for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA
|
||||
cross-lingual pre-trained models on the XFUN dataset.*
|
||||
|
||||
One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so:
|
||||
|
||||
.. code-block::
|
||||
|
||||
from transformers import LayoutLMv2Model
|
||||
|
||||
model = LayoutLMv2Model.from_pretrained('microsoft/layoutxlm-base')
|
||||
|
||||
As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to :doc:`LayoutLMv2's documentation page
|
||||
<layoutlmv2>` for all tips, code examples and notebooks.
|
||||
|
||||
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
|
||||
<https://github.com/microsoft/unilm>`__.
|
||||
Reference in New Issue
Block a user