Improve LayoutLM (#9476)
* Add LayoutLMForSequenceClassification and integration tests Improve docs Add LayoutLM notebook to list of community notebooks * Make style & quality * Address comments by @sgugger, @patrickvonplaten and @LysandreJik * Fix rebase with master * Reformat in one line * Improve code examples as requested by @patrickvonplaten Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
This commit is contained in:
@@ -13,32 +13,72 @@
|
||||
LayoutLM
|
||||
-----------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
.. _Overview:
|
||||
|
||||
Overview
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
|
||||
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
|
||||
Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
|
||||
information extraction tasks, such as form understanding and receipt understanding.
|
||||
information extraction tasks, such as form understanding and receipt understanding. It obtains state-of-the-art results
|
||||
on several downstream tasks:
|
||||
|
||||
- form understanding: the `FUNSD <https://guillaumejaume.github.io/FUNSD/>`__ dataset (a collection of 199 annotated
|
||||
forms comprising more than 30,000 words).
|
||||
- receipt understanding: the `SROIE <https://rrc.cvc.uab.es/?ch=13>`__ dataset (a collection of 626 receipts for
|
||||
training and 347 receipts for testing).
|
||||
- document image classification: the `RVL-CDIP <https://www.cs.cmu.edu/~aharley/rvl-cdip/>`__ dataset (a collection of
|
||||
400,000 images belonging to one of 16 classes).
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
|
||||
widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
|
||||
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
|
||||
the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
|
||||
which is beneficial for a great number of real-world document image understanding tasks such as information extraction
|
||||
from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
|
||||
LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
|
||||
framework for document-level pretraining. It achieves new state-of-the-art results in several downstream tasks,
|
||||
including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
|
||||
classification (from 93.07 to 94.42).*
|
||||
the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is
|
||||
beneficial for a great number of real-world document image understanding tasks such as information extraction from
|
||||
scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM.
|
||||
To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for
|
||||
document-level pretraining. It achieves new state-of-the-art results in several downstream tasks, including form
|
||||
understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification
|
||||
(from 93.07 to 94.42).*
|
||||
|
||||
Tips:
|
||||
|
||||
- LayoutLM has an extra input called :obj:`bbox`, which is the bounding boxes of the input tokens.
|
||||
- The :obj:`bbox` requires the data that on 0-1000 scale, which means you should normalize the bounding box before
|
||||
passing them into model.
|
||||
- In addition to `input_ids`, :meth:`~transformer.LayoutLMModel.forward` also expects the input :obj:`bbox`, which are
|
||||
the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
|
||||
as Google's `Tesseract <https://github.com/tesseract-ocr/tesseract>`__ (there's a `Python wrapper
|
||||
<https://pypi.org/project/pytesseract/>`__ available). Each bounding box should be in (x0, y0, x1, y1) format, where
|
||||
(x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the
|
||||
position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000
|
||||
scale. To normalize, you can use the following function:
|
||||
|
||||
.. code-block::
|
||||
|
||||
def normalize_bbox(bbox, width, height):
|
||||
return [
|
||||
int(1000 * (bbox[0] / width)),
|
||||
int(1000 * (bbox[1] / height)),
|
||||
int(1000 * (bbox[2] / width)),
|
||||
int(1000 * (bbox[3] / height)),
|
||||
]
|
||||
|
||||
Here, :obj:`width` and :obj:`height` correspond to the width and height of the original document in which the token
|
||||
occurs. Those can be obtained using the Python Image Library (PIL) library for example, as follows:
|
||||
|
||||
.. code-block::
|
||||
|
||||
from PIL import Image
|
||||
|
||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
|
||||
|
||||
width, height = image.size
|
||||
|
||||
- For a demo which shows how to fine-tune :class:`LayoutLMForTokenClassification` on the `FUNSD dataset
|
||||
<https://guillaumejaume.github.io/FUNSD/>`__ (a collection of annotated forms), see `this notebook
|
||||
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb>`__.
|
||||
It includes an inference part, which shows how to use Google's Tesseract on a new document.
|
||||
|
||||
The original code can be found `here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_.
|
||||
|
||||
@@ -78,6 +118,13 @@ LayoutLMForMaskedLM
|
||||
:members:
|
||||
|
||||
|
||||
LayoutLMForSequenceClassification
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LayoutLMForSequenceClassification
|
||||
:members:
|
||||
|
||||
|
||||
LayoutLMForTokenClassification
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
||||
Reference in New Issue
Block a user