Add Nougat (#25942)

* Add conversion script

* Add NougatImageProcessor

* Add crop margin

* More improvements

* Add docs, READMEs

* Remove print statements

* Include model_max_length

* Add NougatTokenizerFast

* Fix imports

* Improve postprocessing

* Improve image processor

* Fix image processor

* Improve normalize method

* More improvements

* More improvements

* Add processor, improve docs

* Simplify fast tokenizer

* Remove test file

* Fix docstrings

* Use NougatProcessor in conversion script

* Add is_levensthein_available

* Add tokenizer tests

* More improvements

* Use numpy instead of opencv

* Add is_cv2_available

* Fix cv2_available

* Add is_nltk_available

* Add image processor tests, improve crop_margin

* Add integration tests

* Improve integration test

* Use do_rescale instead of hacks, thanks Amy

* Remove random_padding

* Address comments

* Address more comments

* Add import

* Address more comments

* Address more comments

* Address comment

* Address comment

* Set max_model_input_sizes

* Add tests

* Add requires_backends

* Add Nougat to exotic tests

* Use to_pil_image

* Address comment regarding nltk

* Add NLTK

* Improve variable names, integration test

* Add test

* refactor, document, and test regexes

* remove named capture groups, add comments

* format

* add non-markdown fixed tokenization

* format

* correct flakyness of args parse

* add regex comments

* test functionalities for crop_image, align long axis and expected output

* add regex tests

* remove cv2 dependency

* test crop_margin equality between cv2 and python

* refactor table regexes to markdown

add newline

* change print to log, improve doc

* fix high count tables correction

* address PR comments: naming, linting, asserts

* Address comments

* Add copied from

* Update conversion script

* Update conversion script to convert both small and base versions

* Add inference example

* Add more info

* Fix style

* Add require annotators to test

* Define all keyword arguments explicitly

* Move cv2 annotator

* Add tokenizer init method

* Transfer checkpoints

* Add reference to Donut

* Address comments

* Skip test

* Remove cv2 method

* Add copied from statements

* Use cached_property

* Fix docstring

* Add file to not doctested

---------

Co-authored-by: Pablo Montalvo <pablo.montalvo.leroux@gmail.com>
This commit is contained in:
NielsRogge
2023-09-26 07:06:04 +02:00
committed by GitHub
parent 5e09af2acd
commit ace74d16bd
31 changed files with 2347 additions and 5 deletions

View File

@@ -55,6 +55,7 @@ from .utils import (
is_auto_gptq_available,
is_bitsandbytes_available,
is_bs4_available,
is_cv2_available,
is_cython_available,
is_decord_available,
is_detectron2_available,
@@ -69,8 +70,10 @@ from .utils import (
is_jinja_available,
is_jumanpp_available,
is_keras_nlp_available,
is_levenshtein_available,
is_librosa_available,
is_natten_available,
is_nltk_available,
is_onnx_available,
is_optimum_available,
is_pandas_available,
@@ -311,6 +314,36 @@ def require_bs4(test_case):
return unittest.skipUnless(is_bs4_available(), "test requires BeautifulSoup4")(test_case)
def require_cv2(test_case):
"""
Decorator marking a test that requires OpenCV.
These tests are skipped when OpenCV isn't installed.
"""
return unittest.skipUnless(is_cv2_available(), "test requires OpenCV")(test_case)
def require_levenshtein(test_case):
"""
Decorator marking a test that requires Levenshtein.
These tests are skipped when Levenshtein isn't installed.
"""
return unittest.skipUnless(is_levenshtein_available(), "test requires Levenshtein")(test_case)
def require_nltk(test_case):
"""
Decorator marking a test that requires NLTK.
These tests are skipped when NLTK isn't installed.
"""
return unittest.skipUnless(is_nltk_available(), "test requires NLTK")(test_case)
def require_accelerate(test_case):
"""
Decorator marking a test that requires accelerate. These tests are skipped when accelerate isn't installed.