* Adding support for raw python `generator` in addition to `Dataset`
The main goal is to ease the create of streaming data to the pipe.
`Dataset` is more involved and pytorch specific.
This PR, provides a way to use a python iterator too.
This enabled #14250 but can be proposed as a standalone PR.
```python
from transformers import pipeline
def read_data(filename):
with open(filename, 'r') as f:
for line in f:
yield f
pipe = pipeline("text-classification")
for classified in pipe(read_data("large_file.txt")):
print("Success ! ", classified)
```
The main caveat of this, is the interaction with `DataLoader` with
`num_workers>1`. When you have multiple workers, each receive a copy
of the generator (like `IterableDataset`). That means the naive Iterator
will fail since all workers iterate on all items of the generator.
There are ways to do clever "skipping", but it could be bad still
because all workers still do have to pass through all items of the
generator (they just ignore items they don't handle), depending on
the case it might be bad.
Using `num_workers=1` is the simplest fix and if the cost of loading
your data is small enough should be good enough. In the above example
trying to do smart tricks to skip some lines is unlikely to be a net
positive for instance.
If there are better ways to do "jumps" on some data, then using
`Dataset` is more advised (since then differents workers can just jump
themselves).
* Adding iterator support for `tf` too.
* fix loading flax bf16 weights in pt
* fix clip test
* fix t5 test
* add logging statement
* Update src/transformers/modeling_flax_pytorch_utils.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* switch back to native any
* fix check for bf16 weights
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Start the work for TFViTModel
* Convert to TF code - need to check in the follow up commits
* Clean up model code
* Expose TFViTModel
* make style
* make quality
* Add test
* make style & quality
* Fix some imports
* fix wrong usage - *kwargs => ** kwargs
* Fix Conv2D weight loading (PT->TF) issue
* Add tests for images with different sizes + fix model
* Fix some common tests for TFViTModel
* Use inputs instead of input_ids in test_compile_tf_model
* Add a comment about transpose and Conv2D in convert_tf_weight_name_to_pt_weight_name
* Avoid transpose in TFViT call
* Fix Conv2D issue in load_tf2_weights_in_pytorch_model
* Use tf.keras.layers.Conv2D instead of tf.nn.conv2d
* Using simpler heuristic to detect Conv2D layer
* Change convert_tf_weight_name_to_pt_weight_name to return TransposeType
* Check tf_weight_shape is not None before using it
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* fix missing comma
* fix input dtype
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* correct order of overflowing tokens for LayoutLmV2 tokenizer
* test to check order of overflowing_tokens for a seq of input_ids
* fix up quality
* added suggested changes
* check that tests the bbox sequence
* pair_input test added
* pass quality test
* check bbox sequence added
* unittest method
* comments added
* add overflowing bbox test
* improved "seq_1"
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
* improve code quality
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
* Adding support for `truncation` parameter on `feature-extraction`
pipeline.
Fixes#14183
* Fixing tests on ibert, longformer, and roberta.
* Rebase fix.
* minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf"
* more consinstent implementation for numpy_mask_tokens
* Add cross attentions to TFGPT2Model
* change to is_pt_tf_cross_test
* A minor correction to a comment
* Remove n_ctx when creating self.crossattention
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* check test_configuration_tie
* Fix test_configuration_tie
* make test slow again
* Remove property and use model.module.bind
* revert to slow test
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Add first draft
* Make forward pass work
* Improve conversion script
* Add notebook that checks if it works
* Add BeitForSemanticSegmentation to the tests
* More improvements
* Make BeitForSemanticSegmentation consistent with Segformer
* Small bug fix
* Add BeitForSemanticSegmentation to docs
* Make sure model doesn't output hidden states when the user doesn't want to
* Make it possible to convert the large model
* Fix issue
* Fix conversion script for large model
* Add auxiliary_head option to semantic segmentation model
* Apply suggestions from @sgugger's review
* Apply suggestions from code review
* Fix failing test
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
* Adding `handle_long_generation` paramters for `text-generation` pipeline.
* More error handling
* Fixing tests by dropping tf support on this functionality, it needs
`max_new_tokens` to make it possible to understand user's intent.
Otherwise, `max_length` == `tokenizer.model_max_length` <
input_ids.shape[0].
* Fixing doc ?
* Doc ?
* Remove link from doc.
* Catched an issue on roberta.
* Damn doc.
* Non BC proposal ?
* Cleaning the fix ?
* Finally using only a test override.
* Don't need to modify this.
* Bad print.
* Add the support for the fast (rust) implementation of BlenbderbotTokenizer
* Fix a converter and a typo in a doc
* Apply the patil-suraj's suggestion
* (Nitpick) Fast tokenization -> Fast Tokenization in doc
* Apply the SaulLu's suggestion
* Apply Narsil's suggestion to fix test pipelines
* Add encoder_no_repeat_ngram_size according to the Narsil's suggestion
* Revert the last (unnecessary) commit
* Override pipeline config for Blenderbot to allow for larger pos. emb.
* make fix-copies
* Remove n_ctx from configs
* Fix GPTJ and OpenAIGPT, both are acceptable breaking changes as there are no configs such that it breaks
* Remove unecessary n_positions from TFOpenAIGPT
* First draft
* Make style & quality
* Improve conversion script
* Add print statement to see actual slice
* Make absolute tolerance smaller
* Fix image classification models
* Add post_process_semantic method
* Disable padding
* Improve conversion script
* Rename to ForSemanticSegmentation, add integration test, remove post_process methods
* Improve docs
* Fix code quality
* Fix feature extractor tests
* Fix tests for image classification model
* Delete file
* Add is_torch_available to feature extractor
* Improve documentation of feature extractor methods
* Apply suggestions from @sgugger's code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Apply some more suggestions of code review
* Rebase with master
* Fix rebase issues
* Make sure model only outputs hidden states when the user wants to
* Apply suggestions from code review
* Add pad method
* Support padding of 2d images
* Add print statement
* Add print statement
* Move padding method to SegformerFeatureExtractor
* Fix issue
* Add casting of segmentation maps
* Add test for padding
* Add small note about padding
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* unispeech
* add copy from
* remove hubert copy from
* finish for today
* add unispeech-sat
* adapt more
* up
* up
* up
* up
* add modeling
* add tests
* up
* up
* finish
* up
* Apply suggestions from code review
* up
* up
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* up
* up
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>