-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add some iteration method on a dataset column (specific for inference) #4180
Comments
Thanks for the suggestion ! I agree it would be nice to have something directly in cc @albertvillanova @mariosasko @polinaeterna What do you think if we have something similar to pandas It would be a breaking change though, since Your alternative suggestion with |
I agree that current behavior (decoding all audio file sin the dataset when accessing Therefore I upvote for a "useful" behavior of @lhoestq on the other hand I don't understand your proposal about Pandas-like... |
I recall I had the same idea while working on the |
@lhoestq @mariosasko Could you please give a link to that new feature of |
Yea pandas doesn't do lazy loading. I was referring to pandas.Series to say that they have a dedicated class to represent a column ;) |
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is.
Currently,
dataset["audio"]
will load EVERY element in the dataset in RAM, which can be quite big for an audio dataset.Having an iterator (or sequence) type of object, would make inference with
transformers
'spipeline
easier to use and not so memory hungry.Describe the solution you'd like
A clear and concise description of what you want to happen.
For a non breaking change:
For a breaking change solution (not necessary), changing the type of
dataset["audio"]
to a sequence type so thatcould work
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
This works but requires the helper function which feels slightly clunky.
Additional context
Add any other context about the feature request here.
The context is actually to showcase better integration between
pipeline
anddatasets
in the Quicktour demo: https://github.com/huggingface/transformers/pull/16723/files@lhoestq
The text was updated successfully, but these errors were encountered: