Add some iteration method on a dataset column (specific for inference) #4180

Narsil · 2022-04-19T09:15:45Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is.

Currently, dataset["audio"] will load EVERY element in the dataset in RAM, which can be quite big for an audio dataset.
Having an iterator (or sequence) type of object, would make inference with transformers 's pipeline easier to use and not so memory hungry.

Describe the solution you'd like
A clear and concise description of what you want to happen.

For a non breaking change:

for audio in dataset.iterate("audio"):
    # {"array": np.array(...), "sampling_rate":...}

For a breaking change solution (not necessary), changing the type of dataset["audio"] to a sequence type so that

pipe = pipeline(model="...")
for out in pipe(dataset["audio"]):
    # {"text":....}

could work

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

def iterate(dataset, key):
    for item in dataset:
        yield dataset[key]

for out in pipeline(iterate(dataset, "audio")):
    # {"array": ...}

This works but requires the helper function which feels slightly clunky.

Additional context
Add any other context about the feature request here.

The context is actually to showcase better integration between pipeline and datasets in the Quicktour demo: https://github.com/huggingface/transformers/pull/16723/files

@lhoestq

The text was updated successfully, but these errors were encountered:

lhoestq · 2022-04-19T15:39:43Z

Thanks for the suggestion ! I agree it would be nice to have something directly in datasets to do something as simple as that

cc @albertvillanova @mariosasko @polinaeterna What do you think if we have something similar to pandas Series that wouldn't bring everything in memory when doing dataset["audio"] ? Currently it returns a list with all the decoded audio data in memory.

It would be a breaking change though, since isinstance(dataset["audio"], list) wouldn't work anymore, but we could implement a Sequence so that dataset["audio"][0] still works and only loads one item in memory.

Your alternative suggestion with iterate is also sensible, though maybe less satisfactory in terms of experience IMO

albertvillanova · 2022-04-19T15:54:44Z

I agree that current behavior (decoding all audio file sin the dataset when accessing dataset["audio"]) is not useful, IMHO. Indeed in our docs, we are constantly warning our collaborators not to do that.

Therefore I upvote for a "useful" behavior of dataset["audio"]. I don't think the breaking change is important in this case, as I guess no many people use it with its current behavior. Therefore, for me it seems reasonable to return a generator (instead of an in-memeory list) for "special" features, like Audio/Image.

@lhoestq on the other hand I don't understand your proposal about Pandas-like...

mariosasko · 2022-04-20T12:01:35Z

I recall I had the same idea while working on the Image feature, so I agree implementing something similar to pd.Series that lazily brings elements in memory would be beneficial.

albertvillanova · 2022-04-20T12:15:44Z

@lhoestq @mariosasko Could you please give a link to that new feature of pandas.Series? As far as I remember since I worked with pandas for more than 6 years, there was no lazy in-memory feature; it was everything in-memory; that was the reason why other frameworks were created, like Vaex or Dask, e.g.

lhoestq · 2022-04-21T10:30:57Z

Yea pandas doesn't do lazy loading. I was referring to pandas.Series to say that they have a dedicated class to represent a column ;)

Narsil added the enhancement New feature or request label Apr 19, 2022

Narsil mentioned this issue Apr 19, 2022

Adding support for array key in raw dictionnaries in ASR pipeline. huggingface/transformers#16827

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some iteration method on a dataset column (specific for inference) #4180

Add some iteration method on a dataset column (specific for inference) #4180

Narsil commented Apr 19, 2022

lhoestq commented Apr 19, 2022

albertvillanova commented Apr 19, 2022 •

edited

Loading

mariosasko commented Apr 20, 2022

albertvillanova commented Apr 20, 2022 •

edited

Loading

lhoestq commented Apr 21, 2022

Add some iteration method on a dataset column (specific for inference) #4180

Add some iteration method on a dataset column (specific for inference) #4180

Comments

Narsil commented Apr 19, 2022

lhoestq commented Apr 19, 2022

albertvillanova commented Apr 19, 2022 • edited Loading

mariosasko commented Apr 20, 2022

albertvillanova commented Apr 20, 2022 • edited Loading

lhoestq commented Apr 21, 2022

albertvillanova commented Apr 19, 2022 •

edited

Loading

albertvillanova commented Apr 20, 2022 •

edited

Loading