Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add some iteration method on a dataset column (specific for inference) #4180

Open
Narsil opened this issue Apr 19, 2022 · 5 comments
Open

Add some iteration method on a dataset column (specific for inference) #4180

Narsil opened this issue Apr 19, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@Narsil
Copy link
Contributor

Narsil commented Apr 19, 2022

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is.

Currently, dataset["audio"] will load EVERY element in the dataset in RAM, which can be quite big for an audio dataset.
Having an iterator (or sequence) type of object, would make inference with transformers 's pipeline easier to use and not so memory hungry.

Describe the solution you'd like
A clear and concise description of what you want to happen.

For a non breaking change:

for audio in dataset.iterate("audio"):
    # {"array": np.array(...), "sampling_rate":...}

For a breaking change solution (not necessary), changing the type of dataset["audio"] to a sequence type so that

pipe = pipeline(model="...")
for out in pipe(dataset["audio"]):
    # {"text":....}

could work

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

def iterate(dataset, key):
    for item in dataset:
        yield dataset[key]

for out in pipeline(iterate(dataset, "audio")):
    # {"array": ...}

This works but requires the helper function which feels slightly clunky.

Additional context
Add any other context about the feature request here.

The context is actually to showcase better integration between pipeline and datasets in the Quicktour demo: https://github.com/huggingface/transformers/pull/16723/files

@lhoestq

@lhoestq
Copy link
Member

lhoestq commented Apr 19, 2022

Thanks for the suggestion ! I agree it would be nice to have something directly in datasets to do something as simple as that

cc @albertvillanova @mariosasko @polinaeterna What do you think if we have something similar to pandas Series that wouldn't bring everything in memory when doing dataset["audio"] ? Currently it returns a list with all the decoded audio data in memory.

It would be a breaking change though, since isinstance(dataset["audio"], list) wouldn't work anymore, but we could implement a Sequence so that dataset["audio"][0] still works and only loads one item in memory.

Your alternative suggestion with iterate is also sensible, though maybe less satisfactory in terms of experience IMO

@albertvillanova
Copy link
Member

albertvillanova commented Apr 19, 2022

I agree that current behavior (decoding all audio file sin the dataset when accessing dataset["audio"]) is not useful, IMHO. Indeed in our docs, we are constantly warning our collaborators not to do that.

Therefore I upvote for a "useful" behavior of dataset["audio"]. I don't think the breaking change is important in this case, as I guess no many people use it with its current behavior. Therefore, for me it seems reasonable to return a generator (instead of an in-memeory list) for "special" features, like Audio/Image.

@lhoestq on the other hand I don't understand your proposal about Pandas-like...

@mariosasko
Copy link
Collaborator

I recall I had the same idea while working on the Image feature, so I agree implementing something similar to pd.Series that lazily brings elements in memory would be beneficial.

@albertvillanova
Copy link
Member

albertvillanova commented Apr 20, 2022

@lhoestq @mariosasko Could you please give a link to that new feature of pandas.Series? As far as I remember since I worked with pandas for more than 6 years, there was no lazy in-memory feature; it was everything in-memory; that was the reason why other frameworks were created, like Vaex or Dask, e.g.

@lhoestq
Copy link
Member

lhoestq commented Apr 21, 2022

Yea pandas doesn't do lazy loading. I was referring to pandas.Series to say that they have a dedicated class to represent a column ;)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants