Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Validate features during add_frame + Add 2D-to-5D + Add string #720

Conversation

Cadene
Copy link
Collaborator

@Cadene Cadene commented Feb 12, 2025

What this does

Motivations: We want to support more modalities and data type that Hugging Face datasets already supports. See: https://github.com/huggingface/datasets/blob/339e9dc3914e4ed837f2c9b24972e211334025ec/src/datasets/features/features.py#L462-L646

  • Add ndarray features of shape: (1,), (n,), (n,m), (n,m,l), (n,m,l,o), (n,m,l,o,p)
  • Add string features like caption per frame
  • Add sanity check at the top of add_frame to validate if frame matches self.features specification
    • Display informative error messages when some features are missing, or extra features are present, the dtype is wrong, the shape is wrong.
  • Add range check to image_array_to_pil_image which is executed in the image writer threads

How it was tested

  • Added a lot of tests for add_frame in pytest -sx tests/test_datasets.py
  • Added tests for image_array_to_pil_image with range_check=True.
  • The range check involves the compute of a max and a min which are fast enough: Average time taken for computing max and min over 100 runs: 0.000188 seconds
import numpy as np
import time

def main():
    array = np.random.rand(3,480,640)
    num_repetitions = 100
    elapsed_times = []
    for _ in range(num_repetitions):
        # Measure the time taken to perform torch.max
        start_time = time.perf_counter()
        array.max()
        array.min()
        end_time = time.perf_counter()

        # Calculate the elapsed time
        elapsed_time = end_time - start_time
        elapsed_times.append(elapsed_time)

    # Calculate the average elapsed time
    average_elapsed_time = sum(elapsed_times) / num_repetitions
    print(f"Average time taken for computing max and min over {num_repetitions} runs: {average_elapsed_time:.6f} seconds")

if __name__ == "__main__":
    main()

@Cadene Cadene changed the base branch from main to user/rcadene/2025_01_27_dataset_v2.1 February 12, 2025 17:33
@Cadene Cadene force-pushed the user/rcadene/2025_02_11_2d_features branch from f2d59a9 to e94c029 Compare February 12, 2025 18:56
@Cadene Cadene marked this pull request as ready for review February 12, 2025 19:11
@Cadene Cadene requested a review from aliberts February 12, 2025 19:11
@Cadene Cadene changed the title Validate features during add_frame + Add 2d, 3d, 4d, 5d features Validate features during add_frame + Add 2D-to-5D + Add string Feb 12, 2025
@aliberts aliberts mentioned this pull request Feb 13, 2025
3 tasks
Base automatically changed from user/rcadene/2025_01_27_dataset_v2.1 to user/aliberts/2025_02_10_dataset_v2.1 February 14, 2025 13:22
…_v2.1' into user/rcadene/2025_02_11_2d_features
@Cadene Cadene mentioned this pull request Feb 14, 2025
6 tasks
Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round, let's iterate on the comments a bit ;)

Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Cadene Cadene merged commit 7c2bbee into user/aliberts/2025_02_10_dataset_v2.1 Feb 14, 2025
7 checks passed
@Cadene Cadene deleted the user/rcadene/2025_02_11_2d_features branch February 14, 2025 18:59
Comment on lines 818 to +820
if key in ["index", "episode_index", "task_index"] or ft["dtype"] in ["image", "video"]:
continue
elif len(ft["shape"]) == 1 and ft["shape"][0] == 1:
episode_buffer[key] = np.array(episode_buffer[key], dtype=ft["dtype"])
elif len(ft["shape"]) == 1 and ft["shape"][0] > 1:
episode_buffer[key] = np.stack(episode_buffer[key])
else:
raise ValueError(key)
episode_buffer[key] = np.stack(episode_buffer[key])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This causes any string features (such as "caption") to be stored as string inside np.arrays which is really not ideal. It cause failing tests after rebasing these changes on the per-episodes stats branch (see hack). We should probably just continue in these cases, WDYT @Cadene?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants