Validate features during `add_frame` + Add 2D-to-5D + Add string #720

Cadene · 2025-02-12T17:33:25Z

What this does

Motivations: We want to support more modalities and data type that Hugging Face datasets already supports. See: https://github.com/huggingface/datasets/blob/339e9dc3914e4ed837f2c9b24972e211334025ec/src/datasets/features/features.py#L462-L646

Add ndarray features of shape: (1,), (n,), (n,m), (n,m,l), (n,m,l,o), (n,m,l,o,p)
Add string features like caption per frame
Add sanity check at the top of add_frame to validate if frame matches self.features specification
- Display informative error messages when some features are missing, or extra features are present, the dtype is wrong, the shape is wrong.
Add range check to image_array_to_pil_image which is executed in the image writer threads

How it was tested

Added a lot of tests for add_frame in pytest -sx tests/test_datasets.py
Added tests for image_array_to_pil_image with range_check=True.
The range check involves the compute of a max and a min which are fast enough: Average time taken for computing max and min over 100 runs: 0.000188 seconds

import numpy as np
import time

def main():
    array = np.random.rand(3,480,640)
    num_repetitions = 100
    elapsed_times = []
    for _ in range(num_repetitions):
        # Measure the time taken to perform torch.max
        start_time = time.perf_counter()
        array.max()
        array.min()
        end_time = time.perf_counter()

        # Calculate the elapsed time
        elapsed_time = end_time - start_time
        elapsed_times.append(elapsed_time)

    # Calculate the average elapsed time
    average_elapsed_time = sum(elapsed_times) / num_repetitions
    print(f"Average time taken for computing max and min over {num_repetitions} runs: {average_elapsed_time:.6f} seconds")

if __name__ == "__main__":
    main()

…_v2.1' into user/rcadene/2025_02_11_2d_features

aliberts

First round, let's iterate on the comments a bit ;)

lerobot/common/datasets/lerobot_dataset.py

examples/port_datasets/pusht_zarr.py

lerobot/common/utils/utils.py

lerobot/common/datasets/utils.py

tests/test_datasets.py

tests/test_image_writer.py

aliberts

LGTM

aliberts · 2025-02-14T23:56:02Z

lerobot/common/datasets/lerobot_dataset.py

            if key in ["index", "episode_index", "task_index"] or ft["dtype"] in ["image", "video"]:
                continue
-            elif len(ft["shape"]) == 1 and ft["shape"][0] == 1:
-                episode_buffer[key] = np.array(episode_buffer[key], dtype=ft["dtype"])
-            elif len(ft["shape"]) == 1 and ft["shape"][0] > 1:
-                episode_buffer[key] = np.stack(episode_buffer[key])
-            else:
-                raise ValueError(key)
+            episode_buffer[key] = np.stack(episode_buffer[key])


Note: This causes any string features (such as "caption") to be stored as string inside np.arrays which is really not ideal. It cause failing tests after rebasing these changes on the per-episodes stats branch (see hack). We should probably just continue in these cases, WDYT @Cadene?

Cadene added 2 commits February 11, 2025 17:20

Add possibility to add task per frame

b8ca7e7

Add tests add_frame

3281e66

Cadene changed the base branch from main to user/rcadene/2025_01_27_dataset_v2.1 February 12, 2025 17:33

Cadene added 2 commits February 12, 2025 18:41

Fix unit tests

450b04d

Add check_feature, Add 2D, 3D, 4D, 5D features

e94c029

Cadene force-pushed the user/rcadene/2025_02_11_2d_features branch from f2d59a9 to e94c029 Compare February 12, 2025 18:56

Cadene marked this pull request as ready for review February 12, 2025 19:11

Cadene requested a review from aliberts February 12, 2025 19:11

Add string

53bb86a

Cadene changed the title ~~Validate features during add_frame + Add 2d, 3d, 4d, 5d features~~ Validate features during add_frame + Add 2D-to-5D + Add string Feb 12, 2025

Cadene added 2 commits February 13, 2025 10:55

Add sanity check for image range

7a81132

Fix unit tests

15cc1f8

aliberts mentioned this pull request Feb 13, 2025

LeRobotDataset v2.1 #711

Open

3 tasks

Base automatically changed from user/rcadene/2025_01_27_dataset_v2.1 to user/aliberts/2025_02_10_dataset_v2.1 February 14, 2025 13:22

Merge remote-tracking branch 'origin/user/aliberts/2025_02_10_dataset…

6932755

…_v2.1' into user/rcadene/2025_02_11_2d_features

Cadene mentioned this pull request Feb 14, 2025

Per-episode stats #521

Merged

6 tasks

aliberts reviewed Feb 14, 2025

View reviewed changes

address comments

9394191

aliberts approved these changes Feb 14, 2025

View reviewed changes

Cadene merged commit 7c2bbee into user/aliberts/2025_02_10_dataset_v2.1 Feb 14, 2025
7 checks passed

Cadene deleted the user/rcadene/2025_02_11_2d_features branch February 14, 2025 18:59

aliberts reviewed Feb 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate features during `add_frame` + Add 2D-to-5D + Add string #720

Validate features during `add_frame` + Add 2D-to-5D + Add string #720

Cadene commented Feb 12, 2025 •

edited

Loading

aliberts left a comment

aliberts left a comment

aliberts Feb 14, 2025

Validate features during add_frame + Add 2D-to-5D + Add string #720

Validate features during add_frame + Add 2D-to-5D + Add string #720

Conversation

Cadene commented Feb 12, 2025 • edited Loading

What this does

How it was tested

aliberts left a comment

Choose a reason for hiding this comment

aliberts left a comment

Choose a reason for hiding this comment

aliberts Feb 14, 2025

Choose a reason for hiding this comment

Validate features during `add_frame` + Add 2D-to-5D + Add string #720

Validate features during `add_frame` + Add 2D-to-5D + Add string #720

Cadene commented Feb 12, 2025 •

edited

Loading