Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Remove reshape to speed up loading of scalar features in TensorFlow #116

Merged
merged 2 commits into from
Mar 28, 2023

Conversation

oliverholworthy
Copy link
Member

Goal Improve loading time of TensorFlow dataloader with scalar features.

Details

Removing the _reshape_dim method. This was required because we grouped scalar columns with the same dtype together during conversion in _process_dataframe and then needed to extract the columns into the flat values later in _process_batch. This PR removes the need for the reshape later by processing each column separately, like we do for the list columns.

Timing

import random
import time

import cupy
import cudf

from merlin.io import Dataset


def get_dataset(num_rows, *, num_list_features=0, num_int_features=0, num_float_features=0):    
    list_features = {
        f"list_{i}": [[random.randint(1, 10) for _ in range(4)] for _ in range(num_rows)]
        for i in range(num_list_features)
    }
    scalar_int_features = {
        f"scalar_int_{i}": cupy.random.randint(1, 10, size=num_rows)
        for i in range(num_int_features)
    }
    scalar_float_features = {
        f"scalar_int_{i}": cupy.random.uniform(size=num_rows)
        for i in range(num_float_features)
    }
    features = {**list_features, **scalar_int_features, **scalar_float_features}
    df = cudf.DataFrame(features)
    return  Dataset(df)


def dataset_load_time(dataset, loader_cls, batch_size):
    with loader_cls(dataset, batch_size=batch_size) as loader:
        start_t = time.time()
        for batch in loader:
            pass
        end_t = time.time()
        return end_t - start_t
num_rows = 100_000
num_features = 10
dataset = get_dataset(num_rows, num_int_features=num_features)
batch_size = 10

TensorFlow

from merlin.dataloader.tensorflow import Loader as  TFLoader

%timeit -n 10 dataset_load_time(dataset, TFLoader, batch_size=batch_size)
  • Before 12.7 s ± 44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    • Note that this was a performance regression that is only present in the current unreleased development branch
    • 23.02 - 386 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • After 222 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

PyTorch

from merlin.dataloader.torch import Loader as  TorchLoader

%timeit -n 10 dataset_load_time(dataset, TorchLoader, batch_size=batch_size)
  • Before 386 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • After 152 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

the pack/unpack methods are designed mostly for series not cupy arrays
@karlhigley karlhigley merged commit 020e538 into NVIDIA-Merlin:main Mar 28, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants