adding array version of dataloader #111

jperez999 · 2023-03-14T16:56:57Z

This PR adds a new data loader that creates numpy (CPU) and cupy tensors (GPU), respective of device.

merlin/dataloader/array.py

tests/unit/dataloader/test_array_dataloader.py

…to array-dataloader

…oader-1 into array-dataloader

oliverholworthy · 2023-03-27T12:44:59Z

I've been looking at comparing load time between the array version and the current version. The array version from the latests commit on this PR appears to take longer to load in most cases. The exception the TensorFlow Loader with scalar features of the same type, which is slower with the current dataloader version.

import random
import time

import cupy
import cudf

from merlin.io import Dataset

def get_dataset(num_rows, *, num_list_features=0, num_int_features=0, num_float_features=0):
    list_features = {
        f"list_{i}": [[random.randint(1, 10) for _ in range(4)] for _ in range(num_rows)]
        for i in range(num_list_features)
    }
    scalar_int_features = {
        f"scalar_int_{i}": cupy.random.randint(1, 10, size=num_rows)
        for i in range(num_int_features)
    }
    scalar_float_features = {
        f"scalar_int_{i}": cupy.random.uniform(size=num_rows)
        for i in range(num_float_features)
    }
    features = {**list_features, **scalar_int_features, **scalar_float_features}
    df = cudf.DataFrame(features)
    return  Dataset(df)


def dataset_load_time(dataset, loader_cls, batch_size):
    start_t = time.time()
    for batch in loader_cls(dataset, batch_size=batch_size):
        pass
    end_t = time.time()
    return end_t - start_t


from merlin.dataloader.tensorflow import Loader as  TFLoader
from merlin.dataloader.torch import Loader as TorchLoader

# Array Versions (PR #111)
from merlin.dataloader.frameworks.torch import TorchArrayDataloader
from merlin.dataloader.frameworks.tensorflow import TFArrayDataloader

# -----------------------------------------------------------------------------
# List Features

dataset = get_dataset(4000, num_list_features=10)
batch_size = 10

print("\n# List Features")

print("\nTensorFlow")
for loader_cls in [TFArrayDataloader, TFLoader]:
    load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
    print(loader_cls.__name__, f"{load_time:.02f} seconds")

print("\nTorch")
for loader_cls in [TorchArrayDataloader, TorchLoader]:
    load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
    print(loader_cls.__name__, f"{load_time:.02f} seconds")

# -----------------------------------------------------------------------------
# Scalar Features

dataset = get_dataset(100_000, num_int_features=10)
batch_size = 10

print("\n# Scalar Features")

print("\nTensorFlow")
for loader_cls in [TFArrayDataloader, TFLoader]:
    load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
    print(loader_cls.__name__, f"{load_time:.02f} seconds")

print("\nTorch")
for loader_cls in [TorchArrayDataloader, TorchLoader]:
    load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
    print(loader_cls.__name__, f"{load_time:.02f} seconds")

# List Features

TensorFlow
TFArrayDataloader 26.07 seconds
Loader 0.89 seconds

Torch
TorchArrayDataloader 5.76 seconds
Loader 0.37 seconds

# Scalar Features

TensorFlow
TFArrayDataloader 4.50 seconds
Loader 12.71 seconds

Torch
TorchArrayDataloader 5.40 seconds
Loader 0.36 seconds

karlhigley · 2023-03-27T14:15:39Z

Given that the array loader is faster for TF scalar features, I’d guess that we might be hitting issues with the reshapes required to get list features through DLpack? Seems like it’s more than that, but that might be one contributor.

karlhigley · 2023-03-27T14:59:05Z

After profiling with pyinstrument, it looks like TensorColumn's constructor is slow due to the validation of values/offsets and reverse engineering of the shape from them. If I disable those two lines by commenting them out and setting the shape to Shape(), then I get this:

# List Features

Tensorflow
TFArrayDataloader 1.56 seconds
Loader 0.86 seconds

Torch
TorchArrayDataloader 2.10 seconds
Loader 0.39 seconds

# Scalar Features

TensorFlow
TFArrayDataloader 5.64 seconds
Loader 14.75 seconds

Torch
TorchArrayDataloader 7.34 seconds
Loader 0.49 seconds

Profile that led me down that path:

karlhigley · 2023-03-27T15:11:51Z

If I further disable the dtype conversion and set self._dtype = None, I can get the scalar columns case down to this:

Torch
TorchArrayDataloader 5.72 seconds
Loader 0.54 seconds

But at that point we start to hit a limitation of Torch's DLpack implementation, which grabs the current CUDA stream every time torch.utils.dlpack.from_dlpack() gets called.

karlhigley · 2023-03-27T15:37:40Z

Turns out that the CUDA initialization is getting counted as part of whichever loader goes first, so I added this line above the profiling:

force_init = th.tensor([]).cuda()

And due to the issue with from_dlpack getting the current stream over and over again, I amended this function from this:

    def _from_dlpack_gpu_to_torch(target_type, array):
        return th.utils.dlpack.from_dlpack(array)

to this:

    def _from_dlpack_gpu_to_torch(target_type, array):
        return th.utils.dlpack.from_dlpack(array.__dlpack__())

which uses the older style DLpack API, bypassing Torch's stream handling code but still seems to work okay.

That gets me to:

Torch
Loader 0.52 seconds
TorchArrayDataloader 2.80 seconds

where the performance difference seems to break down to about 2/3 converting through DLpack and 1/3 creating TensorTable objects.

karlhigley · 2023-03-27T15:42:14Z

Initializing both frameworks ahead of time and using the older PyTorch DLpack API, along with disabling the values/offsets validation, shape computation, and dtype coercion gets me to here:

# List Features

TensorFlow
TFArrayDataloader 1.18 seconds
Loader 0.83 seconds

Torch
TorchArrayDataloader 0.87 seconds
Loader 0.38 seconds

# Scalar Features

TensorFlow
TFArrayDataloader 1.75 seconds
Loader 12.46 seconds

Torch
TorchArrayDataloader 1.84 seconds
Loader 0.41 seconds

The existing TF dataloader seems to be spending a lot of time reshaping scalar columns, which we could probably improve.

oliverholworthy · 2023-03-27T18:02:50Z

The existing TF dataloader seems to be spending a lot of time reshaping scalar columns, which we could probably improve.

I've opened a PR #116 to speed up the TF dataloader for scalar columns. (and makes the torch versions slighly faster too. We have an unreleased performance regression that was introduced when we added the reshape recently as part of the output shape change from 2-d to 1-d in #101

…oader-1 into array-dataloader

oliverholworthy · 2023-03-30T08:04:28Z

Using the same script from this comment with the latest version of this branch alongside the latest version of core. I'm still seeing a big difference in loading time between the two versions.

One thing to note, is that a modification is required to make this script work. One way to get it to run is to copy the modified loader_base.py to be used only by the array loader. And reverting the changes to the main loader_base. Otherwise you get AttributeError: EagerTensor object has no attribute 'tolist'.

# List Features

TensorFlow
TFArrayDataloader 13.70 seconds
Loader 0.91 seconds

Torch
TorchArrayDataloader 3.41 seconds
Loader 0.37 seconds

# Scalar Features

TensorFlow
TFArrayDataloader 2.43 seconds
Loader 0.38 seconds

Torch
TorchArrayDataloader 2.08 seconds
Loader 0.15 seconds

karlhigley · 2023-03-30T14:41:02Z

There's still a timing difference because the changes required to take advantage of the dispatching optimizations in Core aren't yet present in this PR

karlhigley · 2023-03-30T15:51:29Z

With this PR, core/pull/264, and the following profiling script, I get these timings:

===== List Features =====

TensorFlow
Loader 2.72 seconds
TFArrayDataloader 0.93 seconds

Torch
Loader 1.11 seconds
TorchArrayDataloader 0.87 seconds

===== Scalar Features =====

TensorFlow
Loader 0.09 seconds
TFArrayDataloader 0.26 seconds

Torch
Loader 0.04 seconds
TorchArrayDataloader 0.20 seconds

===== Mixed Features =====

TensorFlow
Loader 1.27 seconds
TFArrayDataloader 0.59 seconds

Torch
Loader 0.58 seconds
TorchArrayDataloader 0.52 seconds

import random
import time

import cupy
import cudf
import tensorflow as tf
import torch as th

from merlin.io import Dataset

from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()

def get_dataset(num_rows, *, num_list_features=0, num_int_features=0, num_float_features=0):
    list_features = {
        f"list_{i}": [[random.randint(1, 10) for _ in range(4)] for _ in range(num_rows)]
        for i in range(num_list_features)
    }
    scalar_int_features = {
        f"scalar_int_{i}": cupy.random.randint(1, 10, size=num_rows)
        for i in range(num_int_features)
    }
    scalar_float_features = {
        f"scalar_int_{i}": cupy.random.uniform(size=num_rows)
        for i in range(num_float_features)
    }
    features = {**list_features, **scalar_int_features, **scalar_float_features}
    df = cudf.DataFrame(features)
    return  Dataset(df)


def dataset_load_time(dataset, loader_cls, batch_size):
    start_t = time.time()
    for batch in loader_cls(dataset, batch_size=batch_size):
        pass
    end_t = time.time()
    return end_t - start_t


from merlin.dataloader.tensorflow import Loader as  TFLoader
from merlin.dataloader.torch import Loader as TorchLoader

# Array Versions (PR #111)
from merlin.dataloader.frameworks.torch import TorchArrayDataloader
from merlin.dataloader.frameworks.tensorflow import TFArrayDataloader

with tf.device("gpu"):
    tf_force_init = tf.constant([1,2,3])

th_force_init = th.tensor([1,2,3]).cuda()

# -----------------------------------------------------------------------------
# List Features

print("\nList Features\n")

dataset = get_dataset(1_000_000, num_list_features=10)
batch_size = 1000

print("TensorFlow")
for loader_cls in [TFLoader, TFArrayDataloader]:
    load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
    print(loader_cls.__name__, f"{load_time:.02f} seconds")

print("Torch")
for loader_cls in [TorchLoader, TorchArrayDataloader]:
    load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
    print(loader_cls.__name__, f"{load_time:.02f} seconds")

# -----------------------------------------------------------------------------
# Scalar Features

print("\nScalar Features\n")

dataset = get_dataset(1_000_000, num_int_features=10)
batch_size = 1000

print("TensorFlow")
for loader_cls in [TFLoader, TFArrayDataloader]:
    load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
    print(loader_cls.__name__, f"{load_time:.02f} seconds")

print("Torch")
for loader_cls in [TorchLoader, TorchArrayDataloader]:
    load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
    print(loader_cls.__name__, f"{load_time:.02f} seconds")


# -----------------------------------------------------------------------------
# Mixed Features

print("\nMixed Features\n")

dataset = get_dataset(1_000_000, num_list_features=5, num_int_features=5)
batch_size = 1000

print("TensorFlow")
for loader_cls in [TFLoader, TFArrayDataloader, ]:
    load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
    print(loader_cls.__name__, f"{load_time:.02f} seconds")

print("Torch")
for loader_cls in [TorchLoader, TorchArrayDataloader]:
    load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
    print(loader_cls.__name__, f"{load_time:.02f} seconds")

oliverholworthy · 2023-03-30T16:11:16Z

I've tried the same and getting similar results.

# List Features
                                                                  
TensorFlow
TFArrayDataloader 1.47 seconds
Loader 0.88 seconds
TFArrayDataloader 0.37 seconds
Loader 0.88 seconds                                                                                                                  
TFArrayDataloader 0.37 seconds                                                                                                       
Loader 0.85 seconds
TFArrayDataloader 0.58 seconds
Loader 0.85 seconds                                                                                                                  
TFArrayDataloader 0.37 seconds                                                                                                       
Loader 0.87 seconds                                                                                                                  
TFArrayDataloader 0.37 seconds
Loader 0.85 seconds                                                                                                                  
TFArrayDataloader 0.37 seconds
Loader 0.86 seconds
TFArrayDataloader 0.59 seconds
Loader 0.85 seconds
TFArrayDataloader 0.36 seconds
Loader 0.85 seconds
TFArrayDataloader 0.37 seconds
Loader 0.84 seconds
                                                                  
Torch                                                                                                                                
TorchArrayDataloader 1.51 seconds                                                                                                    
Loader 0.36 seconds
TorchArrayDataloader 0.57 seconds
Loader 0.36 seconds                                                                                                                  
TorchArrayDataloader 0.35 seconds                                                                                                    
Loader 0.38 seconds                                                                                                                  
TorchArrayDataloader 0.35 seconds
Loader 0.36 seconds
TorchArrayDataloader 0.34 seconds
Loader 0.36 seconds
TorchArrayDataloader 0.52 seconds
Loader 0.36 seconds
TorchArrayDataloader 0.37 seconds
Loader 0.36 seconds
TorchArrayDataloader 0.35 seconds
Loader 0.36 seconds
TorchArrayDataloader 0.35 seconds
Loader 0.36 seconds
TorchArrayDataloader 0.35 seconds
Loader 0.37 seconds

# Scalar Features

TensorFlow
TFArrayDataloader 2.28 seconds
Loader 0.19 seconds
TFArrayDataloader 2.18 seconds
Loader 0.35 seconds
TFArrayDataloader 2.20 seconds
Loader 0.36 seconds
TFArrayDataloader 2.19 seconds
Loader 0.37 seconds
TFArrayDataloader 2.23 seconds
Loader 0.18 seconds
TFArrayDataloader 2.17 seconds
Loader 0.36 seconds
TFArrayDataloader 2.15 seconds
Loader 0.35 seconds
TFArrayDataloader 2.19 seconds
Loader 0.37 seconds
TFArrayDataloader 2.18 seconds
Loader 0.18 seconds
TFArrayDataloader 2.37 seconds
Loader 0.17 seconds

Torch
TorchArrayDataloader 1.69 seconds
Loader 0.34 seconds
TorchArrayDataloader 1.64 seconds
Loader 0.35 seconds
TorchArrayDataloader 1.62 seconds
Loader 0.16 seconds
TorchArrayDataloader 1.81 seconds
Loader 0.16 seconds
TorchArrayDataloader 1.63 seconds
Loader 0.34 seconds
TorchArrayDataloader 1.64 seconds
Loader 0.35 seconds
TorchArrayDataloader 1.64 seconds
Loader 0.16 seconds
TorchArrayDataloader 1.60 seconds
Loader 0.34 seconds
TorchArrayDataloader 1.63 seconds
Loader 0.35 seconds
TorchArrayDataloader 1.62 seconds
Loader 0.16 seconds

oliverholworthy · 2023-03-30T16:13:34Z

The last example was with the dataset from the first dataset example run 10 times. One thing that I notice here is that the first time it runs on list features is slower for the array-based implementation. What is the mechanism that causes subsequent runs to load faster? Since this timing includes the instantiation of a new loader each time, is there some global state that is being changed?

oliverholworthy · 2023-03-30T16:17:14Z

And could that same mechanism that is responsible for the speed up of list features after the first run have something to do with why the scalar features are 5-10x slower, while with list features, after the first run, load faster than the equivalent current loader?

… kw arg

…oader-1 into array-dataloader

karlhigley · 2023-03-30T18:05:57Z

I'm pretty sure the first run takes longer no matter which version of the dataloader you use, because whichever comes first gets attributed the cost of initializing the framework. In the version of the profiling script included above, I addressed that by forcing the frameworks to initialize outside the timing function:

with tf.device("gpu"):
    tf_force_init = tf.constant([1,2,3])

th_force_init = th.tensor([1,2,3]).cuda()

karlhigley · 2023-03-31T15:20:21Z

Seems like we need the PR that swaps out DictArray for TensorTable to go through first, but the tests on that are failing due to some kind of CuPy/cuDF interoperability issue.

adding array version of dataloader

b0285aa

jperez999 added the enhancement New feature or request label Mar 14, 2023

jperez999 added this to the Merlin 23.03 milestone Mar 14, 2023

jperez999 requested review from karlhigley and oliverholworthy March 14, 2023 16:56

jperez999 self-assigned this Mar 14, 2023

karlhigley approved these changes Mar 14, 2023

View reviewed changes

merlin/dataloader/array.py Show resolved Hide resolved

merlin/dataloader/array.py Outdated Show resolved Hide resolved

tests/unit/dataloader/test_array_dataloader.py Show resolved Hide resolved

jperez999 and others added 2 commits March 20, 2023 09:49

updated batching in array dataloader

bb84127

Merge branch 'main' into array-dataloader

30bd10c

karlhigley modified the milestones: Merlin 23.03, Merlin 23.04 Mar 22, 2023

jperez999 added 4 commits March 22, 2023 10:04

adding framework specific dataloaders that leverage array dataloaders

4db1b1e

Merge branch 'main' of https://github.com/NVIDIA-Merlin/dataloader in…

8e40eaf

…to array-dataloader

working replacements for torch and tensorflow dataloaders

d8463d2

Merge branch 'array-dataloader' of https://github.com/jperez999/datal…

82f4338

…oader-1 into array-dataloader

karlhigley requested a review from edknv March 24, 2023 20:32

jperez999 and others added 4 commits March 28, 2023 17:41

added unsafe kwarg for faster processing

ad76714

Merge branch 'main' into array-dataloader

ac1fb8b

fix segfault in tf dataloader

c95973b

Merge branch 'array-dataloader' of https://github.com/jperez999/datal…

2329ec0

…oader-1 into array-dataloader

Take advantage of dispatch caching in the array dataloaders

8f18bf5

karlhigley added 3 commits March 30, 2023 11:43

Rename methods in LoaderBase for clarity

afec6bd

Create new base class for array loaders that overrides tensor processing

d828ddc

Adjust the imports

8b1c041

Fix parameter name override for the linter

c83e353

jperez999 added 2 commits March 30, 2023 13:09

added on_epoch_end hook to tf array dl and removed unavailable unsure…

9d1ce96

… kw arg

Merge branch 'array-dataloader' of https://github.com/jperez999/datal…

72c471d

…oader-1 into array-dataloader

jperez999 added 3 commits March 30, 2023 14:20

add unsafe kwarg to convert_col calls

a2959eb

add import or skip for frameworks in array unit tests

9086cd2

remove old dataloader

49912b3

jperez999 and others added 3 commits March 31, 2023 12:34

Merge branch 'main' into array-dataloader

d8152c3

fix support for embedding op based on array dataloader

e44df39

use array dl base for embeddings tests

0bed696

jperez999 merged commit 1452e82 into NVIDIA-Merlin:main Mar 31, 2023

oliverholworthy mentioned this pull request Apr 4, 2023

Remove reset of dataloader in len method of ArrayLoader #123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding array version of dataloader #111

adding array version of dataloader #111

jperez999 commented Mar 14, 2023

oliverholworthy commented Mar 27, 2023 •

edited

Loading

karlhigley commented Mar 27, 2023

karlhigley commented Mar 27, 2023 •

edited

Loading

karlhigley commented Mar 27, 2023

karlhigley commented Mar 27, 2023

karlhigley commented Mar 27, 2023 •

edited

Loading

oliverholworthy commented Mar 27, 2023

oliverholworthy commented Mar 30, 2023

karlhigley commented Mar 30, 2023 •

edited

Loading

karlhigley commented Mar 30, 2023 •

edited

Loading

oliverholworthy commented Mar 30, 2023

oliverholworthy commented Mar 30, 2023

oliverholworthy commented Mar 30, 2023

karlhigley commented Mar 30, 2023

karlhigley commented Mar 31, 2023

adding array version of dataloader #111

adding array version of dataloader #111

Conversation

jperez999 commented Mar 14, 2023

oliverholworthy commented Mar 27, 2023 • edited Loading

karlhigley commented Mar 27, 2023

karlhigley commented Mar 27, 2023 • edited Loading

karlhigley commented Mar 27, 2023

karlhigley commented Mar 27, 2023

karlhigley commented Mar 27, 2023 • edited Loading

oliverholworthy commented Mar 27, 2023

oliverholworthy commented Mar 30, 2023

karlhigley commented Mar 30, 2023 • edited Loading

karlhigley commented Mar 30, 2023 • edited Loading

oliverholworthy commented Mar 30, 2023

oliverholworthy commented Mar 30, 2023

oliverholworthy commented Mar 30, 2023

karlhigley commented Mar 30, 2023

karlhigley commented Mar 31, 2023

oliverholworthy commented Mar 27, 2023 •

edited

Loading

karlhigley commented Mar 27, 2023 •

edited

Loading

karlhigley commented Mar 27, 2023 •

edited

Loading

karlhigley commented Mar 30, 2023 •

edited

Loading

karlhigley commented Mar 30, 2023 •

edited

Loading