-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
adding array version of dataloader #111
Conversation
…to array-dataloader
…oader-1 into array-dataloader
I've been looking at comparing load time between the array version and the current version. The array version from the latests commit on this PR appears to take longer to load in most cases. The exception the TensorFlow Loader with scalar features of the same type, which is slower with the current dataloader version. import random
import time
import cupy
import cudf
from merlin.io import Dataset
def get_dataset(num_rows, *, num_list_features=0, num_int_features=0, num_float_features=0):
list_features = {
f"list_{i}": [[random.randint(1, 10) for _ in range(4)] for _ in range(num_rows)]
for i in range(num_list_features)
}
scalar_int_features = {
f"scalar_int_{i}": cupy.random.randint(1, 10, size=num_rows)
for i in range(num_int_features)
}
scalar_float_features = {
f"scalar_int_{i}": cupy.random.uniform(size=num_rows)
for i in range(num_float_features)
}
features = {**list_features, **scalar_int_features, **scalar_float_features}
df = cudf.DataFrame(features)
return Dataset(df)
def dataset_load_time(dataset, loader_cls, batch_size):
start_t = time.time()
for batch in loader_cls(dataset, batch_size=batch_size):
pass
end_t = time.time()
return end_t - start_t
from merlin.dataloader.tensorflow import Loader as TFLoader
from merlin.dataloader.torch import Loader as TorchLoader
# Array Versions (PR #111)
from merlin.dataloader.frameworks.torch import TorchArrayDataloader
from merlin.dataloader.frameworks.tensorflow import TFArrayDataloader
# -----------------------------------------------------------------------------
# List Features
dataset = get_dataset(4000, num_list_features=10)
batch_size = 10
print("\n# List Features")
print("\nTensorFlow")
for loader_cls in [TFArrayDataloader, TFLoader]:
load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
print(loader_cls.__name__, f"{load_time:.02f} seconds")
print("\nTorch")
for loader_cls in [TorchArrayDataloader, TorchLoader]:
load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
print(loader_cls.__name__, f"{load_time:.02f} seconds")
# -----------------------------------------------------------------------------
# Scalar Features
dataset = get_dataset(100_000, num_int_features=10)
batch_size = 10
print("\n# Scalar Features")
print("\nTensorFlow")
for loader_cls in [TFArrayDataloader, TFLoader]:
load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
print(loader_cls.__name__, f"{load_time:.02f} seconds")
print("\nTorch")
for loader_cls in [TorchArrayDataloader, TorchLoader]:
load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
print(loader_cls.__name__, f"{load_time:.02f} seconds")
|
Given that the array loader is faster for TF scalar features, I’d guess that we might be hitting issues with the reshapes required to get list features through DLpack? Seems like it’s more than that, but that might be one contributor. |
After profiling with
|
If I further disable the dtype conversion and set
But at that point we start to hit a limitation of Torch's DLpack implementation, which grabs the current CUDA stream every time |
Turns out that the CUDA initialization is getting counted as part of whichever loader goes first, so I added this line above the profiling:
And due to the issue with
to this:
which uses the older style DLpack API, bypassing Torch's stream handling code but still seems to work okay. That gets me to:
where the performance difference seems to break down to about 2/3 converting through DLpack and 1/3 creating |
Initializing both frameworks ahead of time and using the older PyTorch DLpack API, along with disabling the values/offsets validation, shape computation, and dtype coercion gets me to here:
The existing TF dataloader seems to be spending a lot of time reshaping scalar columns, which we could probably improve. |
I've opened a PR #116 to speed up the TF dataloader for scalar columns. (and makes the torch versions slighly faster too. We have an unreleased performance regression that was introduced when we added the reshape recently as part of the output shape change from 2-d to 1-d in #101 |
Using the same script from this comment with the latest version of this branch alongside the latest version of core. I'm still seeing a big difference in loading time between the two versions. One thing to note, is that a modification is required to make this script work. One way to get it to run is to copy the modified
|
There's still a timing difference because the changes required to take advantage of the dispatching optimizations in Core aren't yet present in this PR |
With this PR, core/pull/264, and the following profiling script, I get these timings:
import random
import time
import cupy
import cudf
import tensorflow as tf
import torch as th
from merlin.io import Dataset
from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()
def get_dataset(num_rows, *, num_list_features=0, num_int_features=0, num_float_features=0):
list_features = {
f"list_{i}": [[random.randint(1, 10) for _ in range(4)] for _ in range(num_rows)]
for i in range(num_list_features)
}
scalar_int_features = {
f"scalar_int_{i}": cupy.random.randint(1, 10, size=num_rows)
for i in range(num_int_features)
}
scalar_float_features = {
f"scalar_int_{i}": cupy.random.uniform(size=num_rows)
for i in range(num_float_features)
}
features = {**list_features, **scalar_int_features, **scalar_float_features}
df = cudf.DataFrame(features)
return Dataset(df)
def dataset_load_time(dataset, loader_cls, batch_size):
start_t = time.time()
for batch in loader_cls(dataset, batch_size=batch_size):
pass
end_t = time.time()
return end_t - start_t
from merlin.dataloader.tensorflow import Loader as TFLoader
from merlin.dataloader.torch import Loader as TorchLoader
# Array Versions (PR #111)
from merlin.dataloader.frameworks.torch import TorchArrayDataloader
from merlin.dataloader.frameworks.tensorflow import TFArrayDataloader
with tf.device("gpu"):
tf_force_init = tf.constant([1,2,3])
th_force_init = th.tensor([1,2,3]).cuda()
# -----------------------------------------------------------------------------
# List Features
print("\nList Features\n")
dataset = get_dataset(1_000_000, num_list_features=10)
batch_size = 1000
print("TensorFlow")
for loader_cls in [TFLoader, TFArrayDataloader]:
load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
print(loader_cls.__name__, f"{load_time:.02f} seconds")
print("Torch")
for loader_cls in [TorchLoader, TorchArrayDataloader]:
load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
print(loader_cls.__name__, f"{load_time:.02f} seconds")
# -----------------------------------------------------------------------------
# Scalar Features
print("\nScalar Features\n")
dataset = get_dataset(1_000_000, num_int_features=10)
batch_size = 1000
print("TensorFlow")
for loader_cls in [TFLoader, TFArrayDataloader]:
load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
print(loader_cls.__name__, f"{load_time:.02f} seconds")
print("Torch")
for loader_cls in [TorchLoader, TorchArrayDataloader]:
load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
print(loader_cls.__name__, f"{load_time:.02f} seconds")
# -----------------------------------------------------------------------------
# Mixed Features
print("\nMixed Features\n")
dataset = get_dataset(1_000_000, num_list_features=5, num_int_features=5)
batch_size = 1000
print("TensorFlow")
for loader_cls in [TFLoader, TFArrayDataloader, ]:
load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
print(loader_cls.__name__, f"{load_time:.02f} seconds")
print("Torch")
for loader_cls in [TorchLoader, TorchArrayDataloader]:
load_time = dataset_load_time(dataset, loader_cls, batch_size=batch_size)
print(loader_cls.__name__, f"{load_time:.02f} seconds") |
I've tried the same and getting similar results.
|
The last example was with the dataset from the first dataset example run 10 times. One thing that I notice here is that the first time it runs on list features is slower for the array-based implementation. What is the mechanism that causes subsequent runs to load faster? Since this timing includes the instantiation of a new loader each time, is there some global state that is being changed? |
And could that same mechanism that is responsible for the speed up of list features after the first run have something to do with why the scalar features are 5-10x slower, while with list features, after the first run, load faster than the equivalent current loader? |
I'm pretty sure the first run takes longer no matter which version of the dataloader you use, because whichever comes first gets attributed the cost of initializing the framework. In the version of the profiling script included above, I addressed that by forcing the frameworks to initialize outside the timing function:
|
Seems like we need the PR that swaps out |
This PR adds a new data loader that creates numpy (CPU) and cupy tensors (GPU), respective of device.