Refactor ragged tensor processing for readbility and improved performance #104

oliverholworthy · 2023-02-20T16:11:07Z

Refactor ragged tensor processing for readbility and improved performance.

Performance

Loading of list columns is now roughly 10x faster.

Torch

This change fixes a couple things that were inconsistent about the Torch loader.

returns 2-d scalar arrays (last dim of size 1) consistently, matching the TensorFlow loader. In some cases we were returning a 1-d array, and sometimes a 2-d array depending on whether or not there was more than one column with the same dtype. And some differences depending on whether the data was on CPU or GPU (cudf/pandas).
Ragged output reprsentation for list columns (values, offsets) returns the offsets correctly with last value representing the length of the list. We were previously omiting the last value.

This functionality is currently unsupported with undefined behaviour. This test was returning different shapes betwen CPU/GPU due to the difference in the implementation of `pull_apart_list`.

Otherwise this results in implicit broadcasting that consumes a large amount of memory

review-notebook-app · 2023-02-24T09:23:50Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

edknv · 2023-02-27T17:10:09Z

tests/unit/dataloader/test_tf_dataloader.py

@@ -98,63 +98,6 @@ def test_simple_model():
    _ = model.evaluate(loader)


-def test_nested_list():


Just want to make sure I understand this correctly, we remove this test because ragged representation is no longer supported and dataloader now always outputs a tuple of values and offsets, correct?

This PR doesn't change the output representation. the output types and shapes should be the same as before. Apart from the torch version which now outputs the offsets with the last value correctly and consistently returns the scalar columns as 2-d arrays with the last dim of size 1

The reason for the removal of the test is mostly because it is now broken due to the difference between our implementation of the pull_apart_list behaving in a different way between cudf/pandas. I'll try restoring these two lines https://github.com/NVIDIA-Merlin/dataloader/blob/v0.0.4/merlin/dataloader/loader_base.py#L577-L579 and see If I can restore this test.

The current implementation and what this test is checking isn't really nested list support. It currently loses the information about the nesting during the transformation. And what you get out of the dataloader cannot be turned back into the nested lists in the original data.

Updated to restore this test test_nested_list, and tests are all passing. We can leave this to another PR if we want to change the behaviour. Thanks for checking about this.

Thanks for the explanation.

This reverts commit 5aa7104.

edknv

This is awesome! I have one more comment.

edknv · 2023-02-28T00:04:17Z

merlin/dataloader/loader_base.py

+                    tensors_by_name[column_name] = self._to_tensor(leaves), self._to_tensor(
+                        col_offsets
+                    )


Super nit: I think black formatting makes this harder to read (at least for me). I wonder if using parens or tuple(), e.g., tensors_by_name[column_name] = (self._to_tensor(leaves), self._to_tensor(col_offsets)), improves readability with black.

edknv · 2023-02-28T00:10:07Z

merlin/dataloader/torch.py

-        if HAS_GPU:
-            offsets = torch.cat([offsets, torch.cuda.LongTensor([len(values)], device=self.device)])
-        else:
-            offsets = torch.cat([offsets, torch.LongTensor([len(values)])])


I remember there was some issue with NVTabular and/or T4Rec in the multi-GPU case and device=self.device was added to line 149. What do you think about still keeping something like if HAS_GPU: offsets = offsets.to(device=self.device)?

On second thought, it was probably an issue because of torch.cat and offsets and the other tensor being on different devices, so it should be okay with the refactor.

the reason for the removal of the lines doing torch.cat is because the offsets previously were missing the last value. After this change the offsets are now computed in the _row_lengths_to_offests method which doesn't require this logic. There's still a torch.cat in use in that method, that concatenates the cumulative sum of the row lengths with a leading zero.

…ance (#104) * Update ragged tensor handling to improve load performance * Store values/row_lengths in dict in make_tensors * Implement _sum method fo jax * Rename `_handle_tensors` to `_process_batch` * Rename `_create_tensors` to `_process_dataframe` * Update docstring of `make_tensors` * Handle case where `HAS_GPU=True` and `CUDA_VISIBLE_DEVICES=""` * Enable offsets row-parition output for torch ragged tensors * Return 2-d scalars from torch dataloader consistently * Remove `test_nested_list` from tensorflow dataloader tests This functionality is currently unsupported with undefined behaviour. This test was returning different shapes betwen CPU/GPU due to the difference in the implementation of `pull_apart_list`. * Add squeeze to torch embeddings lookup to handle 2-d keys * Add squeeze to batch before calling loss Otherwise this results in implicit broadcasting that consumes a large amount of memory * Revert "Remove `test_nested_list` from tensorflow dataloader tests" This reverts commit 5aa7104. * Restore nested list handling for pandas

oliverholworthy added the enhancement New feature or request label Feb 20, 2023

oliverholworthy self-assigned this Feb 20, 2023

oliverholworthy added 3 commits February 20, 2023 16:12

Update ragged tensor handling to improve load performance

fb7f6f5

Store values/row_lengths in dict in make_tensors

3fe94b6

Implement _sum method fo jax

1242ecd

oliverholworthy force-pushed the ragged-tensor-handling branch from 7d5dbf3 to 1242ecd Compare February 20, 2023 16:12

Rename _handle_tensors to _process_batch

cb3a7e5

oliverholworthy force-pushed the ragged-tensor-handling branch from 77c4d62 to cb3a7e5 Compare February 20, 2023 16:31

oliverholworthy added 2 commits February 20, 2023 16:57

Rename _create_tensors to _process_dataframe

0d478a9

Update docstring of make_tensors

8f2ad62

oliverholworthy mentioned this pull request Feb 21, 2023

Update dataloader to provide new output structure #101

Merged

oliverholworthy added 6 commits February 23, 2023 11:49

Handle case where HAS_GPU=True and CUDA_VISIBLE_DEVICES=""

046bace

Enable offsets row-parition output for torch ragged tensors

3a588f5

Return 2-d scalars from torch dataloader consistently

f1d37b6

Remove test_nested_list from tensorflow dataloader tests

5aa7104

This functionality is currently unsupported with undefined behaviour. This test was returning different shapes betwen CPU/GPU due to the difference in the implementation of `pull_apart_list`.

Add squeeze to torch embeddings lookup to handle 2-d keys

d052863

Add squeeze to batch before calling loss

e40ffbd

Otherwise this results in implicit broadcasting that consumes a large amount of memory

oliverholworthy marked this pull request as ready for review February 24, 2023 09:27

edknv reviewed Feb 27, 2023

View reviewed changes

oliverholworthy added 2 commits February 27, 2023 18:00

Revert "Remove test_nested_list from tensorflow dataloader tests"

fea430c

This reverts commit 5aa7104.

Restore nested list handling for pandas

38acb1e

edknv reviewed Feb 28, 2023

View reviewed changes

edknv approved these changes Feb 28, 2023

View reviewed changes

oliverholworthy merged commit 4301447 into NVIDIA-Merlin:main Feb 28, 2023

oliverholworthy deleted the ragged-tensor-handling branch February 28, 2023 15:58

This was referenced Feb 28, 2023

NVTabular KerasSequenceLoader costs longer time to load multi-hot features than one-hot features #82

Closed

Restore support for passing device in CPU-only environment #107

Merged

oliverholworthy mentioned this pull request Mar 17, 2023

Improved Performance for Fixed List Columns #97

Closed

This was referenced Jun 20, 2023

[BUG] Exception in model when using ragged tensors with tensorflow 2.10.0 #74

Closed

Add test for loading time of list features #62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ragged tensor processing for readbility and improved performance #104

Refactor ragged tensor processing for readbility and improved performance #104

oliverholworthy commented Feb 20, 2023 •

edited

Loading

review-notebook-app bot commented Feb 24, 2023

edknv Feb 27, 2023

oliverholworthy Feb 27, 2023

oliverholworthy Feb 27, 2023

edknv Feb 27, 2023

edknv left a comment

edknv Feb 28, 2023

edknv Feb 28, 2023

edknv Feb 28, 2023

oliverholworthy Feb 28, 2023

		@@ -98,63 +98,6 @@ def test_simple_model():
		_ = model.evaluate(loader)


		def test_nested_list():

Refactor ragged tensor processing for readbility and improved performance #104

Refactor ragged tensor processing for readbility and improved performance #104

Conversation

oliverholworthy commented Feb 20, 2023 • edited Loading

Performance

Torch

review-notebook-app bot commented Feb 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edknv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverholworthy commented Feb 20, 2023 •

edited

Loading