Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Refactor ragged tensor processing for readbility and improved performance #104

Merged

Conversation

oliverholworthy
Copy link
Member

@oliverholworthy oliverholworthy commented Feb 20, 2023

Refactor ragged tensor processing for readbility and improved performance.

Performance

Loading of list columns is now roughly 10x faster.

Torch

This change fixes a couple things that were inconsistent about the Torch loader.

  • returns 2-d scalar arrays (last dim of size 1) consistently, matching the TensorFlow loader. In some cases we were returning a 1-d array, and sometimes a 2-d array depending on whether or not there was more than one column with the same dtype. And some differences depending on whether the data was on CPU or GPU (cudf/pandas).
  • Ragged output reprsentation for list columns (values, offsets) returns the offsets correctly with last value representing the length of the list. We were previously omiting the last value.

@oliverholworthy oliverholworthy added the enhancement New feature or request label Feb 20, 2023
@oliverholworthy oliverholworthy self-assigned this Feb 20, 2023
This functionality is currently unsupported with undefined behaviour.
This test was returning different shapes betwen CPU/GPU due to the difference in
the implementation of `pull_apart_list`.
Otherwise this results in implicit broadcasting that consumes a large amount of memory
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@oliverholworthy oliverholworthy marked this pull request as ready for review February 24, 2023 09:27
@@ -98,63 +98,6 @@ def test_simple_model():
_ = model.evaluate(loader)


def test_nested_list():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure I understand this correctly, we remove this test because ragged representation is no longer supported and dataloader now always outputs a tuple of values and offsets, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't change the output representation. the output types and shapes should be the same as before. Apart from the torch version which now outputs the offsets with the last value correctly and consistently returns the scalar columns as 2-d arrays with the last dim of size 1

The reason for the removal of the test is mostly because it is now broken due to the difference between our implementation of the pull_apart_list behaving in a different way between cudf/pandas. I'll try restoring these two lines https://github.com/NVIDIA-Merlin/dataloader/blob/v0.0.4/merlin/dataloader/loader_base.py#L577-L579 and see If I can restore this test.

The current implementation and what this test is checking isn't really nested list support. It currently loses the information about the nesting during the transformation. And what you get out of the dataloader cannot be turned back into the nested lists in the original data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to restore this test test_nested_list, and tests are all passing. We can leave this to another PR if we want to change the behaviour. Thanks for checking about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

Copy link
Contributor

@edknv edknv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I have one more comment.

Comment on lines +494 to +496
tensors_by_name[column_name] = self._to_tensor(leaves), self._to_tensor(
col_offsets
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: I think black formatting makes this harder to read (at least for me). I wonder if using parens or tuple(), e.g., tensors_by_name[column_name] = (self._to_tensor(leaves), self._to_tensor(col_offsets)), improves readability with black.

Comment on lines -148 to -151
if HAS_GPU:
offsets = torch.cat([offsets, torch.cuda.LongTensor([len(values)], device=self.device)])
else:
offsets = torch.cat([offsets, torch.LongTensor([len(values)])])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there was some issue with NVTabular and/or T4Rec in the multi-GPU case and device=self.device was added to line 149. What do you think about still keeping something like if HAS_GPU: offsets = offsets.to(device=self.device)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, it was probably an issue because of torch.cat and offsets and the other tensor being on different devices, so it should be okay with the refactor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason for the removal of the lines doing torch.cat is because the offsets previously were missing the last value. After this change the offsets are now computed in the _row_lengths_to_offests method which doesn't require this logic. There's still a torch.cat in use in that method, that concatenates the cumulative sum of the row lengths with a leading zero.

@oliverholworthy oliverholworthy merged commit 4301447 into NVIDIA-Merlin:main Feb 28, 2023
@oliverholworthy oliverholworthy deleted the ragged-tensor-handling branch February 28, 2023 15:58
oliverholworthy added a commit that referenced this pull request Mar 13, 2023
…ance (#104)

* Update ragged tensor handling to improve load performance

* Store values/row_lengths in dict in make_tensors

* Implement _sum method fo jax

* Rename `_handle_tensors` to `_process_batch`

* Rename `_create_tensors` to `_process_dataframe`

* Update docstring of `make_tensors`

* Handle case where `HAS_GPU=True` and `CUDA_VISIBLE_DEVICES=""`

* Enable offsets row-parition output for torch ragged tensors

* Return 2-d scalars from torch dataloader consistently

* Remove `test_nested_list` from tensorflow dataloader tests

This functionality is currently unsupported with undefined behaviour.
This test was returning different shapes betwen CPU/GPU due to the difference in
the implementation of `pull_apart_list`.

* Add squeeze to torch embeddings lookup to handle 2-d keys

* Add squeeze to batch before calling loss

Otherwise this results in implicit broadcasting that consumes a large amount of memory

* Revert "Remove `test_nested_list` from tensorflow dataloader tests"

This reverts commit 5aa7104.

* Restore nested list handling for pandas
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants