Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Update dataloader to provide new output structure #101

Merged
merged 16 commits into from
Mar 15, 2023

Conversation

bschifferer
Copy link
Contributor

Currently the dataloaders and Merlin Models use values/row_lengths, which is incompatible both with Triton’s ragged batching support

We want to update:

  • The (values, row_lengths) tuple is also problematic and we should get rid of it, because it leads to inconsistent column names between Merlin schemas and Tensorflow saved models (TF assigns weird suffixes)
  • We should use a flat dictionary with separate keys for values and offsets at the boundaries of models, like:
    {“col_name__values”: [1,2…], “col_name__offsets”: [0,4,7]}
  • What about multiple ragged dimensions (more than one offsets array: example from TensorFlow docs )

Currently:

  • Added support to return only 1d columns
  • Fixed bug that single categorical column was 1d although it should be 2d (backwards compatible)
  • Return list columns not as an tuple

@bschifferer
Copy link
Contributor Author

@jperez999 @oliverholworthy would be interested in an early feedback :)

@oliverholworthy
Copy link
Member

Based on the TF RaggedTensor docs it seems like we could handle this with a multi-dimensional array of offsets in the {col}__offsets value. so the proposed output may not need to change if we implement support for multiple ragged dims in future.

@bschifferer
Copy link
Contributor Author

  • example from TensorFlow docs

thanks for the documentation - although I posted it - I havent seen it. I was wondering, how to deal with that case.

I am not sure, if a multi-dimensional __offsets columns will work. We will need to be able to feed the same structure in Triton. I think the idea is that we provide only 1-d Tensors in every output column. In the TensorFlow documentations, both offsets vectors have different length:

rt = tf.RaggedTensor.from_row_splits(
    values=tf.RaggedTensor.from_row_splits(
        values=[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        row_splits=[0, 3, 3, 5, 9, 10]),
    row_splits=[0, 1, 1, 5])

I think we should provide multiple offsets values with __offsets_0, __offsets_1, etc? What do you think?

@karlhigley
Copy link
Contributor

Multiple arrays worth of offsets sound reasonable, but I think we should ignore the case with multiple ragged dimensions for now, since we don't currently have a use case for it.

X.update(lists)

for column_name in self.sparse_names:
if column_name in self.sparse_max:
# raise ValueError(
# f"Did not convert {column_name} to sparse due to missing sparse_max entry"
# )
X[column_name] = self._to_sparse_tensor(X[column_name], column_name)
if self.lists_as_tuple:
tensor = (X[column_name][0], X[column_name][1][:-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we want the full set of offsets here, not just the last one ([:-1])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the PyTorch dataloader does NOT provide the last offsets (length of a batch). See comment: #101 (comment)

The function _to_sparse_tensor (actually I think the result is a dense_tensor) does not work, when the last offset (length) is part of the offset tensors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhhhh, I get it now 👍🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the issue was that the torch dataloader adds the last offset in the _to_sparse_tensor:
https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/torch.py#L148-L151

I will remove these lines in my updated PR

@@ -116,6 +116,9 @@ def __init__(
drop_last=False,
transforms=None,
device=None,
use_row_lengths=False,
tensors_as_1d = True,
lists_as_tuple = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how others feel about this, but in the interest of simplifying the code I wonder if we can live without the extra configuration flags and just make the new behavior the default/standard

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we're making the default values preserve the existing behaviour, it's going to be a breaking change that forces some action. I agree in this case it seems like a situation where not having the flags seems like a reasonable option in this case.


def _reshape_dim(self, tensor):
if self.tensors_as_1d:
return tf.reshape(tensor, shape=[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

In the course of researching this change, I learned that reshape is essentially free because it just changes how the data in memory is interpreted, but transpose actually copies the data so it's more expensive. In theory, either one would work here, but reshape is definitely the better choice.

else:
return tf.reshape(tensor, shape=[-1, 1])

def _add_last_offset(self, index, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of surprised we're not already getting the length as one of the offsets 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is how we create batches from a chunk of dataframe:
https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L411

If we have an offsets value: [0, 3, 5, 9, 10, 15, 20, 25, 27, 30] with batch_size 3 to
[0, 3, 5], [9, 10, 15, 20, 25, 27, 30] - the last offset will go as the initial offset into the new batch. The 2nd iteration will be [9, 10, 15], [20, 25, 27, 30].

The offests will be adjusted in line: https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L438

I thought about changing it, but that will result into missing the initial offset in the new batch.
[0, 3, 5, 9], [10, 15, 20, 25, 27, 30].

Note: Currently, the pytorch dataloader does NOT return the last offset values of a batch.

I think we need to duplicate the boundary offsets (9 and 20 in the example). I am not sure, if there is a cleaner way to achieve it. Without trying to refactor the full logic ( https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L388-L440 ), I felt that is the best way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. Thanks for the explanation! I can't think of a cleaner way to do it yet either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, if there is a cleaner way to achieve it. Without trying to refactor the full logic

Working on refactoring the full logic you refererenced in here #104 which may remove the requirement for this method.

That PR (#104) is currently incomplete since it returns only row_lengths currently. (need to add a final row_lengths -> offsets conversion). the row_lengths representation turned out to be easier to work with for the splitting into batches since the batch split array corresponds to the row lengths and not the offsets.

if self.tensors_as_1d:
return tensor.view(-1)
else:
return tensor.view(-1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

view() seems good here 👍🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried squeeze - but it was incorrect for a single batch

{"cat": [1]} got squeezed to {"cat" 1}

print(col)
print(feature_tensor)
if tensors_as_1d or "__values" in col or "__offsets" in col or col in spa_lst:
assert len(list(feature_tensor.shape)) == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add test assertions about the offsets either in this test or in a new test, just to double-check that the resulting offsets are what we expect

@bschifferer bschifferer added the enhancement New feature or request label Feb 23, 2023
@bschifferer
Copy link
Contributor Author

I had to modify some unittests, which I think were incorrect/do not work.

Updates based on the discussion on slack + meeting:

  • Default behavior is 1d shape of tensor
  • Default behavior is list columns are represented as values and offsets (both TensorFlow and PyTorch)
  • Default behavior is list columns are NOT tuples anymore and part of the dictonary
  • Backwards compatibility is dropped to keep the API simple
  1. I dropped nested list columns support in TensorFlow. As we moved to values, offsets structure, the unittest does not work for nested columns. We do not need nested lists, yet and we should enable the feature for TensorFlow and PyTorch and not only one framework:
    https://github.com/NVIDIA-Merlin/dataloader/pull/101/files#diff-2895e2ef272fe5a536569e29059f5811d4d6d421650818baa7204eae369cda44R107-R112

  2. We test the validater in the TensorFlow script. First, the original model had softmax activation function, although we train with binary entropy. As we had only 1 outout logit, the model always predicts 1. So that seems off to me. I increased batch_size and number of samples to have a more reliable metrics (more samples -> better convergence). However, keras metrics are not accurate in the test. I increased the threshhold and we need a follow up ticket - why keras metrics are off. In my opinion, this could move to Merlin Models:
    https://github.com/NVIDIA-Merlin/dataloader/pull/101/files#diff-2895e2ef272fe5a536569e29059f5811d4d6d421650818baa7204eae369cda44L416

  3. The unittest with multiple label columns had the same issue. It uses binary entropy, but it provides 5x output logits in the 2nd output. The model would interpret this as 5x independent binary targets and expects a target column with [batchsize, 5]. I changed the loss function to SparseCategoricalEntropty which would fit that use case better
    https://github.com/NVIDIA-Merlin/dataloader/pull/101/files#diff-2895e2ef272fe5a536569e29059f5811d4d6d421650818baa7204eae369cda44R705

@karlhigley karlhigley added this to the Merlin 23.03 milestone Feb 23, 2023
@bschifferer bschifferer changed the title [WIP] update dataloader to provide new output structure Update dataloader to provide new output structure Feb 23, 2023
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor Author

@bschifferer bschifferer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed @oliverholworthy changes. I guess that he modified the logic that certain functions are not required anymore (add last offset or reshape the tensors). As unittests are working, I think the code is working correct.

Co-authored-by: Karl Higley <kmhigley@gmail.com>
@karlhigley karlhigley merged commit 3d1845b into NVIDIA-Merlin:main Mar 15, 2023
@oliverholworthy oliverholworthy linked an issue Mar 27, 2023 that may be closed by this pull request
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
breaking enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change dataloader to output 1D tensors for scalar features
3 participants