Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Dl transforms #41

Merged
merged 24 commits into from
Nov 8, 2022
Merged

Dl transforms #41

merged 24 commits into from
Nov 8, 2022

Conversation

jperez999
Copy link
Collaborator

This PR adds the ability to run a merlin graph transforms over the batches of data that come out of the data loader. Operator introduced here is the embedding operators. Allowing for batch level additions of the embedding representations for records.
From previous #37, which was closed because it was created while repo was private.

# are all operators going to need to know about lists as tuples?
# seems like we could benefit from an object here that encapsulates
# both lists and scalar tensor types?
if self.transforms:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think about creating a comprehensive "column" class that can be sub-classed to ScalarColumn and ListColumn. This will hide the tuple format behind a df series type interface that will be more friendly to the other parts of merlin, i.e. the graph. The use case is what if I want to do some after dataloader inbatch processing to a list column. It will be easier to abstract that tuple representation (values, nnz) and allow the user to not have to worry about keeping track of all that.

from merlin.schema import ColumnSchema, Schema, Tags


class TFEmbeddingOperator(BaseOperator):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these are repeated with small tweaks, would be nice to be able to converge so we dont have three operators for the same thing just using different inputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like that's the best we can currently do though, so 🤷🏻

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would your proposal for columns remove the need for ColumnSchema?

I'm guessing Tags is significantly different but I could be wrong.

from merlin.schema import ColumnSchema, Schema, Tags


class TorchEmbeddingOperator(BaseOperator):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in tensorflow case, so many of the operators are just a little different, but to avoid confusions and allow users to understand more clearly uses and use cases we have kept these operators separate. Would be good to move to a state where we just have one operator for this (as previously stated).



@pytest.fixture(scope="session")
def rev_embedding_ids(embedding_ids, tmpdir_factory):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse embeddings is used to ensure that id_lookup is working correctly, in this case the indexes are reversed, [99999:1] , In embedding_ids above its [1:99999]. This allows us to use enumeration of batches to pull out the correct (what should be in the embeddings) values and assert they are what came back in each batch.

)


def test_embedding_torch_np_mmap_dl_with_lookup(tmpdir, rev_embedding_ids, np_embeddings_from_pq):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are kept separate to allow developers and users to quickly understand how the APIs should be used. Can be converged but would be more convoluted.

@jperez999 jperez999 self-assigned this Nov 3, 2022
@jperez999 jperez999 added the enhancement New feature or request label Nov 3, 2022
@jperez999 jperez999 added this to the Merlin 22.11 milestone Nov 3, 2022
@github-actions
Copy link

github-actions bot commented Nov 3, 2022

Documentation preview

https://nvidia-merlin.github.io/dataloader/review/pr-41

@jperez999
Copy link
Collaborator Author

rerun tests

3 similar comments
@jperez999
Copy link
Collaborator Author

rerun tests

@jperez999
Copy link
Collaborator Author

rerun tests

@jperez999
Copy link
Collaborator Author

rerun tests

merlin/loader/loader_base.py Outdated Show resolved Hide resolved
merlin/loader/loader_base.py Outdated Show resolved Hide resolved
merlin/loader/loader_base.py Show resolved Hide resolved
from merlin.schema import ColumnSchema, Schema, Tags


class TFEmbeddingOperator(BaseOperator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like that's the best we can currently do though, so 🤷🏻

merlin/loader/ops/embeddings/torch_embedding_op.py Outdated Show resolved Hide resolved
merlin/loader/torch.py Outdated Show resolved Hide resolved
@karlhigley karlhigley merged commit b6f9a67 into NVIDIA-Merlin:main Nov 8, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants