Dl transforms #41

jperez999 · 2022-11-03T12:49:41Z

This PR adds the ability to run a merlin graph transforms over the batches of data that come out of the data loader. Operator introduced here is the embedding operators. Allowing for batch level additions of the embedding representations for records.
From previous #37, which was closed because it was created while repo was private.

… into dl-transforms

…s from pq

jperez999 · 2022-11-03T12:51:02Z

merlin/loader/loader_base.py

+        #           are all operators going to need to know about lists as tuples?
+        #           seems like we could benefit from an object here that encapsulates
+        #               both lists and scalar tensor types?
+        if self.transforms:


We should think about creating a comprehensive "column" class that can be sub-classed to ScalarColumn and ListColumn. This will hide the tuple format behind a df series type interface that will be more friendly to the other parts of merlin, i.e. the graph. The use case is what if I want to do some after dataloader inbatch processing to a list column. It will be easier to abstract that tuple representation (values, nnz) and allow the user to not have to worry about keeping track of all that.

jperez999 · 2022-11-03T12:51:18Z

merlin/loader/ops/embeddings/tf_embedding_op.py

+from merlin.schema import ColumnSchema, Schema, Tags
+
+
+class TFEmbeddingOperator(BaseOperator):


Most of these are repeated with small tweaks, would be nice to be able to converge so we dont have three operators for the same thing just using different inputs.

Seems like that's the best we can currently do though, so 🤷🏻

Would your proposal for columns remove the need for ColumnSchema?

I'm guessing Tags is significantly different but I could be wrong.

jperez999 · 2022-11-03T12:51:31Z

merlin/loader/ops/embeddings/torch_embedding_op.py

+from merlin.schema import ColumnSchema, Schema, Tags
+
+
+class TorchEmbeddingOperator(BaseOperator):


Same as in tensorflow case, so many of the operators are just a little different, but to avoid confusions and allow users to understand more clearly uses and use cases we have kept these operators separate. Would be good to move to a state where we just have one operator for this (as previously stated).

jperez999 · 2022-11-03T12:51:51Z

tests/conftest.py

+
+
+@pytest.fixture(scope="session")
+def rev_embedding_ids(embedding_ids, tmpdir_factory):


reverse embeddings is used to ensure that id_lookup is working correctly, in this case the indexes are reversed, [99999:1] , In embedding_ids above its [1:99999]. This allows us to use enumeration of batches to pull out the correct (what should be in the embeddings) values and assert they are what came back in each batch.

jperez999 · 2022-11-03T12:53:01Z

tests/unit/loader/test_torch_embeddings.py

+)
+
+
+def test_embedding_torch_np_mmap_dl_with_lookup(tmpdir, rev_embedding_ids, np_embeddings_from_pq):


Tests are kept separate to allow developers and users to quickly understand how the APIs should be used. Can be converged but would be more convoluted.

…er-1 into dl-transforms

github-actions · 2022-11-03T20:12:28Z

Documentation preview

https://nvidia-merlin.github.io/dataloader/review/pr-41

…GPU exists

jperez999 · 2022-11-04T14:04:24Z

rerun tests

jperez999 · 2022-11-04T14:12:38Z

rerun tests

jperez999 · 2022-11-04T14:29:21Z

rerun tests

jperez999 · 2022-11-04T14:38:51Z

rerun tests

merlin/loader/loader_base.py

karlhigley · 2022-11-04T17:29:18Z

merlin/loader/ops/embeddings/tf_embedding_op.py

+from merlin.schema import ColumnSchema, Schema, Tags
+
+
+class TFEmbeddingOperator(BaseOperator):


Seems like that's the best we can currently do though, so 🤷🏻

merlin/loader/ops/embeddings/torch_embedding_op.py

merlin/loader/torch.py

merlin/loader/ops/__init__.py

jperez999 and others added 12 commits October 14, 2022 14:35

lay down foundation of transform capability in dataloader.

db89cab

working transforms for greater that host memory and greater

736c8e8

add in memory versions of indexing

c1572fa

added docstrings to operators and made change to tf embedding test

0eb8d81

remove base loader logic for gpu only

6334d91

Merge branch 'main' into dl-transforms

f978c57

working id lookup and non lookup nmap torch

d4d4687

all torch embeddings with id and without working

91fe81b

tf with and without lookups all green

0d59358

Merge branch 'dl-transforms' of https://github.com/jperez999/dataloader…

4cd2270

… into dl-transforms

add the npy_append_array package to allow for utility build embedding…

8b6fed0

…s from pq

retry adding base.txt

da2c9dc

jperez999 commented Nov 3, 2022

View reviewed changes

jperez999 self-assigned this Nov 3, 2022

jperez999 added the enhancement New feature or request label Nov 3, 2022

jperez999 added this to the Merlin 22.11 milestone Nov 3, 2022

This was linked to issues Nov 3, 2022

Add lookup for embeddings based on key during dataloading #31

Closed

Add pretrained embedding to the dictionary of tensors #32

Closed

jperez999 and others added 8 commits November 3, 2022 10:44

Merge branch 'main' into dl-transforms

fdcbca4

fix various errors in dataloader to allow dictarray

c1dbe71

Merge branch 'dl-transforms' of https://github.com/jperez999/dataload…

a499d88

…er-1 into dl-transforms

remove rmm in testing

1867c4c

remove cudf based parameter row_group_size_bytes

3f400f2

Merge branch 'main' into dl-transforms

2378873

move npy-append to dev requirements for conda package mitigation

44cdfad

tox update for requirements

6bb0cc0

adding support for still using CPU, overriding default behavior when …

01ea009

…GPU exists

jperez999 requested review from benfred and edknv November 3, 2022 22:02

jperez999 requested review from karlhigley, nv-alaiacano and oliverholworthy November 4, 2022 15:21

karlhigley reviewed Nov 4, 2022

View reviewed changes

EvenOldridge reviewed Nov 4, 2022

View reviewed changes

merlin/loader/ops/__init__.py Outdated Show resolved Hide resolved

jperez999 added 2 commits November 7, 2022 20:37

refactor code to use base embedding operators

2213a24

remove comments from loader base

9c247cc

jperez999 requested a review from karlhigley November 8, 2022 14:51

replace torch double dtype with float64 to keep current convention

2e47f87

karlhigley approved these changes Nov 8, 2022

View reviewed changes

karlhigley merged commit b6f9a67 into NVIDIA-Merlin:main Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dl transforms #41

Dl transforms #41

jperez999 commented Nov 3, 2022

jperez999 Nov 3, 2022

jperez999 Nov 3, 2022

karlhigley Nov 4, 2022

EvenOldridge Nov 4, 2022

jperez999 Nov 3, 2022

jperez999 Nov 3, 2022

jperez999 Nov 3, 2022

github-actions bot commented Nov 3, 2022

jperez999 commented Nov 4, 2022

jperez999 commented Nov 4, 2022

jperez999 commented Nov 4, 2022

jperez999 commented Nov 4, 2022

karlhigley Nov 4, 2022

		from merlin.schema import ColumnSchema, Schema, Tags


		class TFEmbeddingOperator(BaseOperator):

		from merlin.schema import ColumnSchema, Schema, Tags


		class TorchEmbeddingOperator(BaseOperator):



		@pytest.fixture(scope="session")
		def rev_embedding_ids(embedding_ids, tmpdir_factory):

		)


		def test_embedding_torch_np_mmap_dl_with_lookup(tmpdir, rev_embedding_ids, np_embeddings_from_pq):

Dl transforms #41

Dl transforms #41

Conversation

jperez999 commented Nov 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 3, 2022

Documentation preview

jperez999 commented Nov 4, 2022

jperez999 commented Nov 4, 2022

jperez999 commented Nov 4, 2022

jperez999 commented Nov 4, 2022

Choose a reason for hiding this comment