-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Dl transforms #41
Dl transforms #41
Conversation
… into dl-transforms
# are all operators going to need to know about lists as tuples? | ||
# seems like we could benefit from an object here that encapsulates | ||
# both lists and scalar tensor types? | ||
if self.transforms: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should think about creating a comprehensive "column" class that can be sub-classed to ScalarColumn and ListColumn. This will hide the tuple format behind a df series type interface that will be more friendly to the other parts of merlin, i.e. the graph. The use case is what if I want to do some after dataloader inbatch processing to a list column. It will be easier to abstract that tuple representation (values, nnz) and allow the user to not have to worry about keeping track of all that.
from merlin.schema import ColumnSchema, Schema, Tags | ||
|
||
|
||
class TFEmbeddingOperator(BaseOperator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these are repeated with small tweaks, would be nice to be able to converge so we dont have three operators for the same thing just using different inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like that's the best we can currently do though, so 🤷🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would your proposal for columns remove the need for ColumnSchema?
I'm guessing Tags is significantly different but I could be wrong.
from merlin.schema import ColumnSchema, Schema, Tags | ||
|
||
|
||
class TorchEmbeddingOperator(BaseOperator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in tensorflow case, so many of the operators are just a little different, but to avoid confusions and allow users to understand more clearly uses and use cases we have kept these operators separate. Would be good to move to a state where we just have one operator for this (as previously stated).
|
||
|
||
@pytest.fixture(scope="session") | ||
def rev_embedding_ids(embedding_ids, tmpdir_factory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverse embeddings is used to ensure that id_lookup is working correctly, in this case the indexes are reversed, [99999:1] , In embedding_ids above its [1:99999]. This allows us to use enumeration of batches to pull out the correct (what should be in the embeddings) values and assert they are what came back in each batch.
) | ||
|
||
|
||
def test_embedding_torch_np_mmap_dl_with_lookup(tmpdir, rev_embedding_ids, np_embeddings_from_pq): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are kept separate to allow developers and users to quickly understand how the APIs should be used. Can be converged but would be more convoluted.
Documentation preview |
rerun tests |
3 similar comments
rerun tests |
rerun tests |
rerun tests |
from merlin.schema import ColumnSchema, Schema, Tags | ||
|
||
|
||
class TFEmbeddingOperator(BaseOperator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like that's the best we can currently do though, so 🤷🏻
This PR adds the ability to run a merlin graph transforms over the batches of data that come out of the data loader. Operator introduced here is the embedding operators. Allowing for batch level additions of the embedding representations for records.
From previous #37, which was closed because it was created while repo was private.