-
Notifications
You must be signed in to change notification settings - Fork 685
Data lineage programmatic API #6003
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
I think we can have a nice unification of the CLI and programmatic API here Viewing a single LID:
Querying a collection of LIDs:
It might also make sense to refactor Finally, I think that |
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
I have pushed what I showed today.
|
future.exceptionally(this.&handlerException) | ||
} | ||
|
||
static DataflowWriteChannel queryLineage(String queryString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just use a Map here instead of query string. I would apply the same change to the find
command as well. There is no need to add the extra complexity of URL encoding
return filePattern instanceof QueryablePath && (filePattern as QueryablePath).hasQuery() | ||
} | ||
|
||
private boolean applyQueryablePath0(QueryablePath path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to add query support to fromPath
because it will already be supported through queryLineage
. Besides I would like to get away from using URI query params everywhere.
The queryLineage
factory should return a channel of metadata objects, so you can chain it with a map
operator to extract the actual files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems one of the use cases that would be nice to support is to get outputs annotated with some metadata. This is only possible to do with a query, so I implemented it within fromPath
and queryLineage
to see what's the best option.
I had doubts about what queryLineage
should return. Command find
is returning the lids, not the objects. I returned the LinPaths
thinking in the mentioned use case. If the LinPath
is a FileOutput
, it accesses the real file, checking the integrity. If we return the FileOutput
description, users must get the path and check the integrity.
Maybe the best is returning the lid, letting the user do what they want according to the query. If it is cast to Path
, it will be converted to a LinPath
, and if it is passed to the lineage
function, it will be converted to the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the best is returning the lid, letting the user do what they want according to the query.
This is exactly it. The lineage
function and queryLineage
factory should return just the metadata descriptions. Users can compose these with other operators to do things like get the actual files.
Meanwhile, fromPath
should only be used to get individual files.
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Made commented changes:
Usage example:
|
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Last changes according last discussions:
|
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Give a try to this i'm not convinced about I find counterintuitive especially when using it with |
From our discussion today:
A possible example usage: channel.queryLineage(workflowRun: '...', annotations: ['my-label', 'foo=bar']) Focusing on |
real path or the LinPath?. If returning LinPath, it is checking the integrity when accessing. I think it is a better option than just returning the real path. |
Indeed 👍 |
Maybe I'm misunderstanding. If you return the LinPath, will that just be the JSON metadata-as-a-file or the actual file? Because the user will want to use the actual file |
It is both, when the URI LinPath is pointing to a FileOutput, it is just a wrapper of the real file, including the checksum validation. When it is pointing to other metadata types, it is the JSON metadata-as-a-file. |
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Perfect 👍 |
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Implemented latests comment:
Usage Example:
|
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Initial fromLineage factory implementation