Skip to content

Data lineage programmatic API #6003

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 24 commits into from
May 2, 2025
Merged

Data lineage programmatic API #6003

merged 24 commits into from
May 2, 2025

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Apr 24, 2025

Initial fromLineage factory implementation

jorgee added 2 commits April 24, 2025 14:56
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Copy link

netlify bot commented Apr 24, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 880461c
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/681527739442810008b6abb9

@bentsherman bentsherman changed the title Lineage factory channel factory Data lineage channel factories Apr 25, 2025
@bentsherman
Copy link
Member

bentsherman commented Apr 25, 2025

I think we can have a nice unification of the CLI and programmatic API here

Viewing a single LID:

  • CLI: view lid://<hash>[/<path>] -> use jq to inspect further
  • API: channel.fromLineage('lid://<hash>[/<path>]') -> use json-path param to inspect further

Querying a collection of LIDs:

  • CLI: find <name=value> <name=value> ...
  • API: channel.queryLineage(foo: 'bar', baz: 'qux') -> returns queue channel of items, use json-path param to inspect further

This analysis suggests that the json-path functionality should be provided as a standalone function so that it can be used in different ways. You can add it to Nextflow.groovy. Actually you could just add it as an extra param to both APIs if that would be easier.

It might also make sense to refactor channel.fromLineage as a function instead of a channel factory, e.g. lineage(lid), because as a channel factory it will always return a value channel which is a needless constraint. The queryLineage factory makes sense because it can emit results while it's querying.

Finally, I think that find / queryLineage should really just use key-value pairs, rather than the URI query syntax which is unnecessary and error-prone

jorgee added 2 commits April 25, 2025 21:11
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee
Copy link
Contributor Author

jorgee commented Apr 25, 2025

I have pushed what I showed today.

  • Files published with PublishDir included in outputs
  • fromLinage channel factory: It should return the same as CLI view command:
    Usage: Channel.fromLineage("lid://xxxx")
    -queryLineage channel factory: It return a channel with the LinPaths matching with the queryString. Almost the same as the CLI find command
    Usage: Channel.queryString("type=FileOutput&annotations.value=test")

future.exceptionally(this.&handlerException)
}

static DataflowWriteChannel queryLineage(String queryString) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use a Map here instead of query string. I would apply the same change to the find command as well. There is no need to add the extra complexity of URL encoding

return filePattern instanceof QueryablePath && (filePattern as QueryablePath).hasQuery()
}

private boolean applyQueryablePath0(QueryablePath path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to add query support to fromPath because it will already be supported through queryLineage. Besides I would like to get away from using URI query params everywhere.

The queryLineage factory should return a channel of metadata objects, so you can chain it with a map operator to extract the actual files

Copy link
Contributor Author

@jorgee jorgee Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems one of the use cases that would be nice to support is to get outputs annotated with some metadata. This is only possible to do with a query, so I implemented it within fromPath and queryLineage to see what's the best option.

I had doubts about what queryLineage should return. Command find is returning the lids, not the objects. I returned the LinPaths thinking in the mentioned use case. If the LinPath is a FileOutput, it accesses the real file, checking the integrity. If we return the FileOutput description, users must get the path and check the integrity.

Maybe the best is returning the lid, letting the user do what they want according to the query. If it is cast to Path, it will be converted to a LinPath, and if it is passed to the lineage function, it will be converted to the object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the best is returning the lid, letting the user do what they want according to the query.

This is exactly it. The lineage function and queryLineage factory should return just the metadata descriptions. Users can compose these with other operators to do things like get the actual files.

Meanwhile, fromPath should only be used to get individual files.

@bentsherman bentsherman changed the title Data lineage channel factories Data lineage programmatic API Apr 28, 2025
jorgee added 2 commits April 29, 2025 11:46
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee
Copy link
Contributor Author

jorgee commented Apr 29, 2025

Made commented changes:

  • fromLineage factory -> to lineage operator (still using fragments to be merged with Remove URL fragment in lineage IDs #6011)
  • queryLinage factory and find command moved from query string to key-value map
  • remove fromPath and publishDir

Usage example:

process ls {
        input:
                path('file')
        output:
                stdout
        script:
        """
        ls -l file
        """
}

workflow {
        // Print lineage metadata
        channel.of("lid://xxxx") | lineage | view
        
        // Get Ouputs containing a certain annotation experiment=test and use them in a task
        channel.queryLineage('type': 'FileOutput', 'annotations.key': 'experiment, 'annotations.value': params.value) | ls | view
 
        // Query lineage metadata and print
        channel.queryLineage('type': 'WorkflowOutput', 'workflowRun': 'lid://xxxx') | lineage | view

}

jorgee and others added 5 commits April 29, 2025 12:18
@jorgee
Copy link
Contributor Author

jorgee commented Apr 30, 2025

Last changes according last discussions:

  • lineage moved from operator to function. Also returning objects instead of String with json serializations
  • Removed support for query in view and fromPath. If user add a query in a LinPath URI, it will be ignored. It prints a warning message.
  • Fragment is also validated at LinPath creation to detect errors earlier. Before, incorrect fragments were detected when accessing the LinPath.

@bentsherman bentsherman self-requested a review April 30, 2025 13:24
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@pditommaso
Copy link
Member

Give a try to this i'm not convinced about lineage either as function or operator.

I find counterintuitive especially when using it with queryLineage. I would expect the latter to emit map object corresponding to the matched metadata. Could even be interesting give an option to decide the object format e.g. string (json), Path or Map

@bentsherman
Copy link
Member

From our discussion today:

  • remove the lineage function
  • refactor queryLineage to always return Channel<Path>
    • allows the following fields from FileOutput as named args
      • workflowRun
      • taskRun
      • annotations
    • return the real paths associated with FileOutput records
    • multiple annotations are treated as AND

A possible example usage:

channel.queryLineage(workflowRun: '...', annotations: ['my-label', 'foo=bar'])

Focusing on FileOutput allows us to document a specific set of keys that can be used as named args to the factory.

@jorgee
Copy link
Contributor Author

jorgee commented May 2, 2025

  • return the real paths associated with FileOutput records

real path or the LinPath?. If returning LinPath, it is checking the integrity when accessing. I think it is a better option than just returning the real path.

@pditommaso
Copy link
Member

Indeed 👍

@bentsherman
Copy link
Member

Maybe I'm misunderstanding. If you return the LinPath, will that just be the JSON metadata-as-a-file or the actual file? Because the user will want to use the actual file

@jorgee
Copy link
Contributor Author

jorgee commented May 2, 2025

It is both, when the URI LinPath is pointing to a FileOutput, it is just a wrapper of the real file, including the checksum validation. When it is pointing to other metadata types, it is the JSON metadata-as-a-file.

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@bentsherman
Copy link
Member

Perfect 👍

jorgee added 3 commits May 2, 2025 16:01
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee
Copy link
Contributor Author

jorgee commented May 2, 2025

Implemented latests comment:

  • lineage function removed
  • queryLineage refactored to return Channel. Allowed params: workflowRun, taskRun, and labels
  • multiple labels are treated as AND
  • Conflicts with master solved
  • Documentation updated

Usage Example:

channel.queryLineage(workflowRun: 'lid://xxxx', taskRun: 'lid://xxx', annotations: ['my-label', 'foo=bar'])

jorgee and others added 5 commits May 2, 2025 17:44
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman marked this pull request as ready for review May 2, 2025 20:13
@bentsherman bentsherman requested review from a team as code owners May 2, 2025 20:13
@bentsherman bentsherman merged commit 85b9d00 into master May 2, 2025
23 checks passed
@bentsherman bentsherman deleted the lineage-factory branch May 2, 2025 20:50
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants