Skip to content

Remove URL fragment in lineage IDs #6011

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bentsherman
Copy link
Member

@bentsherman bentsherman commented Apr 25, 2025

This PR removes the use of URL fragments in LIDs, since jq can be used on the command line. The #output shortcut is replaced by adding both WorkflowLaunch and WorkflowRun to the history log.

  • Rename WorkflowRun -> WorkflowLaunch
  • Rename WorkflowOutput -> WorkflowRun
  • Add launch LID / run LID to nextflow lineage list
  • Remove TaskOutput since it is not used
  • Remove the use of URL fragments

@bentsherman bentsherman requested a review from jorgee April 25, 2025 21:39
@bentsherman bentsherman marked this pull request as draft April 25, 2025 21:39
@bentsherman bentsherman marked this pull request as ready for review April 25, 2025 21:39
Copy link

netlify bot commented Apr 25, 2025

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 0193cfe
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/681ec3d0df2f7d0008d91d0b
😎 Deploy Preview https://deploy-preview-6011--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@bentsherman
Copy link
Member Author

This PR removes a lot of unnecessary complexity around the LID filesystem that I feel is getting in the way of core use cases.

There are a few tests still failing, mainly because the view command now goes through LinPath instead of LinUtils.query() to resolve the LID, and it is currently trying to return the actual path of FileOutputs instead of the metadata description.

@pditommaso pditommaso marked this pull request as draft April 26, 2025 12:16
@bentsherman
Copy link
Member Author

Before we move this forward, I think we should decide whether we actually want to embed the workflow output value in the metadata. Like I mentioned, a pipeline with a samples output could produce 100,000 samples, which would be saved as a giant list in the metadata.

Alternatively, the user can save the output to an index file (i.e. samplesheet) and just reference that index file in a downstream pipeline, like they already do. We could simply reference this index file in the metadata instead of the contents. I intend to explore this as part of the TraceObserverV2 proposal.

In that case, maybe we could drop the use of URI fragments entirely. Users already have the index file as a drop-in replacement for samplesheets, and traversing the metadata can already be done more effectively via jq or json-path.

So I think my main goal for the 25.04 release is to remove the URI syntax overloading (fragment, query params) in favor of something more familiar and flexible.

@jorgee
Copy link
Contributor

jorgee commented Apr 28, 2025

Alternatively, the user can save the output to an index file (i.e. samplesheet) and just reference that index file in a downstream pipeline, like they already do. We could simply reference this index file in the metadata instead of the contents. I intend to explore this as part of the TraceObserverV2 proposal.

AFAIK, the index file is not mandatory, users need to indicate they want to write the index file, right? In the case, they do not write it, what we should put as output? Maybe we should write the index file in all the cases and publish to the file if indicated or store as lineage metadata.

@jorgee
Copy link
Contributor

jorgee commented Apr 28, 2025

There are a few tests still failing, mainly because the view command now goes through LinPath instead of LinUtils.query() to resolve the LID, and it is currently trying to return the actual path of FileOutputs instead of the metadata description.

I think when view, it should just try to load and check the fragment. In LinPath it is also checking if the path is a subpath of a description but it is not needed in this case.

@bentsherman bentsherman force-pushed the data-lineage-simplify-fragment branch from 4e29bb8 to e897e77 Compare May 3, 2025 00:21
@bentsherman bentsherman changed the title Use LID fragment to access workflow output by name Data lineage finishing touches May 3, 2025
@bentsherman bentsherman added this to the 25.04 milestone May 3, 2025
@bentsherman bentsherman marked this pull request as ready for review May 3, 2025 00:27
@bentsherman bentsherman requested a review from a team as a code owner May 3, 2025 00:27
@bentsherman
Copy link
Member Author

Update this PR based on the latest changes, focusing on removing the use of URL fragments entirely in the LID filesystem. This makes the user experience simpler (don't need to remember the #output hack) and removes a lot of complex code.

Every workflow execution now has a "launch LID" and "run LID". The former is added to the history log when the workflow begins, the latter is added when the workflow completes.

Here's a simple test you can run to get started:

rm -rf .lineage/
nextflow run rnaseq-nf -r lineage -profile conda -resume --labels foo,bar
nextflow lineage list

@jorgee if these changes make sense to you, can you help me fix the tests? I think they are the same ones as before

@bentsherman bentsherman requested a review from jorgee May 3, 2025 00:36
@bentsherman
Copy link
Member Author

Expected output:

$ nextflow lineage list
TIMESTAMP               RUN NAME        SESSION ID                              LAUNCH LID                              RUN LID                               
2025-05-02 19:06:15 CDT stoic_shaw      bc79451f-c573-4b7d-8e7c-697be8d9cefc    lid://cd7197c02ab1250eafc2bf7499715e5f  lid://304c57e48ab6b324715ad2c5ba55b25e

$ nextflow lineage view lid://304c57e48ab6b324715ad2c5ba55b25e
{
  "type": "WorkflowRun",
  "createdAt": "2025-05-02T19:06:16.160559118-05:00",
  "workflowLaunch": "lid://cd7197c02ab1250eafc2bf7499715e5f",
  "output": [
    {
      "type": "Path",
      "name": "summary",
      "value": "lid://cd7197c02ab1250eafc2bf7499715e5f/summary/multiqc_report.html"
    },
    {
      "type": "Collection",
      "name": "samples",
      "value": "lid://cd7197c02ab1250eafc2bf7499715e5f/samples.json"
    }
  ]
}

@bentsherman
Copy link
Member Author

I added the completion status to the WorkflowRun. It can be SUCCEEDED, FAILED, or CANCELLED, consistent with platform terminology

@bentsherman
Copy link
Member Author

I think this changes are valuable enough to merge now.

@jorgee do you have any remaining concerns? You mentioned something about the workflow outputs not having LIDs, but I don't remember exactly. Otherwise if these changes make sense to you, please approve and I will merge later

@jorgee
Copy link
Contributor

jorgee commented May 5, 2025

My main concern was about the two LIDs and the path of the workflow output files. They are based on the LAUNCH LID but the users could expect to have it in the RUN LID.

@bentsherman
Copy link
Member Author

Could it be done the other way? I would expected the final workflow run hash to include the workflow outputs as components, so there would be no way for a workflow output to refer to the final workflow run that is referring to it

@bentsherman bentsherman modified the milestones: 25.04, 25.10 May 6, 2025
@bentsherman bentsherman changed the title Data lineage finishing touches Remove URL fragment in lineage IDs May 6, 2025
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman force-pushed the data-lineage-simplify-fragment branch from 3453af8 to 0193cfe Compare May 10, 2025 03:11
@bentsherman bentsherman marked this pull request as draft May 10, 2025 03:11
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants