fix: coverage and genome-build #113

famosab · 2025-01-24T13:25:43Z

Related to #99

Summary by CodeRabbit

Release Notes

Configuration Updates
- Added explicit genome build configuration (GRCh38)
- Disabled contig renaming for variant calls
- Updated benchmark configurations to specify coverage status for specific benchmarks
Workflow Improvements
- Enhanced genome build and truth set retrieval logic
- Added validation for benchmark configurations
- Improved handling of genome-related settings
Validation
- Added checks for specific benchmark requirements
- Refined configuration parameter handling

coderabbitai · 2025-01-24T13:25:51Z

Warning

Rate limit exceeded

@famosab has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 19 minutes and 22 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 77f354f and 2a31ecb.

📒 Files selected for processing (1)

workflow/rules/common.smk (4 hunks)

Walkthrough

This pull request introduces configuration updates across multiple files to enhance genome build and benchmark handling. The changes primarily focus on clarifying genome build settings in config/config.yaml, adding high-coverage flags to benchmark presets in workflow/resources/presets.yaml, and modifying the genome build retrieval logic in workflow/rules/common.smk. The modifications aim to improve configuration specificity and error handling for variant calling and benchmark processes.

Changes

File	Change Summary
`config/config.yaml`	- Added `genome-build: grch38` in variant-calls section - Set `rename-contigs: false` explicitly
`workflow/resources/presets.yaml`	- Added `high-coverage: false` for `giab-na12878-exome` and `chm-eval` benchmarks
`workflow/rules/common.smk`	- Updated `get_genome_build()` to accept `wildcards` - Added validation for benchmark configurations - Modified genome build and contig retrieval logic

Sequence Diagram

sequenceDiagram
    participant Config as Configuration
    participant Common as Common Rules
    participant Benchmarks as Benchmark Presets

    Config->>Common: Provide genome build
    Config->>Common: Set rename-contigs flag
    Benchmarks->>Common: Specify high-coverage status
    Common->>Common: Validate configuration
    Common-->>Config: Return validated settings

Possibly related PRs

feat: Add heatmap coloring for coverage stratification in fp fn report, hide class label #112: Changes in workflow/resources/datavzrd/fp-fn-per-callset-config.yte.yaml regarding the high_coverage parameter are related to the main PR's updates in config/config.yaml, specifically the introduction of the genome-build and rename-contigs settings, which may influence how coverage is interpreted in the workflow.

Suggested reviewers

johanneskoester

Poem

🐰 A Rabbit's Config Delight

In YAML's realm, settings take flight,
Genome builds dancing left and right,
Contig names whisper, coverage gleams bright,
With each tweak, our workflow shines tight!

Hop, configure, validate! 🧬

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

workflow/rules/common.smk (2)

Line range hint 27-31: Clarify or remove the grch37 check.
The TODO comment implies that this logic might be obsolete. If genome-build settings at the callset level now supersede the global grch37 flag, consider removing or adjusting this check to avoid confusion.

400-400: Explicitly define a fallback for .get("high-coverage").
If "high-coverage" is omitted from a benchmark, .get("high-coverage") returns None. Provide a default (False or similar) for clarity and to avoid surprises.

config/config.yaml (1)

20-24: Synchronize the comment and actual genome build.
The inline comment says “Set to grch37,” but the actual value is grch38. Clarify or remove the outdated reference to avoid user confusion.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0d3ea9 and bb10504.

📒 Files selected for processing (3)

config/config.yaml (1 hunks)
workflow/resources/presets.yaml (1 hunks)
workflow/rules/common.smk (7 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Formatting

🔇 Additional comments (7)

workflow/rules/common.smk (5)

107-107: Add error handling for unknown genome builds.
Accessing genome["truth"][get_genome_build(wildcards)] can raise a KeyError if the computed genome build isn’t present. Consider validating the key or raising a clearer exception.

119-119: Repeat the key check for truthsets.
As above, verify that the retrieved key from get_genome_build(wildcards) is present in genome["truth"].

150-151: Validate the callset wildcard usage.
Ensure that wildcards always includes a valid callset and that genome-build is consistently defined, otherwise this might raise a KeyError.

403-403: Same fallback suggestion for coverage retrieval.

413-413: Same fallback suggestion for coverage retrieval.

workflow/resources/presets.yaml (2)

7-7: Check that “high-coverage” is accurately labeled for giab-na12878-exome.
If this preset truly represents lower coverage, the update is correct. Otherwise, revise coverage flags accordingly.

13-13: Same coverage labeling advice for chm-eval.

coderabbitai · 2025-01-24T13:28:24Z

workflow/rules/common.smk

+    if config["variant-calls"][wildcards.callset]["genome-build"] == "grch37" and config["variant-calls"][wildcards.callset]["rename-contigs"]:
+        return workflow.source_path("../resources/rename-contigs/grch37_ucsc2ensembl.txt")
+    if config["variant-calls"][wildcards.callset]["genome-build"] == "grch38" and config["variant-calls"][wildcards.callset]["rename-contigs"]:
+        return workflow.source_path("../resources/rename-contigs/grch38_ucsc2ensembl.txt")
+    else:
+        config["variant-calls"][wildcards.callset]["rename-contigs"]


⚠️ Potential issue

Fix missing return in the else branch.
The else clause references rename-contigs without returning anything, possibly causing None to be returned. Use a return statement to maintain a consistent function contract.

- config["variant-calls"][wildcards.callset]["rename-contigs"] + return config["variant-calls"][wildcards.callset]["rename-contigs"]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if config["variant-calls"][wildcards.callset]["genome-build"] == "grch37" and config["variant-calls"][wildcards.callset]["rename-contigs"]:

return workflow.source_path("../resources/rename-contigs/grch37_ucsc2ensembl.txt")

if config["variant-calls"][wildcards.callset]["genome-build"] == "grch38" and config["variant-calls"][wildcards.callset]["rename-contigs"]:

return workflow.source_path("../resources/rename-contigs/grch38_ucsc2ensembl.txt")

else:

config["variant-calls"][wildcards.callset]["rename-contigs"]

if config["variant-calls"][wildcards.callset]["genome-build"] == "grch37" and config["variant-calls"][wildcards.callset]["rename-contigs"]:

return workflow.source_path("../resources/rename-contigs/grch37_ucsc2ensembl.txt")

if config["variant-calls"][wildcards.callset]["genome-build"] == "grch38" and config["variant-calls"][wildcards.callset]["rename-contigs"]:

return workflow.source_path("../resources/rename-contigs/grch38_ucsc2ensembl.txt")

else:

return config["variant-calls"][wildcards.callset]["rename-contigs"]

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

workflow/rules/common.smk (1)

340-355: Refactor for better maintainability and validation.

The function contains duplicate logic and lacks validation in the else clause.

 def get_rename_contig_file(wildcards):
+    RENAME_PATHS = {
+        "grch37": "../resources/rename-contigs/grch37_ucsc2ensembl.txt",
+        "grch38": "../resources/rename-contigs/grch38_ucsc2ensembl.txt"
+    }
+    callset_config = config["variant-calls"][wildcards.callset]
+    genome_build = callset_config["genome-build"]
+    rename_contigs = callset_config["rename-contigs"]
+    
+    if genome_build in RENAME_PATHS and rename_contigs:
+        return workflow.source_path(RENAME_PATHS[genome_build])
+    elif rename_contigs is None:
+        raise ValueError(f"Missing rename-contigs configuration for callset {wildcards.callset}")
+    else:
+        return rename_contigs
-    if (
-        config["variant-calls"][wildcards.callset]["genome-build"] == "grch37"
-        and config["variant-calls"][wildcards.callset]["rename-contigs"]
-    ):
-        return workflow.source_path(
-            "../resources/rename-contigs/grch37_ucsc2ensembl.txt"
-        )
-    if (
-        config["variant-calls"][wildcards.callset]["genome-build"] == "grch38"
-        and config["variant-calls"][wildcards.callset]["rename-contigs"]
-    ):
-        return workflow.source_path(
-            "../resources/rename-contigs/grch38_ucsc2ensembl.txt"
-        )
-    else:
-        return config["variant-calls"][wildcards.callset]["rename-contigs"]

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb10504 and 77f354f.

📒 Files selected for processing (1)

workflow/rules/common.smk (7 hunks)

🔇 Additional comments (1)

workflow/rules/common.smk (1)

Line range hint 27-34: Enhance error message and verify TODO.

The error handling for genome build compatibility is crucial, but could be improved:

The error message should explain why grch37 must be false.

The TODO comment needs clarification or resolution.

Let's verify if this check is still needed:

✅ Verification successful

Keep the check but improve error message clarity

The check is necessary because the giab-NA12878-exome benchmark data is in GRCh37/hg19 format, and while the workflow can lift over callsets from GRCh37 to GRCh38, it doesn't support lifting over benchmark data. Suggested improvements:

Update error message to: "When using giab-NA12878-exome benchmark (which is in GRCh37), set grch37=false in config to allow automatic liftover of callsets to GRCh38"

Remove the TODO comment as the check is required
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check if giab-NA12878-exome benchmark requires grch37=false
# Test 1: Search for references to giab-NA12878-exome and grch37
rg -l "giab-NA12878-exome|grch37" --type yaml

# Test 2: Check if there are any genome build constraints in benchmark configurations
ast-grep --pattern 'genome-build: $_'
Length of output: 185

Script:
#!/bin/bash
# Check config files content
echo "=== config/config.yaml ==="
cat config/config.yaml
echo -e "\n=== workflow/resources/presets.yaml ==="
cat workflow/resources/presets.yaml

# Search for documentation about this constraint
echo -e "\n=== Searching for documentation ==="
rg -i "giab.*na12878.*genome.*build" --type md
rg -i "na12878.*grch" --type md

# Check for other references to this benchmark
echo -e "\n=== Other references to benchmark ==="
rg "giab-NA12878-exome" -C 2
Length of output: 8527

coderabbitai · 2025-01-24T13:33:11Z

workflow/rules/common.smk

+        high_cov_status = benchmarks[wildcards.benchmark].get("high-coverage")
    else:
        benchmark = config["variant-calls"][wildcards.callset]["benchmark"]
-        high_cov_status = benchmarks[benchmark].get("high-coverage", False)
+        high_cov_status = benchmarks[benchmark].get("high-coverage")


🛠️ Refactor suggestion

Consolidate coverage status handling and add defaults.

The code duplicates the high coverage status retrieval logic and lacks default values, which could lead to unexpected behavior.

+def _get_benchmark_high_coverage_status(benchmark_name): + """Helper function to get high coverage status with a default value.""" + return benchmarks[benchmark_name].get("high-coverage", False) + def get_coverages(wildcards): if hasattr(wildcards, "benchmark"): - high_cov_status = benchmarks[wildcards.benchmark].get("high-coverage") + high_cov_status = _get_benchmark_high_coverage_status(wildcards.benchmark) else: benchmark = config["variant-calls"][wildcards.callset]["benchmark"] - high_cov_status = benchmarks[benchmark].get("high-coverage") + high_cov_status = _get_benchmark_high_coverage_status(benchmark) if high_cov_status: coverages = high_coverages else: coverages = low_coverages return coverages def get_coverages_of_callset(callset): benchmark = config["variant-calls"][callset]["benchmark"] - high_cov_status = benchmarks[benchmark].get("high-coverage") + high_cov_status = _get_benchmark_high_coverage_status(benchmark) if high_cov_status: coverages = high_coverages else: coverages = low_coverages return coverages

Also applies to: 423-423

coderabbitai · 2025-01-24T13:33:11Z

workflow/rules/common.smk

+def get_genome_build(wildcards):
+    return config["variant-calls"][wildcards.callset]["genome-build"]



🛠️ Refactor suggestion

Add error handling for missing configuration.

The function should handle cases where the required configuration keys are missing to prevent KeyError exceptions.

def get_genome_build(wildcards): - return config["variant-calls"][wildcards.callset]["genome-build"] + try: + return config["variant-calls"][wildcards.callset]["genome-build"] + except KeyError as e: + raise ValueError(f"Missing required configuration: {e}") from e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_genome_build(wildcards):

return config["variant-calls"][wildcards.callset]["genome-build"]

def get_genome_build(wildcards):

try:

return config["variant-calls"][wildcards.callset]["genome-build"]

except KeyError as e:

raise ValueError(f"Missing required configuration: {e}") from e

fix: coverage and genome-build

bb10504

famosab mentioned this pull request Jan 24, 2025

grch37: false can only be set once while grch37: true can be set per callset #99

Open

coderabbitai bot reviewed Jan 24, 2025

View reviewed changes

fix: get-rename-contig function

77f354f

coderabbitai bot reviewed Jan 24, 2025

View reviewed changes

fix: general genome build

2a31ecb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: coverage and genome-build #113

fix: coverage and genome-build #113

famosab commented Jan 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 24, 2025 •

edited

Loading

Rate limit exceeded

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Jan 24, 2025

coderabbitai bot left a comment

coderabbitai bot Jan 24, 2025

coderabbitai bot Jan 24, 2025

		def get_genome_build(wildcards):
		return config["variant-calls"][wildcards.callset]["genome-build"]

fix: coverage and genome-build #113

Are you sure you want to change the base?

fix: coverage and genome-build #113

Conversation

famosab commented Jan 24, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Jan 24, 2025 • edited Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 24, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 24, 2025

Choose a reason for hiding this comment

coderabbitai bot Jan 24, 2025

Choose a reason for hiding this comment

famosab commented Jan 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 24, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)