Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix: coverage and genome-build #113

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

fix: coverage and genome-build #113

wants to merge 3 commits into from

Conversation

famosab
Copy link
Collaborator

@famosab famosab commented Jan 24, 2025

Related to #99

Summary by CodeRabbit

Release Notes

  • Configuration Updates

    • Added explicit genome build configuration (GRCh38)
    • Disabled contig renaming for variant calls
    • Updated benchmark configurations to specify coverage status for specific benchmarks
  • Workflow Improvements

    • Enhanced genome build and truth set retrieval logic
    • Added validation for benchmark configurations
    • Improved handling of genome-related settings
  • Validation

    • Added checks for specific benchmark requirements
    • Refined configuration parameter handling

Copy link

coderabbitai bot commented Jan 24, 2025

Warning

Rate limit exceeded

@famosab has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 19 minutes and 22 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 77f354f and 2a31ecb.

📒 Files selected for processing (1)
  • workflow/rules/common.smk (4 hunks)

Walkthrough

This pull request introduces configuration updates across multiple files to enhance genome build and benchmark handling. The changes primarily focus on clarifying genome build settings in config/config.yaml, adding high-coverage flags to benchmark presets in workflow/resources/presets.yaml, and modifying the genome build retrieval logic in workflow/rules/common.smk. The modifications aim to improve configuration specificity and error handling for variant calling and benchmark processes.

Changes

File Change Summary
config/config.yaml - Added genome-build: grch38 in variant-calls section
- Set rename-contigs: false explicitly
workflow/resources/presets.yaml - Added high-coverage: false for giab-na12878-exome and chm-eval benchmarks
workflow/rules/common.smk - Updated get_genome_build() to accept wildcards
- Added validation for benchmark configurations
- Modified genome build and contig retrieval logic

Sequence Diagram

sequenceDiagram
    participant Config as Configuration
    participant Common as Common Rules
    participant Benchmarks as Benchmark Presets

    Config->>Common: Provide genome build
    Config->>Common: Set rename-contigs flag
    Benchmarks->>Common: Specify high-coverage status
    Common->>Common: Validate configuration
    Common-->>Config: Return validated settings
Loading

Possibly related PRs

Suggested reviewers

  • johanneskoester

Poem

🐰 A Rabbit's Config Delight

In YAML's realm, settings take flight,
Genome builds dancing left and right,
Contig names whisper, coverage gleams bright,
With each tweak, our workflow shines tight!

Hop, configure, validate! 🧬


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
workflow/rules/common.smk (2)

Line range hint 27-31: Clarify or remove the grch37 check.
The TODO comment implies that this logic might be obsolete. If genome-build settings at the callset level now supersede the global grch37 flag, consider removing or adjusting this check to avoid confusion.


400-400: Explicitly define a fallback for .get("high-coverage").
If "high-coverage" is omitted from a benchmark, .get("high-coverage") returns None. Provide a default (False or similar) for clarity and to avoid surprises.

config/config.yaml (1)

20-24: Synchronize the comment and actual genome build.
The inline comment says “Set to grch37,” but the actual value is grch38. Clarify or remove the outdated reference to avoid user confusion.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0d3ea9 and bb10504.

📒 Files selected for processing (3)
  • config/config.yaml (1 hunks)
  • workflow/resources/presets.yaml (1 hunks)
  • workflow/rules/common.smk (7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Formatting
🔇 Additional comments (7)
workflow/rules/common.smk (5)

107-107: Add error handling for unknown genome builds.
Accessing genome["truth"][get_genome_build(wildcards)] can raise a KeyError if the computed genome build isn’t present. Consider validating the key or raising a clearer exception.


119-119: Repeat the key check for truthsets.
As above, verify that the retrieved key from get_genome_build(wildcards) is present in genome["truth"].


150-151: Validate the callset wildcard usage.
Ensure that wildcards always includes a valid callset and that genome-build is consistently defined, otherwise this might raise a KeyError.


403-403: Same fallback suggestion for coverage retrieval.


413-413: Same fallback suggestion for coverage retrieval.

workflow/resources/presets.yaml (2)

7-7: Check that “high-coverage” is accurately labeled for giab-na12878-exome.
If this preset truly represents lower coverage, the update is correct. Otherwise, revise coverage flags accordingly.


13-13: Same coverage labeling advice for chm-eval.

Comment on lines 340 to 345
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch37" and config["variant-calls"][wildcards.callset]["rename-contigs"]:
return workflow.source_path("../resources/rename-contigs/grch37_ucsc2ensembl.txt")
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch38" and config["variant-calls"][wildcards.callset]["rename-contigs"]:
return workflow.source_path("../resources/rename-contigs/grch38_ucsc2ensembl.txt")
else:
config["variant-calls"][wildcards.callset]["rename-contigs"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix missing return in the else branch.
The else clause references rename-contigs without returning anything, possibly causing None to be returned. Use a return statement to maintain a consistent function contract.

-    config["variant-calls"][wildcards.callset]["rename-contigs"]
+    return config["variant-calls"][wildcards.callset]["rename-contigs"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch37" and config["variant-calls"][wildcards.callset]["rename-contigs"]:
return workflow.source_path("../resources/rename-contigs/grch37_ucsc2ensembl.txt")
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch38" and config["variant-calls"][wildcards.callset]["rename-contigs"]:
return workflow.source_path("../resources/rename-contigs/grch38_ucsc2ensembl.txt")
else:
config["variant-calls"][wildcards.callset]["rename-contigs"]
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch37" and config["variant-calls"][wildcards.callset]["rename-contigs"]:
return workflow.source_path("../resources/rename-contigs/grch37_ucsc2ensembl.txt")
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch38" and config["variant-calls"][wildcards.callset]["rename-contigs"]:
return workflow.source_path("../resources/rename-contigs/grch38_ucsc2ensembl.txt")
else:
return config["variant-calls"][wildcards.callset]["rename-contigs"]

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
workflow/rules/common.smk (1)

340-355: Refactor for better maintainability and validation.

The function contains duplicate logic and lacks validation in the else clause.

 def get_rename_contig_file(wildcards):
+    RENAME_PATHS = {
+        "grch37": "../resources/rename-contigs/grch37_ucsc2ensembl.txt",
+        "grch38": "../resources/rename-contigs/grch38_ucsc2ensembl.txt"
+    }
+    callset_config = config["variant-calls"][wildcards.callset]
+    genome_build = callset_config["genome-build"]
+    rename_contigs = callset_config["rename-contigs"]
+    
+    if genome_build in RENAME_PATHS and rename_contigs:
+        return workflow.source_path(RENAME_PATHS[genome_build])
+    elif rename_contigs is None:
+        raise ValueError(f"Missing rename-contigs configuration for callset {wildcards.callset}")
+    else:
+        return rename_contigs
-    if (
-        config["variant-calls"][wildcards.callset]["genome-build"] == "grch37"
-        and config["variant-calls"][wildcards.callset]["rename-contigs"]
-    ):
-        return workflow.source_path(
-            "../resources/rename-contigs/grch37_ucsc2ensembl.txt"
-        )
-    if (
-        config["variant-calls"][wildcards.callset]["genome-build"] == "grch38"
-        and config["variant-calls"][wildcards.callset]["rename-contigs"]
-    ):
-        return workflow.source_path(
-            "../resources/rename-contigs/grch38_ucsc2ensembl.txt"
-        )
-    else:
-        return config["variant-calls"][wildcards.callset]["rename-contigs"]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb10504 and 77f354f.

📒 Files selected for processing (1)
  • workflow/rules/common.smk (7 hunks)
🔇 Additional comments (1)
workflow/rules/common.smk (1)

Line range hint 27-34: Enhance error message and verify TODO.

The error handling for genome build compatibility is crucial, but could be improved:

  1. The error message should explain why grch37 must be false.
  2. The TODO comment needs clarification or resolution.

Let's verify if this check is still needed:

✅ Verification successful

Keep the check but improve error message clarity

The check is necessary because the giab-NA12878-exome benchmark data is in GRCh37/hg19 format, and while the workflow can lift over callsets from GRCh37 to GRCh38, it doesn't support lifting over benchmark data. Suggested improvements:

  • Update error message to: "When using giab-NA12878-exome benchmark (which is in GRCh37), set grch37=false in config to allow automatic liftover of callsets to GRCh38"
  • Remove the TODO comment as the check is required
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if giab-NA12878-exome benchmark requires grch37=false
# Test 1: Search for references to giab-NA12878-exome and grch37
rg -l "giab-NA12878-exome|grch37" --type yaml

# Test 2: Check if there are any genome build constraints in benchmark configurations
ast-grep --pattern 'genome-build: $_'

Length of output: 185


Script:

#!/bin/bash
# Check config files content
echo "=== config/config.yaml ==="
cat config/config.yaml
echo -e "\n=== workflow/resources/presets.yaml ==="
cat workflow/resources/presets.yaml

# Search for documentation about this constraint
echo -e "\n=== Searching for documentation ==="
rg -i "giab.*na12878.*genome.*build" --type md
rg -i "na12878.*grch" --type md

# Check for other references to this benchmark
echo -e "\n=== Other references to benchmark ==="
rg "giab-NA12878-exome" -C 2

Length of output: 8527

Comment on lines +410 to +413
high_cov_status = benchmarks[wildcards.benchmark].get("high-coverage")
else:
benchmark = config["variant-calls"][wildcards.callset]["benchmark"]
high_cov_status = benchmarks[benchmark].get("high-coverage", False)
high_cov_status = benchmarks[benchmark].get("high-coverage")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consolidate coverage status handling and add defaults.

The code duplicates the high coverage status retrieval logic and lacks default values, which could lead to unexpected behavior.

+def _get_benchmark_high_coverage_status(benchmark_name):
+    """Helper function to get high coverage status with a default value."""
+    return benchmarks[benchmark_name].get("high-coverage", False)
+
 def get_coverages(wildcards):
     if hasattr(wildcards, "benchmark"):
-        high_cov_status = benchmarks[wildcards.benchmark].get("high-coverage")
+        high_cov_status = _get_benchmark_high_coverage_status(wildcards.benchmark)
     else:
         benchmark = config["variant-calls"][wildcards.callset]["benchmark"]
-        high_cov_status = benchmarks[benchmark].get("high-coverage")
+        high_cov_status = _get_benchmark_high_coverage_status(benchmark)
     if high_cov_status:
         coverages = high_coverages
     else:
         coverages = low_coverages
     return coverages

 def get_coverages_of_callset(callset):
     benchmark = config["variant-calls"][callset]["benchmark"]
-    high_cov_status = benchmarks[benchmark].get("high-coverage")
+    high_cov_status = _get_benchmark_high_coverage_status(benchmark)
     if high_cov_status:
         coverages = high_coverages
     else:
         coverages = low_coverages
     return coverages

Also applies to: 423-423

Comment on lines 150 to 152
def get_genome_build(wildcards):
return config["variant-calls"][wildcards.callset]["genome-build"]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for missing configuration.

The function should handle cases where the required configuration keys are missing to prevent KeyError exceptions.

 def get_genome_build(wildcards):
-    return config["variant-calls"][wildcards.callset]["genome-build"]
+    try:
+        return config["variant-calls"][wildcards.callset]["genome-build"]
+    except KeyError as e:
+        raise ValueError(f"Missing required configuration: {e}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_genome_build(wildcards):
return config["variant-calls"][wildcards.callset]["genome-build"]
def get_genome_build(wildcards):
try:
return config["variant-calls"][wildcards.callset]["genome-build"]
except KeyError as e:
raise ValueError(f"Missing required configuration: {e}") from e

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant