-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
fix: coverage and genome-build #113
base: main
Are you sure you want to change the base?
Conversation
Warning Rate limit exceeded@famosab has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 19 minutes and 22 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThis pull request introduces configuration updates across multiple files to enhance genome build and benchmark handling. The changes primarily focus on clarifying genome build settings in Changes
Sequence DiagramsequenceDiagram
participant Config as Configuration
participant Common as Common Rules
participant Benchmarks as Benchmark Presets
Config->>Common: Provide genome build
Config->>Common: Set rename-contigs flag
Benchmarks->>Common: Specify high-coverage status
Common->>Common: Validate configuration
Common-->>Config: Return validated settings
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
workflow/rules/common.smk (2)
Line range hint
27-31
: Clarify or remove thegrch37
check.
The TODO comment implies that this logic might be obsolete. Ifgenome-build
settings at the callset level now supersede the globalgrch37
flag, consider removing or adjusting this check to avoid confusion.
400-400
: Explicitly define a fallback for.get("high-coverage")
.
If"high-coverage"
is omitted from a benchmark,.get("high-coverage")
returnsNone
. Provide a default (False
or similar) for clarity and to avoid surprises.config/config.yaml (1)
20-24
: Synchronize the comment and actual genome build.
The inline comment says “Set to grch37,” but the actual value isgrch38
. Clarify or remove the outdated reference to avoid user confusion.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
config/config.yaml
(1 hunks)workflow/resources/presets.yaml
(1 hunks)workflow/rules/common.smk
(7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Formatting
🔇 Additional comments (7)
workflow/rules/common.smk (5)
107-107
: Add error handling for unknown genome builds.
Accessinggenome["truth"][get_genome_build(wildcards)]
can raise aKeyError
if the computed genome build isn’t present. Consider validating the key or raising a clearer exception.
119-119
: Repeat the key check fortruthsets
.
As above, verify that the retrieved key fromget_genome_build(wildcards)
is present ingenome["truth"]
.
150-151
: Validate the callset wildcard usage.
Ensure thatwildcards
always includes a validcallset
and thatgenome-build
is consistently defined, otherwise this might raise aKeyError
.
403-403
: Same fallback suggestion for coverage retrieval.
413-413
: Same fallback suggestion for coverage retrieval.workflow/resources/presets.yaml (2)
7-7
: Check that “high-coverage” is accurately labeled for giab-na12878-exome.
If this preset truly represents lower coverage, the update is correct. Otherwise, revise coverage flags accordingly.
13-13
: Same coverage labeling advice for chm-eval.
workflow/rules/common.smk
Outdated
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch37" and config["variant-calls"][wildcards.callset]["rename-contigs"]: | ||
return workflow.source_path("../resources/rename-contigs/grch37_ucsc2ensembl.txt") | ||
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch38" and config["variant-calls"][wildcards.callset]["rename-contigs"]: | ||
return workflow.source_path("../resources/rename-contigs/grch38_ucsc2ensembl.txt") | ||
else: | ||
config["variant-calls"][wildcards.callset]["rename-contigs"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix missing return
in the else branch.
The else clause references rename-contigs
without returning anything, possibly causing None
to be returned. Use a return statement to maintain a consistent function contract.
- config["variant-calls"][wildcards.callset]["rename-contigs"]
+ return config["variant-calls"][wildcards.callset]["rename-contigs"]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch37" and config["variant-calls"][wildcards.callset]["rename-contigs"]: | |
return workflow.source_path("../resources/rename-contigs/grch37_ucsc2ensembl.txt") | |
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch38" and config["variant-calls"][wildcards.callset]["rename-contigs"]: | |
return workflow.source_path("../resources/rename-contigs/grch38_ucsc2ensembl.txt") | |
else: | |
config["variant-calls"][wildcards.callset]["rename-contigs"] | |
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch37" and config["variant-calls"][wildcards.callset]["rename-contigs"]: | |
return workflow.source_path("../resources/rename-contigs/grch37_ucsc2ensembl.txt") | |
if config["variant-calls"][wildcards.callset]["genome-build"] == "grch38" and config["variant-calls"][wildcards.callset]["rename-contigs"]: | |
return workflow.source_path("../resources/rename-contigs/grch38_ucsc2ensembl.txt") | |
else: | |
return config["variant-calls"][wildcards.callset]["rename-contigs"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
workflow/rules/common.smk (1)
340-355
: Refactor for better maintainability and validation.The function contains duplicate logic and lacks validation in the else clause.
def get_rename_contig_file(wildcards): + RENAME_PATHS = { + "grch37": "../resources/rename-contigs/grch37_ucsc2ensembl.txt", + "grch38": "../resources/rename-contigs/grch38_ucsc2ensembl.txt" + } + callset_config = config["variant-calls"][wildcards.callset] + genome_build = callset_config["genome-build"] + rename_contigs = callset_config["rename-contigs"] + + if genome_build in RENAME_PATHS and rename_contigs: + return workflow.source_path(RENAME_PATHS[genome_build]) + elif rename_contigs is None: + raise ValueError(f"Missing rename-contigs configuration for callset {wildcards.callset}") + else: + return rename_contigs - if ( - config["variant-calls"][wildcards.callset]["genome-build"] == "grch37" - and config["variant-calls"][wildcards.callset]["rename-contigs"] - ): - return workflow.source_path( - "../resources/rename-contigs/grch37_ucsc2ensembl.txt" - ) - if ( - config["variant-calls"][wildcards.callset]["genome-build"] == "grch38" - and config["variant-calls"][wildcards.callset]["rename-contigs"] - ): - return workflow.source_path( - "../resources/rename-contigs/grch38_ucsc2ensembl.txt" - ) - else: - return config["variant-calls"][wildcards.callset]["rename-contigs"]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
workflow/rules/common.smk
(7 hunks)
🔇 Additional comments (1)
workflow/rules/common.smk (1)
Line range hint
27-34
: Enhance error message and verify TODO.The error handling for genome build compatibility is crucial, but could be improved:
- The error message should explain why grch37 must be false.
- The TODO comment needs clarification or resolution.
Let's verify if this check is still needed:
✅ Verification successful
Keep the check but improve error message clarity
The check is necessary because the giab-NA12878-exome benchmark data is in GRCh37/hg19 format, and while the workflow can lift over callsets from GRCh37 to GRCh38, it doesn't support lifting over benchmark data. Suggested improvements:
- Update error message to: "When using giab-NA12878-exome benchmark (which is in GRCh37), set grch37=false in config to allow automatic liftover of callsets to GRCh38"
- Remove the TODO comment as the check is required
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if giab-NA12878-exome benchmark requires grch37=false # Test 1: Search for references to giab-NA12878-exome and grch37 rg -l "giab-NA12878-exome|grch37" --type yaml # Test 2: Check if there are any genome build constraints in benchmark configurations ast-grep --pattern 'genome-build: $_'Length of output: 185
Script:
#!/bin/bash # Check config files content echo "=== config/config.yaml ===" cat config/config.yaml echo -e "\n=== workflow/resources/presets.yaml ===" cat workflow/resources/presets.yaml # Search for documentation about this constraint echo -e "\n=== Searching for documentation ===" rg -i "giab.*na12878.*genome.*build" --type md rg -i "na12878.*grch" --type md # Check for other references to this benchmark echo -e "\n=== Other references to benchmark ===" rg "giab-NA12878-exome" -C 2Length of output: 8527
high_cov_status = benchmarks[wildcards.benchmark].get("high-coverage") | ||
else: | ||
benchmark = config["variant-calls"][wildcards.callset]["benchmark"] | ||
high_cov_status = benchmarks[benchmark].get("high-coverage", False) | ||
high_cov_status = benchmarks[benchmark].get("high-coverage") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consolidate coverage status handling and add defaults.
The code duplicates the high coverage status retrieval logic and lacks default values, which could lead to unexpected behavior.
+def _get_benchmark_high_coverage_status(benchmark_name):
+ """Helper function to get high coverage status with a default value."""
+ return benchmarks[benchmark_name].get("high-coverage", False)
+
def get_coverages(wildcards):
if hasattr(wildcards, "benchmark"):
- high_cov_status = benchmarks[wildcards.benchmark].get("high-coverage")
+ high_cov_status = _get_benchmark_high_coverage_status(wildcards.benchmark)
else:
benchmark = config["variant-calls"][wildcards.callset]["benchmark"]
- high_cov_status = benchmarks[benchmark].get("high-coverage")
+ high_cov_status = _get_benchmark_high_coverage_status(benchmark)
if high_cov_status:
coverages = high_coverages
else:
coverages = low_coverages
return coverages
def get_coverages_of_callset(callset):
benchmark = config["variant-calls"][callset]["benchmark"]
- high_cov_status = benchmarks[benchmark].get("high-coverage")
+ high_cov_status = _get_benchmark_high_coverage_status(benchmark)
if high_cov_status:
coverages = high_coverages
else:
coverages = low_coverages
return coverages
Also applies to: 423-423
workflow/rules/common.smk
Outdated
def get_genome_build(wildcards): | ||
return config["variant-calls"][wildcards.callset]["genome-build"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for missing configuration.
The function should handle cases where the required configuration keys are missing to prevent KeyError exceptions.
def get_genome_build(wildcards):
- return config["variant-calls"][wildcards.callset]["genome-build"]
+ try:
+ return config["variant-calls"][wildcards.callset]["genome-build"]
+ except KeyError as e:
+ raise ValueError(f"Missing required configuration: {e}") from e
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_genome_build(wildcards): | |
return config["variant-calls"][wildcards.callset]["genome-build"] | |
def get_genome_build(wildcards): | |
try: | |
return config["variant-calls"][wildcards.callset]["genome-build"] | |
except KeyError as e: | |
raise ValueError(f"Missing required configuration: {e}") from e |
Related to #99
Summary by CodeRabbit
Release Notes
Configuration Updates
Workflow Improvements
Validation