-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat: add vaf calculation for strelka results #109
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe pull request introduces functionality for calculating Variant Allele Frequency (VAF) in a bioinformatics workflow. A new Conda environment for Changes
Sequence DiagramsequenceDiagram
participant Input as Filtered Variants
participant VAFCalc as VAF Calculation Script
participant Output as VAF-Added Variants
Input->>VAFCalc: Process BCF file
VAFCalc->>VAFCalc: Calculate SNV VAF
VAFCalc->>VAFCalc: Calculate INDEL VAF
VAFCalc->>Output: Generate annotated BCF
Assessment against linked issues
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
workflow/scripts/calc-vaf.py (1)
3-4
: Remove hardcoded file paths from comments.The commented file paths appear to be local development paths. Remove them to avoid confusion.
-#vcf = "/Users/famke/01-pm4onco/osf-download/pipeline-results-of-imgag-data/qbic/strelka/tumor_5perc_vs_normal_5perc.strelka.somatic_snvs_VEP.ann.vcf" #snakemake.input -#indel = "/Users/famke/01-pm4onco/osf-download/pipeline-results-of-imgag-data/qbic/strelka/tumor_5perc_vs_normal_5perc.strelka.somatic_indels_VEP.ann.vcf.gz"workflow/rules/eval.smk (1)
94-105
: Enhance the calculate_vaf rule configuration.The rule needs some improvements:
- Add resources section for memory management
- Fix the log path to include {wildcards.callset}
- Add benchmark section for performance tracking
Here's a suggested implementation:
rule calculate_vaf: input: "results/filtered-variants/{callset}.bcf" output: "results/calculate-vaf/{callset}.added-vaf.bcf" log: - "logs/calculate-vaf/" + "logs/calculate-vaf/{callset}.log" + benchmark: + "benchmarks/calculate-vaf/{callset}.tsv" + resources: + mem_mb=4000 conda: "../envs/cyvcf.yaml" script: "../scripts/calc-vaf.py"workflow/rules/common.smk (1)
458-459
: Document the "tbc" condition in get_vaf_status.Add a comment explaining what "tbc" means and why it triggers a True return value.
if vaf_benchmark is None: return False + # Return True for "tbc" (to be calculated) to enable VAF calculation + # for variants that don't have pre-calculated VAF values if vaf_benchmark == "tbc": return Trueworkflow/envs/cyvcf.yaml (1)
1-5
: Enhance the conda environment configuration.The configuration needs improvements:
- Pin the cyvcf2 version for reproducibility
- Add a newline at the end of file
Here's a suggested implementation:
channels: - conda-forge - bioconda dependencies: - - cyvcf2 + - cyvcf2=0.30.22 +🧰 Tools
🪛 yamllint (1.35.1)
[error] 5-5: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.gitignore
(1 hunks)workflow/envs/cyvcf.yaml
(1 hunks)workflow/rules/common.smk
(1 hunks)workflow/rules/eval.smk
(1 hunks)workflow/scripts/calc-vaf.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🧰 Additional context used
🪛 yamllint (1.35.1)
workflow/envs/cyvcf.yaml
[error] 5-5: no new line character at the end of file
(new-line-at-end-of-file)
def get_indel_allele_freq(vcf): | ||
for variant in VCF(vcf): | ||
tier1RefCounts = variant.format("TAR")[0,0] | ||
tier1AltCounts = variant.format("TIR")[0,0] | ||
|
||
vaf = tier1AltCounts / (tier1AltCounts + tier1RefCounts) | ||
|
||
print(vaf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Apply similar improvements to INDEL VAF calculation.
The INDEL function needs the same enhancements as the SNV function.
Here's a suggested implementation:
def get_indel_allele_freq(vcf):
+ """Calculate Variant Allele Frequency (VAF) for INDELs.
+
+ Args:
+ vcf: Path to the VCF file containing INDEL variants
+
+ Returns:
+ List of tuples containing (variant_id, vaf)
+ """
+ results = []
for variant in VCF(vcf):
tier1RefCounts = variant.format("TAR")[0,0]
tier1AltCounts = variant.format("TIR")[0,0]
- vaf = tier1AltCounts / (tier1AltCounts + tier1RefCounts)
+ total_counts = tier1AltCounts + tier1RefCounts
+ if total_counts == 0:
+ vaf = 0.0
+ else:
+ vaf = tier1AltCounts / total_counts
- print(vaf)
+ results.append((variant.ID or f"{variant.CHROM}:{variant.POS}", vaf))
+ return results
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_indel_allele_freq(vcf): | |
for variant in VCF(vcf): | |
tier1RefCounts = variant.format("TAR")[0,0] | |
tier1AltCounts = variant.format("TIR")[0,0] | |
vaf = tier1AltCounts / (tier1AltCounts + tier1RefCounts) | |
print(vaf) | |
def get_indel_allele_freq(vcf): | |
"""Calculate Variant Allele Frequency (VAF) for INDELs. | |
Args: | |
vcf: Path to the VCF file containing INDEL variants | |
Returns: | |
List of tuples containing (variant_id, vaf) | |
""" | |
results = [] | |
for variant in VCF(vcf): | |
tier1RefCounts = variant.format("TAR")[0,0] | |
tier1AltCounts = variant.format("TIR")[0,0] | |
total_counts = tier1AltCounts + tier1RefCounts | |
if total_counts == 0: | |
vaf = 0.0 | |
else: | |
vaf = tier1AltCounts / total_counts | |
results.append((variant.ID or f"{variant.CHROM}:{variant.POS}", vaf)) | |
return results |
def get_snv_allele_freq(vcf): | ||
for variant in VCF(vcf): | ||
refCounts = variant.format(variant.REF + "U") | ||
altCounts = variant.format(variant.ALT[0] + "U") | ||
|
||
# TODO: check which value is the correct one from the matrix (this leads to many zero VAF) | ||
tier1RefCounts = refCounts[0, 0] | ||
tier1AltCounts = altCounts[0, 0] | ||
|
||
vaf = tier1AltCounts / (tier1AltCounts + tier1RefCounts) | ||
|
||
print(vaf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance the SNV VAF calculation implementation.
The current implementation needs improvements:
- Add error handling for division by zero
- Add docstring explaining the function
- Store results instead of just printing
- Validate the matrix values as noted in TODO
Here's a suggested implementation:
def get_snv_allele_freq(vcf):
+ """Calculate Variant Allele Frequency (VAF) for SNVs.
+
+ Args:
+ vcf: Path to the VCF file containing SNV variants
+
+ Returns:
+ List of tuples containing (variant_id, vaf)
+ """
+ results = []
for variant in VCF(vcf):
refCounts = variant.format(variant.REF + "U")
altCounts = variant.format(variant.ALT[0] + "U")
- # TODO: check which value is the correct one from the matrix (this leads to many zero VAF)
tier1RefCounts = refCounts[0, 0]
tier1AltCounts = altCounts[0, 0]
- vaf = tier1AltCounts / (tier1AltCounts + tier1RefCounts)
+ total_counts = tier1AltCounts + tier1RefCounts
+ if total_counts == 0:
+ vaf = 0.0
+ else:
+ vaf = tier1AltCounts / total_counts
- print(vaf)
+ results.append((variant.ID or f"{variant.CHROM}:{variant.POS}", vaf))
+ return results
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_snv_allele_freq(vcf): | |
for variant in VCF(vcf): | |
refCounts = variant.format(variant.REF + "U") | |
altCounts = variant.format(variant.ALT[0] + "U") | |
# TODO: check which value is the correct one from the matrix (this leads to many zero VAF) | |
tier1RefCounts = refCounts[0, 0] | |
tier1AltCounts = altCounts[0, 0] | |
vaf = tier1AltCounts / (tier1AltCounts + tier1RefCounts) | |
print(vaf) | |
def get_snv_allele_freq(vcf): | |
"""Calculate Variant Allele Frequency (VAF) for SNVs. | |
Args: | |
vcf: Path to the VCF file containing SNV variants | |
Returns: | |
List of tuples containing (variant_id, vaf) | |
""" | |
results = [] | |
for variant in VCF(vcf): | |
refCounts = variant.format(variant.REF + "U") | |
altCounts = variant.format(variant.ALT[0] + "U") | |
tier1RefCounts = refCounts[0, 0] | |
tier1AltCounts = altCounts[0, 0] | |
total_counts = tier1AltCounts + tier1RefCounts | |
if total_counts == 0: | |
vaf = 0.0 | |
else: | |
vaf = tier1AltCounts / total_counts | |
results.append((variant.ID or f"{variant.CHROM}:{variant.POS}", vaf)) | |
return results |
Closes #100
This is a work in progress. The following steps still have to be solved:
Summary by CodeRabbit
Release Notes
New Features
Configuration Updates
.gitignore
to exclude test and report directories.Workflow Improvements