Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improve chromsizes File Validation to Catch Formatting Errors Early #458

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ShigrafS
Copy link

Improve chromsizes File Validation to Catch Formatting Errors Early (#209 )

Original Issue: #142

Previously, improperly formatted chromsizes files (e.g., files with spaces instead of tabs or hidden characters) could be silently parsed into a DataFrame, leading to NaN values in the "length" column. This resulted in downstream crashes, such as the ValueError: cannot convert float NaN to integer when attempting to compute bins.

This update improves the validation in the read_chromsizes function by immediately converting the "length" column to numeric values and checking for NaNs. If any NaN values are encountered, a clear ValueError is raised, informing the user to ensure the file is properly formatted as a tab-delimited file with exactly two columns: sequence name and integer length. This proactive validation helps users catch formatting issues earlier in the pipeline, preventing cryptic error messages later.

Error Before Fix:

  • Cryptic error when chromsizes file is not properly formatted.

  • Example error message when misformatted chromsizes file is used:

    ValueError: cannot convert float NaN to integer
    
  • Example of command causing the error:

    cooler cload pairix --nproc 9 --assembly gal5 gal5Allele.chrom.sizes:1000 MNP-DT40-1-3-3-R1-T1__gal5.nodups.pairs.gz MNP-DT40-1-3-3-R1-T1__gal5.1000.cool
    

Cause:

  • When hidden characters or formatting issues were present in the .chrom.sizes file, such as spaces instead of tabs, the file was misinterpreted, leading to NaN values being parsed.
  • This issue was overly permissive, allowing incorrect files to pass unnoticed. For instance, a file that misinterpreted a chrom name (allele1) as a valid sequence with NaN as its length would cause problems downstream.

Solution:

  • Immediate validation of the chromsizes file by converting the "length" column to numeric values and checking for NaNs right away.
  • If any invalid data is found, a clear error is raised to guide the user to correct the issue.

This ensures that errors are caught early, avoiding confusing issues later in the pipeline and improving overall robustness.

@ShigrafS ShigrafS changed the title Improve chromsizes file validation to catch formatting errors early (… Improve chromsizes File Validation to Catch Formatting Errors Early Feb 26, 2025
@nvictus
Copy link
Member

nvictus commented Feb 26, 2025

Thank you for the contribution @ShigrafS! Would you mind adding a simple unit test that confirms the exception gets raised with bad input? You can use a broken version of toy.chrom.sizes.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants