Improve chromsizes File Validation to Catch Formatting Errors Early #458

ShigrafS · 2025-02-26T10:29:12Z

Improve chromsizes File Validation to Catch Formatting Errors Early (#209 )

Original Issue: #142

Previously, improperly formatted chromsizes files (e.g., files with spaces instead of tabs or hidden characters) could be silently parsed into a DataFrame, leading to NaN values in the "length" column. This resulted in downstream crashes, such as the ValueError: cannot convert float NaN to integer when attempting to compute bins.

This update improves the validation in the read_chromsizes function by immediately converting the "length" column to numeric values and checking for NaNs. If any NaN values are encountered, a clear ValueError is raised, informing the user to ensure the file is properly formatted as a tab-delimited file with exactly two columns: sequence name and integer length. This proactive validation helps users catch formatting issues earlier in the pipeline, preventing cryptic error messages later.

Error Before Fix:

Cryptic error when chromsizes file is not properly formatted.
Example error message when misformatted chromsizes file is used:
```
ValueError: cannot convert float NaN to integer
```

Example of command causing the error:

cooler cload pairix --nproc 9 --assembly gal5 gal5Allele.chrom.sizes:1000 MNP-DT40-1-3-3-R1-T1__gal5.nodups.pairs.gz MNP-DT40-1-3-3-R1-T1__gal5.1000.cool

Cause:

When hidden characters or formatting issues were present in the .chrom.sizes file, such as spaces instead of tabs, the file was misinterpreted, leading to NaN values being parsed.
This issue was overly permissive, allowing incorrect files to pass unnoticed. For instance, a file that misinterpreted a chrom name (allele1) as a valid sequence with NaN as its length would cause problems downstream.

Solution:

Immediate validation of the chromsizes file by converting the "length" column to numeric values and checking for NaNs right away.
If any invalid data is found, a clear error is raised to guide the user to correct the issue.

This ensures that errors are caught early, avoiding confusing issues later in the pipeline and improving overall robustness.

…pen2c#209)

nvictus · 2025-02-26T19:50:50Z

Thank you for the contribution @ShigrafS! Would you mind adding a simple unit test that confirms the exception gets raised with bad input? You can use a broken version of toy.chrom.sizes.

Improve chromsizes file validation to catch formatting errors early (o…

f093a77

…pen2c#209)

ShigrafS changed the title ~~Improve chromsizes file validation to catch formatting errors early (…~~ Improve chromsizes File Validation to Catch Formatting Errors Early Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve chromsizes File Validation to Catch Formatting Errors Early #458

Improve chromsizes File Validation to Catch Formatting Errors Early #458

ShigrafS commented Feb 26, 2025

nvictus commented Feb 26, 2025

Improve chromsizes File Validation to Catch Formatting Errors Early #458

Are you sure you want to change the base?

Improve chromsizes File Validation to Catch Formatting Errors Early #458

Conversation

ShigrafS commented Feb 26, 2025

Improve chromsizes File Validation to Catch Formatting Errors Early (#209 )

Error Before Fix:

Cause:

Solution:

nvictus commented Feb 26, 2025