Improve chromsizes File Validation to Catch Formatting Errors Early #458
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Improve chromsizes File Validation to Catch Formatting Errors Early (#209 )
Original Issue: #142
Previously, improperly formatted chromsizes files (e.g., files with spaces instead of tabs or hidden characters) could be silently parsed into a DataFrame, leading to
NaN
values in the "length" column. This resulted in downstream crashes, such as theValueError: cannot convert float NaN to integer
when attempting to compute bins.This update improves the validation in the
read_chromsizes
function by immediately converting the "length" column to numeric values and checking forNaN
s. If anyNaN
values are encountered, a clearValueError
is raised, informing the user to ensure the file is properly formatted as a tab-delimited file with exactly two columns: sequence name and integer length. This proactive validation helps users catch formatting issues earlier in the pipeline, preventing cryptic error messages later.Error Before Fix:
Cryptic error when chromsizes file is not properly formatted.
Example error message when misformatted chromsizes file is used:
Example of command causing the error:
Cause:
.chrom.sizes
file, such as spaces instead of tabs, the file was misinterpreted, leading toNaN
values being parsed.allele1
) as a valid sequence withNaN
as its length would cause problems downstream.Solution:
chromsizes
file by converting the "length" column to numeric values and checking forNaN
s right away.This ensures that errors are caught early, avoiding confusing issues later in the pipeline and improving overall robustness.