451: Avoid Unnecessary Re-Creation of Bins in scool to Prevent File Bloat #457
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch addresses an issue in the current scool file creation workflow where, during append operations, the bins group is always deleted and re-created—even when the bins data (i.e., the "chrom", "start", and "end" columns) hasn’t changed. Over time, this behavior leads to file bloat due to HDF5’s inability to reclaim deleted space automatically.
What’s Changed:
Conditional Bins Update:
The patch introduces a check that compares the existing bins in the file with the new bins data. If they match, the bins group is left intact, avoiding unnecessary deletion and re-creation.
Consistent Chromosome Data:
The chroms group is always updated to ensure consistency, while the bins group is only rewritten when there is an actual difference in the underlying data.
Why This Matters:
Space Efficiency:
By skipping redundant writes when the bins are identical, we prevent the accumulation of dead space and reduce the need for costly file repacking operations (e.g., using h5repack).
Improved Performance:
Avoiding unnecessary I/O operations helps maintain a leaner file size and can lead to faster append operations over multiple iterations.
Testing:
The patch was tested locally by creating an initial scool file, appending cells with unchanged bins (which showed minimal file size increase), and then appending with modified bins (which resulted in a larger file increase as expected).
Additional Notes:
HDF5 Limitations:
Even with this patch, HDF5 will not automatically reclaim space from deleted datasets. For existing files with accumulated dead space, tools like h5repack are still recommended.
Future Enhancements:
A longer-term improvement might involve accepting a single shared bins DataFrame with per-cell iterators, reducing memory usage and further streamlining the workflow.
Please review the changes and let me know if there are any questions or further improvements needed.