Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

451: Avoid Unnecessary Re-Creation of Bins in scool to Prevent File Bloat #457

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Snowman-cpu
Copy link

This patch addresses an issue in the current scool file creation workflow where, during append operations, the bins group is always deleted and re-created—even when the bins data (i.e., the "chrom", "start", and "end" columns) hasn’t changed. Over time, this behavior leads to file bloat due to HDF5’s inability to reclaim deleted space automatically.

What’s Changed:

Conditional Bins Update:
The patch introduces a check that compares the existing bins in the file with the new bins data. If they match, the bins group is left intact, avoiding unnecessary deletion and re-creation.

Consistent Chromosome Data:
The chroms group is always updated to ensure consistency, while the bins group is only rewritten when there is an actual difference in the underlying data.

Why This Matters:

Space Efficiency:
By skipping redundant writes when the bins are identical, we prevent the accumulation of dead space and reduce the need for costly file repacking operations (e.g., using h5repack).

Improved Performance:
Avoiding unnecessary I/O operations helps maintain a leaner file size and can lead to faster append operations over multiple iterations.

Testing:

The patch was tested locally by creating an initial scool file, appending cells with unchanged bins (which showed minimal file size increase), and then appending with modified bins (which resulted in a larger file increase as expected).
Additional Notes:

HDF5 Limitations:
Even with this patch, HDF5 will not automatically reclaim space from deleted datasets. For existing files with accumulated dead space, tools like h5repack are still recommended.

Future Enhancements:
A longer-term improvement might involve accepting a single shared bins DataFrame with per-cell iterators, reducing memory usage and further streamlining the workflow.

Please review the changes and let me know if there are any questions or further improvements needed.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant