Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

batch reading on load to avoid memory issue #44

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gmossessian
Copy link

@gmossessian gmossessian commented Dec 3, 2020

This fixes an issue that can cause the dawg to allocate a lot of memory on load in the case of a corrupted file.

Unfortunately, the original dawgdic library looks like it has long since ceased to be maintained, so I am opening a PR against the python wrapper.

To replicate the issue, use the following script:

from dawg import BytesDAWG as dawg

with open('corrupt.dawg', 'w') as f:
    f.write('corrupt!')

try:
    d = dawg().load('corrupt.dawg')
except:
    print('failed, as expected')

If you run this under gtime on OSX, you will see somewhere between 4 and 6GB of RAM being used.

$ gtime -v python load_corrupted_dawg
failed, as expected
	Command being timed: "python load_corrupted_dawg"
	User time (seconds): 2.94
	System time (seconds): 3.00
	Percent of CPU this job got: 84%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.04
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 6256108    <-------------- SIX GIGABYTES!!
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 258
	Minor (reclaiming a frame) page faults: 1890722
	Voluntary context switches: 338
	Involuntary context switches: 15360
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After this PR, this number is greatly reduced

	Maximum resident set size (kbytes): 35152

@gmossessian
Copy link
Author

@kmike would love a review if you're still maintaining this library! thank you!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant