-
Notifications
You must be signed in to change notification settings - Fork 81
Urgent: Oplog Resolver Intermittently doesn't Complete Resolving #277
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Using the pystack debugger I managed to use gdb to get a thread dump of some of our hanging processes. I've attached the dump of one of our hung threads into a file in this comment. The short version is the thread dump points to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/1.3.0/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L45 |
Got a dump of the parent PID of the defunct processes. the results are attached below. This one points to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/master/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L105 |
Hi @corey-hammerton I suspect this is related to the same stalls seen in #165. I've also traced it down to thread dumps, strace, etc and cannot explain why 'multiprocessing' and/or threading is stalling. There's a few issues on GitHub for multiprocessing that sound related to this problem but I haven't seen a clear solutions yet (that don't mean large changes like Python 3+, etc). I'm hoping someone with more familiarity with Python internals can help take this investigation further. |
On our backup server we experience cases where backups fail to complete successfully because the Oplog Resolver from a previous backup didn't complete. This happens on many of our backup configurations using default oplog configuration settings.
The ResolverThreads to complete successfully, as displayed by the logs below with hostnames redacted. The resolver, however, doesn't free the threads and continues with the backup. Manually killing the defunct processes releases the locks and the MainProcess thread logs that Oplog resolving completed.
Killing the processes on the server with high Virtual Memory Size achieves this.
In our testing and debugging we have narrowed it down to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/master/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L105. Please Advise.
Generic server information:
The text was updated successfully, but these errors were encountered: