-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Feature #600]: Use multiprocessing to speed up the parsing #601
base: develop
Are you sure you want to change the base?
[Feature #600]: Use multiprocessing to speed up the parsing #601
Conversation
If it's not done inside of the "if __name__ == "__main__"", it will be recalled inside every new process on Mac/Windows
Since the processing is now async, this print might confuse the users
Thank you @AlexandraImbrisca for the implementation and sending the detailed report which reads coherently! What I stumbled across so far:
The speed is stalling somewhere from 5 cores onward. I can imagine this drop in the speed increase is caused by a) the writing concurrency, b) other running processes on my laptop, c) number of parallel processes decrease once most of the tasks are done?
(The column
|
Thanks a lot for the detailed review and suggestions @nesnoj!
|
Hey @AlexandraImbrisca !
Sounds good to me.
An alternative way could be to create separate SQLite DBs and finally merge them. Dunno if this is a viable option..
It terminates :( |
Sounds good to me as well! |
Instead of "timeout", we can use "connect_timeout" which works for both SQLite and PostgreSQL
Awesome, thanks a lot both! A few updates from my side:
About merging the DBs: that might work, but it might get quite messy with many processes (i.e., we could end up with 10+ temporary DBs) and we have to make sure that we clean everything up eventually 🤔 Using temporary tables performed better than I expected (source) |
Thx for the quick update!
I'll get back to this later
The column issue seems to be solved but now I keep getting an error in PostgreSQL with the privileges, see below for full log. The user has all privileges for the DB (superuser) and the tables are created but no data is written. I think it is not related to the actual privileges but the implementation but I wasn't able to track it further down right now.
Great that you already did some testing in the past! The write-temp-and-merge strategy was just a quick thought, it probably comes with other consequences I cannot estimate and also requires more testing. I'm also fine with the current implementation but open for discussion ;). Click here for full postgres traceback
|
Thanks a bunch for finding this bug! I was using an unauthenticated database and I didn't realise that this could be an issue. The connection_url obfuscates the password so I updated the code to properly set the password. Could you please try again and let me know if you see the same issue? |
Ensure correct type of NUMBER_OF_PROCESSES and add error handling for non-numeric types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two small things needed a fix, I patched..
Now it works fine with psql, thank you!
Thank you for spotting the issues and fixing them! If you any other suggestions, please let me know |
Is this the version now that should be merged to develop and released afterwards? If yes, I would start with the comparison of the two databases:
|
Yes, I think this is the final version (unless we find any other bugs/suggestions). If you can help testing, that would be great! I will also test a bit more |
Did you test on windows? Without setting os.environ, my program immediatly crashes: concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. |
Could you please try again and let me know if any other error is being printed? I unfortunately don't have my own Windows system and I only tested the previous version before adding the os.environ variables. I'll try accessing Windows today and test the code again |
I just tested on Windows 11 and I had no issues. I tried with WSL 2.0 and similarly, the program is running correctly. I tested without setting os.environ as well as with setting each of the fields. |
Just saw it now, I'll work on this hopefully within this or next week. |
Follow up to the previous PR: