-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Bug]: Poor performance of taps #1194
Comments
Made a PR with my proposed improvement here: #1196 |
@Jack-Burnett thanks for opening the issue and PR! I've added it to our Engineering Board to have one of the engineers take a look 😄 cc @edgarrmondragon @aaronsteers |
@Jack-Burnett thanks for reporting! You're absolutely right that
https://docs.python.org/3/library/dataclasses.html This was my oversight when re-implementing the message classes. |
That was a quick turnaround! Thanks |
Singer SDK Version
0.13.1
Python Version
3.10
Bug scope
Taps (catalog, state, stream maps, etc.)
Operating System
MacOS
Description
I've been trying to optimise the performance of a tap i am working on, and find that a lot of it's time is spent in the code of the library itself.
Roughly 0.5-0.7 seconds per 1000 records.
Here is an excerpt from a profiled run (via pycharm);
As you can see, it spends around half of it's total time on the to_dict calls inside format_message.
If we switch line 165 in messages.py from;
return json.dumps(message.to_dict(), use_decimal=True, default=str)
to
return json.dumps(message.__dict__, use_decimal=True, default=str)
it basically entirely eliminates this and doubles the throughput of the tap.
My understanding is that dict just returns the existing object dict, whereas to_dict does a complex deep clone.
In my personal tests this has no downsides, all the tests still pass (didn't try the sdk tests yet), though I'm not sure if there are ramifications beyond that.
Thoughts?
Code
No response
The text was updated successfully, but these errors were encountered: