Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bug]: Poor performance of taps #1194

Closed
Jack-Burnett opened this issue Nov 16, 2022 · 4 comments · Fixed by #1196
Closed

[Bug]: Poor performance of taps #1194

Jack-Burnett opened this issue Nov 16, 2022 · 4 comments · Fixed by #1196

Comments

@Jack-Burnett
Copy link
Contributor

Jack-Burnett commented Nov 16, 2022

Singer SDK Version

0.13.1

Python Version

3.10

Bug scope

Taps (catalog, state, stream maps, etc.)

Operating System

MacOS

Description

I've been trying to optimise the performance of a tap i am working on, and find that a lot of it's time is spent in the code of the library itself.
Roughly 0.5-0.7 seconds per 1000 records.

Here is an excerpt from a profiled run (via pycharm);
Screenshot 2022-11-16 at 11 54 05

As you can see, it spends around half of it's total time on the to_dict calls inside format_message.

If we switch line 165 in messages.py from;
return json.dumps(message.to_dict(), use_decimal=True, default=str)
to
return json.dumps(message.__dict__, use_decimal=True, default=str)
it basically entirely eliminates this and doubles the throughput of the tap.

My understanding is that dict just returns the existing object dict, whereas to_dict does a complex deep clone.

In my personal tests this has no downsides, all the tests still pass (didn't try the sdk tests yet), though I'm not sure if there are ramifications beyond that.

Thoughts?

Code

No response

@Jack-Burnett Jack-Burnett added kind/Bug Something isn't working valuestream/SDK labels Nov 16, 2022
@Jack-Burnett
Copy link
Contributor Author

Made a PR with my proposed improvement here: #1196
My initial suggestion of using dict had some issues with BatchMessage, the solution in this PR seems pretty clean though.
The same change could (should?) be applied to the other message types too, though RecordMessage is of course by far the most common one.

@tayloramurphy
Copy link
Collaborator

@Jack-Burnett thanks for opening the issue and PR! I've added it to our Engineering Board to have one of the engineers take a look 😄 cc @edgarrmondragon @aaronsteers

@edgarrmondragon
Copy link
Collaborator

@Jack-Burnett thanks for reporting! You're absolutely right that asdict does a deepcopy:

Converts the dataclass obj to a dict (by using the factory function dict_factory). Each dataclass is converted to a dict of its fields, as name: value pairs. dataclasses, dicts, lists, and tuples are recursed into. Other objects are copied with copy.deepcopy().

https://docs.python.org/3/library/dataclasses.html

This was my oversight when re-implementing the message classes.

@Jack-Burnett
Copy link
Contributor Author

That was a quick turnaround! Thanks

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants