Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: Cache SQL columns and schemas #1779

Merged
merged 16 commits into from
Oct 3, 2023
Merged

Conversation

pnadolny13
Copy link
Contributor

@pnadolny13 pnadolny13 commented Jun 20, 2023

These two methods get called a lot and behind the scenes sqlalchemy is running a query so it gets pretty expensive if they arent being cached. If the schema or the table isnt found in the cache then it runs the normal workflow of querying the database but if its cached it skips it. I originally implemented a version of this in MeltanoLabs/target-snowflake#57.

I know for targets these methods get called at the initialization step of the sink and a new sink and connector is created if the schema changes so theres no worry about having to invalidate the cache, especially since targets commonly alter the table columns. For taps I'm less familiar with the workflow but it seems rare and out of scope for the source tables to be updated mid sync and require us to invalidate the cache.

@edgarrmondragon @kgpayne any thoughts?


📚 Documentation preview 📚: https://meltano-sdk--1779.org.readthedocs.build/en/1779/

@codecov
Copy link

codecov bot commented Jun 20, 2023

Codecov Report

Merging #1779 (0ef8701) into main (7adc1a8) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1779      +/-   ##
==========================================
+ Coverage   87.38%   87.39%   +0.01%     
==========================================
  Files          59       59              
  Lines        5121     5126       +5     
  Branches      828      830       +2     
==========================================
+ Hits         4475     4480       +5     
  Misses        451      451              
  Partials      195      195              
Files Coverage Δ
singer_sdk/connectors/sql.py 84.41% <100.00%> (+0.25%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@kgpayne kgpayne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pnadolny13 I think caching at this level makes sense. We use functools.lru_cache in other places in the SDK for a similar purpose, which may be a bit cleaner that implementing our own cache. Will let @edgarrmondragon take a look too 🙂

@edgarrmondragon edgarrmondragon changed the title feat: Cache sql columns and schemas feat: Cache SQL columns and schemas Jul 11, 2023
@pnadolny13
Copy link
Contributor Author

@kgpayne I picked this back up and explored if lru_cache would work. I dont think it will for this case since we're calling these methods to check if the column/schema exists and if not we create them. If we use lru_cache then in the case where a column doesnt yet exist we would get a return value of false, then we'd add that column elsewhere, and a second call to this method would also return the cached value false even though the column has since been created.

My implementation tries to avoid needless calls to the DB if we already know the column/schema exists but if it doesnt exist in our cache then we fall back to calling the DB. The first time the method is called the cache is hydrated then later if a column/schema is requested that exists in the cache its returned without hitting the DB again, but if a column/schema is requested and that isnt in the cache then we go back to the DB to refresh our cache to double check in which case newly added columns would be accounted for.

Does that make sense? Can you think of a better way to do it?

cc @edgarrmondragon

Copy link
Collaborator

@edgarrmondragon edgarrmondragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this implementation makes sense without lru_cache. What this is doing is calling inspection if the object hasn't been processed.

singer_sdk/connectors/sql.py Outdated Show resolved Hide resolved
@pnadolny13 pnadolny13 added this pull request to the merge queue Oct 3, 2023
Merged via the queue into main with commit b4f9ac5 Oct 3, 2023
25 checks passed
@pnadolny13 pnadolny13 deleted the cache_sql_columns_and_schemas branch October 3, 2023 18:36
@raulbonet
Copy link
Contributor

raulbonet commented Mar 18, 2024

Hello,

As mentioned in the Target Postgres issue, we are trying to add the overwrite functionality to that loader.

I am trying to use the built-in load_method: overwrite functionality but the singer test suite is failing in the Update Schema tests, and I wonder if the problem is that there is a bug in this implementation. @pnadolny13 can I maybe get your opinion?

You say "... a new connector is created if the schema changes" but I am not sure if this is the case. I put loggers all over the place in the Target Postgres and the sink is initialized if the schema changes indeed, but NOT the connector, which is where the cache is stored.

Indeed, if you look at the code here:

class SQLTarget(Target):
    def get_sink():
        # This part of the code calls `add_sqlsink()`
        ...
        
     def add_sqlsink():
       sink = sink_class(
            target=self,
            stream_name=stream_name,
            schema=schema,
            key_properties=key_properties,
            # HERE: the Sink is being initialized with the EXISTING connector
            connector=self.target_connector,
        )

As you can see, the Sink class is initialized with the existing target connector, so the sink does not recreate it again.

Am I misunderstanding the code?

@pnadolny13
Copy link
Contributor Author

@raulbonet it looks like the methods you mentioned were added #1864 after this PR merged. Can you create a separate bug issue for this?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants