feat: Cache SQL columns and schemas #1779

pnadolny13 · 2023-06-20T20:52:45Z

These two methods get called a lot and behind the scenes sqlalchemy is running a query so it gets pretty expensive if they arent being cached. If the schema or the table isnt found in the cache then it runs the normal workflow of querying the database but if its cached it skips it. I originally implemented a version of this in MeltanoLabs/target-snowflake#57.

I know for targets these methods get called at the initialization step of the sink and a new sink and connector is created if the schema changes so theres no worry about having to invalidate the cache, especially since targets commonly alter the table columns. For taps I'm less familiar with the workflow but it seems rare and out of scope for the source tables to be updated mid sync and require us to invalidate the cache.

@edgarrmondragon @kgpayne any thoughts?

📚 Documentation preview 📚: https://meltano-sdk--1779.org.readthedocs.build/en/1779/

codecov · 2023-06-20T21:05:55Z

Codecov Report

Merging #1779 (0ef8701) into main (7adc1a8) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1779      +/-   ##
==========================================
+ Coverage   87.38%   87.39%   +0.01%     
==========================================
  Files          59       59              
  Lines        5121     5126       +5     
  Branches      828      830       +2     
==========================================
+ Hits         4475     4480       +5     
  Misses        451      451              
  Partials      195      195

Files	Coverage Δ
singer_sdk/connectors/sql.py	`84.41% <100.00%> (+0.25%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

kgpayne

@pnadolny13 I think caching at this level makes sense. We use functools.lru_cache in other places in the SDK for a similar purpose, which may be a bit cleaner that implementing our own cache. Will let @edgarrmondragon take a look too 🙂

This reverts commit 668832b.

pnadolny13 · 2023-09-14T21:13:58Z

@kgpayne I picked this back up and explored if lru_cache would work. I dont think it will for this case since we're calling these methods to check if the column/schema exists and if not we create them. If we use lru_cache then in the case where a column doesnt yet exist we would get a return value of false, then we'd add that column elsewhere, and a second call to this method would also return the cached value false even though the column has since been created.

My implementation tries to avoid needless calls to the DB if we already know the column/schema exists but if it doesnt exist in our cache then we fall back to calling the DB. The first time the method is called the cache is hydrated then later if a column/schema is requested that exists in the cache its returned without hitting the DB again, but if a column/schema is requested and that isnt in the cache then we go back to the DB to refresh our cache to double check in which case newly added columns would be accounted for.

Does that make sense? Can you think of a better way to do it?

cc @edgarrmondragon

edgarrmondragon

I think this implementation makes sense without lru_cache. What this is doing is calling inspection if the object hasn't been processed.

singer_sdk/connectors/sql.py

for more information, see https://pre-commit.ci

…ltano/sdk into cache_sql_columns_and_schemas

for more information, see https://pre-commit.ci

raulbonet · 2024-03-18T13:53:18Z

Hello,

As mentioned in the Target Postgres issue, we are trying to add the overwrite functionality to that loader.

I am trying to use the built-in load_method: overwrite functionality but the singer test suite is failing in the Update Schema tests, and I wonder if the problem is that there is a bug in this implementation. @pnadolny13 can I maybe get your opinion?

You say "... a new connector is created if the schema changes" but I am not sure if this is the case. I put loggers all over the place in the Target Postgres and the sink is initialized if the schema changes indeed, but NOT the connector, which is where the cache is stored.

Indeed, if you look at the code here:

class SQLTarget(Target):
    def get_sink():
        # This part of the code calls `add_sqlsink()`
        ...
        
     def add_sqlsink():
       sink = sink_class(
            target=self,
            stream_name=stream_name,
            schema=schema,
            key_properties=key_properties,
            # HERE: the Sink is being initialized with the EXISTING connector
            connector=self.target_connector,
        )

As you can see, the Sink class is initialized with the existing target connector, so the sink does not recreate it again.

Am I misunderstanding the code?

pnadolny13 · 2024-03-18T14:15:21Z

@raulbonet it looks like the methods you mentioned were added #1864 after this PR merged. Can you create a separate bug issue for this?

pnadolny13 added 2 commits June 20, 2023 16:43

cache sql columns and schemas

21f3b33

lint fix

a70efe6

pnadolny13 requested review from edgarrmondragon and kgpayne as code owners June 20, 2023 20:52

fix mypy typing

c757342

Merge branch 'main' into cache_sql_columns_and_schemas

5b4c8d2

kgpayne reviewed Jun 27, 2023

View reviewed changes

edgarrmondragon changed the title ~~feat: Cache sql columns and schemas~~ feat: Cache SQL columns and schemas Jul 11, 2023

edgarrmondragon and others added 5 commits July 11, 2023 12:52

Merge branch 'main' into cache_sql_columns_and_schemas

7574d59

Merge branch 'main' into cache_sql_columns_and_schemas

487e9ae

Merge branch 'main' into cache_sql_columns_and_schemas

6bea818

use lru_cache instead of custom cache

668832b

Revert "use lru_cache instead of custom cache"

d56d963

This reverts commit 668832b.

pnadolny13 mentioned this pull request Sep 15, 2023

Performance Issues Due to SDK MeltanoLabs/target-postgres#192

Closed

pnadolny13 and others added 2 commits October 3, 2023 08:20

Merge branch 'main' into cache_sql_columns_and_schemas

fc7d434

Merge branch 'main' into cache_sql_columns_and_schemas

d68e68f

edgarrmondragon approved these changes Oct 3, 2023

View reviewed changes

singer_sdk/connectors/sql.py Outdated Show resolved Hide resolved

pnadolny13 and others added 5 commits October 3, 2023 13:27

use set for cache instead of list

14031f2

[pre-commit.ci] auto fixes from pre-commit.com hooks

61413ee

for more information, see https://pre-commit.ci

fix set typing

e0451b6

Merge branch 'cache_sql_columns_and_schemas' of https://github.com/me…

3e2859c

…ltano/sdk into cache_sql_columns_and_schemas

[pre-commit.ci] auto fixes from pre-commit.com hooks

0ef8701

for more information, see https://pre-commit.ci

pnadolny13 added this pull request to the merge queue Oct 3, 2023

Merged via the queue into main with commit b4f9ac5 Oct 3, 2023
25 checks passed

pnadolny13 deleted the cache_sql_columns_and_schemas branch October 3, 2023 18:36

pnadolny13 mentioned this pull request Dec 1, 2023

feat: Refactor for a more efficient prepare_table() MeltanoLabs/target-postgres#228

Merged

visch mentioned this pull request Dec 16, 2023

prepare_table does SDK handle this? If so migrate to it MeltanoLabs/target-postgres#256

Closed

raulbonet mentioned this pull request Mar 18, 2024

bug: Cache SQL columns and schemas does not work with a shared connector #2325

Closed

1 task

edgarrmondragon mentioned this pull request Apr 12, 2024

fix: Removed unnecessary and problematic column caching #2352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Cache SQL columns and schemas #1779

feat: Cache SQL columns and schemas #1779

pnadolny13 commented Jun 20, 2023 •

edited by github-actions bot

Loading

codecov bot commented Jun 20, 2023 •

edited

Loading

kgpayne left a comment

pnadolny13 commented Sep 14, 2023

edgarrmondragon left a comment

raulbonet commented Mar 18, 2024 •

edited

Loading

pnadolny13 commented Mar 18, 2024

feat: Cache SQL columns and schemas #1779

feat: Cache SQL columns and schemas #1779

Conversation

pnadolny13 commented Jun 20, 2023 • edited by github-actions bot Loading

codecov bot commented Jun 20, 2023 • edited Loading

Codecov Report

kgpayne left a comment

Choose a reason for hiding this comment

pnadolny13 commented Sep 14, 2023

edgarrmondragon left a comment

Choose a reason for hiding this comment

raulbonet commented Mar 18, 2024 • edited Loading

pnadolny13 commented Mar 18, 2024

pnadolny13 commented Jun 20, 2023 •

edited by github-actions bot

Loading

codecov bot commented Jun 20, 2023 •

edited

Loading

raulbonet commented Mar 18, 2024 •

edited

Loading