Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

bug: Cache SQL columns and schemas does not work with a shared connector #2325

Closed
1 task done
raulbonet opened this issue Mar 18, 2024 · 7 comments · Fixed by #2352 or raulbonet/sdk#1
Closed
1 task done

bug: Cache SQL columns and schemas does not work with a shared connector #2325

raulbonet opened this issue Mar 18, 2024 · 7 comments · Fixed by #2352 or raulbonet/sdk#1
Labels
kind/Bug Something isn't working valuestream/SDK

Comments

@raulbonet
Copy link
Contributor

raulbonet commented Mar 18, 2024

Singer SDK Version

0.35.2

Is this a regression?

  • Yes

Python Version

3.9

Bug scope

Taps (catalog, state, etc.)

Operating System

Linux

Description

Meltano throws an exception whenever there is a schema change in the same run.

This feature does not seem to work after the release of this feature.

Whenever a schema is changed, the cache of the connector should no longer valid. But currently, upon sink initialization, the connector is reused.

Indeed, if you look at the add_sqlsink() method, the sink is initialized with the existing connector

class SQLTarget(Target):
    def get_sink():
        # This part of the code calls `add_sqlsink()`
        ...
        
     def add_sqlsink():
       sink = sink_class(
            target=self,
            stream_name=stream_name,
            schema=schema,
            key_properties=key_properties,
            # HERE: the Sink is being initialized with the EXISTING connector
            connector=self.target_connector,
        )

Code

No response

@raulbonet raulbonet added kind/Bug Something isn't working valuestream/SDK labels Mar 18, 2024
@raulbonet
Copy link
Contributor Author

As per @pnadolny13 request here, I opened an issue with this.

I have been giving a thought to all of this and with the current version of the SDK I am not sure if we can optimize even more the calls to prepare_columns(), get_column_type() etc. , which I understand were the original reason to develop the caching.

  1. The mentioned methods I think are called only as part of the prepare_table() method. But this method is only called by the add_sink() method (or I think that's how it should be as I mention here ). , which in turn is only called by the get_sink() method .

  2. But the get_sink() method already checks for schema changes, and only calls add_sink() whenever there has been a schema change. If there were no changes, an existing sink is returned.

And if there has been a schema change, I want to invalidate the cache.

So I don't think that this process can be optimized any more, and therefore the caching cannot be done.

Any thoughts? I am new to the Meltano SDK so maybe I am understanding the whole code wrong!

@edgarrmondragon
Copy link
Collaborator

cc @BuzzCutNorman in case you have thoughts about this

@BuzzCutNorman
Copy link
Contributor

BuzzCutNorman commented Mar 20, 2024

@edgarrmondragon I don't have any initial thoughts. I will need to find a way to reproduce the issue.

@raulbonet
Copy link
Contributor Author

raulbonet commented Mar 22, 2024

I agree with the the approach and reasoning of @BuzzCutNorman : from an OOD perspective, the way it was intended is for the SQLConnector class to serve as a wrapper/interface around the connection, to encapsulate some common operations like "rename", "delete" or "get_ddl".

I think then the SQLConnector 's implementation class has to be as close as possible to the underlying database connector. It makes sense to leverage connection pooling, as the underlying connection does.

But I wonder if the logic of caching is the connector's responsibility. In any case, I don't think any caching can be done as I exposed.

Would it be OK if I just remove the caching implemented here?

@BuzzCutNorman
Copy link
Contributor

@raulbonet Thank you for creating the issue and starting the conversation. I understand what you are saying in theory I just have not gotten time to reproduce the issue you are describing. Once I can observe it and work with it in a testing scenario, then I can attempt to have a knowledgeable discussion on ways to resolve this. If you could provide the error that you are seeing or the traceback of the error that would be very helpful. My guess is you are running your PR version of meltanolabs target-postgres but I don't know if you are seeing the error when you run the testing or if the error occurs when you are giving target-postgres data from a particular tap.

@raulbonet
Copy link
Contributor Author

Hello @BuzzCutNorman and company,

Sorry, I have been sick and not been able to follow up on this until now. In this PR I replaced (almost) all the custom overridden methods of the target-postgres SDK by the native Meltano SDK functions.

If you do:

docker compose up
pytest

You will see that 2 tests of the test suite fail:
TestTargetPostgres.test_target_schema_updates
test_schema_updates

I am attaching the full log. It's interesting to note that they only fail once: if you re-execute the test, they will be successful: that's because the test fails but it is able to create the new column, so if the test is executed without restarting the Postgres instance, it is successful.

If I remove the part of the code belonging to the caching, the tests are successful.
pytest.log

@raulbonet
Copy link
Contributor Author

Closed by mistake when merging to my personal fork, re-opened.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/Bug Something isn't working valuestream/SDK
Projects
None yet
3 participants