Skip to content

[EWT-1250] Sqlalchemy/Superset expectations from python-DBAPI #17

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

prantogg
Copy link
Contributor

@prantogg prantogg commented Aug 1, 2024

This PR introduces the following requirements that have risen from the Wherobots x Superset integration -

  • SQLAlchemy expects the result set to be a List of Tuples or List of Lists. SQLAlchemy wraps this result set in it's own Row object.
  • Superset requires rollback() and commit() to be implemented. Other OLAP databases such as pyhive simple "pass" the not implemented rollback() and commit() methods. For context - Superset's background processes often bypass the SQLAlchemy dialect and directly interacts with DBAPI. This is why overriding the rollback and commit methods in the Dialect doesn't suffice.
  • Adds Web socket URL parameter, ws_url, to connection. This helps maintain static connection pool configuration in Superset.

@prantogg prantogg marked this pull request as ready for review August 1, 2024 18:44
@prantogg prantogg changed the title Sqlalchemy/Superset expectations from python-DBAPI [EWT-1250] Sqlalchemy/Superset expectations from python-DBAPI Aug 1, 2024
Copy link

schema = reader.schema
columns = schema.names
column_types = [field.type for field in schema]
rows = reader.read_all().to_pandas().values.tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need both read_all().to_pandas() or can you read_pandas() directly? (to be fair according to the docs it does the same thing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Using read_pandas() now for better readability.

},
headers=headers,
if ws_url:
session_uri = ws_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things here:

  1. This can be done earlier, so if we feel confident in using the ws_url we can do that right after checking the token/api-keys.
  2. This needs a little bit different logging, since we're not requesting a new runtime (see line 70).
  3. This should early return so that we don't have to indent everything afterwards. Makes it more readable.

@prantogg prantogg requested a review from peterfoldes August 1, 2024 23:09
@@ -51,6 +50,7 @@ def connect(
results_format: Union[ResultsFormat, None] = None,
data_compression: Union[DataCompression, None] = None,
geometry_representation: Union[GeometryRepresentation, None] = None,
ws_url: str = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required, instead of calling connect_direct() directly?

The reason this parameter wasn't included in connect() is that it creates ambiguity between the rest of the parameters (like runtime/region) and the runtime you'd actually connect to when providing a ws_url, which may not match those choices.

query.handler(json.loads(result_bytes.decode("utf-8")))
data = json.loads(result_bytes.decode("utf-8"))
columns = data["columns"]
column_types = data.get("column_types")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is column_types optional? If so, then it's good that you're using data.get() here, but then in Cursor.__get_results you expect column_types to be non-None. You either need to ensure column_types is always provided, or change __get_results to be more defensive.

None, # precision
None, # scale
True, # null_ok; Assuming all columns can accept NULL values
)
for col_name in result.columns
for i, col_name in enumerate(columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use https://docs.python.org/3/library/functions.html#zip to avoid jumping hoops with an index (it's much nicer to read, and also more efficient):

self.__description = [
  (
    col_name,
    _TYPE_MAP.get(col_type, 'STRING'),
    ...
  )
  for (col_name, col_type) in zip(columns, column_types)
]

@prantogg prantogg marked this pull request as draft August 13, 2024 17:17
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants