-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat: add duckdb as DataSource - Fixes #14563 #19317
Conversation
needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine)
update _time_grain_expressions
Codecov Report
@@ Coverage Diff @@
## master #19317 +/- ##
==========================================
- Coverage 66.53% 66.50% -0.04%
==========================================
Files 1667 1672 +5
Lines 64360 64564 +204
Branches 6493 6493
==========================================
+ Hits 42824 42936 +112
- Misses 19854 19946 +92
Partials 1682 1682
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except for lint error, LGTM. BTW, I was impressed by DuckDB as a column base lite database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - a few non-blocking style related comments
except RuntimeError: | ||
# Catches the equivalent single-threading error from duckdb. | ||
alive = engine.dialect.do_ping(conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we have these in the same except
:
except (sqlite3.ProgrammingError, RuntimeError):
# SQLite can't run on a separate thread, so ``func_timeout`` fails
# RuntimeError catches the equivalent single-threading error from duckdb.
alive = engine.dialect.do_ping(conn)
superset/db_engine_specs/duckdb.py
Outdated
|
||
@classmethod | ||
def get_table_names( | ||
cls, database: "Database", inspector: Inspector, schema: Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we add from __future__ import annotations
in the beginning of the file, we can remove the quotes. See an example here:
superset/superset/common/query_context.py
Line 17 in b7ecb14
from __future__ import annotations |
cls, database: "Database", inspector: Inspector, schema: Optional[str] | |
cls, database: Database, inspector: Inspector, schema: Optional[str] |
Thanks for the review @villebro @zhaoyongjie - I'll incorporate your feedback and get the linter passing. |
OK - I am about 90% confident that I am running the linter properly on my local and that all the tests will pass now - looks like I cannot trigger the CI myself, but I think if someone kicks that off we'll see the build go green. |
@rwhaling, CI looks like waiting to finish other tasks. When the CI is all green, I will merge it. Thanks for the following up! |
Since DuckDB is, much like SQLite, an in-process, single-threaded engine, the error handling in I had one small comment in case it is helpful! DuckDB is indeed embedded in your local process, but it is multi-threaded and can use as many CPU cores as you would like. Thanks for building this connector! |
* + duckdb support needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine) * Update duckdb.py update _time_grain_expressions * removed superfluous get_all_datasource_names def in duckdb engine spec * added exception handling for duckdb single-threaded RuntimeError * fixed linter blips and other stylistic cleanup in duckdb.py * one last round of linter tweaks in test_connection.py for duckdb support Co-authored-by: Steven Lee <admin@alitrack.com> Co-authored-by: Richard Whaling <richardwhaling@Richards-MacBook-Pro.local> (cherry picked from commit 202e34a)
* + duckdb support needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine) * Update duckdb.py update _time_grain_expressions * removed superfluous get_all_datasource_names def in duckdb engine spec * added exception handling for duckdb single-threaded RuntimeError * fixed linter blips and other stylistic cleanup in duckdb.py * one last round of linter tweaks in test_connection.py for duckdb support Co-authored-by: Steven Lee <admin@alitrack.com> Co-authored-by: Richard Whaling <richardwhaling@Richards-MacBook-Pro.local>
Hi, testing this and Superset doesn't work when installing
Here's my full Dockerfile
|
@inakianduaga heya, that's on me - unfortunately sqlalchemy doesn't really document what's expected from a dialect, so it's hard to keep up with the changes on their side. I'll push out a fixed version shortly |
@inakianduaga it should work now if you try with duckdb_engine 0.1.11, sorry about that 😅 |
ok thanks. It actually worked for me by going to the fixed |
I wanted to access the S3 parquet files from Superset/SQL Editor. While I am able to use DuckDB to do the same in my python shell, I am wondering how to do it with the Superset and/or duckdb-engine My python snippet to load parquet files from S3: import duckdb
cursor = duckdb.connect()
cursor.execute("INSTALL httpfs;")
cursor.execute("LOAD httpfs;")
cursor.execute("SET s3_region='******'")
cursor.execute("SET s3_access_key_id=''**************")
cursor.execute("SET s3_secret_access_key='*****************************'")
cursor.execute("PRAGMA enable_profiling;")
cursor.execute("SELECT count(*) FROM read_parquet('s3://<bucket>/prefix/*.parquet'") |
@Mause Could you please help with loading S3 parquet files with your engine? |
if you're having an issue with
PS. @zhaoyongjie any chance you could lock this conversation? |
SUMMARY
Adds Duckdb as an embedded, in-process OLAP db engine.
Duckdb can directly query CSV or Parquet files on disk - eventually, we should be able to query Parquet files directly on S3 as well.
Supersedes #19265
Relying on https://github.com/Mause/duckdb_engine for the SQLAlchemy implementation, and building on top of @alitrack's original work.
Since DuckDB is, much like SQLite, an in-process, single-threaded engine, the error handling in
TestConnectionDatabaseCommand.run
feels a bit weird. Might want to work with @Mause on a narrow exception class for the threading blip.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
pip install duckdb-engine
duckdb:////Users/whoever/path/to/duck.db
duckdb:///:memory:
seems to work?SELECT * from 'test.parquet'
etc.ADDITIONAL INFORMATION