Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: add duckdb as DataSource - Fixes #14563 #19317

Merged
merged 7 commits into from
Mar 24, 2022

Conversation

rwhaling
Copy link
Contributor

@rwhaling rwhaling commented Mar 22, 2022

SUMMARY

Adds Duckdb as an embedded, in-process OLAP db engine.
Duckdb can directly query CSV or Parquet files on disk - eventually, we should be able to query Parquet files directly on S3 as well.
Supersedes #19265
Relying on https://github.com/Mause/duckdb_engine for the SQLAlchemy implementation, and building on top of @alitrack's original work.
Since DuckDB is, much like SQLite, an in-process, single-threaded engine, the error handling in TestConnectionDatabaseCommand.run feels a bit weird. Might want to work with @Mause on a narrow exception class for the threading blip.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screen Shot 2022-03-22 at 3 22 53 PM

Screen Shot 2022-03-22 at 3 23 10 PM

Screen Shot 2022-03-22 at 4 12 34 PM

Screen Shot 2022-03-22 at 4 42 28 PM

TESTING INSTRUCTIONS

  • install duckdb_engine via pip install duckdb-engine
  • DuckDB appears in the add a database dropdown
  • Provide path to a duckdb database on disk in the url, e.g., duckdb:////Users/whoever/path/to/duck.db
  • in SQL lab, query any tables defined in your on-disk database, tables, schemas, and column defs all appear as normal
  • have not tested in-memory a ton but duckdb:///:memory: seems to work?
  • once you are in SQL Lab, you can directly query parquet or CSV files on disk: SELECT * from 'test.parquet' etc.

ADDITIONAL INFORMATION

  • Has associated issue: duckdb as DataSource #14563
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@codecov
Copy link

codecov bot commented Mar 22, 2022

Codecov Report

Merging #19317 (4aa477b) into master (d645579) will decrease coverage by 0.03%.
The diff coverage is 74.77%.

❗ Current head 4aa477b differs from pull request most recent head d1e16c4. Consider uploading reports for the commit d1e16c4 to get more accurate results

@@            Coverage Diff             @@
##           master   #19317      +/-   ##
==========================================
- Coverage   66.53%   66.50%   -0.04%     
==========================================
  Files        1667     1672       +5     
  Lines       64360    64564     +204     
  Branches     6493     6493              
==========================================
+ Hits        42824    42936     +112     
- Misses      19854    19946      +92     
  Partials     1682     1682              
Flag Coverage Δ
hive ?
mysql 81.65% <88.63%> (+0.16%) ⬆️
postgres 81.69% <88.63%> (+0.16%) ⬆️
presto ?
python 81.78% <88.63%> (-0.18%) ⬇️
sqlite 81.46% <88.63%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...t-ui-chart-controls/src/operators/pivotOperator.ts 100.00% <ø> (ø)
...i-chart-controls/src/operators/resampleOperator.ts 100.00% <ø> (ø)
.../superset-ui-core/src/query/types/QueryFormData.ts 100.00% <ø> (ø)
...tend/packages/superset-ui-core/src/style/index.tsx 100.00% <ø> (ø)
...end/plugins/legacy-plugin-chart-chord/src/Chord.js 0.00% <0.00%> (ø)
...ns/legacy-plugin-chart-chord/src/transformProps.js 0.00% <0.00%> (ø)
.../legacy-plugin-chart-country-map/src/CountryMap.js 0.00% <0.00%> (ø)
...acy-plugin-chart-country-map/src/transformProps.js 0.00% <0.00%> (ø)
...ns/legacy-plugin-chart-histogram/src/Histogram.jsx 0.00% <0.00%> (ø)
...egacy-plugin-chart-histogram/src/transformProps.js 0.00% <0.00%> (ø)
... and 258 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d645579...d1e16c4. Read the comment docs.

@zhaoyongjie zhaoyongjie self-requested a review March 23, 2022 02:06
Copy link
Member

@zhaoyongjie zhaoyongjie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except for lint error, LGTM. BTW, I was impressed by DuckDB as a column base lite database.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - a few non-blocking style related comments

Comment on lines 97 to 99
except RuntimeError:
# Catches the equivalent single-threading error from duckdb.
alive = engine.dialect.do_ping(conn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we have these in the same except:

                except (sqlite3.ProgrammingError, RuntimeError):
                    # SQLite can't run on a separate thread, so ``func_timeout`` fails
                    # RuntimeError catches the equivalent single-threading error from duckdb.
                    alive = engine.dialect.do_ping(conn)


@classmethod
def get_table_names(
cls, database: "Database", inspector: Inspector, schema: Optional[str]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we add from __future__ import annotations in the beginning of the file, we can remove the quotes. See an example here:

from __future__ import annotations

Suggested change
cls, database: "Database", inspector: Inspector, schema: Optional[str]
cls, database: Database, inspector: Inspector, schema: Optional[str]

@rwhaling
Copy link
Contributor Author

Thanks for the review @villebro @zhaoyongjie - I'll incorporate your feedback and get the linter passing.

@rwhaling
Copy link
Contributor Author

OK - I am about 90% confident that I am running the linter properly on my local and that all the tests will pass now - looks like I cannot trigger the CI myself, but I think if someone kicks that off we'll see the build go green.

@zhaoyongjie
Copy link
Member

OK - I am about 90% confident that I am running the linter properly on my local and that all the tests will pass now - looks like I cannot trigger the CI myself, but I think if someone kicks that off we'll see the build go green.

@rwhaling, CI looks like waiting to finish other tasks. When the CI is all green, I will merge it. Thanks for the following up!

@zhaoyongjie zhaoyongjie reopened this Mar 24, 2022
@zhaoyongjie zhaoyongjie merged commit 202e34a into apache:master Mar 24, 2022
@villebro
Copy link
Member

@rwhaling - One thing that I forgot while reviewing was if you would be able to open a PR that adds the relevant documentation for connecting to DuckDB? Something similar to the what this PR did for Trino (there's some unrelated changes, you can ignore those): #13171

@rwhaling
Copy link
Contributor Author

@rwhaling - One thing that I forgot while reviewing was if you would be able to open a PR that adds the relevant documentation for connecting to DuckDB? Something similar to the what this PR did for Trino (there's some unrelated changes, you can ignore those): #13171

absolutely 👍

@Alex-Monahan
Copy link

Alex-Monahan commented Mar 24, 2022

Since DuckDB is, much like SQLite, an in-process, single-threaded engine, the error handling in TestConnectionDatabaseCommand.run feels a bit weird. Might want to work with @Mause on a narrow exception class for the

I had one small comment in case it is helpful! DuckDB is indeed embedded in your local process, but it is multi-threaded and can use as many CPU cores as you would like.

Thanks for building this connector!

villebro pushed a commit that referenced this pull request Apr 3, 2022
* + duckdb support

needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine)

* Update duckdb.py

update  _time_grain_expressions

* removed superfluous get_all_datasource_names def in duckdb engine spec

* added exception handling for duckdb single-threaded RuntimeError

* fixed linter blips and other stylistic cleanup in duckdb.py

* one last round of linter tweaks in test_connection.py for duckdb support

Co-authored-by: Steven Lee <admin@alitrack.com>
Co-authored-by: Richard Whaling <richardwhaling@Richards-MacBook-Pro.local>
(cherry picked from commit 202e34a)
philipher29 pushed a commit to ValtechMobility/superset that referenced this pull request Jun 9, 2022
* + duckdb support

needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine)

* Update duckdb.py

update  _time_grain_expressions

* removed superfluous get_all_datasource_names def in duckdb engine spec

* added exception handling for duckdb single-threaded RuntimeError

* fixed linter blips and other stylistic cleanup in duckdb.py

* one last round of linter tweaks in test_connection.py for duckdb support

Co-authored-by: Steven Lee <admin@alitrack.com>
Co-authored-by: Richard Whaling <richardwhaling@Richards-MacBook-Pro.local>
@inakianduaga
Copy link

inakianduaga commented Jun 21, 2022

Hi, testing this and Superset doesn't work when installing duckdb-engine, throws a 500 internal error on any page in the UI, backend says

module 'duckdb_engine' has no attribute 'name'
Traceback (most recent call last):                                                                                                                                                                                                                              File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/app/superset/utils/log.py", line 245, in wrapper
    value = f(*args, **kwargs)
  File "/app/superset/views/core.py", line 2677, in welcome
    "common": common_bootstrap_payload(),
  File "/app/superset/views/base.py", line 368, in common_bootstrap_payload
    available_specs = get_available_engine_specs()
  File "/app/superset/db_engine_specs/__init__.py", line 141, in get_available_engine_specs
    backend = dialect.name

Here's my full Dockerfile

FROM apache/superset

# Switching to root to install the required packages
USER root

# Example: installing the MySQL driver to connect to the metadata database
# if you prefer Postgres, you may want to use `psycopg2-binary` instead
RUN pip install psycopg2-binary

# Example: installing a driver to connect to Redshift
# Find which driver you need based on the analytics database
# you want to connect to here:
# https://superset.apache.org/installation.html#database-dependencies
# RUN pip install sqlalchemy-redshift
RUN pip install pyhive[hive]
RUN pip install pyhive[presto]

# DuckDB support
RUN pip install duckdb-engine

ADD ./config/superset_config.py /app/pythonpath/superset_config.py

# Switching back to using the `superset` user
USER superset

@Mause
Copy link

Mause commented Jun 21, 2022

@inakianduaga heya, that's on me - unfortunately sqlalchemy doesn't really document what's expected from a dialect, so it's hard to keep up with the changes on their side. I'll push out a fixed version shortly

@Mause
Copy link

Mause commented Jun 21, 2022

@inakianduaga it should work now if you try with duckdb_engine 0.1.11, sorry about that 😅

@inakianduaga
Copy link

ok thanks. It actually worked for me by going to the fixed superset:1.5.1 docker image instead of latest. Not sure that made the difference or it was simultaneously your fix. But either way, it works now 🙇

@Mageswaran1989
Copy link

I wanted to access the S3 parquet files from Superset/SQL Editor. While I am able to use DuckDB to do the same in my python shell, I am wondering how to do it with the Superset and/or duckdb-engine

My python snippet to load parquet files from S3:

import duckdb
cursor = duckdb.connect()
cursor.execute("INSTALL httpfs;")
cursor.execute("LOAD httpfs;")
cursor.execute("SET s3_region='******'")
cursor.execute("SET s3_access_key_id=''**************")
cursor.execute("SET s3_secret_access_key='*****************************'")
cursor.execute("PRAGMA enable_profiling;")
cursor.execute("SELECT count(*) FROM read_parquet('s3://<bucket>/prefix/*.parquet'")

@Mageswaran1989
Copy link

@Mause Could you please help with loading S3 parquet files with your engine?
Reference: https://duckdb.org/docs/guides/import/s3_import

@Mause
Copy link

Mause commented Jul 14, 2022

@Mageswaran1989

if you're having an issue with

PS. @zhaoyongjie any chance you could lock this conversation?

@apache apache locked as off-topic and limited conversation to collaborators Jul 15, 2022
@apache apache unlocked this conversation Jul 15, 2022
@apache apache locked as resolved and limited conversation to collaborators Jul 15, 2022
@mistercrunch mistercrunch added 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 labels Mar 13, 2024
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels lts-v1 size/M 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🚢 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants