chore(datasets): Add /api/v1/dataset/{pk}/column endpoint #22332

john-bodley · 2022-12-05T21:54:29Z

SUMMARY

Per SIP-64 this PR adds the /api/v1/dataset/{pk}/column endpoints for obtaining just the columns associated with a dataset to aid the performance of the native filters.

As indicated in the SIP the /api/v1/dataset/{pk} endpoint is highly inefficient when fetching a dataset comprising of a large number of columns and/or metrics, i.e., specifically at Airbnb we have a virtual dataset which houses thousands of metrics and dimensions which results in an interim query which computes the cross product of # metrics x # columns x # owners—resulting in millions of rows.

Given the native filters only use the columns/metrics independently from the /api/v1/dataset/{pk} endpoint this PR simply adds a new endpoint to fetch only the columns metadata associated with said dataset.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Added unit tests. Also tested the new endpoint with a dataset comprising of ~ 7,000 columns and the /api/v1/dataset/{pk}/column endpoint returned in ~ 1 second whereas /api/v1/dataset/{pk} timed out.

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

john-bodley · 2022-12-05T21:54:56Z

docs/static/resources/openapi.json

@@ -257,7 +257,7 @@
      "AnnotationLayerRestApi.get_list": {
        "properties": {
          "changed_by": {
-            "$ref": "#/components/schemas/AnnotationLayerRestApi.get_list.User1"
+            "$ref": "#/components/schemas/AnnotationLayerRestApi.get_list.User"


Regenerated by running superset update-api-docs.

john-bodley · 2022-12-06T01:19:15Z

superset/datasets/columns/api.py

@@ -36,14 +38,99 @@
 class DatasetColumnsRestApi(BaseSupersetModelRestApi):
    datamodel = SQLAInterface(TableColumn)

-    include_route_methods = {"delete"}
+    allow_browser_login = True


we should probably check and move allow_browser_login = True into BaseSupersetModelRestApi since it seems it's used everywhere, I'll make a PR for it

codecov · 2022-12-06T02:18:44Z

Codecov Report

Merging #22332 (e10ca2b) into master (d41cb66) will decrease coverage by 0.64%.
The diff coverage is 78.26%.

❗ Current head e10ca2b differs from pull request most recent head 6845322. Consider uploading reports for the commit 6845322 to get more accurate results

@@            Coverage Diff             @@
##           master   #22332      +/-   ##
==========================================
- Coverage   66.75%   66.11%   -0.65%     
==========================================
  Files        1847     1847              
  Lines       70562    70577      +15     
  Branches     7742     7737       -5     
==========================================
- Hits        47106    46664     -442     
- Misses      21455    21906     +451     
- Partials     2001     2007       +6

Flag	Coverage Δ
hive	`52.53% <76.19%> (?)`
mysql	`?`
postgres	`?`
presto	`52.43% <76.19%> (+<0.01%)`	⬆️
python	`79.67% <76.19%> (-1.34%)`	⬇️
sqlite	`76.48% <76.19%> (-0.01%)`	⬇️
unit	`50.92% <76.19%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/datasets/columns/api.py	`70.00% <76.19%> (-30.00%)`	⬇️
...tersConfigModal/FiltersConfigForm/ColumnSelect.tsx	`76.92% <100.00%> (ø)`
superset/datasets/commands/create.py	`30.61% <0.00%> (-69.39%)`	⬇️
superset/datasets/commands/update.py	`25.00% <0.00%> (-69.05%)`	⬇️
superset/datasets/commands/bulk_delete.py	`33.33% <0.00%> (-53.34%)`	⬇️
superset/datasets/columns/commands/delete.py	`44.11% <0.00%> (-52.95%)`	⬇️
superset/datasets/metrics/commands/delete.py	`44.11% <0.00%> (-52.95%)`	⬇️
superset/datasets/dao.py	`44.21% <0.00%> (-50.35%)`	⬇️
superset/sql_validators/postgres.py	`50.00% <0.00%> (-50.00%)`	⬇️
superset/datasets/commands/delete.py	`48.57% <0.00%> (-48.58%)`	⬇️
... and 45 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

rusackas · 2022-12-06T16:41:38Z

Thanks for taking this performance issue under your wing. I'm hopeful this can mean other orgs with this issue can take this new filters feature to production, and move us another step toward 3.0 and the end of Filter Boxes :D

ktmud

Not sure why cypress is failing but the change makes sense to me

ktmud · 2022-12-06T19:46:24Z

...src/dashboard/components/nativeFilters/FiltersConfigModal/FiltersConfigForm/ColumnSelect.tsx

@@ -98,17 +98,17 @@ export function ColumnSelect({
    }
    if (datasetId != null) {
      cachedSupersetGet({
-        endpoint: `/api/v1/dataset/${datasetId}`,
+        endpoint: `/api/v1/dataset/${datasetId}/column`,


Suggested change

endpoint: `/api/v1/dataset/${datasetId}/column`,

endpoint: `/api/v1/dataset/${datasetId}/columns`,

Should this end point be /columns?

+1 on columns as it represents a set of columns.

I added a ticket to Superset 3.0 project board to refactor the endpoints to use the plural form.

@ktmud and @michael-s-molina I agree that it should be columns and not column—it should also be datasets and not dataset, however per,

we already use the singular form, hence using column ensures consistency which I sense is more important. Per @michael-s-molina's comment I think changing to plural forms should be a bulk operation.

john-bodley · 2022-12-06T23:33:58Z

superset/datasets/columns/api.py

+        rison_data.setdefault("filters", [])
+        rison_data["filters"].append({"col": "table_id", "opr": "eq", "value": pk})
+        rison_data["page_size"] = -1
+        return self.get_list_headless(**kwargs)


@dpgaspar, @michael-s-molina, or @villebro I would love your input on this as I'm not overly familiar with the new FAB API and/or how best to define said endpoint. I came across the pattern by prodding around.

Can we use something more like: https://github.com/apache/superset/blob/master/superset/annotation_layers/annotations/api.py#L129

why the rison_data["page_size"] = -1 because the frontend is not handling pagination yet?

also we probably need to apply DatasourceFilter here, or users will be able to view dataset columns from datasets their not authorized to view

@dpgaspar I added the rison_data["page_size"] = -1 as the /api/v1/dataset/{pk} response is non-paginated and I wanted to ensure consistency. Additionally this endpoint is currently solely used in a non-paginated way, i.e., it's not powering any /*/list/ endpoints.

Regarding the pattern I see it being only used twice and it only applies one rison override (as opposed to two which is needed in our case). I think there's merit in refactoring if it's used more than once. Furthermore this example is actually a misnomer in terms of naming, i.e., the function name has been copied and pasted as is even though it's applying the report_schedule as opposed to layer.

I Agree let's refactor this after. Still think we need to apply the DatasourceFilter here, have you checked it?

john-bodley · 2022-12-06T23:35:51Z

superset/datasets/columns/api.py

    method_permission_name = MODEL_API_RW_METHOD_PERMISSION_MAP
-
+    list_columns = [  # See DatasetRestApi.show_select_columns.


I'm not overly sure the difference between list_columns and show_columns but I thought that it was prudent that the table_columns columns returned was consistent with the columns field in the /api/v1/dataset response.

list_columns is for get_list and show_columns is for get item. Not great naming, at the time I aimed for compatibility with the previous MovelView classes on FAB to allow for a smooth migration

Can we use this opportunity to add comments in the code explaining the difference?

dpgaspar

This is great @john-bodley! thank you

dpgaspar · 2022-12-07T10:46:56Z

superset/datasets/columns/api.py

+        rison_data.setdefault("filters", [])
+        rison_data["filters"].append({"col": "table_id", "opr": "eq", "value": pk})
+        rison_data["page_size"] = -1
+        return self.get_list_headless(**kwargs)


Can we use something more like: https://github.com/apache/superset/blob/master/superset/annotation_layers/annotations/api.py#L129

why the rison_data["page_size"] = -1 because the frontend is not handling pagination yet?

also we probably need to apply DatasourceFilter here, or users will be able to view dataset columns from datasets their not authorized to view

dpgaspar · 2022-12-07T10:56:10Z

superset/datasets/columns/api.py

@@ -36,14 +38,99 @@
 class DatasetColumnsRestApi(BaseSupersetModelRestApi):
    datamodel = SQLAInterface(TableColumn)

-    include_route_methods = {"delete"}
+    allow_browser_login = True


we should probably check and move allow_browser_login = True into BaseSupersetModelRestApi since it seems it's used everywhere, I'll make a PR for it

dpgaspar · 2022-12-07T10:59:12Z

superset/datasets/columns/api.py

    class_permission_name = "Dataset"
+    include_route_methods = {RouteMethod.DELETE, RouteMethod.GET, RouteMethod.GET_LIST}


we just need the RouteMethod.GET_LIST

dpgaspar · 2022-12-07T11:01:06Z

superset/datasets/columns/api.py

    method_permission_name = MODEL_API_RW_METHOD_PERMISSION_MAP
-
+    list_columns = [  # See DatasetRestApi.show_select_columns.


list_columns is for get_list and show_columns is for get item. Not great naming, at the time I aimed for compatibility with the previous MovelView classes on FAB to allow for a smooth migration

rusackas · 2022-12-08T23:18:45Z

Kicking CI 🤞

john-bodley · 2022-12-10T00:44:49Z

Regrettably when examining the initial problem I missed that the /api/v1/dataset/{pk} endpoint is also used as part of the native filters to fetch the metrics + dataset metadata here.

I thought it was prudent to return to the drawing board and close this PR in favor of trying to optimize the underlying FAB/SQLAlchemy logic when fetching a model with its associated relationships. @dpgaspar hopefully dpgaspar/Flask-AppBuilder#1959 is a step in the right direction.

All going well I'll follow up with another PR which updates the existing /api/v1/dataset/{pk} queries to filter on only the relevant dataset, columns, and metrics fields which should further improve query performance and reduce the response payload size.

pull-request-size bot added the size/XXL label Dec 5, 2022

john-bodley commented Dec 5, 2022

View reviewed changes

john-bodley force-pushed the john-bodley--api-datasets-columns branch 4 times, most recently from f4c2c43 to a4aa974 Compare December 6, 2022 01:17

john-bodley commented Dec 6, 2022

View reviewed changes

john-bodley marked this pull request as ready for review December 6, 2022 01:31

john-bodley force-pushed the john-bodley--api-datasets-columns branch from a4aa974 to 939d13e Compare December 6, 2022 02:10

john-bodley force-pushed the john-bodley--api-datasets-columns branch from 939d13e to f8d36c7 Compare December 6, 2022 08:50

rusackas requested review from dpgaspar, betodealmeida and villebro December 6, 2022 16:39

john-bodley force-pushed the john-bodley--api-datasets-columns branch 3 times, most recently from 7684198 to f1bd7f2 Compare December 6, 2022 18:44

john-bodley requested review from jinghua-qa and geido as code owners December 6, 2022 18:44

john-bodley requested a review from ktmud December 6, 2022 19:14

ktmud approved these changes Dec 6, 2022

View reviewed changes

john-bodley commented Dec 6, 2022

View reviewed changes

dpgaspar reviewed Dec 7, 2022

View reviewed changes

chore(datasets): Add /api/v1/dataset/{pk}/column endpoint

6845322

john-bodley force-pushed the john-bodley--api-datasets-columns branch from f1bd7f2 to 6845322 Compare December 8, 2022 19:55

john-bodley changed the title ~~chore(datasets): Add /api/v1/dataset/{pk}/column endpoint~~ chore(datasets): Add /api/v1/dataset/{pk}/[column|metric] endpoints Dec 9, 2022

john-bodley changed the title ~~chore(datasets): Add /api/v1/dataset/{pk}/[column|metric] endpoints~~ chore(datasets): Add /api/v1/dataset/{pk}/column endpoint Dec 9, 2022

john-bodley closed this Dec 10, 2022

john-bodley mentioned this pull request Dec 14, 2022

chore(sqla): Add explicit bidirectional performant relationships for SQLA model #22413

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(datasets): Add /api/v1/dataset/{pk}/column endpoint #22332

chore(datasets): Add /api/v1/dataset/{pk}/column endpoint #22332

john-bodley commented Dec 5, 2022 •

edited

Loading

john-bodley Dec 5, 2022

john-bodley Dec 6, 2022

rusackas Dec 6, 2022

dpgaspar Dec 7, 2022

codecov bot commented Dec 6, 2022 •

edited

Loading

rusackas commented Dec 6, 2022

ktmud left a comment

ktmud Dec 6, 2022

michael-s-molina Dec 7, 2022 •

edited

Loading

john-bodley Dec 8, 2022 •

edited

Loading

john-bodley Dec 6, 2022

dpgaspar Dec 7, 2022

john-bodley Dec 8, 2022

john-bodley Dec 8, 2022

dpgaspar Dec 9, 2022

john-bodley Dec 6, 2022

dpgaspar Dec 7, 2022

michael-s-molina Dec 7, 2022

dpgaspar left a comment

dpgaspar Dec 7, 2022

dpgaspar Dec 7, 2022

dpgaspar Dec 7, 2022

dpgaspar Dec 7, 2022

rusackas commented Dec 8, 2022

john-bodley commented Dec 10, 2022 •

edited

Loading

	endpoint: `/api/v1/dataset/${datasetId}/column`,
	endpoint: `/api/v1/dataset/${datasetId}/columns`,

		method_permission_name = MODEL_API_RW_METHOD_PERMISSION_MAP

		list_columns = [ # See DatasetRestApi.show_select_columns.

		class_permission_name = "Dataset"
		include_route_methods = {RouteMethod.DELETE, RouteMethod.GET, RouteMethod.GET_LIST}

chore(datasets): Add /api/v1/dataset/{pk}/column endpoint #22332

chore(datasets): Add /api/v1/dataset/{pk}/column endpoint #22332

Conversation

john-bodley commented Dec 5, 2022 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 6, 2022 • edited Loading

Codecov Report

rusackas commented Dec 6, 2022

ktmud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michael-s-molina Dec 7, 2022 • edited Loading

Choose a reason for hiding this comment

john-bodley Dec 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpgaspar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rusackas commented Dec 8, 2022

john-bodley commented Dec 10, 2022 • edited Loading

john-bodley commented Dec 5, 2022 •

edited

Loading

codecov bot commented Dec 6, 2022 •

edited

Loading

michael-s-molina Dec 7, 2022 •

edited

Loading

john-bodley Dec 8, 2022 •

edited

Loading

john-bodley commented Dec 10, 2022 •

edited

Loading