Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat(ingest): improve tableau lineage, workbooks query, fix pagination #5756

Merged

Conversation

mayurinehate
Copy link
Collaborator

@mayurinehate mayurinehate commented Aug 29, 2022

Changes:

  1. Add support for platform_instance_map config to specify platform instance for external tables (e.g. snowflake, hive)
  2. Add support for overriding platform name . lineage_overrides.platform_override_map
  3. Restore pagination change for custom sql and published datasource query, accidently reverted during this refractor [Issue]
  4. Emit upstream tables after processing workbooks, embedded datasources, custom SQL tables, published datasources, not to miss emitting external table aspects for any. [Issue]
  5. Correct casing for "view" subtype to show "View Definition" tab [Issue]
  6. Separate embeddedDatasourcesConnection from workbooksConnection query to avoid node limit exceeded error for complex workbooks [Issue]

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@mayurinehate mayurinehate force-pushed the tableau_platform_instance_map branch from ae8f8e9 to 8773816 Compare August 29, 2022 11:29
@github-actions
Copy link

github-actions bot commented Aug 29, 2022

Unit Test Results (metadata ingestion)

       8 files         8 suites   55m 37s ⏱️
   672 tests    669 ✔️ 3 💤 0
1 344 runs  1 338 ✔️ 6 💤 0

Results for commit cc2f4f8.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Aug 29, 2022

Unit Test Results (build & test)

517 tests  ±0   517 ✔️ ±0   8m 36s ⏱️ +17s
121 suites ±0       0 💤 ±0 
121 files   ±0       0 ±0 

Results for commit cc2f4f8. ± Comparison against base commit 2a95248.

♻️ This comment has been updated with latest results.

@shirshanka
Copy link
Contributor

Does this PR also address this concern?

@mayurinehate
Copy link
Collaborator Author

Does this PR also address this concern?

Not all of the concerns. Only part where page_size needs to be supported for customSQLTablesConnection and publishedDatasourcesConnection queries.

@mayurinehate
Copy link
Collaborator Author

Does this PR also address this concern?

Not all of the concerns. Only part where page_size needs to be supported for customSQLTablesConnection and publishedDatasourcesConnection queries.

All the concerns from this thread are now addressed.

@mayurinehate mayurinehate force-pushed the tableau_platform_instance_map branch from 9b10af8 to cc2f4f8 Compare September 2, 2022 07:15
@mayurinehate mayurinehate changed the title feat(tableau): add platform_instance_map, lineage_overrides in tableau feat(ingest): improve tableau lineage, workbooks query, fix pagination Sep 2, 2022
@cuong-pham
Copy link
Contributor

@mayurinehate I don't know if we should make this PR bigger. But i observed some behaviour with Tableau Metadata API with regarding CustomSQLTables (we have some very complex custom sql in tableau, that's how i noticed the behaviour):

  • If the query is simple enough, tables properties in CustomSQLTables actually represents which tables the CustomSQLTable use. Currently we get the upstreamTables in datasouces, which is not always correct. I can have multiple tables connected to a datasource but my custom query only use one of them, or worse only use the connection to the database.
  • If the query is a bit more complex, tables will be empty. We have 2 options, using the existing parsing logic (upstreamTables are populated in datasources field) or we can use sqllineage to parse the the query, but sqllineage is quite buggy.
  • The some cases, if the query is too complex, both tables and upstreamTables in datasources are empty, you have to rely on parsing queries.

I will put some sample payload up

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Merging this to make progress and improve this connector.
We can improve on the cases @cuong-pham mentioned in a future PR.

@shirshanka shirshanka added the ingestion PR or Issue related to the ingestion of metadata label Sep 6, 2022
@shirshanka shirshanka merged commit a8c1397 into datahub-project:master Sep 6, 2022
shirshanka pushed a commit to shirshanka/datahub that referenced this pull request Sep 8, 2022
shirshanka pushed a commit to shirshanka/datahub that referenced this pull request Sep 8, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants