Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

databricks workflow failing with 'too many 503 error responses' #892

Open
mkjain1982 opened this issue Dec 27, 2024 · 14 comments
Open

databricks workflow failing with 'too many 503 error responses' #892

mkjain1982 opened this issue Dec 27, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@mkjain1982
Copy link

Describe the bug

A workflow dbt job terminates with an error Max retries exceeded with url: ... (Caused by ResponseError('too many 503 error responses')) just when it starts sending SQL commands to the cluster.
No changes has been made to the code or yml files.
This occurs only with SQL Warehouse, not with SQL Warehouse Serverless

Steps To Reproduce

The problem occurs in a workflow in a databricks workspace with the following settings
Running on Azure, Databricks Premium, not Unity Catalog
Job cluster single node Standard_DS3_v2
Work cluster SQL Warehouse Pro X-Small, Cluster count: Active 0 Min 1 Max 1, Channel Current, Cost optimized
git source Azure Devops
Settings for library version dbt-databricks>=1.0.0,<2.0.0

Start the workflow. After the job cluster has been created and the SQL Warehouse has been started an error is shown in the log:

  • dbt deps --profiles-dir ../misc/misc/ -t prod
    10:14:06 Running with dbt=1.9.1
    10:14:07 Updating lock file in file path: /tmp/tmp-dbt-run-126728701395951/piab/dbt/package-lock.yml
    10:14:07 Installing calogica/dbt_expectations
    10:14:07 Installed from version 0.10.4
    10:14:07 Up to date!
    10:14:07 Installing dbt-labs/dbt_utils
    10:14:08 Installed from version 1.1.1
    10:14:08 Updated version available: 1.3.0
    10:14:08 Installing calogica/dbt_date
    10:14:08 Installed from version 0.10.1
    10:14:08 Up to date!
    10:14:08
    10:14:08 Updates available for packages: ['dbt-labs/dbt_utils']
    Update your versions in packages.yml, then run dbt deps

  • dbt build --profiles-dir ../misc/misc/ -t prod -f
    10:14:10 Running with dbt=1.9.1
    10:14:11 Registered adapter: databricks=1.9.1
    10:14:12 Unable to do partial parsing because saved manifest not found. Starting full parse.

  • dbt build --profiles-dir ../misc/misc/ -t prod -f
    10:14:10 Running with dbt=1.9.1
    10:14:11 Registered adapter: databricks=1.9.1
    10:14:12 Unable to do partial parsing because saved manifest not found. Starting full parse.
    10:14:31 Found 435 models, 103 snapshots, 1 analysis, 8 seeds, 1559 data tests, 123 sources, 8 exposures, 999 macros
    10:14:32
    10:14:32 Concurrency: 12 threads (target='prod')
    10:14:32
    10:14:58
    10:14:58 Finished running in 0 hours 0 minutes and 26.49 seconds (26.49s).
    10:14:58 Encountered an error:
    Database Error
    HTTPSConnectionPool(host='adb-130132662866554.14.azuredatabricks.net', port=443): Max retries exceeded with url: /sql/1.0/warehouses/660a2880f1cab4fb (Caused by ResponseError('too many 503 error responses'))

Expected behavior

dbt-databricks workflow start with any error as shown below

  • dbt deps --profiles-dir ../misc/misc/ -t prod
    10:20:37 Running with dbt=1.9.1
    10:20:37 Updating lock file in file path: /tmp/tmp-dbt-run-355636934123336/piab/dbt/package-lock.yml
    10:20:37 Installing calogica/dbt_expectations
    10:20:38 Installed from version 0.10.4
    10:20:38 Up to date!
    10:20:38 Installing dbt-labs/dbt_utils
    10:20:38 Installed from version 1.1.1
    10:20:38 Updated version available: 1.3.0
    10:20:38 Installing calogica/dbt_date
    10:20:38 Installed from version 0.10.1
    10:20:38 Up to date!
    10:20:38
    10:20:38 Updates available for packages: ['dbt-labs/dbt_utils']
    Update your versions in packages.yml, then run dbt deps

  • dbt build --profiles-dir ../misc/misc/ -t prod -f
    10:20:41 Running with dbt=1.9.1
    10:20:42 Registered adapter: databricks=1.9.1
    10:20:42 Unable to do partial parsing because saved manifest not found. Starting full parse.
    10:21:01 Found 435 models, 103 snapshots, 1 analysis, 8 seeds, 1559 data tests, 123 sources, 8 exposures, 999 macros
    10:21:02
    10:21:02 Concurrency: 12 threads (target='prod')
    10:21:02
    10:21:19 1 of 1986 START sql table model staging.rollup12helper ......................... [RUN]
    10:21:19 2 of 1986 START sql table model staging.rollup24helper ......................... [RUN]

Screenshots and log output

NA

System information

The output of dbt --version:

dbt=1.9.1
Registered adapter: databricks=1.9.1

image

The operating system you're using:
NA
The output of python --version:
NA

Additional context

Add any other context about the problem here.

@mkjain1982 mkjain1982 added the bug Something isn't working label Dec 27, 2024
@KristoRSparkle
Copy link

We have the same issue with dbt-databricks==1.9.1
Downgraded to 1.8.7 and that works.

@mkjain1982
Copy link
Author

It was working fine till last week.. suddenly we started getting this error. Temporarily we have changed the SQL Warehouse to Serverless instead of Pro. and its working. However I want to know the root cause of the issue.

@spenaustin
Copy link

My team has been noticing this too. Here's what we found:

This only occurs when our SQL Warehouse is in the "Stopped" status.

This is extremely similar to #570 , which was solved by Pull Request #578 , which pinned the databricks-sql-connector package back to an older version. I think this is likely to be a similar problem: databricks-sql-connector just received an upgrade to version 3.7 on December 23rd, and we started seeing this issue on December 24th.

Looking into it further, it looks like version 3.7 altered the library's retry backoff behavior, which was also the issue in #570. Pinning our version of databricks-sql-connector to version 3.6 seems to have solved the problem for us, but leaving it unspecified will let pip default to installing the newest version.

@benc-db
Copy link
Collaborator

benc-db commented Jan 6, 2025

@spenaustin thanks for the investigation, you are probably right, and I will notify the sql connector team.

@benc-db
Copy link
Collaborator

benc-db commented Jan 6, 2025

it looks like the issue is that defaults were changed; in your profile, you can try

connection_parameters:
  _retry_stop_after_attempts_count: 30

@anouar-zh
Copy link

@benc-db looks like a great suggestion but where in the profile should you add this?

@benc-db
Copy link
Collaborator

benc-db commented Jan 7, 2025

@benc-db looks like a great suggestion but where in the profile should you add this?

Same level as you provide your credentials. I'm hopeful that a new version of the python Databricks connector will be out shortly as well.

@alexeyegorov
Copy link

In order to provide more information:
we do not use Databricks SQL, but All Purpose Clusters. We run into the same issues every time we run a query against it when it is stopped. We have a PR job to run some simple tests which is then not working with the current version.

@ddors3y
Copy link

ddors3y commented Jan 14, 2025

We are continuing to see this error in dbt Cloud even with adding the connection_parameters in the extended properties. The only sure fire way to resolve the issues seems like setting the version back to 1.7 (the last named version) which causes issue if the team is using features from later versions (ie microbatching)

Is there a timeline of when this issue will be resolved?

@benc-db
Copy link
Collaborator

benc-db commented Jan 14, 2025

The issue should already be resolved. In your logs, what version do you see after this:

"databricks_sql_connector_version":

@ddors3y
Copy link

ddors3y commented Jan 14, 2025

I don't see that specific line in the logs, but it looks like the adapter being used is 1.9.0.

From the logs:
2025-01-14 17:21:04.483773 (MainThread): 17:21:04 Registered adapter: databricks=1.9.0-post8+5e20eeaef43e671913f995d8079d4ec2b8a1da6d

@ilyaberd
Copy link

The issue should already be resolved.

@benc-db Could you please link the PR that resolved this issue?

Starting Jan 15th, in dbt Cloud, issue is now exist in "compatible" version as well. Previously dbt team suggested using "compatible" version as one of the workarounds.

@benc-db
Copy link
Collaborator

benc-db commented Jan 16, 2025

databricks/databricks-sql-python#486

@benc-db
Copy link
Collaborator

benc-db commented Jan 17, 2025

When I release 1.9.2 next week, I will ensure we have reasonable defaults coming from dbt-databricks. Users can still override, but if they don't, we will provide defaults to the sql connector that provide sufficient time to start up clusters. This should work regardless of version of sql connector, provided its > 3 :).

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants