Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Feature request: configure all-purpose cluster libraries through DAB #1860

Open
rsayn opened this issue Oct 25, 2024 · 7 comments
Open

Feature request: configure all-purpose cluster libraries through DAB #1860

rsayn opened this issue Oct 25, 2024 · 7 comments
Assignees
Labels
DABs DABs related issues Enhancement New feature or request

Comments

@rsayn
Copy link

rsayn commented Oct 25, 2024

Describe the issue

Since 0.229.0 all-purpose (interactive) clusters can be created via DAB.

With Job clusters, it's pretty straightforward to install a DAB wheel artifact by specifying the libraries for a task executed on that cluster.

With All-purpose clusters this is currently not possible, and the only solution is to perform post-operations with the SDK or APIs to add a library programmatically.

Configuration

bundle:
  name: demo-dab
  databricks_cli_version: 0.231.0

artifacts:
  default:
    type: whl
    build: poetry build
    path: .

resources:
  clusters:
    interactive:
      cluster_name: ${bundle.name} cluster
      data_security_mode: SINGLE_USER
      # [...] cluster config pointing to an all-purpose policy ID
      # these next lines are currently not valid
      libraries:
        - whl: "../dist/*.whl"

Expected Behavior

There should be a way to specify the deployed bundle wheel as a dependency.

Actual Behavior

There's currently no way to specify this behaviour.
The wheel needs to be post-attached to the cluster via the SDK by:

  1. Retrieving the cluster's ID
  2. Attaching libraries

Note that both steps would greatly benefit from the substitution happening inside DABs - without it, the cluster name and library path have to be inferred somehow.

OS and CLI version

  • Databricks CLI v0.231.0
  • MacOS

Is this a regression?

No, this is a new feature request

Debug Logs

N/A

@rsayn rsayn added the DABs DABs related issues label Oct 25, 2024
@andrewnester andrewnester added the Bug Something isn't working label Oct 29, 2024
@andrewnester andrewnester self-assigned this Oct 29, 2024
@andrewnester
Copy link
Contributor

Hi @rsayn ! Thanks for reporting the issue. Just to confirm: when you run a workflow with this cluster, the library is not installed as well?

@rsayn
Copy link
Author

rsayn commented Oct 29, 2024

Hey @andrewnester! If I define jobs to run on this cluster I can include libraries from the job / task definition.
However, my use case here is to boot an interactive small cluster for dev / debugging things via attached notebooks, and I'd like to avoid the overhead of manually installing the project wheel that I deploy through DABs.

My request comes from the fact that you can specify cluster-scoped libraries from the Databricks UI, the SDK or via a cluster policy, but not via DABs.

@andrewnester
Copy link
Contributor

@rsayn thanks for clarifying, it makes sense. My expectation was that in the configuration like you have libraries will be installed when the cluster is started (when corresponding job is started). If that's not the case, this has to be fixed on our side and I'll look into this

@rsayn
Copy link
Author

rsayn commented Oct 29, 2024

All right, thanks a lot! To further clarify: I think (please confirm) all-purpose clusters can still be used for jobs.

In that case, I'd expect any library configured on the job's tasks to override the default cluster libraries (which I think is the current behaviour if you attach libraries to a cluster policy) 🤔

@andrewnester andrewnester added Enhancement New feature or request and removed Bug Something isn't working labels Oct 29, 2024
@andrewnester
Copy link
Contributor

I think I might have misunderstood original issue. In any case, even if you use interactive cluster, you can use it in the job tasks. But for libraries to be installed, you need to specify them at libraries section in tasks not in clusters so it could look like

resources:
  clusters:
    test_cluster:
      cluster_name: "test-cluste"
      spark_version: "13.3.x-snapshot-scala2.12"
      num_workers: 1
      data_security_mode: USER_ISOLATION

  jobs:
    some_other_job:
      name: "[${bundle.target}] Test Wheel Job"
      tasks:
        - task_key: TestTask
          existing_cluster_id: "${resources.clusters.test_cluster.cluster_id}"
          python_wheel_task:
            package_name: my_test_code
            entry_point: run
            parameters:
              - "one"
              - "two"
          libraries:
            - whl: ./dist/*.whl

@rsayn
Copy link
Author

rsayn commented Oct 29, 2024

Exactly. In my case I don't have any jobs attached to the cluster, so I can't use the setup you provided

@rsayn
Copy link
Author

rsayn commented Nov 8, 2024

Hello @andrewnester, any news about this? 🙏 LMK if I can help in any way!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
DABs DABs related issues Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants