This repository contains examples of Azure DevOps ( azdo) Pipelines that demonstrate how an end-to-end Azure Databricks workspace automation could be done.
The Azure Pipelines can use either Terraform or scripts (with Azure CLI and ARM templates).
The main goal is to have a Databricks Delta pipeline orchestrated with Azure Data Factory while starting with an empty Azure account (with only an empty Subscription and DevOps Organization).
- all of this by simply running
./run_all.sh
:
-
Create the Subscription and DevOps Organization. If using the free tier, request a free Azure DevOps Parallelism grant by filling out the following form: https://aka.ms/azpipelines-parallelism-request
-
Fork this GitHub repository as Azure DevOps would need access to it and changing the Azure Pipelines variables would require committing and pushing changes.
-
Customize the variables:
-
Use the
run_all.sh
script:
export USE_TERRAFORM="yes"
export AZDO_ORG_SERVICE_URL="https://dev.azure.com/myorg/" # or set it in vars.sh
export AZDO_PERSONAL_ACCESS_TOKEN="xvwepmf76..." # not required if USE_TERRAFORM="no"
export AZDO_GITHUB_SERVICE_CONNECTION_PAT="ghp_9xSDnG..." # GitHub PAT
./run_all.sh
Security was a central part in designing the main steps of this example and reflected in the minimum user privileges required for each step:
- step 1: administrator user (
Owner
of Subscription andGlobal administrator
of the AD) - step 2: infra service principal that is Owner of the project Resources Group
- step 3: data service principal that can deploy and run a Data Factory pipeline
Builds the Azure core infrastructure (using a privileged user / Administrator):
- this is the foundation for the next step: Resource Groups, Azure DevOps Project and Pipelines, Service Principals, Project group and role assignments.
- the user creating these resources needs to be
Owner
of Subscription andGlobal administrator
of the Active Directory tenant. - it can be seen as deploying an empty shell for a project or business unit including the Service Principal (
the
Infra SP
) assigned to that project that would have control over the project resources.
To run this step use one of the scripts depending on the tool preference:
- Terraform:
./admin/setup-with-terraform.sh
(code) - Scripts with Azure CLI:
./admin/setup-with-azure-cli.sh
(code)
Before using either, check and personalize the variables under the admin/vars.sh
file.
Builds the Azure infrastructure for the data pipeline and project (using the project specific Infra SP
):
- this is the Azure infrastructure required to run a Databricks data pipeline, including Data Lake Gen 2 account and containers, Azure Data Factory, Azure Databricks workspace and Azure permissions.
- the service principal creating these resources is the
Infra SP
deployed at step 1 ( Resource Group owner). - it is run as the first stage in the Azure DevOps infra pipeline with the pipeline name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.
-
Terraform:
there are two Azure Pipelines yaml definitions for this deployment and either one can be used depending on the tool preference:
pipelines/azure-pipelines-infra-with-terraform.yml
(code)- ARM templates and Azure
CLI:
pipelines/azure-pipelines-infra-with-azure-cli.yml
(code)
- ARM templates and Azure
CLI:
To run this step:
- either use the az cli command like
run_all.sh
does it. - or use the Azure DevOps portal by clicking the
Run pipeline
button on the pipeline with the name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.
Before using either, check and personalize the variables under the pipelines/vars.yml
file (
don't forget to push any changes to Git before running).
This step is executed together with the one above and after deploying the Azure infrastructure and the Databricks workspace itself:
- it bootstraps the Databricks workspace with the required workspace objects for a new project and pipeline, including Instance Pools, Clusters, Policies, Notebooks, Groups and Service Principals.
- the service principal creating these resources is the
Infra SP
deployed at step 1 and is already a Databricks workspace admin since it deployed the workspace. - it is run as the second stage in the Azure DevOps infra pipeline with the pipeline name defined in the AZURE_DEVOPS_INFRA_PIPELINE_NAME variable.
This is run together with previous step but if it is needed to run it separately:
- in the Azure DevOps portal, before clicking the
Run pipeline
button on the Infra pipeline, deselect theDeploy infrastructure
job. - with Terraform, use the
terraform/deployments/run-deployment.sh
script file.
This step is executed after the infrastructure deployment and workspace bootstrap:
- it's a simple Azure DevOps Pipeline that deploys with ARM templates an Azure Data Factory data pipeline together with the Databricks linked service.
- it then invokes the Azure Data Factory data pipeline with the Azure DevOps Pipeline parameters.
- the service principal deploying and running the pipeline is the
Data SP
deployed at step 1 and it has the necessary Databricks and Data Factory permissions given at step 2. - this service principal also has the permission to write data into the Data Lake.
- the Databricks linked service can be of two types:
To run this step:
- either use the az cli command like
run_all.sh
does it. - or use the Azure DevOps portal by clicking the
Run pipeline
button on the pipeline with the name defined in the AZURE_DEVOPS_DATA_PIPELINE_NAME variable.
It will use some of the variables under the pipelines/vars.yml
file and it can be customized
using pipeline parameters like database and table names, source
data location, etc.