Dagster is a system for building modern data applications.
- Elegant programming model: Dagster is a set of abstractions for building self-describing, testable, and reliable data applications. It embraces the principles of functional data programming; gradual, optional typing; and testability as a first-class value.
- Flexible & incremental: Dagster integrates with your existing tools and infrastructure, and can invoke any computation–whether it be Spark, Python, a Jupyter notebook, or SQL. It is also designed to deploy to any workflow engine, such as Airflow.
- Beautiful tools: Dagster's development environment, dagit–designed for data engineers, machine learning engineers, data scientists–enables astoundingly productive local development.
pip install dagster dagit
This installs two modules:
- dagster | The core programming model and abstraction stack; stateless, single-node, single-process and multi-process execution engines; and a CLI tool for driving those engines.
- dagit | A UI and rich development environment for Dagster, including a DAG browser, a type-aware config editor, and a streaming execution interface.
hello_dagster.py
from dagster import execute_pipeline, pipeline, solid
@solid
def get_name(_):
return 'dagster'
@solid
def hello(context, name: str):
context.log.info('Hello, {name}!'.format(name=name))
@pipeline
def hello_pipeline():
hello(get_name())
Let's execute our first pipeline via any of three different mechanisms:
-
From arbitrary Python scripts, use dagster’s Python API
if __name__ == "__main__": execute_pipeline(hello_pipeline) # Hello, dagster!
-
From the command line, use the dagster CLI
$ dagster pipeline execute -f hello_dagster.py -n hello_pipeline
-
From a rich graphical interface, use the dagit GUI tool
$ dagit -f hello_dagster.py -n hello_pipeline
Navigate to http://localhost:3000 and start your journey with Dagit.
Next, jump right into our tutorial, or read our complete documentation. If you're actively using Dagster or have questions on getting started, we'd love to hear from you:
For details on contributing or running the project for development, check out our contributing guide.
Dagster works with the tools and systems that you're already using with your data, including:
Integration | Dagster Library | |
Apache Airflow | dagster-airflow Allows Dagster pipelines to be scheduled and executed, either containerized or uncontainerized, as Apache Airflow DAGs. |
|
Apache Spark | dagster-spark · dagster-pyspark
Libraries for interacting with Apache Spark and Pyspark. |
|
Dask | dagster-dask
Provides a Dagster integration with Dask / Dask.Distributed. |
|
Datadog | dagster-datadog
Provides a Dagster resource for publishing metrics to Datadog. |
|
/ | Jupyter / Papermill | dagstermill Built on the papermill library, dagstermill is meant for integrating productionized Jupyter notebooks into dagster pipelines. |
PagerDuty | dagster-pagerduty
A library for creating PagerDuty alerts from Dagster workflows. |
|
Snowflake | dagster-snowflake
A library for interacting with the Snowflake Data Warehouse. |
|
Cloud Providers | ||
AWS | dagster-aws
A library for interacting with Amazon Web Services. Provides integrations with S3, EMR, and (coming soon!) Redshift. |
|
GCP | dagster-gcp
A library for interacting with Google Cloud Platform. Provides integrations with BigQuery and Cloud Dataproc. |
This list is growing as we are actively building more integrations, and we welcome contributions!
Several example projects are provided under the examples folder demonstrating how to use Dagster, including:
- examples/airline-demo: A substantial demo project illustrating how these tools can be used together to manage a realistic data pipeline.
- examples/event-pipeline-demo: An example illustrating a typical web event processing pipeline with S3, Scala Spark, and Snowflake.