diff --git a/README.md b/README.md index 1a0f1ba9..3c6e77f2 100644 --- a/README.md +++ b/README.md @@ -9,79 +9,61 @@ data-diff: Compare datasets fast, within or across SQL databases
-> [Make sure to join us at our virtual hands-on lab series where our team walks through live how to get set-up with it!](https://www.datafold.com/virtual-hands-on-lab) +> [Join our live virtual lab series to learn how to set it up!](https://www.datafold.com/virtual-hands-on-lab) -# Use Cases +# What's a Data Diff? +A data diff is the value-level comparison between two tablesβ€”used to identify critical changes to your data and guarantee data quality. + +There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies when moving data between databases. -## Data Migration & Replication Testing -Compare source to target and check for discrepancies when moving data between systems: -- Migrating to a new data warehouse (e.g., Oracle > Snowflake) -- Converting SQL to a new transformation framework (e.g., stored procedures > dbt) -- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift) +# Use Cases +### Data Migration & Replication Testing +data-diff is a powerful tool for comparing data when you're moving it between systems. Use it to ensure data accuracy and identify discrepancies during tasks like: +- **Migrating** to a new data warehouse (e.g., Oracle -> Snowflake) +- **Converting SQL** to a new transformation framework (e.g., stored procedures -> dbt) +- Continuously **replicating data** from an OLTP database to OLAP data warehouse (e.g., MySQL -> Redshift) -## Data Development Testing -Test SQL code and preview changes by comparing development/staging environment data to production: -1. Make a change to some SQL code +### Data Development Testing +When developing SQL code, data-diff helps you validate and preview changes by comparing data between development/staging environments and production. Here's how it works: +1. Make a change to your SQL code 2. Run the SQL code to create a new dataset -3. Compare the dataset with its production version or another iteration +3. Compare this dataset with its production version or other iterations +# dbt Integration

dbt -

- -
- data-diff integrates with dbt Core to seamlessly compare local development to production datasets +

-
+data-diff integrates with [dbt Core](https://github.com/dbt-labs/dbt-core) to seamlessly compare local development to production datasets. -![data-development-testing](docs/development_testing.png) +Learn more about how data-diff works with dbt: +* Read our docs to get started with [data-diff & dbt](https://docs.datafold.com/development_testing/cli) or :eyes: **watch the [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)** +* dbt Cloud users should check out [Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing) +* Get support from the dbt Community Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) -
-> [dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing) +# Getting Started -:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)** +### ⚑ Validating dbt model changes between dev and prod +Looking to use data-diff in dbt development? -**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)** - -Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support - - -# How it works +Development testing with Datafold enables you to see the impact of dbt code changes on data as you write the code, whether in your IDE or CLI. -When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison: + Head over to [our `data-diff` + `dbt` documentation](https://docs.datafold.com/development_testing/cli) to get started with a development testing workflow! -## `joindiff` -- Recommended for comparing data within the same database -- Uses the outer join operation to diff the rows as efficiently as possible within the same database -- Fully relies on the underlying database engine for computation -- Requires both datasets to be queryable with a single SQL query -- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset +### πŸ”€ Compare data tables between databases +1. Install `data-diff` with adapters -## `hashdiff` -- Recommended for comparing datasets across different databases -- Can also be helpful in diffing very large tables with few expected differences within the same database -- Employs a divide-and-conquer algorithm based on hashing and binary search -- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake -- Time complexity approximates COUNT(*) operation when there are few differences -- Performance degrades when datasets have a large number of differences - -More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md) - -# Get started - -## Validating dbt model changes between dev and prod -⚑ Looking to use `data-diff` in dbt development? Head over to [our `data-diff` + `dbt` documentation](https://docs.datafold.com/development_testing/how_it_works) to get started! - -## Compare data tables between databases -πŸ”€ To compare data between databases, install `data-diff` with specific database adapters, e.g.: +To compare data between databases, install `data-diff` with specific database adapters. For example, install it for PostgreSQL and Snowflake like this: ``` pip install data-diff 'data-diff[postgresql,snowflake]' -U ``` -Run `data-diff` with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm: +2. Run `data-diff` with connection URIs + +Then, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm: ```bash data-diff \ @@ -93,8 +75,9 @@ data-diff \ -c \ -w ``` +3. Set up your configuration -Run `data-diff` with a `toml` configuration file. In the following example, we compare tables between MotherDuck(hosted DuckDB) and Snowflake using the hashdiff algorithm: +You can use a `toml` configuration file to run your `data-diff` job. In this example, we compare tables between MotherDuck (hosted DuckDB) and Snowflake using the hashdiff algorithm: ```toml ## DATABASE CONNECTION ## @@ -103,7 +86,6 @@ Run `data-diff` with a `toml` configuration file. In the following example, we c # filepath = "datafold_demo.duckdb" # local duckdb file example # filepath = "md:" # default motherduck connection example filepath = "md:datafold_demo?motherduck_token=${motherduck_token}" # API token recommended for motherduck connection - database = "datafold_demo" [database.snowflake_connection] driver = "snowflake" @@ -132,8 +114,12 @@ Run `data-diff` with a `toml` configuration file. In the following example, we c verbose = false ``` +4. Run your `data-diff` job + +Make sure to export relevant environment variables as needed. For example, we compare data based on the earlier configuration: ```bash + # export relevant environment variables, example below export motherduck_token= @@ -148,11 +134,13 @@ data-diff --conf datadiff.toml \ + 1, returned ``` -Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference. +5. Review the output +After running your data-diff job, review the output to identify and analyze differences in your data. -# Supported databases +Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference. +# Supported databases | Database | Status | Connection string | |---------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| @@ -161,8 +149,8 @@ Check out [documentation](https://docs.datafold.com/reference/open_source/cli) f | Snowflake | 🟒 | `"snowflake://[:]@//?warehouse=&role=[&authenticator=externalbrowser]"` | | BigQuery | 🟒 | `bigquery:///` | | Redshift | 🟒 | `redshift://:@:5439/` | -| DuckDB | 🟒 | `duckdb://@` | -| MotherDuck | 🟒 | `duckdb://@` | +| DuckDB | 🟒 | `duckdb://` | +| MotherDuck | 🟒 | `duckdb://` | | Oracle | 🟑 | `oracle://:@/servive_or_sid` | | Presto | 🟑 | `presto://:@:8080/` | | Databricks | 🟑 | `databricks://:@//` | @@ -172,8 +160,7 @@ Check out [documentation](https://docs.datafold.com/reference/open_source/cli) f | ElasticSearch | πŸ“ | | | Planetscale | πŸ“ | | | Pinot | πŸ“ | | -| Druid | πŸ“ | | -| Kafka | πŸ“ | | +| Druid | πŸ“ | | | | SQLite | πŸ“ | | * 🟒: Implemented and thoroughly tested. @@ -189,9 +176,48 @@ Your database not listed here?
+# How it works + +`data-diff` efficiently compares data using two modes: + +**joindiff**: Ideal for comparing data within the same database, utilizing outer joins for efficient row comparisons. It relies on the database engine for computation and has consistent performance. + +**hashdiff**: Recommended for comparing datasets across different databases or large tables with minimal differences. It uses hashing and binary search, capable of diffing data across distinct database engines. + +
+Click here to learn more about joindiff and hashdiff + +### `joindiff` +* Recommended for comparing data within the same database +* Uses the outer join operation to diff the rows as efficiently as possible within the same database +* Fully relies on the underlying database engine for computation +* Requires both datasets to be queryable with a single SQL query +* Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset + +### `hashdiff`: +* Recommended for comparing datasets across different databases +* Can also be helpful in diffing very large tables with few expected differences within the same database +* Employs a divide-and-conquer algorithm based on hashing and binary search +* Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake +* Time complexity approximates COUNT(*) operation when there are few differences +* Performance degrades when datasets have a large number of differences + +
+
+ +For detailed algorithm and performance insights, explore [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md), or head to our docs to [learn more about how Datafold diffs data](https://docs.datafold.com/data_diff/how-datafold-diffs-data). + + +# data-diff OSS & Datafold Cloud +data-diff is an open source utility for running stateless diffs on your local computer for a great single player experience. + +Scale up with [Datafold Cloud](https://www.datafold.com/) to make data diffing a company-wide experience to both supercharge your data diffing CLI experience (ex: data-diff --dbt --cloud) and run diffs manually in the UI. This includes [column-level lineage](https://www.datafold.com/column-level-lineage), [CI testing](https://docs.datafold.com/deployment_testing/how_it_works/), and diff history. + ## Contributors -We thank everyone who contributed so far! +We thank everyone who contributed so far! + +We'd love to see your face here: [Contributing Instructions](CONTRIBUTING.md)