diff --git a/README.md b/README.md
index 1a0f1ba9..3c6e77f2 100644
--- a/README.md
+++ b/README.md
@@ -9,79 +9,61 @@ data-diff: Compare datasets fast, within or across SQL databases
-> [Make sure to join us at our virtual hands-on lab series where our team walks through live how to get set-up with it!](https://www.datafold.com/virtual-hands-on-lab)
+> [Join our live virtual lab series to learn how to set it up!](https://www.datafold.com/virtual-hands-on-lab)
-# Use Cases
+# What's a Data Diff?
+A data diff is the value-level comparison between two tablesβused to identify critical changes to your data and guarantee data quality.
+
+There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies when moving data between databases.
-## Data Migration & Replication Testing
-Compare source to target and check for discrepancies when moving data between systems:
-- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
-- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
-- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
+# Use Cases
+### Data Migration & Replication Testing
+data-diff is a powerful tool for comparing data when you're moving it between systems. Use it to ensure data accuracy and identify discrepancies during tasks like:
+- **Migrating** to a new data warehouse (e.g., Oracle -> Snowflake)
+- **Converting SQL** to a new transformation framework (e.g., stored procedures -> dbt)
+- Continuously **replicating data** from an OLTP database to OLAP data warehouse (e.g., MySQL -> Redshift)
-## Data Development Testing
-Test SQL code and preview changes by comparing development/staging environment data to production:
-1. Make a change to some SQL code
+### Data Development Testing
+When developing SQL code, data-diff helps you validate and preview changes by comparing data between development/staging environments and production. Here's how it works:
+1. Make a change to your SQL code
2. Run the SQL code to create a new dataset
-3. Compare the dataset with its production version or another iteration
+3. Compare this dataset with its production version or other iterations
+# dbt Integration
-
-
-
- data-diff integrates with dbt Core to seamlessly compare local development to production datasets
+
-
+data-diff integrates with [dbt Core](https://github.com/dbt-labs/dbt-core) to seamlessly compare local development to production datasets.
-
+Learn more about how data-diff works with dbt:
+* Read our docs to get started with [data-diff & dbt](https://docs.datafold.com/development_testing/cli) or :eyes: **watch the [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
+* dbt Cloud users should check out [Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing)
+* Get support from the dbt Community Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU)
-
-> [dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing)
+# Getting Started
-:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
+### β‘ Validating dbt model changes between dev and prod
+Looking to use data-diff in dbt development?
-**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)**
-
-Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support
-
-
-# How it works
+Development testing with Datafold enables you to see the impact of dbt code changes on data as you write the code, whether in your IDE or CLI.
-When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
+ Head over to [our `data-diff` + `dbt` documentation](https://docs.datafold.com/development_testing/cli) to get started with a development testing workflow!
-## `joindiff`
-- Recommended for comparing data within the same database
-- Uses the outer join operation to diff the rows as efficiently as possible within the same database
-- Fully relies on the underlying database engine for computation
-- Requires both datasets to be queryable with a single SQL query
-- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
+### π Compare data tables between databases
+1. Install `data-diff` with adapters
-## `hashdiff`
-- Recommended for comparing datasets across different databases
-- Can also be helpful in diffing very large tables with few expected differences within the same database
-- Employs a divide-and-conquer algorithm based on hashing and binary search
-- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
-- Time complexity approximates COUNT(*) operation when there are few differences
-- Performance degrades when datasets have a large number of differences
-
-More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md)
-
-# Get started
-
-## Validating dbt model changes between dev and prod
-β‘ Looking to use `data-diff` in dbt development? Head over to [our `data-diff` + `dbt` documentation](https://docs.datafold.com/development_testing/how_it_works) to get started!
-
-## Compare data tables between databases
-π To compare data between databases, install `data-diff` with specific database adapters, e.g.:
+To compare data between databases, install `data-diff` with specific database adapters. For example, install it for PostgreSQL and Snowflake like this:
```
pip install data-diff 'data-diff[postgresql,snowflake]' -U
```
-Run `data-diff` with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:
+2. Run `data-diff` with connection URIs
+
+Then, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:
```bash
data-diff \
@@ -93,8 +75,9 @@ data-diff \
-c \
-w
```
+3. Set up your configuration
-Run `data-diff` with a `toml` configuration file. In the following example, we compare tables between MotherDuck(hosted DuckDB) and Snowflake using the hashdiff algorithm:
+You can use a `toml` configuration file to run your `data-diff` job. In this example, we compare tables between MotherDuck (hosted DuckDB) and Snowflake using the hashdiff algorithm:
```toml
## DATABASE CONNECTION ##
@@ -103,7 +86,6 @@ Run `data-diff` with a `toml` configuration file. In the following example, we c
# filepath = "datafold_demo.duckdb" # local duckdb file example
# filepath = "md:" # default motherduck connection example
filepath = "md:datafold_demo?motherduck_token=${motherduck_token}" # API token recommended for motherduck connection
- database = "datafold_demo"
[database.snowflake_connection]
driver = "snowflake"
@@ -132,8 +114,12 @@ Run `data-diff` with a `toml` configuration file. In the following example, we c
verbose = false
```
+4. Run your `data-diff` job
+
+Make sure to export relevant environment variables as needed. For example, we compare data based on the earlier configuration:
```bash
+
# export relevant environment variables, example below
export motherduck_token=
@@ -148,11 +134,13 @@ data-diff --conf datadiff.toml \
+ 1, returned
```
-Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.
+5. Review the output
+After running your data-diff job, review the output to identify and analyze differences in your data.
-# Supported databases
+Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.
+# Supported databases
| Database | Status | Connection string |
|---------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
@@ -161,8 +149,8 @@ Check out [documentation](https://docs.datafold.com/reference/open_source/cli) f
| Snowflake | π’ | `"snowflake://[:]@//?warehouse=&role=[&authenticator=externalbrowser]"` |
| BigQuery | π’ | `bigquery:///` |
| Redshift | π’ | `redshift://:@:5439/` |
-| DuckDB | π’ | `duckdb://@` |
-| MotherDuck | π’ | `duckdb://@` |
+| DuckDB | π’ | `duckdb://` |
+| MotherDuck | π’ | `duckdb://` |
| Oracle | π‘ | `oracle://:@/servive_or_sid` |
| Presto | π‘ | `presto://:@:8080/` |
| Databricks | π‘ | `databricks://:@//` |
@@ -172,8 +160,7 @@ Check out [documentation](https://docs.datafold.com/reference/open_source/cli) f
| ElasticSearch | π | |
| Planetscale | π | |
| Pinot | π | |
-| Druid | π | |
-| Kafka | π | |
+| Druid | π | | |
| SQLite | π | |
* π’: Implemented and thoroughly tested.
@@ -189,9 +176,48 @@ Your database not listed here?
+# How it works
+
+`data-diff` efficiently compares data using two modes:
+
+**joindiff**: Ideal for comparing data within the same database, utilizing outer joins for efficient row comparisons. It relies on the database engine for computation and has consistent performance.
+
+**hashdiff**: Recommended for comparing datasets across different databases or large tables with minimal differences. It uses hashing and binary search, capable of diffing data across distinct database engines.
+
+
+Click here to learn more about joindiff and hashdiff
+
+### `joindiff`
+* Recommended for comparing data within the same database
+* Uses the outer join operation to diff the rows as efficiently as possible within the same database
+* Fully relies on the underlying database engine for computation
+* Requires both datasets to be queryable with a single SQL query
+* Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
+
+### `hashdiff`:
+* Recommended for comparing datasets across different databases
+* Can also be helpful in diffing very large tables with few expected differences within the same database
+* Employs a divide-and-conquer algorithm based on hashing and binary search
+* Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
+* Time complexity approximates COUNT(*) operation when there are few differences
+* Performance degrades when datasets have a large number of differences
+
+
+
+
+For detailed algorithm and performance insights, explore [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md), or head to our docs to [learn more about how Datafold diffs data](https://docs.datafold.com/data_diff/how-datafold-diffs-data).
+
+
+# data-diff OSS & Datafold Cloud
+data-diff is an open source utility for running stateless diffs on your local computer for a great single player experience.
+
+Scale up with [Datafold Cloud](https://www.datafold.com/) to make data diffing a company-wide experience to both supercharge your data diffing CLI experience (ex: data-diff --dbt --cloud) and run diffs manually in the UI. This includes [column-level lineage](https://www.datafold.com/column-level-lineage), [CI testing](https://docs.datafold.com/deployment_testing/how_it_works/), and diff history.
+
## Contributors
-We thank everyone who contributed so far!
+We thank everyone who contributed so far!
+
+We'd love to see your face here: [Contributing Instructions](CONTRIBUTING.md)