Compare source to target and check for discrepancies when moving data between systems:
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
Test SQL code and preview changes by comparing development/staging environment data to production:
- Make a change to some SQL code
- Run the SQL code to create a new dataset
- Compare the dataset with its production version or another iteration
dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration
π Watch 4-min demo video
Get started with data-diff & dbt
Also available in a VS Code Extension
Reach out on the dbt Slack in #tools-datafold for advice and support
When comparing the data, data-diff
utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
- Recommended for comparing data within the same database
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
- Fully relies on the underlying database engine for computation
- Requires both datasets to be queryable with a single SQL query
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
- Recommended for comparing datasets across different databases
- Can also be helpful in diffing very large tables with few expected differences within the same database
- Employs a divide-and-conquer algorithm based on hashing and binary search
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
- Time complexity approximates COUNT(*) operation when there are few differences
- Performance degrades when datasets have a large number of differences
More information about the algorithm and performance considerations can be found here
β‘ Looking to use data-diff
in dbt development? Head over to our data-diff
dbt
documentation to get started!
π To compare data between databases, install data-diff
with specific database adapters, e.g.:
pip install data-diff 'data-diff[postgresql,snowflake]' -U
Run data-diff
with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:
data-diff \
postgresql://<username>:'<password>'@localhost:5432/<database> \
<table> \
"snowflake://<username>:<password>@<account>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
<TABLE> \
-k <primary key column> \
-c <columns to compare> \
-w <filter condition>
Run data-diff
with a toml
configuration file. In the following example, we compare tables between MotherDuck(hosted DuckDB) and Snowflake using the hashdiff algorithm:
## DATABASE CONNECTION ##
[database.duckdb_connection]
driver = "duckdb"
# filepath = "datafold_demo.duckdb" # local duckdb file example
# filepath = "md:" # default motherduck connection example
filepath = "md:datafold_demo?motherduck_token=${motherduck_token}" # API token recommended for motherduck connection
database = "datafold_demo"
[database.snowflake_connection]
driver = "snowflake"
database = "DEV"
user = "sung"
password = "${SNOWFLAKE_PASSWORD}" # or "<PASSWORD_STRING>"
# the info below is only required for snowflake
account = "${ACCOUNT}" # by33919
schema = "DEVELOPMENT"
warehouse = "DEMO"
role = "DEMO_ROLE"
## RUN PARAMETERS ##
[run.default]
verbose = true
## EXAMPLE DATA DIFF JOB ##
[run.demo_xdb_diff]
# Source 1 ("left")
1.database = "duckdb_connection"
1.table = "development.raw_orders"
# Source 2 ("right")
2.database = "snowflake_connection"
2.table = "RAW_ORDERS" # note that snowflake table names are case-sensitive
verbose = false
# export relevant environment variables, example below
export motherduck_token=<MOTHERDUCK_TOKEN>
# run the configured data-diff job
data-diff --conf datadiff.toml \
--run demo_xdb_diff \
-k "id" \
-c status
# output example
- 1, completed
1, returned
Check out documentation for the full command reference.
Database | Status | Connection string |
---|---|---|
PostgreSQL >=10 | π’ | postgresql://<user>:<password>@<host>:5432/<database> |
MySQL | π’ | mysql://<user>:<password>@<hostname>:5432/<database> |
Snowflake | π’ | "snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]" |
BigQuery | π’ | bigquery://<project>/<dataset> |
Redshift | π’ | redshift://<username>:<password>@<hostname>:5439/<database> |
DuckDB | π’ | duckdb://<dbname>@<filepath> |
MotherDuck | π’ | duckdb://<dbname>@<filepath> |
Oracle | π‘ | oracle://<username>:<password>@<hostname>/servive_or_sid |
Presto | π‘ | presto://<username>:<password>@<hostname>:8080/<database> |
Databricks | π‘ | databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema> |
Trino | π‘ | trino://<username>:<password>@<hostname>:8080/<database> |
Clickhouse | π‘ | clickhouse://<username>:<password>@<hostname>:9500/<database> |
Vertica | π‘ | vertica://<username>:<password>@<hostname>:5433/<database> |
ElasticSearch | π | |
Planetscale | π | |
Pinot | π | |
Druid | π | |
Kafka | π | |
SQLite | π |
- π’: Implemented and thoroughly tested.
- π‘: Implemented, but not thoroughly tested yet.
- β³: Implementation in progress.
- π: Implementation planned. Contributions welcome.
Your database not listed here?
- Contribute a new database adapter βΒ we accept pull requests!
- Get in touch about enterprise support and adding new adapters and features
We thank everyone who contributed so far!
This project is licensed under the terms of the MIT License.