Python CLI tool and library for comparing CSV database dumps and finding differences.
pip install git+https://github.com/datsom1/db-diff.git
To install the unofficial latest version (you probably don't need to do this):
pip install --upgrade --force-reinstall git+https://github.com/datsom1/db-diff.git
Consider two CSV files:
one.csv
Id,name,age
1,Cleo,4
2,Pancakes,2
two.csv
Id,name,age
1,Cleo,5
3,Bailey,1
db-diff
can show a human-readable summary of differences between the files:
$ db-diff one.csv two.csv --key=Id
1 rows changed, 1 rows added, 1 rows removed
1 rows changed
Rows 1
age: "4" => "5"
1 rows added
Id: 3
name: Bailey
age: 1
1 rows removed
Id: 2
name: Pancakes
age: 2
The --key=Id
option means that the Id
column should be treated as the unique key, to identify which records have changed.
The tool will automatically detect if your files are comma- or tab-separated. You can over-ride this automatic detection and force the tool to use a specific format using --format=tsv
or --format=csv
.
You can also feed it JSON files, provided they are a JSON array of objects where each object has the same keys. Use --format=json
if your input files are JSON.
Use --show-unchanged
to include full details of the unchanged values for rows with at least one change in the diff output:
% db-diff one.csv two.csv --key=Id --showunchanged
1 rows changed
Id: 1
age: "4" => "5"
Unchanged:
name: "Cleo"
You can use the --output=json
option to get a machine-readable difference:
$ db-diff one.csv two.csv --key=Id --output=json
{
"added": [
{
"id": "3",
"name": "Bailey",
"age": "1"
}
],
"removed": [
{
"id": "2",
"name": "Pancakes",
"age": "2"
}
],
"changed": [
{
"key": "1",
"changes": {
"age": [
"4",
"5"
]
}
}
],
"columns_added": [],
"columns_removed": []
}
You can use the --output=jsonfile
and --outputfile=
option to automatically save a .json file of the output:
$ db-diff one.csv two.csv --key=Id --output=jsonfile --outputfile=diffs.json
You can use the --time
option to meaure the time it takes:
$ db-diff one.csv two.csv --key=Id --time
.
.
.
Elapsed time: 0.016 seconds
You can also import the Python library into your own code like so:
from csv_diff import load_csv, compare
diff = compare(
load_csv(open("one.csv"), key="Id"),
load_csv(open("two.csv"), key="Id")
)
diff
will now contain the same data structure as the output in the --output=json
example above.
If the columns in the CSV have changed, those added or removed columns will be ignored when calculating changes made to specific rows.
$ db-diff --help
Usage: db-diff [OPTIONS] PREVIOUS CURRENT
Compare the differences between two CSV or JSON files to find differences.
Options:
--format TEXT Explicitly specify input format. Available (csv|tsv|json) [default: auto-detect based on file extension]
--encoding TEXT Input File Encoding. Available: (utf-8|utf-16|utf-16le|utf-16be|latin1|cp1252|ascii|...) [default: utf-8]
--key TEXT Column to use as a unique ID for each row (ex: --key=Id) [default: first column if not specified]
--output TEXT Output format. Available: (readable|json|jsonfile) [default: readable]
--outputfile FILE File to write JSON output to (only used with --output=jsonfile)
--showunchanged If a record is changed, show ALL fields, not just the changed fields.
--time Measure and display elapsed time for the diff operation
--version Show the version and exit.
-h, --help Show this message and exit.
Example: db-diff old.csv new.csv --key=Id --output=jsonfile --outputfile=diff.json