Skip to content

ryan-williams/dvc-helpers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dvc-helpers

Git plugins and Bash scripts/aliases for DVC.

git {diff,show} plugins

git-diff-dvc.sh and git-textconv-dvc.sh can be used to render human-readable git diff and git show summaries of DVC-tracked files and directories.

Setup

Install jq and yq, then configure Git diff/show via:

# From a clone of this repo: ensure git-diff-dvc.sh is on your $PATH
echo "export PATH=$PATH:$PWD" >> ~/.bashrc && . ~/.bashrc

# Git configs
git config --global diff.dvc.command git-diff-dvc.sh       # For git diff
git config --global diff.dvc.textconv git-textconv-dvc.sh  # For git show

# Git attributes (map globs/extensions to commands above):
git config --global core.attributesfile ~/.gitattributes
echo "*.dvc diff=dvc" >> ~/.gitattributes

# Or, configure just the current repo:
git config diff.dvc.command git-diff-dvc.sh       # For git diff
git config diff.dvc.textconv git-textconv-dvc.sh  # For git show
echo "*.dvc diff=dvc" >> .gitattributes

Examples

Examples below use commits from the @test branch, and are verified in GitHub Actions by ci.yml and docker.yml:

Add text file

8ec2060 added a DVC-tracked text file, test.txt (with test.txt.dvc committed to Git):

git diff '8ec2060^..8ec2060'
diff --git .gitignore .gitignore
new file mode 100644
index 0000000..341707b
--- /dev/null
+++ .gitignore
@@ -0,0 +1 @@
+/test.txt
test.txt
--- /dev/null .
+++ .dvc/cache/files/md5/3b/0332e02daabf31651a5a0d81ba830a f8af2eab0b7cb904c4fa697593684bbf716f091b
diff --git b/test.txt b/test.txt
new file mode 100644
index 0000000..f00c965
--- /dev/null
+++ b/test.txt
@@ -0,0 +1,10 @@
+1
+2
+3
+4
+5
+6
+7
+8
+9
+10
git show 8ec2060
commit 8ec2060ab71c85da8e8eb1ab07df56bf91b045f8
Author: Ryan Williams <ryan@runsascoded.com>
Date:   Wed Dec 25 10:01:22 2024 -0500

    add `test.txt.dvc`

diff --git .gitignore .gitignore
new file mode 100644
index 0000000..341707b
--- /dev/null
+++ .gitignore
@@ -0,0 +1 @@
+/test.txt
diff --git test.txt.dvc test.txt.dvc
new file mode 100644
index 0000000..f8af2ea
--- /dev/null
+++ test.txt.dvc
@@ -0,0 +1,17 @@
+outs:
+- md5: 3b0332e02daabf31651a5a0d81ba830a
+  size: 21
+  hash: md5
+  path: test.txt
+
+test.txt .dvc/cache/files/md5/3b/0332e02daabf31651a5a0d81ba830a
+1
+2
+3
+4
+5
+6
+7
+8
+9
+10

Update text file

0455b50 appended some lines to test.txt:

git diff '0455b50^..0455b50'
test.txt
--- .dvc/cache/files/md5/3b/0332e02daabf31651a5a0d81ba830a f8af2eab0b7cb904c4fa697593684bbf716f091b
+++ .dvc/cache/files/md5/fc/a18e3023be1c0a6e14ca2003b1524a 0ceff0bc7527454d140941483082b9cb892ffab2
diff --git a/test.txt b/test.txt
index f00c965..97b3d1a 100644
--- a/test.txt
+++ b/test.txt
@@ -8,3 +8,8 @@
 8
 9
 10
+11
+12
+13
+14
+15
git show 0455b50
commit 0455b50f4716a40b63595addb2df62658bae6d88
Author: Ryan Williams <ryan@runsascoded.com>
Date:   Wed Dec 25 12:15:55 2024 -0500

    `seq 15 > test.txt`

diff --git test.txt.dvc test.txt.dvc
index f8af2ea..0ceff0b 100644
--- test.txt.dvc
+++ test.txt.dvc
@@ -1,10 +1,10 @@
 outs:
-- md5: 3b0332e02daabf31651a5a0d81ba830a
-  size: 21
+- md5: fca18e3023be1c0a6e14ca2003b1524a
+  size: 36
   hash: md5
   path: test.txt
 
-test.txt .dvc/cache/files/md5/3b/0332e02daabf31651a5a0d81ba830a
+test.txt .dvc/cache/files/md5/fc/a18e3023be1c0a6e14ca2003b1524a
 1
 2
 3
@@ -15,3 +15,8 @@ test.txt .dvc/cache/files/md5/3b/0332e02daabf31651a5a0d81ba830a
 8
 9
 10
+11
+12
+13
+14
+15

Add Parquet file

git-diff-dvc.sh delegates to other diff drivers that you can configure, for file types (based on path names). For example, if git-diff-parquet.sh is configured, you get a nice rendering of f92c1d2 adding test.parquet;

git diff 'f92c1d2^..f92c1d2' -- test.parquet.dvc
test.parquet
--- /dev/null .
+++ .dvc/cache/files/md5/43/79600b26647a50dfcd0daa824e8219 33d076033596bcccf90a442c58eb83f44499ea40
diff --git b/test.parquet b/test.parquet
new file mode 100644
index 0000000..918850d
--- /dev/null
+++ b/test.parquet
@@ -0,0 +1,25 @@
+MD5: 4379600b26647a50dfcd0daa824e8219
+1635 bytes
+5 rows
+message schema {
+  OPTIONAL INT64 num;
+  OPTIONAL BYTE_ARRAY str (STRING);
+}
+First 2 rows:
+{
+  "num": 111,
+  "str": "aaa"
+}
+{
+  "num": 222,
+  "str": "bbb"
+}
+Last 2 rows:
+{
+  "num": 444,
+  "str": "ddd"
+}
+{
+  "num": 555,
+  "str": "eee"
+}
git show f92c1d2
commit f92c1d2958e4b61dffe95eb68ed98b1a968c2432
Author: Ryan Williams <ryan@runsascoded.com>
Date:   Wed Dec 25 12:32:31 2024 -0500

    add `test.parquet.dvc`

diff --git .gitignore .gitignore
index 341707b..a35ca01 100644
--- .gitignore
+++ .gitignore
@@ -1 +1,2 @@
 /test.txt
+/test.parquet
diff --git test.parquet.dvc test.parquet.dvc
new file mode 100644
index 0000000..33d0760
--- /dev/null
+++ test.parquet.dvc
@@ -0,0 +1,32 @@
+outs:
+- md5: 4379600b26647a50dfcd0daa824e8219
+  size: 1635
+  hash: md5
+  path: test.parquet
+
+test.parquet .dvc/cache/files/md5/43/79600b26647a50dfcd0daa824e8219
+MD5: 4379600b26647a50dfcd0daa824e8219
+1635 bytes
+5 rows
+message schema {
+  OPTIONAL INT64 num;
+  OPTIONAL BYTE_ARRAY str (STRING);
+}
+First 2 rows:
+{
+  "num": 111,
+  "str": "aaa"
+}
+{
+  "num": 222,
+  "str": "bbb"
+}
+Last 2 rows:
+{
+  "num": 444,
+  "str": "ddd"
+}
+{
+  "num": 555,
+  "str": "eee"
+}
diff --git test.py test.py
new file mode 100644
index 0000000..fcfac0e
--- /dev/null
+++ test.py
@@ -0,0 +1,7 @@
+import pandas as pd
+
+df = pd.DataFrame({
+    'num': [111, 222, 333, 444, 555],
+    'str': ['aaa', 'bbb', 'ccc', 'ddd', 'eee'],
+})
+df.to_parquet('test.parquet', index=False)

Update Parquet file

f29e52a updated test.parquet, appending 3 rows and changing a dtype (from int64 to int32):

git diff 'f29e52a^..f29e52a' -- test.parquet.dvc
test.parquet
--- .dvc/cache/files/md5/43/79600b26647a50dfcd0daa824e8219 33d076033596bcccf90a442c58eb83f44499ea40
+++ .dvc/cache/files/md5/be/082c87786f3364ca9efec061a3cc21 718c8cd68af7fc28fb60e8ab1ee678a03cda86fe
a/test.parquet..b/test.parquet
1,3c1,3
< MD5: 4379600b26647a50dfcd0daa824e8219
< 1635 bytes
< 5 rows
---
> MD5: be082c87786f3364ca9efec061a3cc21
> 1622 bytes
> 8 rows
5c5
<   OPTIONAL INT64 num;
---
>   OPTIONAL INT32 num;
19,20c19,20
<   "num": 444,
<   "str": "ddd"
---
>   "num": 777,
>   "str": "ggg"
23,24c23,24
<   "num": 555,
<   "str": "eee"
---
>   "num": 888,
>   "str": "hhh"
git show f29e52a
commit f29e52a12d176e27c39fae5e87ce50317432279a
Author: Ryan Williams <ryan@runsascoded.com>
Date:   Wed Dec 25 12:34:53 2024 -0500

    append to `test.parquet`, change "num" to int32

diff --git test.parquet.dvc test.parquet.dvc
index 33d0760..718c8cd 100644
--- test.parquet.dvc
+++ test.parquet.dvc
@@ -1,15 +1,15 @@
 outs:
-- md5: 4379600b26647a50dfcd0daa824e8219
-  size: 1635
+- md5: be082c87786f3364ca9efec061a3cc21
+  size: 1622
   hash: md5
   path: test.parquet
 
-test.parquet .dvc/cache/files/md5/43/79600b26647a50dfcd0daa824e8219
-MD5: 4379600b26647a50dfcd0daa824e8219
-1635 bytes
-5 rows
+test.parquet .dvc/cache/files/md5/be/082c87786f3364ca9efec061a3cc21
+MD5: be082c87786f3364ca9efec061a3cc21
+1622 bytes
+8 rows
 message schema {
-  OPTIONAL INT64 num;
+  OPTIONAL INT32 num;
   OPTIONAL BYTE_ARRAY str (STRING);
 }
 First 2 rows:
@@ -23,10 +23,10 @@ First 2 rows:
 }
 Last 2 rows:
 {
-  "num": 444,
-  "str": "ddd"
+  "num": 777,
+  "str": "ggg"
 }
 {
-  "num": 555,
-  "str": "eee"
+  "num": 888,
+  "str": "hhh"
 }
diff --git test.py test.py
index fcfac0e..8721b78 100644
--- test.py
+++ test.py
@@ -1,7 +1,7 @@
 import pandas as pd
 
 df = pd.DataFrame({
-    'num': [111, 222, 333, 444, 555],
-    'str': ['aaa', 'bbb', 'ccc', 'ddd', 'eee'],
-})
+    'num': [111, 222, 333, 444, 555, 666, 777, 888],
+    'str': ['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff', 'ggg', 'hhh'],
+}).astype({ 'num': 'int32' })
 df.to_parquet('test.parquet', index=False)

Customize .parquet.dvc diff with $PQT_TXT_OPTS

git-diff-parquet.sh supports $PQT_TXT_OPTS for customizing how Parquet files are converted to text (before being compared):

PQT_TXT_OPTS=-sn, git diff 'f29e52a^..f29e52a' -- test.parquet.dvc
test.parquet
--- .dvc/cache/files/md5/43/79600b26647a50dfcd0daa824e8219 33d076033596bcccf90a442c58eb83f44499ea40
+++ .dvc/cache/files/md5/be/082c87786f3364ca9efec061a3cc21 718c8cd68af7fc28fb60e8ab1ee678a03cda86fe
a/test.parquet..b/test.parquet
1,3c1,3
< MD5: 4379600b26647a50dfcd0daa824e8219
< 1635 bytes
< 5 rows
---
> MD5: be082c87786f3364ca9efec061a3cc21
> 1622 bytes
> 8 rows
5c5
<   OPTIONAL INT64 num;
---
>   OPTIONAL INT32 num;
12a13,15
> {"num":666,"str":"fff"}
> {"num":777,"str":"ggg"}
> {"num":888,"str":"hhh"}
  • -s renders one object per line (instead of one field)
  • -n, means "print all the rows" (before a diff is performed)

Add directory, remove files

3257258 moved test.txt and test.parquet into a new DVC-tracked directory, data/ (with tracking file data.dvc):

git diff '3257258^..3257258' -- data.dvc
data
--- /dev/null .
+++ .dvc/cache/files/md5/63/9653e88148f06346d0b965fd0318cc.dir e9c2c3a1ce3f416a21df573905667f4083122bc3
1c1,4
< {}
---
> {
>   "test.parquet": "c07bba3fae2b64207aa92f422506e4a2",
>   "test.txt": "e20b902b49a98b1a05ed62804c757f94"
> }

data/test.parquet
--- /dev/null null
+++ .dvc/cache/files/md5/c0/7bba3fae2b64207aa92f422506e4a2 c07bba3fae2b64207aa92f422506e4a2
diff --git b/data/test.parquet b/data/test.parquet
new file mode 100644
index 0000000..0109fa9
--- /dev/null
+++ b/data/test.parquet
@@ -0,0 +1,25 @@
+MD5: c07bba3fae2b64207aa92f422506e4a2
+1592 bytes
+5 rows
+message schema {
+  OPTIONAL INT32 num;
+  OPTIONAL BYTE_ARRAY str (STRING);
+}
+First 2 rows:
+{
+  "num": 111,
+  "str": "aaa"
+}
+{
+  "num": 222,
+  "str": "bbb"
+}
+Last 2 rows:
+{
+  "num": 444,
+  "str": "ddd"
+}
+{
+  "num": 555,
+  "str": "eee"
+}


data/test.txt
--- /dev/null null
+++ .dvc/cache/files/md5/e2/0b902b49a98b1a05ed62804c757f94 e20b902b49a98b1a05ed62804c757f94
diff --git b/data/test.txt b/data/test.txt
new file mode 100644
index 0000000..8b1acc1
--- /dev/null
+++ b/data/test.txt
@@ -0,0 +1,10 @@
+0
+1
+2
+3
+4
+5
+6
+7
+8
+9

Notice how both data/test.{txt,parquet} are rendered (the latter using the appropriate diff driver).

The full commit also shows the previous test.{txt,parquet} files as deleted:

git show 3257258
commit 3257258cce6f8b70e2d30d3deec8e00919a22079
Author: Ryan Williams <ryan@runsascoded.com>
Date:   Wed Dec 25 13:03:28 2024 -0500

    mv `test.{txt,parquet}.dvc` into dvc-tracked dir `data/`

diff --git .gitignore .gitignore
index a35ca01..3af0ccb 100644
--- .gitignore
+++ .gitignore
@@ -1,2 +1 @@
-/test.txt
-/test.parquet
+/data
diff --git data.dvc data.dvc
new file mode 100644
index 0000000..e9c2c3a
--- /dev/null
+++ data.dvc
@@ -0,0 +1,45 @@
+outs:
+- md5: 639653e88148f06346d0b965fd0318cc.dir
+  size: 1612
+  nfiles: 2
+  hash: md5
+  path: data
+
+test.parquet .dvc/cache/files/md5/c0/7bba3fae2b64207aa92f422506e4a2
+MD5: c07bba3fae2b64207aa92f422506e4a2
+1592 bytes
+5 rows
+message schema {
+  OPTIONAL INT32 num;
+  OPTIONAL BYTE_ARRAY str (STRING);
+}
+First 2 rows:
+{
+  "num": 111,
+  "str": "aaa"
+}
+{
+  "num": 222,
+  "str": "bbb"
+}
+Last 2 rows:
+{
+  "num": 444,
+  "str": "ddd"
+}
+{
+  "num": 555,
+  "str": "eee"
+}
+
+test.txt .dvc/cache/files/md5/e2/0b902b49a98b1a05ed62804c757f94
+0
+1
+2
+3
+4
+5
+6
+7
+8
+9
diff --git test.parquet.dvc test.parquet.dvc
deleted file mode 100644
index 718c8cd..0000000
--- test.parquet.dvc
+++ /dev/null
@@ -1,32 +0,0 @@
-outs:
-- md5: be082c87786f3364ca9efec061a3cc21
-  size: 1622
-  hash: md5
-  path: test.parquet
-
-test.parquet .dvc/cache/files/md5/be/082c87786f3364ca9efec061a3cc21
-MD5: be082c87786f3364ca9efec061a3cc21
-1622 bytes
-8 rows
-message schema {
-  OPTIONAL INT32 num;
-  OPTIONAL BYTE_ARRAY str (STRING);
-}
-First 2 rows:
-{
-  "num": 111,
-  "str": "aaa"
-}
-{
-  "num": 222,
-  "str": "bbb"
-}
-Last 2 rows:
-{
-  "num": 777,
-  "str": "ggg"
-}
-{
-  "num": 888,
-  "str": "hhh"
-}
diff --git test.py test.py
index 8721b78..065a6f3 100644
--- test.py
+++ test.py
@@ -1,7 +1,15 @@
+from os import makedirs
+
 import pandas as pd
 
+makedirs('data', exist_ok=True)
+
 df = pd.DataFrame({
-    'num': [111, 222, 333, 444, 555, 666, 777, 888],
-    'str': ['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff', 'ggg', 'hhh'],
+    'num': [111, 222, 333, 444, 555],
+    'str': ['aaa', 'bbb', 'ccc', 'ddd', 'eee'],
 }).astype({ 'num': 'int32' })
-df.to_parquet('test.parquet', index=False)
+df.to_parquet('data/test.parquet', index=False)
+
+with open('data/test.txt', 'w') as f:
+    for i in range(10):
+        print(f"{i}", file=f)
diff --git test.txt.dvc test.txt.dvc
deleted file mode 100644
index 0ceff0b..0000000
--- test.txt.dvc
+++ /dev/null
@@ -1,22 +0,0 @@
-outs:
-- md5: fca18e3023be1c0a6e14ca2003b1524a
-  size: 36
-  hash: md5
-  path: test.txt
-
-test.txt .dvc/cache/files/md5/fc/a18e3023be1c0a6e14ca2003b1524a
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
-11
-12
-13
-14
-15

Update files in DVC-tracked directory

ae8638a changed values in data/test.parquet, and added rows to data/test.txt:

git diff 'ae8638a^..ae8638a' -- data.dvc
data
--- .dvc/cache/files/md5/63/9653e88148f06346d0b965fd0318cc.dir e9c2c3a1ce3f416a21df573905667f4083122bc3
+++ .dvc/cache/files/md5/06/3f561a84adbf367a10e21aa33479dd.dir cb8a498df96e6a595dba21f186793e464d12282f
2,3c2,3
<   "test.parquet": "c07bba3fae2b64207aa92f422506e4a2",
<   "test.txt": "e20b902b49a98b1a05ed62804c757f94"
---
>   "test.parquet": "f46dd86f608b1dc00993056c9fc55e6e",
>   "test.txt": "9306ec0709cc72558045559ada26573b"

data/test.parquet
--- .dvc/cache/files/md5/c0/7bba3fae2b64207aa92f422506e4a2 c07bba3fae2b64207aa92f422506e4a2
+++ .dvc/cache/files/md5/f4/6dd86f608b1dc00993056c9fc55e6e f46dd86f608b1dc00993056c9fc55e6e
a/data/test.parquet..b/data/test.parquet
1c1
< MD5: c07bba3fae2b64207aa92f422506e4a2
---
> MD5: f46dd86f608b1dc00993056c9fc55e6e
10c10
<   "num": 111,
---
>   "num": 11,
14c14
<   "num": 222,
---
>   "num": 22,
19c19
<   "num": 444,
---
>   "num": 44,
23c23
<   "num": 555,
---
>   "num": 55,



data/test.txt
--- .dvc/cache/files/md5/e2/0b902b49a98b1a05ed62804c757f94 e20b902b49a98b1a05ed62804c757f94
+++ .dvc/cache/files/md5/93/06ec0709cc72558045559ada26573b 9306ec0709cc72558045559ada26573b
diff --git a/data/test.txt b/data/test.txt
index 8b1acc1..aa44898 100644
--- a/data/test.txt
+++ b/data/test.txt
@@ -8,3 +8,8 @@
 7
 8
 9
+10
+11
+12
+13
+14
git show ae8638a
commit ae8638a47e0ed11f4e0f6d451d69d951b34c12c7
Author: Ryan Williams <ryan@runsascoded.com>
Date:   Wed Dec 25 15:21:11 2024 -0500

    modify DVC-dir files

diff --git data.dvc data.dvc
index e9c2c3a..cb8a498 100644
--- data.dvc
+++ data.dvc
@@ -1,12 +1,12 @@
 outs:
-- md5: 639653e88148f06346d0b965fd0318cc.dir
-  size: 1612
+- md5: 063f561a84adbf367a10e21aa33479dd.dir
+  size: 1627
   nfiles: 2
   hash: md5
   path: data
 
-test.parquet .dvc/cache/files/md5/c0/7bba3fae2b64207aa92f422506e4a2
-MD5: c07bba3fae2b64207aa92f422506e4a2
+test.parquet .dvc/cache/files/md5/f4/6dd86f608b1dc00993056c9fc55e6e
+MD5: f46dd86f608b1dc00993056c9fc55e6e
 1592 bytes
 5 rows
 message schema {
@@ -15,24 +15,24 @@ message schema {
 }
 First 2 rows:
 {
-  "num": 111,
+  "num": 11,
   "str": "aaa"
 }
 {
-  "num": 222,
+  "num": 22,
   "str": "bbb"
 }
 Last 2 rows:
 {
-  "num": 444,
+  "num": 44,
   "str": "ddd"
 }
 {
-  "num": 555,
+  "num": 55,
   "str": "eee"
 }
 
-test.txt .dvc/cache/files/md5/e2/0b902b49a98b1a05ed62804c757f94
+test.txt .dvc/cache/files/md5/93/06ec0709cc72558045559ada26573b
 0
 1
 2
@@ -43,3 +43,8 @@ test.txt .dvc/cache/files/md5/e2/0b902b49a98b1a05ed62804c757f94
 7
 8
 9
+10
+11
+12
+13
+14
diff --git test.py test.py
index 065a6f3..34e93bb 100644
--- test.py
+++ test.py
@@ -5,11 +5,11 @@ import pandas as pd
 makedirs('data', exist_ok=True)
 
 df = pd.DataFrame({
-    'num': [111, 222, 333, 444, 555],
+    'num': [11, 22, 33, 44, 55],
     'str': ['aaa', 'bbb', 'ccc', 'ddd', 'eee'],
 }).astype({ 'num': 'int32' })
 df.to_parquet('data/test.parquet', index=False)
 
 with open('data/test.txt', 'w') as f:
-    for i in range(10):
+    for i in range(15):
         print(f"{i}", file=f)

Diff-pipelining with dvc-utils / dvc-diff

Sometimes diffs of DVC-tracked blobs are too complex to grok with a generic file-type-based diffs (like git-diff-parquet.sh, as used above).

The dvc-utils package provides a dvc-diff CLI that supports applying arbitrary bash pipelines to DVC-tracked blobs, before diffing them. See its examples for more info.

Other Bash DVC scripts/aliases

[.dvc-rc] can be sourced from ~/.bashrc, and provides useful aliases, e.g.:

  • dvlp (dvc_local_cache_path)
  • dvz (dvc_size)
  • dvc-diff aliases (from dvc-utils)

About

Git plugins and Bash scripts/aliases for DVC

Resources

Stars

Watchers

Forks