Notes on how to accomplish various operation using DVC. Most of these are just distilled notes from the DVC documentation.
This document guides you through working with the llm-eval
repo and working with DVC backed by a JASMIN object store. The instruction can then be taken and applied to any other repository that you may want to setup to work with DVC
$ git clone git@github.com:NERC-CEH/llm-eval.git
$ cd llm-eval
DVC can be installed using pip
, this will provide the basic CLI needed to execute commands with DVC. The recomended way if working with this repository is to create a new python virtual environment and then install the appropriate DVC packages via the requirements.txt
file:
$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
If you are working on a different repository the packages can be installed seperately:
$ pip install dvc
$ pip install dvc[s3]
DVC remotes backed by various other technologies (besides s3) can be used. See the DVC documentation for details.
This repository has a corresponding object store on JASMIN llm-eval-o
. To work with the data in this repository managed by DVC you must request access to the object store from the object store manager Matt Coole.
Once you have been granted USER
access, log in through the JASMIN Object Store Portal and create an access key for the llm-eval-o
object store. Instructions for creating keys can be found in the JASMIN Documentation.
Make sure you store your secret somewhere safe as you will not be able to view it again after the initial creation of your key.
Once you have access to the object store and have created a key you will need to setup your credentials:
$ dvc remote modify --local jasmin access_key_id '<ACCES_KEY_ID>'
$ dvc remote modify --local jasmin secret_access_key '<KEY_SECRET>'
Note: The configuration for DVC is tracked in
.dvc/config
but your credentials should be stored in a seperate file (.dvc/config.local
) which should not be tracked by version control to avoid secrets being leaked. Make sure to use--local
when configuring credentials.
WARNING: There appears to be a subtle bug in DVC credentials management where any kind of quote character
'"
in your secret key will invalidate your credentials and you will receive a403: Forbidden
error when attempting to access the JASMIN object store. The only way I've found around this so far is to keep generating new access keys until you get one without any quote characters...
Assuming that configuration and credentials have been set up correctly you should now be able to pull the data that is tracked by DVC from the JASMIN object store. This is done using the dvc pull
command.
$ dvc pull
You should now be able to see the data
folder and contents:
data
├── evaluation-sets
│ ├── eidc-eval.csv
│ └── eidc-eval-sample.csv
└── synthetic-datasets
└── eidc_rag_test_set.csv
To make changes to your data use dvc add
on the local file and then use dvc push
to push to the remote store. It is then important to commit the .dvc
files to git as well e.g.
$ dvc add my-data-file.csv
$ dvc push
$ git commit my-data-file.csv.dvc -m 'Updated data file'
my-data-file.csv.dvc
is a place holder that DVC creates to tell it about the files/folder being tracked. This place holder will be tracked by git and the actual data tracked by DVC.
DVC should also automatically add the file/directory to .gitignore
so it won't end up being accidentally tracked in git as well.
Any files or folders that you add to DVC must not be tracked by git. To switch from tracking a file with git to DVC, first untrack it with git:
$ git rm --cached data-file-in-git.csv
then follow the steps [above](#Making changes) to add the file(s) to be tracked by DVC.
Note: Whilst
dvc
commands seem to somewhat mirrorgit
commands, there doesn't seem to be quite the same concept of a staging area. I would suggest thatdvc add
is more like an amalgamation (in DVC) ofgit add
+git commit
.
To switch between versions of your data tracked by DVC you can simply use git checkout
as you typically would to checkout a particular version of code and then follow this up with dvc checkout
to checkout the corresponding version of the data e.g.
$ git checkout c474fcc
$ dvc checkout
Up until here it was assumed you were working with a repository already setup with DVC, but to setup DVC on your own git repository there are just a few initial steps to configure:
To set up your own git repository to track any data files using DVC use dvc init
in the repository's directory.
$ dvc init
You will see a .dvc/
directory and a .dvcignore
file which you should add to you version controlled files.
Now add a bucket to be used as a remote (make sure you create the bucket in your object store first):
dvc remote add jasmin s3://test-dvc
This will initially set up the remote to use the test-dvc
bucket (check the config in .dvc/config
).
Next you need to add the endpoint URL, if you are using a JASMIN object store this should look something like this:
dvc remote modify myremote endpointurl https://my-test-store-o.s3-ext.jc.rl.ac.uk
Where your object store is called my-test-store
.
Finally you can configure your credentials as described [above](#Connecting to JASMIN object storage).