Skip to content

Latest commit

 

History

History
55 lines (46 loc) · 2.51 KB

integrations.md

File metadata and controls

55 lines (46 loc) · 2.51 KB

Integrations

Data Version Control (DVC)

Data assets like training corpora or pretrained weights are at the core of any NLP project, but they're often difficult to manage: you can't just check them into your Git repo to version and keep track of them. And if you have multiple steps that depend on each other, like a preprocessing step that generates your training data, you need to make sure the data is always up-to-date, and re-run all steps of your process every time, just to be safe.

Data Version Control (DVC) is a standalone open-source tool that integrates into your workflow like Git, builds a dependency graph for your data pipelines and tracks and caches your data files. If you're downloading data from an external source, like a storage bucket, DVC can tell whether the resource has changed. It can also determine whether to re-run a step, depending on whether its input have changed or not. All metadata can be checked into a Git repo, so you'll always be able to reproduce your experiments.

To set up DVC, install the package and initialize your Weasek project as a Git and DVC repo. You can also customize your DVC installation to include support for remote storage like Google Cloud Storage, S3, Azure, SSH and more.

pip install dvc   # Install DVC
git init          # Initialize a Git repo
dvc init          # Initialize a DVC project

⚠️ Important note on privacy

DVC enables usage analytics by default, so if you're working in a privacy-sensitive environment, make sure to opt-out manually.

The weasel dvc command creates a dvc.yaml config file based on a workflow defined in your project.yml. Whenever you update your project, you can re-run the command to update your DVC config. You can then manage your Weasel project like any other DVC project, run dvc add to add and track assets and dvc repro to reproduce the workflow or individual commands.

python -m weasel dvc [project_dir] [workflow_name]

⚠️ Important note for multiple workflows

DVC currently expects a single workflow per project, so when creating the config with weasel dvc, you need to specify the name of a workflow defined in your project.yml. You can still use multiple workflows, but only one can be tracked by DVC.