Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Import ignore cache #10657

Open
konstantin-frolov opened this issue Dec 20, 2024 · 2 comments
Open

Import ignore cache #10657

konstantin-frolov opened this issue Dec 20, 2024 · 2 comments
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p1-important Important, aka current backlog of things to do

Comments

@konstantin-frolov
Copy link

konstantin-frolov commented Dec 20, 2024

Bug Report

DVC 3.56 Import ignore cache

Description

I have local DVC repo with json annotations added each one and large data storage with thousand image files added as full folder.
I use symlinks for cache.
But import in external storage doesn't create symlinks for images data storage. DVC download first, than link files.

Reproduce

Local data repo

Config

cache.type=symlink
core.autostage=true

Local storage dirs:

annotations/
     master_annotation.json
     train_annotation.json
     test_annotations.json
data_storage/
     image_0
     image_1
     ...
     image_N

In local storage comands

dvc add ./annotations/*
dvc add ./data_storage
Project repo

Config

cache.type=symlink
cache.dir=path/to/local/data/repo/.dvc/cache
core.autostage=true

commands:

dvc import path/to/local/data/repo data_storage

This command start downloading copies files from cache

dvc import path/to/local/data/repo data_storage --no-download

Check data_storage.dvc file and create it in project repo, but

dvc checkout data_storage.dvc

or

dvc checkout data_storage.dvc --relink

start downloading files again

Expected

I think DVC must create symlink for files without downloading originals

Environment information

Output of dvc doctor in local data repo:

-------------------------
Platform: Python 3.12.7 on Linux-5.15.0-86-generic-x86_64-with-glibc2.31
Subprojects:

Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.19.0),
        gdrive (pydrive2 = 1.21.1),
        gs (gcsfs = 2024.10.0),
        hdfs (fsspec = 2024.10.0, pyarrow = 18.0.0),
        http (aiohttp = 3.10.10, aiohttp-retry = 2.9.0),
        https (aiohttp = 3.10.10, aiohttp-retry = 2.9.0),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.10.0, boto3 = 1.35.36),
        ssh (sshfs = 2024.9.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.10.0)
Config:
        Global: /home/user/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: nfs on ip-addr:/storage/
Caches: local
Remotes: None
Workspace directory: nfs on [ip-addr:/storage/](ip-addr:/storage/)
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/76de345055c7e5635fd954ee44e5d4e2

Output of dvc doctor in project repo:

DVC version: 3.56.0 (deb)
-------------------------
Platform: Python 3.12.7 on Linux-5.15.0-86-generic-x86_64-with-glibc2.31
Subprojects:

Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.19.0),
        gdrive (pydrive2 = 1.21.1),
        gs (gcsfs = 2024.10.0),
        hdfs (fsspec = 2024.10.0, pyarrow = 18.0.0),
        http (aiohttp = 3.10.10, aiohttp-retry = 2.9.0),
        https (aiohttp = 3.10.10, aiohttp-retry = 2.9.0),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.10.0, boto3 = 1.35.36),
        ssh (sshfs = 2024.9.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.10.0)
Config:
        Global: /home/user/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: nfs on ip-addr:/storage/
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/f967073321531b0cc07fba234dd73d7b
@shcheklein
Copy link
Member

Yes, I can confirm that it first copies the files from cache, then removes them and replaces with links. It can be problematic in case of a share NAS cache where we want manipulate with links alone, we don't or can't have data on the disk).

@skshetry do you remember if this is expected behavior or a regression?

@shcheklein shcheklein added bug Did we break something? p1-important Important, aka current backlog of things to do A: data-sync Related to dvc get/fetch/import/pull/push labels Dec 22, 2024
@konstantin-frolov
Copy link
Author

Hi!
@shcheklein @skshetry do you have any updates on this issue?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p1-important Important, aka current backlog of things to do
Projects
None yet
Development

No branches or pull requests

2 participants