Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

unable to get indi/fcon1000 data #33

Open
loj opened this issue Jun 23, 2020 · 12 comments
Open

unable to get indi/fcon1000 data #33

loj opened this issue Jun 23, 2020 · 12 comments

Comments

@loj
Copy link

loj commented Jun 23, 2020

What is the problem?

I am unable able to get any data from the fcon1000 dataset.

What steps will reproduce the problem?

❱ datalad install ///indi/fcon1000 
install(ok): /home/loj/tmp/fcon1000 (dataset)

❱ cd fcon1000

❱ datalad get -n AnnArbor_a 
install(ok): /home/loj/tmp/fcon1000/AnnArbor_a (dataset) [Installed subdataset in order to get /home/loj/tmp/fcon1000/AnnArbor_a]

❱ datalad get AnnArbor_a/sub04111 
[INFO   ] To obtain some keys we need to fetch an archive of size 1.6 GB                                                                                                                                            
Total (0 ok, 4 failed out of 3):   0%|                                                                                                                                                  | 0.00/70.2M [00:03<?, ?B/s][WARNING] Running get resulted in stderr output: [INFO] To obtain some keys we need to fetch an archive of size 1.6 GB                                                                                              
[INFO] PROGRESS-JSON: {"byte-progress":16384,"action":{"command":"get","note":"from web...","key":"MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar","file":null},"total-size":1602406400,"percent-progress":"0%"} 
[INFO] PROGRESS-JSON: {"command":"get","wanted":[{"here":false,"uuid":"00000000-0000-0000-0000-000000000001","description":"web"},{"here":false,"uuid":"ccec1cce-6820-4a71-8041-7abd4d6603ac","description":"yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a"}],"note":"from web...\nUnable to access these remotes: web\nTry making some of these repositories available:\n\t00000000-0000-0000-0000-000000000001 -- web\n \tccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a\n","skipped":[],"success":false,"key":"MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar","file":null} 
git-annex: get: 3 failed
 
                                                                                                                                                                                                                    [INFO   ] PROGRESS-JSON: {"command":"get","wanted":[{"here":false,"uuid":"00000000-0000-0000-0000-000000000001","description":"web"},{"here":false,"uuid":"ccec1cce-6820-4a71-8041-7abd4d6603ac","description":"yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a"}],"note":"from web...\nUnable to access these remotes: web\nTry making some of these repositories available:\n\t00000000-0000-0000-0000-000000000001 -- web\n \tccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a\n","skipped":[],"success":false,"key":"MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar","file":null} 
[ERROR  ] from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;         ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives] [get(/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_anonymized.nii.gz)] 
get(error): AnnArbor_a/sub04111/anat/mprage_anonymized.nii.gz (file) [from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;     ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives]]
[ERROR  ] from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;         ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives] [get(/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_skullstripped.nii.gz)] 
get(error): AnnArbor_a/sub04111/anat/mprage_skullstripped.nii.gz (file) [from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;  ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives]]
[ERROR  ] from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;         ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives] [get(/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/func/rest.nii.gz)] 
get(error): AnnArbor_a/sub04111/func/rest.nii.gz (file) [from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;  ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives]]
[WARNING] could not get some content in /home/loj/tmp/fcon1000/AnnArbor_a/sub04111 ['/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_anonymized.nii.gz', '/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_skullstripped.nii.gz', '/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/func/rest.nii.gz'] [get(/home/loj/tmp/fcon1000/AnnArbor_a/sub04111)] 
get(impossible): AnnArbor_a/sub04111 (directory) [could not get some content in /home/loj/tmp/fcon1000/AnnArbor_a/sub04111 ['/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_anonymized.nii.gz', '/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_skullstripped.nii.gz', '/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/func/rest.nii.gz']]
action summary:
  get (error: 3, impossible: 1, notneeded: 1)

What version of DataLad are you using?

datalad wtf

❱ datalad wtf                                                                                                                                                                                                   1 !
# WTF
## configuration <SENSITIVE, report disabled by configuration>
## datalad 
  - full_version: 0.12.7
  - version: 0.12.7
## dataset 
  - id: b6101c84-7aea-11e6-9d5d-002590f97d84
  - metadata: <SENSITIVE, report disabled by configuration>
  - path: /home/loj/tmp/fcon1000
  - repo: AnnexRepo
## dependencies 
  - appdirs: 1.4.4
  - boto: 2.49.0
  - cmd:7z: 16.02
  - cmd:annex: 7.20190819+git2-g908476a9b-1~ndall+1
  - cmd:bundled-git: 2.20.1
  - cmd:git: 2.20.1
  - cmd:system-git: 2.27.0
  - cmd:system-ssh: 8.3p1
  - git: 3.1.3
  - gitdb: 4.0.5
  - humanize: 2.4.0
  - iso8601: 0.1.12
  - keyring: 21.2.1
  - keyrings.alt: 3.4.0
  - msgpack: 1.0.0
  - requests: 2.24.0
  - tqdm: 4.46.1
  - wrapt: 1.12.1
## environment 
  - GIT_PYTHON_GIT_EXECUTABLE: /usr/lib/git-annex.linux/git
  - LANG: en_US.UTF-8
  - LANGUAGE: en_US.UTF-8
  - LC_ALL: en_US.UTF-8
  - LC_CTYPE: en_US.UTF-8
  - PATH: /home/loj/.venv/datalad_fresh/bin:/home/loj/.dotfiles/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin:/usr/local/games:/usr/games
## extensions 
## git-annex 
  - build flags: 
    - Assistant
    - Webapp
    - Pairing
    - S3
    - WebDAV
    - Inotify
    - DBus
    - DesktopNotify
    - TorrentParser
    - MagicMime
    - Feeds
    - Testsuite
  - dependency versions: 
    - aws-0.20
    - bloomfilter-2.0.1.0
    - cryptonite-0.25
    - DAV-1.3.3
    - feed-1.0.0.0
    - ghc-8.4.4
    - http-client-0.5.13.1
    - persistent-sqlite-2.8.2
    - torrent-10000.1.1
    - uuid-1.3.13
    - yesod-1.6.0
  - key/value backends: 
    - SHA256E
    - SHA256
    - SHA512E
    - SHA512
    - SHA224E
    - SHA224
    - SHA384E
    - SHA384
    - SHA3_256E
    - SHA3_256
    - SHA3_512E
    - SHA3_512
    - SHA3_224E
    - SHA3_224
    - SHA3_384E
    - SHA3_384
    - SKEIN256E
    - SKEIN256
    - SKEIN512E
    - SKEIN512
    - BLAKE2B256E
    - BLAKE2B256
    - BLAKE2B512E
    - BLAKE2B512
    - BLAKE2B160E
    - BLAKE2B160
    - BLAKE2B224E
    - BLAKE2B224
    - BLAKE2B384E
    - BLAKE2B384
    - BLAKE2BP512E
    - BLAKE2BP512
    - BLAKE2S256E
    - BLAKE2S256
    - BLAKE2S160E
    - BLAKE2S160
    - BLAKE2S224E
    - BLAKE2S224
    - BLAKE2SP256E
    - BLAKE2SP256
    - BLAKE2SP224E
    - BLAKE2SP224
    - SHA1E
    - SHA1
    - MD5E
    - MD5
    - WORM
    - URL
  - local repository version: 5
  - operating system: linux x86_64
  - remote types: 
    - git
    - gcrypt
    - p2p
    - S3
    - bup
    - directory
    - rsync
    - web
    - bittorrent
    - webdav
    - adb
    - tahoe
    - glacier
    - ddar
    - git-lfs
    - hook
    - external
  - supported repository versions: 
    - 5
    - 7
  - upgrade supported from repository versions: 
    - 0
    - 1
    - 2
    - 3
    - 4
    - 5
    - 6
  - version: 7.20190819+git2-g908476a9b-1~ndall+1
## location 
  - path: /home/loj/tmp/fcon1000
  - type: dataset
## metadata_extractors 
  - annex: 
    - load_error: None
    - module: datalad.metadata.extractors.annex
    - version: None
  - audio: 
    - load_error: No module named 'mutagen' [audio.py:<module>:17]
    - module: datalad.metadata.extractors.audio
  - datacite: 
    - load_error: None
    - module: datalad.metadata.extractors.datacite
    - version: None
  - datalad_core: 
    - load_error: None
    - module: datalad.metadata.extractors.datalad_core
    - version: None
  - datalad_rfc822: 
    - load_error: None
    - module: datalad.metadata.extractors.datalad_rfc822
    - version: None
  - exif: 
    - load_error: No module named 'exifread' [exif.py:<module>:16]
    - module: datalad.metadata.extractors.exif
  - frictionless_datapackage: 
    - load_error: None
    - module: datalad.metadata.extractors.frictionless_datapackage
    - version: None
  - image: 
    - load_error: No module named 'PIL' [image.py:<module>:16]
    - module: datalad.metadata.extractors.image
  - xmp: 
    - load_error: No module named 'libxmp' [xmp.py:<module>:20]
    - module: datalad.metadata.extractors.xmp
## python 
  - implementation: CPython
  - version: 3.8.3
## system 
  - distribution: debian/unstable/sid
  - encoding: 
    - default: utf-8
    - filesystem: utf-8
    - locale.prefered: UTF-8
  - max_path_length: 278
  - name: Linux
  - release: 5.4.0-4-amd64
  - type: posix
  - version: #1 SMP Debian 5.4.19-1 (2020-02-13)

@yarikoptic
Copy link
Member

I think it is due to nitrc starting to require to login to get access to that file :-/

details
(git-annex)lena:/tmp/fcon1000/AnnArbor_a[master]sub04111
$> git annex whereis anat/mprage_anonymized.nii.gz
whereis anat/mprage_anonymized.nii.gz (2 copies) 
  	ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a
   	f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives]

  datalad-archives: dl+archive:MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar#path=sub04111/anat/mprage_anonymized.nii.gz&size=3914814
ok
(dev3) 1 39266.....................................:Tue 23 Jun 2020 12:01:46 PM EDT:.
(git-annex)lena:/tmp/fcon1000/AnnArbor_a[master]sub04111
$> git annex whereis --key MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar#
whereis MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar# (0 copies) failed
git-annex: whereis: 1 failed
(dev3) 1 39267 ->1.....................................:Tue 23 Jun 2020 12:02:00 PM EDT:.
(git-annex)lena:/tmp/fcon1000/AnnArbor_a[master]sub04111
$> git annex whereis --key MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar 
whereis MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar (2 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
   	ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a

  web: http://www.nitrc.org/frs/downloadlink.php/1992
ok
(dev3) 1 39268.....................................:Tue 23 Jun 2020 12:02:02 PM EDT:.
(git-annex)lena:/tmp/fcon1000/AnnArbor_a[master]sub04111
$> git annex get --key MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar
get MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar (from web...) 

  verification of content failed

  Unable to access these remotes: web

  Try making some of these repositories available:
  	00000000-0000-0000-0000-000000000001 -- web
   	ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a
failed
git-annex: get: 1 failed
(dev3) 1 39269 ->1.....................................:Tue 23 Jun 2020 12:02:25 PM EDT:.
(git-annex)lena:/tmp/fcon1000/AnnArbor_a[master]sub04111
$> git annex get --key MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar --debug
[2020-06-23 12:02:34.293205307] read: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","symbolic-ref","-q","HEAD"]
[2020-06-23 12:02:34.305624863] process done ExitSuccess
[2020-06-23 12:02:34.305860717] read: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","show-ref","refs/heads/master"]
[2020-06-23 12:02:34.319716634] process done ExitSuccess
get MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar [2020-06-23 12:02:34.321403394] read: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","show-ref","git-annex"]
[2020-06-23 12:02:34.331076977] process done ExitSuccess
[2020-06-23 12:02:34.33168571] read: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","show-ref","--hash","refs/heads/git-annex"]
[2020-06-23 12:02:34.346998595] process done ExitSuccess
[2020-06-23 12:02:34.347577569] read: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","log","refs/heads/git-annex..88214397de46cdb2d9e0aae77ed89f995d80332f","--pretty=%H","-n1"]
[2020-06-23 12:02:34.357381935] process done ExitSuccess
[2020-06-23 12:02:34.37867209] chat: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","cat-file","--batch"]
[2020-06-23 12:02:34.381210874] chat: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
(from web...) 
[2020-06-23 12:02:34.45923921] Request {
  host                 = "www.nitrc.org"
  port                 = 80
  secure               = False
  requestHeaders       = [("Accept-Encoding","identity"),("User-Agent","git-annex/8.20200501+git61-g64e081d58-1~ndall+1")]
  path                 = "/frs/downloadlink.php/1992"
  queryString          = ""
  method               = "GET"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}

                                      
  verification of content failed

  Unable to access these remotes: web

  Try making some of these repositories available:
  	00000000-0000-0000-0000-000000000001 -- web
   	ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a
failed
[2020-06-23 12:02:35.993770571] process done ExitSuccess
[2020-06-23 12:02:35.994441736] process done ExitSuccess
git-annex: get: 1 failed

So going to http://www.nitrc.org/frs/downloadlink.php/1992 requires to login, and I guess that is what has changed, may be @chaselgrove could confirm that?
Would all downloads require login now?
If not all -- is there a list which would tell which ones?

As a workaround solution, we could

As a bit more permanent/reliable solution, I guess we would need to adjust our downloaders to provide support for the ad-hoc "you need to login" web page and make datalad downloader "smarter" by allowing first non authenticated attempt, then parsing the output (if not too large) to discover if we got something else from what we expected - e.g. login page -- and then authenticate...

I quickly tested that

this patch `datalad download-url` works, but I didn't wait long enough (home inet isn't fast enough to fetch GB atm)
$> git diff
diff --git a/.travis.yml b/.travis.yml
index 48d9c5a64..9517a6657 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -51,7 +51,7 @@ before_install:
   # Install git-annex
   - OLD_PATH="$PATH"
   - eval source tools/ci/install-annex.sh ${_DL_ANNEX_INSTALL_SCENARIO}
-  # if PATH was changed, we need to make it available in the login sessions
+  # if PATH was changed, we need to
   - if [ "$PATH" != "$OLD_PATH" ]; then echo export PATH=$PATH >> ~/.bashrc; fi
   # Optionally install the latest Git.  Exit code 100 indicates that bundled is same as the latest.
   - if [ ! -z "${_DL_UPSTREAM_GIT:-}" ]; then
diff --git a/datalad/downloaders/configs/nitrc.cfg b/datalad/downloaders/configs/nitrc.cfg
index a97e16bf8..ebd6e8601 100644
--- a/datalad/downloaders/configs/nitrc.cfg
+++ b/datalad/downloaders/configs/nitrc.cfg
@@ -10,6 +10,7 @@
 # to accomplish the mission here
 url_re = https?://fcon_1000\.projects\.nitrc\.org/indi/adhd200/index\.html
          https?://www\.nitrc\.org/frs/downloadlink\.php/(7058|3075|3479|9108)
+         https?://www\.nitrc\.org/frs/downloadlink\.php/([0-9][0-9]*)
 credential = nitrc
 authentication_type = html_form
 html_form_url = https://www.nitrc.org/account/#.php

@chaselgrove
Copy link

Yes, it appears that login is now required for that file (and others; compare https://www.nitrc.org/frs/?group_id=296 logged in to logged out).

@yarikoptic
Copy link
Member

@chaselgrove what about

Would all downloads require login now?
If not all -- is there a list which would tell which ones?

@chaselgrove
Copy link

If you look at the link I sent and compare it logged in and logged out, you get what can best be described as "a list [of] which ones." :)

Not "all downloads," but perhaps all that you're concerned with. Certainly everything in the fcon_1000 package (what appears to be all the site tarballs).

@yarikoptic
Copy link
Member

My question was more generic -- by now I do not remember what other datasets from NITRC, beyond fcon_1000 we might have in datasets.datalad.org . So I wondered if there is some list of which datasets started to require authentication.

But I guess it could be any project's admin who enables or disables requiring authentication for download, right? i.e. it could have not been you (NITRC) which decided to require it for data distributed otherwise under a license which otherwise does allow redistribution. am I correct @chaselgrove ?

@chaselgrove
Copy link

My first response would be to say look at https://www.nitrc.org/ir/, but that doesn't match the fcon_1000 permissions problem we're seeing here. Didn't we set things up to get data from NITRC-IR?

You are correct on the second point. It is in fact never NITRC that makes these decisions for data provided by others.

@fangq
Copy link

fangq commented Feb 25, 2024

I am also having trouble downloading indi/fcon1000 using datalad, and found this thread

here is the new error message - datalad was able to download about 1.8GB file, but stalled for one of the submodules - I let it run overnight, it just won't download anything.

I added --on-failure continue but nothing changes. is the issue still related to nitrc permissions?

datalad  --on-failure continue install -r -g https://datasets.datalad.org/indi/fcon1000 
[INFO   ] Installing Dataset(neurojson/fcon1000/orig/fcon1000) to get neurojson/fcon1000/orig/fcon1000 recursively 
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/19/cn-e3eca5763f10a6525e7036cf385cd6.xz (file) [not available]                                                                                                                                   
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/19/ds-e3eca5763f10a6525e7036cf385cd6 (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/1a/cn-318fd4a160260a41b5094d73bbd2b5.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/1a/ds-318fd4a160260a41b5094d73bbd2b5 (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/26/cn-0ad917bee8d05db1dd27d0ad50c1bb.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/26/ds-0ad917bee8d05db1dd27d0ad50c1bb (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/29/cn-29fa0eaba9b0555f900cc7bda87c69.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/29/ds-29fa0eaba9b0555f900cc7bda87c69 (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/45/cn-bb76bda106d7aa78527fc618ffeb7b.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/45/ds-bb76bda106d7aa78527fc618ffeb7b (file) [not available]
Total:  42%|██████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                                                                 | 1.41G/3.34G [5:53:28<8:03:07, 66.5k Bytes/s]
ERROR:                                                                                                                                                                                                                                                                                    
Interrupted by user while doing magic: KeyboardInterrupt()
Total:  42%|██████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                                                                 | 1.41G/3.34G [5:54:03
```                                                                                                                    | 1.41G/3.34G [5:54:03

@yarikoptic
Copy link
Member

I have now pushed those metadata files. Report back if you find some other files not downloadable... But note that in principle you don't need any of those for your analyses of any kind -- those are internal to (now somewhat deprecated) datalad search.

@fangq
Copy link

fangq commented Feb 27, 2024

@yarikoptic, thanks for the update. The error messages related to the .datalad/metadata folder weren't really my concerns (by the way, datalad still complains these metadata files are missing), because my JSON converter skips .git/.datalad folders.

the issue is that the main data folder download seems to got stalled in the middle. is there a flag I can turn on to print out the stalled URL?

@yarikoptic
Copy link
Member

for a file you can run git annex whereis to see where file available from, e.g. URLs.

You can run git annex find --not --in here to see what is not yet here... actually you can just git annex whereis --not --in here to see urls for files which are not here yet

@fangq
Copy link

fangq commented Feb 27, 2024

@yarikoptic, I think the issue is to first identify which file(s) is hanging the download.

after adding --log-level 5 and rerun the install command, I was able to locate the step that cased the stall

datalad --log-level 5 --on-failure continue install -r -g https://datasets.datalad.org/indi/fcon1000

...
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'get', '-c', 'annex.retry=3', '--json', '--json-error-messages', '--json-progress', '--debug', '-c', 'annex.dotfiles=true', '--', '.'] (cwd=/drives/tu1/users/neurojson/fcon1000/orig/fcon1000/Cleveland CCF) 
[Level 8] Process 1717702 started 
[Level 5] ReaderThread(<_io.FileIO name=5 mode='rb' closefd=True>, <queue.Queue object at 0x7f671d325ed0>, ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'get', '-c', 'annex.retry=3', '--json', '--json-error-messages', '--json-progress', '--debug', '-c', 'annex.dotfiles=true', '--', '.']) started 
[Level 5] ReaderThread(<_io.FileIO name=3 mode='rb' closefd=True>, <queue.Queue object at 0x7f671d325ed0>, ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'get', '-c', 'annex.retry=3', '--json', '--json-error-messages', '--json-progress', '--debug', '-c', 'annex.dotfiles=true', '--', '.']) started 
[Level 5] Read 192 bytes from 1717702[stderr]                                                                                                                                               
[Level 5] Read 67 bytes from 1717702[stderr]                                                                                                                                                

......                                                                                                                                          

[Level 5] Read 150 bytes from 1717702[stderr]                                                                                                                                               
Total:   0%|                                                                                             | 12.5k/3.34G [02:02<9024:08:09, 103 Bytes/s]                                                                                        

after this point, the download simply hangs with 0%. From the DEBUG line immediately above the hanging, it seems the folder it tries to download is fcon1000/Cleveland CCF, but if I go to the downloaded Cleveland CCF folder, and run git pull, it says "already up to date". so I am not entirely sure if the above debug info actually pin-point the submodule that caused the stalling.

it is also strange that the progress bar showed that it downloaded a number of submodules ranging between 1GB to 8GB before getting to this 3.34GB repo that caused hanging, but when I check the downloaded folder size, it only reached 1.8GB. I don't know if the progress bar had reported the size correctly.

anyhow, I was able to download the dataset from https://www.nitrc.org/ir/app/action/ProjectDownloadAction/project/fcon_1000 as guest, although its folder organization is less BIDS-like.

@yarikoptic
Copy link
Member

re progress bar stall: I think we are experiencing an issue

since the file(s) to come from an archive:

❯ git annex whereis phenotypic.csv
whereis phenotypic.csv (2 copies) 
  	978192a9-f540-4f5a-b6c5-ca57c0c9552f -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/Cleveland CCF
   	c402c7ef-34d8-4f1f-a180-a63babc57733 -- [datalad-archives]

  datalad-archives: dl+archive:MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz#path=INDI_Lite_NIFTI/phenotypic.csv&size=489
ok

for me it doesn't hang if for that single file but relatively quickly complains multiple times on the same boring message:

❯ datalad get phenotypic.csv
get(error): phenotypic.csv (file) [Failed to fetch any archive containing MD5E-s489--2d2f2e702e4b40c2eb96a9beeafea6db.csv. Tried: ['MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz']
Failed to fetch any archive containing MD5E-s489--2d2f2e702e4b40c2eb96a9beeafea6db.csv. Tried: ['MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz']
Failed to fetch any archive containing MD5E-s489--2d2f2e702e4b40c2eb96a9beeafea6db.csv. Tried: ['MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz']]

but if asked for more files -- indeed just keeps its problems to itself for quite a while, fetching some files once in a while as well.

FWIW -- to ease debugging etc, can just invoke git annex get directly to see what is going on... So

❯ git annex get --key MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz
get MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz (from datalad...) 
[INFO] Downloading 'http://www.nitrc.org/frs/downloadlink.php/3479' into '.git/annex/tmp/MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz' 
                                     
  Verification of content failed

  Unable to access these remotes: datalad

  Maybe add some of these git remotes (git remote add ...):
  	978192a9-f540-4f5a-b6c5-ca57c0c9552f -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/Cleveland CCF
failed
get: 1 failed

so it uses datalad special remote to download it but that one failed for me... knowing that we expose that also via datalad download-url I do

❯ datalad download-url http://www.nitrc.org/frs/downloadlink.php/3479
[INFO   ] Downloading 'http://www.nitrc.org/frs/downloadlink.php/3479' into '/tmp/' 
download_url(ok): /tmp/#.php (file) 

to see that damn thing downloads just the login page :-/ -- since NITRC doesn't provide proper interface for clients with corresponding 4xx codes, and just web ui -- we are trying to figure out when it wants to login etc, I guess that detection failed now. Its configuration is at
https://github.com/datalad/datalad/blob/master/datalad/downloaders/configs/nitrc.cfg#L11 and it even includes this URL in regex... so it is providing credentials but then gets back to that login page. Also it has now in Red

Cookies must be enabled past this point.

@chaselgrove could you guide me on how to download from NITRC nowadays in a scripted manner?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants