-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Consider moving our storage from AWS to Digital Alliance #162
Comments
I can't remember off the top of my head what Digital Alliance's policy is for this but I know I'd talked to Nick about it before. My memory is that they're happy to supply the bandwidth but I could be mistaken. I'll submit a ticket about it now. |
It should be possible, but I'll just point out a technical difference:
So, it's not exactly the same interface/setup. But, I'm sure there exists open source software which can present the same interface as S3. Or, we can also look at the other types of special remotes that git-annex supports. |
Thank you for the clarification @mguaypaq. Would it make sense to move spine-generic to a NeuroGitea server? |
I think Github still gives us a lot of value:
Probably we just want to change the storage backend for the large data files, that should be a much smaller change from the point of view of our users. It should be very doable. |
In particular, it looks like the Digital Alliance already has a service that's compatible with S3, so maybe this can be easy: |
I heard back from Digital Alliance on Arbutus (way quicker than I expected... 30 minute turnaround time!). There's no specific policy about using arbutus for public datasets but the bandwidth they can provide is pretty limited - unless it's primarily our own users who are downloading the dataset? They have fast uplinks to Canadian university networks, but relatively slow uplinks to everyone else from the sounds of things? Do we know how many external people are using the dataset? |
i'd say ~10 ppl/month? but uplinks doesn't matter too much. What matters is downlinks. And as long as it is not insanely slow (which is not), we should be fine |
I mean uplink in this case would mean their upload speed to external networks which would affect download speed but:
I think that fits our needs, very cool! @mguaypaq do you have a vision for how you might switch backends in git-annex? I imagine we can just import the dataset from s3 to arbutus without much trouble but I'm not sure about how this works with git-annex. |
Just like git, git-annex supports having multiple remotes. So, I imagine we would:
So, a nice gradual transition, with plenty of opportunity to roll back if there are problems. |
@nullnik-0 is becoming our resident expert on ComputeCanada already! I'd be down to work alongside her since I have admin perms on our CC projects. I'll loop all three of us into a slack convo and we could talk about permissions. |
Preliminary tests seem to work! We should be able to migrate to Arbutus object storage fairly quickly and reduce our Amazon bandwidth costs. Steps:
I'm out of time for this week, but next week I'll try to migrate both |
Stumbling on that convo... Also, I don't think you need to set 2 special remotes for read/write respectively, the first one should work for both (write with credentials only). |
@bpinsard have you gotten the single read/write remote to work in the past? I remember trying (in the past year) to use a single special remote for both, but couldn't get it to work. Possibly something to do with this interaction between the config settings. |
We do not use that setup in production (only authenticated access), but I just tested it with a minio s3 server (so not the same as digital alliance). I think a good way to avoid that, is to set the S3 remote for the read/write data management only, not autoenabled, and then add a httpalso sameas remote, crafting the https url depending on the server, bucket-name and requeststyle.
This can save a lot of user-support headaches. |
Oh! I didn't know about the httpalso remote type, that makes a lot of sense. It's still two remotes with a sameas, but probably with fewer corner cases. |
We have a backup of what's on gitea (and what was on gitolite) however those I understand to be git-annex archives rather than the datasets themselves. |
Whoops! I misunderstood - we don't have spine-generic backed up on restic - mathieu helped me remember that that one's on github. |
then we should probably create a backup, no? |
Good point. I made a ticket for getting this put in restic (with some questions for Mathieu) - should be pretty straightforward. |
The spine-generic dataset is being increasingly downloaded, which comes at a cost. For example, the cost for 2023 was 478$, which is not negligible (cost per month below):
I'm wondering how feasible/difficult it would be to move the git-annex server to a Digital Alliance cloud?
The text was updated successfully, but these errors were encountered: