Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add --inplace option for syncing e.g. 15G files with small changes, trading away safety for speed #65

Open
klemens-u opened this issue Feb 14, 2017 · 15 comments
Labels
enhancement issue is a request for a feature, and not a defect feedback Information has been requested; may be closed in 30 days if not provided. impact-low low importance wontfix maintainers choose not to work on this, but PR would still be considered

Comments

@klemens-u
Copy link

klemens-u commented Feb 14, 2017

Hello, after some investigation I was happy to see that unison uses an intelligent algorithm which transfers only the modified part e.g. of a 15GB large virtual disk image.

But still, the progress is quite slow and generates a lot of I/O.

My test scenario:

  • A 15GB virtual machine disk image on my local machine
  • Already transfered once via unison to a remote machine
  • Local start and shutdown of the virtual machine to provoke changes in the disk image file
  • Running unison again.

This process takes more than 10 minutes although only a few megabytes of data are transferred over the network.
Therefore I monitored I/O on both machines and the network traffic.
Here is what I found:

  • Unison detects changes to the local file -> good.
  • Full read of the 15GB file on the local machine -> ok, surly necessary to detect changes
  • Full read of the 15GB file on the remote machine -> for comparison and changeset calculation?
  • Network-transfer of the changed data (just a few MB) and simultaneous creation of a temp 15GB file on the remote machine. Since only little data is transferred over the network I assume most of the data is copied from the original remote file. This creates massive I/O on the target machine, since 15GB are read and written to the same disk... -> why isn't there a rsync-like "inplace" option?
  • Full read of the 15GB file on the local side and full read of the 15GB file on the target side -> why? verify? This is very time and I/O consuming...

Apropos rsync - in reference to "Making Unison Faster on Large Files" (http://www.cis.upenn.edu/~bcpierce/unison/download/releases/stable/unison-manual.html#speeding):

Regarding the "copyprog" option: I couldn't detect any reference to rsync in the debug output (debug=all), or as a separate process. How can I make sure Unison is delegating sync of large files to rsync?

Furthermore, as I understand, rsync is only used for the initial first-time transfer. Is it possible to use rsync also for subsequent syncs?

Thanks!

@klemens-u
Copy link
Author

For comparison: rsync --inplace takes about half the time (around 5minutes) for the example above.

@alkuzad
Copy link

alkuzad commented Feb 14, 2017

@klemens-u Have you checked if you have fastcheck=true enabled ? What OSes do you sync ?And - if you need only one way synchronization - rsync can be less problematic. Unison is best used for multiple replicas.

@brabalan
Copy link
Collaborator

@klemens-u copyprog only works for new files.

@klemens-u
Copy link
Author

@alkuzad:

  • yes, fastcheck is enabled.
  • I use Ubuntu Linux. Local machine = 14.04, remote machine 16.04, both 64bit.
  • rsync: yes true, but it would be so nice to handle all backup jobs with one tool - unison

@brabalan: yes, I saw that copyprog works only for new files. Do you know why there is no option to use rsync also for subsequent syncs?

@klemens-u
Copy link
Author

klemens-u commented Feb 15, 2017

I did some further investigating, and there is another big issue: I'm using btrfs and btrbk snapshot backup on the remote machine. Now because there is no "inplace" option, unison creates a separate 15GB file during transfer which finally replaces the original file when finished. This means that for btrfs all blocks have changed! So every btrfs snapshot of a changed virtual machine image will take the whole 15GB space instead of only a few changed MBs....

@brabalan
Copy link
Collaborator

Ouch, this is painful.

I think the reason we don't use rsync and don't do inplace transfer is to minimize the time where the filesystem is in an inconsistent state. In other words, if you interrupt unison, you either want the old file or the new file, and not something in between.

I guess we could have an option to allow unsafe inplace transfer, but the code to do that needs to be written.

@klemens-u
Copy link
Author

klemens-u commented Feb 15, 2017

@brabalan: you're right, for normal operation unison's attempt to minimize inconsistent states is very good and the way to go.

I don't have any insight into the inner workings of the current "copyprog" option. But wouldn't it be a simple and clean way to extend the "copyprog / copythreshold" options to delegate all syncing of big files to rsync?

This is how a future config file could look like:

# use rsync 
copyprog = rsync --inplace --no-whole-file
# for files bigger than x
copythreshold = 10000
# new option: use rsync for all transfers, not just the initial one
copyprogalways = true

Benefits:

  • Unisons default behaviour remains unchanged
  • More efficient sync of big files:
    • Faster
    • Less I/O
    • No problems with (btrfs) snapshots, only changed blocks are modified in the target filesystem
  • Best of both worlds, proven unison functionality for small files, good performance for big files.
  • But all without the complexity of multiple tools

Your thoughts?

@brabalan
Copy link
Collaborator

I think it would be great, but I don't know how big of a change this would be. One needs to change this line

&& update = `Copy
(there Copy means the file is new), but I don't know where unison deals with temporary files.

@bcpierce00 : would this be difficult to implement?

@bcpierce00
Copy link
Owner

bcpierce00 commented Feb 15, 2017 via email

@klemens-u
Copy link
Author

Thanks @bcpierce00. I'd appreciate I someone could take a look at it.

@pbillen
Copy link

pbillen commented Sep 11, 2018

I believe that the introduction of copyprogalways = true will also resolve the following request: #219.

@gdt gdt added impact-low low importance and removed impact-medium medium importance labels Oct 23, 2020
@gdt gdt changed the title Slow performance with large files e.g. virtual machine disk images Add --inplace option for syncing e.g. 15G files with small changes, trading away safety for speed Oct 23, 2020
@gdt
Copy link
Collaborator

gdt commented Mar 19, 2023

This issue is old and there have been many improvements over the years. Please retest with 2.53.1 to see if the builtin sync is slower than using rsync, and post results and a repro recipe, preferably a script. Or, really, the question is to articulate the difference between how unison behaves and some rsync invocation. Without test results, I'll assume this issue is no longer relevant (standard 30-day feedback timer).

(In addition, I'm not really comfortable with an optimization which can result in bad data.)

@gdt gdt added wontfix maintainers choose not to work on this, but PR would still be considered feedback Information has been requested; may be closed in 30 days if not provided. labels Mar 19, 2023
@tleedjarv
Copy link
Contributor

There are several issues (or areas of potential improvement) here.

First, the lack of inplace update, which not only causes a lot of extra I/O but also ruins fs block-level snapshots. An inplace update seems like a valuable option for expert users to have. It is still an extremely dangerous option but it could be useful for people who know what they're doing (can correctly re-run a sync, or can restore from a snapshot, for example), and it makes the next point below (verification) even more important. I can see this being implemented directly in Unison or delegated to rsync, either way could work.

Note that, while not exactly inplace updates, some work has already been done to make block-level snapshots easier and reduce I/O loads. See #577. That code is currently working for whole-file copies. At least on some systems, this work could be hopefully easily adapted to simulate inplace updates. (I don't know if any fs actually support this; if not then a "real" inplace could be implemented.) The good thing about such simulated inplace update is that it would be safer than a real inplace update because it would never leave the target file in an inconsistent state.

Then, the last bullet from the original report:

  • Full read of the 15GB file on the local side and full read of the 15GB file on the target side -> why? verify? This is very time and I/O consuming...

Yes, this is done to verify that 1) the transfer was correct; and 2) that the source file hasn't changed during the transfer. I can only agree that this is very time and I/O consuming but I can't see turning this off either, even for expert users.

@tleedjarv
Copy link
Contributor

I've opened a PR with a proof-of-concept alternative solution to this request. See #876.

Granted, it only works in some configurations (must have platform and filesystem support) but when it does work then it is completely safe, unlike rsync's --inplace.

It works as you'd expect, not copying unchanged data and not breaking snapshots. Anyone interested in this is welcome to test the PR. Even though it is supposed to be safe, please do initial tests on non-production data.

@gdt
Copy link
Collaborator

gdt commented Oct 9, 2023

@klemens-u Have you been able to update to recent and test the draft PR?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement issue is a request for a feature, and not a defect feedback Information has been requested; may be closed in 30 days if not provided. impact-low low importance wontfix maintainers choose not to work on this, but PR would still be considered
Projects
None yet
Development

No branches or pull requests

7 participants