Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Pack multiple stdin outputs into a single snapshot #2133

Open
dhoffend opened this issue Jan 2, 2019 · 7 comments
Open

Pack multiple stdin outputs into a single snapshot #2133

dhoffend opened this issue Jan 2, 2019 · 7 comments
Labels

Comments

@dhoffend
Copy link
Contributor

dhoffend commented Jan 2, 2019

Output of restic version

restic 0.9.3 compiled with go1.11.1 on linux/amd64

What are you trying to do?

I would like to run several 100 mysqldumps commands (backup separate tables instead the whole db) and I run into several issue that makes backups very unpractical and hard to use

  1. Every stdin backup becomes its own snapshot making the list very cluttered and hard to read. The only way to pack them together would be to use tags with dates included. Not very practical tbh.
  2. The main problem: the execution of restic takes 10-20s before it actually starts doing its job (due to index loading
  3. Instead of restoring all mysqldump at once you have to take every single snapshot ...

What should restic do differently? Which functionality do you think we should add?

I would like to propose an alternative way to backup multiple stdin outputs into a single snapshot

  1. Please provide a --stdin-commands-file <file> to the backup task
  2. The --stdin-commands-file would contain a list of backup jobs/commands (one per line) with the resulting filename as first argmument: <filename><whitespace><command that stdin should be saved><newline> (one filename + command per line). A config file with other syntax is also okay.
    Example:
db01.sql mysqldump [...] "db01"
db02.sql mysqldump [...] "db02"
db03.sql mysqldump [...] "db03"
  1. The path name for the whole snapshot could be the basename of the stdin-command-file or the --stdin-filename parameter
  2. restic itself would execute every single command and pipes the stdout into the archiver code internally with the given filename

This way a commandfile can be prepared prior to the execution of restic and a single restic instance could save multiple stdins into a single snapshot. Not only is it more easy to handle mysqldump backup jobs (or something similar), you also end up with faster executing (only 1x the index loading instead of multiple hundred) and be done in minutes rather then hours.

Maybe this could help #1873 aswell. In my case I would like to avoid piping mysqldumps to the disks first before backuping them as a single snapshot.

Did restic help you or made you happy in any way?

Sure. I'm about to switch to restic for my private servers (from rsnapshot) and I'm already using restic in a different environment to backup 100+ servers, but struggling with database dumps and other performance related things (like loading of indexes, memory usage etc. But it has become far better in the last year)

@dhoffend
Copy link
Contributor Author

dhoffend commented Jan 2, 2019

Maybe it makes more sense to use a --commands-file parameter to not confuse with the --stdin mode. Basically the backup would execute commands and then backup the stdout of the output. This has nothing to do anymore with the original stdin.

@dhoffend
Copy link
Contributor Author

dhoffend commented Jan 2, 2019

Looking at the code for too long I can see a following possible way to go:

Clone the fs_reader Code to provide a directory of fake files with their commands, or enhance the fs_reader object to support multiple fake files incl. optional stdout of commands instead of stdin.

The scanner will get a list of all fake files which are defined in the commands-file and hands them over to the Archiver. When the Archiver code calls Open() or OpenFile()? and fs.Command is given for this fake file entry, we would have to execute the given command and pipe the cmd.stdout to the read output so the archiver code can actually store it until the cmd execution reaches EOF. Then the scanner would continue with the next given fake entry and call the next Open()

It sounds possible without too much of a change, apart from messing up with the fs.Reader for fake stdin files.

Any ideas?

@fd0
Copy link
Member

fd0 commented Jan 6, 2019

Thanks for taking the time to submit this idea. To be honest, I'm not convinced this is a good thing to add for restic. It makes the (already complex) code for reading something from stdin even more complicated. After all, restic is intended to be a tool to backup files.

Would you mind elaborating why the straightforward way (running mysqldump to create files, then backup those files) does not work for you?

I'm sorry if my comment comes across as negative, it's not meant that way. We are a Free Software project for which most people (at least me) work in their spare time, and our development/maintenance/debugging time is very limited. So we're trying to keep restic's scope as small as possible. This also applies to #1873. :)

@fd0 fd0 added state: need feedback waiting for feedback, e.g. from the submitter type: feature suggestion suggesting a new feature labels Jan 6, 2019
@dhoffend
Copy link
Contributor Author

dhoffend commented Jan 8, 2019

Hi fd0 thank for taking your time to comment this. I know that restic strength is to backup files. The feature to backup stuff from stdin makes it also a great tool to backup non-file based stuff.

The main reason why I prefer backups via mysqldump ... | restic backup-stdin ... is IO usage. Imagine you have one or multiple database that have a size of multiple gigabytes (10GB+). Running mysqldump to the filesystem first then running the backup creates a lot of write I/O operations while the system is doing lots of reads as well. This reduces the overall performance quite a lot. In the past I've encountered this by pipeing it through gzip first ... A second problems comes in when you run servers which sync their storage over DBRD. Creating file mysqldumps first would also generate network traffic combined with io operations and io latency based on the network link. This is why I would like to avoid write I/O operations if possible.

The reason why I would like to have some kind of backing up multiple stdin is the memory usage and index loading time of restic. It takes quite some time before restic starts backup up stuff (~10s but it depends on the size of the repo). If you then would like to backup multiple hundred stdin commands (say mysqldumps) and would like to avoid lots write I/O operations it you end up with executing restic multiple hundred times while the waiting time for index loading stacks up.

Sure a --stdin-commands parameter is not mission critical and the old fashioned way (creating files and then backing them up) still works. But the larger the dumps are getting the more I would like to safe write operations in shared environments or when using drbd over network.

Thanks in advance. Maybe I can get my head around it but I've never used go before.

@fd0
Copy link
Member

fd0 commented Jan 11, 2019

Okay, thanks for taking the time to describe your use case.

@micw
Copy link

micw commented Jan 29, 2019

@fd0 We are internally brainstorming our backup strategies and it turns out that the use case "backup files plus large streamed output of multiple commands" is a very common case. Streaming the command outputs to files and back it up along with all the regular files is a workaround but has massive drawbacks in performance and space usage. E.g. we backup apps containing small local data plus large elasticsearch dumps (which not even would fit on a single disk of the system that runs restic). Having all this together in one snapshot would be great for restore consistency.

I'd appreciate if you'd consider to support this use case.

Best regards,
Michael.

@MichaelEischer
Copy link
Member

Related to #4804

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants