Feature Request: Target number of blocks/chunks for read performance #142

xcfmc · 2023-05-09T20:01:14Z

Can we add a feature that attempts to target a final number of blocks/chunks?

Reason:
I'm using this to compress hard drive images. I mount the drawrfs archive though fuse, and then mount the drive image inside that. Read performance can get really bad if the final number of blocks is high (especially on a drive image that is fragmented). I've been playing with the -B -S -W and -w parameters, to find a decent compression/performance balance, but it takes a long time to compress and recompress a 1TB image.

Possible Solutions:

Implement a recompress stage where dwarfs deduplicates with large blocks first, then checks if the final number of blocks is below the desired target. If so, split those blocks into smaller blocks and continue deduplicate, and then finally compress.
Implement a 'scan' phase that attempts to estimate the final number of blocks based on the type of data it sees. This could be a routine that attempts to deduplicate 100MB samples at 1GB distances, and self-adjusts the BSWw parameters to aim for a desired final block count.
Implement some kind of "simulate" feature that skips compression, and writes the minimum amount of data (hashes only?) needed to get block count using a specified block size. This can then be added to a shell script to simulate several parameter variations until the outcome is close to the desired block count.

One last wishlist feature... Can you add the ability to directly compress a drive (/dev/sda) to a dwarfs image? Maybe it defaults to a single file inside called sda.img?

Thanks again for this awesome program!

xcfmc · 2023-05-12T15:55:30Z

Actually... I have a better idea. Does the dwarfs database record how many times each block/chunk is referenced? It would be awesome if dwarfs had a feature to precache the most frequently referenced blocks, up to a desired cache size. Precaching could be done with IDLE priority, to not interfere with immediate user read requests. What I'm doing now is precaching the whole thing after mounting it, but that burns time on blocks that don't need caching.

On a side note, I've been running gnome-disks to benchmark a loop mounted image inside dwarfs. For some reason, read performance drops rapidly whenever it is reading blocks that are empty. This is a performance graph of a 1TB drive with windows 10 and stock software from an MSI laptop. Only ~120GB of the beginning of the drive is used, and the rest is all zeros. In the graph, dwarfs has great read performance on the data parts, but there are odd dips to ~90mb/s. The more concerning issue is that it reads the zero parts quite slowly. Shouldn't reading out sparse zeros be fast?

The benchmark was done with 1000 samples and 10mb sample size.
File was compressed with these parameters:
-B 1 -S 26 -W 22 -w 1
(similar graphs happen with other parameter variations)
File is mounted with these parameters for max caching performance:
-o auto_unmount -o cache_image -o cache_files -o cachesize=64g -o workers=48 -o mlock=must -o large_read -o splice_read -o decratio=0

The system is a dual Xeon E5-2697 v2 (24 cores/48 with hyper threading) and 256GB of RAM.

mhx · 2023-05-12T17:17:56Z

Hi, thanks for your input! I'm currently travelling with only sporadic internet connectivity. I'll take a closer look when I'm back mid June.

mhx · 2023-07-04T20:33:12Z

Hi @xcfmc, I'm just getting back to this.

Curiously, I made a related observation only last week while working on the next release and running some benchmarks against fuse-archive. One of the benchmarks was compressing a 1 GiB file of only zeroes, and I was surprised that fuse-archive could actually have higher throughput than DwarFS when reading that file.

So I dug into it and was reminded that random accesses into files are O(n) in DwarFS. That's simply because all chunks have to be traversed from the beginning of the file until the read offset. Unfortunately, that means that even when reading a huge fragmented file sequentially, you get O(n**2) performance as for each read() request, DwarFS has to traverse the list of chunks again. While the 1 GiB of zeroes compressed down to a 850 byte DwarFS image, reading got progressively slower because the file was composed of almost 100,000 chunks.

What I did to improve this was to implement an "offset cache" that would cache both the offset and index of the last chunk read for the most recently accessed inodes. The cache is tiny (64 inodes) and is only used if an inode has more than 64 chunks. However, sequential reads from fragmented files are now no longer quadratic complexity and throughput improved dramatically. To give you an idea, I tried to reproduce your use case and built a DwarFS image from a single 64 GiB EXT4 file system image. The inode of the EXT4 image consists of just over 3 million chunks. Just copying the EXT4 image to /dev/null takes ages using the 0.7.0-RC4 DwarFS release:

$ dd if=mnt/perl-install.ext4 of=/dev/null status=progress bs=16384
4042506240 bytes (4.0 GB, 3.8 GiB) copied, 60 s, 67.4 MB/s^C
247199+0 records in
247198+0 records out
4050092032 bytes (4.1 GB, 3.8 GiB) copied, 60.2922 s, 67.2 MB/s

I interrupted this after a minute because it was getting slower and slower.

Now, with the 0.7.0-RC5 DwarFS release from today, things look very different:

$ dd if=mnt/perl-install.ext4 of=/dev/null status=progress bs=16384
67681239040 bytes (68 GB, 63 GiB) copied, 37 s, 1.8 GB/s
4194304+0 records in
4194304+0 records out
68719476736 bytes (69 GB, 64 GiB) copied, 37.3407 s, 1.8 GB/s

Throughput is pretty much constant across the whole image. However, that's obviously not a typical workload for a file system, so let's measure building a tar archive from the mounted EXT4 image. First with the "old" 0.7.0-RC4:

$ time tar cf - perl/ | dd of=/dev/null status=progress bs=16384
230912000 bytes (231 MB, 220 MiB) copied, 100 s, 2.3 MB/s^C
8800+9434 records in
8800+9434 records out
232960000 bytes (233 MB, 222 MiB) copied, 100.318 s, 2.3 MB/s

And now with the "new" 0.7.0-RC5 release including the offset cache:

$ time tar cf - perl/ | dd of=/dev/null status=progress bs=16384
752240640 bytes (752 MB, 717 MiB) copied, 100 s, 7.5 MB/s^C
19615+43281 records in
19615+43281 records out
752394240 bytes (752 MB, 718 MiB) copied, 100.246 s, 7.5 MB/s

That's definitely better, but still a far cry from what I'd want the performance to be.

Now, the good news is that I actually thought about random access into large, fragmented files when implementing the offset cache and have a theoretical solution that I only dismissed because of the added complexity and because it didn't see this being a real use case. So watch this space, it shouldn't be too hard to implement this.

With all that being said, I'm a bit irritated by the fact that you're storing file system images inside DwarFS images. Because DwarFS would most certainly perform a lot better when used directly on the data stored in those file systems; compression would very likely be much faster and better, and file access would also be much faster. Is there a reason why you're doing it that way?

xcfmc · 2023-07-04T23:20:49Z

Thanks for the update! I agree that storing files instead of images would be infinitely better. Unfortunately I have several use cases where that is not possible. Even though dwarfs isn't made for these cases, it still works absolute magic on them, compared to other solutions, like Opendedup and Borg backup. Your use of efficient multithreading blows them away on throughput.

Here's the list:

Forensic disk image archiving - In the forensics field, deleted blocks and obscure filesystem metadata are often more important than the files themselves. These bytes tell a story that can make or break a case, and a court will only treat is as fact if the data can be proven to be a byte for byte copy of the original source material.
Retro computer disk imaging - These drive images are in formats that haven't been used in decades. The goal is to preserve that history in it's original bootable/executable form.
Virtual machine images - Some of the programs I use are only available on windows, so I have no choice but to run them in virtual machines. For performance and security reasons, each of these programs is installed on it's own fresh image of Windows. When there is a program that I don't need for a while, I archive the image with Dwarfs, and store it on a spinning HD to free up NVME space. Dwarfs does a great job at deduplicating common OS data when these are stored together.
Android flash images - These have to be stored as partition images in order to be restored quickly and booted again.

There's no pressure to work on these use cases. As I said, Dwarfs is already performing magic on them, compared to other solutions. They may be helpful as stress tests / benchmarks though, as the code evolves.

mhx · 2023-07-05T23:20:18Z

That's a super interesting set of use cases, thanks for the clarification!

I've built and tested a more elaborate version of the offset cache today and I'm pretty happy with the results. Performing random reads with 32 concurrent processes from the mounted EXT4 image inside the DwarFS image, I'm getting an average read speed of around 250 MB/s:

$ find perl -type f -print0 | xargs -0 -P32 -n32 cat | dd of=/dev/null status=progress
50936064512 bytes (51 GB, 47 GiB) copied, 209 s, 244 MB/s
99921108+9076 records in
99925689+1 records out
51161953159 bytes (51 GB, 48 GiB) copied, 209.65 s, 244 MB/s

That's more than 100 times faster than the version you've been using. And the best thing it that this is going to work with all the DwarFS images you already have. :)

Sequential reads of the EXT4 image itself are still at around 1.8 GB/s.

xcfmc · 2023-07-06T03:36:59Z

I just tested it, and the performance is awesome! 100x performance increase for my test case as well. Thank you!

mhx · 2023-07-06T07:47:57Z

I just tested it, and the performance is awesome! 100x performance increase for my test case as well. Thank you!

Oh, you can only have tested the "old" cache from 0.7.0-RC5 :)

The new cache has only been pushed a few minutes ago, but it's very likely not going to matter much in your case.

I'm judging from the screenshot that you built with -W 22, which means the smallest chunk size is 4 MiB. That chunk size will give you pretty good throughput even if it takes some time to find a chunk. In the test case I've used, I built the image with -W 12, i.e. a smallest chunk size of just 4 kiB. In the absolute worst case that means your 1 TB disk image could end up in an inode of 250,000 chunks. I doubt the first 120 GB will be very fragmented as overlaps of 4 MiB should be quite rare. The 880 GB of zeroes will be maximally fragmented (into around 220,000 chunks), and this is where read speeds have gone slower in the past. I'm assuming the benchmark tool just did a sequential read of these zeros, and so that read speed should be much improved even by the old cache from RC5.

If you're interested, you can check the number of chunks in your image with dwarfsck:

$ dwarfsck perl-install-ext4.dwarfs 
DwarFS version 2.4 [2]
created by: libdwarfs v0.7.0-RC5
block size: 64 MiB
block count: 381
inode count: 2
preferred path separator: /
original filesystem size: 64 GiB
compressed block size: 3.2 GiB (13.44%)
uncompressed block size: 23.81 GiB
compressed metadata size: 6.047 MiB (26.74%)
uncompressed metadata size: 22.62 MiB
options: mtime_only
metadata memory usage:
               total metadata............23,714,980 bytes       11857490.0 bytes/inode
     3,110,146 chunks....................23,714,864 bytes 100.0%   7.6 bytes/item
             1 compact_names.....................19 bytes   0.0%  19.0 bytes/item
               |- data                           17 bytes   0.0%  17.0 bytes/item
               '- index                           2 bytes   0.0%   2.0 bytes/item
             2 chunk_table........................6 bytes   0.0%   3.0 bytes/item
             2 modes..............................4 bytes   0.0%   2.0 bytes/item
             1 uids...............................2 bytes   0.0%   2.0 bytes/item
             2 inodes.............................2 bytes   0.0%   1.0 bytes/item
             1 gids...............................1 bytes   0.0%   1.0 bytes/item
             2 dir_entries........................1 bytes   0.0%   0.5 bytes/item
             2 directories........................1 bytes   0.0%   0.5 bytes/item
             0 devices............................0 bytes   0.0%   0.0 bytes/item
             0 shared_files_table.................0 bytes   0.0%   0.0 bytes/item
             0 symlink_table......................0 bytes   0.0%   0.0 bytes/item

So this 64 GB image has an order of magnitude more chunks than your 1 TB image due to the much smaller match window size.

I'm closing this issue, but feel free to let me know if there's anything else.

mhx closed this as completed Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Target number of blocks/chunks for read performance #142

Feature Request: Target number of blocks/chunks for read performance #142

xcfmc commented May 9, 2023

xcfmc commented May 12, 2023 •

edited

Loading

mhx commented May 12, 2023

mhx commented Jul 4, 2023

xcfmc commented Jul 4, 2023

mhx commented Jul 5, 2023

xcfmc commented Jul 6, 2023

mhx commented Jul 6, 2023

Feature Request: Target number of blocks/chunks for read performance #142

Feature Request: Target number of blocks/chunks for read performance #142

Comments

xcfmc commented May 9, 2023

xcfmc commented May 12, 2023 • edited Loading

mhx commented May 12, 2023

mhx commented Jul 4, 2023

xcfmc commented Jul 4, 2023

mhx commented Jul 5, 2023

xcfmc commented Jul 6, 2023

mhx commented Jul 6, 2023

xcfmc commented May 12, 2023 •

edited

Loading