zfs-check: efficient handling of sparse files


Greetings,

This is a feature/enhancement request for `zfs-check`, and a braindump on possible implementation.

There has been discourse on this topic in the past 2022-June on reddit regarding this topic. Screenshot for quick reference and [here is a link to the thread](https://www.reddit.com/r/zfs/comments/t9mrgn/comment/ibb0hbn/):
<details> 
  <summary>click to unfold screenshot</summary>
   <img src="https://github.com/psy0rz/zfs_autobackup/assets/517822/6fef3c96-62b2-47d8-a895-2fcf2bef5c88">
</details>

Having recently restored some datasets from backups, I had reason to do some checksums on various parts of the data, and `zfs-check` came back to my mind as a nice tool for this.

I checked the latest code in the repo and didn't see any support for sparse file handling yet, so I thought I'd make this request and provide some details. If I get the time I will make a PR for it myself, but in the meantime here is the request/info.

To illustrate what I'm referring to take for example a raw GPT partition file with XFS partition(s) stored on ZFS dataset. Example creation command `qemu-img create -f raw test.raw 4T` then the usual `gdisk` commands to create the GUID partition table, and `mkfs.xfs -m crc=1,reflink=1` to create a filesystem partition.

This set of commands would create a sparse 4TiB file, in my case stored on a zfs dataset, create the partition table, and format a partition with XFS and this raw disk would be provisioned on a kvm for example running on proxmox.

Today when `zfs-check` checks such a file it will chunk up and iterate over all the bytes in the sparse file. It will not treat the sparse blocks differently to normal blocks. Therefore it will actually read 4TiB bytes and compute the checksums for the chunks etc. This was covered in the reddit post from 2022-June (see above).

**What is a sparse file?** A sparsely allocated file is a file that contains one or more HOLE(s). A HOLE is a sequence of zeros that (normally) has not been allocated in the underlying file storage. Cite: https://linux.die.net/man/2/lseek

As a sanity check for the post in 2022-June I checked how long it takes to "optimally read" the sparse `store1`  4TiB raw disk, with its normal allocation of ~2.6TiB the remainder sparse allocation.

| 4TiB raw disk | read duration |
|--|--|
| Sparse optimised | 5-6 hours |
| `zfs-check` read | 8-9 hours |

So, by my calculation there is definitely room for optimisation in `zfs-check` in this regard.

An example of common utilities handling sparse files efficiently as follows, consider using `dd` to read a 4TiB sparse file in the following way and `xxh64sum` to create the checksum:

```
root@viper:~# dd iflag=direct bs=1M if=/store5b/data/vm/raw/.zfs/snapshot/2023-07-24-blah/images/102/vm-102-disk-0.raw status=progress | xxh64sum
4397679509504 bytes (4.4 TB, 4.0 TiB) copied, 21016 s, 209 MB/s
4194304+0 records in
4194304+0 records out
4398046511104 bytes (4.4 TB, 4.0 TiB) copied, 21016.2 s, 209 MB/s
7e2042fc6c4c43d9  stdin
```

OK, nothing special on face value **BUT** when we study the details during the operation using `pv` to monitor the relevant file descriptor, the output format is `$(date) %{Current data transfer rate} %{Average data transfer rate} %{Bytes transferred so far}`:

```
root@viper:~# { while ps -p 3331237 >/dev/null 2>&1; do st=$(date); timeout 2m pv -d 3331237:0  -F "${st} %t %r %a %b"; done }
Wed Oct 11 02:21:27 AM UTC 2023 0:01:59 [ 102MiB/s] [ 100MiB/s]  222GiB
Wed Oct 11 02:23:27 AM UTC 2023 0:01:59 [ 104MiB/s] [ 102MiB/s]  234GiB
Wed Oct 11 02:25:27 AM UTC 2023 0:01:59 [93.6MiB/s] [ 103MiB/s]  246GiB
Wed Oct 11 02:27:27 AM UTC 2023 0:01:59 [ 105MiB/s] [ 100MiB/s]  258GiB
Wed Oct 11 02:29:27 AM UTC 2023 0:01:59 [ 888MiB/s] [ 255MiB/s]  288GiB
Wed Oct 11 02:31:27 AM UTC 2023 0:01:59 [ 144MiB/s] [ 222MiB/s]  314GiB
Wed Oct 11 02:33:27 AM UTC 2023 0:01:59 [ 143MiB/s] [ 128MiB/s]  329GiB
Wed Oct 11 02:35:27 AM UTC 2023 0:01:59 [ 121MiB/s] [ 130MiB/s]  345GiB
Wed Oct 11 02:37:27 AM UTC 2023 0:01:59 [ 128MiB/s] [ 151MiB/s]  362GiB
Wed Oct 11 02:39:27 AM UTC 2023 0:01:59 [1.86GiB/s] [ 932MiB/s]  471GiB
Wed Oct 11 02:41:27 AM UTC 2023 0:01:59 [1.90GiB/s] [1.94GiB/s]  704GiB
Wed Oct 11 02:43:27 AM UTC 2023 0:01:59 [1.93GiB/s] [1.92GiB/s]  934GiB
Wed Oct 11 02:45:27 AM UTC 2023 0:01:59 [ 189MiB/s] [ 867MiB/s] 1.01TiB
Wed Oct 11 02:47:28 AM UTC 2023 0:01:59 [ 112MiB/s] [ 146MiB/s] 1.03TiB
Wed Oct 11 02:49:28 AM UTC 2023 0:01:35 [ 164MiB/s] [ 135MiB/s] 1.04TiB
```
What you see here is a human readable log containing 2 minute samples of the progress and data rates for the `dd` operation.

Note the 3rd column, the `%{Average data transfer rate}`. You'll notice how sometimes a sample is few hundred MiB/s which is determined by the physical limits of the underlying vdev(s) read speed when reading normally allocated blocks, and sometimes you'll see ~2GiB/s which is obviously an order of magnitude faster. The high speed samples document where `dd`, its libraries, the kernel and OpenZFS have conspired together to efficiently process the sparse blocks and sections of the the sparse 4TiB raw partition file.

NB: In my example I used `xxHash` by Yann Collet because as per the name: Extremely fast non-cryptographic hash algorithm ([github link](https://github.com/Cyan4973/xxHash)) its modern, maintained and it can provide "near RAM speed" hashing speed. `zfs-check` defaults to `sha1` which has limited performance and perhaps not best suited for sparse file handling. Note the hash performance comparison table on the xxHash GitHub [README](https://github.com/Cyan4973/xxHash#benchmarks).

The sparse allocation optimisation is also reflected in the 4th column `%{Bytes transferred so far}`: in the samples that illustrate reading of the sparse allocated blocks, the amount of *logical* data read per sample is naturally a lot more than the samples for normally allocated blocks.

**Outcome:** progress in reading through the file is much quicker (than `zfs-check`) because the sparse allocations/sections are handled efficiently.

A quick note on the system in my examples. Its a entry level enterprise storage node running proxmox 7.x which is Debian bookworm based and running the Ubuntu 6.2.x kernel. 12 physical Xeon CPU cores over two sockets and 128 GB RAM.

For my `dd` example, what I've not yet understood is which code (or code combo) in the chain is responsible for the sparse file reading optimisation. From glancing at the the `dd` code it doesn't seem that the optimisation is happening at that level, and you can see from my `dd` example I'm not giving it special arguments to handle sparse files. I think the optimisation is more likely to be located in lower level `read() and seek()` functions and a combination of the filesystem support for sparse file allocation. OpenZFS definitely supports efficient sparse file reads as demonstrated in the `dd` example provided.

On Debian the `coreutils` package which contains `dd`  source might have some clues: https://sources.debian.org/src/coreutils/9.1-1/. I took a quick look but didn't find a smoking gun, in fact `dd` code is primarily using `lseek (fd, offset, SEEK_CUR)` which means I missed something or the optimisation is happening actually in the kernel/filesystem code ([for example fs/read_write.c here](https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/tree/fs/read_write.c)) or in OpenZFS code. NB: Proxmox uses the Ubuntu kernel.

For reference and perhaps part of the solution for `zfs-check`, for python there is a useful but probably not exactly what is needed answer here: https://stackoverflow.com/a/62312222. This approach seeks the entire file first to determine the ranges of DATA and HOLE(s).

From what I can tell `zfs_autobackup/BlockHasher.py` will need an logic update something like the following:

 1. Ensure logic exists somewhere to skip using "sparse logic" on file descriptors like pipes that don't support seeking backwards or hole detection.
 2. `lseek(fd, offset, SEEK_HOLE)` to determine if the offset is the end of the file
	 * IF TRUE offset == EOF then the file contains no HOLE(s) and is not sparse - rewind `fd` - **the existing logic can be used**
	 * IF FALSE offset == EOF then the file contains at least one HOLE - **use sparse logic**
SPARSE LOGIC:
 2. Start with `lseek(fd, offset, SEEK_DATA)` until a HOLE chunk is detected? (a chunk that is sequence of zeros)
 3. HOLE DETECTED, `lseek(fd, offset, SEEK_HOLE)` iterate forward seeking *n* chunks to find the end of the HOLE.
     * if the return == `offset` then still in a contiguous hole
	     * keep iterating and seeking forwards, counting the `nr_contiguous_hole_chunks`
     * if the return != `offset` and != EOF then there is some more DATA to process
         * compute the checksum of the HOLE? `zeros` x `nr_contiguous_hole_chunks` x `chunk_size`
         * rewind `fd` to resume processing DATA at the end offset of the last detected contiguous HOLE chunk
     * if the return == EOF (there are no more HOLES) then rewind `fd` to resume processing DATA at the end offset of the last detected contiguous HOLE chunk. The logic then iterates of the remaining DATA.

It would be interesting to find examples of other code that have implemented similar logic to sanity check this logic and check for any inefficiencies or flaws.

💡 IDEA: Maybe its _enough_ to switch the `zfs_autobackup/BlockHasher.py` from using `fh.seek(pos)` to using `os.lseek(fd, pos, whence)` and using `xxhash` rather than the slower `sha1` and the optimisation would be automatically performed at a lower level? As per the `dd` example which is using `lseek (fd, offset, SEEK_CUR)` and no mention of `SEEK_DATA` or `SEEK_HOLE`. This might be worth a try before trying to invent new logic per my brain dump herein.

python ref for `os.lseek()`: https://docs.python.org/3/library/os.html#os.lseek  
python ref for `fd.seek()`: https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects

Thanks for reading, look forward to any comments or critique.

Cheers

Kyle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

zfs-check: efficient handling of sparse files #225

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

zfs-check: efficient handling of sparse files #225

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions