-
-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Greetings,
This is a feature/enhancement request for zfs-check, and a braindump on possible implementation.
There has been discourse on this topic in the past 2022-June on reddit regarding this topic. Screenshot for quick reference and here is a link to the thread:
Having recently restored some datasets from backups, I had reason to do some checksums on various parts of the data, and zfs-check came back to my mind as a nice tool for this.
I checked the latest code in the repo and didn't see any support for sparse file handling yet, so I thought I'd make this request and provide some details. If I get the time I will make a PR for it myself, but in the meantime here is the request/info.
To illustrate what I'm referring to take for example a raw GPT partition file with XFS partition(s) stored on ZFS dataset. Example creation command qemu-img create -f raw test.raw 4T then the usual gdisk commands to create the GUID partition table, and mkfs.xfs -m crc=1,reflink=1 to create a filesystem partition.
This set of commands would create a sparse 4TiB file, in my case stored on a zfs dataset, create the partition table, and format a partition with XFS and this raw disk would be provisioned on a kvm for example running on proxmox.
Today when zfs-check checks such a file it will chunk up and iterate over all the bytes in the sparse file. It will not treat the sparse blocks differently to normal blocks. Therefore it will actually read 4TiB bytes and compute the checksums for the chunks etc. This was covered in the reddit post from 2022-June (see above).
What is a sparse file? A sparsely allocated file is a file that contains one or more HOLE(s). A HOLE is a sequence of zeros that (normally) has not been allocated in the underlying file storage. Cite: https://linux.die.net/man/2/lseek
As a sanity check for the post in 2022-June I checked how long it takes to "optimally read" the sparse store1 4TiB raw disk, with its normal allocation of ~2.6TiB the remainder sparse allocation.
| 4TiB raw disk | read duration |
|---|---|
| Sparse optimised | 5-6 hours |
zfs-check read |
8-9 hours |
So, by my calculation there is definitely room for optimisation in zfs-check in this regard.
An example of common utilities handling sparse files efficiently as follows, consider using dd to read a 4TiB sparse file in the following way and xxh64sum to create the checksum:
root@viper:~# dd iflag=direct bs=1M if=/store5b/data/vm/raw/.zfs/snapshot/2023-07-24-blah/images/102/vm-102-disk-0.raw status=progress | xxh64sum
4397679509504 bytes (4.4 TB, 4.0 TiB) copied, 21016 s, 209 MB/s
4194304+0 records in
4194304+0 records out
4398046511104 bytes (4.4 TB, 4.0 TiB) copied, 21016.2 s, 209 MB/s
7e2042fc6c4c43d9 stdin
OK, nothing special on face value BUT when we study the details during the operation using pv to monitor the relevant file descriptor, the output format is $(date) %{Current data transfer rate} %{Average data transfer rate} %{Bytes transferred so far}:
root@viper:~# { while ps -p 3331237 >/dev/null 2>&1; do st=$(date); timeout 2m pv -d 3331237:0 -F "${st} %t %r %a %b"; done }
Wed Oct 11 02:21:27 AM UTC 2023 0:01:59 [ 102MiB/s] [ 100MiB/s] 222GiB
Wed Oct 11 02:23:27 AM UTC 2023 0:01:59 [ 104MiB/s] [ 102MiB/s] 234GiB
Wed Oct 11 02:25:27 AM UTC 2023 0:01:59 [93.6MiB/s] [ 103MiB/s] 246GiB
Wed Oct 11 02:27:27 AM UTC 2023 0:01:59 [ 105MiB/s] [ 100MiB/s] 258GiB
Wed Oct 11 02:29:27 AM UTC 2023 0:01:59 [ 888MiB/s] [ 255MiB/s] 288GiB
Wed Oct 11 02:31:27 AM UTC 2023 0:01:59 [ 144MiB/s] [ 222MiB/s] 314GiB
Wed Oct 11 02:33:27 AM UTC 2023 0:01:59 [ 143MiB/s] [ 128MiB/s] 329GiB
Wed Oct 11 02:35:27 AM UTC 2023 0:01:59 [ 121MiB/s] [ 130MiB/s] 345GiB
Wed Oct 11 02:37:27 AM UTC 2023 0:01:59 [ 128MiB/s] [ 151MiB/s] 362GiB
Wed Oct 11 02:39:27 AM UTC 2023 0:01:59 [1.86GiB/s] [ 932MiB/s] 471GiB
Wed Oct 11 02:41:27 AM UTC 2023 0:01:59 [1.90GiB/s] [1.94GiB/s] 704GiB
Wed Oct 11 02:43:27 AM UTC 2023 0:01:59 [1.93GiB/s] [1.92GiB/s] 934GiB
Wed Oct 11 02:45:27 AM UTC 2023 0:01:59 [ 189MiB/s] [ 867MiB/s] 1.01TiB
Wed Oct 11 02:47:28 AM UTC 2023 0:01:59 [ 112MiB/s] [ 146MiB/s] 1.03TiB
Wed Oct 11 02:49:28 AM UTC 2023 0:01:35 [ 164MiB/s] [ 135MiB/s] 1.04TiB
What you see here is a human readable log containing 2 minute samples of the progress and data rates for the dd operation.
Note the 3rd column, the %{Average data transfer rate}. You'll notice how sometimes a sample is few hundred MiB/s which is determined by the physical limits of the underlying vdev(s) read speed when reading normally allocated blocks, and sometimes you'll see ~2GiB/s which is obviously an order of magnitude faster. The high speed samples document where dd, its libraries, the kernel and OpenZFS have conspired together to efficiently process the sparse blocks and sections of the the sparse 4TiB raw partition file.
NB: In my example I used xxHash by Yann Collet because as per the name: Extremely fast non-cryptographic hash algorithm (github link) its modern, maintained and it can provide "near RAM speed" hashing speed. zfs-check defaults to sha1 which has limited performance and perhaps not best suited for sparse file handling. Note the hash performance comparison table on the xxHash GitHub README.
The sparse allocation optimisation is also reflected in the 4th column %{Bytes transferred so far}: in the samples that illustrate reading of the sparse allocated blocks, the amount of logical data read per sample is naturally a lot more than the samples for normally allocated blocks.
Outcome: progress in reading through the file is much quicker (than zfs-check) because the sparse allocations/sections are handled efficiently.
A quick note on the system in my examples. Its a entry level enterprise storage node running proxmox 7.x which is Debian bookworm based and running the Ubuntu 6.2.x kernel. 12 physical Xeon CPU cores over two sockets and 128 GB RAM.
For my dd example, what I've not yet understood is which code (or code combo) in the chain is responsible for the sparse file reading optimisation. From glancing at the the dd code it doesn't seem that the optimisation is happening at that level, and you can see from my dd example I'm not giving it special arguments to handle sparse files. I think the optimisation is more likely to be located in lower level read() and seek() functions and a combination of the filesystem support for sparse file allocation. OpenZFS definitely supports efficient sparse file reads as demonstrated in the dd example provided.
On Debian the coreutils package which contains dd source might have some clues: https://sources.debian.org/src/coreutils/9.1-1/. I took a quick look but didn't find a smoking gun, in fact dd code is primarily using lseek (fd, offset, SEEK_CUR) which means I missed something or the optimisation is happening actually in the kernel/filesystem code (for example fs/read_write.c here) or in OpenZFS code. NB: Proxmox uses the Ubuntu kernel.
For reference and perhaps part of the solution for zfs-check, for python there is a useful but probably not exactly what is needed answer here: https://stackoverflow.com/a/62312222. This approach seeks the entire file first to determine the ranges of DATA and HOLE(s).
From what I can tell zfs_autobackup/BlockHasher.py will need an logic update something like the following:
- Ensure logic exists somewhere to skip using "sparse logic" on file descriptors like pipes that don't support seeking backwards or hole detection.
lseek(fd, offset, SEEK_HOLE)to determine if the offset is the end of the file- IF TRUE offset == EOF then the file contains no HOLE(s) and is not sparse - rewind
fd- the existing logic can be used - IF FALSE offset == EOF then the file contains at least one HOLE - use sparse logic
SPARSE LOGIC:
- IF TRUE offset == EOF then the file contains no HOLE(s) and is not sparse - rewind
- Start with
lseek(fd, offset, SEEK_DATA)until a HOLE chunk is detected? (a chunk that is sequence of zeros) - HOLE DETECTED,
lseek(fd, offset, SEEK_HOLE)iterate forward seeking n chunks to find the end of the HOLE.- if the return ==
offsetthen still in a contiguous hole- keep iterating and seeking forwards, counting the
nr_contiguous_hole_chunks
- keep iterating and seeking forwards, counting the
- if the return !=
offsetand != EOF then there is some more DATA to process- compute the checksum of the HOLE?
zerosxnr_contiguous_hole_chunksxchunk_size - rewind
fdto resume processing DATA at the end offset of the last detected contiguous HOLE chunk
- compute the checksum of the HOLE?
- if the return == EOF (there are no more HOLES) then rewind
fdto resume processing DATA at the end offset of the last detected contiguous HOLE chunk. The logic then iterates of the remaining DATA.
- if the return ==
It would be interesting to find examples of other code that have implemented similar logic to sanity check this logic and check for any inefficiencies or flaws.
💡 IDEA: Maybe its enough to switch the zfs_autobackup/BlockHasher.py from using fh.seek(pos) to using os.lseek(fd, pos, whence) and using xxhash rather than the slower sha1 and the optimisation would be automatically performed at a lower level? As per the dd example which is using lseek (fd, offset, SEEK_CUR) and no mention of SEEK_DATA or SEEK_HOLE. This might be worth a try before trying to invent new logic per my brain dump herein.
python ref for os.lseek(): https://docs.python.org/3/library/os.html#os.lseek
python ref for fd.seek(): https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
Thanks for reading, look forward to any comments or critique.
Cheers
Kyle
