TensorStore and Zarr-Python benchmarks #25

JackKelly · 2023-11-03T17:07:27Z

JackKelly
Nov 3, 2023
Maintainer

Here are some simple benchmark results for TensorStore and Zarr-Python.

In every case, we're loading the entirety of an array which is 1 GB when uncompressed. All the Zarr benchmarks use Zarr version 2. The machine is an AMD EPYC server. The benchmark datasets are stored on a 4 TB Seagate FireCuda 530 Gen4 PCIe SSD, which is nominally capable of 7,250 MB/s sequential reads, and 1 million IOPs for 4k reads with a queue depth of 32. The Linux operating system is stored on a different SSD.

The code defining the datasets and the workloads is here. The code for creating the plots below is here.

There are 6 different datasets:

Numpy NPY: Use numpy.load to read an entire 1 GB .npy file into RAM.
Uncompressed 1 chunk: Read a single 1 GB Zarr chunk.
LZ4 200 chunks: Read a 1 GB dataset, spread across 200 chunks, and compressed using LZ4. Each chunk is 4.7 MB on disk.
Uncompressed 200 chunks. Each chunk is 5 MB.
LZ4 20,000 chunks. Each chunk is about 7.4 kB.
Uncompressed 20,000 chunks. Each chunk is 50 kB.

Each workload is run three times. The graphs below show the mean across those three runs. There is only a little variance between the runs (the variance is not shown in the plots below).

The first graph, below, shows the total runtime (lower is better):

The second graph, below, shows the "GB/sec to numpy" which is the size of the uncompressed numpy array (1 GByte) divided by the total runtime in seconds (higher is better) (because the uncompressed array size is always 1 GB, this second graph shows exactly the same information as the first graph, it's just that "GB/s" feels like a useful measure):

The red line at the top shows the max measured bandwidth from the SSD. TensorStore is faster than Zarr-Python in every case except the case of loading a single 1 GB Zarr chunk.

The third figure, below, shows the IO throughput (this is calculated as the total number of bytes read from disk, divided by the "busy time (in seconds)" of the disk during the workload). The red bars show the maximum throughput I observed using fio and using the same number of files and the same filesizes as the dataset in question (e.g., for the LZ4_20000_Chunks dataset, I asked fio to create and read 20,000 files, each of 8kB in size). Impressively, TensorStore is able to achieve the same throughput as fio for all the datasets, except the single 1 GB chunk (higher is better):

edit: I updated the graph above, after removing some NaNs from the dataset! The updated results show TensorStore in an even better light!

It's worth noting that the OS & hardware can only read tiny (8 kB) files at about 1 GB/sec (supposedly because of the overhead of opening each file). However, if we instead read 8 kB chunks from a single 1 GB file, then fio can hit 2.3 GB/sec (no matter if those chunks are read sequentially or at random from the 1 GB file). This suggests that Zarr sharding should offer significantly speed improvements for small chunk sizes!

Appendix: `fio` commands

If anyone's interested, here are the fio commands:

# fio --name=read --direct=1 --rw=read --ioengine=io_uring --iodepth=512 --directory=/mnt/storage_ssd_4tb/fio --size=1g --bs=32k
df.loc["Uncompressed_1_Chunk", 'fio'] = 6.9

# fio --name=read --direct=1 --rw=read --ioengine=io_uring --iodepth=128 --directory=/mnt/storage_ssd_4tb/fio --nrfiles=200 --filesize=4800k --bs=128k
df.loc['Uncompressed_200_Chunks', 'fio'] = 6.9

# fio --name=read --direct=1 --rw=read --ioengine=io_uring --iodepth=128 --directory=/mnt/storage_ssd_4tb/fio --nrfiles=200 --filesize=4500k --bs=64k
df.loc['LZ4_200_Chunks', 'fio'] = 6.9

# fio --name=read --direct=1 --rw=read --ioengine=io_uring --iodepth=256 --directory=/mnt/storage_ssd_4tb/fio --nrfiles=20000 --filesize=49k --bs=49k --openfiles=100
df.loc['Uncompressed_20000_Chunks', 'fio'] = 5.0

# fio --name=read --direct=1 --rw=read --ioengine=io_uring --iodepth=256 --directory=/mnt/storage_ssd_4tb/fio --nrfiles=20000 --filesize=8k --bs=8k --openfiles=1000
df.loc['LZ4_20000_Chunks', 'fio'] = 0.92

jbms · 2023-11-03T17:28:01Z

jbms
Nov 3, 2023

Thanks --- this is very helpful. On the TensorStore side we'll definitely look into these results to better understand where the bottlenecks are in TensorStore.

0 replies

JackKelly · 2023-11-03T17:38:46Z

JackKelly
Nov 3, 2023
Maintainer Author

Some thoughts:

In general, these results are consistent with anecdotal reports I'd heard that "TensorStore is about twice as fast as Zarr-Python".
TensorStore is particularly impressive on the 200-chunk datasets: TensorStore reads 200 chunks faster than Numpy can read a single 1 GB file!
There's an interesting paradox on the 20,000 chunk datasets: TensorStore reads from the SSD as fast as the hardware can go (which isn't trivial! It took me quite a few attempts with fio to read 20,000 files as fast as TensorStore!) Yet, TensorStore takes 6 times longer to deliver the numpy array to Python (compared to TensorStore reading 200 chunks). That is: TensorStore reads the 20,000 chunks from disk very quickly. But TensorStore spends a while assembling those chunks into the final numpy array. (I would run Intel VTune on TensorStore but I'm running these benchmarks on an AMD machine, and I haven't yet figured out how to do that - or even if it's possible!)

2 replies

jbms Nov 3, 2023

pprof is the tool that I've normally used for profiling C++ code, including tensorstore. It doesn't provide all the detail of vtune but it will work on any architecture and is often sufficient for finding performance bottlenecks.

jbms Nov 3, 2023

Note though that I'm not sure if our normal PyPI packages of tensorstore retain sufficient symbol information for profiling.

JackKelly · 2023-11-03T18:15:47Z

JackKelly
Nov 3, 2023
Maintainer Author

Sorry, I just spotted that the "busy_time" for the disk is sometimes insanely low (even though I'm using vmtouch to clear all the files from the page cache before each run). I've removed the "busy_time" when the "busy_time" is less than 0.01 seconds per run. I've updated the third plot in the top post. The updated plot shows that - except when reading a single 1 GB chunk - TensorStore always maxes out the SSD! Nice!

0 replies

jbms · 2023-11-13T06:57:35Z

jbms
Nov 13, 2023

As an update, I was able to collect a CPU profile for TensorStore with Uncompressed_20000_Chunks as follows:

sudo apt-get install libgoogle-perftools-dev google-perftools # on debian
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libprofiler.so CPUPROFILE=/tmp/prof python  perfcapture/scripts/cli.py --data-path ~/benchmark_temp/ --recipe-path zarr-benchmark/recipes --selected-workloads TensorStoreLoadEntireArray --selected-datasets Uncompressed_20000_Chunks
google-pprof -web $(which python3) /tmp/prof

See profile result as svg here (open in separate window/tab and zoom in to view properly)

It turns out that the tensorstore packages on PyPI do have sufficient symbol information for useful profiling.

In this benchmark, the chunk size is 500x100 bytes. 53.8% of the time is spent in CopyReadChunk (which is copying the data from tensorstore's chunk cache to the output array), and an additional 5.1% of the time is spent on overhead in managing each chunk (PartitionIndexTransformOverGrid).

The copying operation is non-trivial because the 500x100 byte chunks need to be copied into a larger array, which ultimately involves 500 separate copies of 100 contiguous bytes (via memcpy). There is likely some per-operation overhead in memcpy itself, and tensorstore adds its own overhead in setting up each of these memcpy operations. Additionally the fact that the inner chunk size of 100 is not a multiple of the cache line size of 64 bytes probably means additional memory bandwidth is consumed, though that will probably not become significant until the other sources of overhead are eliminated.

Currently TensorStore handles copying like this through an "nd iteration" mechanism that converts an operation on an ndarray into a sequence of 1-d operations that operate on one of 3 types of buffers: contiguous, strided, or indexed by a separate index array. TensorStore dispatches to the appropriate "kernel" for a given buffer type and given data type. While there is surely some room to reduce the tensorstore overhead, ultimately to come close to the memory bandwidth limit we will need our "memory copying" kernel to operate on much more than 100 bytes at a time. To do that, it will probably be necessary to introduce a fourth buffer type: 2-d where the inner dimension is contiguous and the outer dimension is strided with a fixed stride. This would allow various forms of loop unrolling.

2 replies

JackKelly Nov 13, 2023
Maintainer Author

Very cool! Thank you!

Please remind me: Does TensorStore use multiple CPU cores in parallel to copy data from the chunk cache into the output array? (one CPU core per chunk?) (although I appreciate that this might not speed things up if we bump into issues with false sharing).

the inner chunk size of 100 is not a multiple of the cache line size of 64 bytes probably means additional memory bandwidth is consumed, though that will probably not become significant until the other sources of overhead are eliminated.

Yeah, I've been wondering about this, too. If I've understood correctly, in the best case for your 100-byte chunks, TensorStore would load 128 bytes from RAM per chunk (when the 100 byte chunk straddles 2 cache lines), and in the worst case TensorStore would load 192 bytes per chunk (when the 100 byte chunk straddles 3 cache lines).

It's perhaps interesting to us that the Apache Arrow spec recommends implementations "allocate memory on aligned addresses (multiple of 8- or 64-bytes) and pad (overallocate) to a length that is a multiple of 8 or 64 bytes". I'm curious if Zarr implementations should do the same for Zarr chunks in memory, so we guarantee the "best case", every time. I'm sure you, Jeremy, know this already, but just for anyone else who's not as familiar with this stuff: I think the motivation for the Arrow authors is that you not only minimise wasted RAM bandwidth, but also makes it easier & more efficient to use SIMD (e.g. to copy memory at up to 512 bits per CPU instruction).

jbms Nov 13, 2023

Tensorstore does use multiple threads to do the copy, but I didn't check how much, if any, that is helping here. You can configure that with:

"data_copy_concurrency": {"limit":1}

I suppose tensorstore could also try to better assign chunks to threads to reduce concurrent writes by multiple threads to the same cache line, though I'm not yet sure exactly how that might be accomplished.

With arrow it is just 1-d arrays so padding just wastes a few bytes per buffer. With multi-dimensional arrays, ensuring cache line alignment is more challenging since we would potentially need to pad each row and then there are tradeoffs depending on the access pattern.

TensorStore and Zarr-Python benchmarks #25

Uh oh!

Uh oh!

JackKelly Nov 3, 2023 Maintainer

Appendix: fio commands

Replies: 4 comments · 4 replies

Uh oh!

jbms Nov 3, 2023

Uh oh!

Uh oh!

JackKelly Nov 3, 2023 Maintainer Author

Uh oh!

jbms Nov 3, 2023

Uh oh!

jbms Nov 3, 2023

Uh oh!

JackKelly Nov 3, 2023 Maintainer Author

Uh oh!

jbms Nov 13, 2023

Uh oh!

JackKelly Nov 13, 2023 Maintainer Author

Uh oh!

jbms Nov 13, 2023

JackKelly
Nov 3, 2023
Maintainer

Appendix: `fio` commands

Replies: 4 comments 4 replies

jbms
Nov 3, 2023

JackKelly
Nov 3, 2023
Maintainer Author

JackKelly
Nov 3, 2023
Maintainer Author

jbms
Nov 13, 2023

JackKelly Nov 13, 2023
Maintainer Author