Replies: 4 comments 4 replies
-
Thanks --- this is very helpful. On the TensorStore side we'll definitely look into these results to better understand where the bottlenecks are in TensorStore. |
Beta Was this translation helpful? Give feedback.
-
Some thoughts:
|
Beta Was this translation helpful? Give feedback.
-
Sorry, I just spotted that the "busy_time" for the disk is sometimes insanely low (even though I'm using |
Beta Was this translation helpful? Give feedback.
-
As an update, I was able to collect a CPU profile for TensorStore with Uncompressed_20000_Chunks as follows: sudo apt-get install libgoogle-perftools-dev google-perftools # on debian
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libprofiler.so CPUPROFILE=/tmp/prof python perfcapture/scripts/cli.py --data-path ~/benchmark_temp/ --recipe-path zarr-benchmark/recipes --selected-workloads TensorStoreLoadEntireArray --selected-datasets Uncompressed_20000_Chunks
google-pprof -web $(which python3) /tmp/prof See profile result as svg here (open in separate window/tab and zoom in to view properly) It turns out that the tensorstore packages on PyPI do have sufficient symbol information for useful profiling. In this benchmark, the chunk size is 500x100 bytes. 53.8% of the time is spent in CopyReadChunk (which is copying the data from tensorstore's chunk cache to the output array), and an additional 5.1% of the time is spent on overhead in managing each chunk (PartitionIndexTransformOverGrid). The copying operation is non-trivial because the 500x100 byte chunks need to be copied into a larger array, which ultimately involves 500 separate copies of 100 contiguous bytes (via memcpy). There is likely some per-operation overhead in memcpy itself, and tensorstore adds its own overhead in setting up each of these memcpy operations. Additionally the fact that the inner chunk size of 100 is not a multiple of the cache line size of 64 bytes probably means additional memory bandwidth is consumed, though that will probably not become significant until the other sources of overhead are eliminated. Currently TensorStore handles copying like this through an "nd iteration" mechanism that converts an operation on an ndarray into a sequence of 1-d operations that operate on one of 3 types of buffers: contiguous, strided, or indexed by a separate index array. TensorStore dispatches to the appropriate "kernel" for a given buffer type and given data type. While there is surely some room to reduce the tensorstore overhead, ultimately to come close to the memory bandwidth limit we will need our "memory copying" kernel to operate on much more than 100 bytes at a time. To do that, it will probably be necessary to introduce a fourth buffer type: 2-d where the inner dimension is contiguous and the outer dimension is strided with a fixed stride. This would allow various forms of loop unrolling. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Here are some simple benchmark results for TensorStore and Zarr-Python.
In every case, we're loading the entirety of an array which is 1 GB when uncompressed. All the Zarr benchmarks use Zarr version 2. The machine is an AMD EPYC server. The benchmark datasets are stored on a 4 TB Seagate FireCuda 530 Gen4 PCIe SSD, which is nominally capable of 7,250 MB/s sequential reads, and 1 million IOPs for 4k reads with a queue depth of 32. The Linux operating system is stored on a different SSD.
The code defining the datasets and the workloads is here. The code for creating the plots below is here.
There are 6 different datasets:
numpy.load
to read an entire 1 GB.npy
file into RAM.Each workload is run three times. The graphs below show the mean across those three runs. There is only a little variance between the runs (the variance is not shown in the plots below).
The first graph, below, shows the total runtime (lower is better):
The second graph, below, shows the "GB/sec to numpy" which is the size of the uncompressed numpy array (1 GByte) divided by the total runtime in seconds (higher is better) (because the uncompressed array size is always 1 GB, this second graph shows exactly the same information as the first graph, it's just that "GB/s" feels like a useful measure):
The red line at the top shows the max measured bandwidth from the SSD. TensorStore is faster than Zarr-Python in every case except the case of loading a single 1 GB Zarr chunk.
The third figure, below, shows the IO throughput (this is calculated as the total number of bytes read from disk, divided by the "busy time (in seconds)" of the disk during the workload). The red bars show the maximum throughput I observed using
fio
and using the same number of files and the same filesizes as the dataset in question (e.g., for the LZ4_20000_Chunks dataset, I askedfio
to create and read 20,000 files, each of 8kB in size). Impressively, TensorStore is able to achieve the same throughput asfio
for all the datasets, except the single 1 GB chunk (higher is better):edit: I updated the graph above, after removing some NaNs from the dataset! The updated results show TensorStore in an even better light!
It's worth noting that the OS & hardware can only read tiny (8 kB) files at about 1 GB/sec (supposedly because of the overhead of opening each file). However, if we instead read 8 kB chunks from a single 1 GB file, then
fio
can hit 2.3 GB/sec (no matter if those chunks are read sequentially or at random from the 1 GB file). This suggests that Zarr sharding should offer significantly speed improvements for small chunk sizes!Appendix:
fio
commandsIf anyone's interested, here are the
fio
commands:Beta Was this translation helpful? Give feedback.
All reactions