Is there a fast way to access data in DatasetV2? #280
Replies: 2 comments
-
Hey @PatWalters , this is a great question! If the entire uncompressed dataset fits in memory, you can do: ds.load_to_memory() If it doesn't, you can speed things up by:
Assuming you're interested in the import numpy as np
import polaris as po
ds = po.dataset.DatasetV2.from_json("belka_dir/belka-v1.json")
chunk_size = ds.zarr_root["molecule_smiles"].chunks[0]
n_chunks = int(np.ceil(len(ds) / chunk_size))
for i in range(n_chunks):
istart = i * chunk_size
iend = istart + chunk_size
data = ds.zarr_data["molecule_smiles"][istart:iend]
for smi in data:
... You can access multiple chunks in parallel to speed things up further. For context: Polaris uses something called Zarr. Zarr is a format for chunked, compressed, N-dimensional arrays. This is a very versatile format, but you pay a performance penalty for each chunk access. We have been planning a more performant, out-of-the-box way to iterate through large datasets, but we haven't gotten to this yet. |
Beta Was this translation helpful? Give feedback.
-
Thanks, @cwognum! load_to_memory did the trick. It might be useful to add an example, the docs on that function are kind of slim. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been doing some work with the
belka-v1
dataset. I began by downloading the dataset and writing it to disk.I then read the dataset from disk
I'd now like to perform some action, like calculating fingerprints, for the dataset. However, I don't see a way to access the data quickly, and doing something like this would be painfully slow. The code below takes 1.5 min for 1,000 rows.
I'm sure I'm missing something. Can someone please point me in the right direction? Thanks!
Beta Was this translation helpful? Give feedback.
All reactions