Is there a fast way to access data in DatasetV2? #280

PatWalters · 2025-04-05T22:45:23Z

PatWalters
Apr 5, 2025

I've been doing some work with the belka-v1 dataset. I began by downloading the dataset and writing it to disk.

import polaris as po

dataset = po.load_dataset("leash-bio/BELKA-v1")
print(dataset.to_json("belka_dir"))

I then read the dataset from disk

ds = po.dataset.DatasetV2.from_json("belka_dir/belka-v1.json")

I'd now like to perform some action, like calculating fingerprints, for the dataset. However, I don't see a way to access the data quickly, and doing something like this would be painfully slow. The code below takes 1.5 min for 1,000 rows.

for i in tqdm(range(0,ds.n_rows)):
    row = ds[I]

I'm sure I'm missing something. Can someone please point me in the right direction? Thanks!

cwognum · 2025-04-06T16:29:25Z

cwognum
Apr 6, 2025
Maintainer

Hey @PatWalters , this is a great question!

If the entire uncompressed dataset fits in memory, you can do:

ds.load_to_memory()

If it doesn't, you can speed things up by:

Only accessing the column you need;
Accessing a chunk at a time.

Assuming you're interested in the molecule_smiles column, you can do something like:

import numpy as np
import polaris as po

ds = po.dataset.DatasetV2.from_json("belka_dir/belka-v1.json")

chunk_size = ds.zarr_root["molecule_smiles"].chunks[0]
n_chunks = int(np.ceil(len(ds) / chunk_size))

for i in range(n_chunks): 
    istart = i * chunk_size
    iend = istart + chunk_size

    data = ds.zarr_data["molecule_smiles"][istart:iend]

    for smi in data: 
        ...

You can access multiple chunks in parallel to speed things up further.

For context: Polaris uses something called Zarr. Zarr is a format for chunked, compressed, N-dimensional arrays. This is a very versatile format, but you pay a performance penalty for each chunk access. We have been planning a more performant, out-of-the-box way to iterate through large datasets, but we haven't gotten to this yet.

0 replies

PatWalters · 2025-04-06T18:07:55Z

PatWalters
Apr 6, 2025
Author

Thanks, @cwognum! load_to_memory did the trick. It might be useful to add an example, the docs on that function are kind of slim.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there a fast way to access data in DatasetV2? #280

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there a fast way to access data in DatasetV2? #280

Uh oh!

PatWalters Apr 5, 2025

Replies: 2 comments

Uh oh!

cwognum Apr 6, 2025 Maintainer

Uh oh!

PatWalters Apr 6, 2025 Author

PatWalters
Apr 5, 2025

cwognum
Apr 6, 2025
Maintainer

PatWalters
Apr 6, 2025
Author