Skip to content

Commit a694c3d

Browse files
Merge pull request #443 from Blosc/updateDocs
Update docs
2 parents 911c96a + 28aae39 commit a694c3d

14 files changed

+1096
-1895
lines changed

doc/conf.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
"sphinx.ext.autodoc",
1515
"sphinx.ext.intersphinx",
1616
"sphinx.ext.napoleon",
17+
"sphinx.ext.linkcode",
1718
"numpydoc",
1819
"myst_parser",
1920
"sphinx_paramlinks",
@@ -71,6 +72,43 @@
7172
hidden = "_ignore_multiple_size"
7273

7374

75+
def linkcode_resolve(domain, info):
76+
if domain != "py":
77+
return None
78+
if not info["module"]:
79+
return None
80+
81+
import importlib
82+
import inspect
83+
import os
84+
85+
# Modify this to point to your package
86+
module_name = info["module"]
87+
full_name = info["fullname"]
88+
89+
try:
90+
module = importlib.import_module(module_name)
91+
except ImportError:
92+
return None
93+
94+
obj = module
95+
for part in full_name.split("."):
96+
obj = getattr(obj, part, None)
97+
if obj is None:
98+
return None
99+
100+
try:
101+
fn = inspect.getsourcefile(obj)
102+
source, lineno = inspect.getsourcelines(obj)
103+
except Exception:
104+
return None
105+
106+
# Replace this with your repo info
107+
github_base_url = "https://github.com/Blosc/python-blosc2/blob/main/"
108+
relpath = os.path.relpath(fn, start=os.path.dirname(module.__file__))
109+
return f"{github_base_url}{relpath}#L{lineno}"
110+
111+
74112
def process_sig(app, what, name, obj, options, signature, return_annotation):
75113
if signature and hidden in signature:
76114
signature = signature.split(hidden)[0] + ")"

doc/getting_started/overview.rst

Lines changed: 95 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -3,41 +3,44 @@
33
What is it?
44
===========
55

6-
Python-Blosc2 is a high-performance compressed ndarray library with a flexible
7-
compute engine. It uses the C-Blosc2 library as the compression backend.
6+
Python-Blosc2 is a high-performance compressed ndarray library with a
7+
flexible compute engine. The compression functionality comes courtesy of the
8+
C-Blosc2 library.
89
`C-Blosc2 <https://github.com/Blosc/c-blosc2>`_ is the next generation of
910
Blosc, an `award-winning <https://www.blosc.org/posts/prize-push-Blosc2/>`_
1011
library that has been around for more than a decade, and that is being used
1112
by many projects, including `PyTables <https://www.pytables.org/>`_ or
1213
`Zarr <https://zarr.readthedocs.io/en/stable/>`_.
1314

14-
Python-Blosc2 is a Python wrapper around the C-Blosc2 library, enhanced with
15-
an integrated compute engine. This allows for complex computations on
15+
Python-Blosc2's bespoke compute engine allows for complex computations on
1616
compressed data, whether the operands are in memory, on disk, or
1717
`accessed over a network <https://github.com/ironArray/Caterva2>`_. This
1818
capability makes it easier to `work with very large datasets
1919
<https://ironarray.io/blog/compute-bigger>`_, even in distributed
2020
environments.
2121

22-
Most importantly, Python-Blosc2 uses the
23-
`C-Blosc2 simple and open format <https://github.com/Blosc/c-blosc2/blob/main/README_FORMAT.rst>`_
24-
for storing compressed data. This facilitates seamless integration with other
25-
systems and tools.
26-
2722
Interacting with the Ecosystem
28-
==============================
23+
-----------------------------
2924

3025
Python-Blosc2 is designed to integrate seamlessly with existing libraries
31-
and tools, offering:
26+
and tools in the Python ecosystem, including:
3227

3328
* Support for NumPy's `universal functions
3429
mechanism <https://numpy.org/doc/2.1/reference/ufuncs.html>`_, enabling
35-
the combination of NumPy and Blosc2 computation engines.
30+
the combination of the NumPy and Blosc2 computation engines.
3631
* Excellent integration with Numba and Cython via
3732
`User Defined
3833
Functions <https://www.blosc.org/python-blosc2/getting_started/tutorials/03.lazyarray-udf.html>`_.
39-
* Lazy expressions that are evaluated only when needed and can be stored
40-
for future use.
34+
* By making use of the simple and open
35+
`C-Blosc2 format <https://github.com/Blosc/c-blosc2/blob/main/README_FORMAT.rst>`_
36+
for storing compressed data, Python-Blosc2 facilitates seamless integration with many other
37+
systems and tools.
38+
39+
Python-Blosc2's compute engine
40+
==============================
41+
42+
The compute engine is based on lazy expressions that are evaluated only when
43+
needed and can be stored for future use.
4144

4245
Python-Blosc2 leverages both `NumPy <https://numpy.org>`_ and
4346
`NumExpr <https://numexpr.readthedocs.io/en/latest/>`_ to achieve high
@@ -57,7 +60,16 @@ computing engine and NumPy or NumExpr include:
5760
Data Containers
5861
===============
5962

60-
The main data container objects in Python-Blosc2 are:
63+
When working with data that is too large to fit in memory, one solution is to
64+
load the data in chunks, process each chunk, and then write the results back
65+
to disk. If each chunk is compressed, say by a factor of 10, this approach
66+
can be especially efficient, since one is essentially able to send the data
67+
10x faster over the network and store it 10x smaller on disk. Even if the
68+
data fits in memory, it is often beneficial to use compression and chunking
69+
to make more effective use of the cache structure of modern CPUs.
70+
71+
The combined chunking-compression approach is the basis of the main data
72+
container objects in Python-Blosc2:
6173

6274
* ``SChunk``: A 64-bit compressed store suitable for any data type supporting the
6375
`buffer protocol <https://docs.python.org/3/c-api/buffer.html>`_.
@@ -70,15 +82,21 @@ SChunk: a 64-bit compressed store
7082
---------------------------------
7183

7284
``SChunk`` is a simple data container that handles setting, expanding and
73-
getting data and metadata. In contrast to chunks, a super-chunk can update
74-
and resize the data that it contains, supports user metadata, and has virtually
75-
unlimited storage capacity (chunks, on the other hand, cannot store more than 2 GB).
76-
77-
Additionally, you can convert a SChunk into a contiguous, serialized buffer
78-
(aka `cframe
79-
<https://github.com/Blosc/c-blosc2/blob/main/README_CFRAME_FORMAT.rst>`_) and
80-
vice-versa; as a bonus, the serialization/deserialization process also works
81-
with NumPy arrays and PyTorch/TensorFlow tensors at lightning-fast speed:
85+
getting data and metadata. A super-chunk is a wrapper around some set of
86+
chunked data, and can update and resize the data that it contains, supports
87+
user metadata, and has virtually unlimited storage capacity (each constituent
88+
chunk of the super-chunk cannot store more than 2 GB). The separate chunks
89+
are in general not stored sequentially, which allows for efficient extension
90+
of the super-chunk (a new chunk may be inserted anywhere there is space
91+
available, and the super-chunk can be extended with a reference to the
92+
location of the new chunk).
93+
94+
However, since it may be advantageous (for e.g. faster file transfer) to
95+
convert a SChunk into a contiguous, serialized buffer (aka`cframe
96+
<https://github.com/Blosc/c-blosc2/blob/main/README_CFRAME_FORMAT.rst>`_),
97+
such functionality is supported; likewise one may convert a cframe into a
98+
SChunk. The serialization/deserialization process also works with NumPy
99+
arrays and PyTorch/TensorFlow tensors at lightning-fast speed:
82100

83101
.. |compress| image:: https://github.com/Blosc/python-blosc2/blob/main/images/linspace-compress.png?raw=true
84102
:width: 100%
@@ -99,8 +117,8 @@ while reaching excellent compression ratios:
99117
:align: center
100118
:alt: Compression ratio for different codecs
101119

102-
Also, if you are a Mac M1/M2 owner, do yourself a favor and use its native arm64
103-
arch (yes, we are distributing Mac arm64 wheels too; you're welcome ;-) ):
120+
Also, if you are a Mac Silicon owner you may make use of its native arm64
121+
arch, since we distribute Mac arm64 wheels too:
104122

105123
.. |pack_arm| image:: https://github.com/Blosc/python-blosc2/blob/main/images/M1-i386-vs-arm64-pack.png?raw=true
106124
:width: 100%
@@ -122,12 +140,12 @@ NDArray: an N-Dimensional store
122140

123141
A recent feature in Python-Blosc2 is the
124142
`NDArray <https://www.blosc.org/python-blosc2/reference/ndarray_api.html>`_
125-
object. It builds upon the ``SChunk`` object, offering a NumPy-like API
126-
for compressed n-dimensional data.
143+
object. It rests atop the ``SChunk`` object, offering a NumPy-like API
144+
for compressed n-dimensional data, with the same chunked storage.
127145

128146
It efficiently reads/writes n-dimensional datasets using an n-dimensional
129-
two-level partitioning scheme, enabling fine-grained slicing of large,
130-
compressed data:
147+
two-level partitioning scheme (each chunk is itself divided into blocks),
148+
enabling fine-grained slicing of large, compressed data:
131149

132150
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/b2nd-2level-parts.png?raw=true
133151
:width: 75%
@@ -138,72 +156,79 @@ orthogonal to different axes of a 4-dimensional dataset:
138156
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/Read-Partial-Slices-B2ND.png?raw=true
139157
:width: 75%
140158

141-
More information is available in this blog post:
142-
https://www.blosc.org/posts/blosc2-ndim-intro
143-
144-
Check this short video explaining `why slicing in a pineapple-style (aka
145-
double partition) is useful
146-
<https://www.youtube.com/watch?v=LvP9zxMGBng>`_:
159+
More information on chunk-block double partitioning is available in this
160+
`blog post <https://www.blosc.org/posts/blosc2-ndim-intro>`_. Or if you're a
161+
visual learner, see this
162+
`short video <https://www.youtube.com/watch?v=LvP9zxMGBng>`_.
147163

148164
.. image:: https://github.com/Blosc/blogsite/blob/master/files/images/slicing-pineapple-style.png?raw=true
149165
:width: 50%
150166
:alt: Slicing a dataset in pineapple-style
151167
:target: https://www.youtube.com/watch?v=LvP9zxMGBng
152168

153-
Operating with NDArrays
169+
Computing with NDArrays
154170
=======================
155171

156-
Python-Blosc2's ``NDArray`` objects are designed for ease of use,
157-
demonstrated by this example:
172+
Python-Blosc2's ``NDArray`` objects are designed for ease of use, demonstrated
173+
by this example, which closely mirrors the very familiar NumPy syntax:
158174

159175
.. code-block:: python
160176
161177
import blosc2
162178
163179
N = 20_000
164180
# N = 70_000 # for large scenario
165-
a = blosc2.linspace(0, 1, N * N).reshape(N, N)
166-
b = blosc2.linspace(1, 2, N * N).reshape(N, N)
167-
c = blosc2.linspace(-10, 10, N * N).reshape(N, N)
181+
a = blosc2.linspace(0, 1, N * N, shape=(N, N))
182+
b = blosc2.linspace(1, 2, N * N, shape=(N, N))
183+
c = blosc2.linspace(-10, 10, N * N, shape=(N, N))
168184
expr = ((a**3 + blosc2.sin(c * 2)) < b) & (c > 0)
169185
170186
out = expr.compute()
171187
print(out.info)
172188
173-
``NDArray`` instances resemble NumPy arrays but store compressed data,
174-
processed efficiently by Python-Blosc2's engine.
189+
``NDArray`` instances resemble NumPy arrays, since one may expose their shape,
190+
dtype etc. via attributes (try ``a.shape`` in the example above), but store
191+
compressed data, processed efficiently by Python-Blosc2's engine. This means
192+
that you can work with datasets larger than would be feasible with e.g. NumPy.
175193

176-
When operands fit in memory (20,000 x 20,000), performance nears
177-
top-tier libraries like NumExpr, exceeding NumPy and Numba, with low memory use
178-
via default compression. As you can see, Blosc2 compression can speed
179-
computation via fast codecs and filters, plus efficient CPU cache use.
194+
To see this, we can compare the execution time for the above example (see the
195+
`benchmark here <https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-dask-small.ipynb>`_)
196+
when the operands fit in memory uncompressed (20,000 x 20,000). Performance
197+
for Blosc2 then matches that of top-tier libraries like NumExpr, and exceeds
198+
that of NumPy and Numba, with low memory use via default compression. Even
199+
for in-memory computations then, Blosc2 compression can speed up computation
200+
via fast codecs and filters, plus efficient CPU cache use.
180201

181202
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-dask-small.png?raw=true
182203
:width: 100%
183204
:alt: Performance when operands comfortably fit in-memory
184205

185-
For larger datasets exceeding memory, Python-Blosc2 rivals Dask+Zarr in
186-
performance (70,000 x 70,000).
206+
When the operands are so large that they exceed memory (70,000 x 70,000)
207+
unless compressed, one can no longer use NumPy or other uncompressed
208+
libraries such as NumExpr. Python-Blosc2's compression and chunking means the
209+
arrays may be stored compressed in memory and then processed chunk-by-chunk;
210+
both memory footprint and execution time is greatly reduced compared to
211+
Dask+Zarr, which also uses compression (see the
212+
`benchmark here <https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-dask-large.ipynb>`_).
187213

188214
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-dask-large.png?raw=true
189215
:width: 100%
190216
:alt: Performance when operands do not fit in memory (uncompressed)
191217

192-
Blosc2 can utilize MKL-enabled Numexpr for optimized transcendental
193-
functions on Intel compatible CPUs (as used for the above plots).
194-
195-
Benchmark notebooks:
196-
197-
https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-dask-small.ipynb
198-
199-
https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-dask-large.ipynb
218+
Note: For these plots, we made use of the Blosc2 support for MKL-enabled
219+
Numexpr for optimized transcendental functions on Intel compatible CPUs.
200220

201221
Reductions and disk-based computations
202222
--------------------------------------
203223

204-
One key feature of Python-Blosc2's compute engine is its ability to
205-
perform reductions on compressed data, optionally stored on disk, enabling
206-
calculations on datasets too large for memory.
224+
Of course, it may be the case that, even compressed, data is still too large
225+
to fit in memory. Python-Blosc2's compute engine is perfectly capable of
226+
working with data stored on disk, loading the chunked data efficiently to
227+
minimise latency, optimizing calculations on datasets too large for memory.
228+
Computation results may also be stored on disk if necessary We can see this
229+
at work for reductions, which are 1) computationally demanding, and 2) an
230+
important class of operations in data analysis, where we often wish to
231+
compute a single value from an array, such as the sum or mean.
207232

208233
Example:
209234

@@ -226,13 +251,12 @@ Example:
226251
This example computes the sum of a boolean array resulting from an
227252
expression, where the operands are on disk, with the result being a
228253
1D array stored in memory (or optionally on disk via the ``out=``
229-
parameter in ``compute()`` or ``sum()`` functions).
230-
231-
Check out a blog post about this feature, with performance comparisons, at:
232-
https://ironarray.io/blog/compute-bigger
233-
234-
Hopefully, this overview has provided a good understanding of
235-
Python-Blosc2's capabilities. To begin your journey with Python-Blosc2,
236-
proceed to the `installation instructions <installation>`_.
237-
Then explore the `tutorials <tutorials>`_ and
238-
`reference <../reference>`_ sections for further information!
254+
parameter in ``compute()`` or ``sum()`` functions). For a more in-depth look at
255+
this example, with performance comparisons, see this
256+
`blog post <https://ironarray.io/blog/compute-bigger>`_.
257+
258+
Hopefully, this overview has provided a good understanding of Python-Blosc2's
259+
capabilities. To begin your journey with Python-Blosc2, proceed to the
260+
`installation instructions <installation>`_. Then explore the
261+
`tutorials <tutorials>`_ and `reference <../reference>`_ sections for further
262+
information.

0 commit comments

Comments
 (0)