Support custom serialization #606

mrocklin · 2016-10-25T20:18:30Z

This changes the protocol to support custom serialization methods on a per-type basis.

Fixes #604 which also includes an explanation of the approach taken here.

As a proof of concept I implemented a custom serialization for numpy ndarrays that both uses blosc (if available) and avoids the two memory copies in pickle.dumps.

This changes the protocol to support custom serialization methods on a per-type basis. Fixes dask#604

This avoids a circular import

mrocklin · 2016-10-28T22:33:14Z

This could use review if anyone has some free time:

I recommend starting at the docpage and the protocol directory, with a focus on the numpy example:

cc @pitrou @minrk @ogrisel

mrocklin · 2016-10-28T22:34:11Z

distributed/protocol/core.py

+            header['keys'].append(key)
+            out_frames.extend(frames)
+
+        out_frames = [bytes(f) for f in out_frames]


This possible copy should become unnecessary if we can get something like tornadoweb/tornado#1691 in to tornado

pitrou · 2016-10-31T13:25:59Z

distributed/protocol/compression.py

+
+
+default = config.get('compression', 'auto')
+if default != 'auto':


'auto' doesn't do anything?

If config['compression'] == 'auto' then we use the default_compression determined above during the import process.

pitrou · 2016-10-31T13:59:40Z

distributed/scheduler.py

-                yield write(stream, response)
-                yield gen.sleep(interval)
-        except (OSError, IOError, StreamClosedError):
+        with log_errors(pdb=True):


I suppose we shouldn't keep this in production code.

pitrou · 2016-10-31T14:09:52Z

distributed/protocol/pickle.py

+            else:
+                return result
+        else:
+            if isinstance(x, pickle_types) or b'__main__' not in result:


I'm unsure what pickle_types is here?

Originally these were types for which pickle definitely worked and where we didn't want to use cloudpickle. This notably included numpy arrays. I must have lost the variable while moving around code. Clearly this code isn't covered though and needs to be fixed.

pitrou · 2016-10-31T14:27:30Z

distributed/protocol/serialize.py

+    return header, frames
+
+def deserialize_bytes(header, frames):
+    return b''.join(frames)  # the frames may be cut up in transit


When you say "in transit", it's in case some level of compression is enabled?

There are various reasons to cut up large bytestrings. I think in the current case we do need this for compression, yes.

Arguably we should be able to handle this internally though without the custom functions being aware of it.

This is maybe fixed in 05198a8

pitrou · 2016-10-31T14:33:24Z

distributed/protocol/core.py

+            header['keys'].append(key)
+            out_frames.extend(frames)
+
+        for key, (head, frames) in pre.items():


Why don't pre-serialized items get automatic compression?

Ideally the frames should already be compressed

This is now tested in 9d68847

pitrou · 2016-10-31T14:35:38Z

distributed/protocol/numpy.py

+
+def deserialize_numpy_ndarray(header, frames):
+    with log_errors():
+        assert len(frames) == 1


This doesn't seem to match distributed.protocol.core.decompress(), which returns an arbitrary number of frames.

pitrou · 2016-10-31T15:05:54Z

I'm a bit confused trying to follow the layers of protocol encoding. Whose responsiblity is it to call to_serialize()?

mrocklin · 2016-10-31T15:07:53Z

I'm a bit confused trying to follow the layers of protocol encoding. Whose responsiblity is it to call to_serialize()?

Application code like the worker and client functions. For example

class Worker:
    def get_data(self, keys):
        return {key: to_serialize(self.data[key]) for key in keys}

pitrou · 2016-11-01T11:35:43Z

Application code like the worker and client functions.

What happens if a dask graph produces e.g. intermediate results as Numpy arrays (think about rechunk()). Do those get the special serialization treatment as well when shuffled around between workers?

mrocklin · 2016-11-01T11:38:46Z

Yes, worker-worker communication is actually handled by exactly the function I used above:

class Worker:
    def get_data(self, keys):
        return {key: to_serialize(self.data[key]) for key in keys}

So those messages contain data in custom serialized form. Any time you don't wrap a part of a message within to_serialize you are implying that it can be serialized with msgpack.

mrocklin · 2016-11-01T11:45:08Z

I've added a test to demonstrate inter-worker communication works fine.

pitrou · 2016-11-01T12:55:58Z

distributed/protocol/serialize.py

+    return typ.__module__ + '.' + typ.__name__
+
+
+def serialize(x):


I suppose the case where x is e.g. a tuple of Numpy arrays (instead of a single array) isn't covered by this... Does such a case never happen?

That's a good point. We would fall back to pickle in that case. Perhaps we should dive within standard containers.

I think I'm going to wait on this until it comes up in practice. There are some tricky things to consider (how to avoid type checking all entries of long lists of text for example) and I'd like to ensure that what's in here handles itself well.

minrk · 2016-11-01T14:43:35Z

Nice! This is a lot like IPython's custom serialization. How do you propagate the custom deserializer to peers, so they know how to deserialize a custom-serialized message?

mrocklin · 2016-11-01T14:45:40Z

Each piece of serialized data comes along with a header that mentions the kind used, the compression per frame, the length per frame, etc.. Currently we use serialization by type and use the 'modulename.classname' string as an identifier.

minrk · 2016-11-01T15:08:40Z

Right - I think I meant how does the field in deserializers get populated on the destination? I get how the key gets there, just not the value.

mrocklin · 2016-11-01T15:11:15Z

Ah, we currently assume that it is imported on the client.

This works fine for types that we manage, like numpy arrays.

We could probably use something like worker environments (in a stalled PR) to distribute custom serialization functions

mrocklin · 2016-11-01T18:18:03Z

I would like to merge this soon.

pitrou · 2016-11-01T18:31:03Z

distributed/utils_comm.py

@@ -76,8 +77,6 @@ def gather_from_workers(who_has, deserialize=True, rpc=rpc, close=True,
        bad_addresses |= {v for k, v in rev.items() if k not in response}
        results.update(merge(response))

-    if deserialize:


Does this map the deserialize parameter is now obsolete on this function?

pitrou · 2016-11-01T18:32:45Z

distributed/utils.py

@@ -492,6 +492,10 @@ def ensure_bytes(s):
    """
    if isinstance(s, bytes):
        return s
+    if isinstance(s, memoryview):
+        return s.tobytes()
+    if isinstance(s, buffer):


buffer doesn't exist on Python 3.

Guarded. Also added this to tests

mrocklin · 2016-11-02T12:46:25Z

Merging.

jakirkham · 2016-11-03T21:33:28Z

distributed/protocol/numpy.py

+
+try:
+    import blosc
+    n = blosc.set_nthreads(2)


Why was this hard-coded to 2?

We want some benefit from threads, but don't want to take them all (dask itself does parallelism) Two seemed like a decent default (more decent than "all"). Do you have other thoughts on what this should be?

Having a sensible default is fine. Having it hard-coded can be problematic though.

Is there some way for the user to override things like this for dask already? If so, my suggestion would be to expose this through the same mechanism. If not, maybe we should discuss how this and any other hard-coded values can be tuned through a config file or similar.

Yeah, we could place this in the ~/.dask/config file. Users can also call blosc.set_threads themselves. It's possible that blosc itself should have some place to make this configurable instead.

Yeah, we could place this in the ~/.dask/config file.

This seems like the most desirable option.

Can I interest you in raising an issue or perhaps a short PR? Relevant files is distributed/config.py and an example is here:

distributed/scheduler.py:BANDWIDTH = config.get('bandwidth', 100e6)

Agreed done in issue ( #1054 ).

mrocklin added 4 commits October 25, 2016 16:17

Support custom serialization

9aab56e

This changes the protocol to support custom serialization methods on a per-type basis. Fixes dask#604

memory limit auto

bcf3f9a

set blosc nthreads to 2

ac48fcc

Move dumps/loads to serialize

140dc22

This avoids a circular import

mrocklin force-pushed the protocol-serialize branch from 4584c22 to 140dc22 Compare October 28, 2016 12:58

mrocklin added 5 commits October 28, 2016 09:52

fix serialization of numpy arrays

a937ec9

pull out numpy tests to separate file

2559e73

move dumps/loads to serialize.pickle

6b87d9a

avoid sending memoryviews to tornado

5202e9e

move protocol to separate module

118ed9d

mrocklin force-pushed the protocol-serialize branch from 7223d24 to 118ed9d Compare October 28, 2016 18:40

mrocklin added 4 commits October 28, 2016 15:54

avoid msgpack key-reordering testing issue

35d08c7

add docstrings to protocol

5364cc6

add serialization docs

36e89ac

update protocol documentation

0ce7c75

mrocklin commented Oct 28, 2016

View reviewed changes

remove old docstring

c04ff39

mrocklin mentioned this pull request Oct 31, 2016

Integration with dask/distributed (xarray backend design) pydata/xarray#798

Closed

2 tasks

pitrou reviewed Oct 31, 2016

View reviewed changes

mrocklin added 2 commits October 31, 2016 10:42

Move data in graphs with custom protocol

7467398

test that deserialization=False avoids decompression

9d68847

merge split frames into original lengths

05198a8

py27 compat

96c9903

test inter-worker communication of serialized results

48b125e

pitrou reviewed Nov 1, 2016

View reviewed changes

don't decode string

49dc41f

mrocklin closed this Nov 1, 2016

mrocklin reopened this Nov 1, 2016

mrocklin mentioned this pull request Nov 1, 2016

Efficient Pandas serialization #614

Closed

cleanup

b9cc6b0

pitrou reviewed Nov 1, 2016

View reviewed changes

mrocklin added 2 commits November 1, 2016 14:41

test ensure_bytes

36b824d

remove deserialize keyword

a33ecc8

mrocklin merged commit a9bb18a into dask:master Nov 2, 2016

mrocklin deleted the protocol-serialize branch November 2, 2016 12:46

mrocklin mentioned this pull request Nov 2, 2016

Add h5py custom serialization #620

Merged

jakirkham mentioned this pull request Nov 2, 2016

from_hdf5 function that uses filenames rather than files dask/dask#922

Open

jakirkham reviewed Nov 3, 2016

View reviewed changes

jakirkham mentioned this pull request Nov 18, 2016

Adding other [de]serialization methods (e.g. TIFF) #690

Closed

jakirkham mentioned this pull request Nov 28, 2016

Zarr Dask Table dask/dask#1599

Closed

jakirkham mentioned this pull request May 3, 2017

Overriding blosc threads #1054

Closed

dhirschfeld mentioned this pull request Jul 8, 2018

Support arrow Table/RecordBatch types #2103

Closed



		default = config.get('compression', 'auto')
		if default != 'auto':

Uh oh!

Support custom serialization #606

Support custom serialization #606

Uh oh!

Conversation

mrocklin commented Oct 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Oct 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Oct 31, 2016

Uh oh!

mrocklin commented Oct 31, 2016

Uh oh!

pitrou commented Nov 1, 2016

Uh oh!

mrocklin commented Nov 1, 2016

Uh oh!

mrocklin commented Nov 1, 2016

Uh oh!

pitrou Nov 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minrk commented Nov 1, 2016

Uh oh!

mrocklin commented Nov 1, 2016

Uh oh!

minrk commented Nov 1, 2016

Uh oh!

mrocklin commented Nov 1, 2016

Uh oh!

mrocklin commented Nov 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Nov 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Oct 25, 2016 •

edited

Loading

pitrou Nov 1, 2016 •

edited

Loading

jakirkham Nov 4, 2016 •

edited

Loading