Description
Motivation: urllib3's stream
method allows users to request that they be streamed each chunk as it arrives. This is very difficult to do with h11 as it currently stands, because while h11 will try to emit one Data
event per chunk, if the buffer contains a partial chunk h11 will prefer to emit that Data
event and empty the buffer than to sit on it.
This is an entirely defensible design decision: while urllib3's users seem to want to be able to receive the chunks as they come in, chunk delimiters are not supposed to be semantic. However, for better or worse there are some use-cases where it is very helpful to know where chunk delimiters are.
There are three ways I can see of doing this:
-
Change h11's behaviour to emit
NEED_DATA
when a partial chunk is in the buffer, rather than aData
event for that partial chunk. This is probably inefficient in the case where people don't care about the chunk sizes, and also allows for pernicious behaviour where the user just keeps shoving data into h11's buffer without h11 ever being able to emit it.(I should note that this is basically what h2 does with DATA frames: it emits one DataReceived event per frame. This is less problematic for h2 because of SETTINGS_MAX_FRAME_SIZE, which limits the total memory cost of buffering an entire frame.)
-
Add a flag to swap between the current mode and the mode described in (1), which defaults to the current mode. I think this is a bad idea, but I did want to bring it up for completeness' sake. This has all the downsides of (1) plus an extra bit of interface complexity and testing surface to go with it. Not recommended.
-
Add a flag to Data events that signal whether they mark the completed end of a chunk: otherwise keep the current behaviour the same. This would allow tools like urllib3 that want to care about where the chunk boundaries are basically just do a tight loop on
recv()
until they see aData
event withend_chunk=True
. Because of h11's current semantics, any priorData
events that don't have that flag set are part of the same chunk as the one that does, and any subsequentData
events are part of a new chunk.This has the advantage of being the smallest logical change, it's likely pretty easy and preformant to implement, and it is extremely unobtrusive to users that don't care about this concept. Altogether I think this is the best of the three possibilities in terms of giving tools that care about this (and, to be clear: as much as possible tools should try not to care about this) the ability to get what they need, while keeping that unusual use-case as far away from affecting other users as possible.
Thoughts?