-
Notifications
You must be signed in to change notification settings - Fork 77
Proposal: make it possible to observe chunk delimiters. #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So, my first response is going to be to push back and ask if urllib3 really needs this? ...Are you sure? :-) Looking at the docs, I can't see any mention of this "give me one encoded chunk at a time" behavior, or chunk boundaries at all? Indeed it looks like the Also note that proxies are explicitly allowed to mangle chunk boundaries ("The chunked encoding is specific to each connection and is likely to be removed or recoded by each recipient (including intermediaries)" -- RFC 7230), so even if the API is there then this can hardly be reliable in the general internet... That said...
Right, h11 guarantees that it never uses unbounded memory, so option (1) is definitely out. And I don't like (2) for the same reason -- having an bounded-memory guarantee isn't very useful if the first thing popular wrappers like urllib3 do is disable it. (Presumably urllib3 will do whatever it's going to do even for callers who don't care about chunk boundaries, i.e. all of them.) Also, I don't see how this would allow
If we're going to do this, then probably the thing to do would be to bite the bullet and start supporting chunk extensions at the same time. These are an extremely obscure feature, where each chunk can have some arbitrary key/value metadata attached to it. Right now we just throw away this metadata (which is a totally compliant option -- even proxies are allowed to do this!). Since this metadata appears at the beginning of chunks, probably the thing to do would be to instead of flagging the end of chunks, add a field to But again, I'd rather not do it at all :-) |
You're misreading that doc. The relevant work-trail is kennethreitz/requests#2449 and urllib3/urllib3#550, and the documentation for the feature while not present in urllib3 is present in the Requests documentation. So while I note that the feature cannot be reliable (indeed, I basically said as much above), that doesn't seem to me to be a compelling reason not to surface the meta-data and allow implementations to make the decision about whether to care. In this case, requests and urllib3 received feature requests to allow this feature which were implemented, and that means that we cannot remove the feature short of a full deprecation cycle. That turns out to be a pain in the neck. As to your suggestion about supporting chunk extensions: I suspect that you really don't need to do that. Certainly, if it's on your list of features you'd like to support then by all means while we're messing about in the code it's worth adding. But you already have the data I'm asking for: you're just choosing to throw it away, where I'd like it surfaced to the user. Getting chunk extensions requires additional parsing code, which rather changes the workload of the feature. ;) |
While I'm here, let me ask a follow-on. If the position of h11 were that chunk boundaries are non-semantic and cannot be relied up (a reasonable and intellectually consistent position, by the way), it seems odd that the |
Reading that history, it's not clear to me that the user even cared about chunk boundaries really -- it sorta looks like the (It might make sense in any case to update the requests docs to emphasize that one really should set
Well, parsing them isn't terrible hard -- just a regexp match, and we're already isolating them from the stream to discard them. But yeah, it's not crucial. My point was more that, given there is this other feature that potentially might make sense to support, and that other feature prefers treating chunk starts specially instead of chunk ends, then we should probably follow that here as well. (AFAICT if the goal is just to detect chunk boundaries then either approach is equally good, right?)
Because the "read a (partial) chunk" state machine was much simpler than the "read one or more complete chunks + a trailing partial". The current parser structure is stupid-simple and I don't like writing code when I can skip it :-). |
This collection of sentences describes like 90% of my opinions about software.
Yep, I certainly have no objection to reversing the logic. It requires a bit more out-of-h11 buffering but urllib3 is going to have to do loads of out-of-h11 buffering anyway so we can probably tolerate it.
That's interesting, I'd have expected that the second is basically just "keep doing the first until we get a partial match". Anyway, not important. ;) |
Actually it looks like implementing flags for chunk start, or chunk end, or both, is all pretty simple -- see #20 for a non-working proof of concept. (I'll leave you to turn it into a real patch? :-)) |
Fixed by #20. |
Motivation: urllib3's
stream
method allows users to request that they be streamed each chunk as it arrives. This is very difficult to do with h11 as it currently stands, because while h11 will try to emit oneData
event per chunk, if the buffer contains a partial chunk h11 will prefer to emit thatData
event and empty the buffer than to sit on it.This is an entirely defensible design decision: while urllib3's users seem to want to be able to receive the chunks as they come in, chunk delimiters are not supposed to be semantic. However, for better or worse there are some use-cases where it is very helpful to know where chunk delimiters are.
There are three ways I can see of doing this:
Change h11's behaviour to emit
NEED_DATA
when a partial chunk is in the buffer, rather than aData
event for that partial chunk. This is probably inefficient in the case where people don't care about the chunk sizes, and also allows for pernicious behaviour where the user just keeps shoving data into h11's buffer without h11 ever being able to emit it.(I should note that this is basically what h2 does with DATA frames: it emits one DataReceived event per frame. This is less problematic for h2 because of SETTINGS_MAX_FRAME_SIZE, which limits the total memory cost of buffering an entire frame.)
Add a flag to swap between the current mode and the mode described in (1), which defaults to the current mode. I think this is a bad idea, but I did want to bring it up for completeness' sake. This has all the downsides of (1) plus an extra bit of interface complexity and testing surface to go with it. Not recommended.
Add a flag to Data events that signal whether they mark the completed end of a chunk: otherwise keep the current behaviour the same. This would allow tools like urllib3 that want to care about where the chunk boundaries are basically just do a tight loop on
recv()
until they see aData
event withend_chunk=True
. Because of h11's current semantics, any priorData
events that don't have that flag set are part of the same chunk as the one that does, and any subsequentData
events are part of a new chunk.This has the advantage of being the smallest logical change, it's likely pretty easy and preformant to implement, and it is extremely unobtrusive to users that don't care about this concept. Altogether I think this is the best of the three possibilities in terms of giving tools that care about this (and, to be clear: as much as possible tools should try not to care about this) the ability to get what they need, while keeping that unusual use-case as far away from affecting other users as possible.
Thoughts?
The text was updated successfully, but these errors were encountered: