You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(openai): add token usage stream options to request (#11606)
This PR adds special casing such that any user's openai streamed
chat/completion requests, unless explicitly specified otherwise, will by
default include the token usage as part of the streamed response.
### Motivation
OpenAI streamed responses have historically not provided token usage
details as part of the streamed response. However OpenAI earlier this
year added a `stream_options: {"include_usage": True}` kwarg option to
the chat/completions API to provide token usage details as part of an
additional stream chunk at the end of the streamed response.
If this kwarg option was not specified by the user, then token usage is
not provided by OpenAI and our current behavior is to give our best
effort to 1) use the `tiktoken` library to calculate token counts, or 2)
use a very crude heuristic to estimate token counts. Both are not ideal
as neither alternative takes into account function/tool calling. **It is
simpler and more accurate to just request the token counts from OpenAI
directly.**
### Proposed design
There are 2 major components for this feature:
1. If a user does not specify `stream_options: {"include_usage": True}`
as a kwarg on the chat/completions call, we need to manually insert that
as part of the kwargs before the request is made.
2. If a user does not specify `stream_options: {"include_usage": True}`
as a kwarg on the chat/completions call but we add that option on the
integration-side, the returned streamed response will include an
additional chunk (with empty content) at the end containing token usage
information. To avoid disrupting user applications with one more chunk
(with different content/fields) than expected, the integration should
automatically extract the last chunk under the hood.
Note: if a user does explicitly specify `stream_options:
{"include_usage": False}`, then we must respect their intent and avoid
adding token usage into the kwargs. We'll add in our release note that
we cannot guarantee 100% accurate token counts in this case.`
### Streamed reading logic change
Additionally, we make a change to `__iter__/__aiter__` methods of our
traced streamed responses. Previously we returned the traced streamed
response (and relied on the underlying `__next__/__anext__` methods),
but to ensure spans will be finished even if the streamed response is
not fully consumed, we change the `__iter__/__aiter__` methods to
implement the stream consumption using a try/catch/finally.
Note: this only applies to
1. When users use `__iter__/__aiter__()`, since directly calling
`__next__()/__anext__()` individually will not let us know when the
overall response is fully consumed.
2. When users use `__aiter__()` and break early, they are still
responsible for calling `resp.close()`, since asynchronous generators do
not automatically close when the context manager is exited (this is held
until close() is called either manually or by the garbage collector).
### Testing
This PR modifies the existing OpenAI streamed completion/chat completion
tests to be simplified (use snapshots when possible instead of making
large numbers of tedious assertions) and to add coverage for the token
extraction behavior (existing tests remove `include_usage: True` options
to assert that the automatic extraction works, and we add a couple tests
asserting our original behavior if `include_usage: False` is explicitly
set).
## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
openai: Introduces automatic extraction of token usage from streamed chat completions.
5
+
Unless ``stream_options: {"include_usage": False}`` is explicitly set on your streamed chat completion request,
6
+
the OpenAI integration will add ``stream_options: {"include_usage": True}`` to your request and automatically extract the token usage chunk from the streamed response.
0 commit comments