Skip to content

feat(ds): Use sample rate in trace context if missing from DSC #4620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

mjq
Copy link
Member

@mjq mjq commented Mar 27, 2025

Our Electron SDK is not populating the sample_rate field in DSC (see getsentry/sentry-electron#1114). As a result, span metric extrapolation does not work as the client's sample rate is unknown.

Although we will also fix this in the SDK, it will take time for clients to upgrade. In the meantime, it would be nice if data sent by existing SDK versions could work too.

In the case of the Electron SDK, the client sample rate is being sent in the transaction event payload, inside the trace context:

{
  "contexts": {
    "trace": {
      "data": {
        "sentry.sample_rate": 0.1
      }
    }
  }
}

This PR checks if we are missing a DSC sample rate (because that field is missing, or the entire DSC is missing), and if so, tries to find a client sample rate in the transaction event payload to use and copies it into the DSC. The first place we check is trace context's documented client_sample_rate field, falling back to the non-standard sentry.sample_rate in trace context's data being used by Electron.


I would like to add an integration test to this PR as well, but I'll get to that tomorrow if the concept and direction seem sound. Thanks!

@mjq mjq requested a review from a team as a code owner March 27, 2025 00:46
Dav1dde
Dav1dde previously approved these changes Mar 27, 2025
@Dav1dde
Copy link
Member

Dav1dde commented Mar 27, 2025

I would like to add an integration test to this PR as well, but I'll get to that tomorrow if the concept and direction seem sound. Thanks!

Yep seems like the right place to do this. Integration test 😙🤌.

@Dav1dde Dav1dde dismissed their stale review March 27, 2025 15:07

More information

@Dav1dde
Copy link
Member

Dav1dde commented Mar 27, 2025

@jan-auer brought up a good point to me, do we need to make this change, ideally people affected by this can upgrade their SDK version, this change may have unintended side effects in Relay, when the DSC is accessed before running this correction.

@mjq
Copy link
Member Author

mjq commented Mar 27, 2025

@Dav1dde Yes, I agree that the ideal is not to touch Relay in this case. I am not sure we are in ideal circumstances though 😄 @jan-auer and others are currently looking at how widespread this issue is - if it was just Electron, that would be one thing, but Cocoa and the latest Python SDKs also appear to be missing the DSC sample rate (but populate it in trace context). We'll see how the investigation shakes out.

@mcannizz
Copy link
Contributor

@jan-auer brought up a good point to me, do we need to make this change, ideally people affected by this can upgrade their SDK version, this change may have unintended side effects in Relay, when the DSC is accessed before running this correction.

If we settle on a solution that requires customers to upgrade their SDKs, we have no control over when or if they will upgrade. That creates a risk to our business that we need to avoid at all reasonable costs. This is not a theoretical discussion; we have a product launch that depends on our ability to record accurate sampling rates.

While I understand the possibility of unintended side effects, we can and should mitigate this risk with automated tests.

This is what I recommend:

  1. Make this change and add automated tests now, to unblock our product launch.
  2. Find all the SDKs that don't pass sample_rate as intended and fix them.
  3. Once all or mostly all of our customers have upgraded their SDKs, remove this workaround. (I realize this may take a long time.)

@jan-auer thoughts?

@mjq
Copy link
Member Author

mjq commented Mar 27, 2025

I'm going to start working on an integration test for this while the discussion continues. (I don't see a test for making sure the client_sample_rate makes it to Kafka yet, so that's valuable no matter where we land.)

@mjq
Copy link
Member Author

mjq commented Mar 28, 2025

Added integration tests.

They showed that the logic I added to try to read the sample rate from trace context client_sample_rate was flawed: client_sample_rate is a field written by Relay, not something clients can send in. So the value always matches the DSC anyway. Removed that logic.

On the plus side, the tests do confirm that the workaround for Electron (reading trace context's data's sentry.sample_rate) does indeed write client_sample_rate into span measurements as required by Snuba/EAP.

On the minus side, this change does not address the issue we found in the Python SDK - in that case the final event JSON is missing sample_rate in the DSC (based on the _dsc field of the event), but does have it in client_sample_rate in the trace context (example). Given what I just said about Relay generating this field, that ought to be impossible. Looked at the Python SDK, Relay and Sentry and can't figure out how an event gets into that state. Would love to hear ideas if anyone has any.

@jan-auer jan-auer changed the title feat(ds): use sample rate in trace context if missing from dsc feat(ds): Use sample rate in trace context if missing from DSC Mar 28, 2025
@jan-auer
Copy link
Member

jan-auer commented Mar 28, 2025

I'm going to respond to several comments here without quoting the individual points:

  1. @cleptric could you please provide some information on the previous investigation of missing sample rates in the DSC? I know that a couple of months ago we have already investigated all SDKs and started to fix them all. @mcannizz has just requested this again, and we should double-check what we have missed in the past.
  2. Note that we're not dealing with some intentional legacy behavior here, but with custom data that the Electron SDK has been accidentally sending (notice the field is not in the schema). If we move forward with this, we need to:
    1. Put protections in place that Relay never reads the sample rate field from the DSC before this normalization has run. The simplest way to assure this is by writing an E2E test.
    2. Determine the minimum version where Electron SDK started sending the erroneous field. All versions before that will not have any sample rate at all, so these customers must update in any case (see here).
    3. Technically we can monitor SDK usage and eventually remove the workaround. We are very strict with this, however, so I would plan for this workaround to persist indefinitely.
  3. Even if we normalize in Relay, it is still missing from baggage in SDKs. Therefore, propagation of the sample rate between connected SDKs is likely broken. This means that spans/transactions from downstream clients will not have any sample rate available and extrapolation will be broken regardless. There's nothing that can be done about this in Relay if the sample rate is missing entirely.

In combination with other SDK bugs where no sample rate or the wrong sample rate was sent to Relay, this means that for every SDK there is a certain minimum version to allow correct extrapolation. We need to document these versions, and potentially suggest an SDK update in-product. For most SDKs, this has been added a long time ago, but in some this is recent.

We can move forward with this workaround, but it will not solve the larger problem affecting far more users. I strongly suggest we determine all affected SDKs and encourage upgrades.

@Dav1dde
Copy link
Member

Dav1dde commented May 20, 2025

Closing for now, with the reasons above described by Jan.

@Dav1dde Dav1dde closed this May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants