Skip to content

Conversation

@nimanikoo
Copy link
Contributor

Closes #6282

Description

This pull request addresses the // TODO comment in the OtlpRetry helper and resolves the issue of potentially indefinite retries in the OTLP exporter. It introduces a hardcoded maximum limit of 5 retry attempts for failed exports.

Changes Made

  • Implemented Max Retry Limit: The core logic in OtlpRetry.TryGetGrpcRetryResult and OtlpRetry.TryGetHttpRetryResult has been updated to check the number of attempts. If attempt >= 5, the methods will now return false, effectively stopping the retry loop.

  • Updated Existing Tests: The existing test case Exponential backoff after throttling in GrpcRetryTestCase, which previously simulated 9 retries, has been adjusted to respect the new limit of 5.

  • Added New Boundary Tests: To ensure the limit is strictly enforced, new test cases named "Max retries exceeded (fail on 6th attempt)" have been added for both GrpcRetryTestCase and HttpRetryTestCase. These tests verify that:

    • The first 5 retry attempts are successful (true).
    • The 6th attempt correctly fails (false).

This change provides a safer default behavior for the exporter and serves as a foundational step for potentially making this limit configurable in the future.

Implements a hardcoded maximum of 5 retry attempts in the OTLP exporter's backoff logic. This prevents scenarios of indefinite retries.

The changes include:
- Modifying `OtlpRetry` helper to stop retrying after the 5th attempt (on the 6th call).
- Updating existing `OtlpRetryTests` that were failing due to the new limit.
- Adding new dedicated test cases for both gRPC and HTTP to specifically verify that the retry mechanism stops exactly on the 6th attempt.
@nimanikoo nimanikoo requested a review from a team as a code owner December 4, 2025 16:43
@github-actions github-actions bot added the pkg:OpenTelemetry.Exporter.OpenTelemetryProtocol Issues related to OpenTelemetry.Exporter.OpenTelemetryProtocol NuGet package label Dec 4, 2025
@codecov
Copy link

codecov bot commented Dec 4, 2025

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.87%. Comparing base (f630c55) to head (973bc2c).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ansmission/OtlpExporterRetryTransmissionHandler.cs 66.66% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6751      +/-   ##
==========================================
- Coverage   86.88%   86.87%   -0.01%     
==========================================
  Files         259      259              
  Lines       12066    12073       +7     
==========================================
+ Hits        10483    10489       +6     
- Misses       1583     1584       +1     
Flag Coverage Δ
unittests-Project-Experimental 86.67% <91.66%> (-0.11%) ⬇️
unittests-Project-Stable 86.72% <91.66%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...yProtocol/Implementation/ExportClient/OtlpRetry.cs 86.25% <100.00%> (+0.91%) ⬆️
...tlpExporterPersistentStorageTransmissionHandler.cs 92.85% <100.00%> (ø)
...rotocol/Implementation/Transmission/RetryHelper.cs 100.00% <100.00%> (ø)
...ansmission/OtlpExporterRetryTransmissionHandler.cs 75.00% <66.66%> (-5.00%) ⬇️

@nimanikoo nimanikoo changed the title feat(exporter-otlp): Add hardcoded max retry limit feat(exporter-otlp): Add max retry limit Dec 4, 2025
@rajkumar-rangaraj
Copy link
Member

@open-telemetry/dotnet-approvers There's no OTLP spec requirement for max retry attempts. Seems JS SDK uses 5, Collector uses time-based limits instead.
Should we accept this implementation-specific change, or encourage driving this in the spec first?

@nimanikoo
Copy link
Contributor Author

Thanks for bringing this up. I completely agree that adhering to the spec is crucial, but the current behavior of potential infinite retries poses a significant safety risk (e.g., in serverless environments like AWS Lambda).

Regarding the implementation-specific concern:

Precedent in other SDKs: As you mentioned, the JS SDK limits this. Similarly, the Java SDK has established a RetryPolicy that supports MaxAttempts. You can see it utilized in their tests here:
Java SDK RetryPolicy Usage

Community Demand: There is an active discussion in the spec repo (issue #3639) where users are explicitly asking for configurable retry limits, specifically for better compatibility with environments like Lambda.

I’ve implemented this hardcoded limit of 5 as a pragmatic safeguard to align .NET with the other SDKs and prevent infinite loops. This serves as a safe default for now, and I am fully open to updating or refactoring this logic (e.g., making it configurable) as soon as the spec evolves or if you have specific requirements for the implementation.

thanks

@open-telemetry/dotnet-approvers
@rajkumar-rangaraj

@pellared
Copy link
Member

@open-telemetry/dotnet-approvers There's no OTLP spec requirement for max retry attempts. Seems JS SDK uses 5, Collector uses time-based limits instead. Should we accept this implementation-specific change, or encourage driving this in the spec first?

Some specification is here: https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md

For an exponential backoff strategy it makes more sense to have other, more readable/predictable configuration than retry count (I do not believe user will know how much time takes e.g. 5 retries) . I find time-based limits better. This is the configuration we have in OTel Go: https://pkg.go.dev/go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploghttp/internal/retry#Config (IMO there should be also a "multiply factor" field).

Moreover, we should also have a design which accepts a custom backoff strategy (example interface from Go: https://pkg.go.dev/github.com/cenkalti/backoff/v5#BackOff).

It would be good to add some recommendations in the protocol spec. This will help validate our proposal/design.

@nimanikoo
Copy link
Contributor Author

Thanks for the feedback.

To be clear, implementing a simple Time-based limit (e.g., hardcoded 30s timeout) is technically straightforward here. However, implementing a full pluggable/custom Backoff strategy involves a larger refactoring of the retry architecture, which I believe is out of scope for this specific PR.

My main goal here is to fix a critical bug: The current code allows for infinite retries.

Since other SDKs are split on this (Java & JS use MaxAttempts, Go uses TimeBased), could we please merge this MaxAttempts = 5 change as an immediate safeguard to stop the infinite loops?

This fixes the burning issue now without blocking us from moving to a sophisticated Time-based/Pluggable design in a future PR (aligned with Spec updates).

@pellared

@pellared
Copy link
Member

@nimanikoo

I agree. We just need to be sure that we would not have backwards compatibility problems when we want to make it more sophisticated (add more parameters). Still, I do not think it should be a problem. I am just very careful when adding things to the public API.

Pluggable design in a future PR (aligned with Spec updates).

This can be done later and this should also not cause any backwards compatibility problems.

@nimanikoo
Copy link
Contributor Author

@pellared

I completely understand and appreciate your caution regarding the public API surface. Stability is definitely key.

I am glad we agree on the path forward. I will leave the pluggable design for a future PR aligned with Spec updates, as discussed.

Please let me know if there are any final checks needed or if we are good to merge.

@pellared
Copy link
Member

I am not sure if hard-coded MaxRetryAttempts=5 is a good and safe proposal. How much time is it? Where is this documented? Note also that this can be a breaking behavioral change for some users that rely on the existing behavior. There needs to be a way to configures the retry mechanism for users that not happy with new behavior.

Here is the OTel Go behavior (from https://pkg.go.dev/go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploghttp#WithRetry):

If unset, the default retry policy will be used. It will retry the export 5 seconds after receiving a retryable error and increase exponentially after each error for no more than a total time of 1 minute.

I propose to change this PR to allow setting time-based limits and provide some reasonable default values.

Note that I am not a maintainer nor an approver in this repository. Therefore, I suggest waiting for feedback from e.g. @rajkumar-rangaraj .

@nimanikoo
Copy link
Contributor Author

nimanikoo commented Dec 10, 2025

I agree that hard-coding values is generally not ideal.

Regarding the "Breaking Change" concern: The current behavior (infinite retries) is widely considered a defect, especially in serverless environments where it causes timeouts and increased costs. It is unlikely that users rely on "infinite retries" as a desired feature. Most other OTel SDKs (Java, JS) impose a default limit for safety.

However, to address your concern about flexibility:
I can expose a MaxRetryAttempts property in the options (instead of hard-coding it internally).

  • Default: 5 (to align with Java/JS and provide safety by default).
  • User Override: Users who genuinely need more retries can increase this value.

Currently, implementing a full Time-based limit requires a larger refactor of the retry logic, which we agreed with @rajkumar-rangaraj to postpone to a future PR to keep this fix minimal.

Does exposing MaxRetryAttempts in the options satisfy your concern for configurability?

@alanwest
Copy link
Member

My main goal here is to fix a critical bug: The current code allows for infinite retries.

I would support an option to configure the max retry attempts if it were supported by the specification. That said, we do not have a bug. The OTLP exporter has a timeout option. It defaults to 10 seconds. Attempts to retry should not exceed the configured timeout.

@nimanikoo
Copy link
Contributor Author

Thanks for the comment @alanwest .

While I agree that the Timeout (default 10s) acts as a hard stop and prevents the process from hanging forever, relying solely on timeout for retry logic is not ideal for a few reasons:

  1. Fail Fast & Resource Efficiency: In scenarios with persistent errors, we might want to stop retrying after 5 attempts (which might take only 2-3 seconds) rather than spinning and consuming CPU/network resources until the full 10s timeout triggers.
  2. Consistency with other SDKs: The OpenTelemetry Java SDK, for example, explicitly implements max_attempts (default 5). Bringing this parity to .NET would be beneficial.
  3. Spec Alignment: The OTel specification generally recommends implementing retry strategies. While some SDKs use time-based (Go), others use count-based (Java). Adding MaxRetryAttempts (configurable) covers this need without being overly complex.

So, while the Timeout prevents an "infinite hang," the MaxRetryAttempts provides necessary control over the retry behavior itself.

@alanwest
Copy link
Member

I agree that a configurable max number of retries would be useful, but I'd like to see the specification permit this configuration first.

I opened an issue against the specification awhile ago open-telemetry/opentelemetry-specification#3639. It includes the suggestion of making max number of retries configurable.

I do not have bandwidth to drive this issue at this time, but you're welcome to chime in and maybe even suggest a change to the specification.

@nimanikoo
Copy link
Contributor Author

nimanikoo commented Dec 17, 2025

Thanks for sharing the context on the specification issue. I agree that we should wait for the Spec before introducing any public API changes.

So, let's put the configuration option aside.

Instead, I propose we stick to an internal limit of 5 attempts (hard-coded, no public API change).

This aligns with the Java SDK and acts as a "fail-fast" optimization for serverless environments (avoiding full timeout waits), without conflicting with the Specification.

If this internal-only approach works for you, I'll make sure the PR reflects this logic so we can move forward.

@alanwest

@alanwest
Copy link
Member

With the exponential backoff I don't think you achieve your goal of a "fail-fast". How long it will take a fixed number of retries is indeterministic.

At this time, I think the best option for your scenario is to reduce the timeout to a range that is acceptable for your environment. That way your can have complete control of how long the request + any retries will take at maximum.

I think @pellared's thought of a time-based limit for retries is interesting, but I'd still want to see this driven in the spec.

@nimanikoo
Copy link
Contributor Author

I understand the concern about the Specification, but I'm worried that strictly waiting for a Spec update will stall this improvement indefinitely.

This is a practical technical fix that aligns with the Java SDK and mitigates a real-world risk. However, driving a full Specification change process is likely outside my current bandwidth.

If strictly following the Spec process is the only way forward, does that mean we are forced to block this PR until someone else picks up the Spec work? Or is there a pragmatic middle ground (e.g., treating it as an internal safeguard) that allows us to move forward without the bureaucratic delay?

@rajkumar-rangaraj
@alanwest

@rajkumar-rangaraj
Copy link
Member

I understand the concern. However, from the .NET SDK side we cannot merge changes that alter behavior without Specification backing. Once released, this becomes a contract we cannot easily undo.

To stay aligned with the Specification and avoid locking in behavior prematurely, we will pause this PR until the Spec provides clarity or guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pkg:OpenTelemetry.Exporter.OpenTelemetryProtocol Issues related to OpenTelemetry.Exporter.OpenTelemetryProtocol NuGet package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature request] Limit Maximum Retry Attempts

4 participants