-
Notifications
You must be signed in to change notification settings - Fork 924
Description
Describe the bug
Unbounded memory and socket retention due to AWS SDK Netty NioSocketChannel leak under high-concurrency S3 async operations.
Description:
When performing high-throughput, parallel S3 object locking operations using the AWS SDK v2 (with async Netty client), the application experienced:
Heap memory growth from retained Netty Channel.attr(...) state (e.g. RequestContext, Metrics, etc.).
Socket exhaustion and eventual TCP Zero Window scenarios.
Thread explosion under error or retry paths due to unconsumed CompletableFuture chains in Reactor.
Persistent CLOSE_WAIT / ESTABLISHED sockets, even after all logical requests completed.
Regression Issue
- Select this option if this issue appears to be a regression.
Expected Behavior
Given the use of the AWS SDK v2's NettyNioAsyncHttpClient, we expected:
At most X concurrent NioSocketChannel instances, where X = maxConcurrency set on the async HTTP client.
Each Channel to be fully released back to the pool after the request completes, with:
All Channel.attr(...) cleared (e.g., RequestContext, ExecutionId, etc.).
No dangling user-defined or internal AWS SDK metadata.
Current Behavior
Unbounded NioSocketChannel accumulation:
Far more channels than maxConcurrency were created and retained.
Old channels were not cleaned or released, even after request completion.
TCP Port Exhaustion:
Persistent ESTABLISHED and CLOSE_WAIT states observed on client side.
System unable to open new connections, leading to zero window TCP behavior.
Heap Retention and GC Pressure:
Channels held reference chains via Channel.attr() (e.g., RequestContext, SdkHttpRequest, ExecutionId).
Reproduction Steps
You can reproduce the issue by configuring a high-throughput AWS S3 async client using the AWS SDK v2 with NettyNioAsyncHttpClient. This snippet simulates many concurrent headObject or putObjectRetention calls and reveals NioSocketChannel leakage.
`SdkAsyncHttpClient httpClient = NettyNioAsyncHttpClient.builder()
.maxConcurrency(64) // Expected limit
.readTimeout(Duration.ofSeconds(10))
.writeTimeout(Duration.ofSeconds(10))
.connectionAcquisitionTimeout(Duration.ofSeconds(15))
.build();
S3AsyncClient s3Client = S3AsyncClient.builder()
.httpClient(httpClient)
.region(Region.US_EAST_1)
.build();`
ExecutorService executor = Executors.newFixedThreadPool(100); for (int i = 0; i < 10000; i++) { int id = i; executor.submit(() -> { s3Client.headObject(HeadObjectRequest.builder() .bucket("your-bucket") .key("test-object-" + id) .build()) .whenComplete((resp, err) -> { if (err != null) System.err.println("Failed: " + err); }); }); }
Monitoring Tools:
lsof -iTCP -sTCP:ESTABLISHED | grep java
netstat -an | grep '.443 ' | grep ESTABLISHED | wc -l
VisualVM / JFR to inspect heap and NioSocketChannel retention
Possible Solution
NettyRequestExecutor configureChannel and puts requestContext inside channel attr , this lets to many completed requestcontext objects not being gced , RequestContext has references to executor
Channels are offered for reuse and they kept in idle also have references to completed requestContext
At a call back createExecutionFuture you call close without cleanning up channel attrs
` if (ch.attr(IN_USE) != null && ch.attr(IN_USE).get() && executionIdKey != null) {
ch.pipeline().fireExceptionCaught(new FutureCancelledException(this.executionId, t));
} else {
// here you can clean up attr set for this call
cleanupChannelAttributes(ch);
ch.close().addListener(closeFuture -> context.channelPool().release(ch));
}`
private void cleanupChannelAttributes(Channel ch) {
if (ch == null) return;
Long currentExecutionId = ch.attr(EXECUTION_ID_KEY).get();
RequestContext currentContext = ch.attr(REQUEST_CONTEXT_KEY).get();
boolean sameExecution = Objects.equals(currentExecutionId, this.executionId);
boolean sameContext = Objects.equals(currentContext, this.context);
if (sameExecution || sameContext) {
ch.attr(REQUEST_CONTEXT_KEY).set(null);
ch.attr(EXECUTE_FUTURE_KEY).set(null);
ch.attr(RESPONSE_COMPLETE_KEY).set(null);
ch.attr(STREAMING_COMPLETE_KEY).set(null);
ch.attr(RESPONSE_CONTENT_LENGTH).set(null);
ch.attr(RESPONSE_DATA_READ).set(null);
ch.attr(EXECUTION_ID_KEY).set(null); // optional
log.trace(ch, () -> "Cleaned channel attributes for executionId=" + executionId);
} else {
log.trace(ch, () -> "Skipped cleanup: channel reused by another request (executionId=" + currentExecutionId + ")");
}
}
NettyRequestExecutor has global/class level references
Problem class and code
` private CompletableFuture executeFuture;
private Channel channel;
private RequestAdapter requestAdapter;
public NettyRequestExecutor(RequestContext context) {
this.context = context;
}
@SuppressWarnings("unchecked")
public CompletableFuture<Void> execute() {
Promise<Channel> channelFuture = context.eventLoopGroup().next().newPromise();
executeFuture = createExecutionFuture(channelFuture);
acquireChannel(channelFuture);
channelFuture.addListener((GenericFutureListener) this::makeRequestListener);
return executeFuture;
}`
Before the fix we can see 2130 open NioSocketChannel objects for 200 max concurrency setting.

After the fix inside aws sdk http client , we see max 200 NioSocketChannel objects which matches the max Concurrency we set

Here is the fix that worked and stoped the niosocket channel leak
-
remove class level references
private CompletableFuture executeFuture;
private Channel channel;
private RequestAdapter requestAdapter; -
move to method level and pass arround to private methods
public CompletableFuture execute() {
Promise channelFuture = context.eventLoopGroup().next().newPromise();
CompletableFuture executeFuture = createExecutionFuture(channelFuture);
acquireChannel(channelFuture);
channelFuture.addListener((GenericFutureListener) ->{
if (channelFuture.isSuccess()) {
Channel channel = channelFuture.getNow();
NettyUtils.doInEventLoop(channel.eventLoop(), () -> {
try {
configureChannel(channel, executeFuture);
RequestAdapter adapter = configurePipeline(channel);
makeRequest(channel, adapter, executeFuture);
} catch (Throwable t) {
closeAndRelease(channel);
handleFailure(channel, () -> "Failed to initiate request to " + endpoint(), t, executeFuture);
}
});
} else {
handleFailure(null, () -> "Failed to create connection to " + endpoint(), channelFuture.cause(), executeFuture);
}
});
return executeFuture;
}
private void configureChannel(Channel channel, CompletableFuture<Void> future)
private RequestAdapter configurePipeline(Channel channel) throws IOException
private void makeRequest(Channel channel, RequestAdapter adapter, CompletableFuture executeFuture) {
HttpRequest request = adapter.adapt(context.executeRequest().request());
writeRequest(channel, request, executeFuture);
}
with this we dont see to many SocketChannel objtects
### Additional Information/Context
if solution is valid and safe I am happy to create a fix PR
### AWS Java SDK version used
2.29.6
### JDK version used
sdkman/candidates/java/21.0.6-tem/bin/java
### Operating System and version
linux and osx