-
Notifications
You must be signed in to change notification settings - Fork 14.5k
KAFKA-19425: Stop the server when fail to initialize to avoid local segment never got deleted. #20007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
@jiafu1115 Could you share the exception you got? I would like to check whether the exception is retriable. If it's, we may not use |
[2025-06-03 20:44:27,356] ERROR [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Error building remote log auxiliary state for MyTopicName (kafka.server.ReplicaFetcherThread) org.apache.kafka.common.internals.FatalExitError at org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager.ensureInitializedAndNotClosed(TopicBasedRemoteLogMetadataManager.java:553) at org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager.remoteLogSegmentMetadata(TopicBasedRemoteLogMetadataManager.java:221) at kafka.log.remote.RemoteLogManager.fetchRemoteLogSegmentMetadata(RemoteLogManager.java:586) at kafka.server.TierStateMachine.buildRemoteLogAuxState(TierStateMachine.java:231) at kafka.server.TierStateMachine.start(TierStateMachine.java:113) at kafka.server.AbstractFetcherThread.handleOffsetsMovedToTieredStorage(AbstractFetcherThread.scala:763) at kafka.... |
@FrankYang0529 in fact. I check the code and from the throw exception "FatalExitError" within multiple method's ensureInitializedAndNotClosed. How we find it? BTW: Why it failed to complete the initial status: you can check this #20008 and https://issues.apache.org/jira/browse/KAFKA-19371's description part for more information. |
From comment in [0] Lines 562 to 565 in 4387132
[1] kafka/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java Lines 1979 to 1988 in 4387132
[2] kafka/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java Lines 2019 to 2038 in 4387132
[3] Lines 379 to 383 in 4387132
|
"we don't use another thread to initialize resources in TopicBasedRemoteLogMetadataManager [3] and throw exception immediately if there is an error." So that is why I suppose to stop it when fail initial and still keep it in thread. Then we can delete this check in many methods: WDTY? |
@FrankYang0529 Can you help to check the code again. Thanks I make all the wanted changes. You can see that after this change we don't relay on the callers (at least 5+ methods as callers) to do check if stop the server (It is not easy to do this). just stop it ASAP at the original position. BTW: Why I think it is critical issue? |
A label of 'needs-attention' was automatically added to this PR in order to raise the |
We found that one broker's local segment on disk never get removed forever no matter how long it stored. The disk always keep increasing.
note: Partition 2's node is the exception node.
After we trouble shooting. we find if one broker is very slow to startup it will cause the TopicBasedRemoteLogMetadataManager#initializeResources's fail sometime (it meet expectation due to the server is not ready as fast). Thus it won't stop the server so that the server still run just with some exception log but not shutdown. It won't upload to remote for the local so that the local segment never to deleted.
So propose the change to shutdown the broker to avoid the silence critical error which caused the disk keep increasing forever.