-
Notifications
You must be signed in to change notification settings - Fork 1.2k
EKS Auto Mode: Two bugs prevent node provisioning — empty instance profile + dry-run validation deadlock #9016
Description
Description
Observed Behavior:
Two bugs prevent any node provisioning in EKS Auto Mode:
Bug 1: The default NodeClass uses spec.role. The controller calls CreateInstanceProfile but never calls AddRoleToInstanceProfile. The instance profile is created with Roles: []. NodeClass status = InstanceProfileCreationFailed.
CloudTrail shows:
- CreateInstanceProfile: 1 event from AWSServiceRoleForAmazonEKS — succeeds
- AddRoleToInstanceProfile: 0 events from EKS service — never attempted
IAM policy simulation on AWSServiceRoleForAmazonEKS confirms:
- iam:CreateInstanceProfile on eks-* -> allowed (AmazonEKSServiceRolePolicy)
- iam:AddRoleToInstanceProfile on eks-* -> implicitDeny (NOT in AmazonEKSServiceRolePolicy)
- AllowedByOrganizations: true (SCPs not blocking)
Bug 2: After working around Bug 1 with a custom NodeClass using spec.instanceProfile (pre-created profile with role attached), InstanceProfileReady becomes True. However, ValidationSucceeded is permanently stuck at AwaitingReconciliation. The controller calls DescribeLaunchTemplates for hash-based names (eks.amazonaws.com/) that don't exist — gets InvalidLaunchTemplateName.NotFoundException. No CreateFleet or RunInstances calls are ever made.
We manually created dummy launch templates with the expected names. DescribeLaunchTemplates then succeeds (confirmed via CloudTrail), but the controller still does not progress.
NodeClass conditions:
- InstanceProfileReady: True
- SubnetsReady: True
- SecurityGroupsReady: True
- CapacityReservationsReady: True
- ValidationSucceeded: Unknown (AwaitingReconciliation) — stuck indefinitely
- Ready: Unknown
Zero nodes. All pods Pending. Reproduced across 2 separate AWS Organizations with different management accounts and different SCPs.
Related: #8720, aws/containers-roadmap#2557
Expected Behavior:
- The controller should call AddRoleToInstanceProfile after CreateInstanceProfile (or AmazonEKSServiceRolePolicy should include this permission)
- The controller should create launch templates before attempting to describe/validate them
- Nodes should provision and pods should schedule
Reproduction Steps (Please include YAML):
- Create an EKS cluster with Auto Mode enabled
- Observe the default NodeClass — status shows InstanceProfileCreationFailed
- Check CloudTrail: CreateInstanceProfile succeeds, AddRoleToInstanceProfile never called
- Run iam simulate-principal-policy on AWSServiceRoleForAmazonEKS for AddRoleToInstanceProfile — returns implicitDeny
- Work around Bug 1 by applying a custom NodeClass with spec.instanceProfile:
apiVersion: eks.amazonaws.com/v1
kind: NodeClass
metadata:
name: custom
spec:
instanceProfile: <pre-created-profile-with-role-attached>
subnetSelectorTerms:
- tags:
kubernetes.io/cluster/<cluster-name>: shared
securityGroupSelectorTerms:
- tags:
kubernetes.io/cluster/<cluster-name>: owned- InstanceProfileReady becomes True, but ValidationSucceeded stays at AwaitingReconciliation indefinitely
- Check CloudTrail: DescribeLaunchTemplates for eks.amazonaws.com/ returns NotFoundException repeatedly
Versions:
Chart Version: N/A (EKS Auto Mode — AWS-managed Karpenter on control plane)
Kubernetes Version: v1.31
EKS Platform Version: eks.latest
Region: us-east-1