Skip to content

EKS Auto Mode: Two bugs prevent node provisioning — empty instance profile + dry-run validation deadlock #9016

@nuritizra

Description

@nuritizra

Description

Observed Behavior:

Two bugs prevent any node provisioning in EKS Auto Mode:

Bug 1: The default NodeClass uses spec.role. The controller calls CreateInstanceProfile but never calls AddRoleToInstanceProfile. The instance profile is created with Roles: []. NodeClass status = InstanceProfileCreationFailed.

CloudTrail shows:

  • CreateInstanceProfile: 1 event from AWSServiceRoleForAmazonEKS — succeeds
  • AddRoleToInstanceProfile: 0 events from EKS service — never attempted

IAM policy simulation on AWSServiceRoleForAmazonEKS confirms:

  • iam:CreateInstanceProfile on eks-* -> allowed (AmazonEKSServiceRolePolicy)
  • iam:AddRoleToInstanceProfile on eks-* -> implicitDeny (NOT in AmazonEKSServiceRolePolicy)
  • AllowedByOrganizations: true (SCPs not blocking)

Bug 2: After working around Bug 1 with a custom NodeClass using spec.instanceProfile (pre-created profile with role attached), InstanceProfileReady becomes True. However, ValidationSucceeded is permanently stuck at AwaitingReconciliation. The controller calls DescribeLaunchTemplates for hash-based names (eks.amazonaws.com/) that don't exist — gets InvalidLaunchTemplateName.NotFoundException. No CreateFleet or RunInstances calls are ever made.

We manually created dummy launch templates with the expected names. DescribeLaunchTemplates then succeeds (confirmed via CloudTrail), but the controller still does not progress.

NodeClass conditions:

  • InstanceProfileReady: True
  • SubnetsReady: True
  • SecurityGroupsReady: True
  • CapacityReservationsReady: True
  • ValidationSucceeded: Unknown (AwaitingReconciliation) — stuck indefinitely
  • Ready: Unknown

Zero nodes. All pods Pending. Reproduced across 2 separate AWS Organizations with different management accounts and different SCPs.

Related: #8720, aws/containers-roadmap#2557

Expected Behavior:

  1. The controller should call AddRoleToInstanceProfile after CreateInstanceProfile (or AmazonEKSServiceRolePolicy should include this permission)
  2. The controller should create launch templates before attempting to describe/validate them
  3. Nodes should provision and pods should schedule

Reproduction Steps (Please include YAML):

  1. Create an EKS cluster with Auto Mode enabled
  2. Observe the default NodeClass — status shows InstanceProfileCreationFailed
  3. Check CloudTrail: CreateInstanceProfile succeeds, AddRoleToInstanceProfile never called
  4. Run iam simulate-principal-policy on AWSServiceRoleForAmazonEKS for AddRoleToInstanceProfile — returns implicitDeny
  5. Work around Bug 1 by applying a custom NodeClass with spec.instanceProfile:
apiVersion: eks.amazonaws.com/v1
kind: NodeClass
metadata:
  name: custom
spec:
  instanceProfile: <pre-created-profile-with-role-attached>
  subnetSelectorTerms:
    - tags:
        kubernetes.io/cluster/<cluster-name>: shared
  securityGroupSelectorTerms:
    - tags:
        kubernetes.io/cluster/<cluster-name>: owned
  1. InstanceProfileReady becomes True, but ValidationSucceeded stays at AwaitingReconciliation indefinitely
  2. Check CloudTrail: DescribeLaunchTemplates for eks.amazonaws.com/ returns NotFoundException repeatedly

Versions:

Chart Version: N/A (EKS Auto Mode — AWS-managed Karpenter on control plane)
Kubernetes Version: v1.31
EKS Platform Version: eks.latest
Region: us-east-1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage/needs-informationMarks that the issue still needs more information to properly triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions