Skip to content

feat: expose AMI cache TTL as runtime flag#9052

Closed
chrisdoherty4 wants to merge 1 commit intoaws:mainfrom
chrisdoherty4:cpd-ami-cache-requeue-01
Closed

feat: expose AMI cache TTL as runtime flag#9052
chrisdoherty4 wants to merge 1 commit intoaws:mainfrom
chrisdoherty4:cpd-ami-cache-requeue-01

Conversation

@chrisdoherty4
Copy link
Copy Markdown

@chrisdoherty4 chrisdoherty4 commented Apr 3, 2026

Fixes #N/A

Description

Operators running large fleets (15,000 nodes across 50+ clusters) with 10s of node classes can generate significant DescribeImages API call volume because the reconciler requeues periodically (order of 30s-1m) and uses a hardcoded 1-minute cache TTL. This change makes the cache TTL independently configurable so users can decide an appropriate AMI cache time for their usecase:

Flag Env var Default
--ami-cache-ttl AMI_CACHE_TTL 1m

Default preserve existing behavior.

How was this change tested?

  • Unit tests added to pkg/operator/options/suite_test.go covering CLI
    flag override and env var fallback, and validation rejection of non-positive values.
  • All existing unit tests pass.

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

@chrisdoherty4 chrisdoherty4 requested a review from a team as a code owner April 3, 2026 02:52
@chrisdoherty4 chrisdoherty4 requested a review from ryan-mist April 3, 2026 02:52
@chrisdoherty4 chrisdoherty4 marked this pull request as draft April 3, 2026 14:17
@chrisdoherty4
Copy link
Copy Markdown
Author

chrisdoherty4 commented Apr 3, 2026

Looking deeper it seems a handful of reconcilers set a shorter TTL than the minimum requeue time for the AMI reconciler making the --ami-requeue-interval rather useless.

The cache TTL configurability does help reduce the API calls so that still feels like a worth while configuration option - longer cache windows are acceptable in our case.

@chrisdoherty4 chrisdoherty4 marked this pull request as ready for review April 3, 2026 16:43
@chrisdoherty4 chrisdoherty4 force-pushed the cpd-ami-cache-requeue-01 branch from a25243a to d3b7986 Compare April 3, 2026 19:51
@chrisdoherty4
Copy link
Copy Markdown
Author

chrisdoherty4 commented Apr 3, 2026

Modified the PR to only expose AMI cache TTL. Being able to tweak this for our use case greatly improves API calls and avoids hitting rate limits.

@chrisdoherty4 chrisdoherty4 changed the title feat: expose AMI cache TTL and requeue interval as runtime flags feat: expose AMI cache TTL as runtime flags Apr 7, 2026
@chrisdoherty4 chrisdoherty4 changed the title feat: expose AMI cache TTL as runtime flags feat: expose AMI cache TTL as runtime flag Apr 7, 2026
Operators running large fleets can generate significant DescribeImages
API call volume due to frequent AMI reconciles. This change makes the
AMI cache TTL configurable so operators can tune them for their workload
without rebuilding.

  --ami-cache-ttl        (env: AMI_CACHE_TTL,        default: 1m)

Default preserve existing behaviour.
@chrisdoherty4 chrisdoherty4 force-pushed the cpd-ami-cache-requeue-01 branch from d3b7986 to a1f37c7 Compare April 7, 2026 16:27
@DerekFrank
Copy link
Copy Markdown
Contributor

DerekFrank commented Apr 7, 2026

We generally avoid surfacing too much config if we can avoid it. What were you going to set this to? We might just up the default, 1m seems a bit low for a default

@chrisdoherty4
Copy link
Copy Markdown
Author

We generally avoid surfacing too much config if we can avoid it. What were you going to set this to? We might just up the default, 1m seems a bit low for a default

Either 15m or 1h. We haven't decided and the flexibility is what would let us tweak things. I'm curious what problems there are with surfacing the configuration assuming its a sane default and well documented?

@chrisdoherty4
Copy link
Copy Markdown
Author

chrisdoherty4 commented Apr 8, 2026

When I run this patch I found the DescribeSubnets and DescribeSecurityGroups go up by 7x and 3.5x respectively. That likely isn't acceptable to us either. I'm still trying to determine why.

Turns out this seems to be a regression somewhere between 1.8.1 and 1.10. The jumps here are when I deployed v1.10.

image

Opened #9063

@jmdeal
Copy link
Copy Markdown
Contributor

jmdeal commented Apr 9, 2026

I wanted to float an alternative approach to solving this issue that we've discussed internally. We're hesitant to expose cache TTL configurations directly for a couple of reasons:

  • Karpenter's caching logic is an internal implementation detail and the exact way it works is subject to change version to version. Knowing what to tweak a value to requires an understanding of Karpenter's internal caching logic.
  • Some cache TTLs are dependent on one another, tweaking one without understanding it's relation to others could cause subtle issues.

An alternative we could consider is surfacing per-API client side rate-limit buckets as a configuration. I believe this more directly addresses the core issue - limiting the impact of individual Karpenter controllers - while also not exposing internal implementation details. All internal reconcilers need to be tolerant to rate limiting whether it's from the client or from the server.

@chrisdoherty4
Copy link
Copy Markdown
Author

I wanted to float an alternative approach to solving this issue that we've discussed internally. We're hesitant to expose cache TTL configurations directly for a couple of reasons:

  • Karpenter's caching logic is an internal implementation detail and the exact way it works is subject to change version to version. Knowing what to tweak a value to requires an understanding of Karpenter's internal caching logic.
  • Some cache TTLs are dependent on one another, tweaking one without understanding it's relation to others could cause subtle issues.

An alternative we could consider is surfacing per-API client side rate-limit buckets as a configuration. I believe this more directly addresses the core issue - limiting the impact of individual Karpenter controllers - while also not exposing internal implementation details. All internal reconcilers need to be tolerant to rate limiting whether it's from the client or from the server.

Hi @jmdeal. Expressing this as client side rate limiting would work for us, thanks.

@ryan-mist
Copy link
Copy Markdown
Contributor

Hi @chrisdoherty4,

Just checking back in on this - is this something you'd be interested in working on? If not then we can also work on this on our side. Thanks!

@chrisdoherty4
Copy link
Copy Markdown
Author

@ryan-mist I'm not planning to implement anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants