-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Ask your question here:
Hello,
I'm setting up an infrastructure based on scale-to-zero, and therefore scale-from-zero too.
To do this, we're using the now-familiar "cluster autoscaler", coupled with cluster API (specifically the machineDeployment resource with some annotations).
The node scaling is working fine.
For the moment, I'm just trying to create an "autoscaler-go" knative service, from the cluster where no node is available.
The resource is then "pending", which is expected.
NAME READY STATUS RESTARTS AGE
user-service-00001-deployment-6f6d577c45-rtjvz 0/2 Pending 0 1m32s
Here is the configuration I used to create the service:
apiVersion: v1
kind: Namespace
metadata:
name: 6d2ef157
---
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: user-service
namespace: 6d2ef157
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/max-scale: "10"
autoscaling.knative.dev/min-scale: "1"
autoscaling.knative.dev/scale-down-delay: "15m"
autoscaling.knative.dev/window: "240s"
autoscaling.knative.dev/scale-to-zero-pod-retention-period: "1800s"
creationTimestamp: null
spec:
containerConcurrency: 50
containers:
- env:
- name: TARGET
value: Sample
image: ghcr.io/knative/autoscale-go:latest
name: app
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
successThreshold: 1
tcpSocket:
port: 0
resources:
limits:
cpu: "12"
memory: 78Gi
nvidia.com/gpu: "1"
requests:
cpu: "12"
memory: 78Gi
nvidia.com/gpu: "1"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- CAP_SYS_ADMIN
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
enableServiceLinks: false
nodeSelector:
nvidia.com/gpu.count: "1"
nvidia.com/gpu.product: NVIDIA-GeForce-RTX-2080-Ti
runtimeClassName: nvidia
timeoutSeconds: 1800
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
traffic:
- latestRevision: true
percent: 100
After a few minutes, the pod is still pending, but we get an event that says the cluster autoscaler has been triggered.
Normal TriggeredScaleUp 2m16s cluster-autoscaler pod triggered scale-up: [{MachineDeployment/gpu-nodes 0->1 (max: 30)}]
When the node is available, the pod is created and running.
NAME READY STATUS RESTARTS AGE
user-service-00001-deployment-6f6d577c45-rtjvz 2/2 Running 0 6m7s
However, the service is not ready, and the revision is not created.
NAME URL LATESTCREATED LATESTREADY READY REASON
user-service http://6d2ef157.some.domain.net user-service-00001 False RevisionMissing
NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON ACTUAL REPLICAS DESIRED REPLICAS
user-service-00001 user-service 1 False Unschedulable 1 0
This is the events I get from the revision
Warning InternalError 7m29s revision-controller failed to update deployment "user-service-00001-deployment": Operation cannot be fulfilled on deployments.apps "user-service-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
Warning InternalError 7m29s revision-controller failed to update PA "user-service-00001": Operation cannot be fulfilled on podautoscalers.autoscaling.internal.knative.dev "user-service-00001": the object has been modified; please apply your changes to the latest version and try again
The PodAutoscaler
resource is not ready, and the DesiredScale
is 0.
NAME DESIREDSCALE ACTUALSCALE READY REASON
user-service-00001 0 1 False NoTraffic
the events from the PodAutoscaler
resource
Status:
Actual Scale: 1
Conditions:
Last Transition Time: 2024-02-23T16:32:02Z
Message: The target is not receiving traffic.
Reason: NoTraffic
Status: False
Type: Active
Last Transition Time: 2024-02-23T16:32:02Z
Message: The target is not receiving traffic.
Reason: NoTraffic
Status: False
Type: Ready
Last Transition Time: 2024-02-23T16:38:03Z
Status: True
Type: SKSReady
Last Transition Time: 2024-02-23T16:32:02Z
Status: True
Type: ScaleTargetInitialized
Desired Scale: 0
Metrics Service Name: user-service-00001-private
Observed Generation: 2
Service Name: user-service-00001
I got error logs from the autoscaler pod
{"severity":"ERROR","timestamp":"2024-02-23T15:55:24.847361414Z","logger":"autoscaler","caller":"podautoscaler/reconciler.go:314","message":"Returned an error","commit":"239b73e","knative.dev/controller":"knative.dev.serving.pkg.reconciler.autoscaling.kpa.Reconciler","knative.dev/kind":"autoscaling.internal.knative.dev.PodAutoscaler","knative.dev/traceid":"2c39855d-329c-43a0-99a9-204f4944e4af","knative.dev/key":"3010eb09/user-service-00001","targetMethod":"ReconcileKind","error":"error scaling target: failed to get scale target {Deployment user-service-00001-deployment apps/v1 }: error fetching Pod Scalable 3010eb09/user-service-00001-deployment: deployments.apps \"user-service-00001-deployment\" not found","stacktrace":"knative.dev/serving/pkg/client/injection/reconciler/autoscaling/v1alpha1/podautoscaler.(*reconcilerImpl).Reconcile\n\tknative.dev/serving/pkg/client/injection/reconciler/autoscaling/v1alpha1/podautoscaler/reconciler.go:314\nmain.(*leaderAware).Reconcile\n\tknative.dev/serving/cmd/autoscaler/leaderelection.go:44\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/[email protected]/controller/controller.go:542\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/[email protected]/controller/controller.go:491"}
{"severity":"ERROR","timestamp":"2024-02-23T15:55:24.847442144Z","logger":"autoscaler","caller":"controller/controller.go:566","message":"Reconcile error","commit":"239b73e","knative.dev/controller":"knative.dev.serving.pkg.reconciler.autoscaling.kpa.Reconciler","knative.dev/kind":"autoscaling.internal.knative.dev.PodAutoscaler","knative.dev/traceid":"2c39855d-329c-43a0-99a9-204f4944e4af","knative.dev/key":"3010eb09/user-service-00001","duration":"787.035µs","error":"error scaling target: failed to get scale target {Deployment user-service-00001-deployment apps/v1 }: error fetching Pod Scalable 3010eb09/user-service-00001-deployment: deployments.apps \"user-service-00001-deployment\" not found","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/[email protected]/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/[email protected]/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/[email protected]/controller/controller.go:491"}
PodAutoscaler
resource:
spec:
containerConcurrency: 50
protocolType: http1
reachability: Unreachable
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: user-service-00001-deployment
Changed reachability manually from "Unreachable" to "" and Changed desiredScale manually from "0" to "1"
NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON ACTUAL REPLICAS DESIRED REPLICAS
user-service-00001 user-service 1 True 1 1
NAME URL LATESTCREATED LATESTREADY READY REASON
user-service http://6d2ef157.some.domain.net user-service-00001 user-service-00001 True
The configuration I tried
I started to play with the configuration in an attempt to find the parameter that would unlock everything, but this was not successful. Please note that the values are intentionally exaggerated in an attempt to highlight a pattern.
config-autoscaler:
apiVersion: v1
data:
allow-zero-initial-scale: "true"
enable-scale-to-zero: "true"
initial-scale: "0"
scale-down-delay: 15m
scale-to-zero-grace-period: 1800s
scale-to-zero-pod-retention-period: 1800s
stable-window: 360s
target-burst-capacity: "211"
window: 240s
kind: ConfigMap
metadata:
name: config-autoscaler
namespace: knative-serving
config-deployment:
apiVersion: v1
data:
progress-deadline: 3600s
queue-sidecar-image: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:d569f30abd31cbe105ba32b512a321dd82431b0a8e205bebf14538fddb4dfa54
queueSidecarImage: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:9b8dad0630029dfcab124e6b4fa7c8e39b453249f0b31282c48e008bfc16faa3
kind: ConfigMap
metadata:
name: config-deployment
namespace: knative-serving
config-defaults:
apiVersion: v1
data:
max-revision-timeout-seconds: "3600"
revision-response-start-timeout-seconds: "1800"
revision-timeout-seconds: "1800"
kind: ConfigMap
metadata:
name: config-defaults
namespace: knative-serving
The problem I'm facing
I'm not sure what I'm doing wrong. It looks like the revision has no reconciler, but I'm not sure.
The pod is running and the service is created, but the revision is not, which is why the service is not ready, and it's a bit of a mystery.
Could you please help me understand what is wrong with my configuration?