Skip to content

Intermittent Hangs at crane.Push() on Registry Push #2104

@ranimbal

Description

@ranimbal

Environment

Device and OS: Rocky 8 EC2
App version: 0.29.2
Kubernetes distro being used: RKE2 v1.26.9+rke2r1
Other: Bigbang v2.11.1

Steps to reproduce

  1. zarf package deploy zarf-package-mvp-cluster-amd64-v5.0.0-alpha.7.tar.zst --confirm -l=debug
  2. About 80% of the time or so, the above command gets stuck at crane.Push(). A retry usually works.

Expected result

That the zarf package deploy... command wouldn't get hung up, and continue along.

Actual Result

The zarf package deploy... command gets hung up

Visual Proof (screenshots, videos, text, etc)

��[30;100m�[30;100m  DEBUG  �[0m�[0m �[90m�[90m2023-10-23T18:37:19Z  -  Pushing ...1.dso.mil/ironbank/neuvector/neuvector/manager:5.1.3�[0m�[0m
�[30;100m�[30;100m  DEBUG  �[0m�[0m �[90m�[90m2023-10-23T18:37:19Z  -  crane.Push() /tmp/zarf-3272389118/images:registry1.dso.mil/ironbank/neuvector/neuvector/manager:5.1.3 -> 127.0.0.1:39357/ironbank/neuvector/neuvector/manager:5.1.3-zarf-487612511)�[0m�[0m
section_end:1698087620:step_script
�[0K�[31;1mERROR: Job failed: execution took longer than 35m0s seconds

Severity/Priority

There is a workaround, by keeping retrying until the process succeeds.

Additional Context

This looks exactly like #1568, which was closed.

We have a multi-node cluster on AWS EC2, our package size is about 2.9G. Here are a few things that we noticed after some extensive testing:

  • this issue is not seen on a single EC2 node RKE2 cluster, it seems to only occur on multi-node clusters.
  • our zarf docker registry is backed by S3. The issue is always seen in this case, but only if a multi-node cluster.
  • if we back the registry with the default PVC (instead of S3), the issue is not seen at all. Since data transfer to S3 is slower than to the EBS backed PVC, maybe this extra time causes the problem to appear?
  • disabling or enabling the zarf docker registry HPA doesn't seem to matter either ways.

Metadata

Metadata

Assignees

Labels

bug 🐞Something isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions