Skip to content

calico_rr install role trigger stop of playbook cluster.yml prematurely when failures during iterations #13278

@MxFbk

Description

@MxFbk

What happened?

Hi all,

to trigger calico_rr installation, we have added calico_rr group in our invetory.
All tasks needed to complete its installation were executed.

Inside roles/network_plugin/calico/rr/tasks/update-node.yml the block has a reiteration logic that succeeded on our side after 3 reiteration.

Playbook cluster.yml set "any_errors_fatal" var to "true" if not defined

... ... ... ...
- name: Install Calico Route Reflector
  hosts: calico_rr
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: network_plugin/calico/rr, tags: ['network', 'calico_rr'] }
... ... ... ...

I suppose that failures occurred during rescued successfully iterations, due to the value of "any_errors_fatal", cause the premature stop of the whole playbook avoiding execution of all kuberntes apps tasks.

In this output extract, you can see a successful recap, but no other tasks are triggered after caclico_rr install end.
In the same extract you can see that "retry_count" var is incremented to "2" (<10 limit) during rescue of the block.
extract_calico_rr.txt

We aren't executing playbook with any tags or skip and removing calico_rr group from inventory we succeeded to have all kubernetes app installed.

Could you help us?

Thanks

Massimiliano

What did you expect to happen?

We are expecting to have calico_rr installed and kubernetes-app enbaled too, like metallb.

We are asking if for this specific case the default value "true" for "any_errors_fatal" is really needed or is better to use a check to verify if failures are real or not.

How can we reproduce it (as minimally and precisely as possible)?

In our case issue happens just trying to install calico_rr (adding group in inventory) and adding "metallb_enabled: true" into the play vars.
No metallb is installed at all if the execution of task "network_plugin/calico/rr/tasks/update-node.yml:34" fails on one or more hosts of the target group.

OS

RHEL 9

Version of Ansible

ansible [core 2.18.16]
config file = None
configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /opt/python3_12_0/venvs/kubespray_main/lib64/python3.12/site-packages/ansible
ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
executable location = ./ansible
python version = 3.12.1 (main, Nov 25 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-4)] (/opt/python3_12_0/venvs/kubespray_main/bin/python3)
jinja version = 3.1.6
libyaml = True

Version of Python

Python 3.12.1

Version of Kubespray (commit)

2.31.0

Network plugin used

calico

Full inventory with variables

---
#####################################
upgrade_infra: false
reset_infra: false
#####################################

ansible_user: k8sansible

ansible_ssh_private_key_file: ./secrets/.ssh/k8sansiblersa

ansible_timeout: 120
ansible_become_timeout: 120
ansible_ssh_extra_args: >-
  -o ServerAliveInterval=30
  -o ServerAliveCountMax=10
  -o StrictHostKeyChecking=no
  -o ControlMaster=auto
  -o ControlPersist=30m
  -o ControlPath=/tmp/ansible-ssh-%h-%p-%r

# Debug var
unsafe_show_logs: true
################################

## KUBESPRAY vars
kubespray:
  python:
    version: "3.12.0"

download_run_once: true
download_localhost: true
download_force_cache: true # MUST be set to TRUE to use ansible CONTROLLER as CACHE
download_cache_dir: /opt/gitlab-runner/ansible/kubespray_cache

download_container: true

kube_image_repo: acr.azurecr.io/k8s
quay_image_repo: acr.azurecr.io/quayiok8s
docker_image_repo: acr.azurecr.io/dockeriok8s

################################

## K8S vars
bin_dir: /usr/bin

kube_version: 1.35.0
kube_network_plugin: calico
kube_log_level: 2
disable_ipv6: true
disable_ipv6_dns: true
disable_selinux: true
etcd_deployment_type: host
k8s_image_pull_policy: IfNotPresent

# Container management
container_manager: containerd
containerd_storage_dir: /opt/containerd/images
containerd_state_dir: /opt/containerd/state

containerd_registries_mirrors:
  - prefix: acr.azurecr.io
    mirrors:
      - host: https://acr.azurecr.io
        capabilities: ["pull", "resolve"]
        skip_verify: false
        header:
          Authorization: "Basic *************************************************************************************"

################################

## DNS CONFIGURATION
# NO CHANGES MUST be APLLIED to nodes /etc/resolv.conf cause company DNS servers cannot be overridden.
# They ARE MANDATORY to all commands underlying lookup to AD and DS for authentication and authorization of nodes and users.
resolvconf_mode: none

# Upstream DNS servers for early cluster deployment and fallback
# If an infrastructure service (outside of cluster) are defined through FQDN somewhere
# following DNS will be the only one used, otherwise timeout.
upstream_dns_servers:
  - 172.17.8.105

# CoreDNS static host entries for Azure ACR using dns_etchosts
# This is used by both CoreDNS and NodeLocalDNS
# dns_etchosts: |
#   10.163.68.69 acr.azurecr.io
#   10.163.68.68 acrfbkpronpci01.germanywestcentral.data.azurecr.io

# /etc/hosts custom entries for static host resolutions (for node-level resolution)
custom_etc_hosts:
  # Azure ACR - static IP resolution via /etc/hosts (Private Endpoint IP)
  - domain: "acr.azurecr.io"
    ip: "10.163.68.69"

################################

## ADDONS
dashboard_enabled: true
metrics_server_enabled: true
helm_enabled: true
cert_manager_enabled: true
# Override cert-manager image repo (default: quay.io/jetstack)
jetstack_image_repo: acr.azurecr.io/quayiok8s/jetstack
################################

## METALLB configuration
# EXAMPLE: https://github.com/TayoG/Kubernetes-kubespray/blob/master/docs/metallb.md
kube_proxy_strict_arp: true
metallb_enabled: true
metallb_speaker_enabled: true
metallb_namespace: metallb
metallb_config:
  address_pools:
    primary:
      ip_range:
        - 172.17.253.8-172.17.253.10
      metallb_auto_assign: true
  layer2:
    - primary

#abilitare InPlacePodVerticalScaling
kube_feature_gates:
  - InPlacePodVerticalScaling=true

Command used to invoke ansible

ansible-playbook -vv -i $inventory --become --become-user root $wd/../ansible/kubespray_setup.yml

Output of ansible run

Just final tasks cause whole output play, even if compressed is not loaded.

extract_calico_rr.txt

Anything else we need to know

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    RHEL 9kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions