From 87cdc4b9fc6bf6fb0c2ca81cb966c5927b54e4e1 Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Thu, 5 Dec 2024 11:44:31 +0100 Subject: [PATCH 1/6] Add concepts page on temporary credentials lifetime Co-authored-by: Razvan-Daniel Mihai --- modules/concepts/nav.adoc | 1 + modules/concepts/pages/index.adoc | 2 +- .../pages/operations/cluster_operations.adoc | 2 +- modules/concepts/pages/operations/index.adoc | 10 ++- .../temporary_credentials_lifetime.adoc | 88 +++++++++++++++++++ 5 files changed, 97 insertions(+), 6 deletions(-) create mode 100644 modules/concepts/pages/operations/temporary_credentials_lifetime.adoc diff --git a/modules/concepts/nav.adoc b/modules/concepts/nav.adoc index 083ea63e7..d8c9a0c45 100644 --- a/modules/concepts/nav.adoc +++ b/modules/concepts/nav.adoc @@ -20,6 +20,7 @@ *** xref:operations/pod_disruptions.adoc[] *** xref:operations/pod_placement.adoc[] *** xref:operations/graceful_shutdown.adoc[] +*** xref:operations/temporary_credentials_lifetime.adoc[] ** Observability *** xref:labels.adoc[] *** xref:logging.adoc[] diff --git a/modules/concepts/pages/index.adoc b/modules/concepts/pages/index.adoc index 44eccd3a1..ad3da9a8c 100644 --- a/modules/concepts/pages/index.adoc +++ b/modules/concepts/pages/index.adoc @@ -30,7 +30,7 @@ It also includes xref:tls-server-verification.adoc[]. == Operations The xref:operations/index.adoc[operations] section is directed at platform maintainers. -It covers xref:operations/cluster_operations.adoc[starting, stopping and restarts] of products, xref:operations/graceful_shutdown.adoc[] and other topics related to maintenance and ensuring stability of the platform operation. +It covers xref:operations/cluster_operations.adoc[starting, stopping and restarts] of products, xref:operations/graceful_shutdown.adoc[] and other topics related to maintenance and ensuring stability and availability of the platform operation. == Observability diff --git a/modules/concepts/pages/operations/cluster_operations.adoc b/modules/concepts/pages/operations/cluster_operations.adoc index 95ceb3ac5..2786eeda5 100644 --- a/modules/concepts/pages/operations/cluster_operations.adoc +++ b/modules/concepts/pages/operations/cluster_operations.adoc @@ -123,4 +123,4 @@ You can add more labels to make finer grained restarts. == Automatic restarts The Commons Operator of the Stackable Platform may restart Pods automatically, for purposes such as ensuring that TLS certificates are up-to-date. -For details, see the xref:commons-operator:index.adoc[Commons Operator documentation]. +For details, see xref:operations/temporary_credentials_lifetime.adoc[] as well as the xref:commons-operator:index.adoc[Commons Operator documentation]. diff --git a/modules/concepts/pages/operations/index.adoc b/modules/concepts/pages/operations/index.adoc index 9f16192ab..1c650da46 100644 --- a/modules/concepts/pages/operations/index.adoc +++ b/modules/concepts/pages/operations/index.adoc @@ -11,7 +11,7 @@ Make sure to go through the following checklist to achieve the maximum level of 1. Make setup highly available (HA): In case the product supports running in an HA fashion, our operators will automatically configure it for you. You only need to make sure that you deploy a sufficient number of replicas. Please note that some products don't support HA. -2. Reduce the number of simultaneous pod disruptions (unavailable replicas). +2. Reduce the number of simultaneous pod disruptions (unavailable replicas): The Stackable operators write defaults based upon knowledge about the fault tolerance of the product, which should cover most of the use-cases. For details have a look at xref:operations/pod_disruptions.adoc[]. 3. Reduce impact of pod disruptions: @@ -19,13 +19,15 @@ Make sure to go through the following checklist to achieve the maximum level of The flow is as follows: Kubernetes wants to shut down the Pod and calls a hook into the Pod, which in turn interacts with the product, signaling it to gracefully shut down. The final deletion of the Pod is then blocked until the product has successfully migrated running workloads away from the Pod that is to be shut down. Details covering the graceful shutdown mechanism are described in xref:operations/graceful_shutdown.adoc[] as well as the actual operator documentation. -+ -WARNING: Graceful shutdown is not implemented for all products yet. Please check the documentation specific to the product operator to see if it is supported (such as e.g. xref:trino:usage-guide/operations/graceful-shutdown.adoc[the documentation for Trino]. - 4. Spread workload across multiple Kubernetes nodes, racks, datacenter rooms or datacenters to guarantee availability in the case of e.g. power outages or fire in parts of the datacenter. All of this is supported by configuring an https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/[antiAffinity] as documented in xref:operations/pod_placement.adoc[] +5. Reduce the frequency of disruptions: + Although we try our best to reduce the impact of disruptions, some tools simply don't support HA setups. + One example is the Trino coordinator - if you restart it, all running queries will fail. + Many products use temporary credentials (such as TLS certificates), which have a short lifetime by default for maximum security. + Please read on xref:operations/temporary_credentials_lifetime.adoc[] on how you can increase the lifetime of this temporary credentials too avoid frequent restarts. == Maintenance actions diff --git a/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc new file mode 100644 index 000000000..c3d083e47 --- /dev/null +++ b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc @@ -0,0 +1,88 @@ += Temporary credentials lifetime +:description: Customize the lifetime of temporary credentials. + +== Usages + +=== TLS certificates + +Currently the only temporary credentials are TLS certificates. + +Many products use TLS to secure the communications, often times customers use the xref:secret-operator:secretclass.adoc#backend-autotls[secret-operator autoTls] backend to create TLS certificates for the Pods on the fly. +For maximum security these temporary credentials have a short lifetime by default, which will result in e.g. your Trino coordinator restarting every ~24 hours (minus some safety buffer) to avoid using expired certificates. + +== Configure the lifetime + +In high load production environments, restarting Pods can be a costly operation, as it can disrupt services and in some cases even lead to data loss. +To avoid frequent restarts, the lifetime of all temporary credentials (such as the TLS certificates) can be increased as needed. + +Here is an example for configuring the temporary credentials lifetime to 7 days in a HDFS stacklet. +It should result in the HDFS Pods restarting weekly instead of daily: + +[source,yaml] +---- +--- +apiVersion: hdfs.stackable.tech/v1alpha1 +kind: HdfsCluster +metadata: + name: hdfs +spec: + nameNodes: + config: + requestedSecretLifetime: 7d # <1> + roleGroups: + default: + replicas: 2 + dataNodes: + config: + requestedSecretLifetime: 7d # <2> + roleGroups: + default: + replicas: 2 + journalNodes: + roleGroups: + default: + replicas: 3 + config: + requestedSecretLifetime: 7d # <3> +---- +<1> The lifetime of the TLS certificates for *all* NameNode roleGroups is set to 7 days. +<2> The lifetime of the TLS certificates for *all* DataNode roleGroups is set to 7 days. +<3> The lifetime of the TLS certificates for the `default` JournalNode group is set to 7 days. + +NOTE: The configuration for the JournalNodes is done at roleGroup level for demonstration purposes. + +=== TLS certificate lifetimes + +Even though operators allow setting this property to a value of your choice, the xref:secret-operator:index.adoc[secret-operator] will not exceed the `maxCertificateLifetime` value specified in SecretClass creating the TLS certificates (see xref:secret-operator/secretclass.adoc#certificate_lifetime). + +=== Operators supporting the lifetime configuration + +Similar to the example above, users can configure the lifetime of temporary credentials for the following operators: + +* Apache Druid +* Apache Hadoop +* Apache HBase +* Apache NiFi +* Apache Spark +* Apache Zookeeper +* Trino + +== Check the lifetime + +Pods are normally not restarted "randomly" by Stackable operators. +Instead, when a temporary credential is added to a Pod, an annotation is added as well. +It starts with `restarter.stackable.tech/expires-at.` and instructs the xref:commons-operator:index.adoc[commons-operator] to restart the Pod once the specified instant is reached. + +Given the following Pod + +[source,yaml] +---- +kind: Pod +metadata: + annotations: + restarter.stackable.tech/expires-at.b887492af14bfe84f10cb2ff1b60acb0: "2024-12-05T14:03:47.131570189+00:00" + restarter.stackable.tech/expires-at.ea77192c1184326d33e8ee32cfe921ea: "2024-12-05T15:49:10.043722965+00:00" +---- + +You can always determine the timestamp the Pod will be restarted by the xref:commons-operator:index.adoc[commons-operator] by taking the earliest timestamp, `2024-12-05T14:03:47.131570189+00:00` in this case. +This way you can verify that the changes you made to the temporary credentials lifetime take effect. From 9e2d7c11a782e298ebb6a074517423472b46c137 Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Thu, 5 Dec 2024 12:13:13 +0100 Subject: [PATCH 2/6] Apply suggestions from code review Co-authored-by: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> --- modules/concepts/pages/operations/index.adoc | 2 +- .../pages/operations/temporary_credentials_lifetime.adoc | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/modules/concepts/pages/operations/index.adoc b/modules/concepts/pages/operations/index.adoc index 1c650da46..949c75b6f 100644 --- a/modules/concepts/pages/operations/index.adoc +++ b/modules/concepts/pages/operations/index.adoc @@ -27,7 +27,7 @@ Make sure to go through the following checklist to achieve the maximum level of Although we try our best to reduce the impact of disruptions, some tools simply don't support HA setups. One example is the Trino coordinator - if you restart it, all running queries will fail. Many products use temporary credentials (such as TLS certificates), which have a short lifetime by default for maximum security. - Please read on xref:operations/temporary_credentials_lifetime.adoc[] on how you can increase the lifetime of this temporary credentials too avoid frequent restarts. + The xref:operations/temporary_credentials_lifetime.adoc[] page describes how you can increase the lifetime of this temporary credentials too avoid frequent restarts. == Maintenance actions diff --git a/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc index c3d083e47..b675721e3 100644 --- a/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc +++ b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc @@ -8,7 +8,7 @@ Currently the only temporary credentials are TLS certificates. Many products use TLS to secure the communications, often times customers use the xref:secret-operator:secretclass.adoc#backend-autotls[secret-operator autoTls] backend to create TLS certificates for the Pods on the fly. -For maximum security these temporary credentials have a short lifetime by default, which will result in e.g. your Trino coordinator restarting every ~24 hours (minus some safety buffer) to avoid using expired certificates. +To increase security, these temporary credentials have a short lifetime by default, which will result in e.g. Trino coordinator Pods restarting every ~24 hours (minus some safety buffer) to avoid using expired certificates. == Configure the lifetime @@ -55,7 +55,7 @@ NOTE: The configuration for the JournalNodes is done at roleGroup level for demo Even though operators allow setting this property to a value of your choice, the xref:secret-operator:index.adoc[secret-operator] will not exceed the `maxCertificateLifetime` value specified in SecretClass creating the TLS certificates (see xref:secret-operator/secretclass.adoc#certificate_lifetime). -=== Operators supporting the lifetime configuration +=== Operator support Similar to the example above, users can configure the lifetime of temporary credentials for the following operators: From 9a69f06c1ecf4e46af7497a0ab656b9719824608 Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Thu, 5 Dec 2024 12:15:15 +0100 Subject: [PATCH 3/6] typo --- .../pages/operations/temporary_credentials_lifetime.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc index b675721e3..e1ab6c2c8 100644 --- a/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc +++ b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc @@ -8,7 +8,7 @@ Currently the only temporary credentials are TLS certificates. Many products use TLS to secure the communications, often times customers use the xref:secret-operator:secretclass.adoc#backend-autotls[secret-operator autoTls] backend to create TLS certificates for the Pods on the fly. -To increase security, these temporary credentials have a short lifetime by default, which will result in e.g. Trino coordinator Pods restarting every ~24 hours (minus some safety buffer) to avoid using expired certificates. +To increase security, these temporary credentials have a short lifetime by default, which will result in e.g. Trino coordinator Pods restarting every ~24 hours (minus some safety buffer) to avoid using expired certificates. == Configure the lifetime From 846ccde45bd0d56acb946d86630248d4da306cdd Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Thu, 5 Dec 2024 12:29:47 +0100 Subject: [PATCH 4/6] Better explain checking the lifetime --- .../operations/temporary_credentials_lifetime.adoc | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc index e1ab6c2c8..86c308912 100644 --- a/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc +++ b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc @@ -69,9 +69,12 @@ Similar to the example above, users can configure the lifetime of temporary cred == Check the lifetime -Pods are normally not restarted "randomly" by Stackable operators. -Instead, when a temporary credential is added to a Pod, an annotation is added as well. -It starts with `restarter.stackable.tech/expires-at.` and instructs the xref:commons-operator:index.adoc[commons-operator] to restart the Pod once the specified instant is reached. +After configuring the lifetime as described above you could simply observe your stacklet for a week/month (or whatever your new lifetime is), to see if your changes take effect. +However, it's much quicker to check at what point of time your Pods will be restarted next. + +Pods are not restarted "randomly" by Stackable operators, but in a predicable manner. +When a temporary credential is added to a Pod, an annotation is added as well. +It starts with `restarter.stackable.tech/expires-at.` and instructs the xref:commons-operator/restarter.adoc[restart-controller] to restart the Pod once the specified point in time is reached. Given the following Pod @@ -84,5 +87,6 @@ metadata: restarter.stackable.tech/expires-at.ea77192c1184326d33e8ee32cfe921ea: "2024-12-05T15:49:10.043722965+00:00" ---- -You can always determine the timestamp the Pod will be restarted by the xref:commons-operator:index.adoc[commons-operator] by taking the earliest timestamp, `2024-12-05T14:03:47.131570189+00:00` in this case. -This way you can verify that the changes you made to the temporary credentials lifetime take effect. +You can always determine the instant the Pod will be restarted by the xref:commons-operator/restarter.adoc[restart-controller] by taking the earliest timestamp, `2024-12-05T14:03:47.131570189+00:00` in this case. + +You can use this timestamp to check if your changes have been applied as intended. From a734f69ac0d0021336756d5185e502a46c9f146a Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Thu, 5 Dec 2024 12:32:23 +0100 Subject: [PATCH 5/6] fix rendering --- .../pages/operations/temporary_credentials_lifetime.adoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc index 86c308912..8831a6c65 100644 --- a/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc +++ b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc @@ -74,7 +74,7 @@ However, it's much quicker to check at what point of time your Pods will be rest Pods are not restarted "randomly" by Stackable operators, but in a predicable manner. When a temporary credential is added to a Pod, an annotation is added as well. -It starts with `restarter.stackable.tech/expires-at.` and instructs the xref:commons-operator/restarter.adoc[restart-controller] to restart the Pod once the specified point in time is reached. +It starts with `restarter.stackable.tech/expires-at.` and instructs the xref:commons-operator:restarter.adoc[restart-controller] to restart the Pod once the specified point in time is reached. Given the following Pod @@ -87,6 +87,6 @@ metadata: restarter.stackable.tech/expires-at.ea77192c1184326d33e8ee32cfe921ea: "2024-12-05T15:49:10.043722965+00:00" ---- -You can always determine the instant the Pod will be restarted by the xref:commons-operator/restarter.adoc[restart-controller] by taking the earliest timestamp, `2024-12-05T14:03:47.131570189+00:00` in this case. +You can always determine the instant the Pod will be restarted by the xref:commons-operator:restarter.adoc[restart-controller] by taking the earliest timestamp, `2024-12-05T14:03:47.131570189+00:00` in this case. You can use this timestamp to check if your changes have been applied as intended. From 65dd822bc6000e3840cfea2a8df94a7025489c31 Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Thu, 5 Dec 2024 13:19:19 +0100 Subject: [PATCH 6/6] Rename section to Pod lifetime annotations --- .../pages/operations/temporary_credentials_lifetime.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc index 8831a6c65..a3f529e9e 100644 --- a/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc +++ b/modules/concepts/pages/operations/temporary_credentials_lifetime.adoc @@ -67,7 +67,7 @@ Similar to the example above, users can configure the lifetime of temporary cred * Apache Zookeeper * Trino -== Check the lifetime +== Pod lifetime annotations After configuring the lifetime as described above you could simply observe your stacklet for a week/month (or whatever your new lifetime is), to see if your changes take effect. However, it's much quicker to check at what point of time your Pods will be restarted next.