Skip to content

Commit 1be19a6

Browse files
stefannicaschustmi
andcommitted
Document self-hosted run templates (#3552)
* Document self-hosted run templates * Update docs/book/getting-started/zenml-pro/self-hosted.md Co-authored-by: Michael Schuster <[email protected]> * Update docs/book/getting-started/zenml-pro/self-hosted.md Co-authored-by: Michael Schuster <[email protected]> * Minor correction * Add version warning --------- Co-authored-by: Michael Schuster <[email protected]>
1 parent 3fc79e2 commit 1be19a6

File tree

1 file changed

+112
-1
lines changed

1 file changed

+112
-1
lines changed

docs/book/getting-started/zenml-pro/self-hosted.md

Lines changed: 112 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ ZenML Pro can be installed as a self-hosted deployment. You need to be granted a
1414
This document will guide you through the process.
1515

1616
{% hint style="info" %}
17-
Please note that the SSO (Single Sign-On) and [Run Templates](https://docs.zenml.io/how-to/trigger-pipelines) (i.e. running pipelines from the dashboard) features are currently not available in the on-prem version of ZenML Pro. These features are on our roadmap and will be added in future releases.
17+
Please note that the SSO (Single Sign-On) feature is currently not available in the on-prem version of ZenML Pro. This feature is on our roadmap and will be added in future releases.
1818
{% endhint %}
1919

2020
## Preparation and prerequisites
@@ -1494,6 +1494,117 @@ export ZENML_PRO_API_URL=https://zenml-pro.staging.cloudinfra.zenml.io/api/v1
14941494
zenml login
14951495
```
14961496
1497+
## Enabling Run Templates Support
1498+
1499+
The ZenML Pro workspace server can be configured to optionally support Run Templates - the ability to run pipelines straight from the dashboard. This feature is not enabled by default and needs a few additional steps to be set up.
1500+
1501+
{% hint style="warning" %}
1502+
The Run Templates feature is only available from ZenML workspace server version 0.81.0 onwards.
1503+
{% endhint %}
1504+
1505+
The Run Templates feature comes with some optional sub-features that can be turned on or off to customize the behavior of the feature:
1506+
1507+
* **Building runner container images**: Running pipelines from the dashboard relies on Kubernetes jobs (aka "runner" jobs) that are triggered by the ZenML workspace server. These jobs need to use container images that have the correct Python software packages installed on them to be able to launch the pipelines.
1508+
1509+
The good news is that run templates are based on pipeline runs that have already run in the past and already have container images built and associated with them. The same container images can be reused by the ZenML workspace server for the "runner jobs". However, for this to work, the Kubernetes cluster itself has to be able to access the container registries where these images are stored. This can be achieved in several ways:
1510+
1511+
* use implicit workload identity access to the container registry - available in most cloud providers by granting the Kubernetes service account access to the container registry
1512+
* configure a service account with implicit access to the container registry - associating some cloud service identity (e.g. a GCP service account, an AWS IAM role, etc.) with the Kubernetes service account used by the "runner" jobs
1513+
* configure an image pull secret for the service account - similar to the previous option, but using a Kubernetes secret instead of a cloud service identity
1514+
1515+
When none of the above are available or desirable, an alternative approach is to configure the ZenML workspace server itself to build these "runner" container images and push them to a different container registry. This can be achieved by setting the `ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE` environment variable to `true` and the `ZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRY` environment variable to the container registry where the "runner" images will be pushed.
1516+
1517+
Yet another alternative is to configure the ZenML workspace server to use a single pre-built "runner" image for all the pipeline runs. This can be achieved by keeping `ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE` environment variable set to `false` and the `ZENML_KUBERNETES_WORKLOAD_MANAGER_RUNNER_IMAGE` environment variable set to the container image registry URI where the "runner" image is stored. Note that this image needs to have all requirements installed to instantiate the stack that will be used for the template run.
1518+
1519+
* **Store logs externally**: By default, the ZenML workspace server will use the logs extracted from the "runner" job pods to populate the run template logs shown in the ZenML dashboard. These pods may disappear after a while, so the logs may not be available anymore.
1520+
1521+
To avoid this, you can configure the ZenML workspace server to store the logs in an external location, like an S3 bucket. This can be achieved by setting the `ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS` environment variable to `true`.
1522+
1523+
This option is only currently available with the AWS implementation of the Run Templates feature and also requires the `ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET` environment variable to be set to point to the S3 bucket where the logs will be stored.
1524+
1525+
1. Decide on an implementation.
1526+
1527+
There are currently three different implementations of the Run Templates feature:
1528+
1529+
* **Kubernetes**: runs pipelines in the same Kubernetes cluster as the ZenML Pro workspace server.
1530+
* **AWS**: extends the Kubernetes implementation to be able to build and push container images to AWS ECR and to store run the template logs in AWS S3.
1531+
* **GCP**: currently, this is the same as the Kubernetes implementation, but we plan to extend it to be able to push container images to GCP GCR and to store run template logs in GCP GCS.
1532+
1533+
If you're going for a fast, minimalistic setup, you should go for the Kubernetes implementation. If you want a complete cloud provider solution with all features enabled, you should go for the AWS implementation.
1534+
1535+
2. Prepare Run Templates configuration.
1536+
1537+
You'll need to prepare a list of environment variables that will be added to the Helm chart values used to deploy the ZenML workspace server.
1538+
1539+
For all implementations, the following variables are supported:
1540+
1541+
* `ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE` (mandatory): one of the values associated with the implementation you've chosen in step 1:
1542+
* `zenml_cloud_plugins.kubernetes_workload_manager.KubernetesWorkloadManager`
1543+
* `zenml_cloud_plugins.aws_kubernetes_workload_manager.AWSKubernetesWorkloadManager`
1544+
* `zenml_cloud_plugins.gcp_kubernetes_workload_manager.GCPKubernetesWorkloadManager`
1545+
* `ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE` (mandatory): the Kubernetes namespace where the "runner" jobs will be launched. It must exist before the run templates are enabled.
1546+
* `ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT` (mandatory): the Kubernetes service account to use for the "runner" jobs. It must exist before the run templates are enabled.
1547+
* `ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE` (optional): whether to build the "runner" container images or not. Defaults to `false`.
1548+
* `ZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRY` (optional): the container registry where the "runner" images will be pushed. Mandatory if `ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE` is set to `true`, ignored otherwise.
1549+
* `ZENML_KUBERNETES_WORKLOAD_MANAGER_RUNNER_IMAGE` (optional): the "runner" container image to use. Only used if `ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE` is set to `false`, ignored otherwise.
1550+
* `ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS` (optional): whether to store the logs of the "runner" jobs in an external location. Defaults to `false`. Currently only supported with the AWS implementation and requires the `ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET` variable to be set as well.
1551+
* `ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE_POD_RESOURCES` (optional): the Kubernetes pod resources specification to use for the "runner" jobs, in JSON format. Example: `{"requests": {"cpu": "100m", "memory": "400Mi"}, "limits": {"memory": "700Mi"}}`.
1552+
1553+
For the AWS implementation, the following additional variables are supported:
1554+
1555+
* `ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET` (optional): the S3 bucket where the logs will be stored (e.g. `s3://my-bucket/run-template-logs`). Mandatory if `ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS` is set to `true`.
1556+
* `ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_REGION` (optional): the AWS region where the container images will be pushed (e.g. `eu-central-1`). Mandatory if `ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE` is set to `true`.
1557+
1558+
3. Create the Kubernetes resources.
1559+
1560+
For the Kubernetes implementation, you'll need to create the following resources:
1561+
1562+
* the Kubernetes namespace passed in the `ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE` variable.
1563+
* the Kubernetes service account passed in the `ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT` variable. This service account will be used to build images and run the "runner" jobs, so it needs to have the necessary permissions to do so (e.g. access to the container images, permissions to push container images to the configured container registry, permissions to access the configured bucket, etc.).
1564+
1565+
4. Finally, update the ZenML workspace server configuration to use the new implementation.
1566+
1567+
The environment variables you prepared in step 2 need to be added to the Helm chart values used to deploy the ZenML workspace server and the ZenML server has to be updated as covered in the [Day 2 Operations: Upgrades and Updates](self-hosted.md#day-2-operations-upgrades-and-updates) section.
1568+
1569+
Example updated Helm values file (minimal configuration):
1570+
1571+
```yaml
1572+
zenml:
1573+
environment:
1574+
ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE: zenml_cloud_plugins.kubernetes_workload_manager.KubernetesWorkloadManager
1575+
ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE: zenml-workspace-namespace
1576+
ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT: zenml-workspace-service-account
1577+
```
1578+
1579+
Example updated Helm values file (full AWS configuration):
1580+
1581+
```yaml
1582+
zenml:
1583+
environment:
1584+
ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE: zenml_cloud_plugins.aws_kubernetes_workload_manager.AWSKubernetesWorkloadManager
1585+
ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE: zenml-workspace-namespace
1586+
ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT: zenml-workspace-service-account
1587+
ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE: "true"
1588+
ZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRY: 339712793861.dkr.ecr.eu-central-1.amazonaws.com
1589+
ZENML_KUBERNETES_WORKLOAD_MANAGER_ENABLE_EXTERNAL_LOGS: "true"
1590+
ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE_POD_RESOURCES: '{"requests": {"cpu": "100m", "memory": "400Mi"}, "limits": {"memory": "700Mi"}}'
1591+
ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_BUCKET: s3://my-bucket/run-template-logs
1592+
ZENML_AWS_KUBERNETES_WORKLOAD_MANAGER_REGION: eu-central-1
1593+
```
1594+
1595+
Example updated Helm values file (full GCP configuration):
1596+
1597+
```yaml
1598+
zenml:
1599+
environment:
1600+
ZENML_SERVER_WORKLOAD_MANAGER_IMPLEMENTATION_SOURCE: zenml_cloud_plugins.gcp_kubernetes_workload_manager.GCPKubernetesWorkloadManager
1601+
ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE: zenml-workspace-namespace
1602+
ZENML_KUBERNETES_WORKLOAD_MANAGER_SERVICE_ACCOUNT: zenml-workspace-service-account
1603+
ZENML_KUBERNETES_WORKLOAD_MANAGER_BUILD_RUNNER_IMAGE: "true"
1604+
ZENML_KUBERNETES_WORKLOAD_MANAGER_DOCKER_REGISTRY: europe-west3-docker.pkg.dev/zenml-project/zenml-run-templates/zenml
1605+
ZENML_KUBERNETES_WORKLOAD_MANAGER_NAMESPACE_POD_RESOURCES: '{"requests": {"cpu": "100m", "memory": "400Mi"}, "limits": {"memory": "700Mi"}}'
1606+
```
1607+
14971608
## Day 2 Operations: Upgrades and Updates
14981609
14991610
This section covers how to upgrade or update your ZenML Pro deployment. The process involves updating both the ZenML Pro Control Plane and the ZenML Pro workspace servers.

0 commit comments

Comments
 (0)