Skip to content
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
6d7cab2
add runpod
safoinme Mar 13, 2024
70d1f29
Update Skypilot RunPod orchestrator
safoinme Mar 13, 2024
40547e9
Auto-update of Starter template
actions-user Mar 13, 2024
f869e91
Add Skypilot OCI and Skypilot Lambda integrations
safoinme Mar 14, 2024
ef7cadc
Update Skypilot integrations to use newer versions
safoinme Mar 14, 2024
66a532f
Merge branch 'develop' into feature/OSSK-470-add-more-skypilot-options
safoinme Mar 14, 2024
37eddaa
Add SkypilotRunPodIntegration and SkypilotLambdaIntegration to integr…
safoinme Mar 17, 2024
63e4558
Merge branch 'develop' into feature/OSSK-470-add-more-skypilot-options
safoinme Mar 18, 2024
6fd192d
Update Skypilot integration requirements
safoinme Mar 18, 2024
039ca12
Merge branch 'develop' into feature/OSSK-470-add-more-skypilot-options
safoinme Mar 20, 2024
05fb8be
Update Docker commands to use sudo
safoinme Mar 21, 2024
90ceaf6
Auto-update of LLM Finetuning template
actions-user Mar 21, 2024
6421c6d
Fix package name extraction in Integration class
safoinme Mar 21, 2024
d50d7e6
Merge branch 'feature/OSSK-470-add-more-skypilot-options' of github.c…
safoinme Mar 21, 2024
438231f
Add debug logs for docker run command
safoinme Mar 22, 2024
cb933ae
Update Skypilot VM orchestrator and integration
safoinme Mar 23, 2024
a3f1567
Refactor integration.py and sql_zen_store.py
safoinme Mar 23, 2024
e8256b3
Add GPU support for ZenML pipelines runs
safoinme Mar 24, 2024
ced9602
Merge branch 'develop' into feature/OSSK-470-add-more-skypilot-options
safoinme Mar 24, 2024
24e83bb
Fix debug log message formatting in Integration class
safoinme Mar 24, 2024
1aaa003
Merge branch 'develop' into feature/OSSK-470-add-more-skypilot-options
safoinme Mar 24, 2024
9ad1b0b
Merge branch 'feature/OSSK-470-add-more-skypilot-options' of github.c…
safoinme Mar 24, 2024
aa1244b
Fix IndexError in integration.py
safoinme Mar 24, 2024
ca1b918
Fix dependency resolution issue in Integration class
safoinme Mar 24, 2024
bd43a92
fix docstring
safoinme Mar 24, 2024
1ae6189
Fix fileio import in SkypilotLambdaOrchestrator.py
safoinme Mar 24, 2024
e37cc96
Fix formatting in SkypilotLambdaOrchestratorSettings docstring
safoinme Mar 25, 2024
da576f7
Update Lambda Labs orchestrator documentation and logo
safoinme Mar 25, 2024
c93d2d9
Merge branch 'develop' into feature/OSSK-470-add-more-skypilot-options
strickvl Apr 3, 2024
ff0f22c
Apply suggestions from code review
safoinme Apr 4, 2024
281d809
Apply suggestions from code review
safoinme Apr 4, 2024
daaa498
Merge branch 'develop' into feature/OSSK-470-add-more-skypilot-options
safoinme Apr 4, 2024
35bbca1
Update docs/book/stacks-and-components/component-guide/orchestrators/…
safoinme Apr 4, 2024
99c7dc4
Merge branch 'develop' into feature/OSSK-470-add-more-skypilot-options
safoinme Apr 5, 2024
3e6bfb1
Update Skypilot integration requirements to version 0.5.0
safoinme Apr 5, 2024
2e5ec43
Merge branch 'develop' into feature/OSSK-470-add-more-skypilot-options
strickvl Apr 8, 2024
aba1fc1
Merge branch 'develop' into feature/OSSK-470-add-more-skypilot-options
strickvl Apr 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,14 @@ Read more about how to configure step-specific resources [here](#configuring-ste
The SkyPilot VM Orchestrator does not currently support the ability to [schedule pipelines runs](/docs/book/user-guide/advanced-guide/pipelining-features/schedule-pipeline-runs.md)
{% endhint %}

{% hint style="info" %}
All ZenML pipelines runs are executed using docker containers within the VMs provisioned by the orchestrator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All ZenML pipelines runs are executed using docker containers within the VMs provisioned by the orchestrator.
All ZenML pipelines runs are executed using Docker containers within the VMs provisioned by the orchestrator.

For that reason, you may need to configure you pipeline settings with `docker_run_args=["--gpus=all"]`
to enable GPU support in the docker container.
{% endhint %}

{% hint style="info" %}


## How to deploy it

Expand Down Expand Up @@ -240,6 +248,36 @@ zenml orchestrator connect <ORCHESTRATOR_NAME> --connector azure-skypilot-vm
zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set
```
{% endtab %}

{% tab title="Lambda Labs" %}

Lambda Labs is a cloud provider that offers GPU instances for machine learning workloads. Unlike the major cloud providers, with Lambda Labs we don't need to configure a service connector to authenticate with the cloud provider. Instead, we can directly use API keys to authenticate with the Lambda Labs API.

```shell
zenml integration install skypilot_lambda
```

Once the integration is installed, we can register the orchestrator with the following command:

```shell
# For more secure and recommended way, we will register the API key as a secret
zenml secret create lambda_api_key --scope user --api_key=<VALUE_1>
# Register the orchestrator
zenml orchestrator register <ORCHESTRATOR_NAME> --flavor vm_lambda --api_key={{lambda_api_key.api_key}}
# Register and activate a stack with the new orchestrator
zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set
```

{% hint style="info" %}
The lambda labs orchrestrator does not support some of the features like `spot_recovery`, `disk_tier`, `image_id`, `zone`, `idle_minutes_to_autostop`, `disk_size`, `use_spot`. It is recommended to not use these features with the lambda labs orchestrator and also not to use [step-specific settings](#configuring-step-specific-resources).
{% endhint %}

{% hint style="warning" %}
While testing the orchestrator, we have noticed that the Lambda Labs orchestrator does not support the `down` flag. This means that the orchestrator will not automatically tear down the cluster after all jobs finish. We recommend manually tearing down the cluster after all jobs finish to avoid unnecessary costs.
{% endhint %}

{% endtab %}

{% endtabs %}

#### Additional Configuration
Expand All @@ -263,6 +301,7 @@ For additional configuration of the Skypilot orchestrator, you can pass `Setting
* `idle_minutes_to_autostop`: Automatically stop the cluster after this many minutes of idleness, i.e., no running or pending jobs in the cluster's job queue. Idleness gets reset whenever setting-up/running/pending jobs are found in the job queue. Setting this flag is equivalent to running `sky.launch(..., detach_run=True, ...)` and then `sky.autostop(idle_minutes=<minutes>)`. If not set, the cluster will not be autostopped.
* `down`: Tear down the cluster after all jobs finish (successfully or abnormally). If `idle_minutes_to_autostop` is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.
* `stream_logs`: If True, show the logs in the terminal as they are generated while the cluster is running.
* `docker_run_args`: Additional arguments to pass to the `docker run` command. For example, `['--gpus=all']` to use all GPUs available on the VM.

The following code snippets show how to configure the orchestrator settings for each cloud provider:

Expand Down Expand Up @@ -291,6 +330,7 @@ skypilot_settings = SkypilotAWSOrchestratorSettings(
idle_minutes_to_autostop=60,
down=True,
stream_logs=True
docker_run_args=["--gpus=all"]
)


Expand Down Expand Up @@ -374,6 +414,34 @@ skypilot_settings = SkypilotAzureOrchestratorSettings(
)
```

{% endtab %}

{% tab title="Lambda" %}

**Code Example:**

```python
from zenml.integrations.skypilot_lambda.flavors.skypilot_orchestrator_lambda_vm_flavor import SkypilotLambdaOrchestratorSettings


skypilot_settings = SkypilotLambdaOrchestratorSettings(
instance_type="gpu_1x_h100_pcie",
cluster_name="my_cluster",
retry_until_up=True,
idle_minutes_to_autostop=60,
down=True,
stream_logs=True,
docker_run_args=["--gpus=all"]
)


@pipeline(
settings={
"orchestrator.vm_lambda": skypilot_settings
}
)
```

{% endtab %}
{% endtabs %}

Expand Down
1 change: 1 addition & 0 deletions src/zenml/integrations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@
from zenml.integrations.skypilot_aws import SkypilotAWSIntegration # noqa
from zenml.integrations.skypilot_gcp import SkypilotGCPIntegration # noqa
from zenml.integrations.skypilot_azure import SkypilotAzureIntegration # noqa
from zenml.integrations.skypilot_lambda import SkypilotLambdaIntegration # noqa
from zenml.integrations.slack import SlackIntegration # noqa
from zenml.integrations.spark import SparkIntegration # noqa
from zenml.integrations.tekton import TektonIntegration # noqa
Expand Down
1 change: 1 addition & 0 deletions src/zenml/integrations/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
SKYPILOT_AWS = "skypilot_aws"
SKYPILOT_GCP = "skypilot_gcp"
SKYPILOT_AZURE = "skypilot_azure"
SKYPILOT_LAMBDA = "skypilot_lambda"
SLACK = "slack"
SPARK = "spark"
TEKTON = "tekton"
Expand Down
9 changes: 7 additions & 2 deletions src/zenml/integrations/integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# permissions and limitations under the License.
"""Base and meta classes for ZenML integrations."""

import re
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Type, cast

import pkg_resources
Expand Down Expand Up @@ -83,15 +84,19 @@ def check_installation(cls) -> bool:
try:
requirements = dist.requires(extras=[extra]) # type: ignore[arg-type]
except pkg_resources.UnknownExtra as e:
logger.debug("Unknown extra: " + str(e))
logger.debug(f"Unknown extra: {str(e)}")
return False
deps.extend(requirements)
else:
deps = dist.requires()

for ri in deps:
try:
pkg_resources.get_distribution(ri)
# Remove the "extra == ..." part from the requirement string
cleaned_req = re.sub(
r"; extra == \"\w+\"", "", str(ri)
)
pkg_resources.get_distribution(cleaned_req)
except pkg_resources.DistributionNotFound as e:
logger.debug(
f"Unable to find required dependency "
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,7 @@ def prepare_or_run_pipeline(
if docker_creds := stack.container_registry.credentials:
docker_username, docker_password = docker_creds
setup = (
f"docker login --username $DOCKER_USERNAME --password "
f"sudo docker login --username $DOCKER_USERNAME --password "
f"$DOCKER_PASSWORD {stack.container_registry.config.uri}"
)
task_envs = {
Expand All @@ -286,10 +286,14 @@ def prepare_or_run_pipeline(

try:
task = sky.Task(
run=f"docker run --rm {custom_run_args}{docker_environment_str} {image} {entrypoint_str} {arguments_str}",
run=f"sudo docker run --rm {custom_run_args}{docker_environment_str} {image} {entrypoint_str} {arguments_str}",
setup=setup,
envs=task_envs,
)
logger.debug(
f"Running run: sudo docker run --rm {custom_run_args}{docker_environment_str} {image} {entrypoint_str} {arguments_str}"
)
logger.debug(f"Running run: {setup}")
task = task.set_resources(
sky.Resources(
cloud=self.cloud,
Expand Down
2 changes: 1 addition & 1 deletion src/zenml/integrations/skypilot_aws/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ class SkypilotAWSIntegration(Integration):
"""Definition of Skypilot AWS Integration for ZenML."""

NAME = SKYPILOT_AWS
REQUIREMENTS = ["skypilot[aws]<=0.4.1"]
REQUIREMENTS = ["skypilot[aws]"]
APT_PACKAGES = ["openssh-client","rsync"]

@classmethod
Expand Down
2 changes: 1 addition & 1 deletion src/zenml/integrations/skypilot_azure/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ class SkypilotAzureIntegration(Integration):
"""Definition of Skypilot (Azure) Integration for ZenML."""

NAME = SKYPILOT_AZURE
REQUIREMENTS = ["skypilot[azure]<=0.4.1"]
REQUIREMENTS = ["skypilot[azure]"]
APT_PACKAGES = ["openssh-client","rsync"]

@classmethod
Expand Down
2 changes: 1 addition & 1 deletion src/zenml/integrations/skypilot_gcp/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ class SkypilotGCPIntegration(Integration):
"""Definition of Skypilot (GCP) Integration for ZenML."""

NAME = SKYPILOT_GCP
REQUIREMENTS = ["skypilot[gcp]<=0.4.1"]
REQUIREMENTS = ["skypilot[gcp]"]
APT_PACKAGES = ["openssh-client","rsync"]

@classmethod
Expand Down
50 changes: 50 additions & 0 deletions src/zenml/integrations/skypilot_lambda/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Copyright (c) ZenML GmbH 2024. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at:
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing
# permissions and limitations under the License.
"""Initialization of the Skypilot Lambda integration for ZenML.

The Skypilot integration sub-module powers an alternative to the local
orchestrator for a remote orchestration of ZenML pipelines on VMs.
"""
from typing import List, Type

from zenml.integrations.constants import (
SKYPILOT_LAMBDA,
)
from zenml.integrations.integration import Integration
from zenml.stack import Flavor

SKYPILOT_LAMBDA_ORCHESTRATOR_FLAVOR = "vm_lambda"


class SkypilotLambdaIntegration(Integration):
"""Definition of Skypilot Lambda Integration for ZenML."""

NAME = SKYPILOT_LAMBDA
REQUIREMENTS = ["skypilot[lambda]"]

@classmethod
def flavors(cls) -> List[Type[Flavor]]:
"""Declare the stack component flavors for the Skypilot Lambda integration.

Returns:
List of stack component flavors for this integration.
"""
from zenml.integrations.skypilot_lambda.flavors import (
SkypilotLambdaOrchestratorFlavor,
)

return [SkypilotLambdaOrchestratorFlavor]


SkypilotLambdaIntegration.check_installation()
26 changes: 26 additions & 0 deletions src/zenml/integrations/skypilot_lambda/flavors/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Copyright (c) ZenML GmbH 2024. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at:
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing
# permissions and limitations under the License.
"""Skypilot integration flavor for Skypilot Lambda orchestrator."""

from zenml.integrations.skypilot_lambda.flavors.skypilot_orchestrator_lambda_vm_flavor import (
SkypilotLambdaOrchestratorConfig,
SkypilotLambdaOrchestratorFlavor,
SkypilotLambdaOrchestratorSettings,
)

__all__ = [
"SkypilotLambdaOrchestratorConfig",
"SkypilotLambdaOrchestratorFlavor",
"SkypilotLambdaOrchestratorSettings",
]
Loading