Skip to content

Inferentia inf1 observability #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added aws-quickstart-eks-blueprints-1.13.1.tgz
Binary file not shown.
6 changes: 6 additions & 0 deletions bin/single-new-eks-inferentia-opensource-observability.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import SingleNewEksInferentiaOpenSourceObservabilityPattern from '../lib/single-new-eks-opensource-observability-pattern/neuron/inferentia-index';
import { configureApp } from '../lib/common/construct-utils';

const app = configureApp();

new SingleNewEksInferentiaOpenSourceObservabilityPattern(app, 'single-new-eks-inferentia-opensource');
6 changes: 0 additions & 6 deletions bin/single-new-eks-neuron-opensource-observability.ts

This file was deleted.

28 changes: 16 additions & 12 deletions cdk.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,37 +21,41 @@
"name": "grafana-dashboards",
"namespace": "grafana-operator",
"repository": {
"repoUrl": "https://github.com/aws-observability/aws-observability-accelerator",
"repoUrl": "https://github.com/freschri/aws-observability-accelerator",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing this?

"name": "grafana-dashboards",
"targetRevision": "main",
"targetRevision": "neuron-dashboard",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing this?

"path": "./artifacts/grafana-operator-manifests/eks/infrastructure"
},
"values": {
"GRAFANA_CLUSTER_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json",
"GRAFANA_KUBELET_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json",
"GRAFANA_NSWRKLDS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json",
"GRAFANA_NODEEXP_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodeexporter-nodes.json",
"GRAFANA_NODES_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodes.json",
"GRAFANA_WORKLOADS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/workloads.json"
"GRAFANA_CLUSTER_DASH_URL" : "https://raw.githubusercontent.com/freschri/aws-observability-accelerator/neuron-dashboard/artifacts/grafana-dashboards/eks/infrastructure/cluster.json",
"GRAFANA_KUBELET_DASH_URL" : "https://raw.githubusercontent.com/freschri/aws-observability-accelerator/neuron-dashboard/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json",
"GRAFANA_NSWRKLDS_DASH_URL" : "https://raw.githubusercontent.com/freschri/aws-observability-accelerator/neuron-dashboard/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json",
"GRAFANA_NODEEXP_DASH_URL" : "https://raw.githubusercontent.com/freschri/aws-observability-accelerator/neuron-dashboard/artifacts/grafana-dashboards/eks/infrastructure/nodeexporter-nodes.json",
"GRAFANA_NODES_DASH_URL" : "https://raw.githubusercontent.com/freschri/aws-observability-accelerator/neuron-dashboard/artifacts/grafana-dashboards/eks/infrastructure/nodes.json",
"GRAFANA_WORKLOADS_DASH_URL" : "https://raw.githubusercontent.com/freschri/aws-observability-accelerator/neuron-dashboard/artifacts/grafana-dashboards/eks/infrastructure/workloads.json",
"GRAFANA_NEURON_DASH_URL" : "https://raw.githubusercontent.com/freschri/aws-observability-accelerator/neuron-dashboard/artifacts/grafana-dashboards/eks/neuron/neuron-monitor.json"
},
"kustomizations": [
{
"kustomizationPath": "./artifacts/grafana-operator-manifests/eks/infrastructure"
},
{
"kustomizationPath": "./artifacts/grafana-operator-manifests/eks/neuron"
}
]
},
"gpuNodeGroup": {
"instanceType": "g4dn.xlarge",
"desiredSize": 2,
"minSize": 2,
"desiredSize": 2,
"minSize": 2,
"maxSize": 3,
"ebsSize": 50
},
"neuronNodeGroup": {
"instanceClass": "inf1",
"instanceSize": "2xlarge",
"desiredSize": 1,
"minSize": 1,
"desiredSize": 1,
"minSize": 1,
"maxSize": 3,
"ebsSize": 512
},
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Single Cluster Open Source Observability - Inferentia-based cluster

[AWS Trainium](https://aws.amazon.com/machine-learning/trainium/) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) are accelerated Machine Learning (ML) chips (or ML accelerators), designed by AWS. They are also referred to as Neuron chips. Each Neuron chip includes multiple NeuronCores, the machine learning compute cores.
[AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) is an accelerated Machine Learning (ML) chip (or ML accelerators), designed and built by AWS. It is also referred to as Neuron Device. Each Neuron device includes multiple NeuronCores, the machine learning compute cores.

Amazon EC2 ML instances belong to the [Trn1/Trn1n](https://aws.amazon.com/ec2/instance-types/trn1/), [Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) and [Inf2](https://aws.amazon.com/ec2/instance-types/inf2/) families. Trn1/Trn1n instances feature multiple AWS Trainium accelerators and support high-performance training. Inf1 and Inf2 instances feature multiple AWS Inferentia accelerators and support high-performance and low-latency inference.
Amazon EC2 ML instances belong to the [Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) and [Inf2](https://aws.amazon.com/ec2/instance-types/inf2/) families. Inf1 and Inf2 instances feature multiple AWS Inferentia accelerators and support high-performance and low-latency inference.

[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is an SDK with a compiler, runtime, and profiling tools that helps developers deploy models on both AWS Inferentia accelerators and train them on AWS Trainium accelerators. It integrates natively with popular ML frameworks, such as PyTorch and TensorFlow.

This pattern shows you how to monitor the performance of ML accelerators, used in an Amazon EKS cluster leveraging Inferentia-based Amazon EC2 Inf1/Inf2 instances.
This pattern shows you how to monitor the performance of ML accelerators, used in an Amazon EKS cluster leveraging Inferentia-based Amazon EC2 Inf1 and Inf2 instances.

Amazon Managed Service for Prometheus and Amazon Managed Grafana are open source tools used in this pattern to collect and visualise metrics respectively.

Expand All @@ -16,7 +16,7 @@ Amazon Managed Grafana is a managed service for Grafana, a popular open-source a

## Objective

This pattern deploys an Amazon EKS cluster with a node group that includes Inf1/Inf2 instances.
This pattern deploys an Amazon EKS cluster with a node group that includes Inf1 instances.

The AMI type of the node group is `AL2_x86_64_GPU AMI`, which uses the [Amazon EKS-optimized accelerated AMI](https://aws.amazon.com/marketplace/pp/prodview-nwwwodawoxndm). In addition to the standard Amazon EKS-optimized AMI configuration, the accelerated AMI includes the AWS Neuron container runtime.

Expand Down Expand Up @@ -142,13 +142,13 @@ Example settings: Update the context in `cdk.json` file located in `cdk-eks-blue
}
```

**Note**: you can replace the inf1 instance type with inf2 and the size as you prefer; to check availability in your selected Region, you can run the following command (amend `Values` below as you see fit):
**Note**: insure your selected instance type is available in your selected region. To check that, you can run the following command (amend `Values` below as you see fit):

```bash
aws ec2 describe-instance-type-offerings \
--filters Name=instance-type,Values="inf1*" \
--query "InstanceTypeOfferings[].InstanceType" \
--region $AWS_REGION
--region us-east-2
```

8. For the `neuron-monitor` DaemonSet, you can either use the image already referenced into the manifest, or build your own using the Dockerfile at location `cdk-aws-observability-accelerator/lib/common/resources/neuron/neuron-monitor.dockerfile`, [push it to an ECR repository](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html) of your choice and update the image URL in the manifest at location `cdk-aws-observability-accelerator/lib/single-new-eks-opensource-observability-pattern/neuron/neuron-monitor.yaml`.
Expand All @@ -157,7 +157,7 @@ aws ec2 describe-instance-type-offerings \

```bash
make build
make pattern single-new-eks-neuron-opensource-observability deploy
make pattern single-new-eks-inferentia-opensource-observability deploy
```

## Verify the resources
Expand Down Expand Up @@ -264,5 +264,5 @@ Grafana Operator and Flux always work together to synchronize your dashboards wi
You can teardown the whole CDK stack with the following command:

```bash
make pattern single-new-eks-neuron-opensource-observability destroy
make pattern single-new-eks-inferentia-opensource-observability destroy
```
Loading