A Helm chart for running Deepgram services in a self-hosted environment
Homepage: https://developers.deepgram.com/docs/self-hosted-introduction
Deepgram Self-Hosted Kubernetes Guides: https://developers.deepgram.com/docs/kubernetes
Kubernetes: >=1.28.0-0
| Repository | Name | Version |
|---|---|---|
| https://helm.ngc.nvidia.com/nvidia | gpu-operator | ^24.3.0 |
| https://kubernetes.github.io/autoscaler | cluster-autoscaler | ^9.37.0 |
| https://prometheus-community.github.io/helm-charts | kube-prometheus-stack | ^60.2.0 |
| https://prometheus-community.github.io/helm-charts | prometheus-adapter | ^4.10.0 |
helm repo add deepgram https://deepgram.github.io/self-hosted-resources
helm repo updateThe Deepgram self-hosted chart requires Helm 3.7+ in order to install successfully. Please check your helm release before installation.
You will need to provide your self-service Deepgram licensing and credentials information. See global.deepgramSecretRef and global.pullSecretRef in the Values section for more details, and the Deepgram Self-Hosted Kubernetes Guides for instructions on how to create these secrets.
You may also override any default configuration values. See the Values section for a list of available options, and the samples directory for examples of a standard installation.
helm install -f my-values.yaml [RELEASE_NAME] deepgram/deepgram-self-hosted --atomic --timeout 45m
To upgrade the Deepgram components to a new version, follow these steps:
-
Update the various
image.tagvalues in thevalues.yamlfile to the desired version. -
Run the Helm upgrade command:
helm upgrade -f my-values.yaml [RELEASE_NAME] deepgram/deepgram-self-hosted --atomic --timeout 60m
Important
January 2026 Release (release-260115) - Breaking Change for TTS Deployments
The January 2026 self-hosted release includes changes to improve TTS response times. This release is not backwards-compatible with previous releases when serving TTS traffic due to changes in how API and Engine containers communicate.
To avoid downtime, the updated Engine container (3.107.0-1) must be deployed before the updated API container (1.176.0). The new Engine version is compatible with previous API versions, so deploy Engine first. Blue-green deployment is one strategy that satisfies this requirement. This only applies to deployments serving TTS traffic; STT-only deployments are unaffected.
See the January 2026 changelog for more details.
If you encounter any issues during the upgrade process, you can perform a rollback to the previous version:
helm rollback deepgramBefore upgrading, ensure that you have reviewed the release notes and any migration guides provided by Deepgram for the specific version you are upgrading to.
helm uninstall [RELEASE_NAME]This removes all the Kubernetes components associated with the chart and deletes the release.
See the chart CHANGELOG for a list of relevant changes for each version of the Helm chart.
For more details on changes to the underlying Deepgram resources, such as the container images or available models, see the official Deepgram changelog (RSS feed).
The Deepgram Helm chart supports different persistent storage options for storing Deepgram models and data. The available options include:
- AWS Elastic File System (EFS)
- Google Cloud Persistent Disk (GPD)
- Custom PersistentVolumeClaim (PVC)
To configure a specific storage option, see the engine.modelManager.volumes configuration values. Make sure to provide the necessary configuration values for the selected storage option, such as the EFS file system ID or the GPD disk type and size.
For detailed instructions on setting up and configuring each storage option, refer to the Deepgram self-hosted guides and the respective cloud provider's documentation.
The Deepgram Helm chart provides flexible service configuration options for exposing the API, Engine, and License Proxy services. By default, all services use ClusterIP type, which provides internal cluster access only.
- ClusterIP (default): Exposes the service on a cluster-internal IP. This is the default and recommended option for most deployments.
- NodePort: Exposes the service on each Node's IP at a static port. Useful for development or when you need direct node access.
- LoadBalancer: Exposes the service externally using a cloud provider's load balancer. Recommended for production deployments requiring external access.
API Service with LoadBalancer (with security restrictions):
api:
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
loadBalancerSourceRanges:
- "10.0.0.0/8" # Allow access from private networks
- "192.168.1.0/24" # Allow access from specific subnet
externalTrafficPolicy: "Local" # Preserve source IP and reduce hopsEngine Metrics Service with NodePort:
engine:
service:
type: NodePortLicense Proxy Service with LoadBalancer (restricted access):
licenseProxy:
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
loadBalancerSourceRanges:
- "10.0.0.0/8" # Only allow internal network access
externalTrafficPolicy: "Cluster" # Allow traffic from any nodeWhen using LoadBalancer service type, you can configure additional security and performance options:
loadBalancerSourceRanges: Restrict access to specific IP CIDR ranges. This provides network-level security by only allowing traffic from specified IP ranges.externalTrafficPolicy: Controls how external traffic is routed:Cluster(default): Traffic can be routed to any node in the cluster, then forwarded to the target podLocal: Traffic is only routed to nodes that have the target pod running, preserving source IP addresses
Autoscaling your cluster's capacity to meet incoming traffic demands involves both node autoscaling and pod autoscaling. Node autoscaling for supported cloud providers is setup by default when using this Helm chart and creating your cluster with the Deepgram self-hosted guides. Pod autoscaling can be enabled via the scaling.auto.enabled configuration option in this chart.
The Engine component is the core of the Deepgram self-hosted platform, responsible for performing inference using your deployed models. Autoscaling increases the number of Engine replicas to maintain consistent performance for incoming traffic.
There are currently two primary ways to scale the Engine component: scaling with a hard request limit per Engine Pod, or scaling with a soft request limit per Engine pod.
To set a hard limit on which to scale, configure engine.concurrencyLimit.activeRequests and scaling.auto.engine.metrics.requestCapacityRatio. The activeRequests parameter will set a hard limit of how many requests any given Engine pod will accept, and the requestCapacityRatio will govern scaling the Engine deployment when a certain percentage of "available request slots" is filled. For example, a requestCapacityRatio of 0.8 will scale the Engine deployment when the current number of active requests is >=80% of the active request concurrency limit. If the cluster is not able to scale in time and current active requests hits 100% of the preset limit, additional client requests to the API will return a 429 Too Many Requests HTTP response to clients. This hard limit means that if a request is accepted for inference, it will have consistent performance, as the cluster will refuse surplus requests that could overload the cluster and degrade performance, at the expense of possibly rejecting some incoming requests if capacity does not scale in time.
To set a soft limit on which to scale, configure scaling.auto.engine.metrics.{speechToText,textToSpeech}.{batch,streaming}.requestsPerPod, depending on the primary traffic source for your environment. The cluster will attempt to scale to meet this target for number of requests per Engine pod, but will not reject extra requests with a 429 Too Many Request HTTP response like the hard limit will. If the number of extra requests increases faster than the cluster can scale additional capacity, all incoming requests will still be accepted, but the performance of individual requests may degrade.
Note
Deepgram recommends provisioning separate environments for batch speech-to-text, streaming speech-to-text, and text-to-speech workloads because typical latency and throughput tradeoffs are different for each of those use cases.
There is also a scaling.auto.engine.metrics.custom configuration value available to define your own custom scaling metric, if needed.
The API component is responsible for accepting incoming requests and forming responses, delegating inference work to the Deepgram Engine as needed. A single API pod can typically handle delegating requests to multiple Engine pods, so it is more compute efficient to deploy fewer API pods relative to the number of Engine pods. The scaling.auto.api.metrics.engineToApiRatio configuration value defines the ratio between Engine to API pods. The default value is appropriate for most deployments.
There is also a scaling.auto.api.metrics.custom configuration value available to define your own custom scaling metric, if needed.
The License Proxy is intended to be deployed as a fixed-scale deployment the proxies all licensing requests from your environment. It should not be upscaled with the traffic demands of your environment.
This chart deploys one License Proxy Pod per environment by default. If you wish to deploy a second License Proxy Pod for redundancy, set licenseProxy.deploySecondReplica to true.
Role-Based Access Control (RBAC) is used to control access to Kubernetes resources based on the roles and permissions assigned to users or service accounts. The Deepgram Helm chart includes default RBAC roles and bindings for the API, Engine, and License Proxy components.
To use custom RBAC roles and bindings based on your specific security requirements, you can individually specify pre-existing ServiceAccounts to bind to each deployment by specifying the following options in values.yaml:
{api|engine|licenseProxy}.serviceAccount.create=false
{api|engine|licenseProxy}.serviceAccount.name=<your-pre-existing-sa>
Make sure to review and adjust the RBAC configuration according to the principle of least privilege, granting only the necessary permissions for each component.
The Deepgram Helm chart takes references to two existing secrets - one containing your distribution credentials to pull container images from Deepgram's image repository, and one containing your Deepgram self-hosted API key.
Consult the official Kubernetes documentation for best practices on configuring Secrets for use in your cluster.
See the Getting Help section in the root of this repository for a list of resources to help you troubleshoot and resolve issues.
If you encounter issues while deploying or using Deepgram, consider the following troubleshooting steps:
-
Check the pod status and logs:
- Use
kubectl get podsto check the status of the Deepgram pods. - Use
kubectl logs <pod-name>to view the logs of a specific pod.
- Use
-
Verify resource availability:
- Ensure that the cluster has sufficient CPU, memory, and storage resources to accommodate the Deepgram components.
- Check for any resource constraints or limits imposed by the namespace or the cluster.
-
Review the Kubernetes events:
- Use
kubectl get eventsto view any events or errors related to the Deepgram deployment.
- Use
-
Check the network connectivity:
- Verify that the Deepgram components can communicate with each other and with the Deepgram license server (license.deepgram.com).
- Check the network policies and firewall rules to ensure that the necessary ports and protocols are allowed.
-
Collect diagnostic information:
- Gather relevant logs and metrics.
- Export your existing Helm chart values:
helm get values [RELEASE_NAME] > my-deployed-values.yaml - Provide the collected diagnostic information to Deepgram for assistance.
| Key | Type | Default | Description |
|---|---|---|---|
| agent.allowNonpublicEndpoints | bool | false |
Whether to allow non-public URLs (such as localhost) in custom endpoints. Disabled by default |
| agent.enabled | bool | false |
Whether to enable voice agent. Disabled by default |
| agent.eotTimeoutMs | int | 3500 |
Timeout in milliseconds for end-of-turn detection |
| agent.llmProviders | object | `` | Configuration for LLM providers and their available models |
| agent.llmProviders.anthropic | object | `` | Anthropic provider configuration |
| agent.llmProviders.anthropic.models | object | `` | Available Anthropic models and their configurations |
| agent.llmProviders.deepgram | object | `` | Deepgram provider configuration |
| agent.llmProviders.deepgram.models | object | `` | Available Deepgram models and their configurations |
| agent.llmProviders.google | object | `` | Google provider configuration |
| agent.llmProviders.google.models | object | `` | Available Google models and their configurations |
| agent.llmProviders.groq | object | `` | Groq provider configuration |
| agent.llmProviders.groq.models | object | `` | Available Groq models and their configurations |
| agent.llmProviders.open_ai | object | `` | OpenAI provider configuration |
| agent.llmProviders.open_ai.models | object | `` | Available OpenAI models and their configurations |
| agent.llmProviders.open_ai.models.gpt-4o-mini.name | string | "GPT-4o mini" |
Display name for the GPT-4o mini model |
| agent.llmProviders.open_ai.models.gpt-4o-mini.public | bool | true |
Whether this model is publicly available |
| agent.llmProviders.open_ai.models.gpt-4o-mini.tier | string | "standard" |
Service tier for this model (standard or advanced) |
| agent.llmProviders.open_ai.name | string | "OpenAI" |
Display name for the OpenAI provider |
| agent.llmProviders.x_ai | object | `` | xAI provider configuration |
| agent.llmProviders.x_ai.models | object | `` | Available xAI models and their configurations |
| agent.maxConversationChars | int | 15000 |
Maximum number of characters allowed in a conversation history |
| api.additionalAnnotations | object | nil |
Additional annotations to add to the API deployment |
| api.additionalLabels | object | {} |
Additional labels to add to API resources |
| api.affinity | object | {} |
Affinity and anti-affinity to apply for API pods. |
| api.containerSecurityContext | object | {} |
Container-level security context for API containers. |
| api.customToml | string | nil |
Custom TOML sections can be added to extend api.toml |
| api.driverPool | object | `` | driverPool configures the backend pool of speech engines (generically referred to as "drivers" here). The API will load-balance among drivers in the standard pool; if one standard driver fails, the next one will be tried. |
| api.driverPool.standard | object | `` | standard is the main driver pool to use. |
| api.driverPool.standard.maxResponseSize | string | "1073741824" |
Maximum response to deserialize from Driver (in bytes). Default is 1GB, expressed in bytes. |
| api.driverPool.standard.retryBackoff | float | 1.6 |
retryBackoff is the factor to increase the retrySleep by for each additional retry (for exponential backoff). |
| api.driverPool.standard.retrySleep | string | "2s" |
retrySleep defines the initial sleep period (in humantime duration) before attempting a retry. |
| api.driverPool.standard.timeoutBackoff | float | 1.2 |
timeoutBackoff is the factor to increase the timeout by for each additional retry (for exponential backoff). |
| api.extraEnv | list | [] |
Extra environment variables for the API container. |
| api.features | object | `` | Enable ancillary features |
| api.features.diskBufferPath | string | nil |
If API is receiving requests faster than Engine can process them, a request queue will form. By default, this queue is stored in memory. Under high load, the queue may grow too large and cause Out-Of-Memory errors. To avoid this, set a diskBufferPath to buffer the overflow on the request queue to disk. WARN: This is only to temporarily buffer requests during high load. If there is not enough Engine capacity to process the queued requests over time, the queue (and response time) will grow indefinitely. |
| api.features.entityDetection | bool | false |
Enables entity detection on pre-recorded audio if a valid entity detection model is available. |
| api.features.entityRedaction | bool | true |
Enables entity-based redaction on pre-recorded audio if a valid entity detection model is available. |
| api.features.formatEntityTags | bool | true |
Enables format entity tags on pre-recorded audio if a valid NER model is available. |
| api.features.listenV2 | bool | false |
Enables Flux turn-based streaming STT |
| api.features.redactUsage | bool | true |
Enables usage metadata redaction; set to false to disable redaction of usage metadata |
| api.image.path | string | "quay.io/deepgram/self-hosted-api" |
path configures the image path to use for creating API containers. You may change this from the public Quay image path if you have imported Deepgram images into a private container registry. |
| api.image.pullPolicy | string | "IfNotPresent" |
pullPolicy configures how the Kubelet attempts to pull the Deepgram API image |
| api.image.tag | string | "release-260430" |
tag defines which Deepgram release to use for API containers |
| api.livenessProbe | object | `` | Liveness probe customization for API pods. |
| api.namePrefix | string | "deepgram-api" |
namePrefix is the prefix to apply to the name of all K8s objects associated with the Deepgram API containers. |
| api.nodeSelector | object | {} |
Node selector to apply to API pods. |
| api.readinessProbe | object | `` | Readiness probe customization for API pods. |
| api.resolver | object | `` | Specify custom DNS resolution options. |
| api.resolver.maxTTL | int | nil |
maxTTL sets the DNS TTL value if specifying a custom DNS nameserver. |
| api.resolver.nameservers | list | [] |
nameservers allows for specifying custom domain name server(s). A valid list item's format is "{IP} {PORT} {PROTOCOL (tcp or udp)}", e.g. "127.0.0.1 53 udp". |
| api.resources | object | `` | Configure resource limits per API container. See Deepgram's documentation for more details. |
| api.securityContext | object | {} |
Pod-level security context for API pods. |
| api.server | object | `` | Configure how the API will listen for your requests |
| api.server.callbackConnTimeout | string | "1s" |
callbackConnTimeout configures how long to wait for a connection to a callback URL. See Deepgram's callback documentation for more details. The value should be a humantime duration. |
| api.server.callbackTimeout | string | "10s" |
callbackTimeout configures how long to wait for a response from a callback URL. See Deepgram's callback documentation for more details. The value should be a humantime duration. |
| api.server.fetchConnTimeout | string | "1s" |
fetchConnTimeout configures how long to wait for a connection to a fetch URL. The value should be a humantime duration. A fetch URL is a URL passed in an inference request from which a payload should be downloaded. |
| api.server.fetchTimeout | string | "60s" |
fetchTimeout configures how long to wait for a response from a fetch URL. The value should be a humantime duration. A fetch URL is a URL passed in an inference request from which a payload should be downloaded. |
| api.server.host | string | "0.0.0.0" |
host is the IP address to listen on. You will want to listen on all interfaces to interact with other pods in the cluster. |
| api.server.port | int | 8080 |
port to listen on. |
| api.service | object | `` | Service configuration for the API external service |
| api.service.annotations | object | `` | Additional annotations to add to the service when type is LoadBalancer |
| api.service.externalTrafficPolicy | string | `` | External traffic policy for LoadBalancer service. Options: Cluster, Local Only applies when service type is LoadBalancer |
| api.service.loadBalancerSourceRanges | list | `` | List of IP CIDR ranges allowed to access the LoadBalancer service Only applies when service type is LoadBalancer |
| api.service.type | string | ClusterIP |
Service type for the API external service. Options: ClusterIP, NodePort, LoadBalancer |
| api.serviceAccount.create | bool | true |
Specifies whether to create a default service account for the Deepgram API Deployment. |
| api.serviceAccount.name | string | nil |
Allows providing a custom service account name for the API component. If left empty, the default service account name will be used. If specified, and api.serviceAccount.create = true, this defines the name of the default service account. If specified, and api.serviceAccount.create = false, this provides the name of a preconfigured service account you wish to attach to the API deployment. |
| api.tolerations | list | [] |
Tolerations to apply to API pods. |
| api.topologySpreadConstraints | list | [] |
Topology spread constraints to apply to API pods. |
| api.updateStrategy.rollingUpdate.maxSurge | int | 1 |
The maximum number of extra API pods that can be created during a rollingUpdate, relative to the number of replicas. See the Kubernetes documentation for more details. |
| api.updateStrategy.rollingUpdate.maxUnavailable | int | 0 |
The maximum number of API pods, relative to the number of replicas, that can go offline during a rolling update. See the Kubernetes documentation for more details. |
| aura2 | object | `` | Aura-2 specific configuration options |
| aura2.enabled | bool | false |
Enable Aura-2 features and configuration |
| aura2.english | object | `` | English language configuration for Aura-2 |
| aura2.polyglot | object | `` | Polyglot language configuration for Aura-2 (Dutch, German, French, Italian, Japanese) |
| aura2.spanish | object | `` | Spanish language configuration for Aura-2 |
| billing | object | `` | Configuration options for the Deepgram Billing container. The Billing container is used for airgapped deployments where API and Engine cannot reach license.deepgram.com. It validates licenses locally using a license file and maintains a usage journal. For complete airgapped deployment instructions, see samples/airgapped.md |
| billing.additionalAnnotations | object | nil |
Additional annotations to add to the Billing deployment |
| billing.additionalLabels | object | {} |
Additional labels to add to Billing resources |
| billing.affinity | object | {} |
Affinity and anti-affinity to apply for Billing pods. |
| billing.containerSecurityContext | object | {} |
Container-level security context for Billing containers. |
| billing.enabled | bool | false |
Enable the Billing container for airgapped deployments. When enabled, API and Engine will connect to the Billing container instead of license.deepgram.com. |
| billing.extraEnv | list | [] |
Extra environment variables for the Billing container. |
| billing.image.path | string | "quay.io/deepgram/self-hosted-billing" |
path configures the image path to use for creating Billing containers. You may change this from the public Quay image path if you have imported Deepgram images into a private container registry. |
| billing.image.pullPolicy | string | "IfNotPresent" |
pullPolicy configures how the Kubelet attempts to pull the Deepgram Billing image |
| billing.image.tag | string | "release-260430" |
tag defines which Deepgram release to use for Billing containers |
| billing.journal | object | `` | Configuration for the usage journal volume. The journal tracks usage for billing purposes and must be persisted. WARNING: Do not delete or lose this volume as it contains critical billing data. Failure to persist and return journal files may result in suspension of service. |
| billing.journal.aws | object | `` | AWS-specific volume configuration for billing journals |
| billing.journal.aws.efs.size | string | "1Gi" |
Size of the EFS PVC for journals. |
| billing.journal.aws.efs.storageClassName | string | "" |
StorageClass to use for the EFS PVC. Leave empty to automatically use the StorageClass created by engine.modelManager.volumes.aws.efs. Set a custom value to use a different StorageClass. |
| billing.journal.existingPvcName | string | "" |
Name of an existing PVC to use for journal storage (e.g., EFS-backed shared volume). When set, each Billing pod writes to its own subdirectory: journal-/ Leave empty to auto-provision per-pod EBS volumes via volumeClaimTemplates (default). |
| billing.journal.size | string | "1Gi" |
Size of the journal volume. Only used if existingPvcName is empty and aws.efs.enabled is false. |
| billing.journal.storageClass | string | "" |
Storage class to use for the journal PVC. Only used if existingPvcName is empty and aws.efs.enabled is false. |
| billing.licenseFile | object | `` | Configuration for the Deepgram license file. The license file is a 1-line JSON file provided by Deepgram for airgapped deployments. You must create a Kubernetes Secret containing this file before installing the chart. |
| billing.licenseFile.secretKey | string | "license.dg" |
Key within the Secret that contains the license file content |
| billing.licenseFile.secretRef | string | nil |
Name of the pre-configured K8s Secret containing your Deepgram license file |
| billing.livenessProbe | object | `` | Liveness probe customization for Billing pods. |
| billing.namePrefix | string | "deepgram-billing" |
namePrefix is the prefix to apply to the name of all K8s objects associated with the Deepgram Billing containers. |
| billing.nodeSelector | object | {} |
Node selector to apply to Billing pods. |
| billing.readinessProbe | object | `` | Readiness probe customization for Billing pods. |
| billing.replicas | int | 1 |
Number of Billing replicas. Default is 1 (singleton). Can be increased for high availability. Each replica maintains its own journal file. |
| billing.resources | object | `` | Configure resource limits per Billing container. |
| billing.securityContext | object | {} |
Pod-level security context for Billing pods. |
| billing.server | object | `` | Configure how the Billing container will listen for licensing requests. |
| billing.server.baseUrl | string | "/" |
baseUrl is the prefix for incoming license verification requests. |
| billing.server.certificatesPort | int | 8080 |
certificatesPort is the port for the HTTP certificates endpoint (/v1/certificates). Must be set explicitly for the certificates endpoint to be reachable. |
| billing.server.host | string | "0.0.0.0" |
host is the IP address to listen on. You will want to listen on all interfaces to interact with other pods in the cluster. |
| billing.server.port | int | 8443 |
port to listen on. |
| billing.serviceAccount.create | bool | true |
Specifies whether to create a default service account for the Deepgram Billing Deployment. |
| billing.serviceAccount.name | string | nil |
Allows providing a custom service account name for the Billing component. If left empty, the default service account name will be used. If specified, and billing.serviceAccount.create = true, this defines the name of the default service account. If specified, and billing.serviceAccount.create = false, this provides the name of a preconfigured service account you wish to attach to the Billing deployment. |
| billing.tolerations | list | [] |
Tolerations to apply to Billing pods. |
| billing.topologySpreadConstraints | list | [] |
Topology spread constraints to apply to Billing pods. |
| billing.updateStrategy.rollingUpdate | object | `` | Update strategy for the Billing StatefulSet. StatefulSets update pods one at a time in reverse ordinal order (e.g. billing-2, billing-1, billing-0). |
| billing.updateStrategy.rollingUpdate.maxUnavailable | int | 0 |
Maximum number of Billing pods that can be unavailable during updates. |
| cluster-autoscaler.autoDiscovery.clusterName | string | nil |
Name of your AWS EKS cluster. Using the Cluster Autoscaler on AWS requires knowledge of certain cluster metadata. |
| cluster-autoscaler.awsRegion | string | nil |
Region of your AWS EKS cluster. Using the Cluster Autoscaler on AWS requires knowledge of certain cluster metadata. |
| cluster-autoscaler.enabled | bool | false |
Set to true to enable node autoscaling with AWS EKS. Note needed for GKE, as autoscaling is enabled by a cli option on cluster creation. |
| cluster-autoscaler.rbac.serviceAccount.annotations."eks.amazonaws.com/role-arn" | string | nil |
Replace with the AWS Role ARN configured for the Cluster Autoscaler. See the Deepgram AWS EKS guide or Cluster Autoscaler AWS documentation for details. |
| cluster-autoscaler.rbac.serviceAccount.name | string | "cluster-autoscaler-sa" |
Name of the IAM Service Account with the necessary permissions |
| engine.additionalAnnotations | object | nil |
Additional annotations to add to the Engine deployment |
| engine.additionalLabels | object | {} |
Additional labels to add to Engine resources |
| engine.affinity | object | {} |
Affinity and anti-affinity to apply for Engine pods. |
| engine.agentOverrides | object | `` | Per-engine-type resource overrides. Only applied when agent.enabled: true. Allows setting different GPU counts per engine type without affecting other resources. Supported keys: agent-speech-to-text, agent-text-to-speech, agent-end-of-turn The agent-text-to-speech engine defaults to gpu: 2 because Aura-2 TTS requires a minimum of 2 GPUs to serve traffic. STT and EOT engines are unaffected and continue to use the global engine.resources values (gpu: 1 by default). If you are not using Aura-2 and want to keep the TTS engine at 1 GPU, override: engine: agentOverrides: agent-text-to-speech: resources: requests: gpu: 1 limits: gpu: 1 |
| engine.agentOverrides.agent-text-to-speech.resources.limits.gpu | int | 2 |
Number of GPUs to limit for the TTS engine. Aura-2 requires 2. |
| engine.agentOverrides.agent-text-to-speech.resources.requests.gpu | int | 2 |
Number of GPUs to request for the TTS engine. Aura-2 requires 2. |
| engine.chunking | object | `` | chunking defines the size of audio chunks to process in seconds. Adjusting these values will affect both inference performance and accuracy of results. Please contact your Deepgram Account Representative if you want to adjust any of these values. |
| engine.chunking.speechToText.batch.maxDuration | float | nil |
minDuration is the maximum audio duration for a STT chunk size for a batch request |
| engine.chunking.speechToText.batch.minDuration | float | nil |
minDuration is the minimum audio duration for a STT chunk size for a batch request |
| engine.chunking.speechToText.streaming.maxDuration | float | nil |
minDuration is the maximum audio duration for a STT chunk size for a streaming request |
| engine.chunking.speechToText.streaming.minDuration | float | nil |
minDuration is the minimum audio duration for a STT chunk size for a streaming request |
| engine.chunking.speechToText.streaming.step | float | 1 |
step defines how often to return interim results, in seconds. This value may be lowered to increase the frequency of interim results. However, this also causes a significant decrease in the number of concurrent streams supported by a single GPU. Please contact your Deepgram Account representative for more details. |
| engine.concurrencyLimit.activeRequests | int | nil |
activeRequests limits the number of active requests handled by a single Engine container. If additional requests beyond the limit are sent, the API container forming the request will try a different Engine pod. If no Engine pods are able to accept the request, the API will return a 429 HTTP response to the client. The nil default means no limit will be set. |
| engine.containerSecurityContext | object | {} |
Container-level security context for Engine containers. |
| engine.customToml | string | nil |
Custom TOML sections can be added to extend engine.toml |
| engine.extraEnv | list | [] |
Extra environment variables for the Engine container. |
| engine.features.streamingNer | bool | true |
Enables format entity tags on streaming audio if a valid NER model is available. |
| engine.features.useV2LanguageDetection | bool | false |
Enables use of 36-language detection model. |
| engine.flux.enabled | bool | false |
Enables Flux turn-based streaming STT |
| engine.flux.max_streams | string | nil |
Specify the maximum number of streams for the Flux model in the Engine process. This must be set for production workloads based on the GPU type used in the deployment. |
| engine.flux.model_name | string | "flux-general-en" |
Specify which Flux model to load. Options: flux-general-en (default), flux-general-multi. |
| engine.halfPrecision.state | string | "auto" |
Engine will automatically enable half precision operations if your GPU supports them. You can explicitly enable or disable this behavior with the state parameter which supports "enable", "disabled", and "auto". |
| engine.health.gpuRequired | bool | false |
Engine will automatically fall back to CPU when a GPU is not detected. You can explicitly require a GPU by setting this option to true, production deployments must use a GPU for acceptable performance. |
| engine.image.path | string | "quay.io/deepgram/self-hosted-engine" |
path configures the image path to use for creating Engine containers. You may change this from the public Quay image path if you have imported Deepgram images into a private container registry. |
| engine.image.pullPolicy | string | "IfNotPresent" |
pullPolicy configures how the Kubelet attempts to pull the Deepgram Engine image |
| engine.image.tag | string | "release-260430" |
tag defines which Deepgram release to use for Engine containers |
| engine.lifecycle | object | `` | Configuration for container lifecycle hooks |
| engine.lifecycle.postStart.command | list | [] |
Command to execute in a postStart hook. Leave empty to disable. Example: ["/sbin/ldconfig"] |
| engine.livenessProbe | object | `` | Liveness probe customization for Engine pods. |
| engine.metricsServer | object | `` | metricsServer exposes an endpoint on each Engine container for reporting inference-specific system metrics. See https://developers.deepgram.com/docs/metrics-guide#deepgram-engine for more details. |
| engine.metricsServer.host | string | "0.0.0.0" |
host is the IP address to listen on for metrics requests. You will want to listen on all interfaces to interact with other pods in the cluster. |
| engine.metricsServer.port | int | 9991 |
port to listen on for metrics requests |
| engine.modelManager.models.add | list | [] |
Links to your Deepgram models to automatically download into storage backing a persistent volume. Automatic model management is currently supported for AWS EFS volumes only. Insert each model link provided to you by your Deepgram Account Representative. |
| engine.modelManager.models.links | list | [] |
Deprecated field to automatically download models. Functionality still supported, but migration to use engine.modelManager.models.add is strongly recommended. |
| engine.modelManager.models.remove | list | [] |
If desiring to remove a model from storage (to reduce number of models loaded by Engine on startup), move a link from the engine.modelManager.models.add section to this section. You can also use a model name instead of the full link to designate for removal. Automatic model management is currently supported for AWS EFS volumes only. |
| engine.modelManager.volumes.aws.efs.enabled | bool | false |
Whether to use an AWS Elastic File Sytem to store Deepgram models for use by Engine containers. This option requires your cluster to be running in AWS EKS. |
| engine.modelManager.volumes.aws.efs.fileSystemId | string | nil |
FileSystemId of existing AWS Elastic File System where Deepgram model files will be persisted. You can find it using the AWS CLI: $ aws efs describe-file-systems --query "FileSystems[*].FileSystemId" |
| engine.modelManager.volumes.aws.efs.forceDownload | bool | false |
Whether to force a fresh download of all model links provided, even if models are already present in EFS. |
| engine.modelManager.volumes.aws.efs.namePrefix | string | "dg-models" |
Name prefix for the resources associated with the model storage in AWS EFS. |
| engine.modelManager.volumes.aws.efs.storageClass | object | {"create":true,"name":null} |
Configuration for the EFS StorageClass |
| engine.modelManager.volumes.aws.efs.storageClass.create | bool | true |
Specifies whether a StorageClass should be created |
| engine.modelManager.volumes.aws.efs.storageClass.name | string | nil |
The name of the StorageClass to use. If not set and create is true, a name is generated using <namePrefix>-aws-efs-sc. Required when create is false. |
| engine.modelManager.volumes.customVolumeClaim.enabled | bool | false |
You may manually create your own PersistentVolume and PersistentVolumeClaim to store and expose model files to the Deepgram Engine. Configure your storage beforehand, and enable here. Note: Make sure the PV and PVC accessMode are set to readWriteMany or readOnlyMany |
| engine.modelManager.volumes.customVolumeClaim.modelsDirectory | string | "/" |
Name of the directory within your pre-configured PersistentVolume where the models are stored |
| engine.modelManager.volumes.customVolumeClaim.name | string | nil |
Name of your pre-configured PersistentVolumeClaim |
| engine.modelManager.volumes.gcp.gpd.enabled | bool | false |
Whether to use an GKE Persistent Disks to store Deepgram models for use by Engine containers. This option requires your cluster to be running in GCP GKE. See the GKE documentation on using pre-existing persistent disks. |
| engine.modelManager.volumes.gcp.gpd.fsType | string | "ext4" |
|
| engine.modelManager.volumes.gcp.gpd.namePrefix | string | "dg-models" |
Name prefix for the resources associated with the model storage in GCP GPD. |
| engine.modelManager.volumes.gcp.gpd.storageCapacity | string | "40G" |
The size of your pre-existing persistent disk. |
| engine.modelManager.volumes.gcp.gpd.storageClassName | string | "standard-rwo" |
The storageClassName of the existing persistent disk. |
| engine.modelManager.volumes.gcp.gpd.volumeHandle | string | "" |
The identifier of your pre-existing persistent disk. The format is projects/{project_id}/zones/{zone_name}/disks/{disk_name} for Zonal persistent disks, or projects/{project_id}/regions/{region_name}/disks/{disk_name} for Regional persistent disks. |
| engine.namePrefix | string | "deepgram-engine" |
namePrefix is the prefix to apply to the name of all K8s objects associated with the Deepgram Engine containers. |
| engine.nodeSelector | object | {} |
Node selector to apply to Engine pods. |
| engine.readinessProbe | object | `` | Readiness probe customization for Engine pods. |
| engine.resources | object | `` | Configure resource limits per Engine container. See Deepgram's documentation for more details. |
| engine.resources.gpuResourceName | string | "nvidia.com/gpu" |
Name of the GPU resource to use (e.g., nvidia.com/gpu for standard GPUs, or nvidia.com/mig-4g.40gb for MIG slices). This allows using different GPU resource naming conventions as configured by the NVIDIA GPU Operator. |
| engine.resources.limits.gpu | int | 1 |
Number of GPUs to limit |
| engine.resources.requests.gpu | int | 1 |
Number of GPUs to request |
| engine.runtimeClassName | string | nil |
Runtime class to use for Engine pods. Set to "nvidia" when using KOPS-managed NVIDIA drivers or other environments where the NVIDIA runtime is not the default containerd runtime and the GPU Operator is not installed. |
| engine.securityContext | object | {} |
Pod-level security context for Engine pods. |
| engine.server | object | `` | Configure Engine containers to listen for requests from API containers. |
| engine.server.host | string | "0.0.0.0" |
host is the IP address to listen on for inference requests. You will want to listen on all interfaces to interact with other pods in the cluster. |
| engine.server.port | int | 8080 |
port to listen on for inference requests |
| engine.service | object | `` | Service configuration for the Engine metrics service |
| engine.service.annotations | object | `` | Additional annotations to add to the service when type is LoadBalancer |
| engine.service.externalTrafficPolicy | string | `` | External traffic policy for LoadBalancer service. Options: Cluster, Local Only applies when service type is LoadBalancer |
| engine.service.loadBalancerSourceRanges | list | `` | List of IP CIDR ranges allowed to access the LoadBalancer service Only applies when service type is LoadBalancer |
| engine.service.type | string | ClusterIP |
Service type for the Engine metrics service. Options: ClusterIP, NodePort, LoadBalancer |
| engine.serviceAccount.create | bool | true |
Specifies whether to create a default service account for the Deepgram Engine Deployment. |
| engine.serviceAccount.name | string | nil |
Allows providing a custom service account name for the Engine component. If left empty, the default service account name will be used. If specified, and engine.serviceAccount.create = true, this defines the name of the default service account. If specified, and engine.serviceAccount.create = false, this provides the name of a preconfigured service account you wish to attach to the Engine deployment. |
| engine.startupProbe | object | `` | The startupProbe combination of periodSeconds and failureThreshold allows time for the container to load all models and start listening for incoming requests. Model load time can be affected by hardware I/O speeds, as well as network speeds if you are using a network volume mount for the models. If you are hitting the failure threshold before models are finished loading, you may want to extend the startup probe. However, this will also extend the time it takes to detect a pod that can't establish a network connection to validate its license. |
| engine.startupProbe.failureThreshold | int | 60 |
failureThreshold defines how many unsuccessful startup probe attempts are allowed before the container will be marked as Failed |
| engine.startupProbe.periodSeconds | int | 10 |
periodSeconds defines how often to execute the probe. |
| engine.tolerations | list | [] |
Tolerations to apply to Engine pods. |
| engine.topologySpreadConstraints | list | [] |
Topology spread constraints to apply to Engine pods. |
| engine.ttlSecondsAfterFinished | int | 900 |
This determines the number of seconds that the Pod will be retained, for the Deepgram model manager. As long as this Pod is existing, it will keep a claim on the Kubernetes PersistentVolumeClaim, which can cause the PVC to hang in the "terminating" state, if the Helm chart is deleted (uninstalled). |
| engine.updateStrategy.rollingUpdate.maxSurge | int | 1 |
The maximum number of extra Engine pods that can be created during a rollingUpdate, relative to the number of replicas. See the Kubernetes documentation for more details. |
| engine.updateStrategy.rollingUpdate.maxUnavailable | int | 0 |
The maximum number of Engine pods, relative to the number of replicas, that can go offline during a rolling update. See the Kubernetes documentation for more details. |
| global.additionalLabels | object | {} |
Additional labels to add to all Deepgram resources |
| global.deepgramLicenseSecretRef | string | nil |
Name of the pre-configured K8s Secret containing your Deepgram license key for airgapped deployments with Billing container. Only required when billing.enabled is true. |
| global.deepgramSecretRef | string | nil |
Name of the pre-configured K8s Secret containing your Deepgram self-hosted API key. See chart docs for more details. |
| global.initContainer.image.pullPolicy | string | "IfNotPresent" |
|
| global.initContainer.image.registry | string | "docker.io" |
|
| global.initContainer.image.repository | string | "ubuntu" |
|
| global.initContainer.image.tag | string | "22.04" |
|
| global.initContainer.pullSecretRef | string | "" |
Optional: Override imagePullSecrets for init container. Leave empty to default to global.pullSecretRef |
| global.outstandingRequestGracePeriod | int | 1800 |
When an API or Engine container is signaled to shutdown via Kubernetes sending a SIGTERM signal, the container will stop listening on its port, and no new requests will be routed to that container. However, the container will continue to run until all existing batch or streaming requests have completed, after which it will gracefully shut down. Batch requests should be finished within 10-15 minutes, but streaming requests can proceed indefinitely. outstandingRequestGracePeriod defines the period (in sec) after which Kubernetes will forcefully shutdown the container, terminating any outstanding connections. 1800 / 60 sec/min = 30 mins |
| global.pullSecretRef | string | nil |
If using images from the Deepgram Quay image repositories, or another private registry to which your cluster doesn't have default access, you will need to provide a pre-configured K8s Secret with image repository credentials. See chart docs for more details. |
| gpu-operator | object | `` | Passthrough values for NVIDIA GPU Operator Helm chart You may use the NVIDIA GPU Operator to manage installation of NVIDIA drivers and the container toolkit on nodes with attached GPUs. |
| gpu-operator.driver.enabled | bool | true |
Whether to install NVIDIA drivers on nodes where a NVIDIA GPU is detected. If your Kubernetes nodes run a base image that comes with NVIDIA drivers pre-configured, disable this option, but keep the parent gpu-operator and sibling toolkit options enabled. |
| gpu-operator.driver.version | string | "550.54.15" |
NVIDIA driver version to install. |
| gpu-operator.enabled | bool | true |
Whether to install the NVIDIA GPU Operator to manage driver and/or container toolkit installation. See the list of supported Operating Systems to verify compatibility with your cluster/nodes. Disable this option if your cluster/nodes are not compatible. If disabled, you will need to self-manage NVIDIA software installation on all nodes where you want to schedule Deepgram Engine pods. |
| gpu-operator.toolkit.enabled | bool | true |
Whether to install NVIDIA drivers on nodes where a NVIDIA GPU is detected. |
| gpu-operator.toolkit.version | string | "v1.15.0-ubi8" |
NVIDIA container toolkit to install. The default ubuntu image tag for the toolkit requires a dynamic runtime link to a version of GLIBC that may not be present on nodes running older Linux distribution releases, such as Ubuntu 22.04. Therefore, we specify the ubi8 image, which statically links the GLIBC library and avoids this issue. |
| kube-prometheus-stack | object | `` | Passthrough values for Prometheus k8s stack Helm chart. Prometheus (and its adapter) should be configured when scaling.auto is enabled. You may choose to use the installation/configuration bundled in this Helm chart, or you may configure an existing Prometheus installation in your cluster to expose the needed values. See source Helm chart for explanation of available values. Default values provided in this chart are used to provide pod autoscaling for Deepgram pods. |
| kube-prometheus-stack.includeDependency | bool | nil |
Normally, this chart will be installed if scaling.auto.enabled is true. However, if you wish to manage the Prometheus adapter in your cluster on your own and not as part of the Deepgram Helm chart, you can force it to not be installed by setting this to false. |
| licenseProxy | object | `` | Configuration options for the optional Deepgram License Proxy. |
| licenseProxy.additionalAnnotations | object | nil |
Additional annotations to add to the LicenseProxy deployment |
| licenseProxy.additionalLabels | object | {} |
Additional labels to add to License Proxy resources |
| licenseProxy.affinity | object | {} |
Affinity and anti-affinity to apply for License Proxy pods. |
| licenseProxy.containerSecurityContext | object | {} |
Container-level security context for License Proxy containers. |
| licenseProxy.deploySecondReplica | bool | false |
If the License Proxy is deployed, one replica should be sufficient to support many API/Engine pods. Highly available environments may wish to deploy a second replica to ensure uptime, which can be toggled with this option. |
| licenseProxy.enabled | bool | false |
The License Proxy is optional, but highly recommended to be deployed in production to enable highly available environments. |
| licenseProxy.extraEnv | list | [] |
Extra environment variables for the License Proxy container. |
| licenseProxy.image.path | string | "quay.io/deepgram/self-hosted-license-proxy" |
path configures the image path to use for creating License Proxy containers. You may change this from the public Quay image path if you have imported Deepgram images into a private container registry. |
| licenseProxy.image.pullPolicy | string | "IfNotPresent" |
pullPolicy configures how the Kubelet attempts to pull the Deepgram License Proxy image |
| licenseProxy.image.tag | string | "release-260430" |
tag defines which Deepgram release to use for License Proxy containers |
| licenseProxy.keepUpstreamServerAsBackup | bool | true |
Even with a License Proxy deployed, API and Engine pods can be configured to keep the upstream license.deepgram.com license server as a fallback licensing option if the License Proxy is unavailable. Disable this option if you are restricting API/Engine Pod network access for security reasons, and only the License Proxy should send egress traffic to the upstream license server. |
| licenseProxy.livenessProbe | object | `` | Liveness probe customization for Proxy pods. |
| licenseProxy.namePrefix | string | "deepgram-license-proxy" |
namePrefix is the prefix to apply to the name of all K8s objects associated with the Deepgram License Proxy containers. |
| licenseProxy.nodeSelector | object | {} |
Node selector to apply to License Proxy pods. |
| licenseProxy.readinessProbe | object | `` | Readiness probe customization for License Proxy pods. |
| licenseProxy.resources | object | `` | Configure resource limits per License Proxy container. See Deepgram's documentation for more details. |
| licenseProxy.securityContext | object | {} |
Pod-level security context for License Proxy pods. |
| licenseProxy.server | object | `` | Configure how the license proxy will listen for licensing requests. |
| licenseProxy.server.baseUrl | string | "/" |
baseUrl is the prefix for incoming license verification requests. |
| licenseProxy.server.host | string | "0.0.0.0" |
host is the IP address to listen on. You will want to listen on all interfaces to interact with other pods in the cluster. |
| licenseProxy.server.port | int | 8443 |
port to listen on. |
| licenseProxy.server.statusPort | int | 8080 |
statusPort is the port to listen on for the status/health endpoint. |
| licenseProxy.service | object | `` | Service configuration for the License Proxy status service |
| licenseProxy.service.annotations | object | `` | Additional annotations to add to the service when type is LoadBalancer |
| licenseProxy.service.externalTrafficPolicy | string | `` | External traffic policy for LoadBalancer service. Options: Cluster, Local Only applies when service type is LoadBalancer |
| licenseProxy.service.loadBalancerSourceRanges | list | `` | List of IP CIDR ranges allowed to access the LoadBalancer service Only applies when service type is LoadBalancer |
| licenseProxy.service.type | string | ClusterIP |
Service type for the License Proxy status service. Options: ClusterIP, NodePort, LoadBalancer |
| licenseProxy.serviceAccount.create | bool | true |
Specifies whether to create a default service account for the Deepgram License Proxy Deployment. |
| licenseProxy.serviceAccount.name | string | nil |
Allows providing a custom service account name for the LicenseProxy component. If left empty, the default service account name will be used. If specified, and licenseProxy.serviceAccount.create = true, this defines the name of the default service account. If specified, and licenseProxy.serviceAccount.create = false, this provides the name of a preconfigured service account you wish to attach to the License Proxy deployment. |
| licenseProxy.tolerations | list | [] |
Tolerations to apply to License Proxy pods. |
| licenseProxy.topologySpreadConstraints | list | [] |
Topology spread constraints to apply to License Proxy pods. |
| licenseProxy.updateStrategy.rollingUpdate | object | `` | For the LicenseProxy, we only expose maxSurge and not maxUnavailable. This is to avoid accidentally having all LicenseProxy nodes go offline during upgrades, which could impact the entire cluster's connection to the Deepgram License Server. |
| licenseProxy.updateStrategy.rollingUpdate.maxSurge | int | 1 |
The maximum number of extra License Proxy pods that can be created during a rollingUpdate, relative to the number of replicas. See the Kubernetes documentation for more details. |
| prometheus-adapter | object | `` | Passthrough values for Prometheus Adapter Helm chart. Prometheus, and its adapter here, should be configured when scaling.auto is enabled. You may choose to use the installation/configuration bundled in this Helm chart, or you may configure an existing Prometheus installation in your cluster to expose the needed values. See source Helm chart for explanation of available values. Default values provided in this chart are used to provide pod autoscaling for Deepgram pods. |
| prometheus-adapter.includeDependency | string | nil |
Normally, this chart will be installed if scaling.auto.enabled is true. However, if you wish to manage the Prometheus adapter in your cluster on your own and not as part of the Deepgram Helm chart, you can force it to not be installed by setting this to false. |
| scaling | object | `` | Configuration options for horizontal scaling of Deepgram services. Only one of static and auto options can be enabled. |
| scaling.auto | object | `` | Enable pod autoscaling based on system load/traffic. |
| scaling.auto.api.metrics.custom | list | nil |
If you have custom metrics you would like to scale with, you may add them here. See the k8s docs for how to structure a list of metrics |
| scaling.auto.api.metrics.engineToApiRatio | int | 4 |
Scale the API deployment to this Engine-to-Api pod ratio |
| scaling.auto.engine.behavior | object | "See values.yaml file for default" | Configurable scaling behavior |
| scaling.auto.engine.maxReplicas | int | 10 |
Maximum number of Engine replicas. |
| scaling.auto.engine.metrics.custom | list | [] |
If you have custom metrics you would like to scale with, you may add them here. See the k8s docs for how to structure a list of metrics |
| scaling.auto.engine.metrics.requestCapacityRatio | string | nil |
If engine.concurrencyLimit.activeRequests is set, this variable will define the ratio of current active requests to maximum active requests at which the Engine pods will scale. Setting this value too close to 1.0 may lead to a situation where the cluster is at max capacity and rejects incoming requests. Setting the ratio too close to 0.0 will over-optimistically scale your cluster and increase compute costs unnecessarily. |
| scaling.auto.engine.metrics.speechToText.batch.requestsPerPod | int | nil |
Scale the Engine pods based on a static desired number of speech-to-text batch requests per pod |
| scaling.auto.engine.metrics.speechToText.streaming.requestsPerPod | int | nil |
Scale the Engine pods based on a static desired number of speech-to-text streaming requests per pod |
| scaling.auto.engine.metrics.textToSpeech.batch.requestsPerPod | int | nil |
Scale the Engine pods based on a static desired number of text-to-speech batch requests per pod |
| scaling.auto.engine.minReplicas | int | 1 |
Minimum number of Engine replicas. |
| scaling.replicas | object | `` | Number of replicas to set during initial installation. |
| scaling.replicas.engine | int | 1 |
Engine replicas can be specified either as a single number for one engine type, or as individual counts for each engine type when Voice Agent is enabled. |
| Name | Url | |
|---|---|---|
| Deepgram Self-Hosted | self.hosted@deepgram.com |