Production-ready Azure Kubernetes Service (AKS) infrastructure with GitOps using ArgoCD, fully automated via GitHub Actions and Terraform.
This project demonstrates a complete GitOps workflow from zero to a fully automated Kubernetes cluster:
- Bootstrap → Create backend storage for Terraform state
- Setup → Configure service principals with OIDC authentication
- Deploy → Push to GitHub, infrastructure deploys automatically
- GitOps → ArgoCD syncs applications from Git every 30 seconds
- Scale → KEDA autoscales based on CPU/memory metrics
- Monitor → Prometheus + Grafana for metrics, Loki for logs
- Cleanup → One command destroys everything
Total setup time: ~20 minutes (mostly waiting for AKS cluster)
Manual steps: Only 2 (bootstrap, add 2 secrets)
Everything else: Fully automated via GitHub Actions and ArgoCD
- AKS Cluster: Kubernetes 1.34 with Cluster Autoscaler (1-5 nodes, starts with 2)
- Networking: VNet with dedicated subnet
- Storage: Azure-managed persistent volumes
- Autoscaling: Cluster Autoscaler for automatic node scaling
- ArgoCD: Automated application deployment with app-of-apps pattern
- GitHub Actions: Dual-credential CI/CD pipeline
- Terraform: Infrastructure as Code with remote state
- nginx: Web server with KEDA autoscaling
- KEDA: Event-driven autoscaling (CPU/Memory triggers)
- Prometheus Stack: Metrics collection and alerting
- Grafana: Metrics visualization and dashboards
- Loki: Log aggregation backend
- Promtail: Log collection from all pods
- ✅ Azure Workload Identity (OIDC): No stored credentials
- ✅ Federated Authentication: GitHub Actions authenticates via OIDC
- ✅ Dual-Credential Approach: Separate read/write permissions
- ✅ Azure RBAC: Role-based access control on AKS
- ✅ Least Privilege: Minimal permissions for each service principal
- ✅ Encrypted State: Terraform state in Azure Storage with encryption
- ✅ No Secrets in Code: All sensitive data in GitHub Secrets
- ✅ Branch Protection: PRs required, no direct pushes to main
- ✅ Trivy: IaC security scanning in CI/CD pipeline
- ✅ Terraform Validation: Format and validation checks
- ℹ️ Note: CodeQL not included - appropriate for IaC-focused projects. Would add for application code.
- Azure CLI (
az login) - GitHub CLI (
gh auth login) - Terraform (v1.13.5+)
- kubectl
- Git
./scripts/bootstrap-backend.shWhat it does:
- Creates Azure Storage account for Terraform state
- Automatically updates
terraform/backend.tfwith storage account name - No manual configuration needed!
Output:
✅ Backend created successfully!
✅ Updated terraform/backend.tf automatically!
./scripts/setup-complete-access.shWhat it does:
- Creates 2 service principals (full-access + read-only)
- Assigns Azure roles (Contributor, User Access Administrator, Reader)
- Configures federated credentials for GitHub Actions
- Automatically adds 5 GitHub secrets
Action: Add 2 secrets manually:
gh secret set GIT_USERNAME -b "your-github-username"
gh secret set GIT_TOKEN -b "your-github-pat"git add .
git commit -m "Initial deployment"
git push origin mainThat's it! GitHub Actions will:
- Run terraform plan (security scan)
- Deploy AKS cluster (~15 minutes)
- Install ArgoCD
- Deploy all applications automatically
┌─────────────────────────────────────────────────────────────┐
│ GitHub Actions │
├─────────────────────────────────────────────────────────────┤
│ │
│ Pull Request (Feature Branch) │
│ ├─ Service Principal: aks-gitops-lab-readonly │
│ ├─ Permissions: Reader, Storage access, AKS read │
│ ├─ Action: terraform plan only │
│ └─ Purpose: Safe testing before merge │
│ │
│ Main Branch (After Merge) │
│ ├─ Service Principal: aks-gitops-lab-github │
│ ├─ Permissions: Contributor, User Access Admin │
│ ├─ Action: terraform apply │
│ └─ Purpose: Deploy infrastructure │
│ │
└─────────────────────────────────────────────────────────────┘
Developer → PR → Plan (read-only) → Review → Merge → Apply (full-access) → ArgoCD syncs apps
┌──────────────────────────────────────────────────────────────┐
│ ArgoCD │
├──────────────────────────────────────────────────────────────┤
│ │
│ core-apps (App of Apps) │
│ ├─ Monitors: argocd-apps/ directory │
│ ├─ Auto-sync: Every 30 seconds │
│ └─ Auto-prune: Removes deleted apps │
│ │
│ Applications │
│ ├─ nginx (with KEDA autoscaling) │
│ ├─ keda (autoscaling controller) │
│ ├─ kube-prometheus-stack (monitoring) │
│ ├─ loki (log aggregation) │
│ └─ promtail (log collection) │
│ │
└──────────────────────────────────────────────────────────────┘
.
├── .github/workflows/
│ ├── terraform.yml # Main CI/CD pipeline
│ └── destroy.yml # Infrastructure cleanup
├── apps/ # Helm charts for applications
│ ├── nginx/
│ │ ├── Chart.yaml
│ │ ├── values.yaml
│ │ └── templates/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ ├── namespace.yaml
│ │ └── scaledobject.yaml # KEDA autoscaling
│ ├── keda/
│ ├── kube-prometheus-stack/
│ ├── loki/
│ └── promtail/
├── argocd-apps/ # ArgoCD application definitions
│ ├── nginx.yaml
│ ├── keda.yaml
│ ├── kube-prometheus-stack.yaml
│ ├── loki.yaml
│ └── promtail.yaml
├── terraform/ # Terraform infrastructure
│ ├── modules/ # Terraform modules
│ │ ├── aks/ # AKS cluster configuration
│ │ ├── argocd/ # ArgoCD Helm deployment
│ │ ├── resource-group/ # Azure resource group
│ │ └── vnet/ # Virtual network
│ ├── backend.tf # Terraform backend configuration
│ ├── main.tf # Main Terraform configuration
│ ├── variables.tf # Variable definitions
│ ├── outputs.tf # Output definitions
│ └── provider.tf # Provider configuration
├── scripts/ # Automation scripts
│ ├── bootstrap-backend.sh
│ ├── setup-complete-access.sh
│ └── cleanup-all.sh
└── README.md
mkdir -p apps/myapp/templatesCreate apps/myapp/Chart.yaml:
apiVersion: v2
name: myapp
version: 1.0.0Create apps/myapp/values.yaml:
replicaCount: 2
image:
repository: myapp
tag: "latest"Create argocd-apps/myapp.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/your-org/your-repo.git
targetRevision: main
path: apps/myapp
destination:
server: https://kubernetes.default.svc
namespace: myapp
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=truegit add apps/ argocd-apps/
git commit -m "Add myapp"
git pushArgoCD will automatically deploy your app in ~30 seconds!
# Get credentials
az aks get-credentials --resource-group aks-gitops-lab --name aks-gitops-lab-aks --admin
# Check cluster
kubectl get nodes
kubectl get pods --all-namespaces# Port forward
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Get password
kubectl get secret argocd-initial-admin-secret -n argocd -o jsonpath="{.data.password}" | base64 -d
# Open browser
open https://localhost:8080
# Username: admin
# Password: (from above command)# Port forward
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80
# Get password
kubectl get secret kube-prometheus-stack-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d
# Open browser
open http://localhost:3000
# Username: admin
# Password: (from above command)kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090
open http://localhost:9090./scripts/cleanup-all.shThis removes:
- ✅ Service principals and role assignments
- ✅ GitHub secrets
- ✅ Backend storage account
- ✅ All resource groups
- ✅ Local Terraform state files
# Destroy infrastructure only
gh workflow run destroy.yml -f confirm=destroySolution: The readonly service principal needs AKS Cluster Admin role. This is automatically configured in Terraform (modules/aks/main.tf).
Possible causes:
- GitHub token expired
- Repository URL incorrect
- Branch name mismatch
Solution:
# Check ArgoCD repo secret
kubectl get secret argocd-repo -n argocd -o yaml
# Update if needed
kubectl delete secret argocd-repo -n argocd
# Re-run terraform apply to recreateSolution: Scale up nodes
# Edit modules/aks/variables.tf
node_count = 3 # Increase from 2
# Commit and push
git add modules/aks/variables.tf
git commit -m "Scale to 3 nodes"
git pushSolution: This is cosmetic if using ServerSideApply with webhooks. The application is still functional. Remove ServerSideApply if it bothers you:
syncOptions:
- CreateNamespace=true
# Remove: - ServerSideApply=true- Node metrics: CPU, memory, disk, network
- Pod metrics: Resource usage per pod
- Cluster metrics: Overall cluster health
- Custom metrics: Application-specific metrics
- Centralized logging: All pod logs in one place
- Query language: LogQL for powerful log queries
- Retention: Configurable log retention policies
- Integration: Grafana dashboards for log visualization
- CPU-based: Scale on CPU utilization
- Memory-based: Scale on memory usage
- Custom metrics: Scale on any Prometheus metric
- Event-driven: Scale on queue depth, HTTP requests, etc.
- AKS: 2 x Standard_B2s nodes (~$60/month)
- Storage: Standard_LRS (~$0.10/month)
- Load Balancer: Standard (~$20/month)
- Total: ~$80-90/month
- Use spot instances for non-production workloads
- Scale down when not in use
- Use smaller node sizes for dev/test
- Enable cluster autoscaler to scale to zero
- Destroy infrastructure when not needed
# Destroy when not in use
gh workflow run destroy.yml -f confirm=destroy
# Redeploy when needed
git commit --allow-empty -m "Redeploy" && git push- ✅ No credentials in code or version control
- ✅ Federated authentication (OIDC)
- ✅ Separate read/write service principals
- ✅ Encrypted Terraform state
- ✅ Azure RBAC on AKS cluster
- ✅ Network policies (via Azure CNI)
- ✅ Secrets stored in GitHub Secrets
Security Enhancements:
- 🔲 External Secrets Operator - Sync secrets from Azure Key Vault
- 🔲 Private Cluster Endpoint - Restrict API server access
- 🔲 Network Policies - Control pod-to-pod traffic
- 🔲 Pod Security Standards - Enforce security policies
- 🔲 Azure Policy - Compliance and governance
Infrastructure Improvements:
- 🔲 Separate Node Pools - System vs user workloads
- 🔲 Production VM Sizes - Standard_D2s_v3 instead of B2s
- 🔲 Resource Limits - CPU/memory limits on all pods
- 🔲 Velero Backups - Disaster recovery
- 🔲 Multi-region - High availability
Operational:
- 🔲 Cost Alerts - Azure Cost Management budgets
- 🔲 Terraform Workspaces - Dev/staging/prod environments
- 🔲 Runbooks - Incident response procedures
- 🔲 SLO/SLA Monitoring - Service level objectives
- ✅ Backend storage creation
- ✅ Backend configuration auto-update
- ✅ Service principal creation and configuration
- ✅ Role assignments (subscription and cluster level)
- ✅ GitHub secrets (5 of 7 automated)
- ✅ AKS cluster deployment
- ✅ Cluster Autoscaler configuration
- ✅ ArgoCD installation and configuration
- ✅ Application deployment via GitOps
- ✅ KEDA autoscaling setup
- ✅ Monitoring stack deployment
- ❌ Add
GIT_USERNAMEsecret (one-time) - ❌ Add
GIT_TOKENsecret (one-time)
- Azure Kubernetes Service Documentation
- ArgoCD Documentation
- KEDA Documentation
- Terraform Azure Provider
- GitOps Principles
MIT
This is a learning lab project. Feel free to fork and adapt for your needs!
- Purpose: Learning and portfolio demonstration
- Environment: Lab/Development
- VM Size: Standard_B2s (burstable, cost-optimized)
- Security: Basic (OIDC, RBAC, encrypted state)
This setup provides a solid foundation but requires these enhancements:
Must Have:
- Private cluster endpoint
- Network policies
- Resource limits on all pods
- External Secrets Operator with Key Vault
- Velero backups
- Production VM sizes (Standard_D2s_v3+)
Should Have:
- Separate system/user node pools
- Cost alerts and budgets
- Multi-environment setup (dev/staging/prod)
- Comprehensive monitoring and alerting
- Disaster recovery plan
Cost Considerations:
- Current setup: ~$80-90/month
- Production setup: ~$200-300/month (with redundancy)
- Remember to destroy resources when not in use