Skip to content

chiju/aks-gitops-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AKS GitOps Lab

Production-ready Azure Kubernetes Service (AKS) infrastructure with GitOps using ArgoCD, fully automated via GitHub Actions and Terraform.

🚀 From Scratch to Production

This project demonstrates a complete GitOps workflow from zero to a fully automated Kubernetes cluster:

  1. Bootstrap → Create backend storage for Terraform state
  2. Setup → Configure service principals with OIDC authentication
  3. Deploy → Push to GitHub, infrastructure deploys automatically
  4. GitOps → ArgoCD syncs applications from Git every 30 seconds
  5. Scale → KEDA autoscales based on CPU/memory metrics
  6. Monitor → Prometheus + Grafana for metrics, Loki for logs
  7. Cleanup → One command destroys everything

Total setup time: ~20 minutes (mostly waiting for AKS cluster)

Manual steps: Only 2 (bootstrap, add 2 secrets)

Everything else: Fully automated via GitHub Actions and ArgoCD

🎯 What Gets Deployed

Infrastructure

  • AKS Cluster: Kubernetes 1.34 with Cluster Autoscaler (1-5 nodes, starts with 2)
  • Networking: VNet with dedicated subnet
  • Storage: Azure-managed persistent volumes
  • Autoscaling: Cluster Autoscaler for automatic node scaling

GitOps & Automation

  • ArgoCD: Automated application deployment with app-of-apps pattern
  • GitHub Actions: Dual-credential CI/CD pipeline
  • Terraform: Infrastructure as Code with remote state

Applications & Services

  • nginx: Web server with KEDA autoscaling
  • KEDA: Event-driven autoscaling (CPU/Memory triggers)
  • Prometheus Stack: Metrics collection and alerting
  • Grafana: Metrics visualization and dashboards
  • Loki: Log aggregation backend
  • Promtail: Log collection from all pods

🔐 Security Features

Authentication & Authorization

  • Azure Workload Identity (OIDC): No stored credentials
  • Federated Authentication: GitHub Actions authenticates via OIDC
  • Dual-Credential Approach: Separate read/write permissions
  • Azure RBAC: Role-based access control on AKS
  • Least Privilege: Minimal permissions for each service principal

Data Protection

  • Encrypted State: Terraform state in Azure Storage with encryption
  • No Secrets in Code: All sensitive data in GitHub Secrets
  • Branch Protection: PRs required, no direct pushes to main

Security Scanning

  • Trivy: IaC security scanning in CI/CD pipeline
  • Terraform Validation: Format and validation checks
  • ℹ️ Note: CodeQL not included - appropriate for IaC-focused projects. Would add for application code.

📋 Prerequisites

  • Azure CLI (az login)
  • GitHub CLI (gh auth login)
  • Terraform (v1.13.5+)
  • kubectl
  • Git

🚀 Quick Start (2 Steps)

1. Bootstrap Backend

./scripts/bootstrap-backend.sh

What it does:

  • Creates Azure Storage account for Terraform state
  • Automatically updates terraform/backend.tf with storage account name
  • No manual configuration needed!

Output:

✅ Backend created successfully!
✅ Updated terraform/backend.tf automatically!

2. Setup Service Principals

./scripts/setup-complete-access.sh

What it does:

  • Creates 2 service principals (full-access + read-only)
  • Assigns Azure roles (Contributor, User Access Administrator, Reader)
  • Configures federated credentials for GitHub Actions
  • Automatically adds 5 GitHub secrets

Action: Add 2 secrets manually:

gh secret set GIT_USERNAME -b "your-github-username"
gh secret set GIT_TOKEN -b "your-github-pat"

3. Deploy

git add .
git commit -m "Initial deployment"
git push origin main

That's it! GitHub Actions will:

  1. Run terraform plan (security scan)
  2. Deploy AKS cluster (~15 minutes)
  3. Install ArgoCD
  4. Deploy all applications automatically

🏗️ Architecture

Dual-Credential CI/CD

┌─────────────────────────────────────────────────────────────┐
│                     GitHub Actions                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Pull Request (Feature Branch)                             │
│  ├─ Service Principal: aks-gitops-lab-readonly            │
│  ├─ Permissions: Reader, Storage access, AKS read         │
│  ├─ Action: terraform plan only                           │
│  └─ Purpose: Safe testing before merge                    │
│                                                             │
│  Main Branch (After Merge)                                 │
│  ├─ Service Principal: aks-gitops-lab-github              │
│  ├─ Permissions: Contributor, User Access Admin           │
│  ├─ Action: terraform apply                               │
│  └─ Purpose: Deploy infrastructure                        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

GitOps Flow

Developer → PR → Plan (read-only) → Review → Merge → Apply (full-access) → ArgoCD syncs apps

Application Deployment

┌──────────────────────────────────────────────────────────────┐
│                        ArgoCD                                │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  core-apps (App of Apps)                                    │
│  ├─ Monitors: argocd-apps/ directory                       │
│  ├─ Auto-sync: Every 30 seconds                            │
│  └─ Auto-prune: Removes deleted apps                       │
│                                                              │
│  Applications                                                │
│  ├─ nginx (with KEDA autoscaling)                          │
│  ├─ keda (autoscaling controller)                          │
│  ├─ kube-prometheus-stack (monitoring)                     │
│  ├─ loki (log aggregation)                                 │
│  └─ promtail (log collection)                              │
│                                                              │
└──────────────────────────────────────────────────────────────┘

📁 Project Structure

.
├── .github/workflows/
│   ├── terraform.yml      # Main CI/CD pipeline
│   └── destroy.yml        # Infrastructure cleanup
├── apps/                  # Helm charts for applications
│   ├── nginx/
│   │   ├── Chart.yaml
│   │   ├── values.yaml
│   │   └── templates/
│   │       ├── deployment.yaml
│   │       ├── service.yaml
│   │       ├── namespace.yaml
│   │       └── scaledobject.yaml  # KEDA autoscaling
│   ├── keda/
│   ├── kube-prometheus-stack/
│   ├── loki/
│   └── promtail/
├── argocd-apps/          # ArgoCD application definitions
│   ├── nginx.yaml
│   ├── keda.yaml
│   ├── kube-prometheus-stack.yaml
│   ├── loki.yaml
│   └── promtail.yaml
├── terraform/            # Terraform infrastructure
│   ├── modules/          # Terraform modules
│   │   ├── aks/         # AKS cluster configuration
│   │   ├── argocd/      # ArgoCD Helm deployment
│   │   ├── resource-group/  # Azure resource group
│   │   └── vnet/        # Virtual network
│   ├── backend.tf       # Terraform backend configuration
│   ├── main.tf          # Main Terraform configuration
│   ├── variables.tf     # Variable definitions
│   ├── outputs.tf       # Output definitions
│   └── provider.tf      # Provider configuration
├── scripts/              # Automation scripts
│   ├── bootstrap-backend.sh
│   ├── setup-complete-access.sh
│   └── cleanup-all.sh
└── README.md

🔧 Adding Applications

1. Create Helm Chart

mkdir -p apps/myapp/templates

Create apps/myapp/Chart.yaml:

apiVersion: v2
name: myapp
version: 1.0.0

Create apps/myapp/values.yaml:

replicaCount: 2
image:
  repository: myapp
  tag: "latest"

2. Create ArgoCD Application

Create argocd-apps/myapp.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/your-repo.git
    targetRevision: main
    path: apps/myapp
  destination:
    server: https://kubernetes.default.svc
    namespace: myapp
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true

3. Deploy

git add apps/ argocd-apps/
git commit -m "Add myapp"
git push

ArgoCD will automatically deploy your app in ~30 seconds!

🎮 Accessing Services

AKS Cluster

# Get credentials
az aks get-credentials --resource-group aks-gitops-lab --name aks-gitops-lab-aks --admin

# Check cluster
kubectl get nodes
kubectl get pods --all-namespaces

ArgoCD UI

# Port forward
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Get password
kubectl get secret argocd-initial-admin-secret -n argocd -o jsonpath="{.data.password}" | base64 -d

# Open browser
open https://localhost:8080
# Username: admin
# Password: (from above command)

Grafana

# Port forward
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80

# Get password
kubectl get secret kube-prometheus-stack-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d

# Open browser
open http://localhost:3000
# Username: admin
# Password: (from above command)

Prometheus

kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090
open http://localhost:9090

🧹 Cleanup

Complete Cleanup

./scripts/cleanup-all.sh

This removes:

  • ✅ Service principals and role assignments
  • ✅ GitHub secrets
  • ✅ Backend storage account
  • ✅ All resource groups
  • ✅ Local Terraform state files

Partial Cleanup (Keep Backend)

# Destroy infrastructure only
gh workflow run destroy.yml -f confirm=destroy

🐛 Troubleshooting

Issue: PR workflow fails with permission error

Solution: The readonly service principal needs AKS Cluster Admin role. This is automatically configured in Terraform (modules/aks/main.tf).

Issue: ArgoCD not syncing apps

Possible causes:

  1. GitHub token expired
  2. Repository URL incorrect
  3. Branch name mismatch

Solution:

# Check ArgoCD repo secret
kubectl get secret argocd-repo -n argocd -o yaml

# Update if needed
kubectl delete secret argocd-repo -n argocd
# Re-run terraform apply to recreate

Issue: Pods pending due to insufficient resources

Solution: Scale up nodes

# Edit modules/aks/variables.tf
node_count = 3  # Increase from 2

# Commit and push
git add modules/aks/variables.tf
git commit -m "Scale to 3 nodes"
git push

Issue: KEDA ScaledObject shows OutOfSync

Solution: This is cosmetic if using ServerSideApply with webhooks. The application is still functional. Remove ServerSideApply if it bothers you:

syncOptions:
  - CreateNamespace=true
  # Remove: - ServerSideApply=true

📊 Monitoring & Observability

Metrics (Prometheus + Grafana)

  • Node metrics: CPU, memory, disk, network
  • Pod metrics: Resource usage per pod
  • Cluster metrics: Overall cluster health
  • Custom metrics: Application-specific metrics

Logs (Loki + Promtail)

  • Centralized logging: All pod logs in one place
  • Query language: LogQL for powerful log queries
  • Retention: Configurable log retention policies
  • Integration: Grafana dashboards for log visualization

Autoscaling (KEDA)

  • CPU-based: Scale on CPU utilization
  • Memory-based: Scale on memory usage
  • Custom metrics: Scale on any Prometheus metric
  • Event-driven: Scale on queue depth, HTTP requests, etc.

💰 Cost Optimization

Current Setup (2 nodes)

  • AKS: 2 x Standard_B2s nodes (~$60/month)
  • Storage: Standard_LRS (~$0.10/month)
  • Load Balancer: Standard (~$20/month)
  • Total: ~$80-90/month

Cost Saving Tips

  1. Use spot instances for non-production workloads
  2. Scale down when not in use
  3. Use smaller node sizes for dev/test
  4. Enable cluster autoscaler to scale to zero
  5. Destroy infrastructure when not needed
# Destroy when not in use
gh workflow run destroy.yml -f confirm=destroy

# Redeploy when needed
git commit --allow-empty -m "Redeploy" && git push

🔒 Security Best Practices

Implemented

  • ✅ No credentials in code or version control
  • ✅ Federated authentication (OIDC)
  • ✅ Separate read/write service principals
  • ✅ Encrypted Terraform state
  • ✅ Azure RBAC on AKS cluster
  • ✅ Network policies (via Azure CNI)
  • ✅ Secrets stored in GitHub Secrets

Recommended for Production

Security Enhancements:

  • 🔲 External Secrets Operator - Sync secrets from Azure Key Vault
  • 🔲 Private Cluster Endpoint - Restrict API server access
  • 🔲 Network Policies - Control pod-to-pod traffic
  • 🔲 Pod Security Standards - Enforce security policies
  • 🔲 Azure Policy - Compliance and governance

Infrastructure Improvements:

  • 🔲 Separate Node Pools - System vs user workloads
  • 🔲 Production VM Sizes - Standard_D2s_v3 instead of B2s
  • 🔲 Resource Limits - CPU/memory limits on all pods
  • 🔲 Velero Backups - Disaster recovery
  • 🔲 Multi-region - High availability

Operational:

  • 🔲 Cost Alerts - Azure Cost Management budgets
  • 🔲 Terraform Workspaces - Dev/staging/prod environments
  • 🔲 Runbooks - Incident response procedures
  • 🔲 SLO/SLA Monitoring - Service level objectives

📚 What's Automated

  • ✅ Backend storage creation
  • ✅ Backend configuration auto-update
  • ✅ Service principal creation and configuration
  • ✅ Role assignments (subscription and cluster level)
  • ✅ GitHub secrets (5 of 7 automated)
  • ✅ AKS cluster deployment
  • ✅ Cluster Autoscaler configuration
  • ✅ ArgoCD installation and configuration
  • ✅ Application deployment via GitOps
  • ✅ KEDA autoscaling setup
  • ✅ Monitoring stack deployment

✋ What's Manual

  • ❌ Add GIT_USERNAME secret (one-time)
  • ❌ Add GIT_TOKEN secret (one-time)

🎓 Learning Resources

📝 License

MIT

🤝 Contributing

This is a learning lab project. Feel free to fork and adapt for your needs!

⚠️ Important Notes

Current Setup

  • Purpose: Learning and portfolio demonstration
  • Environment: Lab/Development
  • VM Size: Standard_B2s (burstable, cost-optimized)
  • Security: Basic (OIDC, RBAC, encrypted state)

For Production Use

This setup provides a solid foundation but requires these enhancements:

Must Have:

  • Private cluster endpoint
  • Network policies
  • Resource limits on all pods
  • External Secrets Operator with Key Vault
  • Velero backups
  • Production VM sizes (Standard_D2s_v3+)

Should Have:

  • Separate system/user node pools
  • Cost alerts and budgets
  • Multi-environment setup (dev/staging/prod)
  • Comprehensive monitoring and alerting
  • Disaster recovery plan

Cost Considerations:

  • Current setup: ~$80-90/month
  • Production setup: ~$200-300/month (with redundancy)
  • Remember to destroy resources when not in use

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors