A complete end-to-end demonstration of deploying and managing Databricks Lakeflow jobs using StackQL-Deploy for infrastructure provisioning and Databricks Asset Bundles (DABs) for data pipeline management.
This repository demonstrates modern DataOps practices by combining:
- ποΈ Infrastructure as Code: Using StackQL and stackql-deploy for SQL-based infrastructure management
- π Data Pipeline Management: Using Databricks Asset Bundles for job orchestration and deployment
- π GitOps CI/CD: Automated infrastructure provisioning and data pipeline deployment via GitHub Actions
-
Provisions Databricks Infrastructure using StackQL-Deploy:
- AWS IAM roles and cross-account permissions
- S3 buckets for workspace storage
- Databricks workspace with Unity Catalog
- Storage credentials and external locations
-
Deploys a Retail Data Pipeline using Databricks Asset Bundles:
- Multi-stage data processing (Bronze β Silver β Gold)
- Parallel task execution with dependency management
- State-based conditional processing
- For-each loops for parallel state processing
-
Automates Everything with GitHub Actions:
- Infrastructure provisioning on push to main
- DAB validation and deployment
- Multi-environment support (dev/prod)
graph TB
subgraph "GitHub Repository"
A[infrastructure/] --> B[StackQL-Deploy]
C[retail-job/] --> D[Databricks Asset Bundle]
end
subgraph "AWS Cloud"
B --> E[IAM Roles]
B --> F[S3 Buckets]
B --> G[VPC/Security Groups]
end
subgraph "Databricks Platform"
B --> H[Workspace]
D --> I[Lakeflow Jobs]
H --> I
I --> J[Bronze Tables]
I --> K[Silver Tables]
I --> L[Gold Tables]
end
subgraph "CI/CD Pipeline"
M[GitHub Actions] --> B
M --> D
M --> N[Multi-Environment Deployment]
end
databricks-lakeflow-jobs-example/
βββ infrastructure/ # StackQL infrastructure templates
β βββ README.md # Infrastructure setup guide
β βββ stackql_manifest.yml # StackQL deployment configuration
β βββ resources/ # Cloud resource templates
β βββ aws/ # AWS resources (IAM, S3)
β βββ databricks_account/ # Account-level Databricks resources
β βββ databricks_workspace/ # Workspace configurations
βββ retail-job/ # Databricks Asset Bundle
β βββ databricks.yml # DAB configuration
β βββ Task Files/ # Data pipeline notebooks
β βββ 01_data_ingestion/ # Bronze layer data ingestion
β βββ 02_data_loading/ # Customer data loading
β βββ 03_data_processing/ # Silver layer transformations
β βββ 04_data_transformation/ # Gold layer clean data
β βββ 05_state_processing/ # State-specific processing
βββ .github/workflows/ # CI/CD automation
βββ databricks-dab.yml # GitHub Actions workflow
- AWS account with administrative permissions
- Databricks account (see infrastructure setup guide)
- Python 3.8+ and Git
git clone https://github.com/stackql/databricks-lakeflow-jobs-example.git
cd databricks-lakeflow-jobs-exampleFollow the comprehensive Infrastructure Setup Guide to:
- Configure AWS and Databricks accounts
- Set up service principals and permissions
- Deploy infrastructure using StackQL-Deploy
Once infrastructure is provisioned:
cd retail-job
# Validate the bundle
databricks bundle validate --target dev
# Deploy the data pipeline
databricks bundle deploy --target dev
# Run the complete pipeline
databricks bundle run retail_data_processing_job --target devThe retail data pipeline demonstrates a complete medallion architecture (Bronze β Silver β Gold):
-
π₯ Bronze Layer - Data Ingestion
- Orders Ingestion: Loads raw sales orders data
- Sales Ingestion: Loads raw sales transaction data
- Tables:
orders_bronze,sales_bronze
-
π₯ Silver Layer - Data Processing
- Customer Loading: Loads customer master data
- Data Joining: Joins customers with sales and orders
- Duplicate Removal: Conditional deduplication based on data quality
- Tables:
customers_bronze,customer_sales_silver,customer_orders_silver
-
π₯ Gold Layer - Data Transformation
- Clean & Transform: Business-ready, curated datasets
- State Processing: Parallel processing for each US state using for-each loops
- Tables:
retail_gold,state_summary_gold
- π Parallel Execution: Multiple tasks run concurrently where dependencies allow
- π― Conditional Tasks: Deduplication only runs if duplicates are detected
- π For-Each Loops: State processing runs in parallel for multiple states
- π§ Notifications: Email alerts on job success/failure
- β±οΈ Timeouts & Limits: Job execution controls and concurrent run limits
- ποΈ Parameters: Dynamic state-based processing with base parameters
The GitHub Actions workflow (.github/workflows/databricks-dab.yml) provides complete automation:
- Pull Requests: Validates changes against dev environment
- Main Branch Push: Deploys to production environment
- Path-Based: Only triggers on infrastructure or job configuration changes
-
ποΈ Infrastructure Provisioning
- name: Deploy Infrastructure with StackQL uses: stackql/[email protected] with: command: 'build' stack_dir: 'infrastructure' stack_env: ${{ env.ENVIRONMENT }}
-
π Workspace Configuration
- Extracts workspace details from StackQL deployment
- Configures Databricks CLI with workspace credentials
- Sets up environment-specific configurations
-
β DAB Validation & Deployment
- name: Validate Databricks Asset Bundle run: databricks bundle validate --target ${{ env.ENVIRONMENT }} - name: Deploy Databricks Jobs run: databricks bundle deploy --target ${{ env.ENVIRONMENT }}
-
π§ͺ Pipeline Testing
- Runs the complete data pipeline
- Validates job execution and data quality
- Reports results and generates summaries
The workflow supports multiple environments with automatic detection:
- Dev Environment: For pull requests and feature development
- Production Environment: For main branch deployments
Environment-specific configurations are managed through:
- StackQL environment variables and stack environments
- Databricks Asset Bundle targets (
dev,prd) - GitHub repository secrets for credentials
- SQL-based Infrastructure: Manage cloud resources using familiar SQL syntax
- State-free Operations: No state files - query infrastructure directly from APIs
- Multi-cloud Support: Consistent interface across AWS, Azure, GCP, and SaaS providers
- GitOps Ready: Native CI/CD integration with GitHub Actions
- Environment Consistency: Deploy the same code across dev/staging/prod
- Version Control: Infrastructure and code in sync with Git workflows
- Advanced Orchestration: Complex dependencies, conditions, and parallel execution
- Resource Management: Automated cluster provisioning and job scheduling
- Infrastructure as Code: Everything versioned and reproducible
- GitOps Workflows: Pull request-based infrastructure changes
- Environment Parity: Identical configurations across environments
- Automated Testing: Pipeline validation and data quality checks
- Infrastructure Setup Guide: Complete StackQL-Deploy setup and usage
- StackQL Documentation: Learn SQL-based infrastructure management
- Databricks Asset Bundles: DAB concepts and advanced patterns
- stackql-deploy GitHub Action: CI/CD integration guide
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Cost Management: This project provisions billable cloud resources. Always run teardown commands after testing.
- Cleanup Required: Cancel Databricks subscription after completing the exercise to avoid ongoing charges.
- Security: Never commit credentials to version control. Use environment variables and CI/CD secrets.
Demonstrating the future of DataOps with SQL-based infrastructure management and modern data pipeline orchestration.