Databricks Lakeflow Jobs with StackQL-Deploy

A complete end-to-end demonstration of deploying and managing Databricks Lakeflow jobs using StackQL-Deploy for infrastructure provisioning and Databricks Asset Bundles (DABs) for data pipeline management.

🎯 Project Overview

This repository demonstrates modern DataOps practices by combining:

🏗️ Infrastructure as Code: Using StackQL and stackql-deploy for SQL-based infrastructure management
📊 Data Pipeline Management: Using Databricks Asset Bundles for job orchestration and deployment
🚀 GitOps CI/CD: Automated infrastructure provisioning and data pipeline deployment via GitHub Actions

What This Project Does

Provisions Databricks Infrastructure using StackQL-Deploy:
- AWS IAM roles and cross-account permissions
- S3 buckets for workspace storage
- Databricks workspace with Unity Catalog
- Storage credentials and external locations
Deploys a Retail Data Pipeline using Databricks Asset Bundles:
- Multi-stage data processing (Bronze → Silver → Gold)
- Parallel task execution with dependency management
- State-based conditional processing
- For-each loops for parallel state processing
Automates Everything with GitHub Actions:
- Infrastructure provisioning on push to main
- DAB validation and deployment
- Multi-environment support (dev/prod)

🏛️ Architecture

graph TB
    subgraph "GitHub Repository"
        A[infrastructure/] --> B[StackQL-Deploy]
        C[retail-job/] --> D[Databricks Asset Bundle]
    end
    
    subgraph "AWS Cloud"
        B --> E[IAM Roles]
        B --> F[S3 Buckets]
        B --> G[VPC/Security Groups]
    end
    
    subgraph "Databricks Platform"
        B --> H[Workspace]
        D --> I[Lakeflow Jobs]
        H --> I
        I --> J[Bronze Tables]
        I --> K[Silver Tables]
        I --> L[Gold Tables]
    end
    
    subgraph "CI/CD Pipeline"
        M[GitHub Actions] --> B
        M --> D
        M --> N[Multi-Environment Deployment]
    end

📁 Repository Structure

databricks-lakeflow-jobs-example/
├── infrastructure/                    # StackQL infrastructure templates
│   ├── README.md                     # Infrastructure setup guide
│   ├── stackql_manifest.yml         # StackQL deployment configuration
│   └── resources/                    # Cloud resource templates
│       ├── aws/                      # AWS resources (IAM, S3)
│       ├── databricks_account/       # Account-level Databricks resources
│       └── databricks_workspace/     # Workspace configurations
├── retail-job/                       # Databricks Asset Bundle
│   ├── databricks.yml               # DAB configuration
│   └── Task Files/                   # Data pipeline notebooks
│       ├── 01_data_ingestion/        # Bronze layer data ingestion
│       ├── 02_data_loading/          # Customer data loading
│       ├── 03_data_processing/       # Silver layer transformations
│       ├── 04_data_transformation/   # Gold layer clean data
│       └── 05_state_processing/      # State-specific processing
└── .github/workflows/                # CI/CD automation
    └── databricks-dab.yml           # GitHub Actions workflow

🚀 Quick Start

Prerequisites

AWS account with administrative permissions
Databricks account (see infrastructure setup guide)
Python 3.8+ and Git

1. Clone Repository

git clone https://github.com/stackql/databricks-lakeflow-jobs-example.git
cd databricks-lakeflow-jobs-example

2. Set Up Infrastructure

Follow the comprehensive Infrastructure Setup Guide to:

Configure AWS and Databricks accounts
Set up service principals and permissions
Deploy infrastructure using StackQL-Deploy

3. Deploy Data Pipeline

Once infrastructure is provisioned:

cd retail-job

# Validate the bundle
databricks bundle validate --target dev

# Deploy the data pipeline
databricks bundle deploy --target dev

# Run the complete pipeline
databricks bundle run retail_data_processing_job --target dev

📊 Data Pipeline Deep Dive

The retail data pipeline demonstrates a complete medallion architecture (Bronze → Silver → Gold):

Pipeline Stages

🥉 Bronze Layer - Data Ingestion
- Orders Ingestion: Loads raw sales orders data
- Sales Ingestion: Loads raw sales transaction data
- Tables: orders_bronze, sales_bronze
🥈 Silver Layer - Data Processing
- Customer Loading: Loads customer master data
- Data Joining: Joins customers with sales and orders
- Duplicate Removal: Conditional deduplication based on data quality
- Tables: customers_bronze, customer_sales_silver, customer_orders_silver
🥇 Gold Layer - Data Transformation
- Clean & Transform: Business-ready, curated datasets
- State Processing: Parallel processing for each US state using for-each loops
- Tables: retail_gold, state_summary_gold

Advanced DAB Features Demonstrated

🔄 Parallel Execution: Multiple tasks run concurrently where dependencies allow
🎯 Conditional Tasks: Deduplication only runs if duplicates are detected
🔁 For-Each Loops: State processing runs in parallel for multiple states
📧 Notifications: Email alerts on job success/failure
⏱️ Timeouts & Limits: Job execution controls and concurrent run limits
🎛️ Parameters: Dynamic state-based processing with base parameters

🔄 CI/CD Pipeline

The GitHub Actions workflow (.github/workflows/databricks-dab.yml) provides complete automation:

Workflow Triggers

Pull Requests: Validates changes against dev environment
Main Branch Push: Deploys to production environment
Path-Based: Only triggers on infrastructure or job configuration changes

Deployment Steps

🏗️ Infrastructure Provisioning

- name: Deploy Infrastructure with StackQL
  uses: stackql/[email protected]
  with:
    command: 'build'
    stack_dir: 'infrastructure'
    stack_env: ${{ env.ENVIRONMENT }}

📊 Workspace Configuration
- Extracts workspace details from StackQL deployment
- Configures Databricks CLI with workspace credentials
- Sets up environment-specific configurations

✅ DAB Validation & Deployment

- name: Validate Databricks Asset Bundle
  run: databricks bundle validate --target ${{ env.ENVIRONMENT }}

- name: Deploy Databricks Jobs
  run: databricks bundle deploy --target ${{ env.ENVIRONMENT }}

🧪 Pipeline Testing
- Runs the complete data pipeline
- Validates job execution and data quality
- Reports results and generates summaries

Environment Management

The workflow supports multiple environments with automatic detection:

Dev Environment: For pull requests and feature development
Production Environment: For main branch deployments

Environment-specific configurations are managed through:

StackQL environment variables and stack environments
Databricks Asset Bundle targets (dev, prd)
GitHub repository secrets for credentials

🛠️ Key Technologies

StackQL & stackql-deploy

SQL-based Infrastructure: Manage cloud resources using familiar SQL syntax
State-free Operations: No state files - query infrastructure directly from APIs
Multi-cloud Support: Consistent interface across AWS, Azure, GCP, and SaaS providers
GitOps Ready: Native CI/CD integration with GitHub Actions

Databricks Asset Bundles

Environment Consistency: Deploy the same code across dev/staging/prod
Version Control: Infrastructure and code in sync with Git workflows
Advanced Orchestration: Complex dependencies, conditions, and parallel execution
Resource Management: Automated cluster provisioning and job scheduling

Modern DataOps Practices

Infrastructure as Code: Everything versioned and reproducible
GitOps Workflows: Pull request-based infrastructure changes
Environment Parity: Identical configurations across environments
Automated Testing: Pipeline validation and data quality checks

📚 Learn More

Infrastructure Setup Guide: Complete StackQL-Deploy setup and usage
StackQL Documentation: Learn SQL-based infrastructure management
Databricks Asset Bundles: DAB concepts and advanced patterns
stackql-deploy GitHub Action: CI/CD integration guide

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Important Notes

Cost Management: This project provisions billable cloud resources. Always run teardown commands after testing.
Cleanup Required: Cancel Databricks subscription after completing the exercise to avoid ongoing charges.
Security: Never commit credentials to version control. Use environment variables and CI/CD secrets.

Demonstrating the future of DataOps with SQL-based infrastructure management and modern data pipeline orchestration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Databricks Lakeflow Jobs with StackQL-Deploy

🎯 Project Overview

What This Project Does

🏛️ Architecture

📁 Repository Structure

🚀 Quick Start

Prerequisites

1. Clone Repository

2. Set Up Infrastructure

3. Deploy Data Pipeline

📊 Data Pipeline Deep Dive

Pipeline Stages

Advanced DAB Features Demonstrated

🔄 CI/CD Pipeline

Workflow Triggers

Deployment Steps

Environment Management

🛠️ Key Technologies

StackQL & stackql-deploy

Databricks Asset Bundles

Modern DataOps Practices

📚 Learn More

🤝 Contributing

📄 License

⚠️ Important Notes

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
infrastructure		infrastructure
retail-job		retail-job
.gitignore		.gitignore
README.md		README.md

stackql/databricks-lakeflow-jobs-example

Folders and files

Latest commit

History

Repository files navigation

Databricks Lakeflow Jobs with StackQL-Deploy

🎯 Project Overview

What This Project Does

🏛️ Architecture

📁 Repository Structure

🚀 Quick Start

Prerequisites

1. Clone Repository

2. Set Up Infrastructure

3. Deploy Data Pipeline

📊 Data Pipeline Deep Dive

Pipeline Stages

Advanced DAB Features Demonstrated

🔄 CI/CD Pipeline

Workflow Triggers

Deployment Steps

Environment Management

🛠️ Key Technologies

StackQL & stackql-deploy

Databricks Asset Bundles

Modern DataOps Practices

📚 Learn More

🤝 Contributing

📄 License

⚠️ Important Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages