A curated list of notable ETL (extract, transform, load) frameworks, libraries and software.
The premise of this list: you don't need fancy, specialized ETL frameworks. Well-structured code using mainstream, well-supported libraries gets you surprisingly far and is easier to test, review, and version control than tools that make those things difficult. Tools here are selected for real-world adoption and staying power, not novelty.
Open source tools are strongly preferred. Proprietary or restrictively licensed tools are only included when they're mainstream enough to be genuinely hard to ignore. See CONTRIBUTING.md for full inclusion criteria.
- Workflow Management/Engines
- Job Scheduling
- Java
- Python
- Ruby
- Go
- Cloud Services
- Big Data (Hadoop Stack)
- ETL Tools (GUI)
- Further Reading
- Airflow - "Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed."
- Argo - "an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes."
- Dagster - "Dagster is a data orchestrator for machine learning, analytics, and ETL. It lets you define pipelines in terms of the data flow between reusable, logical components, then test locally and run anywhere. With a unified view of pipelines and the assets they produce, Dagster can schedule and orchestrate Pandas, Spark, SQL, or anything else that Python can invoke."
- Luigi - "a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in."
- prefect - "a workflow orchestration framework for building resilient data pipelines in Python."
- Temporal - "a scalable and reliable runtime for durable function executions called Temporal Workflow Executions."
- Toil - "an open-source pure-Python workflow engine that lets people write better pipelines."
- Jenkins - "the leading open-source automation server. Built with Java, it provides over 1000 plugins to support automating virtually anything, so that humans can actually spend their time doing things machines cannot."
- Apache Camel - "an open source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data."
- Spring Batch - "A lightweight, comprehensive batch framework designed to enable the development of robust batch applications that are vital for the daily operations of enterprise systems."
- BeautifulSoup - "a Python library for pulling data out of HTML and XML files."
- Celery - "an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well."
- Dask - "a flexible parallel computing library for analytics."
- dataset - A wrapper around SQLAlchemy that simplifies database operations (including upserting).
- dbt-core - "enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications."
- dlt - "an open-source Python library that loads data from various, often messy data sources into well-structured datasets."
- DuckDB - "an analytical in-process SQL database management system."
- Great Expectations - "a Python library for validating, documenting, and profiling your data to maintain quality and improve communication between teams about data and data pipelines."
- hamilton - "helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does."
- ijson - "Iterative JSON parser with Pythonic interfaces."
- ingestr - "a CLI tool to copy data between any databases with a single command seamlessly."
- Joblib - "a set of tools to provide lightweight pipelining in Python."
- lxml - "the most feature-rich and easy-to-use library for processing XML and HTML in the Python language."
- Meltano - "the declarative code-first data integration engine."
- Pandas - "Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more."
- parse - "Parse strings using a specification based on the Python format() syntax."
- PETL - "a general purpose Python package for extracting, transforming and loading tables of data."
- polars - "Extremely fast Query Engine for DataFrames, written in Rust."
- PyQuery - "A jquery-like library for python."
- Scrapy - "a fast high-level web crawling & scraping framework for Python."
- SQLAlchemy - "the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL."
- tenacity - "a general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just about anything."
- Toolz - "A functional standard library for Python."
- xmltodict - "Python module that makes working with XML feel like you are working with JSON."
- Embulk - "a parallel bulk data loader that helps data transfer between various storages, databases, NoSQL and cloud services."
- Kiba - "lets you define and run high-quality ETL jobs using Ruby."
- nokogiri - "Nokogiri makes it easy and painless to work with XML and HTML from Ruby."
- Sequel - "a simple, flexible, and powerful SQL database access toolkit for Ruby."
- CloudQuery - "a cloud asset inventory built for platform teams. Sync your cloud infrastructure metadata into your data warehouse, powering insights and automation."
- Pachyderm - "provides parallelized processing of multi-stage, language-agnostic pipelines with data versioning and data lineage tracking."
- Redpanda Connect - "a declarative data streaming and integration tool with 300+ pre-built connectors, configured via YAML."
- Airbyte - "Airbyte is an open-source data integration engine that helps you consolidate your data in your data warehouses, lakes and databases."
- Alteryx - "combines data preparation, data blending, and analytics — predictive, statistical, and spatial — in a visual workflow designer."
- AWS Batch - "enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS."
- AWS Glue - "a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources."
- Cloud Data Fusion - "Fully managed, cloud-native data integration platform."
- Fivetran - "automates data movement from disparate sources into your destination."
- Google Dataflow - "Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines."
- Hevo - "a no-code data movement platform that is usable by your most technical as well as your non-technical and business users."
- Microsoft Azure Data Factory - "A fully managed, serverless data integration service that helps you visually integrate data sources with more than 90 built-in, maintenance-free connectors."
- Stitch - "Stitch is a cloud-first, open source platform for rapidly moving data. A simple, powerful ETL service, Stitch connects to all your data sources – from databases like MySQL and MongoDB, to SaaS applications like Salesforce and Zendesk – and replicates that data to a destination of your choosing."
- Apache Beam - "a unified programming model for Batch and Streaming data processing."
- Apache Flink - "a framework and distributed processing engine for stateful computations over unbounded and bounded data streams."
- Debezium - "Change data capture for a variety of databases."
- Kafka Connect - "a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka."
- Spark - "a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. It also supports a rich set of higher-level tools including Shark (Hive on Spark), MLlib for machine learning, GraphX for graph processing, and Spark Streaming."
Warning: If you're already familiar with a scripting language, GUI ETL tools are not a good replacement for a well structured application written with a scripting language. These tools lack flexibility and are a good example of the "inner-platform effect". With a large project, you will most likely run into instances where "the tool doesn't do that" and end up implementing something hacky with a script run by the GUI ETL tool. Also, the GUI can conceal complexity and the files these tools generate are impossible to code review. However, the GUI and out-of-the-box functionality can make some tasks simpler, especially for people not comfortable with writing code.
- Apache NiFi - "a rich, web-based interface for designing, controlling, and monitoring a dataflow."
- CDAP - "Use Cask Data Application Platform to visually build and manage data applications in hybrid and multi-cloud environments."
- Informatica PowerCenter - An ETL tool for extracting data from source systems, transforming it, and loading it into target systems using a visual mapping and workflow designer.
- Microsoft SSIS - "a component of the Microsoft SQL Server database software that can be used to perform a broad range of data migration tasks."
- N8n - "Free and open fair-code licensed node based Workflow Automation Tool. Easily automate tasks across different services."
- Pentaho Data Integration (PDI) - "a graphical ETL tool for designing data integration workflows using a drag-and-drop interface, also known as Kettle."
- Fundamentals of Data Engineering - Joe Reis & Matt Housley's tool-agnostic overview of the data engineering lifecycle, including the ETL-to-ELT shift (2022).
- The Rise of Data Contracts - Chad Sanderson on formalizing schema and quality guarantees between data producers and consumers.
- ELT 101: The Why and What of ELT - Why cheap cloud warehouse compute flipped the ETL paradigm to ELT.
Contributions welcome! Read the contribution guidelines first.
- awesome-pipeline - A curated list of awesome pipeline toolkits.