This project demonstrates how to build a lightweight disaster recovery and backup validation workflow on AWS using Terraform and AWS-native services.
The goal was not just to create backups, but to prove that backups can be restored, monitored, and validated in a way that reflects real operational thinking.
The solution includes automated RDS snapshot creation, scheduled execution with EventBridge, failure alerting with CloudWatch and SNS, and restore validation by creating a temporary PostgreSQL instance from a snapshot.
- Automate backup creation for an AWS RDS PostgreSQL database
- Schedule backup jobs with EventBridge
- Monitor backup failures with CloudWatch
- Send email notifications with SNS when backup jobs fail
- Validate recoverability by restoring the latest snapshot to a temporary database
- Control cost by deleting temporary restore infrastructure after validation
- Provision the environment with Terraform
This architecture shows how Terraform provisions the AWS resources, EventBridge schedules the backup workflow, Lambda creates manual RDS snapshots, CloudWatch monitors execution, SNS sends failure alerts, and restore validation is performed using a temporary PostgreSQL instance before cleanup.
The project uses the following AWS services:
- Amazon RDS PostgreSQL for the primary database
- AWS Lambda for snapshot automation
- Amazon EventBridge for scheduling backup jobs
- Amazon CloudWatch for logs and alarms
- Amazon SNS for email alerting
- Terraform for infrastructure provisioning
- Terraform provisions the networking, RDS instance, SNS topic, Lambda function, EventBridge rule, and CloudWatch alarm.
- EventBridge triggers the Lambda function on a schedule.
- Lambda creates a manual snapshot of the RDS database.
- CloudWatch captures logs and monitors Lambda errors.
- SNS sends an email notification when the CloudWatch alarm enters the ALARM state.
- A restore validation test is performed by restoring a snapshot into a temporary PostgreSQL instance.
- The temporary restore-test database is deleted after validation to reduce cost.
- Terraform
- AWS RDS PostgreSQL
- AWS Lambda
- Amazon EventBridge
- Amazon CloudWatch
- Amazon SNS
- VPC, subnets, route tables, and security groups
Provisioned the following with Terraform:
- VPC
- Public and private subnets
- RDS PostgreSQL instance
- SNS topic
Built a Lambda function that creates manual RDS snapshots and triggered it using EventBridge.
Validated the phase by:
- manually invoking the Lambda function
- confirming successful execution
- verifying the manual snapshot in the RDS console
- confirming the EventBridge schedule exists
Created a CloudWatch alarm for Lambda errors and connected it to SNS email notifications.
Validated the phase by:
- intentionally breaking the Lambda configuration
- confirming the function failed
- verifying the CloudWatch alarm entered the ALARM state
- confirming the SNS email alert was received
- restoring the correct configuration and verifying the Lambda succeeded again
Validated backup recoverability by restoring the latest RDS snapshot into a temporary PostgreSQL instance.
Confirmed:
- the restore was successful
- the restored database reached the
Availablestate - the original and restored databases existed side by side during validation
The restore-test database was deleted afterward to control cost.
For this demo project, I used the following recovery targets:
-
RPO (Recovery Point Objective): 24 hours
This means the acceptable maximum data loss window is one day. -
RTO (Recovery Time Objective): 30 to 60 minutes
This means the target recovery time for restoring the database from snapshot is under one hour for the demo environment.
Each phase includes its own screenshot evidence folder for easier review.
- Backups are not enough unless restores are tested
- Alerting is essential for backup reliability
- Temporary restore validation is a practical way to prove recoverability without building a complex DR platform
- Terraform makes the solution repeatable and easier to explain
- Cost control matters when testing disaster recovery workflows
To avoid unnecessary cost:
- the restore-test database should be deleted after validation
- snapshots should be reviewed and cleaned up as needed
- Terraform resources should be destroyed when the demo environment is no longer needed
- Automate restore validation with Lambda or Step Functions
- Add snapshot retention cleanup logic
- Add cross-region backup replication
- Add a more detailed operational runbook
- Add architecture diagrams
- Extend validation to include database connectivity checks