Skip to content

philmcmahon/data-pipeline

Repository files navigation

Data Pipeline Workshop

The aim of this workshop is to build a scalable pipeline to use to process large datasets. We'll focus on the following tasks:

  • OCR (extracting text from images)
  • Transcription
  • Running LLM prompts on a document

As source data, we will use an Amazon S3 bucket containing:

  • podcasts/ - a selection of recent episodes from podcasts from a few different european countries
  • podcast_samples - the first 10 mins of a selection of podcasts
  • arms_oil_company_reports/ - annual reports of oil/arms companies
  • company_reports/ - (very big) selection of company reports of the FTSE 350 companies from the london stock exchange

The workshop has three parts:

  1. Creating the cloud infrastructure we will use for the project (infrastructure/)
  • A queue, which we'll use to schedule jobs for our workers
  • A worker pool - computers to perform the OCR/Transcription/LLM operations. Known as an 'auto scaling group' in AWS.
  1. Scheduling some work. See populator/.
  2. Running the desired task on the work we have scheduled. See worker/.

Sign into AWS

First of all, sign in to Amazon web services using the provided credentials at this URL: https://708599814125.signin.aws.amazon.com/console/?region=eu-west-1

Setup - using your own laptop

Firstly, make sure you have installed the aws cli. You can check if it's installed by running:

aws --version

If you get some output similar to aws-cli/2.34.53 ... then it is installed. To install it, follow the guide here: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

Once you have the CLI installed, in a terminal window run the command:

aws login --profile dataharvest --region eu-west-1

It should pop open a browser. Select the account you just signed into.

To test that your credentials are working, run aws s3 ls --profile dataharvest - you should see a list of data harvest related S3 buckets.

Setup - using github codespaces

  1. Sign in to github - a free account is fine. Go to the repo and click on the dropdown next to the + icon in the top right and select 'New codespace'
Screenshot 2026-05-29 at 12 07 00
  1. Once the codespace has launched, open a terminal and run the setup script to install the AWS CLI, OpenTofu, and uv.
bash codespaces-setup.sh

At the end of the script it should output a url you can use to sign in to AWS - copy it and paste it into your browser in a new tab. You'll get a code you need to paste back into the terminal.

Creating your pipeline

To manage the cloud infrastructure we'll need for this workshop, we're using some software called Open Tofu. This allows us to define all the things we need in a single file, rather than having to click lots of buttons in the cloud console. To begin with,make sure you have installed open tofu by following the instructions here: https://opentofu.org/docs/intro/install/.

Once opentofu is installed, we need to initialise the opentofu project we'll be using. Run tofu init inside the infrastructure directory:

cd infrastructure/
tofu init

Next, take a look at infrastructure/workers.tf - this is the OpenTofu file that defines the infrastructure we'll need for the pipeline. There isn't time to learn the details of OpenTofu now, but you could read through the comments of the file to get an idea of what will be created in AWS.

Change the projectName variable inside the infrastructure/workers.tf file file from 'phil-test' to something else. This ensures that you'll be able to tell apart cloud resources you've created from others doing the workshop.

Once you have changed the project name, run tofu apply to create your infrastructure. OpenTofu will show a list of all the things it will do, type 'yes' to tell it to go ahead.

Once OpenTofu has finished creating your infrastructure, you can go check it's there in the console:

Scheduling some work

Now we've got our infrastructure, we need to put some messages on the queue containing the work we want done.

Take a look at the contents of the source data S3 bucket and decide which files you want to work on. Initially it's best to pick something out of the short_files directory. Once you have identified a directory, note down the path to the files, e.g. podcast_samples/

Next, take a look at the populator/ subdirectory. This contains the script we'll use to add messages to the queue. The basic behaviour of the script is:

  • List files in the S3 bucket
  • Add a reference to the file, together with a jobType and any extra necessay data to the queue.

Whilst it's possible to write some code to auto detect the type of file and process it accordingly, to keep things simple we'll be telling the populator what job we want to run on the input files. Initially, we'll start with transcription.

To run the populator you need a program called uv installed. Follow the instructions here https://docs.astral.sh/uv/getting-started/installation/ to install it.

Once you have uv installed, run uv sync to install the dependencies needed by the populator script:

cd data-pipeline/
uv sync

Now we can run the populator script:

uv run populator [QUEUE_URL] [PATH] [JOB_TYPE]

You'll need to replace the arguments:

  • QUEUE_URL: Fetch it from the SQS console
  • PATH: path to the files in S3 you want to process
  • JOB_TYPE: either 'transcribe' or 'ocr'

Running the workers and checking the output

Now we have some work in the queue, we need to start a worker. Go to the autoscaling console, find your worker autoscaling group and set the desired number of instances to 1. If you go to the 'instance management' tab you should see a new instance starting up.

The worker has been pre-configured to install any necessary dependencies and start processing the queue. To see what it is doing we need to look at the output logs from the service. There are various ways of doing this, for now we'll just login to the machine using the AWS console.

(Note that if you're using another cloud provider or don't want to lock in to AWS you can use SSH for this step, AWS tools are just easier in the workshop context)

Find your instance by going to the autoscaling group console, finding your autoscaling group, selecting the 'instance management' tab and then clicking on the instance id of your instance. This will open in a new tab. Click 'connect' in the top right and then on the next page 'connect' in the bottom right.

Once you are logged in, there are two files of interest. /var/log/cloud-init-output.log will show you everything that happened when the instance started up. At the end of the startup process it starts the worker. From then on, logs will appear in /opt/dlami/nvme/worker.log.

To see what's in these files, you can use tail -f to follow along whilst stuff is happening:

tail -f /var/log/cloud-init-output.log
tail -f /opt/dlami/nvme/worker.log

Or you can use cat to dump the whole file to the screen:

cat /var/log/cloud-init-output.log
cat /opt/dlami/nvme/worker.log

worker.log contains all the output from the worker.py script. You can either cat worker.log to dump everything to the terminal, or run tail -f worker.log to follow along as it processes the files.

If you'd prefer to use your own terminal rather than the browser then you can install the AWS CLI session manager plugin from here and then login using the instance id:

aws ssm start-session --profile workshop --target <id>
sudo su ubuntu
cd /home/ubuntu

If you'd like to understand the worker better, note that all the files related to the worker are installed in the folder /opt/dlami/nvme - this location is useful as it is on a faster hard disk that comes with AWS's GPU instances. The startup script for the worker can be seen in theworker.tf opentofu file.

Analysing the output using an LLM

Hopefully by this point you have some data in your output bucket. The next step is to feed this output into an LLM to do some analysis.

Firstly, let's purge the queue of any leftover work. You can do this in the SQS console.

Next, edit prompts/system_prompt.txt to contain a description of the type of data you are looking for. See example_prompt.txt for insipration.

Now, run the populator with your prompt along and the location of the output files. Note that this time you'll need to tell the populator to read from your output bucket

uv run populator [QUEUE_URL] [PATH] prompt --bucket [OUTPUT_BUCKET] --system-prompt-file `prompts/system_prompt.txt`

It's worth logging onto the machine again to check it's doing as you expect. Eventually you should start seeing files appear in the S3 console at the PATH you provided. You should be able to download and take a look at the files.

Gathering output into a single file

It would be easier to look at the output data if we had it all in one place. There is a script 'collector' provided which will download all the output prompt files and combine them into a single CSV (spreadsheet) file. You can run it like this:

uv run collector [OUTPUT_BUCKET] prompt/

You should be left with a CSV file you can open in a spreadsheet editor.

Scaling up

If time, we can try processing a larger amount of data (e.g. company_reports folder). Use the populator to add the jobs to the queue, and then set the desired instances of the autoscaling group to 3. You now have 3 workers processing the work simultaneously, hooray!

About

Repository for workshop at data harvest 2026 on rapidly analysing documents using the public cloud

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors