Skip to content

The Beer Reviews Data Pipeline is a data engineering project that involves extracting, preprocessing, and storing beer review data from a Kaggle dataset in a Google Cloud Storage data lake. The data pipeline is built using Python, and Prefect, and includes a Metabase dashboard for data visualization.

Notifications You must be signed in to change notification settings

directdetour/BeerReviewsDataPipeline

Repository files navigation

Beer Reviews Data Pipeline

Project Description

The Beer Reviews Data Pipeline is a data engineering project that involves extracting, preprocessing, and storing beer review data from a Kaggle dataset in a Google Cloud Storage data lake. The data pipeline is built using Python, and Prefect, and includes a Metabase dashboard for data visualization.

Technologies Used

  • Python
  • Prefect
  • Docker
  • Google Cloud Storage
  • Metabase

Beer Reviews Data Pipeline Image

Prerequisites

Before running the Beer Reviews Data Pipeline, you must have the following installed:

  • Python

For Metabase:

  • Docker
  • Docker Compose

Also needed:

  • GCP Service Account
  • Kaggle API keys

Usage

Data Pipeline

  1. Clone the repository to your local machine. git clone https://github.com/directdetour/BeerReviewsDataPipeline.git
  2. Create a Google Cloud Storage bucket to store the data. bucket_name = 'beer_reviews_bucket'
  3. Add GCS credentials to ./app/creds.json
    • GCP Service Account needs GCS and BigQuery read/write permissions
  4. Visit kaggle.com and get api credentials save this to ./.kaggle/kaggle.json
  5. Run start.sh to activate venv, install python requirements, launch data pipeline, and create bigquery table
    ./start.sh
    

Data Viz

  1. launch Metabase docker instance: (may need to use sudo)

    docker-compose up --build -d
    
  2. Open a web browser and go to http://localhost:3000 to access the Metabase dashboard.

  3. Update the "Beer Reviews" database to use your GCP credentials. (see tip below)

Folder Structure

beerreviewsproject/
├── .kaggle/
│ └── kaggle.json
├── app/
│ └── creds.json
├── config/
│ └── metabase_database.env
├── flows/
│ ├── download_flow.py
│ └── upload_flow.py
├── metabase-data/
│ ├── metabase.db/
│ ├── bqdb_firstrun.sh
│ ├── bqdb_update.sh
│ └── metabase_accounts.sh
├── .gitignore
├── app.py
├── bq_provision.py
├── docker-compose.yml
├── Dockerfile.pipeline-unused
├── requirements.txt
├── README.md
└── start.sh
  • .kaggle/.kaggle.json: Required to download the dataset via api in this pipeline
  • app/creds.json: JSON from Google Cloud IAM Service Account
  • config/metabase_database.env: Settings used by the docker-compose building Metabase container
  • flows/: Folder containing Prefect flow files for downloading and preprocessing the data.
  • metabase-data/: Contains the database file used to run the Metabase dashboard. Also bash scripts for interacting with the metabase system API

    [!tip]- Use the UI to update GCP credentials used for the preconfigured "Beer Reviews" database source, or execute bqdb_update.sh

  • app.py: Python script that defines the Prefect flow and tasks.
  • bq_provision.py: Python script that creates a BigQuery external table from the uploaded parquet data
  • docker-compose.yml: Defines the Docker service for the Metabase Dashboard.
  • README.md: Markdown file containing project description, installation instructions, usage instructions, and folder structure.
  • requirements.txt: File containing Python libraries required to run the project.
  • start.sh: bash script to launch data pipeline

Contributing

Contributions to the Beer Reviews Data Pipeline project are welcome. To contribute, please follow these steps:

  1. Fork the repository: https://github.com/directdetour/BeerReviewsDataPipeline.git
  2. Create a new branch: git checkout -b feature/your-feature
  3. Make your changes and commit them: git commit -am 'Add some feature'
  4. Push to the branch: git push origin feature/your-feature
  5. Submit a pull request

--- More info

kaggle datasets download -d thedevastator/1-5-million-beer-reviews-from-beer-advocate

Notes

  • kaggle api CLI kaggle datasets download -d thedevastator/1-5-million-beer-reviews-from-beer-advocate

  • Metabase is configured to use Google BigQuery as the data source, which is accessed through the Parquet file stored in the Google Cloud Storage bucket.

  • GCP Service Account needs minimum permissions, but will work if given Cloud Storage Admin and BigQuery Admin Roles. Min Permissions if using a custom role:

bigquery.datasets.create
bigquery.datasets.get
bigquery.datasets.update
bigquery.jobs.create
bigquery.jobs.delete
bigquery.jobs.get
bigquery.jobs.list
bigquery.jobs.listAll
bigquery.jobs.listExecutionMetadata
bigquery.jobs.update
bigquery.tables.create
bigquery.tables.get
bigquery.tables.list
bigquery.tables.update
storage.buckets.create
storage.buckets.get
storage.buckets.update
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.list
storage.objects.update

Main Dashboard Alt text

Drill Through Details Dashboard Alt text

Acknowledgements

About

The Beer Reviews Data Pipeline is a data engineering project that involves extracting, preprocessing, and storing beer review data from a Kaggle dataset in a Google Cloud Storage data lake. The data pipeline is built using Python, and Prefect, and includes a Metabase dashboard for data visualization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published