Wordle Twitter Stats

Overview

This repository holds the summarized data for the wordle twitter project, as well as the python code to automate the data collection and manipulation.

This project is a follow-up to my Observable-based data exploration, Wordle, 15 Million Tweets Later.

Motivation

Why do this? Fair question; three reasons:

After publishing that piece, I was curious to see if the trends and correlations would continue to hold true over time, and whether new words would emerge as the hardest or easiest. I didn't want to continually update the prose/structure of the article, so a dashboard made more sense.
Given the limited nature of Twitter's search API, it is very hard to fetch data from more than 7 days ago. So this dataset is not easily reproducible if data were to stop being collected.
This was a good opportunity to become more familar with github workflows, Google Cloud Platform, and data pipeline architecture more generally.

Data Pipeline

The data pipeline currently is as follows:

The fetch_tweets.yml workflow is run on a daily basis.
1. This workflow calls WordleTwitterAPIScrape.py which fetches the last day's full set of Wordle tweets.
2. This data is compiled to a CSV and uploaded to Google Cloud Storage (GCS)
The upload to GCS triggers a Cloud Function, which runs GCPCompileFiles.py
1. This script condenses and anonymizes the data.
2. The data is then uploaded to a Google BigQuery (GBQ) dataset holding all the data from previous days
3. The script then runs queries on the GBQ dataset to generate summary tables, such as round counts by day
4. The script then triggers the download_views.yml workflow in the Github repo
The download_views.yml is triggered, which downloads the summary data.
1. The queries run against GBQ are stored in GHQueryForViewData.py.
2. The data is saved to the CSVs in data_views/

Dashboard

Dashboard to come

Data Privacy

Although all of the collected tweets are publicly available, steps have been taken to protect the user identity behind each tweet:

Tweet id has been removed
Tweet creator id has been replaced by an increasing index, still allowing for user-based analysis
Only the Wordle matrix text is saved

Obviously, it is possible to recover the original tweet even with just this data, but not trivially.

Name		Name	Last commit message	Last commit date
Latest commit History 373 Commits
.github/workflows		.github/workflows
data_views		data_views
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wordle Twitter Stats

Overview

Motivation

Data Pipeline

Dashboard

Data Privacy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wordle Twitter Stats

Overview

Motivation

Data Pipeline

Dashboard

Data Privacy

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages