This repository holds the summarized data for the wordle twitter project, as well as the python code to automate the data collection and manipulation.
This project is a follow-up to my Observable-based data exploration, Wordle, 15 Million Tweets Later.
Why do this? Fair question; three reasons:
- After publishing that piece, I was curious to see if the trends and correlations would continue to hold true over time, and whether new words would emerge as the hardest or easiest. I didn't want to continually update the prose/structure of the article, so a dashboard made more sense.
- Given the limited nature of Twitter's search API, it is very hard to fetch data from more than 7 days ago. So this dataset is not easily reproducible if data were to stop being collected.
- This was a good opportunity to become more familar with github workflows, Google Cloud Platform, and data pipeline architecture more generally.
The data pipeline currently is as follows:
- The
fetch_tweets.ymlworkflow is run on a daily basis.- This workflow calls
WordleTwitterAPIScrape.pywhich fetches the last day's full set of Wordle tweets. - This data is compiled to a CSV and uploaded to Google Cloud Storage (GCS)
- This workflow calls
- The upload to GCS triggers a Cloud Function, which runs
GCPCompileFiles.py- This script condenses and anonymizes the data.
- The data is then uploaded to a Google BigQuery (GBQ) dataset holding all the data from previous days
- The script then runs queries on the GBQ dataset to generate summary tables, such as round counts by day
- The script then triggers the
download_views.ymlworkflow in the Github repo
- The
download_views.ymlis triggered, which downloads the summary data.- The queries run against GBQ are stored in
GHQueryForViewData.py. - The data is saved to the CSVs in
data_views/
- The queries run against GBQ are stored in
Dashboard to come
Although all of the collected tweets are publicly available, steps have been taken to protect the user identity behind each tweet:
- Tweet id has been removed
- Tweet creator id has been replaced by an increasing index, still allowing for user-based analysis
- Only the Wordle matrix text is saved
Obviously, it is possible to recover the original tweet even with just this data, but not trivially.