Tech Wellcome Funds and Data Representation Labels

This repo contains the exploratory notebooks, modelling and analysis scripts to find out what tech Wellcome funds, and also the code to create the data representation labels.

Set-up

Set up the virtual environment

Running

make virtualenv

will create a virtual environment with the packages listed in requirements.txt.

Then when you want to develop and run code start this up by running

source build/virtualenv/bin/activate

Download the data

If you have the AWS command line tool and added your configure credentials then you can either download the open version of the data and models folder from our publically available S3 bucket:

make sync_open_data_from_s3
make sync_open_models_from_s3

(this only contains the essential and most recent files), or (if a Wellcome employee) download the private version of all the data and models:

make sync_data_from_s3
make sync_models_from_s3

or for a smaller set (no legacy data/models) of private data:

make sync_latest_files_from_s3

If you don't have AWS credentials then you will need to download the essential open files by going to this url:

https://datalabs-public.s3.eu-west-2.amazonaws.com/nutrition-labels/open_data_models.zip

from this a zipped file will be downloaded. You will need to unzip this file and move the contents of this folder into the main directory (i.e. the 'data/processed' folder this zipped file should replace the main 'data/processed' folder).

This data contains the file data/raw/wellcome-grants-awarded-2005-2019.csv which is the openly available 360Giving grants data of 16,914 grants from 2005 to 2019. This file is the basis of a lot of this project.

Make sure to upload any processed data to this folder too.

Jupyter notebooks

To run a notebook run

jupyter notebook

Tests

Unit tests can be run by running make test. After any changes made to this codebase this command should be ran in order to check nothing has been broken.

To check the pipeline of training models and predicting grants is working can be done with:

chmod +x pipelines/tech_grants_pipeline.sh
pipelines/tech_grants_pipeline.sh configs/pipeline/2021.05.01.test.ini

this should take about 1 min.

Tech Wellcome Funds

The code for this project is in the nutrition_labels and notebooks folder. More information about the experiments and results to this project are given in the documents:

Finding_Tech_Grants.md
Tech_grant_model_fairness.md
Tech_grant_clusters.md
Expanding_tech_grants.md
Prodigy_training_data.md
Tech_Grants_2021.md - the most current experiment descriptions are all in this document.

Pipeline overview

ResearchFish and EPMC evaluation data

Create RF and EPMC evaluation data by running:

python nutrition_labels/create_training_data.py --config_path configs/training_data/2021.03.29.epmc.ini and python nutrition_labels/create_training_data.py --config_path configs/training_data/2021.03.29.rf.ini

Previously these commands were part of the pipeline bash command, but they require different config parameters to create_training_data.py - which isn't possible now since we are using one config for the entire pipeline (so there are different config parameters).

Pipeline bash command

The pipeline to create training data, train models, and make predictions and evaluate models can be run with the commands:

chmod +x pipelines/tech_grants_pipeline.sh
pipelines/tech_grants_pipeline.sh configs/pipeline/2021.05.01.private.ini

or

chmod +x pipelines/tech_grants_pipeline.sh
pipelines/tech_grants_pipeline.sh configs/pipeline/2021.05.01.open.ini

The former uses internally available FortyTwo data to train and make predictions on, the second command is for external users - it trains and predicts on the publically available 360Giving grants dataset.

Be warned that this takes >5 hours since it includes making predictions on data.

An overview of these pipeline steps (if you were to run them one by one) and the latest files, as of 21/04/2021 used for each of them is as follows.

1. Create the training data

Description: Create training data with expanded tech definition and tagged grants data. ResearchFish and EPMC data are not included in the data.

Input: configs/training_data/2021.03.08.ini

Command:

python nutrition_labels/create_training_data.py --config_path configs/training_data/2021.03.08.ini

Output: 313 tech grants and 488 not tech grants (Internal ID <-> Relevance code) data/processed/training_data/210308/training_data.csv.

2. Train model(s)

Description: Train a BERT + logistic regression classifier model.

Input: Internally we want to train and predict on fortytwo data, but for external communication we need to use the public 360 giving dataset. So we have different configs to train on each of these:

configs/train_model/2021.04.03.ini - to train using the fortytwo grants data downloaded on 20th April 2021.
configs/train_model/2021.04.04.ini - to train using the 360 giving grants dataset.

Command example:

python nutrition_labels/grant_tagger.py --config_path configs/train_model/2021.04.03.ini

Output: All outputs stored in models/210403/, the pickled trained classifier and vectorizer is stored in the folder models/210403/bert_log_reg_210403/ along with a evaluation_results.txt file with the test/train metrics. Another file is stored in the main directory models/210403/training_information.json which contains information of which data points were in the test/train split and what the model predicted.

3. Predict tech grants using a single or an ensemble of models

Description: Predict whether grants should be classified as tech grants or not, using the trained models/210403/bert_log_reg_210403 model with a prediction threshold of 0.55.

Input: Internally we want to predict on fortytwo data, but for external communication we need to use the public 360 giving dataset. So we have different configs to predict on each of these:

configs/predict/2021.04.04.ini - to predict using the 360 giving grants dataset.
configs/predict/2021.04.05.ini - to predict using the fortytwo grants data downloaded on 20th April 2021.

Command:

python nutrition_labels/predict.py --config_path configs/predict/2021.04.04.ini

python nutrition_labels/predict.py --config_path configs/predict/2021.04.05.ini

Output:

data/processed/predictions/210404/wellcome-grants-awarded-2005-2019_tagged.csv
data/processed/predictions/210405/all_grants_fortytwo_info_210420_tagged.csv

4. (optional) Create evaluation data

Description: A sample of ResearchFish (self-reported) and EPMC (publications) outputs data was tagged as containing a tech output. This can be seen as 'hidden' tech - and so the grants description may not have any mention of tech. It is interesting to see how well the model does on predicting these grants, but first this needs to be processed.

Input:

configs/training_data/2021.03.29.epmc.ini
configs/training_data/2021.03.29.rf.ini

Command:

python nutrition_labels/create_training_data.py --config_path configs/training_data/2021.03.29.epmc.ini

and

python nutrition_labels/create_training_data.py --config_path configs/training_data/2021.03.29.rf.ini

Output:

data/processed/training_data/210329epmc/training_data.csv
data/processed/training_data/210329rf/training_data.csv

5. (optional) Evaluate the model

Description: The model is evaluated on up to 4 different datasets - the test data, unseen grants data containing only not-tech grants, and the tech outputs from the EPMC and ResearchFish outputs data. The unseen not-tech grants data was the disregarded set of training data to make the test and training datasets have a balanced number of tech and not tech grants. If step 4 wasn't done, this will still work and just output the test and unseen data metrics.

Input: configs/evaluation/2021.04.03.ini

Command:

python nutrition_labels/evaluate.py --config_path configs/evaluation/2021.04.03.ini

Output: Metrics for all 4 evaluation data sets are outputted in data/processed/evaluation/210403/evaluation_results.txt.

Pipeline additional details

Creating training data

In Finding_Tech_Grants.md we describe the first stage of tagging data to create the training data - this was last updated on the 7th August 2020. In 2021 we updated the definition of 'tech' as described in Expanding_tech_grants.md - this created a new training data set on 26th January 2021. Then, we decided to use active learning in Prodigy to tag more training data, this process is described in Prodigy_training_data.md, this resulted in a training data set on the 21st February 2021.

Finally, we found that adding grants tagged via EPMC and ResearchFish actually may decrease the model scores, so we created some training data not containing these data points - this was done on 8th March 2021.

Tag code	Meaning	Number of grants - 200807	Number of grants - 210126	Number of grants - 210221	Number of grants - 210308
1	Relevant	214	347	495	313
0	Not relevant	883	349	485	488

To create these datasets you should run:

python nutrition_labels/create_training_data.py --config_path configs/training_data/2021.03.08.ini

with config files from '2020.08.07.ini', '2021.01.26.ini', '2021.02.21.ini' or 2021.03.08.ini'.

Model Experiments

Several experiments were performed to come up with the best model for this task. The outcomes of these fed into the design of each of the config files in the pipeline. These experiments are discussed in much more detail in the "Experiments" and "Ensemble model/Parameter experiments" sections of Tech_Grants_2021.md.

Fairness

In the notebook Fairness.ipynb we perform group fairness checks for the models. The results of these are written up in Tech_grant_model_fairness.md.

Computational Science tags comparison

We compare the tech grants with another set of grants tagged by a different model in the notebook Science tags - Tech grant comparison.ipynb.

Clustering

We perform cluster analysis to look at themes within the tech grants. This analysis is written up in Tech_grant_clusters.md.

Data Representation Labels

In the Top_Datasets.md document you can find how we identified some of the datasets to include in the data representation labels.

The folder representation_labels/ contains the code needed to produce the data representation labels html.

Project structure

├── configs            
│   Config files for the models
├── data
│   ├── processed - Data that we generate     
│   └── raw - Raw data                    
├── docs            
│   Documents for this project
├── models            
│   Any models created including clustering
├── notebooks                
|   Jupyter notebooks for experimentation and analysis
├── nutrition_labels
│   Scripts for the tech wellcome funds work
├── README.md
├── representation_labels
│   Scripts for the data representation labels work
├── requirements.txt - a list of all the python packages needed for this project

Deploying

1. Pre-requisites

You need docker installed and to download the model-cli from https://github.com/wellcometrust/hal9000/releases/tag/cli-0.1.0. To add model-cli to your path you can:

chmod +x ~/Downloads/model-cli && mv model-cli /usr/local/bin

You also need awscli to be set-up (pip install awscli if not), and the following env-variable: AWS_ACCOUNT_ID. If you don't know the value of this variable, contact Antonio Campello or Jeff Uren.

To create a deployement:

Change the code
Modify the Makefile parameters $(VERSION) and $(LATEST_MODEL_FILE)
Run make run-debug.

By running make docker-push it will push a new version. The files will be under s3://wt-model-hub/nutrition-labels/${VERSION}

Name		Name	Last commit message	Last commit date
Latest commit History 423 Commits
api		api
configs		configs
data		data
docs		docs
notebooks		notebooks
nutrition_labels		nutrition_labels
pipelines		pipelines
representation_labels		representation_labels
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
latest_files.txt		latest_files.txt
prodigy.json		prodigy.json
prodigy_requirements.txt		prodigy_requirements.txt
requirements.txt		requirements.txt
requirements_minimal.txt		requirements_minimal.txt
setup.py		setup.py
train_and_notify_slack.sh		train_and_notify_slack.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tech Wellcome Funds and Data Representation Labels

Contents

Set-up

Set up the virtual environment

Download the data

Jupyter notebooks

Tests

Tech Wellcome Funds

Pipeline overview

ResearchFish and EPMC evaluation data

Pipeline bash command

1. Create the training data

2. Train model(s)

3. Predict tech grants using a single or an ensemble of models

4. (optional) Create evaluation data

5. (optional) Evaluate the model

Pipeline additional details

Creating training data

Model Experiments

Fairness

Computational Science tags comparison

Clustering

Data Representation Labels

Project structure

Deploying

1. Pre-requisites

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

wellcometrust/nutrition-labels

Folders and files

Latest commit

History

Repository files navigation

Tech Wellcome Funds and Data Representation Labels

Contents

Set-up

Set up the virtual environment

Download the data

Jupyter notebooks

Tests

Tech Wellcome Funds

Pipeline overview

ResearchFish and EPMC evaluation data

Pipeline bash command

1. Create the training data

2. Train model(s)

3. Predict tech grants using a single or an ensemble of models

4. (optional) Create evaluation data

5. (optional) Evaluate the model

Pipeline additional details

Creating training data

Model Experiments

Fairness

Computational Science tags comparison

Clustering

Data Representation Labels

Project structure

Deploying

1. Pre-requisites

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages