Skip to content

Implement multitask training #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Apr 13, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
a3323e5
chg: Use lists for y case to allow multiple labels
ivyleavedtoadflax Mar 24, 2020
0de34d5
chg: Fix tests and evaluate method
ivyleavedtoadflax Mar 24, 2020
af73eb4
chg: revert OOV and NUM tokens
ivyleavedtoadflax Mar 25, 2020
a989b50
fix: output_layers count
ivyleavedtoadflax Mar 25, 2020
711354f
chg: Save predictions for multiple tasks
ivyleavedtoadflax Mar 25, 2020
57e7588
chg: Use the max_len sent at init
ivyleavedtoadflax Mar 25, 2020
322792e
chg: Combine artefacts into indices.pickle
ivyleavedtoadflax Mar 25, 2020
1f8da40
fixup indices
ivyleavedtoadflax Mar 25, 2020
a65d260
chg: Update CHANGELOG
ivyleavedtoadflax Mar 26, 2020
3d1a055
new: Bump version to 2020.3.3
ivyleavedtoadflax Mar 26, 2020
966cba8
chg: fix: missing logging statements
ivyleavedtoadflax Mar 26, 2020
c33dd75
chg: Solve issue with quotes in tsv files
ivyleavedtoadflax Mar 27, 2020
b61de98
chg: Fix logging messages in split and parse
ivyleavedtoadflax Mar 27, 2020
182a9cb
new: Add multitask config
ivyleavedtoadflax Mar 30, 2020
3e48684
new: Update parser and splitter model
ivyleavedtoadflax Mar 30, 2020
f392f9f
new: Add multitask split_parse command
ivyleavedtoadflax Mar 31, 2020
795679d
Add multitask 3.18 tsvs to datasets in Makefile
lizgzil Apr 2, 2020
3e1b20b
chg: Use lower level weight loading
ivyleavedtoadflax Apr 5, 2020
20afa75
chg: Update predict function for multitask scenario
ivyleavedtoadflax Apr 5, 2020
77971e5
chg: Update split_parse to deal with multiple predictions
ivyleavedtoadflax Apr 5, 2020
b33c2b2
new: Handle no config error
ivyleavedtoadflax Apr 12, 2020
5b17587
new: Add logic to handle single task case
ivyleavedtoadflax Apr 12, 2020
fdfe5d3
chg: Update to 2020.3.19 multitask model
ivyleavedtoadflax Apr 12, 2020
1be3864
chg: Update datasets recipe
ivyleavedtoadflax Apr 12, 2020
88f1a24
Merge pull request #30 from wellcometrust/add-multitask-makefile
ivyleavedtoadflax Apr 12, 2020
0edf65f
Merge branch 'feature/ivyleavedtoadflax/multitask_2' of github.com:we…
ivyleavedtoadflax Apr 12, 2020
15306b4
chg: Use output labels to detect output size
ivyleavedtoadflax Apr 12, 2020
8e3a155
fix: failing test
ivyleavedtoadflax Apr 12, 2020
ad6d6bb
new: Add tests for SplitParser
ivyleavedtoadflax Apr 12, 2020
fad8d58
chg: Update README.md
ivyleavedtoadflax Apr 12, 2020
6db7d8e
new: Update CHANGELOG
ivyleavedtoadflax Apr 12, 2020
0e6658c
new: Add split_parse model config to setup.py
ivyleavedtoadflax Apr 12, 2020
fceed1b
fix: typo
ivyleavedtoadflax Apr 13, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# Changelog

## 2020.3.3 - Pre-release

NOTE: This version includes changes to both the way that model artefacts are packaged and saved, and the way that data are laded and parsed from tsv files. This results in a significantly faster training time (c.14 hours -> c.0.5 hour), but older models will no longer be compatible. For compatibility you must use multitask modles > 2020.3.19, splitting models > 2020.3.6, and parisng models > 2020.3.8. These models currently perform less well than previous versions, but performance is expected to improve with more data and experimentation predominatly around sequence length.

* Adds support for a Multitask models as in the original Rodrigues paper
* Combines artefacts into a single `indices.pickle` rather than the several previous pickles. Now the model just requires the embedding, `indices.pickle`, and `weights.h5`.
* Updates load_tsv to better handle quoting.


## 2020.3.2 - Pre-release

* Adds parse command that can be called with `python -m deep_reference_parser parse`
Expand Down
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,10 @@ datasets = data/splitting/2019.12.0_splitting_train.tsv \
data/splitting/2019.12.0_splitting_valid.tsv \
data/parsing/2020.3.2_parsing_train.tsv \
data/parsing/2020.3.2_parsing_test.tsv \
data/parsing/2020.3.2_parsing_valid.tsv
data/parsing/2020.3.2_parsing_valid.tsv \
data/multitask/2020.3.19_multitask_train.tsv \
data/multitask/2020.3.19_multitask_test.tsv \
data/multitask/2020.3.19_multitask_valid.tsv


rodrigues_datasets = data/rodrigues/clean_train.txt \
Expand Down
136 changes: 82 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,63 +2,87 @@

# Deep Reference Parser

Deep Reference Parser is a Bi-direction Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF) for identifying references from text. It is designed to be used in the [Reach](https://github.com/wellcometrust/reach) tool to replace a number of existing machine learning models which find references, and extract the constituent parts (e.g. author, year, publication, volume, etc).
Deep Reference Parser is a Deep Learning Model for recognising references in free text. In this context we mean references to other works, for example an academic paper, or a book. Given an arbitrary block of text (nominally a section containing references), the model will extract the limits of the individual references, and identify key information like: authors, year published, and title.

The BiLSTM model is based on Rodrigues et al. (2018), and like this project, the intention is to implement a MultiTask model which will complete three tasks simultaneously: reference span detection (splitting), reference component detection (parsing), and reference type classification (classification) in a single neural network and stacked CRF.
The model itself is a Bi-directional Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF). It is designed to be used in the [Reach](https://github.com/wellcometrust/reach) application to replace a number of existing machine learning models which find references, and extract the constituent parts.

The BiLSTM model is based on [Rodrigues et al. (2018)](https://github.com/dhlab-epfl/LinkedBooksDeepReferenceParsing) who developed a model to find (split) references, parse them into contituent parts, and classify them according to the type of reference (e.g. primary reference, secondary reference, etc). This implementation of the model implements a the first two tasks and is intened for use in the medical field. Three models are implemented here: individual splitting and parsing models, and a combined multitask model which both splits and parses. We have not yet attempted to include reference type classification, but this may be done in the future.

### Current status:

|Component|Individual|MultiTask|
|---|---|---|
|Spans (splitting)|✔️ Implemented|❌ Not Implemented|
|Components (parsing)|✔️ Implemented|❌ Not Implemented|
|Spans (splitting)|✔️ Implemented|✔️ Implemented|
|Components (parsing)|✔️ Implemented|✔️ Implemented|
|Type (classification)|❌ Not Implemented|❌ Not Implemented|

### The model

The model itself is based on the work of [Rodrigues et al. (2018)](https://github.com/dhlab-epfl/LinkedBooksDeepReferenceParsing), although the implemention here differs significantly. The main differences are:

* We use a combination of the training data used by Rodrigues, et al. (2018) in addition to data that we have labelled ourselves. No Rodrigues et al. data are included in the test and validation sets.
* We also use a new word embedding that has been trained on documents relevant to the medicine.
* We use a combination of the training data used by Rodrigues, et al. (2018) in addition to data that we have annotated ourselves. No Rodrigues et al. data are included in the test and validation sets.
* We also use a new word embedding that has been trained on documents relevant to the field of medicine.
* Whereas Rodrigues at al. split documents on lines, and sent the lines to the model, we combine the lines of the document together, and then send larger chunks to the model, giving it more context to work with when training and predicting.
* Whilst the model makes predictions at the token level, it outputs references by naively splitting on these tokens ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/tokens_to_references.py)).
* Whilst the splitter model makes predictions at the token level, it outputs references by naively splitting on these tokens ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/tokens_to_references.py)).
* Hyperparameters are passed to the model in a config (.ini) file. This is to keep track of experiments, but also because it is difficult to save the model with the CRF architecture, so it is necesary to rebuild (not re-train!) the model object each time you want to use it. Storing the hyperparameters in a config file makes this easier.
* The package ships with a [config file](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/configs/2019.12.0.ini) which defines the latest, highest performing model. The config file defines where to find the various objects required to build the model (dictionaries, weights, embeddings), and will automatically fetch them when run, if they are not found locally.
* The package ships with a [config file](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/configs/2020.3.19_multitask.ini) which defines the latest, highest performing model. The config file defines where to find the various objects required to build the model (index dictionaries, weights, embeddings), and will automatically fetch them when run, if they are not found locally.
* The model includes a command line interface inspired by [SpaCy](https://github.com/explosion/spaCy); functions can be called from the command line with `python -m deep_reference_parser` ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/predict.py)).
* Python version updated to 3.7, along with dependencies (although more to do)
* Python version updated to 3.7, along with dependencies (although more to do).

### Performance

On the validation set.

#### Span detection (splitting)
#### Finding references spans (splitting)

|token|f1|support|
|---|---|---|
|b-r|0.9364|2472|
|e-r|0.9312|2424|
|i-r|0.9833|92398|
|o|0.9561|32666|
|weighted avg|0.9746|129959|
Current mode version: *2020.3.6_splitting*

#### Components (parsing)
|token|f1|
|---|---|
|b-r|0.8146|
|e-r|0.7075|
|i-r|0.9623|
|o|0.8463|
|weighted avg|0.9326|

|token|f1|support|
|---|---|---|
|author|0.9467|2818|
|title|0.8994|4931|
|year|0.8774|418|
|o|0.9592|13685|
|weighted avg|0.9425|21852|
#### Identifying reference components (parsing)

Current mode version: *2020.3.8_parsing*

|token|f1|
|---|---|
|author|0.9053|
|title|0.8607|
|year|0.0.8639|
|o|0.0.9340|
|weighted avg|0.9124|

#### Multitask model (splitting and parsing)

Current mode version: *2020.3.19_multitask*

|token|f1|
|---|---|
|author|0.9102|
|title|0.8809|
|year|0.7469|
|o|0.8892|
|parsing weighted avg|0.8869|
|b-r|0.8254|
|e-r|0.7908|
|i-r|0.9563|
|o|0.7560|
|weighted avg|0.9240|

#### Computing requirements

Models are trained on AWS instances using CPU only.

|Model|Time Taken|Instance type|Instance cost (p/h)|Total cost|
|---|---|---|---|---|
|Span detection|16:02:00|m4.4xlarge|$0.88|$14.11|
|Components|11:02:59|m4.4xlarge|$0.88|$9.72|
|Span detection|00:26:41|m4.4xlarge|$0.88|$0.39|
|Components|00:17:22|m4.4xlarge|$0.88|$0.25|
|MultiTask|00:19:56|m4.4xlarge|$0.88|$0.29|

## tl;dr: Just get me to the references!

Expand All @@ -77,15 +101,20 @@ cat > references.txt <<EOF
EOF


# Run the splitter model. This will take a little time while the weights and
# Run the MultiTask model. This will take a little time while the weights and
# embeddings are downloaded. The weights are about 300MB, and the embeddings
# 950MB.

python -m deep_reference_parser split "$(cat references.txt)"
python -m deep_reference_parser split_parse -t "$(cat references.txt)"

# For parsing:

python -m deep_reference_parser parse "$(cat references.txt)"

# For splitting:

python -m deep_reference_parser split "$(cat references.txt)"

```

## The longer guide
Expand All @@ -106,22 +135,24 @@ A [config file](https://github.com/wellcometrust/deep_reference_parser/blob/mast

```
[DEFAULT]
version = 2019.12.0
version = 2020.3.19_multitask
description = Same as 2020.3.13 but with adam rather than rmsprop
deep_reference_parser_version = b61de984f95be36445287c40af4e65a403637692

[data]
test_proportion = 0.25
valid_proportion = 0.25
data_path = data/
respect_line_endings = 0
respect_doc_endings = 1
line_limit = 250
policy_train = data/2019.12.0_train.tsv
policy_test = data/2019.12.0_test.tsv
policy_valid = data/2019.12.0_valid.tsv
line_limit = 150
policy_train = data/multitask/2020.3.19_multitask_train.tsv
policy_test = data/multitask/2020.3.19_multitask_test.tsv
policy_valid = datamultitask/2020.3.19_multitask_valid.tsv
s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/

[build]
output_path = models/2020.2.0/
output_path = models/multitask/2020.3.19_multitask/
output = crf
word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
pretrained_embedding = 0
Expand All @@ -133,13 +164,10 @@ char_embedding_type = BILSTM
optimizer = rmsprop

[train]
epochs = 10
epochs = 60
batch_size = 100
early_stopping_patience = 5
metric = val_f1

[evaluate]
out_file = evaluation_data.tsv
```

### Getting help
Expand Down Expand Up @@ -198,21 +226,21 @@ Data must be prepared in the following tab separated format (tsv). We use [prodi
You must provide the train/test/validation data splits in this format in pre-prepared files that are defined in the config file.

```
References o
1 o
The b-r
potency i-r
of i-r
history i-r
was i-r
on i-r
display i-r
at i-r
a i-r
workshop i-r
held i-r
in i-r
February i-r
References o o
1 o o
The b-r title
potency i-r title
of i-r title
history i-r title
was i-r title
on i-r title
display i-r title
at i-r title
a i-r title
workshop i-r title
held i-r title
in i-r title
February i-r title
```

### Making predictions
Expand Down
2 changes: 2 additions & 0 deletions deep_reference_parser/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,13 @@
from .train import train
from .split import split
from .parse import parse
from .split_parse import split_parse

commands = {
"split": split,
"parse": parse,
"train": train,
"split_parse": split_parse,
}

if len(sys.argv) == 1:
Expand Down
7 changes: 4 additions & 3 deletions deep_reference_parser/__version__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
__name__ = "deep_reference_parser"
__version__ = "2020.3.2"
__version__ = "2020.3.3"
__description__ = "Deep learning model for finding and parsing references"
__url__ = "https://github.com/wellcometrust/deep_reference_parser"
__author__ = "Wellcome Trust DataLabs Team"
__author_email__ = "[email protected]"
__license__ = "MIT"
__splitter_model_version__ = "2019.12.0_splitting"
__parser_model_version__ = "2020.3.2_parsing"
__splitter_model_version__ = "2020.3.6_splitting"
__parser_model_version__ = "2020.3.8_parsing"
__splitparser_model_version__ = "2020.3.19_multitask"
14 changes: 7 additions & 7 deletions deep_reference_parser/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,12 @@
from logging import getLogger
from urllib import parse, request

from .__version__ import (
__parser_model_version__,
__splitparser_model_version__,
__splitter_model_version__,
)
from .logger import logger
from .__version__ import __splitter_model_version__, __parser_model_version__


def get_path(path):
Expand All @@ -15,6 +19,7 @@ def get_path(path):

SPLITTER_CFG = get_path(f"configs/{__splitter_model_version__}.ini")
PARSER_CFG = get_path(f"configs/{__parser_model_version__}.ini")
MULTITASK_CFG = get_path(f"configs/{__splitparser_model_version__}.ini")


def download_model_artefact(artefact, s3_slug):
Expand Down Expand Up @@ -47,13 +52,8 @@ def download_model_artefacts(model_dir, s3_slug, artefacts=None):
if not artefacts:

artefacts = [
"char2ind.pickle",
"ind2label.pickle",
"ind2word.pickle",
"label2ind.pickle",
"maxes.pickle",
"indices.pickle" "maxes.pickle",
"weights.h5",
"word2ind.pickle",
]

for artefact in artefacts:
Expand Down
35 changes: 0 additions & 35 deletions deep_reference_parser/configs/2019.12.0_splitting.ini

This file was deleted.

37 changes: 37 additions & 0 deletions deep_reference_parser/configs/2020.3.19_multitask.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
[DEFAULT]
version = 2020.3.19_multitask
description = Same as 2020.3.13 but with adam rather than rmsprop
deep_reference_parser_version = b61de984f95be36445287c40af4e65a403637692

[data]
# Note that test and valid proportion are only used for data creation steps,
# not when running the train command.
test_proportion = 0.25
valid_proportion = 0.25
data_path = data/
respect_line_endings = 0
respect_doc_endings = 1
line_limit = 150
policy_train = data/multitask/2020.3.19_multitask_train.tsv
policy_test = data/multitask/2020.3.19_multitask_test.tsv
policy_valid = data/multitask/2020.3.19_multitask_valid.tsv
s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/

[build]
output_path = models/multitask/2020.3.19_multitask/
output = crf
word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
pretrained_embedding = 0
dropout = 0.5
lstm_hidden = 400
word_embedding_size = 300
char_embedding_size = 100
char_embedding_type = BILSTM
optimizer = rmsprop

[train]
epochs = 60
batch_size = 100
early_stopping_patience = 5
metric = val_f1

Loading