Skip to content

Commit ba9ac92

Browse files
committed
fix preprocess/evaluate instructions
1 parent c2e50cd commit ba9ac92

File tree

3 files changed

+9
-8
lines changed

3 files changed

+9
-8
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Set up environment and data for training and evaluation:
5050
* All data and config files are placed relative to the: `base_dir = /path/to/project` in [local.conf](local.conf) so change it to point to the root of this repo
5151
* All splits created using the `split_*` Python scripts will need to be processed using `preprocess.py` to be used as training input for the model, for example, to split the DROC dataset run:
5252
- `python split_droc.py --type-system-xml /path/to/DROC-Release/droc/src/main/resources/CorefTypeSystem.xml /path/to/DROC-Release/droc/DROC-xmi data/german.droc_gold_conll`
53-
- `python preprocess.py --input_dir data/droc_full --output_dir data/droc_full --seg_len 512 --language german --tokenizer_name german-nlp-group/electra-base-german-uncased --input_suffix droc_gold_conll --input_format conll-2012`
53+
- `python preprocess.py --input_dir data/droc_full --output_dir data/droc_full --seg_len 512 --language german --tokenizer_name german-nlp-group/electra-base-german-uncased --input_suffix droc_gold_conll --input_format conll-2012 --model_type electra`
5454

5555

5656
## Evaluation
@@ -59,7 +59,7 @@ If you want to use the official evaluator, download and unzip [official conll 20
5959
Evaluate a model on the dev/test set:
6060
* Download the corresponding model file (`.mar`) and extract `model*.bin` from it and place it in `data_dir/<experiment_id>/`
6161
* `python evaluate.py [config] [model_id] [gpu_id] ([output_file])`
62-
* e.g. News, SemEval-2010, ELECTRA uncased (base) :`python evaluate.py se10_electra_uncased Apr30_08-52-00_56879 0`
62+
* e.g. News, SemEval-2010, ELECTRA uncased (base) :`python evaluate.py se10_electra_uncased tuba10_electra_uncased_Apr30_08-52-00_56879 0`
6363

6464
## Training
6565

experiments.conf

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -281,13 +281,14 @@ news = ${base}{
281281
model_type = electra
282282
incremental = false
283283
postprocess_merge_overlapping_spans = false
284+
language = german
284285
}
285286

286287
# SemEval 2010
287288

288289
se10 = ${news}{
289-
conll_eval_path = ${base.data_dir}/se10.dev.german.v4_gold_conll
290-
conll_test_path = ${base.data_dir}/se10.test.german.v4_gold_conll
290+
conll_eval_path = ${base.data_dir}/dev.german.v4_gold_conll
291+
conll_test_path = ${base.data_dir}/test.german.v4_gold_conll
291292
num_epochs = 48
292293
long_doc_strategy = truncate
293294
}
@@ -308,8 +309,8 @@ se10_gelectra_large = ${se10}{
308309
# TuBa-D/Z 10.0
309310

310311
tuba10 = ${news}{
311-
conll_eval_path = ${base.data_dir}/tuba10.dev.german.tuebdz_gold_conll
312-
conll_test_path = ${base.data_dir}/tuba10.test.german.tuebdz_gold_conll
312+
conll_eval_path = ${base.data_dir}/dev.german.tuebdz_gold_conll
313+
conll_test_path = ${base.data_dir}/test.german.tuebdz_gold_conll
313314
max_training_sentences = 3
314315
num_epochs = 24
315316
}

tensorize.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import util
22
import numpy as np
33
import random
4-
from transformers import BertTokenizer
4+
from transformers import AutoTokenizer
55
import os
66
from os.path import join
77
import json
@@ -91,7 +91,7 @@ class Tensorizer:
9191
def __init__(self, config):
9292
self.config = config
9393
self.long_doc_strategy = config['long_doc_strategy']
94-
self.tokenizer = BertTokenizer.from_pretrained(config['bert_tokenizer_name'])
94+
self.tokenizer = AutoTokenizer.from_pretrained(config['bert_tokenizer_name'])
9595

9696
# Will be used in evaluation
9797
self.stored_info = {}

0 commit comments

Comments
 (0)