Skip to content

two bug-fixes and new cli flags #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

MartinBorcin
Copy link

I fixed a bug creating empty labels when the span of the label specified in the json also spanned a leading space after a word. This happens a lot when people use a labeling software and don't notice they also selected a space after/before a word when tagging it. The fix was just adding a .strip() to the label_text = text[beg_index:end_index] in convert.py
For example, a json like this:

{"data": "James is a cool name .", "label": [[0, 6, "PER"]]}

would have been parsed as this:

James B-PER
 I-PER
is O
a O
cool O
name O
. O

but now is parsed as:

James B-PER
is O
a O
cool O
name O
. O

I also fixed a minor bug causing the JSON decoder to fail on empty lines in jsonl file. Even though regular jsonl files shouldn't contain empty lines, some do. These are usually introduced either as a result of some editors auto-adding a new line at the end of each file when saving or as a result of concatenation of multiple such files, I just rephrased read_jsonl(filename) in cli.py and added a condition.

I added cli flags to specify json field names for text and labels, as different labeling tools export jsonl files with different field names. And Lastly I also added a flag to choose a separator of words and tags in the conll file, as some parsers prefer different characters as separators.

Oh, and I also reflected the changes in the READMEs.

Now, you can process an input.jsonl file like this:

{"id": 0, "data": "James is a cool name .", "label": [[0, 6, "PER"]]}
{"id": 1, "data": "Facebook is a social network .", "label": [[0, 9, "ORG"]]}

(note: the new lines at the end, tagged sub-strings including trailing spaces, different name for text field)
by running

 jsonl-to-conll input.jsonl output.conll -s $'\t' --text_field 'data'

you generate a tab-separated output.conll file:

James	B-PER
is	O
a	O
cool	O
name	O
.	O

Facebook	B-ORG
is	O
a	O
social	O
network	O
.	O

Martin Borcin added 2 commits September 8, 2021 17:48
…y lines in jsonl file and added options to specify json field names
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant