two bug-fixes and new cli flags #3
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I fixed a bug creating empty labels when the span of the label specified in the json also spanned a leading space after a word. This happens a lot when people use a labeling software and don't notice they also selected a space after/before a word when tagging it. The fix was just adding a
.strip()
to thelabel_text = text[beg_index:end_index]
inconvert.py
For example, a json like this:
would have been parsed as this:
but now is parsed as:
I also fixed a minor bug causing the JSON decoder to fail on empty lines in jsonl file. Even though regular jsonl files shouldn't contain empty lines, some do. These are usually introduced either as a result of some editors auto-adding a new line at the end of each file when saving or as a result of concatenation of multiple such files, I just rephrased
read_jsonl(filename)
incli.py
and added a condition.I added cli flags to specify json field names for text and labels, as different labeling tools export jsonl files with different field names. And Lastly I also added a flag to choose a separator of words and tags in the conll file, as some parsers prefer different characters as separators.
Oh, and I also reflected the changes in the READMEs.
Now, you can process an
input.jsonl
file like this:(note: the new lines at the end, tagged sub-strings including trailing spaces, different name for text field)
by running
you generate a tab-separated
output.conll
file: