-
Notifications
You must be signed in to change notification settings - Fork 12
Update BosphorusSign #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update BosphorusSign #49
Conversation
src/datasets/BosphorusSign.json
Outdated
"#items": 636, | ||
"#samples": "24,161 Samples", | ||
"#items": 595, | ||
"#samples": "22,670 Samples", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BosphorusSign Turkish Sign Language corpus, which consists of 855 sign and p
https://aclanthology.org/L16-1220.pdf
Where is this info from? this is the number for Bosphorus22K, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Table 3 of the 22k paper lists statistics for both datasets.
Edit: https://arxiv.org/pdf/2004.01283 is the link for 22k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm frankly not sure where the original figure of 636 lexicon and 24,161 clips comes from, so I went with the info from the updated citation. Presumably if we went through the dataset access process now and specifically asked for BosphorusSign, not BosphorusSign22k, this is what we'd get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original BosphorusSign citation the number given is 855, not 636 or 595, we have:
- "The corpus contains 855 signs" in the conclusion section
- Table 2 talks about modalities/features
- Table 1 talks about other datasets
- "We have collected 855 signs and phrase samples..." in the introduction section
- "When completed, the corpus will have at least six repetitions of each sign per-
formed by 10 signers, giving a wide variance to the data."
What I presume happened is that between the two papers they decided to trim down the "publicly available" data to 595 signs.
Edit: and of course 855 is listed in table 3 of the BosphorusSign22k paper as well, as the overall lexicon size rather than the publicly available subset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also seems that for whatever reason the "when completed... 10 signers" did not happen, as the newer citation lists only 6, and has this to say:
Our dataset is based on the BosphorusSign (Cam-
goz et al., 2016c) corpus which was collected with the pur-
pose of helping both linguistic and computer science com-
munities. It contains isolated videos of Turkish Sign Lan-
guage glosses from three different domains: Health, finance
and commonly used everyday signs. Videos in this dataset
were performed by six native signers, as shown in Figure
1, which makes this dataset valuable for user independent
sign language studies.
"this dataset" I interpreted to mean that BosphorusSign, meaning that both BosphorusSign and BosphorusSign22k have the same number of signers, namely 6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the question here is whether to go with overall stats, or stats for the "publicly available" subset I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think overall stats are more "correct" to use. thanks for checking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm then in that case I'm not sure what to put for "number of clips". Because Table 3 only has "-" for that. Looking through both papers here's the candidates:
- 1257, the figure directly above in the table, from HospiSign. That seems unlikely. This dataset has way more signs, signers, etc.
- 22670, the figure directly below. But that's the reduced publicly available set.
- 855 signs6 signers/sign4 repetitions/signer = 20520?
I think I will just compromise and list it in the JSON with a little note?
Updating BosphorusSign details including broken link, adding features for Kinectv2.
TODO a second pull request adding BosphorusSign22k, the updated release
#48 details.