Skip to content

Update BosphorusSign #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

cleong110
Copy link
Contributor

Updating BosphorusSign details including broken link, adding features for Kinectv2.

TODO a second pull request adding BosphorusSign22k, the updated release

#48 details.

"#items": 636,
"#samples": "24,161 Samples",
"#items": 595,
"#samples": "22,670 Samples",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BosphorusSign Turkish Sign Language corpus, which consists of 855 sign and p

https://aclanthology.org/L16-1220.pdf

Where is this info from? this is the number for Bosphorus22K, no?

Copy link
Contributor Author

@cleong110 cleong110 May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table 3 of the 22k paper lists statistics for both datasets.
333638579-10e469cc-8511-41f0-99a6-5416479ba427

Edit: https://arxiv.org/pdf/2004.01283 is the link for 22k

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm frankly not sure where the original figure of 636 lexicon and 24,161 clips comes from, so I went with the info from the updated citation. Presumably if we went through the dataset access process now and specifically asked for BosphorusSign, not BosphorusSign22k, this is what we'd get?

Copy link
Contributor Author

@cleong110 cleong110 May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original BosphorusSign citation the number given is 855, not 636 or 595, we have:

  • "The corpus contains 855 signs" in the conclusion section
  • Table 2 talks about modalities/features
  • Table 1 talks about other datasets
  • "We have collected 855 signs and phrase samples..." in the introduction section
  • "When completed, the corpus will have at least six repetitions of each sign per-
    formed by 10 signers, giving a wide variance to the data."

What I presume happened is that between the two papers they decided to trim down the "publicly available" data to 595 signs.

Edit: and of course 855 is listed in table 3 of the BosphorusSign22k paper as well, as the overall lexicon size rather than the publicly available subset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems that for whatever reason the "when completed... 10 signers" did not happen, as the newer citation lists only 6, and has this to say:

Our dataset is based on the BosphorusSign (Cam-
goz et al., 2016c) corpus which was collected with the pur-
pose of helping both linguistic and computer science com-
munities. It contains isolated videos of Turkish Sign Lan-
guage glosses from three different domains: Health, finance
and commonly used everyday signs. Videos in this dataset
were performed by six native signers, as shown in Figure
1, which makes this dataset valuable for user independent
sign language studies.

"this dataset" I interpreted to mean that BosphorusSign, meaning that both BosphorusSign and BosphorusSign22k have the same number of signers, namely 6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the question here is whether to go with overall stats, or stats for the "publicly available" subset I suppose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think overall stats are more "correct" to use. thanks for checking!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm then in that case I'm not sure what to put for "number of clips". Because Table 3 only has "-" for that. Looking through both papers here's the candidates:

  • 1257, the figure directly above in the table, from HospiSign. That seems unlikely. This dataset has way more signs, signers, etc.
  • 22670, the figure directly below. But that's the reduced publicly available set.
  • 855 signs6 signers/sign4 repetitions/signer = 20520?

I think I will just compromise and list it in the JSON with a little note?

@AmitMY AmitMY merged commit 969a923 into sign-language-processing:master May 29, 2024
1 check failed
@cleong110 cleong110 deleted the dataset/BosphorusSign_update branch June 7, 2024 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants