From 756f6f6d7ff9bf92e0b531bea80b2ff41a0be84c Mon Sep 17 00:00:00 2001 From: Colin Leong <122366389+cleong110@users.noreply.github.com> Date: Fri, 7 Jun 2024 14:30:29 -0400 Subject: [PATCH 1/7] CDL: adding YouTube-ASL.json --- src/datasets/YouTube-ASL.json | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) create mode 100644 src/datasets/YouTube-ASL.json diff --git a/src/datasets/YouTube-ASL.json b/src/datasets/YouTube-ASL.json new file mode 100644 index 0000000..930f9d8 --- /dev/null +++ b/src/datasets/YouTube-ASL.json @@ -0,0 +1,16 @@ +{ + "pub": { + "name": "Youtube-ASL", + "year": 2023, + "publication":"dataset:uthus2023YoutubeASL", + "url": "https://github.com/google-research/google-research/tree/master/youtube_asl" + }, + "#loader": null, + "#items": "~60K", + "#samples": "984 hours", + "#signers": ">2519", + "features": ["video", "text:English"], + "language": "American", + "license": null, + "licenseUrl": null + } \ No newline at end of file From 7b11437716d4e2aa16a08cb47ff3c0caf5bbf3a8 Mon Sep 17 00:00:00 2001 From: Colin Leong <122366389+cleong110@users.noreply.github.com> Date: Fri, 7 Jun 2024 15:24:05 -0400 Subject: [PATCH 2/7] CDL: adding a basic explanation of YouTube-ASL --- src/index.md | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/src/index.md b/src/index.md index 839d83d..18e474e 100644 --- a/src/index.md +++ b/src/index.md @@ -768,9 +768,11 @@ In training, the VQ-Sign "character-level" module is trained with a context pred The framework achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:huang2018video] datasets without relying on gloss annotations. - + + -@rust2024PrivacyAwareSign introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). +@rust2024PrivacyAwareSign introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). The first stage involves self-supervised pretraining of a Hiera vision transformer [@ryali2023HieraVisionTransformer] on large unannotated video datasets [@dataset:duarte2020how2sign, @dataset:uthus2023YoutubeASL]. In the second stage, the vision model's outputs are fed into a multilingual language model [@raffel2020T5Transformer] for finetuning on the How2Sign dataset [@dataset:duarte2020how2sign]. To mitigate privacy risks, the framework employs facial blurring during pretraining. @@ -1016,6 +1018,16 @@ Anvil installation is [available](http://www.anvil-software.de/download/index.ht Research papers which do not necessarily contribute new theory or architectures are actually important and useful enablers of other research. Furthermore, the advancement of the dataset creation process itself is important, and the pipeline of creation and curation is a potential target for improvements and advancements. +@dataset:uthus2023YoutubeASL introduce YouTube-ASL, a large-scale dataset of American Sign Language videos with accompanying English captions mined from YouTube. +They use an iterative process to first automatically identify candidate videos, and then filter for quality using native signers. +From over 88,000 candidate videos they filter down to about 11k with well-aligned English translations and comprehensible ASL content. +They train a baseline translation model leveraging this data, pretraining on Youtube data. +Predicting poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer]. +Results on the How2Sign dataset [@dataset:duarte2020how2sign] show significant improvements over previous methods. +They conclude that that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition/translation. + + + @dataset:joshi-etal-2023-isltranslate introduce ISLTranslate, a large translation dataset for Indian Sign Language based on publicly available educational videos intended for hard-of-hearing children, which happen to contain both Indian Sign Language and English audio voiceover conveying the same content. They use a speech-to-text model to transcribe the audio content, which they later manually corrected with the help of accompanying books also containing the same content. They also use MediaPipe to extract pose features, and have a certified ISL signer validate a small portion of the sign-text pairs. They provide a baseline based on the architecture proposed in @camgoz2020sign, and provide code. ###### Bilingual dictionaries {-} From 1705ff60c28b977384d6c33c364c9ab2f16449b2 Mon Sep 17 00:00:00 2001 From: Colin Leong <122366389+cleong110@users.noreply.github.com> Date: Tue, 11 Jun 2024 16:08:01 -0400 Subject: [PATCH 3/7] CDL minor rewrites, as suggested --- src/index.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/src/index.md b/src/index.md index c393174..44a44a7 100644 --- a/src/index.md +++ b/src/index.md @@ -1003,10 +1003,9 @@ Research papers which do not necessarily contribute new theory or architectures @dataset:uthus2023YoutubeASL introduce YouTube-ASL, a large-scale dataset of American Sign Language videos with accompanying English captions mined from YouTube. They use an iterative process to first automatically identify candidate videos, and then filter for quality using native signers. From over 88,000 candidate videos they filter down to about 11k with well-aligned English translations and comprehensible ASL content. -They train a baseline translation model leveraging this data, pretraining on Youtube data. -Predicting poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer]. +They train a baseline translation model leveraging this data, pretraining on YouTube data by estimating poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer]. Results on the How2Sign dataset [@dataset:duarte2020how2sign] show significant improvements over previous methods. -They conclude that that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition/translation. +They conclude that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition and translation. From 417b63eb0ba5225044faa0aef68338fc1edafe3b Mon Sep 17 00:00:00 2001 From: Colin Leong <122366389+cleong110@users.noreply.github.com> Date: Tue, 11 Jun 2024 16:08:28 -0400 Subject: [PATCH 4/7] CDL: adding license for YouTube-ASL --- src/datasets/YouTube-ASL.json | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/datasets/YouTube-ASL.json b/src/datasets/YouTube-ASL.json index 930f9d8..05979b1 100644 --- a/src/datasets/YouTube-ASL.json +++ b/src/datasets/YouTube-ASL.json @@ -11,6 +11,6 @@ "#signers": ">2519", "features": ["video", "text:English"], "language": "American", - "license": null, - "licenseUrl": null + "license": "CC-By-SA (video IDs)", + "licenseUrl": "https://creativecommons.org/licenses/by/4.0/" } \ No newline at end of file From 1191ae637bca125a6345b1bd2d2830632ffea542 Mon Sep 17 00:00:00 2001 From: Colin Leong <122366389+cleong110@users.noreply.github.com> Date: Wed, 12 Jun 2024 18:15:54 -0400 Subject: [PATCH 5/7] CDL: some requested changes --- src/datasets/YouTube-ASL.json | 2 +- src/index.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/src/datasets/YouTube-ASL.json b/src/datasets/YouTube-ASL.json index 05979b1..e3b6a84 100644 --- a/src/datasets/YouTube-ASL.json +++ b/src/datasets/YouTube-ASL.json @@ -11,6 +11,6 @@ "#signers": ">2519", "features": ["video", "text:English"], "language": "American", - "license": "CC-By-SA (video IDs)", + "license": "CC By-SA (video IDs)", "licenseUrl": "https://creativecommons.org/licenses/by/4.0/" } \ No newline at end of file diff --git a/src/index.md b/src/index.md index 44a44a7..11a7eb5 100644 --- a/src/index.md +++ b/src/index.md @@ -306,7 +306,7 @@ leveraging a spatio-temporal graph convolutional network (ST-GCN; @Yu2017SpatioT @segmentation:bull2021aligning presented a Transformer-based approach to segment sign language videos and align them with subtitles simultaneously, encoding subtitles by BERT [@devlin-etal-2019-bert] and videos by CNN video representations. -@segmentation:moryossef-etal-2023-linguistically presented a method motivated by linguistic cues observed in sign language corpora, such as prosody (pauses, pace, etc) and handshape changes. They also find that using BIO, an annotation scheme that notes the beginning, inside and outside, makes a significant difference over previous ones that only note IO (inside or outside). They find that including optical flow and 3D hand normalization helps with out-of-domain generalization and other signed languages as well. +@segmentation:moryossef-etal-2023-linguistically presented a method motivated by linguistic cues observed in sign language corpora, such as prosody (pauses, pace, etc) and handshape changes. They also find that using BIO, an annotation scheme that notes the beginning, inside and outside, makes a difference over previous ones that only note IO (inside or outside). They find that including optical flow and 3D hand normalization helps with out-of-domain generalization and other signed languages as well. @@ -1004,7 +1004,7 @@ Research papers which do not necessarily contribute new theory or architectures They use an iterative process to first automatically identify candidate videos, and then filter for quality using native signers. From over 88,000 candidate videos they filter down to about 11k with well-aligned English translations and comprehensible ASL content. They train a baseline translation model leveraging this data, pretraining on YouTube data by estimating poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer]. -Results on the How2Sign dataset [@dataset:duarte2020how2sign] show significant improvements over previous methods. +Results on the How2Sign dataset [@dataset:duarte2020how2sign] show improvements over previous methods. They conclude that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition and translation. From 459a6197a9b85d39c69883bc56eb82f1e428a311 Mon Sep 17 00:00:00 2001 From: Colin Leong <122366389+cleong110@users.noreply.github.com> Date: Wed, 12 Jun 2024 18:15:54 -0400 Subject: [PATCH 6/7] CDL: some requested changes --- src/datasets/YouTube-ASL.json | 2 +- src/index.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/datasets/YouTube-ASL.json b/src/datasets/YouTube-ASL.json index 05979b1..e3b6a84 100644 --- a/src/datasets/YouTube-ASL.json +++ b/src/datasets/YouTube-ASL.json @@ -11,6 +11,6 @@ "#signers": ">2519", "features": ["video", "text:English"], "language": "American", - "license": "CC-By-SA (video IDs)", + "license": "CC By-SA (video IDs)", "licenseUrl": "https://creativecommons.org/licenses/by/4.0/" } \ No newline at end of file diff --git a/src/index.md b/src/index.md index 44a44a7..088ed37 100644 --- a/src/index.md +++ b/src/index.md @@ -1004,7 +1004,7 @@ Research papers which do not necessarily contribute new theory or architectures They use an iterative process to first automatically identify candidate videos, and then filter for quality using native signers. From over 88,000 candidate videos they filter down to about 11k with well-aligned English translations and comprehensible ASL content. They train a baseline translation model leveraging this data, pretraining on YouTube data by estimating poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer]. -Results on the How2Sign dataset [@dataset:duarte2020how2sign] show significant improvements over previous methods. +Results on the How2Sign dataset [@dataset:duarte2020how2sign] show improvements over previous methods. They conclude that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition and translation. From 2274652dba0b032a916d8937a42511de8dac6336 Mon Sep 17 00:00:00 2001 From: Colin Leong <--unset> Date: Thu, 20 Jun 2024 12:48:22 -0400 Subject: [PATCH 7/7] CDL: requested changes and a comment about YouTube-SL-25 --- src/index.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/index.md b/src/index.md index 11a7eb5..28939bb 100644 --- a/src/index.md +++ b/src/index.md @@ -1002,11 +1002,13 @@ Research papers which do not necessarily contribute new theory or architectures @dataset:uthus2023YoutubeASL introduce YouTube-ASL, a large-scale dataset of American Sign Language videos with accompanying English captions mined from YouTube. They use an iterative process to first automatically identify candidate videos, and then filter for quality using native signers. -From over 88,000 candidate videos they filter down to about 11k with well-aligned English translations and comprehensible ASL content. +From over 88,000 candidate videos they filter down to about 11,000 with well-aligned English translations and comprehensible ASL content. They train a baseline translation model leveraging this data, pretraining on YouTube data by estimating poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer]. Results on the How2Sign dataset [@dataset:duarte2020how2sign] show improvements over previous methods. They conclude that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition and translation. + + @dataset:joshi-etal-2023-isltranslate introduce ISLTranslate, a large translation dataset for Indian Sign Language based on publicly available educational videos intended for hard-of-hearing children, which happen to contain both Indian Sign Language and English audio voiceover conveying the same content. They use a speech-to-text model to transcribe the audio content, which they later manually corrected with the help of accompanying books also containing the same content. They also use MediaPipe to extract pose features, and have a certified ISL signer validate a small portion of the sign-text pairs. They provide a baseline based on the architecture proposed in @camgoz2020sign, and provide code.