From 756f6f6d7ff9bf92e0b531bea80b2ff41a0be84c Mon Sep 17 00:00:00 2001
From: Colin Leong <122366389+cleong110@users.noreply.github.com>
Date: Fri, 7 Jun 2024 14:30:29 -0400
Subject: [PATCH 1/7] CDL: adding YouTube-ASL.json

---
 src/datasets/YouTube-ASL.json | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)
 create mode 100644 src/datasets/YouTube-ASL.json

diff --git a/src/datasets/YouTube-ASL.json b/src/datasets/YouTube-ASL.json
new file mode 100644
index 0000000..930f9d8
--- /dev/null
+++ b/src/datasets/YouTube-ASL.json
@@ -0,0 +1,16 @@
+{
+    "pub": {
+      "name": "Youtube-ASL", 
+      "year": 2023,
+      "publication":"dataset:uthus2023YoutubeASL", 
+      "url": "https://github.com/google-research/google-research/tree/master/youtube_asl"
+    },
+    "#loader": null,
+    "#items": "~60K",
+    "#samples": "984 hours",
+    "#signers": ">2519", 
+    "features": ["video", "text:English"],
+    "language": "American",
+    "license": null,
+    "licenseUrl": null
+  }
\ No newline at end of file

From 7b11437716d4e2aa16a08cb47ff3c0caf5bbf3a8 Mon Sep 17 00:00:00 2001
From: Colin Leong <122366389+cleong110@users.noreply.github.com>
Date: Fri, 7 Jun 2024 15:24:05 -0400
Subject: [PATCH 2/7] CDL: adding a basic explanation of YouTube-ASL

---
 src/index.md | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/src/index.md b/src/index.md
index 839d83d..18e474e 100644
--- a/src/index.md
+++ b/src/index.md
@@ -768,9 +768,11 @@ In training, the VQ-Sign "character-level" module is trained with a context pred
 The framework achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:huang2018video] datasets without relying on gloss annotations.
 <!-- TODO: c.f. SignLLM with https://github.com/sign-language-processing/sign-vq? -->
 
-<!-- TODO: YoutubeASL explanation would fit nicely here before Rust et al 2024. They don't just do data IIRC. -->
+<!-- TODO: YoutubeASL explanation would fit nicely here before Rust et al 2024. Rust et al leverage it, improve on it, cite it. However Uthus et al's baseline is actually video-to-pose-to-text. Not sure how to work that in  -->
+<!-- @dataset:uthus2023YoutubeASL, as part of experiments with a large-scale data-mining process, train a baseline translation model, achieving SOTA results on How2Sign [@dataset:duarte2020how2sign].
+This is based on a 2-step video-to-pose-to-text process using MediaPipe to estimate poses, then projection into a T5 multilingual language model for translation to English. -->
 
-@rust2024PrivacyAwareSign introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). 
+@rust2024PrivacyAwareSign introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT).
 The first stage involves self-supervised pretraining of a Hiera vision transformer [@ryali2023HieraVisionTransformer] on large unannotated video datasets [@dataset:duarte2020how2sign, @dataset:uthus2023YoutubeASL]. 
 In the second stage, the vision model's outputs are fed into a multilingual language model [@raffel2020T5Transformer] for finetuning on the How2Sign dataset [@dataset:duarte2020how2sign].
 To mitigate privacy risks, the framework employs facial blurring during pretraining.
@@ -1016,6 +1018,16 @@ Anvil installation is [available](http://www.anvil-software.de/download/index.ht
 
 Research papers which do not necessarily contribute new theory or architectures are actually important and useful enablers of other research. Furthermore, the advancement of the dataset creation process itself is important, and the pipeline of creation and curation is a potential target for improvements and advancements.
 
+@dataset:uthus2023YoutubeASL introduce YouTube-ASL, a large-scale dataset of American Sign Language videos with accompanying English captions mined from YouTube.
+They use an iterative process to first automatically identify candidate videos, and then filter for quality using native signers.
+From over 88,000 candidate videos they filter down to about 11k with well-aligned English translations and comprehensible ASL content.
+They train a baseline translation model leveraging this data, pretraining on Youtube data.
+Predicting poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer].
+Results on the How2Sign dataset [@dataset:duarte2020how2sign] show significant improvements over previous methods.
+They conclude that that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition/translation.
+
+<!-- Later seemingly borne out by Rust et al 2024 -->
+
 @dataset:joshi-etal-2023-isltranslate introduce ISLTranslate, a large translation dataset for Indian Sign Language based on publicly available educational videos intended for hard-of-hearing children, which happen to contain both Indian Sign Language and English audio voiceover conveying the same content. They use a speech-to-text model to transcribe the audio content, which they later manually corrected with the help of accompanying books also containing the same content. They also use MediaPipe to extract pose features, and have a certified ISL signer validate a small portion of the sign-text pairs. They provide a baseline based on the architecture proposed in @camgoz2020sign, and provide code.
 
 ###### Bilingual dictionaries {-}

From 1705ff60c28b977384d6c33c364c9ab2f16449b2 Mon Sep 17 00:00:00 2001
From: Colin Leong <122366389+cleong110@users.noreply.github.com>
Date: Tue, 11 Jun 2024 16:08:01 -0400
Subject: [PATCH 3/7] CDL minor rewrites, as suggested

---
 src/index.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/index.md b/src/index.md
index c393174..44a44a7 100644
--- a/src/index.md
+++ b/src/index.md
@@ -1003,10 +1003,9 @@ Research papers which do not necessarily contribute new theory or architectures
 @dataset:uthus2023YoutubeASL introduce YouTube-ASL, a large-scale dataset of American Sign Language videos with accompanying English captions mined from YouTube.
 They use an iterative process to first automatically identify candidate videos, and then filter for quality using native signers.
 From over 88,000 candidate videos they filter down to about 11k with well-aligned English translations and comprehensible ASL content.
-They train a baseline translation model leveraging this data, pretraining on Youtube data.
-Predicting poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer].
+They train a baseline translation model leveraging this data, pretraining on YouTube data by estimating poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer].
 Results on the How2Sign dataset [@dataset:duarte2020how2sign] show significant improvements over previous methods.
-They conclude that that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition/translation.
+They conclude that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition and translation.
 
 <!-- Later seemingly borne out by Rust et al 2024 -->
 

From 417b63eb0ba5225044faa0aef68338fc1edafe3b Mon Sep 17 00:00:00 2001
From: Colin Leong <122366389+cleong110@users.noreply.github.com>
Date: Tue, 11 Jun 2024 16:08:28 -0400
Subject: [PATCH 4/7] CDL: adding license for YouTube-ASL

---
 src/datasets/YouTube-ASL.json | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/datasets/YouTube-ASL.json b/src/datasets/YouTube-ASL.json
index 930f9d8..05979b1 100644
--- a/src/datasets/YouTube-ASL.json
+++ b/src/datasets/YouTube-ASL.json
@@ -11,6 +11,6 @@
     "#signers": ">2519", 
     "features": ["video", "text:English"],
     "language": "American",
-    "license": null,
-    "licenseUrl": null
+    "license": "CC-By-SA (video IDs)",
+    "licenseUrl": "https://creativecommons.org/licenses/by/4.0/"
   }
\ No newline at end of file

From 1191ae637bca125a6345b1bd2d2830632ffea542 Mon Sep 17 00:00:00 2001
From: Colin Leong <122366389+cleong110@users.noreply.github.com>
Date: Wed, 12 Jun 2024 18:15:54 -0400
Subject: [PATCH 5/7] CDL: some requested changes

---
 src/datasets/YouTube-ASL.json | 2 +-
 src/index.md                  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/datasets/YouTube-ASL.json b/src/datasets/YouTube-ASL.json
index 05979b1..e3b6a84 100644
--- a/src/datasets/YouTube-ASL.json
+++ b/src/datasets/YouTube-ASL.json
@@ -11,6 +11,6 @@
     "#signers": ">2519", 
     "features": ["video", "text:English"],
     "language": "American",
-    "license": "CC-By-SA (video IDs)",
+    "license": "CC By-SA (video IDs)",
     "licenseUrl": "https://creativecommons.org/licenses/by/4.0/"
   }
\ No newline at end of file
diff --git a/src/index.md b/src/index.md
index 44a44a7..11a7eb5 100644
--- a/src/index.md
+++ b/src/index.md
@@ -306,7 +306,7 @@ leveraging a spatio-temporal graph convolutional network (ST-GCN; @Yu2017SpatioT
 @segmentation:bull2021aligning presented a Transformer-based approach to segment sign language videos and align them with subtitles simultaneously, 
 encoding subtitles by BERT [@devlin-etal-2019-bert] and videos by CNN video representations.
 
-@segmentation:moryossef-etal-2023-linguistically presented a method motivated by linguistic cues observed in sign language corpora, such as prosody (pauses, pace, etc) and handshape changes. They also find that using BIO, an annotation scheme that notes the beginning, inside and outside, makes a significant difference over previous ones that only note IO (inside or outside). They find that including optical flow and 3D hand normalization helps with out-of-domain generalization and other signed languages as well. 
+@segmentation:moryossef-etal-2023-linguistically presented a method motivated by linguistic cues observed in sign language corpora, such as prosody (pauses, pace, etc) and handshape changes. They also find that using BIO, an annotation scheme that notes the beginning, inside and outside, makes a difference over previous ones that only note IO (inside or outside). They find that including optical flow and 3D hand normalization helps with out-of-domain generalization and other signed languages as well. 
 
 <!-- @segmentation:de-sisto-etal-2021-defining introduce a proposal for mapping segments to meaning in the form of an agglomerate of lexical and non-lexical information. -->
 
@@ -1004,7 +1004,7 @@ Research papers which do not necessarily contribute new theory or architectures
 They use an iterative process to first automatically identify candidate videos, and then filter for quality using native signers.
 From over 88,000 candidate videos they filter down to about 11k with well-aligned English translations and comprehensible ASL content.
 They train a baseline translation model leveraging this data, pretraining on YouTube data by estimating poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer].
-Results on the How2Sign dataset [@dataset:duarte2020how2sign] show significant improvements over previous methods.
+Results on the How2Sign dataset [@dataset:duarte2020how2sign] show improvements over previous methods.
 They conclude that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition and translation.
 
 <!-- Later seemingly borne out by Rust et al 2024 -->

From 459a6197a9b85d39c69883bc56eb82f1e428a311 Mon Sep 17 00:00:00 2001
From: Colin Leong <122366389+cleong110@users.noreply.github.com>
Date: Wed, 12 Jun 2024 18:15:54 -0400
Subject: [PATCH 6/7] CDL: some requested changes

---
 src/datasets/YouTube-ASL.json | 2 +-
 src/index.md                  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/datasets/YouTube-ASL.json b/src/datasets/YouTube-ASL.json
index 05979b1..e3b6a84 100644
--- a/src/datasets/YouTube-ASL.json
+++ b/src/datasets/YouTube-ASL.json
@@ -11,6 +11,6 @@
     "#signers": ">2519", 
     "features": ["video", "text:English"],
     "language": "American",
-    "license": "CC-By-SA (video IDs)",
+    "license": "CC By-SA (video IDs)",
     "licenseUrl": "https://creativecommons.org/licenses/by/4.0/"
   }
\ No newline at end of file
diff --git a/src/index.md b/src/index.md
index 44a44a7..088ed37 100644
--- a/src/index.md
+++ b/src/index.md
@@ -1004,7 +1004,7 @@ Research papers which do not necessarily contribute new theory or architectures
 They use an iterative process to first automatically identify candidate videos, and then filter for quality using native signers.
 From over 88,000 candidate videos they filter down to about 11k with well-aligned English translations and comprehensible ASL content.
 They train a baseline translation model leveraging this data, pretraining on YouTube data by estimating poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer].
-Results on the How2Sign dataset [@dataset:duarte2020how2sign] show significant improvements over previous methods.
+Results on the How2Sign dataset [@dataset:duarte2020how2sign] show improvements over previous methods.
 They conclude that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition and translation.
 
 <!-- Later seemingly borne out by Rust et al 2024 -->

From 2274652dba0b032a916d8937a42511de8dac6336 Mon Sep 17 00:00:00 2001
From: Colin Leong <--unset>
Date: Thu, 20 Jun 2024 12:48:22 -0400
Subject: [PATCH 7/7] CDL: requested changes and a comment about YouTube-SL-25

---
 src/index.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/index.md b/src/index.md
index 11a7eb5..28939bb 100644
--- a/src/index.md
+++ b/src/index.md
@@ -1002,11 +1002,13 @@ Research papers which do not necessarily contribute new theory or architectures
 
 @dataset:uthus2023YoutubeASL introduce YouTube-ASL, a large-scale dataset of American Sign Language videos with accompanying English captions mined from YouTube.
 They use an iterative process to first automatically identify candidate videos, and then filter for quality using native signers.
-From over 88,000 candidate videos they filter down to about 11k with well-aligned English translations and comprehensible ASL content.
+From over 88,000 candidate videos they filter down to about 11,000 with well-aligned English translations and comprehensible ASL content.
 They train a baseline translation model leveraging this data, pretraining on YouTube data by estimating poses with MediaPipe [@mediapipe2020holistic] and projecting into a multilingual language model [@raffel2020T5Transformer].
 Results on the How2Sign dataset [@dataset:duarte2020how2sign] show improvements over previous methods.
 They conclude that further mining of large-scale datasets with wide signer variety may be useful for sign language recognition and translation.
 
+<!-- TODO: YouTube-SL-25, continuing on from YouTube-ASL, something like "therefore, in later work, they mine 25 languages...." -->
+
 <!-- Later seemingly borne out by Rust et al 2024 -->
 
 @dataset:joshi-etal-2023-isltranslate introduce ISLTranslate, a large translation dataset for Indian Sign Language based on publicly available educational videos intended for hard-of-hearing children, which happen to contain both Indian Sign Language and English audio voiceover conveying the same content. They use a speech-to-text model to transcribe the audio content, which they later manually corrected with the help of accompanying books also containing the same content. They also use MediaPipe to extract pose features, and have a certified ISL signer validate a small portion of the sign-text pairs. They provide a baseline based on the architecture proposed in @camgoz2020sign, and provide code.