sign-language-processing · AmitMY · Jun 13, 2024 · May 31, 2024 · Jun 3, 2024 · Jun 3, 2024
diff --git a/src/datasets/ASLLVD.json b/src/datasets/ASLLVD.json
@@ -2,7 +2,7 @@
   "pub": {
     "name": "ASLLVD",
     "year": 2008,
-    "publication": "dataset:athitsos2008american",
+    "publication": "dataset:athitsos2008american,athitsos2010LargeLexiconIndexingRetrieval",
     "url": "https://crystal.uta.edu/~athitsos/projects/asl_lexicon/"
   },
   "features": ["gloss:ASL", "video:RGB"],

diff --git a/src/index.md b/src/index.md
@@ -902,8 +902,16 @@ TODO
 
 Sign Language Retrieval is the task of finding a particular data item, given some input. In contrast to translation, generation or production tasks, there can exist a correct corresponding piece of data already, and the task is to find it out of many, if it exists.
 
-<!-- TODO: text-to-sign-video (T2V) section, sign-video-to-text (V2T) retrieval -->
-<!-- TODO: CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning -->
+Metrics used include retrieval at Rank K and (R@K, higher is better) and median rank (MedR, lower is better).
+
+<!-- TODO: text-to-sign-video (T2V) section, sign-video-to-text (V2T) retrieval? -->
+@athitsos2010LargeLexiconIndexingRetrieval present one of the early works in this task, using a method based on hand centroids and dynamic time warping to enable users to submit videos of a sign and thus query within the ASL Lexicon Video Dataset [@dataset:athitsos2008american].
+
+@Zhang2010RevisedEditDistanceSignVideoRetrieval provide another early method for video-based querying.
+They use classical image feature extraction methods to calculate movement trajectories.
+They then use modified string edit distances between these trajectories as a way to find similar videos.
+
+<!-- TODO: write about SPOT-ALIGN. Cheng2023CiCoSignLanguageRetrieval say retrival is "recently introduced... by SPOT-ALIGN" and cite Amanda Duarte, Samuel Albanie, Xavier Gir ́ o-i Nieto, and G ̈ ul Varol. Sign language video retrieval with free-form textual queries. -->
 
 @costerQueryingSignLanguage2023 present a method to query sign language dictionaries using dense vector search.
 They pretrain a [Sign Language Recognition model](#pose-to-gloss) on a subset of the VGT corpus [@dataset:herreweghe2015VGTCorpus] to embed sign inputs.
@@ -912,6 +920,17 @@ When a user submits a query video, the system compares the input embeddings with
 Tests on a [proof-of-concept Flemish Sign Language dictionary](https://github.com/m-decoster/VGT-SL-Dictionary) show that the system can successfully retrieve a limited vocabulary of signs, including some not in the training set.
 <!-- TODO: add VGT Corpus (dataset:herreweghe2015VGTCorpus) to list of datasets -->
 
+<!-- TODO: Sign language video retrieval with free-form textual queries was the only other paper that Cheng2023CiCoSignLanguageRetrieval compared with. -->
+
+@Cheng2023CiCoSignLanguageRetrieval introduce a video-to-text (V2T) and text-to-video (t2V) retrieval method based on cross-lingual contrastive learning.
+Using a "domain-agnostic" I3D encoder pretrained on large-scale sign datasets [@Varol2021ReadAndAttend] they generate pseudo-labels on target datasets and finetune a "domain-aware" encoder.
+Combining the two encoders they then pre-extract features from sign language videos.
+They then use cross-lingual contrastive learning [@Radford2021LearningTV] in order to contrast feature/text pairs, mapping them to a shared embedding space.
+Embeddings of matched pairs are pulled together and non-matched pairs pushed apart.
+They evaluate on How2Sign [@dataset:duarte2020how2sign] and RWTH-PHOENIX-Weather 2014T dataset [@cihan2018neural], improving by a substantial portion over the previous state of the art method [@Duarte2022SignVideoRetrivalWithTextQueries].
+
+<!-- TODO: add BSL-1K dataset, cited in Cheng2023CiCoSignLanguageRetrieval. https://github.com/gulvarol/bsl1k -->
+
 ### Fingerspelling
 
 Fingerspelling is spelling a word letter-by-letter, borrowing from the spoken language alphabet [@battison1978lexical;@wilcox1992phonetics;@brentari2001language;@patrie2011fingerspelled].

diff --git a/src/references.bib b/src/references.bib
@@ -357,6 +357,23 @@ @inproceedings{dataset:athitsos2008american
  year = {2008}
 }
 
+@inproceedings{athitsos2010LargeLexiconIndexingRetrieval,
+  author    = {Athitsos, Vassilis and Neidle, Carol and Sclaroff, Stan and Nash, Joan and Stefan, Alexandra and Thangali, Ashwin and Wang, Haijing and Yuan, Quan},
+  title     = {Large Lexicon Project: {American} {Sign} {Language} Video Corpus and Sign Language Indexing/Retrieval Algorithms},
+  pages     = {11--14},
+  editor    = {Dreuw, Philippe and Efthimiou, Eleni and Hanke, Thomas and Johnston, Trevor and Mart{\'i}nez Ruiz, Gregorio and Schembri, Adam},
+  booktitle = {Proceedings of the {LREC2010} 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies},
+  maintitle = {7th International Conference on Language Resources and Evaluation ({LREC} 2010)},
+  publisher = {{European Language Resources Association (ELRA)}},
+  address   = {Valletta, Malta},
+  day       = {22--23},
+  month     = may,
+  year      = {2010},
+  language  = {english},
+  url       = {https://www.sign-lang.uni-hamburg.de/lrec/pub/10022.pdf}
+}
+
+
 @inproceedings{dataset:dreuw2008benchmark,
  address = {Marrakech, Morocco},
  author = {Dreuw, Philippe  and
@@ -3150,3 +3167,55 @@ @inproceedings{sellam-etal-2020-bleurt
  url = {https://aclanthology.org/2020.acl-main.704},
  year = {2020}
 }
+
+@inproceedings{Cheng2023CiCoSignLanguageRetrieval,
+  author={Cheng, Yiting and Wei, Fangyun and Bao, Jianmin and Chen, Dong and Zhang, Wenqiang},
+  booktitle={2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
+  title={CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning}, 
+  year={2023},
+  doi={10.1109/CVPR52729.2023.01823}
+}
+
+
+@inproceedings{Varol2021ReadAndAttend,
+  author={Varol, Gül and Momeni, Liliane and Albanie, Samuel and Afouras, Triantafyllos and Zisserman, Andrew},
+  booktitle={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
+  title={Read and Attend: Temporal Localisation in Sign Language Videos}, 
+  year={2021},
+  doi={10.1109/CVPR46437.2021.01658}
+}
+
+@inproceedings{Zhang2010RevisedEditDistanceSignVideoRetrieval,
+  author={Shilin Zhang and Bo Zhang},
+  booktitle={2010 Second International Conference on Computational Intelligence and Natural Computing}, 
+  title={Using revised string edit distance to sign language video retrieval}, 
+  year={2010},
+  volume={1},
+  pages={45-49},
+  doi={10.1109/CINC.2010.5643895}
+}
+
+@inproceedings{Radford2021LearningTV,
+  title = 	 {Learning Transferable Visual Models From Natural Language Supervision},
+  author =       {Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
+  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
+  pages = 	 {8748--8763},
+  year = 	 {2021},
+  editor = 	 {Meila, Marina and Zhang, Tong},
+  volume = 	 {139},
+  series = 	 {Proceedings of Machine Learning Research},
+  month = 	 {18--24 Jul},
+  publisher =    {PMLR},
+  pdf = 	 {http://proceedings.mlr.press/v139/radford21a/radford21a.pdf},
+  url = 	 {https://proceedings.mlr.press/v139/radford21a.html}
+}
+
+
+@inproceedings{Duarte2022SignVideoRetrivalWithTextQueries,
+  author={Duarte, Amanda and Albanie, Samuel and Giró-I-Nieto, Xavier and Varol, Gül},
+  booktitle={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
+  title={Sign Language Video Retrieval with Free-Form Textual Queries}, 
+  year={2022},
+  pages={14074-14084},
+  doi={10.1109/CVPR52688.2022.01370}
+}