You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adding "CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning" (#56)
* CDL: Cheng2023CiCoSignLanguageRetrieval citation
* CDL: rough-draft summary of CiCo continued
* CDL: adding citation for Read and Attend
* CDL: add refs for athitsos2010LargeLexiconIndexingRetrieval and Zhang2010RevisedEditDistanceSignVideoRetrieval, and update ASLLVD, which has updated statistics in the updated citation.
* CDL: updated CiCo Summary
* CDL: adding another citation to ASLLVD json
* CDL: add in a reference for Duarte2022SignVideoRetrivalWithTextQueries
* CDL: remove TODO comment
* CDL: adding some more references for CiCO
* CDL: a rewrite of CiCo for clarity of pseudo-labelling
* CDL: rewrote, synthesizing mine with CHatGPT suggestions
* CDL: extra 'and' removed
* CDL: another rewrite of CiCO!
* CDL: {CiCo}
* CDL: INPROCEEDINGS fix
* CDL: some requested changes
Copy file name to clipboardExpand all lines: src/index.md
+25-5
Original file line number
Diff line number
Diff line change
@@ -734,7 +734,7 @@ and so they have broken the dependency upon costly annotated gloss information i
734
734
@chen2022TwoStreamNetworkSign present a two-stream network for sign language recognition (SLR) and translation (SLT), utilizing a dual visual encoder architecture to encode RGB video frames and pose keypoints in separate streams.
735
735
These streams interact via bidirectional lateral connections.
736
736
For SLT, the visual encoders based on an S3D backbone [@xie2018SpatiotemporalS3D] output to a multilingual translation network using mBART [@liu-etal-2020-multilingual-denoising].
737
-
The model achieves state-of-the-art performance on the RWTH-PHOENIX-Weather-2014 [@dataset:forster2014extensions], RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:huang2018video] datasets.
737
+
The model achieves state-of-the-art performance on the RWTH-PHOENIX-Weather-2014 [@dataset:forster2014extensions], RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily] datasets.
738
738
739
739
@zhang2023sltunet propose a multi-modal, multi-task learning approach to end-to-end sign language translation.
740
740
The model features shared representations for different modalities such as text and video and is trained jointly
@@ -745,7 +745,7 @@ The approach allows leveraging external data such as parallel data for spoken la
745
745
Their approach involves guiding the model to encode visual and textual data similarly through two paths: one with visual data alone and one with both modalities.
746
746
Using KL divergences, they steer the model towards generating consistent embeddings and accurate outputs regardless of the path.
747
747
Once the model achieves consistent performance across paths, it can be utilized for translation without gloss supervision.
748
-
Evaluation on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:huang2018video] datasets demonstrates its efficacy.
748
+
Evaluation on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily] datasets demonstrates its efficacy.
749
749
They provide a [code implementation](https://github.com/rzhao-zhsq/CV-SLT) based largely on @chenSimpleMultiModalityTransfer2022a.
750
750
<!-- The CV-SLT code looks pretty nice! Conda env file, data prep, not too old, paths in .yaml files, checkpoints provided (including the ones for replication), commands to train and evaluate, very nice -->
751
751
@@ -757,7 +757,7 @@ SignLLM converts sign videos into discrete and hierarchical representations comp
757
757
During inference, the "word-level" tokens are projected into the LLM's embedding space, which is then prompted for translation.
758
758
The LLM itself can be taken "off the shelf" and does not need to be trained.
759
759
In training, the VQ-Sign "character-level" module is trained with a context prediction task, the CRA "word-level" module with an optimal transport technique, and a sign-text alignment loss further enhances the semantic alignment between sign and text tokens.
760
-
The framework achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:huang2018video] datasets without relying on gloss annotations.
760
+
The framework achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily] datasets without relying on gloss annotations.
761
761
<!-- TODO: c.f. SignLLM with https://github.com/sign-language-processing/sign-vq? -->
762
762
763
763
<!-- TODO: YoutubeASL explanation would fit nicely here before Rust et al 2024. They don't just do data IIRC. -->
@@ -888,16 +888,36 @@ TODO
888
888
889
889
### Sign Language Retrieval
890
890
891
-
Sign Language Retrieval is the task of finding a particular data item, given some input. In contrast to translation, generation or production tasks, there can exist a correct corresponding piece of data already, and the task is to find it out of many, if it exists.
891
+
Sign Language Retrieval is the task of finding a particular data item, given some input. In contrast to translation, generation or production tasks, there can exist a correct corresponding piece of data already, and the task is to find it out of many, if it exists. Metrics used include retrieval at Rank K (R@K, higher is better) and median rank (MedR, lower is better).
@athitsos2010LargeLexiconIndexingRetrieval present one of the early works on this task, using a method based on hand centroids and dynamic time warping to enable users to submit videos of a sign and thus query within the ASL Lexicon Video Dataset [@dataset:athitsos2008american].
895
+
896
+
@Zhang2010RevisedEditDistanceSignVideoRetrieval provide another early method for video-based querying.
897
+
They use classical image feature extraction methods to calculate movement trajectories.
898
+
They then use modified string edit distances between these trajectories as a way to find similar videos.
899
+
900
+
<!-- TODO: write here about SPOT-ALIGN aka Duarte2022SignVideoRetrivalWithTextQueries. Cheng2023CiCoSignLanguageRetrieval say retrival is "recently introduced... by SPOT-ALIGN" and cite Amanda Duarte, Samuel Albanie, Xavier Gir ́ o-i Nieto, and G ̈ ul Varol. Sign language video retrieval with free-form textual queries. Also Sign language video retrieval with free-form textual queries was the only other paper that Cheng2023CiCoSignLanguageRetrieval compared with. -->
901
+
902
+
<!-- TODO: write here also about jui-etal-2022-machine, the other paper cited by Cheng2023CiCoSignLanguageRetrieval for "Previous methods [16, 18] have demonstrated the feasibility of transferring a sign encoder pre-trained on large-scale sign-spotting data into downstream tasks." -->
894
903
895
904
@costerQueryingSignLanguage2023 present a method to query sign language dictionaries using dense vector search.
896
905
They pretrain a [Sign Language Recognition model](#pose-to-gloss) on a subset of the VGT corpus [@dataset:herreweghe2015VGTCorpus] to embed sign inputs.
897
906
Once the encoder is trained, they use it to generate embeddings for all dictionary signs.
898
907
When a user submits a query video, the system compares the input embeddings with those of the dictionary entries using Euclidean distance.
899
908
Tests on a [proof-of-concept Flemish Sign Language dictionary](https://github.com/m-decoster/VGT-SL-Dictionary) show that the system can successfully retrieve a limited vocabulary of signs, including some not in the training set.
900
909
910
+
@Cheng2023CiCoSignLanguageRetrieval introduce a video-to-text and text-to-video retrieval method using cross-lingual contrastive learning.
911
+
Inspired by transfer learning from sign-spotting/segmentation models [@jui-etal-2022-machine;@Duarte2022SignVideoRetrivalWithTextQueries], the authors employ a "domain-agnostic" I3D encoder, pretrained on large-scale sign language datasets for the sign-spotting task [@Varol2021ReadAndAttend].
912
+
On target datasets with continuous signing videos, they use this model with a sliding window to identify high confidence predictions, which are then used to finetune a "domain-aware" sign-spotting encoder.
913
+
The two encoders each pre-extract features from videos, which are then fused via a weighted sum.
914
+
Cross-lingual contrastive learning [@Radford2021LearningTV] is then applied to align the extracted features with paired texts within a shared embedding space.
915
+
This allows the calculation of similarity scores between text and video embeddings, and thus retrieval in either direction.
916
+
Evaluations on the How2Sign [@dataset:duarte2020how2sign] and RWTH-PHOENIX-Weather 2014T [@cihan2018neural] datasets demonstrate improvement over the previous state-of-the-art [@Duarte2022SignVideoRetrivalWithTextQueries].
917
+
Baseline retrieval results are also provided for the CSL-Daily dataset [@dataset:Zhou2021_SignBackTranslation_CSLDaily].
918
+
919
+
<!-- TODO: tie in with Automatic Dense Annotation of Large-Vocabulary Sign Language Videos, mentioned in video-to-gloss? Also uses Pseudo-labeling -->
920
+
901
921
### Fingerspelling
902
922
903
923
Fingerspelling is spelling a word letter-by-letter, borrowing from the spoken language alphabet [@battison1978lexical;@wilcox1992phonetics;@brentari2001language;@patrie2011fingerspelled].
0 commit comments