Skip to content

Adding "CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning" #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/datasets/ASLLVD.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"pub": {
"name": "ASLLVD",
"year": 2008,
"publication": "dataset:athitsos2008american",
"publication": "dataset:athitsos2008american,athitsos2010LargeLexiconIndexingRetrieval",
"url": "https://crystal.uta.edu/~athitsos/projects/asl_lexicon/"
},
"features": ["gloss:ASL", "video:RGB"],
Expand Down
23 changes: 21 additions & 2 deletions src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -902,8 +902,16 @@ TODO

Sign Language Retrieval is the task of finding a particular data item, given some input. In contrast to translation, generation or production tasks, there can exist a correct corresponding piece of data already, and the task is to find it out of many, if it exists.

<!-- TODO: text-to-sign-video (T2V) section, sign-video-to-text (V2T) retrieval -->
<!-- TODO: CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning -->
Metrics used include retrieval at Rank K and (R@K, higher is better) and median rank (MedR, lower is better).

<!-- TODO: text-to-sign-video (T2V) section, sign-video-to-text (V2T) retrieval? -->
@athitsos2010LargeLexiconIndexingRetrieval present one of the early works in this task, using a method based on hand centroids and dynamic time warping to enable users to submit videos of a sign and thus query within the ASL Lexicon Video Dataset [@dataset:athitsos2008american].

@Zhang2010RevisedEditDistanceSignVideoRetrieval provide another early method for video-based querying.
They use classical image feature extraction methods to calculate movement trajectories.
They then use modified string edit distances between these trajectories as a way to find similar videos.

<!-- TODO: write about SPOT-ALIGN. Cheng2023CiCoSignLanguageRetrieval say retrival is "recently introduced... by SPOT-ALIGN" and cite Amanda Duarte, Samuel Albanie, Xavier Gir ́ o-i Nieto, and G ̈ ul Varol. Sign language video retrieval with free-form textual queries. -->

@costerQueryingSignLanguage2023 present a method to query sign language dictionaries using dense vector search.
They pretrain a [Sign Language Recognition model](#pose-to-gloss) on a subset of the VGT corpus [@dataset:herreweghe2015VGTCorpus] to embed sign inputs.
Expand All @@ -912,6 +920,17 @@ When a user submits a query video, the system compares the input embeddings with
Tests on a [proof-of-concept Flemish Sign Language dictionary](https://github.com/m-decoster/VGT-SL-Dictionary) show that the system can successfully retrieve a limited vocabulary of signs, including some not in the training set.
<!-- TODO: add VGT Corpus (dataset:herreweghe2015VGTCorpus) to list of datasets -->

<!-- TODO: Sign language video retrieval with free-form textual queries was the only other paper that Cheng2023CiCoSignLanguageRetrieval compared with. -->

@Cheng2023CiCoSignLanguageRetrieval introduce a video-to-text (V2T) and text-to-video (t2V) retrieval method based on cross-lingual contrastive learning.
Using a "domain-agnostic" I3D encoder pretrained on large-scale sign datasets [@Varol2021ReadAndAttend] they generate pseudo-labels on target datasets and finetune a "domain-aware" encoder.
Combining the two encoders they then pre-extract features from sign language videos.
They then use cross-lingual contrastive learning [@Radford2021LearningTV] in order to contrast feature/text pairs, mapping them to a shared embedding space.
Embeddings of matched pairs are pulled together and non-matched pairs pushed apart.
They evaluate on How2Sign [@dataset:duarte2020how2sign] and RWTH-PHOENIX-Weather 2014T dataset [@cihan2018neural], improving by a substantial portion over the previous state of the art method [@Duarte2022SignVideoRetrivalWithTextQueries].

<!-- TODO: add BSL-1K dataset, cited in Cheng2023CiCoSignLanguageRetrieval. https://github.com/gulvarol/bsl1k -->

### Fingerspelling

Fingerspelling is spelling a word letter-by-letter, borrowing from the spoken language alphabet [@battison1978lexical;@wilcox1992phonetics;@brentari2001language;@patrie2011fingerspelled].
Expand Down
69 changes: 69 additions & 0 deletions src/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,23 @@ @inproceedings{dataset:athitsos2008american
year = {2008}
}

@inproceedings{athitsos2010LargeLexiconIndexingRetrieval,
author = {Athitsos, Vassilis and Neidle, Carol and Sclaroff, Stan and Nash, Joan and Stefan, Alexandra and Thangali, Ashwin and Wang, Haijing and Yuan, Quan},
title = {Large Lexicon Project: {American} {Sign} {Language} Video Corpus and Sign Language Indexing/Retrieval Algorithms},
pages = {11--14},
editor = {Dreuw, Philippe and Efthimiou, Eleni and Hanke, Thomas and Johnston, Trevor and Mart{\'i}nez Ruiz, Gregorio and Schembri, Adam},
booktitle = {Proceedings of the {LREC2010} 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies},
maintitle = {7th International Conference on Language Resources and Evaluation ({LREC} 2010)},
publisher = {{European Language Resources Association (ELRA)}},
address = {Valletta, Malta},
day = {22--23},
month = may,
year = {2010},
language = {english},
url = {https://www.sign-lang.uni-hamburg.de/lrec/pub/10022.pdf}
}


@inproceedings{dataset:dreuw2008benchmark,
address = {Marrakech, Morocco},
author = {Dreuw, Philippe and
Expand Down Expand Up @@ -3150,3 +3167,55 @@ @inproceedings{sellam-etal-2020-bleurt
url = {https://aclanthology.org/2020.acl-main.704},
year = {2020}
}

@inproceedings{Cheng2023CiCoSignLanguageRetrieval,
author={Cheng, Yiting and Wei, Fangyun and Bao, Jianmin and Chen, Dong and Zhang, Wenqiang},
booktitle={2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
title={CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning},
year={2023},
doi={10.1109/CVPR52729.2023.01823}
}


@inproceedings{Varol2021ReadAndAttend,
author={Varol, Gül and Momeni, Liliane and Albanie, Samuel and Afouras, Triantafyllos and Zisserman, Andrew},
booktitle={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
title={Read and Attend: Temporal Localisation in Sign Language Videos},
year={2021},
doi={10.1109/CVPR46437.2021.01658}
}

@inproceedings{Zhang2010RevisedEditDistanceSignVideoRetrieval,
author={Shilin Zhang and Bo Zhang},
booktitle={2010 Second International Conference on Computational Intelligence and Natural Computing},
title={Using revised string edit distance to sign language video retrieval},
year={2010},
volume={1},
pages={45-49},
doi={10.1109/CINC.2010.5643895}
}

@inproceedings{Radford2021LearningTV,
title = {Learning Transferable Visual Models From Natural Language Supervision},
author = {Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
booktitle = {Proceedings of the 38th International Conference on Machine Learning},
pages = {8748--8763},
year = {2021},
editor = {Meila, Marina and Zhang, Tong},
volume = {139},
series = {Proceedings of Machine Learning Research},
month = {18--24 Jul},
publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v139/radford21a/radford21a.pdf},
url = {https://proceedings.mlr.press/v139/radford21a.html}
}


@inproceedings{Duarte2022SignVideoRetrivalWithTextQueries,
author={Duarte, Amanda and Albanie, Samuel and Giró-I-Nieto, Xavier and Varol, Gül},
booktitle={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
title={Sign Language Video Retrieval with Free-Form Textual Queries},
year={2022},
pages={14074-14084},
doi={10.1109/CVPR52688.2022.01370}
}