CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning #16

cleong110 · 2024-05-31T18:26:29Z

cleong110 · 2024-05-31T18:31:57Z

Official IEEE BibTex:

@INPROCEEDINGS{10204832,
  author={Cheng, Yiting and Wei, Fangyun and Bao, Jianmin and Chen, Dong and Zhang, Wenqiang},
  booktitle={2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  title={CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning}, 
  year={2023},
  volume={},
  number={},
  pages={19016-19026},
  keywords={Visualization;Semantics;Gesture recognition;Speech recognition;Assistive technologies;Linguistics;Feature extraction;Vision applications and systems},
  doi={10.1109/CVPR52729.2023.01823}}

cleong110 · 2024-05-31T18:35:03Z

Edited to fit the website style guide:

@inproceedings{Cheng2023CiCoSignLanguageRetrieval,
  author={Cheng, Yiting and Wei, Fangyun and Bao, Jianmin and Chen, Dong and Zhang, Wenqiang},
  booktitle={2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  title={CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning}, 
  year={2023},
  doi={10.1109/CVPR52729.2023.01823}
}

cleong110 · 2024-05-31T18:37:27Z

https://github.com/cleong110/sign-language-processing.github.io/tree/paper/Cheng2023CiCoSignLanguageRetrieval branch

cleong110 · 2024-06-03T21:48:16Z

Inputs:

RGB Video?

Output:

Another video from a closed set.

Datasets:

How2Sign

Key idea: It's not just video to text retrieval, it's also cross-lingual retrieval. So you can do a CLIP-style thing on visual features and texts.

Key ideas include:

"Considering the linguistic characteristics of sign languages, we formulate sign language retrieval as a cross-lingual retrieval [6, 34, 60] problem in addition to a video-text retrieval [5, 24, 42–44, 58, 69] task."
SL Retrieval hard because: (1) Hard to map between the two, word orders don't match up. (2) small data. How2Sign only 30k pairs for example. (3) Lotta finegrained features needed to distinguish subtleties of face, fingers... (4)
"We consider the linguistic rules (e.g., word order) of both sign languages and natural languages.... While contrasting the sign videos and the texts in a joint embedding space as achieved in most vision-language pre-training frameworks [5, 44, 58], we simultaneously identify the finegrained cross-lingual (sign-to-word) mappings between two types of languages via our proposed cross-lingual contrastive learning as shown in Figure 2.". So it's Video CLIP but also learning mappings?
encoder has two parts: 1 "domain-agnostic" pretrained on "large-scale" sign videos. 2: "domain-aware" finetuned on "pseudo-labeled data from target datasets."
"In this work, pseudo-labeling is served as our sign spotting approach to localize isolated signs in untrimmed videos from target sign language retrieval datasets."
"In order to effectively model long videos, we decouple our framework into two disjoint parts: (1) a sign encoder which adopts a sliding window on sign-videos to pre-extract their vision features; (2) a cross-lingual contrastive learning module which encodes the extracted vision features and their corresponding texts in a joint embedding space." So tl;dr, you CLIP on the pre-extract vision features and the texts.
Code and models are available at: https://github.com/FangyunWei/SLRT.
"Sign language recognition and translation (SLRT)... recognizing the arbitrary semantic meanings conveyed by sign languages." But lack of data! Hampers it badly! But retrieval from closed set, doable.
Early works of sign language retrieval primarily investigate query-by-example searching [4, 71], which queries individual instances with given sign examples.

[4] Vassilis Athitsos, Carol Neidle, Stan Sclaroff, Joan Nash, Alexandra Stefan, Ashwin Thangali, Haijing Wang, and Quan Yuan. Large lexicon project: American sign language video corpus and sign language indexing/retrieval algorithms. In Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, volume 2, pages 11–14, 2010.
[71] Shilin Zhang and Bo Zhang. Using revised string edit distance to sign language video retrieval. In International Conference on Computational Intelligence and Natural Computing, volume 1, pages 45–49. IEEE, 2010. 3

cleong110 · 2024-06-10T21:11:57Z

Making a quick prompt for ChatGPT to provide suggestions:

I am writing a summary of an academic paper. Based on what I have provided below (abstract and/or attached full text), can you provide suggestions so I can rewrite my first version of the summary to be more concise and professional? 

Please go line by line, and explain your suggested changes, as well as any issues with writing quality or inaccuracy in my original summary. Be sure the suggestion you provide are accurate to the abstract or full text, if provided. 

If I have missed a key contribution from the paper please note that and suggest additions. If something is not clear request clarification and I can provide additional snippets. 

Please cite your sources for important details, e.g. "from the abstract" or "based on the full text". My summary is in markdown syntax and contains citations to a BibTex bibliography, the citations begin with "@". Please use the same citation style.

In addition, please follow the following style guide:

STYLE GUIDE
- **Citations**: Use the format `@authorYearKeyword` for inline citations, and `[@authorYearKeyword]` for citations wrapped in parentheses. To include multiple citations ,use a semicolon (;) to separate them (e.g., "@authorYearKeyword;@authorYearKeyword").
- **Background & Related Work**: Use simple past tense to describe previous work (e.g., "@authorYearKeyword used...").
- **Abbreviations**: Define abbreviations in parentheses after the full term (e.g., Langue des Signes Française (LSF)).
- **Percentages**: Use the percent sign (%) with no space between the number and the sign (e.g., 95%).
- **Spacing**: Use a single space after periods and commas.
- **Hyphenation**: Use hyphens (-) for compound adjectives (e.g., video-to-pose).
- **Lists**: Use "-" for list items, followed by a space.
- **Code**: Use backticks (`) for inline code, and triple backticks (```) for code blocks.
- **Numbers**: Spell out numbers less than 10, and use numerals for 10 and greater.
- **Contractions**: Avoid contractions (e.g., use "do not" instead of "don't").
- **Compound Words**: Use a forward slash (/) to separate alternative compound words (e.g., 2D / 3D).
- **Phrasing**: Prefer active voice over passive voice (e.g., "The authors used..." instead of "The work was used by the authors...").
- **Structure**: Present information in a logical order.
- **Capitalization**: Capitalize the first word of a sentence, and proper nouns.
- **Emphasis**: Use italics for emphasis by wrapping a word with asterisks (e.g., *emphasis*).
- **Quote marks**: Use double quotes (").
- **Paragraphs**: When a subsection header starts with ######, add "{-}" to the end of the subsection title to indicate a new paragraph. If it starts with #, ##, ###, ####, or ##### do not add the "{-}".
- **Mathematics**: Use LaTeX math notation (e.g., $x^2$) wrapped in dollar signs ($).

Here is the abstract: 
This work focuses on sign language retrieval—a recently proposed task for sign language understanding. Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval. Different from traditional video-text retrieval, sign language videos, not only contain visual signals but also carry abundant semantic meanings by themselves due to the fact that sign languages are also natural languages. Considering this character, we formulate sign language retrieval as a cross-lingual retrieval problem as well as a video-text retrieval task. Concretely, we take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual (i.e., sign-to-word) mappings while contrasting the texts and the sign videos in a joint embedding space. This process is termed as cross-lingual contrastive learning. Another challenge is raised by the data scarcity issue—sign language datasets are orders of magnitude smaller in scale than that of speech recognition. We alleviate this issue by adopting a domain-agnostic sign encoder pre-trained on large-scale sign videos into the target domain via pseudo-labeling. Our framework, termed as domain-aware sign language retrieval via Cross-lingual Contrastive learning or CiCo for short, outperforms the pioneering method by large margins on various datasets, e.g., +22.4 T2V and +28.0 V2T R@1 improvements on How2Sign dataset, and +13.7 T2V and +17.1 V2T R@1 improvements on PHOENIX-2014T dataset. Code and models are available at: https://github.com/FangyunWei/SLRT.

Here is my summary: 

@Cheng2023CiCoSignLanguageRetrieval introduce a video-to-text and text-to-video retrieval method based on cross-lingual contrastive learning.
Inspired by previous work in transfer learning from sign-spotting/segmenting models [jui-etal-2022-machine;Duarte2022SignVideoRetrivalWithTextQueries], they adopt a "domain-agnostic" I3D encoder pretrained on large-scale sign language datasets for the sign-spotting task[@Varol2021ReadAndAttend].
Then on target datasets with continuous signing videos, they use that sign-spotter model with a sliding window to find high-confidence predictions.
Those predictions then are used to finetune a "domain-aware" sign-spotting encoder for the target dataset.
Combining the two encoders, they pre-extract features from sign language videos.
Cross-lingual contrastive learning [@Radford2021LearningTV] is then used in order to contrast feature/text pairs, mapping them to a shared embedding space.
Embeddings of matched pairs are pulled together and non-matched pairs pushed apart.
They evaluate on the How2Sign [@dataset:duarte2020how2sign] and RWTH-PHOENIX-Weather 2014T datasets [@cihan2018neural], improving substantially over the previous state of the art method [@Duarte2022SignVideoRetrivalWithTextQueries].
In addition, they provide baseline retrieval results for the CSL-Daily dataset [@dataset:Zhou2021_SignBackTranslation_CSLDaily].

cleong110 · 2024-06-19T18:43:41Z

merged

cleong110 mentioned this issue May 31, 2024

Look through "Awesome Sign Language", etc and add missing items #2

Open

34 tasks

cleong110 mentioned this issue Jun 3, 2024

Adding "CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning" sign-language-processing/sign-language-processing.github.io#56

Merged

2 tasks

cleong110 closed this as completed Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning #16

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning #16

cleong110 commented May 31, 2024 •

edited

Loading

cleong110 commented May 31, 2024

cleong110 commented May 31, 2024

cleong110 commented May 31, 2024

cleong110 commented Jun 3, 2024

cleong110 commented Jun 10, 2024

cleong110 commented Jun 19, 2024

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning #16

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning #16

Comments

cleong110 commented May 31, 2024 • edited Loading

cleong110 commented May 31, 2024

cleong110 commented May 31, 2024

cleong110 commented May 31, 2024

cleong110 commented Jun 3, 2024

cleong110 commented Jun 10, 2024

cleong110 commented Jun 19, 2024

cleong110 commented May 31, 2024 •

edited

Loading