Skip to content

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
7 of 21 tasks
Tracked by #2
cleong110 opened this issue May 31, 2024 · 6 comments
Closed
7 of 21 tasks
Tracked by #2

Comments

@cleong110
Copy link
Owner

cleong110 commented May 31, 2024

  • sync, pull and merge master first!
  • Search for the correct citation on Semantic Scholar
  • Make a new branch ("You should always branch out from master")
  • Add citation to references.bib. If dataset, prepend with dataset:. Exclude wordy abstracts. (better BibTex extension to Zotero can exclude keys)
  • Check for egregious {} in the bibtex
  • write a summary and add to the appropriate section in index.md.
  • Make sure the citation keys match.
  • Add a newline after each sentence in a paragraph. Still shows up as one paragraph but makes git stuff easier.
  • ChatGPT 3.5 can suggest rewrites and improve writing.
  • Check if acronyms are explained
  • Copy-Paste into https://dillinger.io/, see if it looks OK
  • Make a PR from the branch on my fork to master on the source repo

PR:

  • sync master of both forks
  • git pull master on local
  • git merge master on branch
  • git push
  • THEN make the PR

Writing/style:

@cleong110
Copy link
Owner Author

Official IEEE BibTex:

@INPROCEEDINGS{10204832,
  author={Cheng, Yiting and Wei, Fangyun and Bao, Jianmin and Chen, Dong and Zhang, Wenqiang},
  booktitle={2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  title={CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning}, 
  year={2023},
  volume={},
  number={},
  pages={19016-19026},
  keywords={Visualization;Semantics;Gesture recognition;Speech recognition;Assistive technologies;Linguistics;Feature extraction;Vision applications and systems},
  doi={10.1109/CVPR52729.2023.01823}}

@cleong110
Copy link
Owner Author

Edited to fit the website style guide:

@inproceedings{Cheng2023CiCoSignLanguageRetrieval,
  author={Cheng, Yiting and Wei, Fangyun and Bao, Jianmin and Chen, Dong and Zhang, Wenqiang},
  booktitle={2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  title={CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning}, 
  year={2023},
  doi={10.1109/CVPR52729.2023.01823}
}

@cleong110
Copy link
Owner Author

image

image

image

Inputs:

  • RGB Video?

Output:

  • Another video from a closed set.

Datasets:

  • How2Sign

Key idea: It's not just video to text retrieval, it's also cross-lingual retrieval. So you can do a CLIP-style thing on visual features and texts.

Key ideas include:

  • "Considering the linguistic characteristics of sign languages, we formulate sign language retrieval as a cross-lingual retrieval [6, 34, 60] problem in addition to a video-text retrieval [5, 24, 42–44, 58, 69] task."
  • SL Retrieval hard because: (1) Hard to map between the two, word orders don't match up. (2) small data. How2Sign only 30k pairs for example. (3) Lotta finegrained features needed to distinguish subtleties of face, fingers... (4)
  • "We consider the linguistic rules (e.g., word order) of both sign languages and natural languages.... While contrasting the sign videos and the texts in a joint embedding space as achieved in most vision-language pre-training frameworks [5, 44, 58], we simultaneously identify the finegrained cross-lingual (sign-to-word) mappings between two types of languages via our proposed cross-lingual contrastive learning as shown in Figure 2.". So it's Video CLIP but also learning mappings?
  • encoder has two parts: 1 "domain-agnostic" pretrained on "large-scale" sign videos. 2: "domain-aware" finetuned on "pseudo-labeled data from target datasets."
  • "In this work, pseudo-labeling is served as our sign spotting approach to localize isolated signs in untrimmed videos from target sign language retrieval datasets."
  • "In order to effectively model long videos, we decouple our framework into two disjoint parts: (1) a sign encoder which adopts a sliding window on sign-videos to pre-extract their vision features; (2) a cross-lingual contrastive learning module which encodes the extracted vision features and their corresponding texts in a joint embedding space." So tl;dr, you CLIP on the pre-extract vision features and the texts.
  • Code and models are available at: https://github.com/FangyunWei/SLRT.
  • "Sign language recognition and translation (SLRT)... recognizing the arbitrary semantic meanings conveyed by sign languages." But lack of data! Hampers it badly! But retrieval from closed set, doable.
  • Early works of sign language retrieval primarily investigate query-by-example searching [4, 71], which queries individual instances with given sign examples.

[4] Vassilis Athitsos, Carol Neidle, Stan Sclaroff, Joan Nash, Alexandra Stefan, Ashwin Thangali, Haijing Wang, and Quan Yuan. Large lexicon project: American sign language video corpus and sign language indexing/retrieval algorithms. In Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, volume 2, pages 11–14, 2010.
[71] Shilin Zhang and Bo Zhang. Using revised string edit distance to sign language video retrieval. In International Conference on Computational Intelligence and Natural Computing, volume 1, pages 45–49. IEEE, 2010. 3

@cleong110
Copy link
Owner Author

Making a quick prompt for ChatGPT to provide suggestions:

I am writing a summary of an academic paper. Based on what I have provided below (abstract and/or attached full text), can you provide suggestions so I can rewrite my first version of the summary to be more concise and professional? 

Please go line by line, and explain your suggested changes, as well as any issues with writing quality or inaccuracy in my original summary. Be sure the suggestion you provide are accurate to the abstract or full text, if provided. 

If I have missed a key contribution from the paper please note that and suggest additions. If something is not clear request clarification and I can provide additional snippets. 

Please cite your sources for important details, e.g. "from the abstract" or "based on the full text". My summary is in markdown syntax and contains citations to a BibTex bibliography, the citations begin with "@". Please use the same citation style.

In addition, please follow the following style guide:

STYLE GUIDE
- **Citations**: Use the format `@authorYearKeyword` for inline citations, and `[@authorYearKeyword]` for citations wrapped in parentheses. To include multiple citations ,use a semicolon (;) to separate them (e.g., "@authorYearKeyword;@authorYearKeyword").
- **Background & Related Work**: Use simple past tense to describe previous work (e.g., "@authorYearKeyword used...").
- **Abbreviations**: Define abbreviations in parentheses after the full term (e.g., Langue des Signes Française (LSF)).
- **Percentages**: Use the percent sign (%) with no space between the number and the sign (e.g., 95%).
- **Spacing**: Use a single space after periods and commas.
- **Hyphenation**: Use hyphens (-) for compound adjectives (e.g., video-to-pose).
- **Lists**: Use "-" for list items, followed by a space.
- **Code**: Use backticks (`) for inline code, and triple backticks (```) for code blocks.
- **Numbers**: Spell out numbers less than 10, and use numerals for 10 and greater.
- **Contractions**: Avoid contractions (e.g., use "do not" instead of "don't").
- **Compound Words**: Use a forward slash (/) to separate alternative compound words (e.g., 2D / 3D).
- **Phrasing**: Prefer active voice over passive voice (e.g., "The authors used..." instead of "The work was used by the authors...").
- **Structure**: Present information in a logical order.
- **Capitalization**: Capitalize the first word of a sentence, and proper nouns.
- **Emphasis**: Use italics for emphasis by wrapping a word with asterisks (e.g., *emphasis*).
- **Quote marks**: Use double quotes (").
- **Paragraphs**: When a subsection header starts with ######, add "{-}" to the end of the subsection title to indicate a new paragraph. If it starts with #, ##, ###, ####, or ##### do not add the "{-}".
- **Mathematics**: Use LaTeX math notation (e.g., $x^2$) wrapped in dollar signs ($).

Here is the abstract: 
This work focuses on sign language retrieval—a recently proposed task for sign language understanding. Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval. Different from traditional video-text retrieval, sign language videos, not only contain visual signals but also carry abundant semantic meanings by themselves due to the fact that sign languages are also natural languages. Considering this character, we formulate sign language retrieval as a cross-lingual retrieval problem as well as a video-text retrieval task. Concretely, we take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual (i.e., sign-to-word) mappings while contrasting the texts and the sign videos in a joint embedding space. This process is termed as cross-lingual contrastive learning. Another challenge is raised by the data scarcity issue—sign language datasets are orders of magnitude smaller in scale than that of speech recognition. We alleviate this issue by adopting a domain-agnostic sign encoder pre-trained on large-scale sign videos into the target domain via pseudo-labeling. Our framework, termed as domain-aware sign language retrieval via Cross-lingual Contrastive learning or CiCo for short, outperforms the pioneering method by large margins on various datasets, e.g., +22.4 T2V and +28.0 V2T R@1 improvements on How2Sign dataset, and +13.7 T2V and +17.1 V2T R@1 improvements on PHOENIX-2014T dataset. Code and models are available at: https://github.com/FangyunWei/SLRT.

Here is my summary: 

@Cheng2023CiCoSignLanguageRetrieval introduce a video-to-text and text-to-video retrieval method based on cross-lingual contrastive learning.
Inspired by previous work in transfer learning from sign-spotting/segmenting models [jui-etal-2022-machine;Duarte2022SignVideoRetrivalWithTextQueries], they adopt a "domain-agnostic" I3D encoder pretrained on large-scale sign language datasets for the sign-spotting task[@Varol2021ReadAndAttend].
Then on target datasets with continuous signing videos, they use that sign-spotter model with a sliding window to find high-confidence predictions.
Those predictions then are used to finetune a "domain-aware" sign-spotting encoder for the target dataset.
Combining the two encoders, they pre-extract features from sign language videos.
Cross-lingual contrastive learning [@Radford2021LearningTV] is then used in order to contrast feature/text pairs, mapping them to a shared embedding space.
Embeddings of matched pairs are pulled together and non-matched pairs pushed apart.
They evaluate on the How2Sign [@dataset:duarte2020how2sign] and RWTH-PHOENIX-Weather 2014T datasets [@cihan2018neural], improving substantially over the previous state of the art method [@Duarte2022SignVideoRetrivalWithTextQueries].
In addition, they provide baseline retrieval results for the CSL-Daily dataset [@dataset:Zhou2021_SignBackTranslation_CSLDaily].

@cleong110
Copy link
Owner Author

merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant