Skip to content

SignBERT+ (and SignBERT) #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
11 of 23 tasks
Tracked by #2
cleong110 opened this issue May 27, 2024 · 18 comments
Closed
11 of 23 tasks
Tracked by #2

SignBERT+ (and SignBERT) #14

cleong110 opened this issue May 27, 2024 · 18 comments

Comments

@cleong110
Copy link
Owner

cleong110 commented May 27, 2024

image

Given that SignBERT+ is a direct "sequel" of SignBERT, I think it could be good to do them as one PR.

https://ieeexplore.ieee.org/document/10109128 SignBERT+
https://openaccess.thecvf.com/content/ICCV2021/html/Hu_SignBERT_Pre-Training_of_Hand-Model-Aware_Representation_for_Sign_Language_Recognition_ICCV_2021_paper.html SignBERT

Checklist

  • sync, pull and merge master first!
  • Search for the correct citation on Semantic Scholar
  • Make a new branch ("You should always branch out from master")
  • Add citation to references.bib. If dataset, prepend with dataset:. Exclude wordy abstracts. (better BibTex extension to Zotero can exclude keys)
  • Check for egregious {} in the bibtex
  • write a summary and add to the appropriate section in index.md.
  • Make sure the citation keys match.
  • Add a newline after each sentence in a paragraph. Still shows up as one paragraph but makes git stuff easier.
  • ChatGPT 3.5 can suggest rewrites and improve writing.
  • Check if acronyms are explained
  • Copy-Paste into https://dillinger.io/, see if it looks OK
  • Make a PR from the branch on my fork to master on the source repo

PR:

  • sync master of both forks
  • git pull master on local
  • git merge master on branch
  • git push
  • THEN make the PR

Writing/style:

Additional:

@cleong110 cleong110 changed the title SignBERT+, https://ieeexplore.ieee.org/document/10109128 SignBERT+ (and SignBERT) May 27, 2024
@cleong110
Copy link
Owner Author

“This work is an extension of the conference paper [5] with improvement in a number of aspects. 1) Considering the characteristics of sign language, we further introduce spatial-temporal global position encoding into embedding, along with the masked clip modeling for modeling temporal dynamics. Those new techniques further bring a notable performance gain. 2) We extend the original framework to two more downstream tasks in video-based sign language understanding, i.e., continuous SLR and SLT.” (Hu et al., 2023, p. 2)

Here's what the authors have to say about the difference between SignBERT and SignBERT+

@cleong110
Copy link
Owner Author

cleong110 commented May 27, 2024

OK what is the thing to know about these two papers? Self-supervised pretraining with SL-specific prior, basically. They're incorporating domain knowledge.

tl;dr Self-supervised pose sequence pretraining specifically designed for SLP. Then you can use that pretrained encoder and finetune on downstream tasks like Isolated SLR, Continuous SLR, or SLT.

They do try it out on all three, including Sign2Text.

They attribute "S2T setting" to N. Cihan Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7784–7793.

@cleong110
Copy link
Owner Author

cleong110 commented May 27, 2024

Inputs: 2D Pose sequences from MMPose, 133 keypoints.
Outputs: embeddings, basically.

@cleong110
Copy link
Owner Author

cleong110 commented May 27, 2024

Datasets:

  • HANDS17, How2Sign and NMFs-CSL for pretraining, as well as all the others apparently.
  • Isolated SLR: "MSASL [17], WLASL [18], and SLR500 [16]"
  • Continuous SLR: "RWTH-Phoenix [14] and RWTH-PhoenixT [33]."
  • SLT: "RWTH-PhoenixT"

"During the pre-training stage, the utilized data includes the training data from all aforementioned sign datasets, along with other collected data from [84], [85]. In total, the pre-training data volume is 230,246 videos."

[84] H. Hu, W. Zhou, J. Pu, and H. Li, “Global-local enhancement network for NMFs-aware sign language recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 17, no. 3, pp. 1–18, 2021.

[85] A. Duarte, S. Palaskar, L. Ventura, D. Ghadiyaram, K. DeHaan, F. Metze, J. Torres, and X. Giro-i Nieto, “How2sign: a large-scale multimodal dataset for continuous american sign language,” in IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2735–2744.

S. Yuan, Q. Ye, G. Garcia-Hernando, and T.-K. Kim, “The 2017 hands in the million challenge on 3D hand pose estimation,” arXiv, pp. 1–7, 2017.

@cleong110
Copy link
Owner Author

cleong110 commented May 27, 2024

Official Citation from IEEE, I am using hu2023SignBertPlus as the key

@ARTICLE{10109128,
  author={Hu, Hezhen and Zhao, Weichao and Zhou, Wengang and Li, Houqiang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={SignBERT+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding}, 
  year={2023},
  volume={45},
  number={9},
  pages={11221-11239},
  keywords={Task analysis;Assistive technologies;Gesture recognition;Visualization;Bit error rate;Transformers;Hidden Markov models;Self-supervised pre-training;masked modeling strategies;model-aware hand prior;sign language understanding},
  doi={10.1109/TPAMI.2023.3269220}}

@cleong110
Copy link
Owner Author

Official citation for NMFs-CSL dataset, but using our normal key style

@article{hu2021NMFAwareSLR,
	author = {Hu, Hezhen and Zhou, Wengang and Pu, Junfu and Li, Houqiang},
	title = {Global-Local Enhancement Network for NMF-Aware Sign Language Recognition},
	year = {2021},
	issue_date = {August 2021},
	publisher = {Association for Computing Machinery},
	address = {New York, NY, USA},
	volume = {17},
	number = {3},
	issn = {1551-6857},
	url = {https://doi.org/10.1145/3436754},
	doi = {10.1145/3436754},
	journal = {ACM Trans. Multimedia Comput. Commun. Appl.},
	month = {jul},
	articleno = {80},
	numpages = {19}
}

@cleong110
Copy link
Owner Author

Looking for the official citation for HANDS17:

Also, HANDS2019 is a thing.

@cleong110
Copy link
Owner Author

cleong110 commented May 27, 2024

Oh, and here's the official citation for SignBERT, taken from https://openaccess.thecvf.com/content/ICCV2021/html/Hu_SignBERT_Pre-Training_of_Hand-Model-Aware_Representation_for_Sign_Language_Recognition_ICCV_2021_paper.html

@InProceedings{Hu_2021_ICCV,
    author    = {Hu, Hezhen and Zhao, Weichao and Zhou, Wengang and Wang, Yuechen and Li, Houqiang},
    title     = {SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {11087-11096}
}

I will use hu2021SignBert as the key

@cleong110
Copy link
Owner Author

image

Pretraining strategy is in section 3.2. They randomly pick some portion of the pose tokens, and do one of:

  • Either mask out or move/jitter around some number of the joints.
  • Drop out the whole frame. The idea is that sometimes on complex backgrounds pose detectors don't output anything at all.
  • Drop out a whole short clip. To deal with the case where because of motion blur the pose detection cuts out for a short time.

What's a "token"?

@cleong110
Copy link
Owner Author

Apparently they use MANO hand model in the decoder? "hand-model-aware decoder" they say, and cite https://dl.acm.org/doi/abs/10.1145/3130800.3130883

@cleong110
Copy link
Owner Author

As far as I can tell they treat each pose in the sequence as a token. So if the pose estimation gives them 30 poses for 30 frames that is 30 tokens.

@cleong110
Copy link
Owner Author

cleong110 commented May 27, 2024

OK, I think it's time. Let's build our initial summary and prompt ChatGPT for help.

Here's my initial version, which I add to the "pose-to-text" section.

@hu2023SignBertPlus introduce SignBERT+, a hand-model-aware self-supervised pretraining method which they validate on sign language recognition (SLR) and sign language translation (SLT). 
Collecting over 230k videos from a number of datasets, they extract pose sequences using MMPose [@mmpose2020]
They then treat these pose sequences as sequences of visual tokens, and pretrain their encoder through masking of joints, frames, and short clips.
They incorporate a statistical hand model [@romero2017MANOHandModel] to constrain their decoder. 
They finetune and validate on a number of downstream tasks including:
- isolated SLR on MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], and SLR500 [@huang2019attention3DCNNsSLR]
- Continuous SLR and SLT using RWTH-PHOENIX-Weather [koller2015ContinuousSLR] and RWTH-PHOENIX-Weather 2014T [dataset:forster2014extensions;@cihan2018neural]
Results show state-of-the-art performance.

@cleong110
Copy link
Owner Author

Building my prompt:

I am writing a summary of an academic paper. Based on what I have provided below, can you rewrite my first version of the summary to be more concise and professional? Please provide 3 alternative rewrites, and explain your suggested changes, as well as any issues with writing quality or inaccuracy in my original summary. Be sure the summaries you provide are accurate to the figure and the the abstract. If I have missed a key contribution from the paper please note that and suggest additions. If something is not clear request clarification and I can provide additional snippets. Please cite your sources for important details, e.g. "from the abstract" or "based on the full text". My summary is in markdown syntax and contains citations to a BibTex bibliography, the citations begin with "@". Please use the same citation style.

In addition, please follow the following style guide:

STYLE GUIDE
- **Citations**: Use the format `@authorYearKeyword` for inline citations, and `[@authorYearKeyword]` for citations wrapped in parentheses. To include multiple citations ,use a semicolon (;) to separate them (e.g., "@authorYearKeyword;@authorYearKeyword").
- **Background & Related Work**: Use simple past tense to describe previous work (e.g., "@authorYearKeyword used...").
- **Abbreviations**: Define abbreviations in parentheses after the full term (e.g., Langue des Signes Française (LSF)).
- **Percentages**: Use the percent sign (%) with no space between the number and the sign (e.g., 95%).
- **Spacing**: Use a single space after periods and commas.
- **Hyphenation**: Use hyphens (-) for compound adjectives (e.g., video-to-pose).
- **Lists**: Use "-" for list items, followed by a space.
- **Code**: Use backticks (`) for inline code, and triple backticks (```) for code blocks.
- **Numbers**: Spell out numbers less than 10, and use numerals for 10 and greater.
- **Contractions**: Avoid contractions (e.g., use "do not" instead of "don't").
- **Compound Words**: Use a forward slash (/) to separate alternative compound words (e.g., 2D / 3D).
- **Phrasing**: Prefer active voice over passive voice (e.g., "The authors used..." instead of "The work was used by the authors...").
- **Structure**: Present information in a logical order.
- **Capitalization**: Capitalize the first word of a sentence, and proper nouns.
- **Emphasis**: Use italics for emphasis by wrapping a word with asterisks (e.g., *emphasis*).
- **Quote marks**: Use double quotes (").
- **Paragraphs**: When a subsection header starts with ######, add "{-}" to the end of the subsection title to indicate a new paragraph. If it starts with #, ##, ###, ####, or ##### do not add the "{-}".
- **Mathematics**: Use LaTeX math notation (e.g., $x^2$) wrapped in dollar signs ($).


All right, here is information about the paper I am trying to summarize:

Paper Title: "SignBERT+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding"

Abstract: 
"Hand gesture serves as a crucial role during the expression of sign language. Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource and suffer limited interpretability. In this paper, we propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated. In our framework, the hand pose is regarded as a visual token, which is derived from an off-the-shelf detector. Each visual token is embedded with gesture state and spatial-temporal position encoding. To take full advantage of current sign data resource, we first perform self-supervised learning to model its statistics. To this end, we design multi-level masked modeling strategies (joint, frame and clip) to mimic common failure detection cases. Jointly with these masked modeling strategies, we incorporate model-aware hand prior to better capture hierarchical context over the sequence. After the pre-training, we carefully design simple yet effective prediction heads for downstream tasks. To validate the effectiveness of our framework, we perform extensive experiments on three main SLU tasks, involving isolated and continuous sign language recognition (SLR), and sign language translation (SLT). Experimental results demonstrate the effectiveness of our method, achieving new state-of-the-art performance with a notable gain."

Full Text: see attached PDF

My Summary:  
"@hu2023SignBertPlus introduce SignBERT+, a hand-model-aware self-supervised pretraining method which they validate on sign language recognition (SLR) and sign language translation (SLT).
Collecting over 230k videos from a number of datasets, they extract pose sequences using MMPose [@mmpose2020].
They then treat these pose sequences as sequences of visual tokens, and pretrain their encoder through masking of joints, frames, and short clips.
When embedding pose sequences they use temporal positional encodings.
They incorporate a statistical hand model [@romero2017MANOHandModel] to constrain their decoder.
They finetune and validate on a number of downstream tasks:

- isolated SLR on MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], and SLR500 [@huang2019attention3DCNNsSLR].
- Continuous SLR using RWTH-PHOENIX-Weather [koller2015ContinuousSLR] and RWTH-PHOENIX-Weather 2014T [dataset:forster2014extensions;@cihan2018neural].
- SLT using RWTH-PHOENIX-Weather 2014T [dataset:forster2014extensions;@cihan2018neural].

Results show state-of-the-art performance on these tasks."

All right, remember my initial instructions, please go ahead and provide me the requested concise, professional rewrite suggestions for my summary, with the requested explanations, citations, and following the style guide. In particular I feel that my summary lacks clarity on the "Hand-model-aware decoder".

@cleong110
Copy link
Owner Author

cleong110 commented May 28, 2024

Resulting ChatGPT conversation: https://chatgpt.com/share/1cf76e17-b778-49c4-9887-d12770fa922a. Main gist of the suggestions is to

  • do some variation on " a statistical hand model [@romero2017MANOHandModel] to enhance the decoder's accuracy."
  • integrate the three tasks into one sentence, instead of a bullet list
  • do something like "use multi-level masking (joints, frames, clips)"
  • "extract pose sequences ... treats these as visual tokens".

@cleong110
Copy link
Owner Author

Metrics:

  • Pose Estimation: Percentage of Correct Keypoints (PCK) score and the Area Under the Curve (AUC) on the PCK threshold ranging from 20 to 40 pixels
  • Isolated SLR: "We utilize the accuracy metrics, including per-instance (P-I) and per-class (P-C) metrics. P-I and P-C denote the average accuracy over all the instances and classes, respectively. Following previous works [5], [13], we report Top-1 and Top-5 P-I and P-C metrics...Since each class in SLR500 contains the same number of samples, P-I is equal to P-C and we only report one of them."
  • "continuous SLR, we utilize Word Error Rate (WER) as the evaluation metric."
  • SLT: BLEU and ROUGE, specifically ROUGE-L F1 score

@cleong110
Copy link
Owner Author

OK, I think I get it about MANO and how that helps. Based on figure 2...
image

So basically it guides/hints the model during pretraining to reconstruct the masked poses more accurately.
"Hey I dropped some joints. But here's a statistical hand model of how human hands are in real life. Knowing that, can you reconstruct properly?"

@cleong110
Copy link
Owner Author

cleong110 commented May 28, 2024

OK, rewriting/synthesizing...
@hu2023SignBertPlus introduce SignBERT+, a self-supervised pretraining method for sign language understanding (SLU) incorporating a hand-model-aware approach.
They extract pose sequences from over 230k videos using MMPose [@mmpose2020], treating these as visual tokens embedded with temporal positional encodings.
They pretrain using multi-level masked modeling (joints, frames, clips) and integrate a statistical hand model [@romero2017MANOHandModel] to enhance the decoder's accuracy and constrain its predictions for anatomical realism.
Validation on isolated SLR (MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR].), continuous SLR (RWTH-PHOENIX-Weather [@koller2015ContinuousSLR]), and SLT (RWTH-PHOENIX-Weather 2014T [@dataset:forster2014extensions;@cihan2018neural]) demonstrates state-of-the-art performance.

@cleong110
Copy link
Owner Author

merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant