Skip to content

Adding "BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization" #61

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -953,14 +953,21 @@ Finally, they used this information to construct an animation system using lette

In this paradigm, rather than targeting a specific task (e.g. pose-to-text), the aim is to learn a generally-useful Sign Language Understanding model or representation which can be applied or finetuned to specific downstream tasks.

<!-- TODO: talk about BEST here. Results are not as good as SignBERT+ but the do some things differently. Compare/contrast things like: left+right+body triplets in BEST vs left+right only in SignBERT+ -->

@hu2023SignBertPlus introduce SignBERT+, a self-supervised pretraining method for sign language understanding (SLU) based on masked modeling of pose sequences.
This is an extension of their earlier SignBERT [@hu2021SignBert], with several improvements.
For pretraining they extract pose sequences from over 230k videos using MMPose [@mmpose2020].
They then perform multi-level masked modeling (joints, frames, clips) on these sequences, integrating a statistical hand model [@romero2017MANOHandModel] to constrain the decoder's predictions for anatomical realism and enhanced accuracy.
Validation on isolated SLR (MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR]), continuous SLR (RWTH-PHOENIX-Weather [@koller2015ContinuousSLR]), and SLT (RWTH-PHOENIX-Weather 2014T [@dataset:forster2014extensions;@cihan2018neural]) demonstrates state-of-the-art performance.

<!-- BEST seems to be **B**ERT pre-training for **S**ign language recognition with coupling **T**okenization -->
@Zhao2023BESTPretrainingSignLanguageRecognition introduce BEST (BERT Pre-training for Sign Language Recognition with Coupling Tokenization), a pre-training method based on masked modeling of pose sequences using a coupled tokenization scheme.
This method takes pose triplet units (left hand, right hand, and upper-body with arms) as inputs, each tokenized into discrete codes [@van_den_Oord_2017NeuralDiscreteRepresentationLearning] that are then coupled together.
Masked modeling is then applied, where any or all components of the triplet (left hand, right hand, or upper-body) may be masked, to learn hierarchical correlations among them.
Unlike @hu2023SignBertPlus, BEST does not mask multi-frame pose sequences or individual joints.
The authors validate their pre-training method on isolated sign recognition (ISR) tasks using MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR].
Besides pose-to-gloss, they also experiment with video-to-gloss tasks via fusion with I3D [@carreira2017quo].
Results on these datasets demonstrate state-of-the-art performance compared to previous methods and are comparable to those of SignBERT+ [@hu2023SignBertPlus].

## Annotation Tools

##### ELAN - EUDICO Linguistic Annotator
Expand Down
279 changes: 196 additions & 83 deletions src/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,23 @@ @inproceedings{dataset:athitsos2008american
year = {2008}
}

@inproceedings{athitsos2010LargeLexiconIndexingRetrieval,
author = {Athitsos, Vassilis and Neidle, Carol and Sclaroff, Stan and Nash, Joan and Stefan, Alexandra and Thangali, Ashwin and Wang, Haijing and Yuan, Quan},
title = {Large Lexicon Project: {American} {Sign} {Language} Video Corpus and Sign Language Indexing/Retrieval Algorithms},
pages = {11--14},
editor = {Dreuw, Philippe and Efthimiou, Eleni and Hanke, Thomas and Johnston, Trevor and Mart{\'i}nez Ruiz, Gregorio and Schembri, Adam},
booktitle = {Proceedings of the {LREC2010} 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies},
maintitle = {7th International Conference on Language Resources and Evaluation ({LREC} 2010)},
publisher = {{European Language Resources Association (ELRA)}},
address = {Valletta, Malta},
day = {22--23},
month = may,
year = {2010},
language = {english},
url = {https://www.sign-lang.uni-hamburg.de/lrec/pub/10022.pdf}
}


@inproceedings{dataset:dreuw2008benchmark,
address = {Marrakech, Morocco},
author = {Dreuw, Philippe and
Expand Down Expand Up @@ -3152,108 +3169,102 @@ @inproceedings{sellam-etal-2020-bleurt
}

@article{hu2023SignBertPlus,
author = {Hu, Hezhen and Zhao, Weichao and Zhou, Wengang and Li, Houqiang},
doi = {10.1109/TPAMI.2023.3269220},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
number = {9},
pages = {11221-11239},
title = {{SignBERT}+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding},
volume = {45},
year = {2023}
author={Hu, Hezhen and Zhao, Weichao and Zhou, Wengang and Li, Houqiang},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={{SignBERT}+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding},
year={2023},
volume={45},
number={9},
pages={11221-11239},
doi={10.1109/TPAMI.2023.3269220}
}

@inproceedings{hu2021SignBert,
author = {Hezhen Hu and
Weichao Zhao and
Wengang Zhou and
Yuechen Wang and
Houqiang Li},
bibsource = {dblp computer science bibliography, https://dblp.org},
biburl = {https://dblp.org/rec/conf/iccv/HuZZWL21.bib},
booktitle = {2021 {IEEE/CVF} International Conference on Computer Vision, {ICCV}
2021, Montreal, QC, Canada, October 10-17, 2021},
doi = {10.1109/ICCV48922.2021.01090},
pages = {11067--11076},
publisher = {{IEEE}},
timestamp = {Fri, 11 Mar 2022 00:00:00 +0100},
title = {SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign
Language Recognition},
url = {https://doi.org/10.1109/ICCV48922.2021.01090},
year = {2021}
@InProceedings{hu2021SignBert,
author = {Hu, Hezhen and Zhao, Weichao and Zhou, Wengang and Wang, Yuechen and Li, Houqiang},
title = {{SignBERT}: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021},
pages = {11087-11096}
}


@article{hu2021NMFAwareSLR,
address = {New York, NY, USA},
articleno = {80},
author = {Hu, Hezhen and Zhou, Wengang and Pu, Junfu and Li, Houqiang},
doi = {10.1145/3436754},
issn = {1551-6857},
issue_date = {August 2021},
journal = {ACM Trans. Multimedia Comput. Commun. Appl.},
number = {3},
numpages = {19},
publisher = {Association for Computing Machinery},
title = {Global-Local Enhancement Network for {NMF}-Aware Sign Language Recognition},
url = {https://doi.org/10.1145/3436754},
volume = {17},
year = {2021}
author = {Hu, Hezhen and Zhou, Wengang and Pu, Junfu and Li, Houqiang},
title = {Global-Local Enhancement Network for {NMF}-Aware Sign Language Recognition},
year = {2021},
issue_date = {August 2021},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {17},
number = {3},
issn = {1551-6857},
url = {https://doi.org/10.1145/3436754},
doi = {10.1145/3436754},
journal = {ACM Trans. Multimedia Comput. Commun. Appl.},
month = {jul},
articleno = {80},
numpages = {19}
}

@misc{yuan20172017,
archiveprefix = {arXiv},
author = {Shanxin Yuan and Qi Ye and Guillermo Garcia-Hernando and Tae-Kyun Kim},
eprint = {1707.02237},
primaryclass = {cs.CV},
title = {The 2017 Hands in the Million Challenge on {3D} Hand Pose Estimation},
year = {2017}
title={The 2017 Hands in the Million Challenge on {3D} Hand Pose Estimation},
author={Shanxin Yuan and Qi Ye and Guillermo Garcia-Hernando and Tae-Kyun Kim},
year={2017},
eprint={1707.02237},
archivePrefix={arXiv},
primaryClass={cs.CV}
}


@article{romero2017MANOHandModel,
address = {New York, NY, USA},
articleno = {245},
author = {Romero, Javier and Tzionas, Dimitrios and Black, Michael J.},
doi = {10.1145/3130800.3130883},
issn = {0730-0301},
issue_date = {December 2017},
journal = {ACM Trans. Graph.},
number = {6},
numpages = {17},
publisher = {Association for Computing Machinery},
title = {Embodied hands: modeling and capturing hands and bodies together},
url = {https://doi.org/10.1145/3130800.3130883},
volume = {36},
year = {2017}
author = {Romero, Javier and Tzionas, Dimitrios and Black, Michael J.},
title = {Embodied hands: modeling and capturing hands and bodies together},
year = {2017},
issue_date = {December 2017},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {36},
number = {6},
issn = {0730-0301},
url = {https://doi.org/10.1145/3130800.3130883},
doi = {10.1145/3130800.3130883},
journal = {ACM Trans. Graph.},
month = {nov},
articleno = {245},
numpages = {17}
}

@misc{mmpose2020,
author = {MMPose Contributors},
howpublished = {\url{https://github.com/open-mmlab/mmpose}},
title = {{OpenMMLab} Pose Estimation Toolbox and Benchmark},
year = {2020}
title={{OpenMMLab} Pose Estimation Toolbox and Benchmark},
author={MMPose Contributors},
howpublished = {\url{https://github.com/open-mmlab/mmpose}},
year={2020}
}

@article{huang2019attention3DCNNsSLR,
author = {Huang, Jie and Zhou, Wengang and Li, Houqiang and Li, Weiping},
doi = {10.1109/TCSVT.2018.2870740},
journal = {IEEE Transactions on Circuits and Systems for Video Technology},
number = {9},
pages = {2822-2832},
title = {Attention-Based {3D-CNNs} for Large-Vocabulary Sign Language Recognition},
volume = {29},
year = {2019}
@ARTICLE{huang2019attention3DCNNsSLR,
author={Huang, Jie and Zhou, Wengang and Li, Houqiang and Li, Weiping},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
title={Attention-Based {3D-CNNs} for Large-Vocabulary Sign Language Recognition},
year={2019},
volume={29},
number={9},
pages={2822-2832},
doi={10.1109/TCSVT.2018.2870740}
}


@article{koller2015ContinuousSLR,
author = {Oscar Koller and Jens Forster and Hermann Ney},
doi = {https://doi.org/10.1016/j.cviu.2015.09.013},
issn = {1077-3142},
journal = {Computer Vision and Image Understanding},
note = {Pose & Gesture},
pages = {108-125},
title = {Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers},
url = {https://www.sciencedirect.com/science/article/pii/S1077314215002088},
volume = {141},
year = {2015}
title = {Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers},
journal = {Computer Vision and Image Understanding},
volume = {141},
pages = {108-125},
year = {2015},
note = {Pose & Gesture},
issn = {1077-3142},
doi = {https://doi.org/10.1016/j.cviu.2015.09.013},
url = {https://www.sciencedirect.com/science/article/pii/S1077314215002088},
author = {Oscar Koller and Jens Forster and Hermann Ney}
}

@inproceedings{dataset:starner_et_al_2023_PopSignASL_v1,
Expand All @@ -3267,3 +3278,105 @@ @inproceedings{dataset:starner_et_al_2023_PopSignASL_v1
volume = {36},
year = {2023}
}

@inproceedings{Cheng2023CiCoSignLanguageRetrieval,
author={Cheng, Yiting and Wei, Fangyun and Bao, Jianmin and Chen, Dong and Zhang, Wenqiang},
booktitle={2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
title={CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning},
year={2023},
doi={10.1109/CVPR52729.2023.01823}
}


@inproceedings{Varol2021ReadAndAttend,
author={Varol, Gül and Momeni, Liliane and Albanie, Samuel and Afouras, Triantafyllos and Zisserman, Andrew},
booktitle={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
title={Read and Attend: Temporal Localisation in Sign Language Videos},
year={2021},
doi={10.1109/CVPR46437.2021.01658}
}

@inproceedings{Zhang2010RevisedEditDistanceSignVideoRetrieval,
author={Shilin Zhang and Bo Zhang},
booktitle={2010 Second International Conference on Computational Intelligence and Natural Computing},
title={Using revised string edit distance to sign language video retrieval},
year={2010},
volume={1},
pages={45-49},
doi={10.1109/CINC.2010.5643895}
}

@inproceedings{Radford2021LearningTV,
title = {Learning Transferable Visual Models From Natural Language Supervision},
author = {Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
booktitle = {Proceedings of the 38th International Conference on Machine Learning},
pages = {8748--8763},
year = {2021},
editor = {Meila, Marina and Zhang, Tong},
volume = {139},
series = {Proceedings of Machine Learning Research},
month = {18--24 Jul},
publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v139/radford21a/radford21a.pdf},
url = {https://proceedings.mlr.press/v139/radford21a.html}
}


@inproceedings{Duarte2022SignVideoRetrivalWithTextQueries,
author={Duarte, Amanda and Albanie, Samuel and Giró-I-Nieto, Xavier and Varol, Gül},
booktitle={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
title={Sign Language Video Retrieval with Free-Form Textual Queries},
year={2022},
pages={14074-14084},
doi={10.1109/CVPR52688.2022.01370}
}

@inproceedings{jui-etal-2022-machine,
title = "A Machine Learning-based Segmentation Approach for Measuring Similarity between Sign Languages",
author = "Jui, Tonni Das and
Bejarano, Gissella and
Rivas, Pablo",
booktitle = "Proceedings of the LREC2022 10th Workshop on the Representation and Processing of Sign Languages: Multilingual Sign Language Resources",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.signlang-1.15",
pages = "94--101"
}

@INPROCEEDINGS{dataset:Zhou2021_SignBackTranslation_CSLDaily,
author={Zhou, Hao and Zhou, Wengang and Qi, Weizhen and Pu, Junfu and Li, Houqiang},
booktitle={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
title={Improving Sign Language Translation with Monolingual Data by Sign Back-Translation},
year={2021},
pages={1316-1325},
doi={10.1109/CVPR46437.2021.00137}
}

@article{Zhao2023BESTPretrainingSignLanguageRecognition,
title = {{BEST}: {BERT} Pre-training for {S}ign Language Recognition with Coupling {T}okenization},
volume = {37},
url = {https://ojs.aaai.org/index.php/AAAI/article/view/25470},
doi = {10.1609/aaai.v37i3.25470},
number = {3},
journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
author = {Zhao, Weichao and Hu, Hezhen and Zhou, Wengang and Shi, Jiaxin and Li, Houqiang},
year = {2023},
month = {Jun.},
pages = {3597-3605}
}

@inproceedings{van_den_Oord_2017NeuralDiscreteRepresentationLearning,
author = {van den Oord, Aaron and Vinyals, Oriol and Kavukcuoglu, Koray},
title = {Neural discrete representation learning},
year = {2017},
isbn = {9781510860964},
publisher = {Curran Associates Inc.},
address = {Red Hook, NY, USA},
booktitle = {Proceedings of the 31st International Conference on Neural Information Processing Systems},
pages = {6309–6318},
numpages = {10},
location = {Long Beach, California, USA},
series = {NIPS'17}
}