BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization #19

cleong110 · 2024-06-04T19:53:57Z

cleong110 · 2024-06-04T20:12:42Z

Official Citation:

@article{Zhao_Hu_Zhou_Shi_Li_2023, title={BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization}, volume={37}, url={https://ojs.aaai.org/index.php/AAAI/article/view/25470}, DOI={10.1609/aaai.v37i3.25470}, abstractNote={In this work, we are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition~(SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally locating in continuous space, which prevents the direct adoption of the BERT cross entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture / body state. After pre-training, we fine-tune the pre-trained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.}, number={3}, journal={Proceedings of the AAAI Conference on Artificial Intelligence}, author={Zhao, Weichao and Hu, Hezhen and Zhou, Wengang and Shi, Jiaxin and Li, Houqiang}, year={2023}, month={Jun.}, pages={3597-3605} }

cleong110 · 2024-06-04T20:12:51Z

Reformatted:

@article{Zhao2023BESTPretrainingSignLanguageRecognition,
  title        = {BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization},
  volume       = {37},
  url          = {https://ojs.aaai.org/index.php/AAAI/article/view/25470},
  doi          = {10.1609/aaai.v37i3.25470},
  number       = {3},
  journal      = {Proceedings of the AAAI Conference on Artificial Intelligence},
  author       = {Zhao, Weichao and Hu, Hezhen and Zhou, Wengang and Shi, Jiaxin and Li, Houqiang},
  year         = {2023},
  month        = {Jun.},
  pages        = {3597-3605}
}

cleong110 · 2024-06-04T20:21:47Z

Abstract:

In this work, we are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition~(SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally locating in continuous space, which prevents the direct adoption of the BERT cross entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture / body state. After pre-training, we fine-tune the pre-trained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.

cleong110 · 2024-06-04T20:22:34Z

cleong110 · 2024-06-04T20:22:49Z

c.f. #14

SignBERT+ figure

SignBERT+ uses ONLY hand poses. They say: "We organize the pre-extracted 2D poses of both hands as the visual token sequence."

Also different: SignBERT+ talks about joint, frame, and clip masking, and it sounds like BEST doesn't do different levels of masking, just the frame-level?

cleong110 · 2024-06-04T20:44:53Z

Back to BEST:

Our contributions are summarized as follows,
• We propose a self-supervised pre-trainable framework. It leverages the BERT success, jointly with the specific design for the sign language domain.

• We organize the main hand and body movement as the pose triplet unit and propose the masked unit modeling (MUM) pretext task. To utilize the BERT objective, we generate the pseudo label for this task via coupling tokenization on the pose triplet unit.

• Extensive experiments on downstream SLR validate the effectiveness of our proposed method, achieving new state-of-the-art performance on four benchmarks with a notable gain.

cleong110 · 2024-06-04T20:46:34Z

Datasets:

We conduct experiments on four public sign language datasets, i.e., NMFs-CSL (Hu et al. 2021b), SLR500 (Huang et al. 2018), WLASL (Li et al. 2020a) and MSASL (Joze and Koller 2018).

cleong110 · 2024-06-04T20:47:51Z

cleong110 · 2024-06-04T20:48:50Z

They compare with SignBERT (but not SignBERT+?) Yeah, they cite the 2021 one.

Global-local enhancement network for NMF-aware sign language recognition is "HMA".

cleong110 · 2024-06-04T20:51:40Z

Honestly, I think this one should not be merged until #14 is merged

cleong110 · 2024-06-05T15:04:00Z

Interesting:

Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally located in continuous space, which prevents the direct adoption of the BERT cross-entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit.

OK, what's that mean?

cleong110 · 2024-06-05T15:06:26Z

However, the main obstacle to leverage its success in video SLR is the different characteristics of the input signal. In NLP, the input word token is discrete and pre-defined with high semantics. In contrast, the video signal of sign language is continuous with the spatial and temporal dimensions. This signal is quite lowlevel, making the original BERT objective not applicable. Besides, since the sign language video is mainly characterized by hand and body movements, the direct adoption of the BERT framework may not be optimal.

cleong110 · 2024-06-05T15:07:17Z

Basically, our framework contains two stages, i.e., self-supervised pre-training and downstream fine-tuning. During pre-training, we propose the masked unit modeling (MUM) pretext task to capture the context cues. The input hand or body unit embedding is randomly masked, and then the framework reconstructs the masked unit from this corrupted input sequence. Similar to BERT, self-reconstruction is optimized via the cross-entropy objective. To this end, we jointly tokenize the pose triplet unit as the pseudo label, which represents the gesture/body state. After pre-training, the pre-trained Transformer encoder is fine-tuned with the newly added prediction head to perform the SLR task.

cleong110 · 2024-06-05T15:07:47Z

"Pose Triplet Unit"?

cleong110 · 2024-06-05T15:11:17Z

Oh this seems important, they're using a d-VAE. c.f. sign-language-processing#37

The tokenization provides pseudo labels for our designed pretext task during pre-training. We utilize a discrete variational autoencoder (d-VAE) to jointly convert the pose triplet unit into the triplet tokens (body, left and right hand), motivated by VQ-VAE (Van Den Oord, Vinyals et al. 2017).

cleong110 · 2024-06-05T16:14:58Z

cleong110 · 2024-06-05T16:16:14Z

cleong110 · 2024-06-05T16:46:29Z

Our designed pretext task is MUM, which aims to exploit the hierarchical correlation context among internal and external triplet pose units. Given a pose sequence with a triplet pose unit of length T , we first randomly choose the α · T frames to process the mask operation. For clarification, we define three parts of the pose triplet unit as f l sign,t, f r sign,t and fb sign,t, respectively. If a unit is masked, a learnable masked token emask ∈ RDpart is utilized to replace each part of the triplet unit with 50% probability. Therefore, the masked triplet unit includes three masking cases: only hand masked, only body masked and hand-body masked.

cleong110 · 2024-06-05T16:51:32Z

OK, in my own words now, real informally:

So the thing to know about BEST is that they wanted to do BERT-style masked language modeling, but, you know, BERT assumes you've already got discrete, sematically meaningful tokens.
So they were, like, well we've got these triplets of left hand, right hand, body (no face keypoints, mind you!), let's make those into triplets and couple them together.
OK so then what? Well, van den Oord and Vinyals wrote "Neural discrete representation learning", which lets you take continuous signals and make them into discrete codes, with like a codebook and stuff.
So they use one of things to make a tokenizer, they put the coupled triplets into it and get discrete tokens out.
OK then what? Well then you've got the discrete tokens, and you mask hand, body, or both, and the transformer has to reconstruct the correct hand position, or body position, or whatever from surrounding context.
"You shall know a ~~word~~ handshape from the company it keeps" I suppose.
(and you use positional encodings to inform the model about temporal stuff, SignBERT+ does that too)
And then they tried it on Isolated SLR and it seemed to work pretty good.

cleong110 · 2024-06-05T16:53:56Z

Also let's do a quick compare/contrast...

SignBERT+ results:

BEST results:

WINNER: SignBERT+

cleong110 · 2024-06-05T16:55:09Z

WLASL

SignBERT+

BEST

WINNER: SignBERT+ again

cleong110 · 2024-06-05T18:30:25Z

SLR500:

SignBERT+

BEST

WINNER: TIE

cleong110 · 2024-06-11T19:06:27Z

OK, here's the transcript of the ChatGPT-assisted rewrite process: https://chatgpt.com/share/f42199e8-bd03-4ec3-9913-e5e832bb2885

My original summary was flawed in a number of ways.

@Zhao2023BESTPretrainingSignLanguageRecognition introduce BEST, a pretraining method based on masked modeling of pose sequences using a coupled tokenization scheme.
The method takes in pose triplet units (left hand, right hand, and upper-body with arms) as inputs.
The pose for each part of the triplet is tokenized into discrete codes [@van_den_Oord_2017NeuralDiscreteRepresentationLearning].
Then masked modeling is employed: any or all of the three parts may be masked, e.g. left hand, or right hand, or body+hand, or all of them...
Unlike @hu2023SignBertPlus, they do not mask multi-frame sequnces ("clips") or sub-frame portions of a pose unit (joints).
They validate their pretraining method isolated ISR (MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR] and NMFs-CSL [@hu2021NMFAwareSLR]).
They experiment with both pose-to-gloss and video-to-gloss via fusion with I3D [@carreira2017quo].
Results on these datasets are SOTA compared to previous methods, and quite similar to those of SignBERT+ [@hu2023SignBertPlus]

Here's the first rewrite suggestions:

@Zhao2023BESTPretrainingSignLanguageRecognition introduced BEST, a pre-training method for sign language recognition based on masked modeling of pose sequences using a coupled tokenization scheme. 
The method uses pose triplet units, consisting of the left hand, right hand, and upper-body with arms, as inputs. 
Each part of the pose triplet is tokenized into discrete codes to facilitate the modeling process [@van_den_Oord_2017NeuralDiscreteRepresentationLearning]. 
Masked modeling is then employed, where any or all of the three parts may be masked, such as the left hand, right hand, or upper-body with hands. 
Unlike @hu2023SignBertPlus, BEST does not mask multi-frame sequences ("clips") or sub-frame portions of a pose unit (joints). 
They validated their pre-training method on isolated sign recognition (ISR) datasets: MS-ASL [@dataset
], WLASL [@dataset
], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR]. 
They experimented with both pose-to-gloss and video-to-gloss tasks, incorporating fusion with I3D [@carreira2017quo]. 
The results on these datasets demonstrated state-of-the-art (SOTA) performance compared to previous methods and were similar to those achieved by SignBERT+ [@hu2023SignBertPlus].

Here's the second:

Zhao et al. (2023) introduce BEST, a pretraining framework designed to enhance sign language recognition (SLR) by leveraging BERT's masked modeling approach and a novel coupled tokenization scheme. 
The method uses pose triplet units, which include the left hand, right hand, and upper-body (including arms), as input data. 
Each component of the pose triplet unit is tokenized into discrete codes, following the approach by van den Oord et al. (2017). 
Masked modeling is then applied, where any or all components of the triplet (left hand, right hand, or upper-body) may be masked to learn the hierarchical correlation among them. 
Unlike SignBERT+ by Hu et al. (2023), the BEST framework does not mask multi-frame sequences ("clips") or sub-frame portions (joints) of a pose unit. 
The authors validate their pretraining method on isolated sign recognition (ISR) tasks using the MS-ASL [@dataset
], WLASL [@dataset
], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR] datasets. 
The authors conduct experiments on both pose-to-gloss and video-to-gloss tasks by integrating their method with the Inflated 3D ConvNet (I3D) [@carreira2017quo]. 
The results demonstrate state-of-the-art (SOTA) performance on all evaluated datasets, showing notable improvements over previous methods and comparable results to SignBERT+ [@hu2023SignBertPlus].

cleong110 · 2024-06-11T19:12:14Z

In addition the PR had various suggestions, e.g. sign-language-processing#61 (comment)

cleong110 · 2024-06-11T19:46:07Z

@Zhao2023BESTPretrainingSignLanguageRecognition introduce BEST (BERT Pre-training for Sign Language Recognition with Coupling Tokenization), a pre-training method based on masked modeling of pose sequences using a coupled tokenization scheme.
This method takes pose triplet units (left hand, right hand, and upper-body with arms) as inputs, each tokenized into discrete codes [@van_den_Oord_2017NeuralDiscreteRepresentationLearning].
Masked modeling is then applied, where any or all components of the triplet (left hand, right hand, or upper-body) may be masked, to learn hierarchical correlations among them.
Unlike @hu2023SignBertPlus, BEST does not mask multi-frame pose sequences or individual joints. 
The authors validate their pre-training method on isolated sign recognition (ISR) tasks using MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR].
Besides pose-to-gloss, they also experiment with video-to-gloss tasks via fusion with I3D [@carreira2017quo].
Results on these datasets demonstrate state-of-the-art performance compared to previous methods and are comparable to those of SignBERT+ [@hu2023SignBertPlus].

cleong110 · 2024-06-11T19:54:54Z

Updated: https://chatgpt.com/share/f42199e8-bd03-4ec3-9913-e5e832bb2885

cleong110 · 2024-06-12T18:28:01Z

Merged

cleong110 mentioned this issue Jun 4, 2024

Look through "Awesome Sign Language", etc and add missing items #2

Open

34 tasks

cleong110 changed the title ~~BEST~~ BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization Jun 4, 2024

cleong110 mentioned this issue Jun 5, 2024

Adding "BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization" sign-language-processing/sign-language-processing.github.io#61

Merged

2 tasks

cleong110 closed this as completed Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization #19

BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization #19

cleong110 commented Jun 4, 2024 •

edited

Loading

cleong110 commented Jun 4, 2024

cleong110 commented Jun 4, 2024 •

edited

Loading

cleong110 commented Jun 4, 2024

cleong110 commented Jun 4, 2024

cleong110 commented Jun 4, 2024 •

edited

Loading

cleong110 commented Jun 4, 2024 •

edited

Loading

cleong110 commented Jun 4, 2024

cleong110 commented Jun 4, 2024

cleong110 commented Jun 4, 2024 •

edited

Loading

cleong110 commented Jun 4, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024 •

edited

Loading

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 11, 2024 •

edited

Loading

cleong110 commented Jun 11, 2024

cleong110 commented Jun 11, 2024

cleong110 commented Jun 11, 2024

cleong110 commented Jun 12, 2024

BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization #19

BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization #19

Comments

cleong110 commented Jun 4, 2024 • edited Loading

cleong110 commented Jun 4, 2024

cleong110 commented Jun 4, 2024 • edited Loading

cleong110 commented Jun 4, 2024

cleong110 commented Jun 4, 2024

cleong110 commented Jun 4, 2024 • edited Loading

cleong110 commented Jun 4, 2024 • edited Loading

cleong110 commented Jun 4, 2024

cleong110 commented Jun 4, 2024

cleong110 commented Jun 4, 2024 • edited Loading

cleong110 commented Jun 4, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024 • edited Loading

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

cleong110 commented Jun 5, 2024

WLASL

cleong110 commented Jun 5, 2024

SLR500:

cleong110 commented Jun 11, 2024 • edited Loading

cleong110 commented Jun 11, 2024

cleong110 commented Jun 11, 2024

cleong110 commented Jun 11, 2024

cleong110 commented Jun 12, 2024

cleong110 commented Jun 4, 2024 •

edited

Loading

cleong110 commented Jun 4, 2024 •

edited

Loading

cleong110 commented Jun 4, 2024 •

edited

Loading

cleong110 commented Jun 4, 2024 •

edited

Loading

cleong110 commented Jun 4, 2024 •

edited

Loading

cleong110 commented Jun 5, 2024 •

edited

Loading

cleong110 commented Jun 11, 2024 •

edited

Loading