-
Notifications
You must be signed in to change notification settings - Fork 29.5k
Introducing Bert2D for Morphologically Rich Languages #38707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Port utility functions from fast tokenizer (is_subword, create_word_ids, create_subword_ids, etc.) - Override __call__ method to generate word_ids and subword_ids alongside standard tokenization - Add BERT2D-specific parameters (max_intermediate_subword_positions_per_word, subword_embedding_order, intermediate_subword_distribution_strategy) - Update model_input_names to include word_ids and subword_ids - Handle both tensor and list outputs properly - Implement padding logic for custom IDs - Maintain compatibility with BertTokenizer base functionality
…tionality - Implemented full BERT2D slow tokenizer based on BertTokenizer structure - Ported utility functions from fast tokenizer (is_subword, create_word_ids, create_subword_ids, etc.) - Added 2D positional embedding support via word_ids and subword_ids generation - Implemented proper batch handling, padding, and tensor conversion - Added comprehensive __main__ test suite for validation - Supports all BERT2D-specific parameters (max_intermediate_subword_positions_per_word, subword_embedding_order, intermediate_subword_distribution_strategy) - Maintains full compatibility with BertTokenizer base functionality - Clean implementation without debug prints ready for production use
…al ID generation - Added Bert2DTokenizer inheriting from BertTokenizer. - Implemented __init__ to handle custom Bert2D parameters (max_intermediate_subword_positions_per_word, subword_embedding_order, intermediate_subword_distribution_strategy) and update init_kwargs. - Overrode __call__ to generate word_ids and subword_ids using ported utility functions from the fast tokenizer. - Ensured model_input_names includes word_ids and subword_ids. - Code cleaned of debug statements. Note: Standalone tests pass, but integration tests with TokenizerTesterMixin show issues likely related to the test environment or mixin behavior with subclassed tokenizers.
…Fix the _pad method signature to exactly match PreTrainedTokenizerBase._pad:\n- Remove **kwargs parameter\n- Change return type from Dict[str, Any] to dict\n- Ensure all parameters match exactly including padding_side
Hi @yigit353, the model looks cool but we probably can't accept a PR for the main codebase right now! We usually only accept those when a model has a lot of users, because it means the Hugging Face team has to take over maintenance at that point. However, you can still share the model! What you can do is upload the modeling code as a "custom code model". Basically, you can just copy your configuration/modeling/tokenization files into the model repo and add an |
Thank you for the clarification @Rocketknight1 The problem is there is no instructions for adding custom tokenizers and I have one slow and fast. How can I proceed? |
Hi @yigit353, you can take a look at https://huggingface.co/Salesforce/codegen25-7b-multi_P as an example. They include their |
Yes, thank you @Rocketknight1 . The model and tokenizer are now up and running. |
What does this PR do?
This pull request introduces Bert2D, a novel architecture based on
BertModel
that incorporates a two-dimensional word embedding system. This new model is specifically designed to enhance performance on morphologically rich languages, such as Turkish and Finnish. This initial release includes the model implementation and a pretrained checkpoint for Turkish.This work is based on the research outlined in the paper "Bert2D: A 2D-Word Embedding for Morphologically Rich Languages", which has been accepted by IEEE and is available at: https://ieeexplore.ieee.org/document/10542953.
A working and pretrained model checkpoint for Turkish is available on the Hugging Face Hub at: https://huggingface.co/yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2
Description
The core innovation of Bert2D is the introduction of a 2D positional embedding mechanism. Standard BERT models utilize a 1D positional embedding, which can be suboptimal for languages with complex morphological structures and more flexible word order. Bert2D addresses this by employing a dual embedding system:
This two-dimensional approach allows the model to better understand the relationships between words and their constituent morphemes, leading to a more nuanced representation of meaning, which is particularly beneficial for agglutinative languages.
Additionally, this implementation incorporates Whole Word Masking (WWM), a training technique where all sub-tokens corresponding to a single word are masked together. This encourages the model to learn deeper contextual relationships between words.
Architectural Innovations and Implementation
This pull request introduces the following key components:
Bert2DModel
: A new model class that inherits fromBertPreTrainedModel
and implements the 2D embedding logic. The core changes are within the embeddings layer to accommodate the dual positional encoding.Bert2DTokenizer
andBert2DTokenizerFast
: Custom tokenizer implementations that are compatible with theBert2D
model.Bert2DForMaskedLM
,Bert2DForSequenceClassification
,Bert2DForTokenClassification
, andBert2DForQuestionAnswering
.Bert2DConfig
introduces new parameters to control the 2D embeddings:max_word_position_embeddings
: An integer that defines the maximum number of words (not sub-tokens) the model can process in a single sequence. Defaults to512
.max_intermediate_subword_position_embeddings
: An integer that defines the embedding value for intermediate sub-tokens within a word. For theNSW2
strategy, this is set to2
.The 2D embeddings are summed with the token and segment embeddings before being passed to the Transformer layers, ensuring seamless integration with the standard BERT architecture. The parameter count is nearly identical to a standard BERT model; the
128K
in the checkpoint name refers to the vocabulary size, not the number of parameters.Example Usage
The
Bert2D
model can be easily used with thepipeline
API for tasks likefill-mask
.Predicted Output:
Fine-Tuning Considerations
When fine-tuning a
Bert2D
model, users must pay close attention to the model's specific configuration. The introduction ofmax_word_position_embeddings
andmax_intermediate_subword_position_embeddings
means that standard BERT configuration files are not directly compatible. Ensure that you are using theBert2DConfig
and its associated parameters to achieve correct and optimal performance.Motivation and Context
Languages with rich morphology, like Turkish, Finnish, and Hungarian, pose a significant challenge for traditional NLP models. The vast number of possible word forms for a single root makes it difficult for models with 1D positional embeddings to generalize effectively. The Bert2D architecture was developed to directly address this limitation, and our initial experiments on Turkish have shown that it consistently outperforms strong monolingual models across a range of downstream tasks.
Future Work and Call for Contributions
We believe that the Bert2D architecture holds significant promise for improving NLP performance in a wide range of languages. We are actively seeking contributions in the following areas:
We believe that the addition of Bert2D to the Transformers library will be a valuable resource for the community and will spur further research into developing more effective models for a wider range of the world's languages.
Thank you @ArthurZucker
EDIT: All tests passed
EDIT 2: Open the issue #38708