Skip to content

Introducing Bert2D for Morphologically Rich Languages #38707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 94 commits into from

Conversation

yigit353
Copy link

@yigit353 yigit353 commented Jun 9, 2025

What does this PR do?

This pull request introduces Bert2D, a novel architecture based on BertModel that incorporates a two-dimensional word embedding system. This new model is specifically designed to enhance performance on morphologically rich languages, such as Turkish and Finnish. This initial release includes the model implementation and a pretrained checkpoint for Turkish.

This work is based on the research outlined in the paper "Bert2D: A 2D-Word Embedding for Morphologically Rich Languages", which has been accepted by IEEE and is available at: https://ieeexplore.ieee.org/document/10542953.

A working and pretrained model checkpoint for Turkish is available on the Hugging Face Hub at: https://huggingface.co/yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2


Description

The core innovation of Bert2D is the introduction of a 2D positional embedding mechanism. Standard BERT models utilize a 1D positional embedding, which can be suboptimal for languages with complex morphological structures and more flexible word order. Bert2D addresses this by employing a dual embedding system:

  1. Whole-Word Positional Embeddings (1st Dimension): Captures the absolute position of each word (not sub-word token) in the sequence.
  2. Sub-word Relative Positional Embeddings (2nd Dimension): Encodes the relative position of sub-words within each word. This is the key innovation, allowing the model to differentiate between the start, middle, and end sub-tokens of a word.

This two-dimensional approach allows the model to better understand the relationships between words and their constituent morphemes, leading to a more nuanced representation of meaning, which is particularly beneficial for agglutinative languages.

Additionally, this implementation incorporates Whole Word Masking (WWM), a training technique where all sub-tokens corresponding to a single word are masked together. This encourages the model to learn deeper contextual relationships between words.


Architectural Innovations and Implementation

This pull request introduces the following key components:

  • Bert2DModel: A new model class that inherits from BertPreTrainedModel and implements the 2D embedding logic. The core changes are within the embeddings layer to accommodate the dual positional encoding.
  • Bert2DTokenizer and Bert2DTokenizerFast: Custom tokenizer implementations that are compatible with the Bert2D model.
  • Model Variants: Includes all standard variants of the BERT architecture, such as Bert2DForMaskedLM, Bert2DForSequenceClassification, Bert2DForTokenClassification, and Bert2DForQuestionAnswering.
  • New Configuration Parameters: The Bert2DConfig introduces new parameters to control the 2D embeddings:
    • max_word_position_embeddings: An integer that defines the maximum number of words (not sub-tokens) the model can process in a single sequence. Defaults to 512.
    • max_intermediate_subword_position_embeddings: An integer that defines the embedding value for intermediate sub-tokens within a word. For the NSW2 strategy, this is set to 2.

The 2D embeddings are summed with the token and segment embeddings before being passed to the Transformer layers, ensuring seamless integration with the standard BERT architecture. The parameter count is nearly identical to a standard BERT model; the 128K in the checkpoint name refers to the vocabulary size, not the number of parameters.


Example Usage

The Bert2D model can be easily used with the pipeline API for tasks like fill-mask.

from transformers import pipeline

# Initialize the fill-mask pipeline with the Bert2D model
fill_mask_pipe = pipeline("fill-mask", model="yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2")

# Example usage
masked_sentence = "Adamın mesleği [MASK] midir acaba?"
predictions = fill_mask_pipe(masked_sentence)

# Print the top predictions
for prediction in predictions:
    print(f"Token: {prediction['token_str']}")
    print(f"Sequence: {prediction['sequence']}")
    print(f"Score: {prediction['score']:.4f}")
    print("-" * 20)

Predicted Output:

Token: mühendis
Sequence: Adamın mesleği mühendis midir acaba?
Score: 0.2393
--------------------
Token: doktor
Sequence: Adamın mesleği doktor midir acaba?
Score: 0.1698
--------------------
Token: asker
Sequence: Adamın mesleği asker midir acaba?
Score: 0.0537
--------------------
Token: memur
Sequence: Adamın mesleği memur midir acaba?
Score: 0.0471
--------------------
Token: öğretmen
Sequence: Adamın mesleği öğretmen midir acaba?
Score: 0.0463
--------------------

Fine-Tuning Considerations

When fine-tuning a Bert2D model, users must pay close attention to the model's specific configuration. The introduction of max_word_position_embeddings and max_intermediate_subword_position_embeddings means that standard BERT configuration files are not directly compatible. Ensure that you are using the Bert2DConfig and its associated parameters to achieve correct and optimal performance.


Motivation and Context

Languages with rich morphology, like Turkish, Finnish, and Hungarian, pose a significant challenge for traditional NLP models. The vast number of possible word forms for a single root makes it difficult for models with 1D positional embeddings to generalize effectively. The Bert2D architecture was developed to directly address this limitation, and our initial experiments on Turkish have shown that it consistently outperforms strong monolingual models across a range of downstream tasks.


Future Work and Call for Contributions

We believe that the Bert2D architecture holds significant promise for improving NLP performance in a wide range of languages. We are actively seeking contributions in the following areas:

  • Pretraining on other languages: We are particularly interested in seeing Bert2D trained on other morphologically complex languages like Finnish, Hungarian, and Korean.
  • Further architectural enhancements: We are open to suggestions and improvements to the current architecture.
  • Downstream task fine-tuning and evaluation: We encourage the community to fine-tune and evaluate Bert2D on various downstream tasks and report their findings.

We believe that the addition of Bert2D to the Transformers library will be a valuable resource for the community and will spur further research into developing more effective models for a wider range of the world's languages.

Thank you @ArthurZucker

EDIT: All tests passed

EDIT 2: Open the issue #38708

yigit353 added 30 commits May 22, 2025 12:26
- Port utility functions from fast tokenizer (is_subword, create_word_ids, create_subword_ids, etc.)
- Override __call__ method to generate word_ids and subword_ids alongside standard tokenization
- Add BERT2D-specific parameters (max_intermediate_subword_positions_per_word, subword_embedding_order, intermediate_subword_distribution_strategy)
- Update model_input_names to include word_ids and subword_ids
- Handle both tensor and list outputs properly
- Implement padding logic for custom IDs
- Maintain compatibility with BertTokenizer base functionality
…tionality

- Implemented full BERT2D slow tokenizer based on BertTokenizer structure
- Ported utility functions from fast tokenizer (is_subword, create_word_ids, create_subword_ids, etc.)
- Added 2D positional embedding support via word_ids and subword_ids generation
- Implemented proper batch handling, padding, and tensor conversion
- Added comprehensive __main__ test suite for validation
- Supports all BERT2D-specific parameters (max_intermediate_subword_positions_per_word,
  subword_embedding_order, intermediate_subword_distribution_strategy)
- Maintains full compatibility with BertTokenizer base functionality
- Clean implementation without debug prints ready for production use
…al ID generation

- Added Bert2DTokenizer inheriting from BertTokenizer.
- Implemented __init__ to handle custom Bert2D parameters (max_intermediate_subword_positions_per_word, subword_embedding_order, intermediate_subword_distribution_strategy) and update init_kwargs.
- Overrode __call__ to generate word_ids and subword_ids using ported utility functions from the fast tokenizer.
- Ensured model_input_names includes word_ids and subword_ids.
- Code cleaned of debug statements.

Note: Standalone tests pass, but integration tests with TokenizerTesterMixin show issues likely related to the test environment or mixin behavior with subclassed tokenizers.
…Fix the _pad method signature to exactly match PreTrainedTokenizerBase._pad:\n- Remove **kwargs parameter\n- Change return type from Dict[str, Any] to dict\n- Ensure all parameters match exactly including padding_side
@Rocketknight1
Copy link
Member

Hi @yigit353, the model looks cool but we probably can't accept a PR for the main codebase right now! We usually only accept those when a model has a lot of users, because it means the Hugging Face team has to take over maintenance at that point.

However, you can still share the model! What you can do is upload the modeling code as a "custom code model". Basically, you can just copy your configuration/modeling/tokenization files into the model repo and add an auto_map entry to config.json. See https://huggingface.co/docs/transformers/en/custom_models for some tips, or you can just look on the Hub for other custom code models. These models work just like library models, and users can load them with AutoModel.from_pretrained(trust_remote_code=True)

@yigit353
Copy link
Author

Hi @yigit353, the model looks cool but we probably can't accept a PR for the main codebase right now! We usually only accept those when a model has a lot of users, because it means the Hugging Face team has to take over maintenance at that point.

However, you can still share the model! What you can do is upload the modeling code as a "custom code model". Basically, you can just copy your configuration/modeling/tokenization files into the model repo and add an auto_map entry to config.json. See https://huggingface.co/docs/transformers/en/custom_models for some tips, or you can just look on the Hub for other custom code models. These models work just like library models, and users can load them with AutoModel.from_pretrained(trust_remote_code=True)

Thank you for the clarification @Rocketknight1 The problem is there is no instructions for adding custom tokenizers and I have one slow and fast. How can I proceed?

@Rocketknight1
Copy link
Member

Hi @yigit353, you can take a look at https://huggingface.co/Salesforce/codegen25-7b-multi_P as an example. They include their tokenization.py file, and then add an auto_map line in tokenizer_config.json. You can replace the null entry in that tuple with your fast tokenizer class and it should work!

@yigit353
Copy link
Author

Yes, thank you @Rocketknight1 . The model and tokenizer are now up and running.

@yigit353 yigit353 closed this Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants