Introducing Bert2D for Morphologically Rich Languages #38707

yigit353 · 2025-06-09T21:15:14Z

What does this PR do?

This pull request introduces Bert2D, a novel architecture based on BertModel that incorporates a two-dimensional word embedding system. This new model is specifically designed to enhance performance on morphologically rich languages, such as Turkish and Finnish. This initial release includes the model implementation and a pretrained checkpoint for Turkish.

This work is based on the research outlined in the paper "Bert2D: A 2D-Word Embedding for Morphologically Rich Languages", which has been accepted by IEEE and is available at: https://ieeexplore.ieee.org/document/10542953.

A working and pretrained model checkpoint for Turkish is available on the Hugging Face Hub at: https://huggingface.co/yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2

Description

The core innovation of Bert2D is the introduction of a 2D positional embedding mechanism. Standard BERT models utilize a 1D positional embedding, which can be suboptimal for languages with complex morphological structures and more flexible word order. Bert2D addresses this by employing a dual embedding system:

Whole-Word Positional Embeddings (1st Dimension): Captures the absolute position of each word (not sub-word token) in the sequence.
Sub-word Relative Positional Embeddings (2nd Dimension): Encodes the relative position of sub-words within each word. This is the key innovation, allowing the model to differentiate between the start, middle, and end sub-tokens of a word.

This two-dimensional approach allows the model to better understand the relationships between words and their constituent morphemes, leading to a more nuanced representation of meaning, which is particularly beneficial for agglutinative languages.

Additionally, this implementation incorporates Whole Word Masking (WWM), a training technique where all sub-tokens corresponding to a single word are masked together. This encourages the model to learn deeper contextual relationships between words.

Architectural Innovations and Implementation

This pull request introduces the following key components:

Bert2DModel: A new model class that inherits from BertPreTrainedModel and implements the 2D embedding logic. The core changes are within the embeddings layer to accommodate the dual positional encoding.
Bert2DTokenizer and Bert2DTokenizerFast: Custom tokenizer implementations that are compatible with the Bert2D model.
Model Variants: Includes all standard variants of the BERT architecture, such as Bert2DForMaskedLM, Bert2DForSequenceClassification, Bert2DForTokenClassification, and Bert2DForQuestionAnswering.
New Configuration Parameters: The Bert2DConfig introduces new parameters to control the 2D embeddings:
- max_word_position_embeddings: An integer that defines the maximum number of words (not sub-tokens) the model can process in a single sequence. Defaults to 512.
- max_intermediate_subword_position_embeddings: An integer that defines the embedding value for intermediate sub-tokens within a word. For the NSW2 strategy, this is set to 2.

The 2D embeddings are summed with the token and segment embeddings before being passed to the Transformer layers, ensuring seamless integration with the standard BERT architecture. The parameter count is nearly identical to a standard BERT model; the 128K in the checkpoint name refers to the vocabulary size, not the number of parameters.

Example Usage

The Bert2D model can be easily used with the pipeline API for tasks like fill-mask.

from transformers import pipeline

# Initialize the fill-mask pipeline with the Bert2D model
fill_mask_pipe = pipeline("fill-mask", model="yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2")

# Example usage
masked_sentence = "Adamın mesleği [MASK] midir acaba?"
predictions = fill_mask_pipe(masked_sentence)

# Print the top predictions
for prediction in predictions:
    print(f"Token: {prediction['token_str']}")
    print(f"Sequence: {prediction['sequence']}")
    print(f"Score: {prediction['score']:.4f}")
    print("-" * 20)

Predicted Output:

Token: mühendis
Sequence: Adamın mesleği mühendis midir acaba?
Score: 0.2393
--------------------
Token: doktor
Sequence: Adamın mesleği doktor midir acaba?
Score: 0.1698
--------------------
Token: asker
Sequence: Adamın mesleği asker midir acaba?
Score: 0.0537
--------------------
Token: memur
Sequence: Adamın mesleği memur midir acaba?
Score: 0.0471
--------------------
Token: öğretmen
Sequence: Adamın mesleği öğretmen midir acaba?
Score: 0.0463
--------------------

Fine-Tuning Considerations

When fine-tuning a Bert2D model, users must pay close attention to the model's specific configuration. The introduction of max_word_position_embeddings and max_intermediate_subword_position_embeddings means that standard BERT configuration files are not directly compatible. Ensure that you are using the Bert2DConfig and its associated parameters to achieve correct and optimal performance.

Motivation and Context

Languages with rich morphology, like Turkish, Finnish, and Hungarian, pose a significant challenge for traditional NLP models. The vast number of possible word forms for a single root makes it difficult for models with 1D positional embeddings to generalize effectively. The Bert2D architecture was developed to directly address this limitation, and our initial experiments on Turkish have shown that it consistently outperforms strong monolingual models across a range of downstream tasks.

Future Work and Call for Contributions

We believe that the Bert2D architecture holds significant promise for improving NLP performance in a wide range of languages. We are actively seeking contributions in the following areas:

Pretraining on other languages: We are particularly interested in seeing Bert2D trained on other morphologically complex languages like Finnish, Hungarian, and Korean.
Further architectural enhancements: We are open to suggestions and improvements to the current architecture.
Downstream task fine-tuning and evaluation: We encourage the community to fine-tune and evaluate Bert2D on various downstream tasks and report their findings.

We believe that the addition of Bert2D to the Transformers library will be a valuable resource for the community and will spur further research into developing more effective models for a wider range of the world's languages.

Thank you @ArthurZucker

EDIT: All tests passed

EDIT 2: Open the issue #38708

- Port utility functions from fast tokenizer (is_subword, create_word_ids, create_subword_ids, etc.) - Override __call__ method to generate word_ids and subword_ids alongside standard tokenization - Add BERT2D-specific parameters (max_intermediate_subword_positions_per_word, subword_embedding_order, intermediate_subword_distribution_strategy) - Update model_input_names to include word_ids and subword_ids - Handle both tensor and list outputs properly - Implement padding logic for custom IDs - Maintain compatibility with BertTokenizer base functionality

…tionality - Implemented full BERT2D slow tokenizer based on BertTokenizer structure - Ported utility functions from fast tokenizer (is_subword, create_word_ids, create_subword_ids, etc.) - Added 2D positional embedding support via word_ids and subword_ids generation - Implemented proper batch handling, padding, and tensor conversion - Added comprehensive __main__ test suite for validation - Supports all BERT2D-specific parameters (max_intermediate_subword_positions_per_word, subword_embedding_order, intermediate_subword_distribution_strategy) - Maintains full compatibility with BertTokenizer base functionality - Clean implementation without debug prints ready for production use

…al ID generation - Added Bert2DTokenizer inheriting from BertTokenizer. - Implemented __init__ to handle custom Bert2D parameters (max_intermediate_subword_positions_per_word, subword_embedding_order, intermediate_subword_distribution_strategy) and update init_kwargs. - Overrode __call__ to generate word_ids and subword_ids using ported utility functions from the fast tokenizer. - Ensured model_input_names includes word_ids and subword_ids. - Code cleaned of debug statements. Note: Standalone tests pass, but integration tests with TokenizerTesterMixin show issues likely related to the test environment or mixin behavior with subclassed tokenizers.

…Fix the _pad method signature to exactly match PreTrainedTokenizerBase._pad:\n- Remove **kwargs parameter\n- Change return type from Dict[str, Any] to dict\n- Ensure all parameters match exactly including padding_side

Rocketknight1 · 2025-06-10T11:38:01Z

Hi @yigit353, the model looks cool but we probably can't accept a PR for the main codebase right now! We usually only accept those when a model has a lot of users, because it means the Hugging Face team has to take over maintenance at that point.

However, you can still share the model! What you can do is upload the modeling code as a "custom code model". Basically, you can just copy your configuration/modeling/tokenization files into the model repo and add an auto_map entry to config.json. See https://huggingface.co/docs/transformers/en/custom_models for some tips, or you can just look on the Hub for other custom code models. These models work just like library models, and users can load them with AutoModel.from_pretrained(trust_remote_code=True)

yigit353 · 2025-06-10T12:00:42Z

Hi @yigit353, the model looks cool but we probably can't accept a PR for the main codebase right now! We usually only accept those when a model has a lot of users, because it means the Hugging Face team has to take over maintenance at that point.

However, you can still share the model! What you can do is upload the modeling code as a "custom code model". Basically, you can just copy your configuration/modeling/tokenization files into the model repo and add an auto_map entry to config.json. See https://huggingface.co/docs/transformers/en/custom_models for some tips, or you can just look on the Hub for other custom code models. These models work just like library models, and users can load them with AutoModel.from_pretrained(trust_remote_code=True)

Thank you for the clarification @Rocketknight1 The problem is there is no instructions for adding custom tokenizers and I have one slow and fast. How can I proceed?

Rocketknight1 · 2025-06-10T13:34:42Z

Hi @yigit353, you can take a look at https://huggingface.co/Salesforce/codegen25-7b-multi_P as an example. They include their tokenization.py file, and then add an auto_map line in tokenizer_config.json. You can replace the null entry in that tuple with your fast tokenizer class and it should work!

yigit353 · 2025-06-10T13:47:00Z

Yes, thank you @Rocketknight1 . The model and tokenizer are now up and running.

yigit353 added 30 commits May 22, 2025 12:26

Added bert2d artchitecture

065ec30

Only test doesn't work

5a60b07

Update test_modeling_bert2d.py

96e69f7

Fixed tests and only one case remained

16558e2

Update modeling_bert2d.py

2314ab2

Update test_modeling_bert2d.py

e967ec3

Update modeling_bert2d.py

e8703e6

Fixed mixup

93357ea

fixed docstrings and format

a3d7ea7

Update check_docstrings.py

a767861

Update test_modeling_bert2d.py

626ad97

Merge branch 'main' of https://github.com/yigit353/transformers

f30f651

test modified

4719f13

Create test_tokenization_bert2d_fast.py

ac06c96

Update test_tokenization_bert2d.py

1a58450

Update tokenization_bert2d_fast.py

93420db

Update tokenization_bert2d_fast.py

21a7435

Update test_tokenization_bert2d.py

7849df0

Style and quality

4ffaa26

Update test_tokenization_bert2d.py

0c59bf1

Solved model_input_names conflict

906efc1

Broken inheritance removed

89c02f7

Update tokenization_bert2d.py

4173afd

Update tokenization_bert2d_fast.py

84b91df

Solved batches

d4407d7

Update test_tokenization_bert2d.py

4a5fbb3

yigit353 added 18 commits June 9, 2025 23:48

Update test_tokenization_bert2d.py

4e5ccd2

Update modeling_bert2d.py

7943af6

Fixed tests

72603f9

prepare_for_model fixed

ab561ad

All fixed

0a8dc63

fill-mask saved

f6376a2

Update test_modeling_bert2d.py

c43c631

fixed test tensors

53045f3

Update test_tokenization_bert2d.py

746941f

Update test_modeling_bert2d.py

8aa86db

bert2d supports generation tests

7bc99c5

left padding

e20b860

Update modeling_bert2d.py

72d7843

re-added auto configs and tokenizations

f8adf19

Merge branch 'main' of https://github.com/yigit353/transformers

2bf1018

style fix

00b9b4f

removed unnecessary exposed methods

134af0c

Merge branch 'main' into main

c053c67

yigit353 mentioned this pull request Jun 10, 2025

Bert2D: A 2D-Word Embedding Model for Morphologically Rich Languages #38708

Open

2 tasks

yigit353 added 6 commits June 10, 2025 12:02

Merge branch 'main' into main

b097804

Update modeling_layers.py

aefbbba

Merge branch 'main' of https://github.com/yigit353/transformers

3d99b44

Merge branch 'main' into main

7cae09c

Merge branch 'main' into main

6632a29

Update configuration_auto.py

d4c720e

yigit353 closed this Jun 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introducing Bert2D for Morphologically Rich Languages #38707

Introducing Bert2D for Morphologically Rich Languages #38707

Uh oh!

yigit353 commented Jun 9, 2025 •

edited

Loading

Uh oh!

Rocketknight1 commented Jun 10, 2025

Uh oh!

yigit353 commented Jun 10, 2025

Uh oh!

Rocketknight1 commented Jun 10, 2025

Uh oh!

yigit353 commented Jun 10, 2025

Uh oh!

Uh oh!

Introducing Bert2D for Morphologically Rich Languages #38707

Introducing Bert2D for Morphologically Rich Languages #38707

Uh oh!

Conversation

yigit353 commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Description

Architectural Innovations and Implementation

Example Usage

Fine-Tuning Considerations

Motivation and Context

Future Work and Call for Contributions

Uh oh!

Rocketknight1 commented Jun 10, 2025

Uh oh!

yigit353 commented Jun 10, 2025

Uh oh!

Rocketknight1 commented Jun 10, 2025

Uh oh!

yigit353 commented Jun 10, 2025

Uh oh!

Uh oh!

yigit353 commented Jun 9, 2025 •

edited

Loading