Skip to content

Conversation

@Mohamed-Ashraf273
Copy link
Contributor

@Mohamed-Ashraf273 Mohamed-Ashraf273 commented Aug 26, 2025

@rkazants
@mvafin
@mlukasze
@evkotov
@CuriousPanCake
@itikhono

Performance issue description

Problem

OpenVINO backend exhibits excessive memory consumption during GPT-2 model inference compared to other Keras backends (TensorFlow, PyTorch, JAX). The issue occurs during the model compilation phase when converting from Keras to OpenVINO format, resulting in significantly higher memory usage that makes OpenVINO unsuitable for memory-constrained environments.
Problem: OpenVINO uses substantially more memory than other backends during the compilation/inference phase.

Summary of the solution:

Solving Issue: #31390,
First I was trying to solve this problem by introducing an EinsumDecomposition at MOC in this PR: #31482

But I found another solution:
My first fix was to add EinsumDecomposition in MOC, and I found that both this version and the original EinsumDecomposition in CommonOptimizations introduced Broadcast nodes. However, in my fix the MOC pipeline later removed them, which allowed constants to be shared before the ConstantFolding pass that otherwise duplicates them in CommonOptimizations, leading to reduced memory usage. By comparing the two, I realized that both decompositions actually produced the same graph initially, but the MOC version benefited from an additional simplification step that cleaned up the broadcasts. After debugging, I identified the responsible pass as NopElimination. When I applied this pass in CommonOptimizations just before ConstantFolding, it achieved the same effect: broadcasts disappeared, constants were shared, and memory usage dropped, without needing to move EinsumDecomposition into MOC.

📊 Complete Analysis & Benchmarks

For comprehensive performance comparison, optimization results, and technical details across all Keras backends:

� Detailed Performance Report & Memory Optimization Analysis

The report includes cross-backend benchmarks before and after both fixes, which gave the same results for OpenVINO


Step-by-step reproduction

Use keras source: https://github.com/keras-team/keras.git
Also use this PR from keras_hub: keras-team/keras-hub#2350

import os
os.environ["KERAS_BACKEND"] = "openvino"

import keras_hub
causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32")
output = causal_lm.generate("Hello", max_length=10)  # Memory spike occurs here

Example Graph:

def create_einsum_constant_model():
    """Create a model with both constant and non-constant einsum patterns from different sources"""
    
    input_tensor = ops.parameter([1, 10, 1024], np.float32, name="input")
    
    # Create diverse constant sources for einsum operations
    # Source 1: Direct constant weight matrix
    weight_data_1 = np.random.randn(1024, 16, 64).astype(np.float32)
    const_weight_1 = ops.constant(weight_data_1, name="const_weight_1")
    
    # Source 2: Constant from addition 
    base_weight_2 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="base_weight_2")
    bias_weight_2 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="bias_weight_2")
    const_weight_2 = ops.add(base_weight_2, bias_weight_2)  # Constant folded
    
    # Source 3: Constant from multiply (your original source)
    base_weight_3 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="base_weight_3")
    scale_3 = ops.constant(np.array(0.125, dtype=np.float32), name="scale_3")
    const_weight_3 = ops.multiply(base_weight_3, scale_3)  # Constant folded
    
    # Source 4: Constant from reshape
    flat_weight_4 = ops.constant(np.random.randn(1024*16*64).astype(np.float32), name="flat_weight_4")
    const_weight_4 = ops.reshape(flat_weight_4, [1024, 16, 64], special_zero=False)
    
    # Source 5: Constant from transpose
    orig_weight_5 = ops.constant(np.random.randn(16, 1024, 64).astype(np.float32), name="orig_weight_5")
    const_weight_5 = ops.transpose(orig_weight_5, [1, 0, 2])  # [1024, 16, 64]
    
    current = input_tensor
    
    # Create 10 einsum operations with constants (WILL BE OPTIMIZED)
    const_sources = [const_weight_1, const_weight_2, const_weight_3, const_weight_4, const_weight_5]
    
    for i in range(5):  # Use each constant source twice (5*2 = 10)
        for j in range(2):
            const_idx = i
            einsum_out = ops.einsum([current, const_sources[const_idx]], "abc,cde->abde")
            
            # Add bias to continue the chain
            bias = ops.constant(np.random.randn(16, 64).astype(np.float32), name=f"bias_{i}_{j}")
            current = ops.add(einsum_out, bias)
            
            # Reshape to prepare for next iteration
            if i < 4 or j < 1:  # Not the last iteration
                proj_weight = ops.constant(np.random.randn(16*64, 1024).astype(np.float32), name=f"proj_{i}_{j}")
                reshaped = ops.reshape(current, [1, 10, 16*64], special_zero=False)
                current = ops.matmul(reshaped, proj_weight, transpose_a=False, transpose_b=False)
    
    # Now create variable tensors from different sources for non-constant einsums
    # Start fresh with current tensor for variable operations
    var_source = ops.reshape(current, [1, 10, 16, 64], special_zero=False)
    
    # Create 20 einsum operations without constants (WON'T BE OPTIMIZED)
    for i in range(10):
        # Source 1: Split operations to create variable tensors
        split_axis = ops.constant(np.array(3, dtype=np.int32), name=f"split_axis_{i}")
        split_lengths = ops.constant(np.array([32, 32], dtype=np.int32), name=f"split_lengths_{i}")
        split_result = ops.variadic_split(var_source, split_axis, split_lengths)
        
        var_tensor_1 = split_result.output(0)  # [1, 10, 16, 32] - Variable
        var_tensor_2 = split_result.output(1)  # [1, 10, 16, 32] - Variable
        
        # EINSUM 1: Element-wise pattern (variable x variable)
        einsum_var_1 = ops.einsum([var_tensor_1, var_tensor_2], "abcd,abcd->abcd")
        
        # Source 2: Create more variable tensors from different operations
        # Use subtract to create another variable tensor
        var_tensor_3 = ops.subtract(var_tensor_1, var_tensor_2)  # [1, 10, 16, 32] - Variable
        
        # Use relu to create another variable tensor
        var_tensor_4 = ops.relu(var_tensor_2)  # [1, 10, 16, 32] - Variable
        
        # EINSUM 2: Another variable x variable pattern  
        einsum_var_2 = ops.einsum([var_tensor_3, var_tensor_4], "abcd,abcd->abcd")
        
        # Combine and use for next iteration
        combined = ops.add(einsum_var_1, einsum_var_2)
        
        # Concatenate back to [1, 10, 16, 64] for next iteration
        var_source = ops.concat([combined, combined], axis=3)  # [1, 10, 16, 64]
    
    # Final projection to output
    final_proj = ops.constant(np.random.randn(16*64, 1024).astype(np.float32), name="final_proj")
    final_reshaped = ops.reshape(var_source, [1, 10, 16*64], special_zero=False)
    final_output = ops.matmul(final_reshaped, final_proj, transpose_a=False, transpose_b=False)
    
    # Final output
    model = ov.Model([final_output], [input_tensor], name="EinsumConstantTest")
    
    # Print model statistics
    ops_by_type = {}
    for op in model.get_ops():
        op_type = op.get_type_name()
        ops_by_type[op_type] = ops_by_type.get(op_type, 0) + 1
    
    print("Original model operations:")
    for op_type, count in sorted(ops_by_type.items()):
        print(f"  {op_type}: {count}")
    
    print(f"\nEinsum breakdown:")
    print(f"  - Einsums with constants (WILL BE OPTIMIZED): 10")
    print(f"    * From direct constant: 2")
    print(f"    * From constant addition: 2") 
    print(f"    * From constant multiply: 2")
    print(f"    * From constant reshape: 2")
    print(f"    * From constant transpose: 2")
    print(f"  - Einsums without constants (WON'T BE OPTIMIZED): 20")
    print(f"    * From variadic_split operations: 10")
    print(f"    * From subtract + relu operations: 10")
    print(f"  - Total Einsums: 30")
    return model

You can find the original IR, Complied IR, IR before NopElimination and after NopElimination here:
https://drive.google.com/drive/folders/1xxNVFotGOZLeUf5ECtmJhm4fytJNoBLN?usp=sharing


Original Graph:
Screenshot from 2025-08-26 12-40-15

Before NopElimination:
Screenshot from 2025-08-26 15-20-51

After NopElimination:
Screenshot from 2025-08-26 15-21-26

@Mohamed-Ashraf273 Mohamed-Ashraf273 requested a review from a team as a code owner August 26, 2025 01:06
@Mohamed-Ashraf273 Mohamed-Ashraf273 requested review from CuriousPanCake and removed request for a team August 26, 2025 01:06
@CuriousPanCake
Copy link
Contributor

.build_jenkins

@mvafin
Copy link
Contributor

mvafin commented Aug 26, 2025

Please add test here:

It should verify that model with 2 einsum ops with a shared const have shared const after run CommonOptimizations transformation.

@Mohamed-Ashraf273
Copy link
Contributor Author

@CuriousPanCake
@mvafin
Images have been updated to the correct graphs.

@mvafin
Copy link
Contributor

mvafin commented Aug 26, 2025

build_jenkins

@Mohamed-Ashraf273
Copy link
Contributor Author

@mvafin
I added the test.

@mvafin
Copy link
Contributor

mvafin commented Aug 26, 2025

build_jenkins

@github-actions github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Aug 26, 2025
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Aug 26, 2025
@CuriousPanCake
Copy link
Contributor

@Mohamed-Ashraf273 can you run functional tests on your local setup?

@Mohamed-Ashraf273
Copy link
Contributor Author

Mohamed-Ashraf273 commented Aug 27, 2025

@Mohamed-Ashraf273 can you run functional tests on your local setup?

@CuriousPanCake

Screenshot from 2025-08-27 12-47-28 This is the only failed test: Screenshot from 2025-08-27 12-47-40

@Mohamed-Ashraf273
Copy link
Contributor Author

@CuriousPanCake
i printed the node that maks the test pass without elimination:
Screenshot from 2025-08-27 13-46-45

with elimination:
Screenshot from 2025-08-27 13-47-57

@Mohamed-Ashraf273
Copy link
Contributor Author

Mohamed-Ashraf273 commented Aug 27, 2025

@CuriousPanCake
@mvafin

The Issue: A pattern matching inconsistency was discovered in the RoPE fusion pass where the value pattern("-1, head_cnt, 1, ndims/2, 1") contained a semantic error. The pattern incorrectly specified a fixed value 1 in position 2, while the actual model constants contained the head count dimension value (typically 32) in this position.

The Evidence: Debug analysis revealed that shape constants like [-1, 128, 32, 32, 1] failed to match the pattern because position 2 contained the actual head count value 32 instead of the expected fixed value 1.

The Solution: The pattern was corrected to "-1, batch, head_cnt, ndims/2, 1" to properly represent the semantic structure where position 1 corresponds to batch size, position 2 to head count, and position 3 to half the rotary dimensions.

This fix ensures correctness, and the test passed for both models before and after the fix.

@CuriousPanCake
Copy link
Contributor

build_jenkins

@mvafin
Copy link
Contributor

mvafin commented Aug 28, 2025

@CuriousPanCake Rope fusion was modified here, could you review if the change makes sense?

@mvafin mvafin requested a review from CuriousPanCake August 28, 2025 13:40
@CuriousPanCake
Copy link
Contributor

build_jenkins

@mvafin
Copy link
Contributor

mvafin commented Aug 28, 2025

build_jenkins

@Mohamed-Ashraf273
Copy link
Contributor Author

@CuriousPanCake
@mvafin

@Mohamed-Ashraf273
Copy link
Contributor Author

@CuriousPanCake
@mvafin
@rkazants
After rerunning the tests locally, all ov_transformations_tests, ov_cpu_func_tests, and keras_hub tests are now passing successfully.
This PR is now ready!

@rkazants
Copy link
Collaborator

rkazants commented Sep 1, 2025

build_jenkins

@rkazants
Copy link
Collaborator

rkazants commented Sep 1, 2025

build_jenkins

@praasz praasz added this to the 2025.4 milestone Sep 1, 2025
@CuriousPanCake
Copy link
Contributor

AFAIK Python unit tests to be fixed soon

@rkazants rkazants enabled auto-merge September 2, 2025 11:06
@rkazants rkazants added this pull request to the merge queue Sep 2, 2025
auto-merge was automatically disabled September 2, 2025 14:58

Pull Request is not mergeable

Merged via the queue into openvinotoolkit:master with commit 4dd79a7 Sep 2, 2025
297 of 308 checks passed
@Mohamed-Ashraf273 Mohamed-Ashraf273 deleted the fix_mem_issue branch September 4, 2025 23:47
praasz pushed a commit to praasz/openvino that referenced this pull request Sep 8, 2025
@rkazants 
@mvafin 
@mlukasze 
@evkotov 
@CuriousPanCake 
@itikhono 

### Performance issue description

## Problem
OpenVINO backend exhibits **excessive memory consumption** during GPT-2
model inference compared to other Keras backends (TensorFlow, PyTorch,
JAX). The issue occurs during the model compilation phase when
converting from Keras to OpenVINO format, resulting in significantly
higher memory usage that makes OpenVINO unsuitable for
memory-constrained environments.
**Problem**: OpenVINO uses substantially more memory than other backends
during the compilation/inference phase.

## Summary of the solution:
Solving Issue: openvinotoolkit#31390,
First I was trying to solve this problem by introducing an
`EinsumDecomposition` at MOC in this PR:
openvinotoolkit#31482

But I found another solution:
My first fix was to add `EinsumDecomposition` in MOC, and I found that
both this version and the original `EinsumDecomposition` in
`CommonOptimizations` introduced `Broadcast` nodes. However, in my fix
the MOC pipeline later removed them, which allowed constants to be
shared before the `ConstantFolding` pass that otherwise duplicates them
in `CommonOptimizations`, leading to reduced memory usage. By comparing
the two, I realized that both decompositions actually produced the same
graph initially, but the MOC version benefited from an additional
simplification step that cleaned up the broadcasts. After debugging, I
identified the responsible pass as `NopElimination`. When I applied this
pass in `CommonOptimizations` just before `ConstantFolding`, it achieved
the same effect: broadcasts disappeared, constants were shared, and
memory usage dropped, without needing to move EinsumDecomposition into
MOC.

### 📊 Complete Analysis & Benchmarks
For comprehensive performance comparison, optimization results, and
technical details across all Keras backends:

**[� Detailed Performance Report & Memory Optimization
Analysis](https://gist.github.com/Mohamed-Ashraf273/1ecc15bd5e83c229d7e3f07851624bc8)**

The report includes cross-backend benchmarks before and after both
fixes, which gave the same results for OpenVINO

---

### Step-by-step reproduction

Use keras source: https://github.com/keras-team/keras.git  
Also use this PR from keras_hub:
keras-team/keras-hub#2350

```python
import os
os.environ["KERAS_BACKEND"] = "openvino"

import keras_hub
causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32")
output = causal_lm.generate("Hello", max_length=10)  # Memory spike occurs here
```
Example Graph:
```python
def create_einsum_constant_model():
    """Create a model with both constant and non-constant einsum patterns from different sources"""
    
    input_tensor = ops.parameter([1, 10, 1024], np.float32, name="input")
    
    # Create diverse constant sources for einsum operations
    # Source 1: Direct constant weight matrix
    weight_data_1 = np.random.randn(1024, 16, 64).astype(np.float32)
    const_weight_1 = ops.constant(weight_data_1, name="const_weight_1")
    
    # Source 2: Constant from addition 
    base_weight_2 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="base_weight_2")
    bias_weight_2 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="bias_weight_2")
    const_weight_2 = ops.add(base_weight_2, bias_weight_2)  # Constant folded
    
    # Source 3: Constant from multiply (your original source)
    base_weight_3 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="base_weight_3")
    scale_3 = ops.constant(np.array(0.125, dtype=np.float32), name="scale_3")
    const_weight_3 = ops.multiply(base_weight_3, scale_3)  # Constant folded
    
    # Source 4: Constant from reshape
    flat_weight_4 = ops.constant(np.random.randn(1024*16*64).astype(np.float32), name="flat_weight_4")
    const_weight_4 = ops.reshape(flat_weight_4, [1024, 16, 64], special_zero=False)
    
    # Source 5: Constant from transpose
    orig_weight_5 = ops.constant(np.random.randn(16, 1024, 64).astype(np.float32), name="orig_weight_5")
    const_weight_5 = ops.transpose(orig_weight_5, [1, 0, 2])  # [1024, 16, 64]
    
    current = input_tensor
    
    # Create 10 einsum operations with constants (WILL BE OPTIMIZED)
    const_sources = [const_weight_1, const_weight_2, const_weight_3, const_weight_4, const_weight_5]
    
    for i in range(5):  # Use each constant source twice (5*2 = 10)
        for j in range(2):
            const_idx = i
            einsum_out = ops.einsum([current, const_sources[const_idx]], "abc,cde->abde")
            
            # Add bias to continue the chain
            bias = ops.constant(np.random.randn(16, 64).astype(np.float32), name=f"bias_{i}_{j}")
            current = ops.add(einsum_out, bias)
            
            # Reshape to prepare for next iteration
            if i < 4 or j < 1:  # Not the last iteration
                proj_weight = ops.constant(np.random.randn(16*64, 1024).astype(np.float32), name=f"proj_{i}_{j}")
                reshaped = ops.reshape(current, [1, 10, 16*64], special_zero=False)
                current = ops.matmul(reshaped, proj_weight, transpose_a=False, transpose_b=False)
    
    # Now create variable tensors from different sources for non-constant einsums
    # Start fresh with current tensor for variable operations
    var_source = ops.reshape(current, [1, 10, 16, 64], special_zero=False)
    
    # Create 20 einsum operations without constants (WON'T BE OPTIMIZED)
    for i in range(10):
        # Source 1: Split operations to create variable tensors
        split_axis = ops.constant(np.array(3, dtype=np.int32), name=f"split_axis_{i}")
        split_lengths = ops.constant(np.array([32, 32], dtype=np.int32), name=f"split_lengths_{i}")
        split_result = ops.variadic_split(var_source, split_axis, split_lengths)
        
        var_tensor_1 = split_result.output(0)  # [1, 10, 16, 32] - Variable
        var_tensor_2 = split_result.output(1)  # [1, 10, 16, 32] - Variable
        
        # EINSUM 1: Element-wise pattern (variable x variable)
        einsum_var_1 = ops.einsum([var_tensor_1, var_tensor_2], "abcd,abcd->abcd")
        
        # Source 2: Create more variable tensors from different operations
        # Use subtract to create another variable tensor
        var_tensor_3 = ops.subtract(var_tensor_1, var_tensor_2)  # [1, 10, 16, 32] - Variable
        
        # Use relu to create another variable tensor
        var_tensor_4 = ops.relu(var_tensor_2)  # [1, 10, 16, 32] - Variable
        
        # EINSUM 2: Another variable x variable pattern  
        einsum_var_2 = ops.einsum([var_tensor_3, var_tensor_4], "abcd,abcd->abcd")
        
        # Combine and use for next iteration
        combined = ops.add(einsum_var_1, einsum_var_2)
        
        # Concatenate back to [1, 10, 16, 64] for next iteration
        var_source = ops.concat([combined, combined], axis=3)  # [1, 10, 16, 64]
    
    # Final projection to output
    final_proj = ops.constant(np.random.randn(16*64, 1024).astype(np.float32), name="final_proj")
    final_reshaped = ops.reshape(var_source, [1, 10, 16*64], special_zero=False)
    final_output = ops.matmul(final_reshaped, final_proj, transpose_a=False, transpose_b=False)
    
    # Final output
    model = ov.Model([final_output], [input_tensor], name="EinsumConstantTest")
    
    # Print model statistics
    ops_by_type = {}
    for op in model.get_ops():
        op_type = op.get_type_name()
        ops_by_type[op_type] = ops_by_type.get(op_type, 0) + 1
    
    print("Original model operations:")
    for op_type, count in sorted(ops_by_type.items()):
        print(f"  {op_type}: {count}")
    
    print(f"\nEinsum breakdown:")
    print(f"  - Einsums with constants (WILL BE OPTIMIZED): 10")
    print(f"    * From direct constant: 2")
    print(f"    * From constant addition: 2") 
    print(f"    * From constant multiply: 2")
    print(f"    * From constant reshape: 2")
    print(f"    * From constant transpose: 2")
    print(f"  - Einsums without constants (WON'T BE OPTIMIZED): 20")
    print(f"    * From variadic_split operations: 10")
    print(f"    * From subtract + relu operations: 10")
    print(f"  - Total Einsums: 30")
    return model
```
You can find the original IR, Complied IR, IR before NopElimination and
after NopElimination here:

https://drive.google.com/drive/folders/1xxNVFotGOZLeUf5ECtmJhm4fytJNoBLN?usp=sharing

---
Original Graph:
<img width="1130" height="918" alt="Screenshot from 2025-08-26 12-40-15"
src="https://github.com/user-attachments/assets/37a93d33-4dd4-4b6b-9f83-1c21676e6551"
/>


Before  NopElimination:
<img width="655" height="919" alt="Screenshot from 2025-08-26 15-20-51"
src="https://github.com/user-attachments/assets/45fe58dc-b702-4510-b30a-1cc15cc43acc"
/>


After  NopElimination:
<img width="655" height="919" alt="Screenshot from 2025-08-26 15-21-26"
src="https://github.com/user-attachments/assets/1b7f19a6-45f8-4d60-b04d-bcd416749267"
/>

---------

Co-authored-by: Maxim Vafin <[email protected]>
Co-authored-by: Andrii Staikov <[email protected]>
Co-authored-by: Roman Kazantsev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: transformations OpenVINO Runtime library - Transformations ExternalPR External contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants