-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[Performance] Solve high memory usage issue during model compilation using OpenVINO backend on Keras 3 #31482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…sing OpenVINO backend on Keras 3
|
Here is the link for the IR without |
|
@evkotov |
0174a70 to
126b77d
Compare
|
@itikhono |
|
build_jenkins |
|
@CuriousPanCake |
| // the order is important | ||
| const char* enable_einsum = std::getenv("OV_ENABLE_EINSUM_DECOMPOSITION"); | ||
| if (enable_einsum) { | ||
| REGISTER_PASS(manager, EinsumDecomposition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good way to fix this. Doing this in MOC means we will have decomposed einsum in IR.
As I understand this is really needed only for einsum that have constant inputs to constant fold it before reaching plugin. Can we do it differently? Maybe modify this transformation to work only on constant inputs for offline step? @CuriousPanCake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvafin
I updated it to check if at least one of the inputs is a constant, and it worked too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from:
================================================================================
FIXED MEMORY TEST: KERAS GPT2 + OPENVINO
================================================================================
[STAGE] 0_INITIAL: 775.24 MB (swap: 0.00 MB) - Initial state after imports
>>> Loading GPT2 model from preset...
[STAGE] 1_MODEL_LOADED: 2314.67 MB (swap: 0.00 MB) - gpt2_medium_en model loaded (10.0s)
[STAGE] 2_BEFORE_INFERENCE: 2314.67 MB (swap: 0.00 MB) - Before first inference
>>> Running first inference (compilation + execution)...
⏳ Converting Keras -> OPENVINO and compiling...
[STAGE] 3_FIRST_INFERENCE: 4512.82 MB (swap: 0.00 MB) - First inference completed via generate (7.7s)
>>> Second inference (no compilation)...
[STAGE] 4_SECOND_INFERENCE: 4510.38 MB (swap: 0.00 MB) - Second inference (2.0s)
[STAGE] 5_FINAL: 4510.38 MB (swap: 0.00 MB) - Final state
================================================================================
PERFORMANCE RESULTS
================================================================================
✅ Generated text: 'Hello everyone,
We've been busy'
✅ Second generation: 'Testimony before the House Judiciary Committee on April'
Backend: openvino
First inference latency: 7.69s
Second inference latency: 2.045s
Throughput: 0.65 tokens/sec
Speedup: 3.8x
📊 DETAILED MEMORY ANALYSIS:
+---------------------+------------+-------------+--------------+---------------+
| STAGE | RAM (MB) | SWAP (MB) | RAM CHANGE | SWAP CHANGE |
+=====================+============+=============+==============+===============+
| Initial | 775.2 | 0 | - | - |
+---------------------+------------+-------------+--------------+---------------+
| After model load | 2314.7 | 0 | +1539.4 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Before inference | 2314.7 | 0 | +0.0 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| After 1st inference | 4512.8 | 0 | +2198.1 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| After 2nd inference | 4510.4 | 0 | -2.4 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Final | 4510.4 | 0 | +0.0 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Peak recorded | 4522.9 | 0 | +3747.7 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
🔍 MAIN MEMORY CONSUMERS:
📚 Model loading: +1539.4 MB RAM +0.0 MB swap (41.2% of total)
⚡ Compilation/inference: +2198.1 MB RAM +0.0 MB swap (58.9% of total)
📈 SUMMARY:
💾 Total RAM growth: +3735.1 MB
💿 Total swap change: +0.0 MB
📊 Peak RAM consumption: +3747.7 MB above initial
🔥 Highest RAM recorded: 4522.9 MB
💿 Peak swap consumption: +0.0 MB above initial
🔥 Highest swap recorded: 0.0 MB
🎯 MEMORY HEALTH CHECK:
❌ CRITICAL: RAM usage 3748 MB is very high (target <1GB)
✅ GOOD: Low peak swap usage 0 MB
🚨 ALERT: Combined memory impact 4523 MB is very high
🎯 Test completed: {'success': True, 'model_loading_mb': 1539.4296875, 'compilation_mb': 2198.1484375, 'total_mb': 3735.13671875, 'peak_mb': 3747.6640625, 'peak_swap_mb': 0.0}
to
[STAGE] 0_INITIAL: 781.90 MB (swap: 0.00 MB) - Initial state after imports
>>> Loading GPT2 model from preset...
[STAGE] 1_MODEL_LOADED: 2321.91 MB (swap: 0.00 MB) - gpt2_medium_en model loaded (13.4s)
[STAGE] 2_BEFORE_INFERENCE: 2321.91 MB (swap: 0.00 MB) - Before first inference
>>> Running first inference (compilation + execution)...
⏳ Converting Keras -> OPENVINO and compiling...
[STAGE] 3_FIRST_INFERENCE: 3548.79 MB (swap: 0.00 MB) - First inference completed via generate (7.6s)
>>> Second inference (no compilation)...
[STAGE] 4_SECOND_INFERENCE: 3546.42 MB (swap: 0.00 MB) - Second inference (2.7s)
[STAGE] 5_FINAL: 3546.42 MB (swap: 0.00 MB) - Final state
================================================================================
PERFORMANCE RESULTS
================================================================================
✅ Generated text: 'Hello! I'm a student studying computer programming'
✅ Second generation: 'Testimonials
I was a new'
Backend: openvino
First inference latency: 7.62s
Second inference latency: 2.673s
Throughput: 0.92 tokens/sec
Speedup: 2.9x
📊 DETAILED MEMORY ANALYSIS:
+---------------------+------------+-------------+--------------+---------------+
| STAGE | RAM (MB) | SWAP (MB) | RAM CHANGE | SWAP CHANGE |
+=====================+============+=============+==============+===============+
| Initial | 781.9 | 0 | - | - |
+---------------------+------------+-------------+--------------+---------------+
| After model load | 2321.9 | 0 | +1540.0 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Before inference | 2321.9 | 0 | +0.0 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| After 1st inference | 3548.8 | 0 | +1226.9 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| After 2nd inference | 3546.4 | 0 | -2.4 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Final | 3546.4 | 0 | +0.0 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Peak recorded | 3567.8 | 0 | +2785.9 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
🔍 MAIN MEMORY CONSUMERS:
📚 Model loading: +1540.0 MB RAM +0.0 MB swap (55.7% of total)
⚡ Compilation/inference: +1226.9 MB RAM +0.0 MB swap (44.4% of total)
📈 SUMMARY:
💾 Total RAM growth: +2764.5 MB
💿 Total swap change: +0.0 MB
📊 Peak RAM consumption: +2785.9 MB above initial
🔥 Highest RAM recorded: 3567.8 MB
💿 Peak swap consumption: +0.0 MB above initial
🔥 Highest swap recorded: 0.0 MB
🎯 MEMORY HEALTH CHECK:
❌ CRITICAL: RAM usage 2786 MB is very high (target <1GB)
✅ GOOD: Low peak swap usage 0 MB
🎯 Test completed: {'success': True, 'model_loading_mb': 1540.0078125, 'compilation_mb': 1226.88671875, 'total_mb': 2764.5234375, 'peak_mb': 2785.86328125, 'peak_swap_mb': 0.0}
| REGISTER_PASS(manager, ConstantFolding) | ||
| REGISTER_PASS(manager, Validate) | ||
|
|
||
| // the order is important |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a better comment before which transformation it should be called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| if (m_check_const) { | ||
| bool has_const = false; | ||
| for (auto& input : einsum_node->input_values()) { | ||
| auto node_ptr = input.get_node_shared_ptr(); | ||
| auto constant_ptr = ov::as_type_ptr<ov::op::v0::Constant>(node_ptr); | ||
| if (constant_ptr) { | ||
| has_const = true; | ||
| break; | ||
| } | ||
| } | ||
| if (!has_const) | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide more detains about the einsum operation you want to optimize? Maybe link to a code of the model or a picture of subgraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization targets specific Einsum operations in transformer models like GPT-2, where at least one input is a constant tensor. After ConstantFolding, weight matrices become constants enabling more efficient decomposition patterns.
Specific Einsum Operations Being Optimized:
1. Query-Key Attention Scores Computation:
- Location: https://github.com/keras-team/keras/blob/master/keras/src/layers/attention/multi_head_attention.py#L493
- Pattern:
einsum("aecd,abcd->acbe", key, query) - Code:
attention_scores = ops.einsum(self._dot_product_equation, key, query)
2. Attention-Value Combination:
- Location: https://github.com/keras-team/keras/blob/master/keras/src/layers/attention/multi_head_attention.py#L509-L511
- Pattern:
einsum("acbe,aecd->abcd", attention_scores, value) - Code:
attention_output = ops.einsum(self._combine_equation, final_attn_scores, value)
3. Weight Matrix Projections (Q/K/V Transformations):
- Location: https://github.com/keras-team/keras/blob/master/keras/src/layers/core/einsum_dense.py#L214
- Pattern:
einsum("abc,cd->abd", input, weight_matrix) - Code:
x = ops.einsum(self.equation, inputs, self.kernel)
Optimization Application:
Note: The optimization is only applied when at least one einsum input is constant. In the examples above:
✅ Weight Matrix Projections (example 3): weight_matrix becomes constant after ConstantFolding → Optimization Applied
❌ Attention Scores (examples 1&2): Both key and query are variable tensors → No Optimization
For more details and examples visit:
https://gist.github.com/Mohamed-Ashraf273/59eddcd120918cb0761ffa5020800d5d
a5001be to
f5dd8f1
Compare
55c03e6 to
cebca9e
Compare
@rkazants @mvafin @mlukasze @evkotov @CuriousPanCake @itikhono ### Performance issue description ## Problem OpenVINO backend exhibits **excessive memory consumption** during GPT-2 model inference compared to other Keras backends (TensorFlow, PyTorch, JAX). The issue occurs during the model compilation phase when converting from Keras to OpenVINO format, resulting in significantly higher memory usage that makes OpenVINO unsuitable for memory-constrained environments. **Problem**: OpenVINO uses substantially more memory than other backends during the compilation/inference phase. ## Summary of the solution: Solving Issue: #31390, First I was trying to solve this problem by introducing an `EinsumDecomposition` at MOC in this PR: #31482 But I found another solution: My first fix was to add `EinsumDecomposition` in MOC, and I found that both this version and the original `EinsumDecomposition` in `CommonOptimizations` introduced `Broadcast` nodes. However, in my fix the MOC pipeline later removed them, which allowed constants to be shared before the `ConstantFolding` pass that otherwise duplicates them in `CommonOptimizations`, leading to reduced memory usage. By comparing the two, I realized that both decompositions actually produced the same graph initially, but the MOC version benefited from an additional simplification step that cleaned up the broadcasts. After debugging, I identified the responsible pass as `NopElimination`. When I applied this pass in `CommonOptimizations` just before `ConstantFolding`, it achieved the same effect: broadcasts disappeared, constants were shared, and memory usage dropped, without needing to move EinsumDecomposition into MOC. ### 📊 Complete Analysis & Benchmarks For comprehensive performance comparison, optimization results, and technical details across all Keras backends: **[� Detailed Performance Report & Memory Optimization Analysis](https://gist.github.com/Mohamed-Ashraf273/1ecc15bd5e83c229d7e3f07851624bc8)** The report includes cross-backend benchmarks before and after both fixes, which gave the same results for OpenVINO --- ### Step-by-step reproduction Use keras source: https://github.com/keras-team/keras.git Also use this PR from keras_hub: keras-team/keras-hub#2350 ```python import os os.environ["KERAS_BACKEND"] = "openvino" import keras_hub causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32") output = causal_lm.generate("Hello", max_length=10) # Memory spike occurs here ``` Example Graph: ```python def create_einsum_constant_model(): """Create a model with both constant and non-constant einsum patterns from different sources""" input_tensor = ops.parameter([1, 10, 1024], np.float32, name="input") # Create diverse constant sources for einsum operations # Source 1: Direct constant weight matrix weight_data_1 = np.random.randn(1024, 16, 64).astype(np.float32) const_weight_1 = ops.constant(weight_data_1, name="const_weight_1") # Source 2: Constant from addition base_weight_2 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="base_weight_2") bias_weight_2 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="bias_weight_2") const_weight_2 = ops.add(base_weight_2, bias_weight_2) # Constant folded # Source 3: Constant from multiply (your original source) base_weight_3 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="base_weight_3") scale_3 = ops.constant(np.array(0.125, dtype=np.float32), name="scale_3") const_weight_3 = ops.multiply(base_weight_3, scale_3) # Constant folded # Source 4: Constant from reshape flat_weight_4 = ops.constant(np.random.randn(1024*16*64).astype(np.float32), name="flat_weight_4") const_weight_4 = ops.reshape(flat_weight_4, [1024, 16, 64], special_zero=False) # Source 5: Constant from transpose orig_weight_5 = ops.constant(np.random.randn(16, 1024, 64).astype(np.float32), name="orig_weight_5") const_weight_5 = ops.transpose(orig_weight_5, [1, 0, 2]) # [1024, 16, 64] current = input_tensor # Create 10 einsum operations with constants (WILL BE OPTIMIZED) const_sources = [const_weight_1, const_weight_2, const_weight_3, const_weight_4, const_weight_5] for i in range(5): # Use each constant source twice (5*2 = 10) for j in range(2): const_idx = i einsum_out = ops.einsum([current, const_sources[const_idx]], "abc,cde->abde") # Add bias to continue the chain bias = ops.constant(np.random.randn(16, 64).astype(np.float32), name=f"bias_{i}_{j}") current = ops.add(einsum_out, bias) # Reshape to prepare for next iteration if i < 4 or j < 1: # Not the last iteration proj_weight = ops.constant(np.random.randn(16*64, 1024).astype(np.float32), name=f"proj_{i}_{j}") reshaped = ops.reshape(current, [1, 10, 16*64], special_zero=False) current = ops.matmul(reshaped, proj_weight, transpose_a=False, transpose_b=False) # Now create variable tensors from different sources for non-constant einsums # Start fresh with current tensor for variable operations var_source = ops.reshape(current, [1, 10, 16, 64], special_zero=False) # Create 20 einsum operations without constants (WON'T BE OPTIMIZED) for i in range(10): # Source 1: Split operations to create variable tensors split_axis = ops.constant(np.array(3, dtype=np.int32), name=f"split_axis_{i}") split_lengths = ops.constant(np.array([32, 32], dtype=np.int32), name=f"split_lengths_{i}") split_result = ops.variadic_split(var_source, split_axis, split_lengths) var_tensor_1 = split_result.output(0) # [1, 10, 16, 32] - Variable var_tensor_2 = split_result.output(1) # [1, 10, 16, 32] - Variable # EINSUM 1: Element-wise pattern (variable x variable) einsum_var_1 = ops.einsum([var_tensor_1, var_tensor_2], "abcd,abcd->abcd") # Source 2: Create more variable tensors from different operations # Use subtract to create another variable tensor var_tensor_3 = ops.subtract(var_tensor_1, var_tensor_2) # [1, 10, 16, 32] - Variable # Use relu to create another variable tensor var_tensor_4 = ops.relu(var_tensor_2) # [1, 10, 16, 32] - Variable # EINSUM 2: Another variable x variable pattern einsum_var_2 = ops.einsum([var_tensor_3, var_tensor_4], "abcd,abcd->abcd") # Combine and use for next iteration combined = ops.add(einsum_var_1, einsum_var_2) # Concatenate back to [1, 10, 16, 64] for next iteration var_source = ops.concat([combined, combined], axis=3) # [1, 10, 16, 64] # Final projection to output final_proj = ops.constant(np.random.randn(16*64, 1024).astype(np.float32), name="final_proj") final_reshaped = ops.reshape(var_source, [1, 10, 16*64], special_zero=False) final_output = ops.matmul(final_reshaped, final_proj, transpose_a=False, transpose_b=False) # Final output model = ov.Model([final_output], [input_tensor], name="EinsumConstantTest") # Print model statistics ops_by_type = {} for op in model.get_ops(): op_type = op.get_type_name() ops_by_type[op_type] = ops_by_type.get(op_type, 0) + 1 print("Original model operations:") for op_type, count in sorted(ops_by_type.items()): print(f" {op_type}: {count}") print(f"\nEinsum breakdown:") print(f" - Einsums with constants (WILL BE OPTIMIZED): 10") print(f" * From direct constant: 2") print(f" * From constant addition: 2") print(f" * From constant multiply: 2") print(f" * From constant reshape: 2") print(f" * From constant transpose: 2") print(f" - Einsums without constants (WON'T BE OPTIMIZED): 20") print(f" * From variadic_split operations: 10") print(f" * From subtract + relu operations: 10") print(f" - Total Einsums: 30") return model ``` You can find the original IR, Complied IR, IR before NopElimination and after NopElimination here: https://drive.google.com/drive/folders/1xxNVFotGOZLeUf5ECtmJhm4fytJNoBLN?usp=sharing --- Original Graph: <img width="1130" height="918" alt="Screenshot from 2025-08-26 12-40-15" src="https://github.com/user-attachments/assets/37a93d33-4dd4-4b6b-9f83-1c21676e6551" /> Before NopElimination: <img width="655" height="919" alt="Screenshot from 2025-08-26 15-20-51" src="https://github.com/user-attachments/assets/45fe58dc-b702-4510-b30a-1cc15cc43acc" /> After NopElimination: <img width="655" height="919" alt="Screenshot from 2025-08-26 15-21-26" src="https://github.com/user-attachments/assets/1b7f19a6-45f8-4d60-b04d-bcd416749267" /> --------- Co-authored-by: Maxim Vafin <[email protected]> Co-authored-by: Andrii Staikov <[email protected]> Co-authored-by: Roman Kazantsev <[email protected]>
@rkazants @mvafin @mlukasze @evkotov @CuriousPanCake @itikhono ### Performance issue description ## Problem OpenVINO backend exhibits **excessive memory consumption** during GPT-2 model inference compared to other Keras backends (TensorFlow, PyTorch, JAX). The issue occurs during the model compilation phase when converting from Keras to OpenVINO format, resulting in significantly higher memory usage that makes OpenVINO unsuitable for memory-constrained environments. **Problem**: OpenVINO uses substantially more memory than other backends during the compilation/inference phase. ## Summary of the solution: Solving Issue: openvinotoolkit#31390, First I was trying to solve this problem by introducing an `EinsumDecomposition` at MOC in this PR: openvinotoolkit#31482 But I found another solution: My first fix was to add `EinsumDecomposition` in MOC, and I found that both this version and the original `EinsumDecomposition` in `CommonOptimizations` introduced `Broadcast` nodes. However, in my fix the MOC pipeline later removed them, which allowed constants to be shared before the `ConstantFolding` pass that otherwise duplicates them in `CommonOptimizations`, leading to reduced memory usage. By comparing the two, I realized that both decompositions actually produced the same graph initially, but the MOC version benefited from an additional simplification step that cleaned up the broadcasts. After debugging, I identified the responsible pass as `NopElimination`. When I applied this pass in `CommonOptimizations` just before `ConstantFolding`, it achieved the same effect: broadcasts disappeared, constants were shared, and memory usage dropped, without needing to move EinsumDecomposition into MOC. ### 📊 Complete Analysis & Benchmarks For comprehensive performance comparison, optimization results, and technical details across all Keras backends: **[� Detailed Performance Report & Memory Optimization Analysis](https://gist.github.com/Mohamed-Ashraf273/1ecc15bd5e83c229d7e3f07851624bc8)** The report includes cross-backend benchmarks before and after both fixes, which gave the same results for OpenVINO --- ### Step-by-step reproduction Use keras source: https://github.com/keras-team/keras.git Also use this PR from keras_hub: keras-team/keras-hub#2350 ```python import os os.environ["KERAS_BACKEND"] = "openvino" import keras_hub causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32") output = causal_lm.generate("Hello", max_length=10) # Memory spike occurs here ``` Example Graph: ```python def create_einsum_constant_model(): """Create a model with both constant and non-constant einsum patterns from different sources""" input_tensor = ops.parameter([1, 10, 1024], np.float32, name="input") # Create diverse constant sources for einsum operations # Source 1: Direct constant weight matrix weight_data_1 = np.random.randn(1024, 16, 64).astype(np.float32) const_weight_1 = ops.constant(weight_data_1, name="const_weight_1") # Source 2: Constant from addition base_weight_2 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="base_weight_2") bias_weight_2 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="bias_weight_2") const_weight_2 = ops.add(base_weight_2, bias_weight_2) # Constant folded # Source 3: Constant from multiply (your original source) base_weight_3 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="base_weight_3") scale_3 = ops.constant(np.array(0.125, dtype=np.float32), name="scale_3") const_weight_3 = ops.multiply(base_weight_3, scale_3) # Constant folded # Source 4: Constant from reshape flat_weight_4 = ops.constant(np.random.randn(1024*16*64).astype(np.float32), name="flat_weight_4") const_weight_4 = ops.reshape(flat_weight_4, [1024, 16, 64], special_zero=False) # Source 5: Constant from transpose orig_weight_5 = ops.constant(np.random.randn(16, 1024, 64).astype(np.float32), name="orig_weight_5") const_weight_5 = ops.transpose(orig_weight_5, [1, 0, 2]) # [1024, 16, 64] current = input_tensor # Create 10 einsum operations with constants (WILL BE OPTIMIZED) const_sources = [const_weight_1, const_weight_2, const_weight_3, const_weight_4, const_weight_5] for i in range(5): # Use each constant source twice (5*2 = 10) for j in range(2): const_idx = i einsum_out = ops.einsum([current, const_sources[const_idx]], "abc,cde->abde") # Add bias to continue the chain bias = ops.constant(np.random.randn(16, 64).astype(np.float32), name=f"bias_{i}_{j}") current = ops.add(einsum_out, bias) # Reshape to prepare for next iteration if i < 4 or j < 1: # Not the last iteration proj_weight = ops.constant(np.random.randn(16*64, 1024).astype(np.float32), name=f"proj_{i}_{j}") reshaped = ops.reshape(current, [1, 10, 16*64], special_zero=False) current = ops.matmul(reshaped, proj_weight, transpose_a=False, transpose_b=False) # Now create variable tensors from different sources for non-constant einsums # Start fresh with current tensor for variable operations var_source = ops.reshape(current, [1, 10, 16, 64], special_zero=False) # Create 20 einsum operations without constants (WON'T BE OPTIMIZED) for i in range(10): # Source 1: Split operations to create variable tensors split_axis = ops.constant(np.array(3, dtype=np.int32), name=f"split_axis_{i}") split_lengths = ops.constant(np.array([32, 32], dtype=np.int32), name=f"split_lengths_{i}") split_result = ops.variadic_split(var_source, split_axis, split_lengths) var_tensor_1 = split_result.output(0) # [1, 10, 16, 32] - Variable var_tensor_2 = split_result.output(1) # [1, 10, 16, 32] - Variable # EINSUM 1: Element-wise pattern (variable x variable) einsum_var_1 = ops.einsum([var_tensor_1, var_tensor_2], "abcd,abcd->abcd") # Source 2: Create more variable tensors from different operations # Use subtract to create another variable tensor var_tensor_3 = ops.subtract(var_tensor_1, var_tensor_2) # [1, 10, 16, 32] - Variable # Use relu to create another variable tensor var_tensor_4 = ops.relu(var_tensor_2) # [1, 10, 16, 32] - Variable # EINSUM 2: Another variable x variable pattern einsum_var_2 = ops.einsum([var_tensor_3, var_tensor_4], "abcd,abcd->abcd") # Combine and use for next iteration combined = ops.add(einsum_var_1, einsum_var_2) # Concatenate back to [1, 10, 16, 64] for next iteration var_source = ops.concat([combined, combined], axis=3) # [1, 10, 16, 64] # Final projection to output final_proj = ops.constant(np.random.randn(16*64, 1024).astype(np.float32), name="final_proj") final_reshaped = ops.reshape(var_source, [1, 10, 16*64], special_zero=False) final_output = ops.matmul(final_reshaped, final_proj, transpose_a=False, transpose_b=False) # Final output model = ov.Model([final_output], [input_tensor], name="EinsumConstantTest") # Print model statistics ops_by_type = {} for op in model.get_ops(): op_type = op.get_type_name() ops_by_type[op_type] = ops_by_type.get(op_type, 0) + 1 print("Original model operations:") for op_type, count in sorted(ops_by_type.items()): print(f" {op_type}: {count}") print(f"\nEinsum breakdown:") print(f" - Einsums with constants (WILL BE OPTIMIZED): 10") print(f" * From direct constant: 2") print(f" * From constant addition: 2") print(f" * From constant multiply: 2") print(f" * From constant reshape: 2") print(f" * From constant transpose: 2") print(f" - Einsums without constants (WON'T BE OPTIMIZED): 20") print(f" * From variadic_split operations: 10") print(f" * From subtract + relu operations: 10") print(f" - Total Einsums: 30") return model ``` You can find the original IR, Complied IR, IR before NopElimination and after NopElimination here: https://drive.google.com/drive/folders/1xxNVFotGOZLeUf5ECtmJhm4fytJNoBLN?usp=sharing --- Original Graph: <img width="1130" height="918" alt="Screenshot from 2025-08-26 12-40-15" src="https://github.com/user-attachments/assets/37a93d33-4dd4-4b6b-9f83-1c21676e6551" /> Before NopElimination: <img width="655" height="919" alt="Screenshot from 2025-08-26 15-20-51" src="https://github.com/user-attachments/assets/45fe58dc-b702-4510-b30a-1cc15cc43acc" /> After NopElimination: <img width="655" height="919" alt="Screenshot from 2025-08-26 15-21-26" src="https://github.com/user-attachments/assets/1b7f19a6-45f8-4d60-b04d-bcd416749267" /> --------- Co-authored-by: Maxim Vafin <[email protected]> Co-authored-by: Andrii Staikov <[email protected]> Co-authored-by: Roman Kazantsev <[email protected]>
@rkazants
@itikhono
Solving Issue #31390, and back to #30934
Adding
EinsumDecompositiontoMOC transformationshelped reduce memory usage during model compilation.Running this script using memory profiling form #31516:
Use keras source https://github.com/keras-team/keras.git
Also use this PR from keras_hub: keras-team/keras-hub#2350
Then run the following script.
Then Enable
os.environ["OV_ENABLE_MEMORY_PROFILING"] = "1"by uncommentinng it.without fix:
with fix
by adding:
Note: the order of its postion is important.
I am still exploring what else can help reduce memory usage further. I would appreciate any suggestions or recommendations.