-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Fix memory issue while compiling keras models #31873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory issue while compiling keras models #31873
Conversation
|
.build_jenkins |
|
Please add test here: openvino/src/common/transformations/tests/common_optimizations/nop_elimination.cpp Line 29 in 658c1df
It should verify that model with 2 einsum ops with a shared const have shared const after run |
|
@CuriousPanCake |
|
build_jenkins |
|
@mvafin |
|
build_jenkins |
|
@Mohamed-Ashraf273 can you run functional tests on your local setup? |
This is the only failed test:
|
|
@CuriousPanCake |
1b148c5 to
8d75547
Compare
|
The Issue: A pattern matching inconsistency was discovered in the RoPE fusion pass where the The Evidence: Debug analysis revealed that shape constants like The Solution: The pattern was corrected to This fix ensures correctness, and the test passed for both models before and after the fix. |
db00abf to
93f64b7
Compare
|
build_jenkins |
|
@CuriousPanCake Rope fusion was modified here, could you review if the change makes sense? |
...ansformations/src/transformations/common_optimizations/fuse_rotary_positional_embeddings.cpp
Outdated
Show resolved
Hide resolved
Co-authored-by: Andrii Staikov <[email protected]>
|
build_jenkins |
|
build_jenkins |
|
@CuriousPanCake |
|
build_jenkins |
|
build_jenkins |
|
AFAIK Python unit tests to be fixed soon |
Pull Request is not mergeable
4dd79a7
@rkazants @mvafin @mlukasze @evkotov @CuriousPanCake @itikhono ### Performance issue description ## Problem OpenVINO backend exhibits **excessive memory consumption** during GPT-2 model inference compared to other Keras backends (TensorFlow, PyTorch, JAX). The issue occurs during the model compilation phase when converting from Keras to OpenVINO format, resulting in significantly higher memory usage that makes OpenVINO unsuitable for memory-constrained environments. **Problem**: OpenVINO uses substantially more memory than other backends during the compilation/inference phase. ## Summary of the solution: Solving Issue: openvinotoolkit#31390, First I was trying to solve this problem by introducing an `EinsumDecomposition` at MOC in this PR: openvinotoolkit#31482 But I found another solution: My first fix was to add `EinsumDecomposition` in MOC, and I found that both this version and the original `EinsumDecomposition` in `CommonOptimizations` introduced `Broadcast` nodes. However, in my fix the MOC pipeline later removed them, which allowed constants to be shared before the `ConstantFolding` pass that otherwise duplicates them in `CommonOptimizations`, leading to reduced memory usage. By comparing the two, I realized that both decompositions actually produced the same graph initially, but the MOC version benefited from an additional simplification step that cleaned up the broadcasts. After debugging, I identified the responsible pass as `NopElimination`. When I applied this pass in `CommonOptimizations` just before `ConstantFolding`, it achieved the same effect: broadcasts disappeared, constants were shared, and memory usage dropped, without needing to move EinsumDecomposition into MOC. ### 📊 Complete Analysis & Benchmarks For comprehensive performance comparison, optimization results, and technical details across all Keras backends: **[� Detailed Performance Report & Memory Optimization Analysis](https://gist.github.com/Mohamed-Ashraf273/1ecc15bd5e83c229d7e3f07851624bc8)** The report includes cross-backend benchmarks before and after both fixes, which gave the same results for OpenVINO --- ### Step-by-step reproduction Use keras source: https://github.com/keras-team/keras.git Also use this PR from keras_hub: keras-team/keras-hub#2350 ```python import os os.environ["KERAS_BACKEND"] = "openvino" import keras_hub causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32") output = causal_lm.generate("Hello", max_length=10) # Memory spike occurs here ``` Example Graph: ```python def create_einsum_constant_model(): """Create a model with both constant and non-constant einsum patterns from different sources""" input_tensor = ops.parameter([1, 10, 1024], np.float32, name="input") # Create diverse constant sources for einsum operations # Source 1: Direct constant weight matrix weight_data_1 = np.random.randn(1024, 16, 64).astype(np.float32) const_weight_1 = ops.constant(weight_data_1, name="const_weight_1") # Source 2: Constant from addition base_weight_2 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="base_weight_2") bias_weight_2 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="bias_weight_2") const_weight_2 = ops.add(base_weight_2, bias_weight_2) # Constant folded # Source 3: Constant from multiply (your original source) base_weight_3 = ops.constant(np.random.randn(1024, 16, 64).astype(np.float32), name="base_weight_3") scale_3 = ops.constant(np.array(0.125, dtype=np.float32), name="scale_3") const_weight_3 = ops.multiply(base_weight_3, scale_3) # Constant folded # Source 4: Constant from reshape flat_weight_4 = ops.constant(np.random.randn(1024*16*64).astype(np.float32), name="flat_weight_4") const_weight_4 = ops.reshape(flat_weight_4, [1024, 16, 64], special_zero=False) # Source 5: Constant from transpose orig_weight_5 = ops.constant(np.random.randn(16, 1024, 64).astype(np.float32), name="orig_weight_5") const_weight_5 = ops.transpose(orig_weight_5, [1, 0, 2]) # [1024, 16, 64] current = input_tensor # Create 10 einsum operations with constants (WILL BE OPTIMIZED) const_sources = [const_weight_1, const_weight_2, const_weight_3, const_weight_4, const_weight_5] for i in range(5): # Use each constant source twice (5*2 = 10) for j in range(2): const_idx = i einsum_out = ops.einsum([current, const_sources[const_idx]], "abc,cde->abde") # Add bias to continue the chain bias = ops.constant(np.random.randn(16, 64).astype(np.float32), name=f"bias_{i}_{j}") current = ops.add(einsum_out, bias) # Reshape to prepare for next iteration if i < 4 or j < 1: # Not the last iteration proj_weight = ops.constant(np.random.randn(16*64, 1024).astype(np.float32), name=f"proj_{i}_{j}") reshaped = ops.reshape(current, [1, 10, 16*64], special_zero=False) current = ops.matmul(reshaped, proj_weight, transpose_a=False, transpose_b=False) # Now create variable tensors from different sources for non-constant einsums # Start fresh with current tensor for variable operations var_source = ops.reshape(current, [1, 10, 16, 64], special_zero=False) # Create 20 einsum operations without constants (WON'T BE OPTIMIZED) for i in range(10): # Source 1: Split operations to create variable tensors split_axis = ops.constant(np.array(3, dtype=np.int32), name=f"split_axis_{i}") split_lengths = ops.constant(np.array([32, 32], dtype=np.int32), name=f"split_lengths_{i}") split_result = ops.variadic_split(var_source, split_axis, split_lengths) var_tensor_1 = split_result.output(0) # [1, 10, 16, 32] - Variable var_tensor_2 = split_result.output(1) # [1, 10, 16, 32] - Variable # EINSUM 1: Element-wise pattern (variable x variable) einsum_var_1 = ops.einsum([var_tensor_1, var_tensor_2], "abcd,abcd->abcd") # Source 2: Create more variable tensors from different operations # Use subtract to create another variable tensor var_tensor_3 = ops.subtract(var_tensor_1, var_tensor_2) # [1, 10, 16, 32] - Variable # Use relu to create another variable tensor var_tensor_4 = ops.relu(var_tensor_2) # [1, 10, 16, 32] - Variable # EINSUM 2: Another variable x variable pattern einsum_var_2 = ops.einsum([var_tensor_3, var_tensor_4], "abcd,abcd->abcd") # Combine and use for next iteration combined = ops.add(einsum_var_1, einsum_var_2) # Concatenate back to [1, 10, 16, 64] for next iteration var_source = ops.concat([combined, combined], axis=3) # [1, 10, 16, 64] # Final projection to output final_proj = ops.constant(np.random.randn(16*64, 1024).astype(np.float32), name="final_proj") final_reshaped = ops.reshape(var_source, [1, 10, 16*64], special_zero=False) final_output = ops.matmul(final_reshaped, final_proj, transpose_a=False, transpose_b=False) # Final output model = ov.Model([final_output], [input_tensor], name="EinsumConstantTest") # Print model statistics ops_by_type = {} for op in model.get_ops(): op_type = op.get_type_name() ops_by_type[op_type] = ops_by_type.get(op_type, 0) + 1 print("Original model operations:") for op_type, count in sorted(ops_by_type.items()): print(f" {op_type}: {count}") print(f"\nEinsum breakdown:") print(f" - Einsums with constants (WILL BE OPTIMIZED): 10") print(f" * From direct constant: 2") print(f" * From constant addition: 2") print(f" * From constant multiply: 2") print(f" * From constant reshape: 2") print(f" * From constant transpose: 2") print(f" - Einsums without constants (WON'T BE OPTIMIZED): 20") print(f" * From variadic_split operations: 10") print(f" * From subtract + relu operations: 10") print(f" - Total Einsums: 30") return model ``` You can find the original IR, Complied IR, IR before NopElimination and after NopElimination here: https://drive.google.com/drive/folders/1xxNVFotGOZLeUf5ECtmJhm4fytJNoBLN?usp=sharing --- Original Graph: <img width="1130" height="918" alt="Screenshot from 2025-08-26 12-40-15" src="https://github.com/user-attachments/assets/37a93d33-4dd4-4b6b-9f83-1c21676e6551" /> Before NopElimination: <img width="655" height="919" alt="Screenshot from 2025-08-26 15-20-51" src="https://github.com/user-attachments/assets/45fe58dc-b702-4510-b30a-1cc15cc43acc" /> After NopElimination: <img width="655" height="919" alt="Screenshot from 2025-08-26 15-21-26" src="https://github.com/user-attachments/assets/1b7f19a6-45f8-4d60-b04d-bcd416749267" /> --------- Co-authored-by: Maxim Vafin <[email protected]> Co-authored-by: Andrii Staikov <[email protected]> Co-authored-by: Roman Kazantsev <[email protected]>




@rkazants
@mvafin
@mlukasze
@evkotov
@CuriousPanCake
@itikhono
Performance issue description
Problem
OpenVINO backend exhibits excessive memory consumption during GPT-2 model inference compared to other Keras backends (TensorFlow, PyTorch, JAX). The issue occurs during the model compilation phase when converting from Keras to OpenVINO format, resulting in significantly higher memory usage that makes OpenVINO unsuitable for memory-constrained environments.
Problem: OpenVINO uses substantially more memory than other backends during the compilation/inference phase.
Summary of the solution:
Solving Issue: #31390,
First I was trying to solve this problem by introducing an
EinsumDecompositionat MOC in this PR: #31482But I found another solution:
My first fix was to add
EinsumDecompositionin MOC, and I found that both this version and the originalEinsumDecompositioninCommonOptimizationsintroducedBroadcastnodes. However, in my fix the MOC pipeline later removed them, which allowed constants to be shared before theConstantFoldingpass that otherwise duplicates them inCommonOptimizations, leading to reduced memory usage. By comparing the two, I realized that both decompositions actually produced the same graph initially, but the MOC version benefited from an additional simplification step that cleaned up the broadcasts. After debugging, I identified the responsible pass asNopElimination. When I applied this pass inCommonOptimizationsjust beforeConstantFolding, it achieved the same effect: broadcasts disappeared, constants were shared, and memory usage dropped, without needing to move EinsumDecomposition into MOC.📊 Complete Analysis & Benchmarks
For comprehensive performance comparison, optimization results, and technical details across all Keras backends:
� Detailed Performance Report & Memory Optimization Analysis
The report includes cross-backend benchmarks before and after both fixes, which gave the same results for OpenVINO
Step-by-step reproduction
Use keras source: https://github.com/keras-team/keras.git
Also use this PR from keras_hub: keras-team/keras-hub#2350
Example Graph:
You can find the original IR, Complied IR, IR before NopElimination and after NopElimination here:
https://drive.google.com/drive/folders/1xxNVFotGOZLeUf5ECtmJhm4fytJNoBLN?usp=sharing
Original Graph:

Before NopElimination:

After NopElimination:
