Skip to content

[OpenVINO] support ai-sage/GigaChat3-10B-A1.8B-bf16#1626

Open
Mohamed-Ashraf273 wants to merge 17 commits intohuggingface:mainfrom
Mohamed-Ashraf273:support_gigachat3
Open

[OpenVINO] support ai-sage/GigaChat3-10B-A1.8B-bf16#1626
Mohamed-Ashraf273 wants to merge 17 commits intohuggingface:mainfrom
Mohamed-Ashraf273:support_gigachat3

Conversation

@Mohamed-Ashraf273
Copy link
Contributor

@Mohamed-Ashraf273 Mohamed-Ashraf273 commented Feb 28, 2026

What does this PR do?

Conversion cmd-line for CohereLabs/tiny-aya-base:

optimum-cli export openvino -m ai-sage/GigaChat3-10B-A1.8B-bf16 ./output_dir --task text-generation-with-past

Inference of ai-sage/GigaChat3-10B-A1.8B-bf16 using OpenVINO backend:

import torch
from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM

model_dir="output_dir"

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = OVModelForCausalLM.from_pretrained(model_dir)

# Prepare input
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")

# Run inference
output_ids = model.generate(**inputs, max_new_tokens=10)
output_text = tokenizer.decode(output_ids[0])

print(output_text)

Solving Issue: #1608

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@Mohamed-Ashraf273 Mohamed-Ashraf273 changed the title modify patcher [OpenVINO] support gigachat3 Feb 28, 2026
@Mohamed-Ashraf273 Mohamed-Ashraf273 marked this pull request as ready for review March 3, 2026 13:19
@Mohamed-Ashraf273
Copy link
Contributor Author

Hi @popovaan ,
Can you take a look?
Thanks!

@Mohamed-Ashraf273 Mohamed-Ashraf273 changed the title [OpenVINO] support gigachat3 [OpenVINO] support ai-sage/GigaChat3-10B-A1.8B-bf16 Mar 3, 2026
@popovaan
Copy link
Collaborator

popovaan commented Mar 3, 2026

Thanks for the PR! Please add tests for this model. For now, use a locally generated tiny model. I'm currently investigating whether we're allowed to invite GSoC contributors to the optimum-intel-internal-testing group so that you can publish the model there. If not, I’ll publish it myself and share the link.

@Mohamed-Ashraf273
Copy link
Contributor Author

Thanks for the PR! Please add tests for this model. For now, use a locally generated tiny model. I'm currently investigating whether we're allowed to invite GSoC contributors to the optimum-intel-internal-testing group so that you can publish the model there. If not, I’ll publish it myself and share the link.

Got it, thanks!
I’ll add the tests with a locally generated tiny model

@Mohamed-Ashraf273
Copy link
Contributor Author

Thanks for the PR! Please add tests for this model. For now, use a locally generated tiny model. I'm currently investigating whether we're allowed to invite GSoC contributors to the optimum-intel-internal-testing group so that you can publish the model there. If not, I’ll publish it myself and share the link.

Hi @popovaan, @rkazants,
I've added a tiny model along with the tests. Could you please take a look?
Thanks!

Copy link
Collaborator

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also add export tests. The same test set that you have added for the previuos model.
Update documentation.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to add OpenVINO export/inference support coverage for the ai-sage/GigaChat3-10B-A1.8B-bf16 family by extending OpenVINO test fixtures and adjusting DeepSeek patching logic used during export.

Changes:

  • Add a gigachat3 tiny-random model fixture and include it in OpenVINO decoder integration coverage.
  • Update decoder tests for gigachat3 (expected SDPA count, relaxed logits tolerance, and skip conditions for incompatible Transformers versions).
  • Refactor DeepSeek attention patching to use a versioned factory function and extend MoE patching to handle MLP blocks exposing experts but not moe_infer.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
tests/openvino/utils_tests.py Adds the gigachat3 test model mapping; adjusts which models are treated as remote-code in tests.
tests/openvino/test_decoder.py Adds gigachat3 to tested architectures and config expectations; tweaks tolerance/skip logic; adds debug output.
optimum/exporters/openvino/model_patcher.py Updates DeepSeek patcher to use a unified attention forward factory and broadens MoE patching behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Mohamed-Ashraf273
Copy link
Contributor Author

Mohamed-Ashraf273 commented Mar 4, 2026

please also add export tests. The same test set that you have added for the previuos model. Update documentation.

@rkazants
Thanks for your feedback!
I've added export tests and updated documentation for the newly added model!

@Mohamed-Ashraf273
Copy link
Contributor Author

Mohamed-Ashraf273 commented Mar 4, 2026

Thanks for the PR! Please add tests for this model. For now, use a locally generated tiny model. I'm currently investigating whether we're allowed to invite GSoC contributors to the optimum-intel-internal-testing group so that you can publish the model there. If not, I’ll publish it myself and share the link.

Hi @popovaan,

I’ve finished adding the tests and temporarily published tiny-random-gigachat3 on my Hugging Face profile (mohamedashraf273/tiny-random-gigachat3) until it can be moved to optimum-intel-internal-testing.

Would it be possible to invite me to the group so I can publish it there, or would you prefer to handle the publishing?

Please let me know if any changes are needed.
Thanks!

@savvadesogle
Copy link

Hi. Can I help test the model?
You are so great, thank you so much!♥️🔥😊

@Mohamed-Ashraf273
Copy link
Contributor Author

Hi. Can I help test the model?
You are so great, thank you so much!♥️🔥😊

Hi!
That would be great, thank you so much!
Please feel free to test it and let me know if you encounter any issues or unexpected behavior.

@rkazants
Copy link
Collaborator

rkazants commented Mar 5, 2026

@Mohamed-Ashraf273 , please check tests locally on your machine:
{520710FC-2874-4953-9C49-56D57B8B8064}

@popovaan
Copy link
Collaborator

popovaan commented Mar 5, 2026

Thanks for the PR! Please add tests for this model. For now, use a locally generated tiny model. I'm currently investigating whether we're allowed to invite GSoC contributors to the optimum-intel-internal-testing group so that you can publish the model there. If not, I’ll publish it myself and share the link.

Hi @popovaan,

I’ve finished adding the tests and temporarily published tiny-random-gigachat3 on my Hugging Face profile (mohamedashraf273/tiny-random-gigachat3) until it can be moved to optimum-intel-internal-testing.

Would it be possible to invite me to the group so I can publish it there, or would you prefer to handle the publishing?

Please let me know if any changes are needed. Thanks!

Hi @Mohamed-Ashraf273!
I checked your model, it’s currently about 269 MB, which is too large. Could you please reduce it to around 5 MB?

You can refer to these tiny model examples, the generation code is available in the model cards:
https://huggingface.co/optimum-intel-internal-testing/tiny-random-lfm2
https://huggingface.co/optimum-intel-internal-testing/tiny-random-mistral3

@savvadesogle
Copy link

savvadesogle commented Mar 5, 2026

@Mohamed-Ashraf273
Hello!

xeon 2699v3 x2
A770 x2
Ubuntu
Linux xpu 6.19.4-061904-generic #202602262342 SMP PREEMPT_DYNAMIC Fri Feb 27 00:13:55 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
intel mesa v: 26.0.1
Driver: XE

Conversion works (int4/int8)
image

image

Working with openvino-genai (openarc) - At least I think so.

openvino 2026.1.0.dev20260221
openvino-genai 2026.1.0.0.dev20260221
openvino-telemetry 2025.2.0
openvino-tokenizers 2026.1.0.0.dev20260221

image

The only thing I need to figure out is how to set up the chat template. Or maybe there is a problem with the conversion. Or i just need to update the openvino-genai 🤷.
image

image image

LOG

logs.txt

Metrics 1x A770

[LLM Worker: GigaChat3-10B-A1.8B-ov-int8] Metrics: {'load_time (s)': 38.04, 'ttft (s)': 1.9, 'tpot (ms)': 85.07317, 'prefill_throughput (tokens/s)': 522.06, 'decode_throughput (tokens/s)': 11.75459, 'decode_duration (s)': 8.53637, 'input_token': 992, 'new_token': 79, 'total_token': 1071, 'stream': True, 'stream_chunk_tokens': 1}

Low GPU utilization and low decoding speed

image image

After

sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400

PP is better:
image

@Mohamed-Ashraf273
Copy link
Contributor Author

Mohamed-Ashraf273 commented Mar 5, 2026

@savvadesogle
Great work!
I found a few issues as well and tried to address them. You can check the updated model if you’d like.

@Mohamed-Ashraf273
Copy link
Contributor Author

Thanks for the PR! Please add tests for this model. For now, use a locally generated tiny model. I'm currently investigating whether we're allowed to invite GSoC contributors to the optimum-intel-internal-testing group so that you can publish the model there. If not, I’ll publish it myself and share the link.

Hi @popovaan,
I’ve finished adding the tests and temporarily published tiny-random-gigachat3 on my Hugging Face profile (mohamedashraf273/tiny-random-gigachat3) until it can be moved to optimum-intel-internal-testing.
Would it be possible to invite me to the group so I can publish it there, or would you prefer to handle the publishing?
Please let me know if any changes are needed. Thanks!

Hi @Mohamed-Ashraf273! I checked your model, it’s currently about 269 MB, which is too large. Could you please reduce it to around 5 MB?

You can refer to these tiny model examples, the generation code is available in the model cards: https://huggingface.co/optimum-intel-internal-testing/tiny-random-lfm2 https://huggingface.co/optimum-intel-internal-testing/tiny-random-mistral3

@popovaan
Thanks for letting me know!
I've uploaded a 2.07 MB model.

@Mohamed-Ashraf273
Copy link
Contributor Author

@Mohamed-Ashraf273 , please check tests locally on your machine: {520710FC-2874-4953-9C49-56D57B8B8064}

@rkazants
Thanks for letting me know!
Tests are now passing on my local machine.

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-7.4.4, pluggy-1.6.0 -- /home/mohamed-ashraf/Desktop/projects/env/bin/python3
cachedir: .pytest_cache
rootdir: /home/mohamed-ashraf/Desktop/projects/GSoC26/optimum-intel
configfile: pyproject.toml
plugins: anyio-4.12.1, langsmith-0.7.10
collecting ... collected 497 items / 489 deselected / 8 selected

tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_beam_search_59_deepseek SKIPPED [ 12%]
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_beam_search_60_gigachat3 SKIPPED [ 25%]
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_59_deepseek PASSED [ 37%]
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_60_gigachat3 PASSED [ 50%]
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_pipeline_59_deepseek SKIPPED [ 62%]
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_pipeline_60_gigachat3 SKIPPED [ 75%]
tests/openvino/test_export.py::ExportModelTest::test_export_26_deepseek PASSED [ 87%]
tests/openvino/test_export.py::ExportModelTest::test_export_27_gigachat3 PASSED [100%]

=============================== warnings summary ===============================
../../../../../../usr/lib/python3.12/multiprocessing/popen_fork.py:66: 1 warning
tests/openvino/test_decoder.py: 122 warnings
tests/openvino/test_export.py: 242 warnings
  /usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=213103) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../env/lib/python3.12/site-packages/torch/jit/_script.py:1480
  /home/mohamed-ashraf/Desktop/projects/env/lib/python3.12/site-packages/torch/jit/_script.py:1480: DeprecationWarning: `torch.jit.script` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_59_deepseek
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_60_gigachat3
tests/openvino/test_export.py::ExportModelTest::test_export_26_deepseek
tests/openvino/test_export.py::ExportModelTest::test_export_26_deepseek
tests/openvino/test_export.py::ExportModelTest::test_export_27_gigachat3
tests/openvino/test_export.py::ExportModelTest::test_export_27_gigachat3
  /home/mohamed-ashraf/Desktop/projects/env/lib/python3.12/site-packages/torch/jit/_trace.py:1000: DeprecationWarning: `torch.jit.trace` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_59_deepseek
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_60_gigachat3
tests/openvino/test_export.py::ExportModelTest::test_export_26_deepseek
tests/openvino/test_export.py::ExportModelTest::test_export_26_deepseek
tests/openvino/test_export.py::ExportModelTest::test_export_27_gigachat3
tests/openvino/test_export.py::ExportModelTest::test_export_27_gigachat3
  /home/mohamed-ashraf/Desktop/projects/env/lib/python3.12/site-packages/torch/jit/_trace.py:1139: DeprecationWarning: `torch.jit.trace_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_59_deepseek
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_60_gigachat3
tests/openvino/test_export.py::ExportModelTest::test_export_26_deepseek
tests/openvino/test_export.py::ExportModelTest::test_export_27_gigachat3
  /home/mohamed-ashraf/Desktop/projects/env/lib/python3.12/site-packages/transformers/cache_utils.py:132: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
    if not self.is_initialized or self.keys.numel() == 0:

tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_59_deepseek
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_60_gigachat3
tests/openvino/test_export.py::ExportModelTest::test_export_26_deepseek
tests/openvino/test_export.py::ExportModelTest::test_export_27_gigachat3
  /home/mohamed-ashraf/Desktop/projects/env/lib/python3.12/site-packages/transformers/masking_utils.py:207: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
    if (padding_length := kv_length + kv_offset - attention_mask.shape[-1]) > 0:

tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_59_deepseek
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_60_gigachat3
tests/openvino/test_export.py::ExportModelTest::test_export_26_deepseek
tests/openvino/test_export.py::ExportModelTest::test_export_27_gigachat3
  /home/mohamed-ashraf/Desktop/projects/GSoC26/optimum-intel/optimum/exporters/openvino/model_patcher.py:233: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
    torch.tensor(0.0, device=mask.device, dtype=dtype),

tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_59_deepseek
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_60_gigachat3
tests/openvino/test_export.py::ExportModelTest::test_export_26_deepseek
tests/openvino/test_export.py::ExportModelTest::test_export_27_gigachat3
  /home/mohamed-ashraf/Desktop/projects/GSoC26/optimum-intel/optimum/exporters/openvino/model_patcher.py:234: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
    torch.tensor(torch.finfo(torch.float16).min, device=mask.device, dtype=dtype),

tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_59_deepseek
tests/openvino/test_decoder.py::OVModelForCausalLMIntegrationTest::test_compare_to_transformers_60_gigachat3
tests/openvino/test_export.py::ExportModelTest::test_export_26_deepseek
tests/openvino/test_export.py::ExportModelTest::test_export_27_gigachat3
  /home/mohamed-ashraf/Desktop/projects/env/lib/python3.12/site-packages/transformers/integrations/sdpa_attention.py:81: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
    is_causal = query.shape[2] > 1 and attention_mask is None and getattr(module, "is_causal", True)

tests/openvino/test_decoder.py: 58 warnings
  /home/mohamed-ashraf/Desktop/projects/GSoC26/optimum-intel/optimum/intel/openvino/modeling_decoder.py:820: DeprecationWarning: __array__ implementation doesn't accept a copy keyword, so passing copy=False failed. __array__ must implement 'dtype' and 'copy' keyword arguments. To learn more, see the migration guide https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword
    np.array(beam_idx) if not self._second_iter_beam_search else self.next_beam_idx

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==== 4 passed, 4 skipped, 489 deselected, 458 warnings in 75.58s (0:01:15) =====

@SearchSavior
Copy link

@Mohamed-Ashraf273 great work on this PR!

were you able to verify that @savvadesogle issue above were solved in your export changes?

PS We are very excited to deploy this model in OpenArc with a qwen ASR/TTS system I am working on, thanks for tackling the issue so quickly.

@Mohamed-Ashraf273
Copy link
Contributor Author

Mohamed-Ashraf273 commented Mar 6, 2026

Hi @SearchSavior
Thank you so much!

Yes, I verified that the issue mentioned by @savvadesogle is resolved. I’ve attached an image below showing the successful result.
Screenshot from 2026-03-06 07-02-05

@Mohamed-Ashraf273
Copy link
Contributor Author

Hi @popovaan, @rkazants, @IlyasMoutawwakil,

Tests are now passing and the issues should be resolved.
Could you please take a look?

Thanks!

MIN_TRANSFORMERS_VERSION = "4.46.0"
MAX_TRANSFORMERS_VERSION = "4.53.3"
MIN_TRANSFORMERS_VERSION = "4.53.0"
MAX_TRANSFORMERS_VERSION = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why explicitly setting it to None ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MiniCPM3OpenVINOConfig (the parent class) sets MAX_TRANSFORMERS_VERSION = "4.53.3" because it was originally written when support for transformers >= 4.54 was uncertain.

DeepseekOpenVINOConfig explicitly overrides this value to None to indicate that there is no upper bound. The Deepseek/GigaChat3 export has been validated with transformers up to 4.57.x, and it is expected to remain compatible with future versions.

Without this override, DeepseekOpenVINOConfig would inherit the 4.53.3 limit from MiniCPM3OpenVINOConfig, which would incorrectly block export on newer transformers versions. Setting it to None explicitly removes that inherited ceiling and ensures the correct behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @IlyasMoutawwakil,
I’d really appreciate it if you could review my latest updates.
Thanks in advance!

@savvadesogle
Copy link

savvadesogle commented Mar 6, 2026

Thank you, @Mohamed-Ashraf273 ❤️

Working with CPU

image image

Not working with GPU yet...

It loads endlessly to GPU, but it works with the CPU in OpenArс Tool (OVGenAI engine).
As in the screenshot below - RAM gradually grows, 10-20 minutes and nothing happens.

image

@Mohamed-Ashraf273
Copy link
Contributor Author

Mohamed-Ashraf273 commented Mar 6, 2026

Hi @savvadesogle,

I ran a demo test with a tiny GigaChat3 model on GPU and it worked correctly. I was able to successfully:

  • Export the model (on CPU)
  • Load/compile it on GPU
  • Run a forward pass
  • Run generate()
  • Run batched generation

All steps completed without issues and the GPU execution finished successfully.

From your description, it sounds like the GPU loading/compilation for the full model may simply require more time and RAM. The tiny model finishes quickly, but the real GigaChat3 model is significantly larger, so it would be expected that:

  • GPU loading/compilation takes longer than 20–30 minutes, and
  • RAM usage may keep increasing during the process before it stabilizes.

Since the same pipeline works correctly with the tiny model on GPU, the real model should also work, but it may just need more time and memory for the GPU compilation step.

For reference, here is the script I used for the GPU test:

import torch
from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM
import openvino as ov

# ── 0. Check available devices ────────────────────────────────────────────────
core = ov.Core()
print("Available devices:", core.available_devices)
assert "GPU" in " ".join(core.available_devices), "No Intel GPU found!"

MODEL_DIR = "./tiny-random-gigachat3"

# ── 1. Export (CPU export, then load on GPU) ──────────────────────────────────
print("\n[1] Exporting tiny-random-gigachat3 to OpenVINO...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
ov_model = OVModelForCausalLM.from_pretrained(
    MODEL_DIR,
    export=True,
    device="GPU",        # compile directly on GPU after export
)
print("    Export + GPU compile: OK")

# ── 2. Basic forward pass ─────────────────────────────────────────────────────
print("\n[2] Running forward pass on GPU...")
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = ov_model(**inputs)
logits = outputs.logits
print(f"    Logits shape : {logits.shape}")
print(f"    Logits dtype : {logits.dtype}")
print(f"    Logits sample: {logits[0, -1, :5].tolist()}")
assert logits.shape[0] == 1, "Batch size mismatch"
print("    Forward pass : OK")

# ── 3. Generation ─────────────────────────────────────────────────────────────
print("\n[3] Running generate() on GPU...")
ov_model.generation_config.eos_token_id = None   # avoid early stop on tiny model
output_ids = ov_model.generate(**inputs, max_new_tokens=10, do_sample=False)
decoded = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"    Output : {decoded!r}")
assert output_ids.shape[1] > inputs["input_ids"].shape[1], "No tokens generated"
print("    Generate     : OK")

# ── 4. Batch generation ───────────────────────────────────────────────────────
print("\n[4] Running batched generate() on GPU...")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
prompts = ["Hello world", "The sky is"]
batch = tokenizer(prompts, return_tensors="pt", padding=True)
output_ids = ov_model.generate(**batch, max_new_tokens=5, do_sample=False)
for i, ids in enumerate(output_ids):
    print(f"    Batch[{i}]: {tokenizer.decode(ids, skip_special_tokens=True)!r}")
print("    Batched generate: OK")

print("\n✅ All GPU tests passed!")

Output:

(env) mohamed-ashraf@mohamed-ashraf-LOQ-15IRX9:~/Desktop/projects/GSoC26/optimum-intel$ python test_gpu.py 2>&1 | grep -v TracerWarning | grep -v "site-packages" | grep -v "^$"
Multiple distributions found for package optimum. Picked distribution: optimum-intel
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
  or not self.key_cache[layer_idx].numel()  # the layer has no cache
  if (padding_length := kv_length + kv_offset - attention_mask.shape[-1]) > 0:
  torch.tensor(0.0, device=mask.device, dtype=dtype),
  torch.tensor(torch.finfo(torch.float16).min, device=mask.device, dtype=dtype),
  not self.key_cache[layer_idx].numel()  # prefers not t.numel() to len(t) == 0 to export the model
  is_causal = query.shape[2] > 1 and attention_mask is None and getattr(module, "is_causal", True)
Available devices: ['CPU', 'GPU.0', 'GPU.1']
[1] Exporting tiny-random-gigachat3 to OpenVINO...
    Export + GPU compile: OK
[2] Running forward pass on GPU...
    Logits shape : torch.Size([1, 31, 32000])
    Logits dtype : torch.float32
    Logits sample: [-0.0056610107421875, -0.0082855224609375, -0.04266357421875, 0.06475830078125, -0.002960205078125]
    Forward pass : OK
[3] Running generate() on GPU...
    Output : 'What is the capital of France?qualurtheremon{}) chiqish\\])\\ир flat المو'
    Generate     : OK
[4] Running batched generate() on GPU...
    Batch[0]: 'Hello world_en дlinux aylandi[tahr'
    Batch[1]: 'The sky isшт г may extentimg'
    Batched generate: OK
✅ All GPU tests passed!

@savvadesogle
Copy link

savvadesogle commented Mar 6, 2026

  • GPU loading/compilation takes longer than 20–30 minutes, and

I didn't expect the process to take so long. I'll have to wait and see. The conversion happens very quickly, up to 3 minutes to a regular int4, without any additional parameters. I'll definitely give it a try. I have 128 GB of RAM, so that should be enough.

Other models load much faster on the GPU. I'll try waiting longer.
Thank you

@Mohamed-Ashraf273
Copy link
Contributor Author

Hi @popovaan, @rkazants, @IlyasMoutawwakil

I’ve fixed the remaining issues. Could you please take a look when you have time?
Thanks!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Mohamed-Ashraf273
Copy link
Contributor Author

Hi @popovaan, @rkazants, @IlyasMoutawwakil,

All tests are now passing. I’d really appreciate it if you could take a final look.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants