[OpenVINO] support ai-sage/GigaChat3-10B-A1.8B-bf16#1626
[OpenVINO] support ai-sage/GigaChat3-10B-A1.8B-bf16#1626Mohamed-Ashraf273 wants to merge 17 commits intohuggingface:mainfrom
Conversation
d376bab to
f88c6a8
Compare
949c2b6 to
c26ffe8
Compare
|
Hi @popovaan , |
|
Thanks for the PR! Please add tests for this model. For now, use a locally generated tiny model. I'm currently investigating whether we're allowed to invite GSoC contributors to the |
Got it, thanks! |
Hi @popovaan, @rkazants, |
rkazants
left a comment
There was a problem hiding this comment.
please also add export tests. The same test set that you have added for the previuos model.
Update documentation.
There was a problem hiding this comment.
Pull request overview
This PR aims to add OpenVINO export/inference support coverage for the ai-sage/GigaChat3-10B-A1.8B-bf16 family by extending OpenVINO test fixtures and adjusting DeepSeek patching logic used during export.
Changes:
- Add a
gigachat3tiny-random model fixture and include it in OpenVINO decoder integration coverage. - Update decoder tests for
gigachat3(expected SDPA count, relaxed logits tolerance, and skip conditions for incompatible Transformers versions). - Refactor DeepSeek attention patching to use a versioned factory function and extend MoE patching to handle MLP blocks exposing
expertsbut notmoe_infer.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
tests/openvino/utils_tests.py |
Adds the gigachat3 test model mapping; adjusts which models are treated as remote-code in tests. |
tests/openvino/test_decoder.py |
Adds gigachat3 to tested architectures and config expectations; tweaks tolerance/skip logic; adds debug output. |
optimum/exporters/openvino/model_patcher.py |
Updates DeepSeek patcher to use a unified attention forward factory and broadens MoE patching behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
@rkazants |
Hi @popovaan, I’ve finished adding the tests and temporarily published Would it be possible to invite me to the group so I can publish it there, or would you prefer to handle the publishing? Please let me know if any changes are needed. |
|
Hi. Can I help test the model? |
Hi! |
|
@Mohamed-Ashraf273 , please check tests locally on your machine: |
Hi @Mohamed-Ashraf273! You can refer to these tiny model examples, the generation code is available in the model cards: |
|
@Mohamed-Ashraf273 xeon 2699v3 x2
Working with openvino-genai (openarc) - At least I think so.openvino 2026.1.0.dev20260221
The only thing I need to figure out is how to set up the chat template. Or maybe there is a problem with the conversion. Or i just need to update the openvino-genai 🤷.
LOGMetrics 1x A770Low GPU utilization and low decoding speed
After |
|
@savvadesogle |
@popovaan |
@rkazants |
|
@Mohamed-Ashraf273 great work on this PR! were you able to verify that @savvadesogle issue above were solved in your export changes? PS We are very excited to deploy this model in OpenArc with a qwen ASR/TTS system I am working on, thanks for tackling the issue so quickly. |
|
Hi @SearchSavior Yes, I verified that the issue mentioned by @savvadesogle is resolved. I’ve attached an image below showing the successful result. |
|
Hi @popovaan, @rkazants, @IlyasMoutawwakil, Tests are now passing and the issues should be resolved. Thanks! |
| MIN_TRANSFORMERS_VERSION = "4.46.0" | ||
| MAX_TRANSFORMERS_VERSION = "4.53.3" | ||
| MIN_TRANSFORMERS_VERSION = "4.53.0" | ||
| MAX_TRANSFORMERS_VERSION = None |
There was a problem hiding this comment.
why explicitly setting it to None ?
There was a problem hiding this comment.
MiniCPM3OpenVINOConfig (the parent class) sets MAX_TRANSFORMERS_VERSION = "4.53.3" because it was originally written when support for transformers >= 4.54 was uncertain.
DeepseekOpenVINOConfig explicitly overrides this value to None to indicate that there is no upper bound. The Deepseek/GigaChat3 export has been validated with transformers up to 4.57.x, and it is expected to remain compatible with future versions.
Without this override, DeepseekOpenVINOConfig would inherit the 4.53.3 limit from MiniCPM3OpenVINOConfig, which would incorrectly block export on newer transformers versions. Setting it to None explicitly removes that inherited ceiling and ensures the correct behavior.
There was a problem hiding this comment.
Hi @IlyasMoutawwakil,
I’d really appreciate it if you could review my latest updates.
Thanks in advance!
|
Thank you, @Mohamed-Ashraf273 ❤️ Working with CPU
Not working with GPU yet...It loads endlessly to GPU, but it works with the CPU in OpenArс Tool (OVGenAI engine).
|
|
Hi @savvadesogle, I ran a demo test with a tiny GigaChat3 model on GPU and it worked correctly. I was able to successfully:
All steps completed without issues and the GPU execution finished successfully. From your description, it sounds like the GPU loading/compilation for the full model may simply require more time and RAM. The tiny model finishes quickly, but the real GigaChat3 model is significantly larger, so it would be expected that:
Since the same pipeline works correctly with the tiny model on GPU, the real model should also work, but it may just need more time and memory for the GPU compilation step. For reference, here is the script I used for the GPU test: import torch
from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM
import openvino as ov
# ── 0. Check available devices ────────────────────────────────────────────────
core = ov.Core()
print("Available devices:", core.available_devices)
assert "GPU" in " ".join(core.available_devices), "No Intel GPU found!"
MODEL_DIR = "./tiny-random-gigachat3"
# ── 1. Export (CPU export, then load on GPU) ──────────────────────────────────
print("\n[1] Exporting tiny-random-gigachat3 to OpenVINO...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
ov_model = OVModelForCausalLM.from_pretrained(
MODEL_DIR,
export=True,
device="GPU", # compile directly on GPU after export
)
print(" Export + GPU compile: OK")
# ── 2. Basic forward pass ─────────────────────────────────────────────────────
print("\n[2] Running forward pass on GPU...")
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = ov_model(**inputs)
logits = outputs.logits
print(f" Logits shape : {logits.shape}")
print(f" Logits dtype : {logits.dtype}")
print(f" Logits sample: {logits[0, -1, :5].tolist()}")
assert logits.shape[0] == 1, "Batch size mismatch"
print(" Forward pass : OK")
# ── 3. Generation ─────────────────────────────────────────────────────────────
print("\n[3] Running generate() on GPU...")
ov_model.generation_config.eos_token_id = None # avoid early stop on tiny model
output_ids = ov_model.generate(**inputs, max_new_tokens=10, do_sample=False)
decoded = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f" Output : {decoded!r}")
assert output_ids.shape[1] > inputs["input_ids"].shape[1], "No tokens generated"
print(" Generate : OK")
# ── 4. Batch generation ───────────────────────────────────────────────────────
print("\n[4] Running batched generate() on GPU...")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
prompts = ["Hello world", "The sky is"]
batch = tokenizer(prompts, return_tensors="pt", padding=True)
output_ids = ov_model.generate(**batch, max_new_tokens=5, do_sample=False)
for i, ids in enumerate(output_ids):
print(f" Batch[{i}]: {tokenizer.decode(ids, skip_special_tokens=True)!r}")
print(" Batched generate: OK")
print("\n✅ All GPU tests passed!")Output: |
I didn't expect the process to take so long. I'll have to wait and see. The conversion happens very quickly, up to 3 minutes to a regular int4, without any additional parameters. I'll definitely give it a try. I have 128 GB of RAM, so that should be enough. Other models load much faster on the GPU. I'll try waiting longer. |
|
Hi @popovaan, @rkazants, @IlyasMoutawwakil I’ve fixed the remaining issues. Could you please take a look when you have time? |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Hi @popovaan, @rkazants, @IlyasMoutawwakil, All tests are now passing. I’d really appreciate it if you could take a final look. Thanks! |















What does this PR do?
Conversion cmd-line for CohereLabs/tiny-aya-base:
optimum-cli export openvino -m ai-sage/GigaChat3-10B-A1.8B-bf16 ./output_dir --task text-generation-with-pastInference of
ai-sage/GigaChat3-10B-A1.8B-bf16using OpenVINO backend:Solving Issue: #1608
Before submitting