Feature request
Add native ONNX export support for the SmolVLM2 family (HuggingFaceTB/SmolVLM2-256M-Instruct,
HuggingFaceTB/SmolVLM2-500M-Instruct, HuggingFaceTB/SmolVLM2-2.2B-Instruct, etc.) so they can be exported
with optimum-cli export onnx --model HuggingFaceTB/SmolVLM2-2.2B-Instruct --task image-text-to-text <output>
without needing a custom OnnxConfig.
Reproduction (current failure)
pip install optimum optimum-onnx
optimum-cli export onnx \
--model HuggingFaceTB/SmolVLM2-2.2B-Instruct \
--task image-text-to-text \
--trust-remote-code \
./out
Result:
ValueError: Trying to export a smolvlm model, that is a custom or unsupported architecture, but no custom onnx
configuration was passed as `custom_onnx_configs`. Please refer to https://huggingface.co/docs/optimum/main/e
n/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models for an example on how to
export custom models. Please open an issue at https://github.com/huggingface/optimum/issues if you would like
the model type smolvlm to be supported natively in the ONNX export.
--monolith and --task feature-extraction bypasses fail with the same error — the model-type check fires before
any task logic runs. Confirmed by inspecting optimum/exporters/onnx/model_configs.py in optimum-onnx (latest
as of 2026-05-03) — zero SmolVLM*OnnxConfig entries.
### Motivation
SmolVLM2 (released 2025) is one of the most popular small VLMs on the Hub —
HuggingFaceTB/SmolVLM2-2.2B-Instruct has 8.5K likes and is widely used for on-device vision-language tasks.
It's the natural upgrade path from SmolVLM-Instruct (v1, 2B), which already has ONNX exports and is widely
deployed in browser-based stacks like Transformers.js + WebGPU.
We're shipping an iPhone PWA (Vite + React + Transformers.js v3) that uses on-device VLM inference to parse
gym whiteboard photos into structured exercise data. We currently use HuggingFaceTB/SmolVLM-500M-Instruct
(q4f16 ONNX, ~358 MB) but its instruction-following and OCR quality are insufficient — the model echoes prompt
template text back as JSON values and falls into repetition loops. We compared candidate replacement models
locally via MLX:
┌─────────────────────────────────┬─────────────────────────────────────────────────────┬───────────┐
│ Model │ Quality on real gym photo │ Has ONNX? │
├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤
│ SmolVLM-500M-Instruct (current) │ 0/10 parseable rows │ ✓ │
├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤
│ SmolVLM-Instruct v1 (2B) │ 5/10 parseable rows, field-shape artifacts │ ✓ │
├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤
│ SmolVLM2-2.2B-Instruct │ 10/10 parseable rows, correct schema, both stations │ ✗ │
├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤
│ Qwen2-VL-2B-Instruct │ 2.7 GB ONNX, untested in spike │ ✓ │
└─────────────────────────────────┴─────────────────────────────────────────────────────┴───────────┘
SmolVLM2-2.2B is the clear quality winner but unreachable for any browser-based stack until ONNX export is
supported.
### Your contribution
I'm a downstream user, not currently positioned to write the OnnxConfig myself, but happy to:
- Test PR branches against my real-world iPhone PWA gym-photo workload.
- Provide failure-mode comparison data (SmolVLM v1 vs v2 quality on the same prompt + image set).
- Validate q4f16 quantization quality vs the PyTorch reference.
If a maintainer or community contributor takes this on, the closest existing precedents in model_configs.py
appear to be ColPaliOnnxConfig (Gemma-backed VLM), Pix2StructOnnxConfig (vision+seq2seq), and
VisionEncoderDecoderOnnxConfig (generic encoder+decoder). SmolVLM2 is Idefics3-derived, so the Idefics3 export
path (if/when it lands) would be the natural foundation.
Thank you!
Feature request
Add native ONNX export support for the SmolVLM2 family (
HuggingFaceTB/SmolVLM2-256M-Instruct,HuggingFaceTB/SmolVLM2-500M-Instruct,HuggingFaceTB/SmolVLM2-2.2B-Instruct, etc.) so they can be exportedwith
optimum-cli export onnx --model HuggingFaceTB/SmolVLM2-2.2B-Instruct --task image-text-to-text <output>without needing a custom
OnnxConfig.Reproduction (current failure)