Skip to content

ONNX export support for SmolVLM2 architecture (smolvlm2) #2431

@stuart-bernstein

Description

@stuart-bernstein

Feature request

Add native ONNX export support for the SmolVLM2 family (HuggingFaceTB/SmolVLM2-256M-Instruct,
HuggingFaceTB/SmolVLM2-500M-Instruct, HuggingFaceTB/SmolVLM2-2.2B-Instruct, etc.) so they can be exported
with optimum-cli export onnx --model HuggingFaceTB/SmolVLM2-2.2B-Instruct --task image-text-to-text <output>
without needing a custom OnnxConfig.

Reproduction (current failure)

pip install optimum optimum-onnx                                                                            
optimum-cli export onnx \           
  --model HuggingFaceTB/SmolVLM2-2.2B-Instruct \                                                              
  --task image-text-to-text \
  --trust-remote-code \                                                                                       
  ./out                                                                                                       
                                    
Result:                                                                                                       
                                                                                                            
ValueError: Trying to export a smolvlm model, that is a custom or unsupported architecture, but no custom onnx
 configuration was passed as `custom_onnx_configs`. Please refer to https://huggingface.co/docs/optimum/main/e
n/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models for an example on how to    
export custom models. Please open an issue at https://github.com/huggingface/optimum/issues if you would like 
the model type smolvlm to be supported natively in the ONNX export.                                         
                                                                                                            
--monolith and --task feature-extraction bypasses fail with the same error — the model-type check fires before
 any task logic runs. Confirmed by inspecting optimum/exporters/onnx/model_configs.py in optimum-onnx (latest
as of 2026-05-03) — zero SmolVLM*OnnxConfig entries. 

### Motivation

SmolVLM2 (released 2025) is one of the most popular small VLMs on the Hub —                                   
HuggingFaceTB/SmolVLM2-2.2B-Instruct has 8.5K likes and is widely used for on-device vision-language tasks.
It's the natural upgrade path from SmolVLM-Instruct (v1, 2B), which already has ONNX exports and is widely    
deployed in browser-based stacks like Transformers.js + WebGPU.                                             
                                    
We're shipping an iPhone PWA (Vite + React + Transformers.js v3) that uses on-device VLM inference to parse   
gym whiteboard photos into structured exercise data. We currently use HuggingFaceTB/SmolVLM-500M-Instruct
(q4f16 ONNX, ~358 MB) but its instruction-following and OCR quality are insufficient — the model echoes prompt
 template text back as JSON values and falls into repetition loops. We compared candidate replacement models
locally via MLX:                    
                                                                                                            
┌─────────────────────────────────┬─────────────────────────────────────────────────────┬───────────┐         
│              Model              │              Quality on real gym photo              │ Has ONNX? │
├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤         
│ SmolVLM-500M-Instruct (current) │ 0/10 parseable rows                                 │ ✓         │       
├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤
│ SmolVLM-Instruct v1 (2B)        │ 5/10 parseable rows, field-shape artifacts          │ ✓         │         
├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤         
│ SmolVLM2-2.2B-Instruct          │ 10/10 parseable rows, correct schema, both stations │ ✗         │         
├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤         
│ Qwen2-VL-2B-Instruct            │ 2.7 GB ONNX, untested in spike                      │ ✓         │       
└─────────────────────────────────┴─────────────────────────────────────────────────────┴───────────┘   

SmolVLM2-2.2B is the clear quality winner but unreachable for any browser-based stack until ONNX export is    
supported.  

### Your contribution

I'm a downstream user, not currently positioned to write the OnnxConfig myself, but happy to:                 
- Test PR branches against my real-world iPhone PWA gym-photo workload.
- Provide failure-mode comparison data (SmolVLM v1 vs v2 quality on the same prompt + image set).             
- Validate q4f16 quantization quality vs the PyTorch reference.                                             
                                                                                                              
If a maintainer or community contributor takes this on, the closest existing precedents in model_configs.py   
appear to be ColPaliOnnxConfig (Gemma-backed VLM), Pix2StructOnnxConfig (vision+seq2seq), and                 
VisionEncoderDecoderOnnxConfig (generic encoder+decoder). SmolVLM2 is Idefics3-derived, so the Idefics3 export
 path (if/when it lands) would be the natural foundation.                                                     
                                                                                                            
Thank you!                                         

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions