ONNX export support for SmolVLM2 architecture (smolvlm2)

### Feature request

Add native ONNX export support for the SmolVLM2 family (`HuggingFaceTB/SmolVLM2-256M-Instruct`,             
  `HuggingFaceTB/SmolVLM2-500M-Instruct`, `HuggingFaceTB/SmolVLM2-2.2B-Instruct`, etc.) so they can be exported 
  with `optimum-cli export onnx --model HuggingFaceTB/SmolVLM2-2.2B-Instruct --task image-text-to-text <output>`
   without needing a custom `OnnxConfig`.                                                                       
                                                                                                              
  ### Reproduction (current failure)                                                                            
                                                                                                              
  ```bash                                                                                                       
  pip install optimum optimum-onnx                                                                            
  optimum-cli export onnx \           
    --model HuggingFaceTB/SmolVLM2-2.2B-Instruct \                                                              
    --task image-text-to-text \
    --trust-remote-code \                                                                                       
    ./out                                                                                                       
                                      
  Result:                                                                                                       
                                                                                                              
  ValueError: Trying to export a smolvlm model, that is a custom or unsupported architecture, but no custom onnx
   configuration was passed as `custom_onnx_configs`. Please refer to https://huggingface.co/docs/optimum/main/e
  n/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models for an example on how to    
  export custom models. Please open an issue at https://github.com/huggingface/optimum/issues if you would like 
  the model type smolvlm to be supported natively in the ONNX export.                                         
                                                                                                              
  --monolith and --task feature-extraction bypasses fail with the same error — the model-type check fires before
   any task logic runs. Confirmed by inspecting optimum/exporters/onnx/model_configs.py in optimum-onnx (latest
  as of 2026-05-03) — zero SmolVLM*OnnxConfig entries. 

### Motivation

 SmolVLM2 (released 2025) is one of the most popular small VLMs on the Hub —                                   
  HuggingFaceTB/SmolVLM2-2.2B-Instruct has 8.5K likes and is widely used for on-device vision-language tasks.
  It's the natural upgrade path from SmolVLM-Instruct (v1, 2B), which already has ONNX exports and is widely    
  deployed in browser-based stacks like Transformers.js + WebGPU.                                             
                                      
  We're shipping an iPhone PWA (Vite + React + Transformers.js v3) that uses on-device VLM inference to parse   
  gym whiteboard photos into structured exercise data. We currently use HuggingFaceTB/SmolVLM-500M-Instruct
  (q4f16 ONNX, ~358 MB) but its instruction-following and OCR quality are insufficient — the model echoes prompt
   template text back as JSON values and falls into repetition loops. We compared candidate replacement models
  locally via MLX:                    
                                                                                                              
  ┌─────────────────────────────────┬─────────────────────────────────────────────────────┬───────────┐         
  │              Model              │              Quality on real gym photo              │ Has ONNX? │
  ├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤         
  │ SmolVLM-500M-Instruct (current) │ 0/10 parseable rows                                 │ ✓         │       
  ├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤
  │ SmolVLM-Instruct v1 (2B)        │ 5/10 parseable rows, field-shape artifacts          │ ✓         │         
  ├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤         
  │ SmolVLM2-2.2B-Instruct          │ 10/10 parseable rows, correct schema, both stations │ ✗         │         
  ├─────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤         
  │ Qwen2-VL-2B-Instruct            │ 2.7 GB ONNX, untested in spike                      │ ✓         │       
  └─────────────────────────────────┴─────────────────────────────────────────────────────┴───────────┘   

  SmolVLM2-2.2B is the clear quality winner but unreachable for any browser-based stack until ONNX export is    
  supported.  

### Your contribution

 I'm a downstream user, not currently positioned to write the OnnxConfig myself, but happy to:                 
  - Test PR branches against my real-world iPhone PWA gym-photo workload.
  - Provide failure-mode comparison data (SmolVLM v1 vs v2 quality on the same prompt + image set).             
  - Validate q4f16 quantization quality vs the PyTorch reference.                                             
                                                                                                                
  If a maintainer or community contributor takes this on, the closest existing precedents in model_configs.py   
  appear to be ColPaliOnnxConfig (Gemma-backed VLM), Pix2StructOnnxConfig (vision+seq2seq), and                 
  VisionEncoderDecoderOnnxConfig (generic encoder+decoder). SmolVLM2 is Idefics3-derived, so the Idefics3 export
   path (if/when it lands) would be the natural foundation.                                                     
                                                                                                              
  Thank you!                                         

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX export support for SmolVLM2 architecture (smolvlm2) #2431

Feature request

Reproduction (current failure)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ONNX export support for SmolVLM2 architecture (smolvlm2) #2431

Description

Feature request

Reproduction (current failure)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions