Parsit is a specialized vision-language model designed for document analysis tasks. It combines Qwen3 language models (1.7B, 4B) with SigLIP vision encoders to deliver high performance on OCR, question answering, and structured document understanding.
- Document-Focused Architecture: Tailored specifically for document analysis (no video or general vision tasks)
- Modern Components: Qwen3 LLM (1.7B/4B) + SigLIP vision encoder + MLP projector
- Flexible Training: Supports full fine-tuning, LoRA, and multiple training modes
- Simple Inference: Python API for document processing and structured text extraction
- Comprehensive Evaluation: Built-in metrics for QA accuracy and OCR performance
git clone https://github.com/your-repo/parsit.git
cd parsit
pip install -e .# Pre-training (vision-language alignment)
bash scripts/pretrain.sh # Default: 1.7B model
MODEL_SIZE=4B bash scripts/pretrain.sh # For 4B model
# Model-specific scripts
bash scripts/pretrain_4b.sh # Optimized for 4B model
# Fine-tuning (instruction following)
bash scripts/finetune.sh
# LoRA training (memory efficient)
bash scripts/train_lora.sh # 1.7B model
bash scripts/train_lora_4b.sh # 4B model
# Full document training
bash scripts/train_parsit_documents.shDocument Image → SigLIP-2 Encoder → MLP Projector → Qwen3 LLM → Text Response
- Vision Encoder: SigLIP-2 (google/siglip-so400m-patch14-384)
- Language Model: Qwen3-1.7B or Qwen3-4B
- Projector: 2-layer MLP with GELU activation (
mlp2x_gelu) - DeepSpeed: ZeRO-2 and ZeRO-3 configurations for efficient training
- 1.7B Model: 8GB+ VRAM, suitable for RTX 3070/4060 Ti
- 4B Model: 12GB+ VRAM, recommended RTX 3060 12GB/4070 or better
- Training: Single GPU sufficient for both models with optimized configurations
Key parameters for document analysis training:
--mm_projector_type "mlp2x_gelu" # Projector architecture
--image_aspect_ratio "pad" # Handle document aspect ratios
--group_by_modality_length True # Efficient batching
--mm_vision_select_layer -2 # SigLIP layer selection
--model_max_length 2048 # Context length for documentsThis project is released under a permissive open-source license.
- Uses Qwen3-1.7B and Qwen3-4B language models
- Powered by SigLIP-2 vision encoder