Skip to content

Parsit is a specialized vision-language model designed for document analysis tasks. Parsit combines Qwen3 language models with SigLIP-2 vision encoders to provide state-of-the-art performance on document understanding, OCR, and analysis tasks.

Notifications You must be signed in to change notification settings

notkisk/Parsit-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parsit: Vision-Language Model for Document Analysis

Parsit is a specialized vision-language model designed for document analysis tasks. It combines Qwen3 language models (1.7B, 4B) with SigLIP vision encoders to deliver high performance on OCR, question answering, and structured document understanding.

Features

  • Document-Focused Architecture: Tailored specifically for document analysis (no video or general vision tasks)
  • Modern Components: Qwen3 LLM (1.7B/4B) + SigLIP vision encoder + MLP projector
  • Flexible Training: Supports full fine-tuning, LoRA, and multiple training modes
  • Simple Inference: Python API for document processing and structured text extraction
  • Comprehensive Evaluation: Built-in metrics for QA accuracy and OCR performance

Quick Start

Installation

git clone https://github.com/your-repo/parsit.git
cd parsit
pip install -e .

Training Scripts

# Pre-training (vision-language alignment)
bash scripts/pretrain.sh  # Default: 1.7B model
MODEL_SIZE=4B bash scripts/pretrain.sh  # For 4B model

# Model-specific scripts
bash scripts/pretrain_4b.sh  # Optimized for 4B model

# Fine-tuning (instruction following)
bash scripts/finetune.sh

# LoRA training (memory efficient)
bash scripts/train_lora.sh  # 1.7B model
bash scripts/train_lora_4b.sh  # 4B model

# Full document training
bash scripts/train_parsit_documents.sh

Model Architecture

Document Image → SigLIP-2 Encoder → MLP Projector → Qwen3 LLM → Text Response

Components

  • Vision Encoder: SigLIP-2 (google/siglip-so400m-patch14-384)
  • Language Model: Qwen3-1.7B or Qwen3-4B
  • Projector: 2-layer MLP with GELU activation (mlp2x_gelu)
  • DeepSpeed: ZeRO-2 and ZeRO-3 configurations for efficient training

Hardware Requirements

  • 1.7B Model: 8GB+ VRAM, suitable for RTX 3070/4060 Ti
  • 4B Model: 12GB+ VRAM, recommended RTX 3060 12GB/4070 or better
  • Training: Single GPU sufficient for both models with optimized configurations

Configuration

Key parameters for document analysis training:

--mm_projector_type "mlp2x_gelu"          # Projector architecture
--image_aspect_ratio "pad"                # Handle document aspect ratios  
--group_by_modality_length True           # Efficient batching
--mm_vision_select_layer -2               # SigLIP layer selection
--model_max_length 2048                   # Context length for documents

License

This project is released under a permissive open-source license.

Acknowledgments

About

Parsit is a specialized vision-language model designed for document analysis tasks. Parsit combines Qwen3 language models with SigLIP-2 vision encoders to provide state-of-the-art performance on document understanding, OCR, and analysis tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published