feat: integrate structured extraction and multimodal role-based pipelineFeat/multimodal pipeline by danielaskdd · Pull Request #2830 · HKUDS/LightRAG

danielaskdd · 2026-03-24T04:05:54Z

Description

This PR integrates a complete multimodal document pipeline with role-based LLM routing into LightRAG, building on the earlier JSON structured extraction work. It replaces synchronous API-layer document extraction with a proper three-stage pipeline, adds per-role model isolation for all four LLM roles (extract/keyword/query/VLM), and implements full DOCX multimodal support including image extraction, paragraph position tracking, and structured interchange format.

Changes Made

1. JSON Structured Entity Extraction

Replace delimiter-based extraction with JSON structured output for improved robustness
Support native JSON mode for OpenAI (response_format: json_object), Ollama (format="json"), and Gemini (response_mime_type="application/json")
Provider fallback logic when native JSON formatting is unsupported
Backward compatibility via ENTITY_EXTRACTION_USE_JSON (default: true)
Auto-detect JSON vs delimiter format during cache rebuild

2. Three-Stage Multimodal Document Pipeline

PARSE: Structured DOCX extraction with heading-aware semantic chunking, table structure preservation, and OMML formula extraction
ANALYZE: VLM-based multimodal analysis with sidecar writeback for drawings, tables, and equations
PROCESS: Entity/relation extraction from text and multimodal chunks, graph and vector store construction
Pipeline status tracking: PENDING → PARSING → ANALYZING → PROCESSING → PROCESSED/FAILED
Parser engine routing via LIGHTRAG_PARSER env var and filename hints (supports native/mineru/docling)

3. DOCX Upload Deferred to Pipeline

.docx uploads no longer do synchronous content extraction at the API layer
Files are enqueued with pending_parse format; the pipeline's parse_native() handles structured extraction
Supports parse_docx_to_interchange_jsonl with heading semantics, para_id positions, and image asset extraction

4. Interchange JSONL Enhancements

positions field now populated with paraid entries from Word w14:paraId attributes
engine_capabilities includes "i" when embedded images are detected
asset_dir flag set when *.blocks.assets directory is created

5. Image Binary Extraction & Assets Directory

New extract_docx_images() extracts embedded images via doc.part.rels relationship API
Images written to *.blocks.assets/ directory alongside interchange JSONL
_extract_drawing_info() extended to return r:embed relationship ID
Paragraph dataclass extended with drawing_rIds for image-paragraph association

6. Role-Based LLM/VLM Routing

Four independent roles: extract, keyword, query, vlm — each with its own function, concurrency queue, timeout, and provider options
Runtime reconfiguration via update_llm_role_config() with atomic rollback on failure
Per-role provider option overrides via options_dict_for_role() (e.g., EXTRACT_OPENAI_LLM_TEMPERATURE)
Ollama role functions now explicitly bind role_model instead of falling back to global config
Cross-provider kwargs isolation: base ollama kwargs won't pollute openai role calls
Ollama API /generate and /chat bypass paths use query_llm_model_kwargs

7. Relation Merge Robustness

Defensive timeout handling for relation VDB upserts
Fine-grained logging around entity/relation upsert stages
Improved observability for edge-processing waits and pending tasks

Test Results

Offline Test Suite

315 passed, 1 skipped, 0 failed (48.97s)

End-to-End Test with Real Multimodal DOCX

Tested with Chapter 2 of a real academic paper (79 paragraphs, 6 tables, 6 embedded images, 25 OMML formulas)：

Verification Point	Result
`extraction_format`	`interchange_jsonl` (not `legacy`)
`format_version`	`2.0`
`engine_capabilities`	`["t", "i"]` (tables + images)
`positions` field	Contains `paraid` entries (e.g., `["1D6D83BF", "68ED75D1"]`)
`*.blocks.assets` directory	Created with 6 extracted PNG images
Formula Q&A	Correctly explained formulas (2-1) and (2-2) for self-attention
Table Q&A	Correctly listed 8 comparison features from Table 2-1
Image Q&A	Correctly described all 6 figures with structural details
Image detail Q&A	Described CLIP model diagram components: dual encoders, N×N similarity matrix, arrows, color scheme

Sample Q&A: Image Structure Details

Question: "图2-1中具体包含了哪些视觉元素和组成部分？请详细描述这张图的结构布局"

Answer (abridged):

图2-1 以"CLIP 模型结构与对比学习"为主题，整体采用上下分区、左右对称的布局。上半部分为模型结构（左侧"图像编码器" + 右侧"文本编码器"双塔），下半部分为 N×N 相似度矩阵与对称交叉熵损失。垂直粗箭头从模型结构区指向矩阵，标注"计算余弦相似度"；循环虚线箭头标注"反向传播优化"形成训练闭环。图像侧使用蓝色调，文本侧使用绿色调...

Related Issues

Supersedes closed PR feat: Entity extraction uses JSON structured output instead of delimiter-based text #2684

Checklist

Changes tested locally
Pre-commit checks pass (ruff-format, ruff, trailing-whitespace, end-of-file, requirements-txt-fixer)
315 offline tests pass
End-to-end multimodal DOCX test verified (upload → parse → extract → query)
Unit tests added for role isolation, runtime updates, rollback, provider options, Ollama kwargs

…ter-based text - Add EntityExtractionResult Pydantic model for structured JSON output - Add JSON-mode prompt templates for entity/relationship extraction - Add _process_json_extraction_result() JSON parser in extraction pipeline - Add entity_extraction_use_json config option, default True - Add extraction_max_tokens config to prevent output truncation - OpenAI: use response_format json_object with auto-fallback retry - Ollama/Gemini: use native JSON mode for entity extraction - Other providers: pop entity_extraction kwarg for compatibility - Cache rebuild auto-detects JSON vs delimiter format - Skip relationships with empty descriptions to prevent merge errors

… ruff formatting

…ndles truncation)

Bring the RAG-Anything parsing/analyze flow into LightRAG's document pipeline and let extract, keyword, query, and VLM roles run with independent model settings. This keeps structured extraction, DOCX interchange ingestion, and relation merge hardening upstreamable without including the entity disambiguation experiment. Made-with: Cursor

Document the companion parser-side changes made in RAG-Anything so reviewers can understand how heading and table normalization align with this LightRAG multimodal pipeline PR without including external repository code here. Made-with: Cursor

Add safe example settings for the new structured extraction, parser integration, staged pipeline, and role-specific LLM/VLM routing options while keeping env.example ready for cp env.example .env deployment without exposing private models or secrets. Made-with: Cursor

Align the combined structured extraction and multimodal pipeline branch with the repository pre-commit rules so the upstream PR can pass formatting and lint checks cleanly. Made-with: Cursor

Sync env.example with the upstream interactive setup wizard template while keeping the multimodal parsing, staged pipeline, JSON extraction, and role-specific LLM settings exposed for configuration. Made-with: Cursor

Rebase env.example onto the latest upstream interactive setup template while preserving the multimodal parsing, staged pipeline, JSON extraction, and role-specific model configuration added by this PR. Made-with: Cursor

Keep LIGHTRAG_RUNTIME_TARGET in the latest upstream wizard template form so env.example remains a configuration template rather than activating a concrete runtime value in place. Made-with: Cursor

Refresh env.example against the latest interactive setup template updates, including runtime metadata, Mongo/OpenSearch storage guidance, and device-related comments, while preserving this PR's structured extraction, multimodal parsing, staged pipeline, and role-specific model settings. Made-with: Cursor

Add the OpenSearch client to offline storage dependencies and make the offline test workflow install storage backends so mocked OpenSearch storage tests can be collected in CI. Made-with: Cursor

Reorder requirements-offline-storage.txt to match the repository requirements formatter so pre-commit passes cleanly in CI. Made-with: Cursor

Add runtime role-specific LLM reconfiguration, provider option overrides per role, isolated role kwargs for Ollama query paths, and expanded offline coverage for role isolation, VLM, rollback behavior, and Ollama role kwargs. Made-with: Cursor

- Defer .docx upload parsing to three-stage pipeline via pending_parse format instead of synchronous API-layer extraction - Fill positions field in interchange JSONL with para_id from docx - Extract embedded images from .docx to *.blocks.assets directory - Add extract_docx_images() for binary image extraction via r:embed - Extend _extract_drawing_info() to return relationship ID - Add FULL_DOCS_FORMAT_PENDING_PARSE constant and parse_native branch - Mark engine_capabilities with "i" when images are present Made-with: Cursor

Made-with: Cursor

Cherry-pick runtime_validation.py from upstream/main so the test_runtime_target_validation tests can be collected during CI merge preview. Made-with: Cursor

Cherry-pick opensearch_impl.py and update kg/__init__.py from upstream/main so test_opensearch_storage tests can be collected during CI merge preview. Made-with: Cursor

Cherry-pick all files that exist in upstream/main but were missing from this branch. CI merge preview collects tests from both branches, so missing source modules cause ImportError during test collection. Includes: Makefile, docker-compose-full, InteractiveSetup docs, OpenSearch examples, setup wizard scripts/templates, and 4 upstream test files (zhipu, opensearch, runtime_target, interactive_setup). Note: test_interactive_setup_outputs has 4 pre-existing failures in upstream/main itself (port mapping assertion format mismatch). Made-with: Cursor

The port mapping assertions expected variable-format strings (${HOST:-0.0.0.0}:${PORT:-9621}:9621) but setup.sh produces concrete values (127.0.0.1:8080:9621). Update assertions to match the actual behavior. Made-with: Cursor

Keep test assertions identical to upstream/main to avoid merge conflicts. The 4 port-mapping assertion failures are a known upstream issue (setup.sh outputs concrete values but tests expect variable-format strings). Made-with: Cursor

Pull the latest setup.sh from upstream/main which now produces variable-format port mappings matching the test assertions. Made-with: Cursor

- add `ready_for_review` to PR trigger types for linting and tests workflows - add conditional check to skip jobs when PR is in draft state

- remove offline-storage extra from pip install command in CI workflow - all offline tests does not require storage backends

- add `type: ignore[import-untyped]` comment to silence mypy warning for untyped raganything.parser module

MrGidea and others added 30 commits March 8, 2026 15:59

fix: resolve CI linting - add extraction_max_tokens definition, apply…

c115342

… ruff formatting

refactor: remove extraction_max_tokens (not essential, json_repair ha…

a6da346

…ndles truncation)

fix: apply lint cleanup for multimodal PR

7c940af

Align the combined structured extraction and multimodal pipeline branch with the repository pre-commit rules so the upstream PR can pass formatting and lint checks cleanly. Made-with: Cursor

chore: align env.example with interactive setup template

deeb177

Sync env.example with the upstream interactive setup wizard template while keeping the multimodal parsing, staged pipeline, JSON extraction, and role-specific LLM settings exposed for configuration. Made-with: Cursor

chore: refresh env.example against latest upstream wizard

8147a47

Rebase env.example onto the latest upstream interactive setup template while preserving the multimodal parsing, staged pipeline, JSON extraction, and role-specific model configuration added by this PR. Made-with: Cursor

fix: align runtime target lines in env.example

bd491a4

Keep LIGHTRAG_RUNTIME_TARGET in the latest upstream wizard template form so env.example remains a configuration template rather than activating a concrete runtime value in place. Made-with: Cursor

fix: install offline storage deps for offline tests

1e85f95

Add the OpenSearch client to offline storage dependencies and make the offline test workflow install storage backends so mocked OpenSearch storage tests can be collected in CI. Made-with: Cursor

fix: sort offline storage requirements

4df267e

Reorder requirements-offline-storage.txt to match the repository requirements formatter so pre-commit passes cleanly in CI. Made-with: Cursor

fix: sync zhipu.py with upstream to resolve merge conflict

4de2dfd

Made-with: Cursor

fix: add upstream runtime_validation module for CI compatibility

584307c

Cherry-pick runtime_validation.py from upstream/main so the test_runtime_target_validation tests can be collected during CI merge preview. Made-with: Cursor

fix: add upstream opensearch storage impl for CI compatibility

651a996

Cherry-pick opensearch_impl.py and update kg/__init__.py from upstream/main so test_opensearch_storage tests can be collected during CI merge preview. Made-with: Cursor

chore: sync setup.sh with upstream fix for port mapping

2f60c67

Pull the latest setup.sh from upstream/main which now produces variable-format port mappings matching the test assertions. Made-with: Cursor

Merge branch 'dev' into feat/upstream-combined-no-disambiguation

881096f

Merge branch 'dev' into ydh/multimodal-pipeline

4cdb582

Merge branch 'dev' into feat/multimodal-pipeline

1df9f45

👷 ci(workflows): skip CI runs on draft pull requests

b4ddad3

- add `ready_for_review` to PR trigger types for linting and tests workflows - add conditional check to skip jobs when PR is in draft state

Merge branch 'dev' into feat/multimodal-pipeline

669a676

Update README.md

ec160ba

Merge branch 'dev' into feat/multimodal-pipeline

b152083

Merge branch 'dev' into feat/multimodal-pipeline

db6ce44

danielaskdd added 10 commits March 21, 2026 13:05

Merge branch 'dev' into feat/multimodal-pipeline

415a275

👷 ci(tests): remove offline-storage dependency from test installation

35226ed

- remove offline-storage extra from pip install command in CI workflow - all offline tests does not require storage backends

Merge branch 'dev' into feat/multimodal-pipeline

598ff5a

Merge branch 'dev' into feat/multimodal-pipeline

85807f6

Merge branch 'dev' into feat/multimodal-pipeline

184c85e

Merge branch 'dev' into feat/multimodal-pipeline

6295f7f

Merge branch 'dev' into feat/multimodal-pipeline

84f07f7

Merge branch 'dev' into feat/multimodal-pipeline

1e9d5f6

🔧 chore(lightrag): suppress type checking warning for raganything import

a425600

- add `type: ignore[import-untyped]` comment to silence mypy warning for untyped raganything.parser module

Merge branch 'dev' into feat/multimodal-pipeline

b86b48c

danielaskdd marked this pull request as ready for review March 24, 2026 04:07

danielaskdd added 2 commits March 24, 2026 12:23

Merge branch 'dev' into feat/multimodal-pipeline

983cc1b

Merge branch 'dev' into feat/multimodal-pipeline

56fcda6

danielaskdd added the tracked Issue is tracked by project label Mar 24, 2026

danielaskdd merged commit 0154f0c into dev Mar 25, 2026
3 checks passed

danielaskdd deleted the feat/multimodal-pipeline branch March 27, 2026 12:41

LarFii mentioned this pull request Apr 21, 2026

feat: parallelize text and multimodal processing in process_document_complete HKUDS/RAG-Anything#227

Closed

BdM-15 mentioned this pull request Apr 30, 2026

Migrate entity extraction from tuple-delimited to JSON structured output (LightRAG dev branch) BdM-15/proj-theseus#124

Closed

35 tasks

LarFii mentioned this pull request May 6, 2026

feat: incremental folder scan — skip unchanged files via MD5 manifest HKUDS/RAG-Anything#239

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate structured extraction and multimodal role-based pipelineFeat/multimodal pipeline#2830

feat: integrate structured extraction and multimodal role-based pipelineFeat/multimodal pipeline#2830
danielaskdd merged 42 commits intodevfrom
feat/multimodal-pipeline

danielaskdd commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielaskdd commented Mar 24, 2026

Description

Changes Made

1. JSON Structured Entity Extraction

2. Three-Stage Multimodal Document Pipeline

3. DOCX Upload Deferred to Pipeline

4. Interchange JSONL Enhancements

5. Image Binary Extraction & Assets Directory

6. Role-Based LLM/VLM Routing

7. Relation Merge Robustness

Test Results

Offline Test Suite

End-to-End Test with Real Multimodal DOCX

Sample Q&A: Image Structure Details

Related Issues

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants