Skip to content

feat: parallelize text and multimodal processing in process_document_complete#227

Closed
ndcorder wants to merge 2 commits intoHKUDS:mainfrom
ndcorder:feat/parallel-text-multimodal
Closed

feat: parallelize text and multimodal processing in process_document_complete#227
ndcorder wants to merge 2 commits intoHKUDS:mainfrom
ndcorder:feat/parallel-text-multimodal

Conversation

@ndcorder
Copy link
Copy Markdown

@ndcorder ndcorder commented Mar 21, 2026

Text insertion and multimodal processing in process_document_complete run sequentially right now but don't share any state. This runs them concurrently with asyncio.gather so total time is max(text, multimodal) instead of text + multimodal.

Uses return_exceptions=True so if one branch fails the other still completes.

@ndcorder ndcorder force-pushed the feat/parallel-text-multimodal branch from eede248 to ab1167c Compare March 21, 2026 09:31
@LarFii
Copy link
Copy Markdown
Collaborator

LarFii commented Mar 24, 2026

Thanks for your contribution!

I found one P1 issue. In raganything/processor.py, the new parallel branch uses asyncio.gather(..., return_exceptions=True) and only logs per-task failures. That means process_document_complete() no longer propagates text/multimodal failures, so the outer error handler is skipped and the method can still emit on_document_complete / "processing complete" even when one branch failed. This is a behavioral regression from main and can silently report partially ingested documents as successful. I think the PR should preserve fail-fast semantics here, or at minimum aggregate task errors and raise after both tasks finish. The new test test_one_branch_failing_does_not_block_the_other currently locks in the broken behavior rather than detecting it.

ndcorder and others added 2 commits April 21, 2026 17:30
These two steps are independent so there's no reason to wait for text
insertion before starting multimodal. Uses asyncio.gather with
return_exceptions so one failing doesn't kill the other.
…owed

asyncio.gather(..., return_exceptions=True) made process_document_complete
silently report partially ingested documents as successful: the outer
try/except never saw the failure, so on_document_error was skipped and
on_document_complete still fired.

Keep return_exceptions=True so a failure in one branch does not cancel
the sibling (preserves the parallelism win), but after gather collect
all task exceptions and re-raise. If both branches fail, raise a single
RuntimeError that names the secondary failures and chains the first
exception via __cause__.

Update the regression test to assert that:
- the non-failing branch still completes
- the failure propagates out of process_document_complete
- on_document_error fires and on_document_complete does not

Add a new test covering the both-branches-failing aggregation path.

Made-with: Cursor
@LarFii LarFii force-pushed the feat/parallel-text-multimodal branch from ab1167c to 5a703a3 Compare April 21, 2026 09:38
@LarFii
Copy link
Copy Markdown
Collaborator

LarFii commented Apr 21, 2026

@ndcorder Thanks again for putting time into this. After a closer review (and another pass by Codex on the latest commit) we found that the parallelization as implemented has correctness issues that go deeper than just the swallowed-exception fix:

  1. text-only documents lose multimodal_processed=True — the else branch calls _mark_multimodal_processing_complete(doc_id) before asyncio.gather(...) runs the text insertion. At that point LightRAG hasn't created doc_status yet, so the mark call silently no-ops.

  2. mixed text + multimodal documents race on doc_status_update_doc_status_with_chunks_type_aware does a read/append/write on chunks_list, while LightRAG's ainsert does a full overwrite of the same record. In main this works because multimodal runs strictly after text insertion; under asyncio.gather any interleaving (multimodal reads None and skips, or text overwrites multimodal's chunks_list/multimodal_processed) is observable. This is not fixable with a lock because LightRAG's internal upsert is not partial-merge-aware.

The "text and multimodal are independent" assumption in the PR description doesn't hold for the integration step — only the description-generation phase is truly independent.

Separately, RAG-Anything's multimodal pipeline is now being merged into LightRAG itself (HKUDS/LightRAG#2830 merged into the dev branch on 2026-03-25, with parser-alignment notes in docs/RAGAnythingParserAlignment.md). A lot of process_document_complete's shape will change once the upstream-side integration lands here, so it doesn't make sense to ship a partial parallelization of the current code.

Closing this PR for now. If you'd like to revisit parallelism after the upstream integration settles, the right shape is probably:

  • split _process_multimodal_content into generate_multimodal_descriptions (pure LLM, parallelizable) and integrate_multimodal_into_lightrag (must be serialized after text insertion)
  • gather(text_insert, generate_multimodal_descriptions) → then await integrate_multimodal_into_lightrag

Really appreciate the contribution and the clean test suite — sorry this one isn't going in.

@LarFii LarFii closed this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants