Backport fix: prevent PDF image extraction from being skipped #481

TheAryan77 · 2025-12-21T18:54:15Z

🧪 Add regression test for PDF image extraction (PyMuPDF)

Summary

Adds a regression test to ensure that extract_images=True does not silently skip
images when using PyMuPDFParser.

In langchain-community==0.4.1, image extraction was broken due to a premature
empty-buffer check, causing all images to be skipped with no warning. This test
would have failed in that version.

Why this matters

The bug failed silently (no error or warning)
OCR-based PDF pipelines produced incomplete results
Difficult for users to diagnose without inspecting source code

This test ensures the expected behavior remains intact and prevents future
regressions.

Notes

The underlying fix already exists on main
This PR adds a regression test to lock in correct behavior

Backport fix: prevent PDF image extraction from being skipped

9a0bef1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backport fix: prevent PDF image extraction from being skipped #481

Backport fix: prevent PDF image extraction from being skipped #481

Uh oh!

TheAryan77 commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Backport fix: prevent PDF image extraction from being skipped #481

Are you sure you want to change the base?

Backport fix: prevent PDF image extraction from being skipped #481

Uh oh!

Conversation

TheAryan77 commented Dec 21, 2025

🧪 Add regression test for PDF image extraction (PyMuPDF)

Summary

Why this matters

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant