Skip to content

Conversation

@TheAryan77
Copy link

🧪 Add regression test for PDF image extraction (PyMuPDF)

Summary

Adds a regression test to ensure that extract_images=True does not silently skip
images when using PyMuPDFParser.

In langchain-community==0.4.1, image extraction was broken due to a premature
empty-buffer check, causing all images to be skipped with no warning. This test
would have failed in that version.

Why this matters

  • The bug failed silently (no error or warning)
  • OCR-based PDF pipelines produced incomplete results
  • Difficult for users to diagnose without inspecting source code

This test ensures the expected behavior remains intact and prevents future
regressions.

Notes

  • The underlying fix already exists on main
  • This PR adds a regression test to lock in correct behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant