Skip to content

[BUG]: schema extraction uses the whole file without splitting in chunks #457

@odysseasdiamadis

Description

@odysseasdiamadis

Before You Report a Bug, Please Confirm You Have Done The Following...

  • I have updated to the latest version of the packages.
  • I have searched for both existing issues and closed issues and found none that matched my issue.

neo4j-graphrag-python's version

1.10.1

Python version

3.12

Operating System

Debian 13

Dependencies

"datasets==3.6.0",
"flask>=3.1.2",
"neo4j>=5.28.2",
"neo4j-graphrag[nlp,ollama,sentence-transformers]>=1.10.1",
"streamlit>=1.52.1",

Reproducible example

PDF_FILE = './some-file.pdf' # a PDF file of big size -- bigger than the model context

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver,
    embedder=embedder,
    from_pdf=True,
    text_splitter=text_splitter,
)
await kg_builder.run_async(file_path=PDF_FILE)

Relevant Log Output

JSONDecoder error stating that the output is not in JSON format

Expected Result

I expect the pipeline to finish

What happened instead?

In the schema extraction phase, the pipeline does not split the document(s) in chunks; rather it gives the whole document, despite its size, to the ollama endpoint in a single prompt.

Additional Info

What happens is that the pipeline expects to extract the schema with a single request to OLLAMA giving it the whole document. Instead, it should perform this step chunk-by-chunk.
By giving the whole document in the prompt, the model loses track of the instructions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions