You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/pdf_ocr_pipeline/cli.py
+3-2Lines changed: 3 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -42,8 +42,9 @@
42
42
43
43
defmain() ->None:
44
44
"""
45
-
Parse arguments and perform OCR on one or more PDF files.
46
-
Outputs a JSON array of {file, ocr_text} objects to stdout.
45
+
Parses command-line arguments and performs OCR on one or more PDF files, outputting results as a JSON array to standard output.
46
+
47
+
Validates input files, manages logging verbosity, and processes PDFs in parallel. Each result includes the filename and either the extracted OCR text or an error message if processing fails. Exits with status code 1 on unrecoverable errors.
Returns a singleton OpenAI client instance configured from environment variables.
55
+
56
+
Reads the required API key and optional endpoint or version overrides from the environment. Raises MissingApiKeyError if the API key is missing or appears to be a placeholder, and RuntimeError if no supported OpenAI SDK is installed.
59
57
"""
60
58
61
59
global_client
@@ -115,28 +113,17 @@ def send(
115
113
client: Optional["OpenAI"] =None,
116
114
**kwargs: Any,
117
115
) ->Dict[str, Any]:
118
-
"""Send *messages* to the chat completion endpoint and return JSON output.
119
-
120
-
Parameters
121
-
----------
122
-
messages:
123
-
List of role/content dicts as expected by the OpenAI chat completion
124
-
endpoint.
125
-
model:
126
-
Model name – defaults to ``gpt-4o``.
127
-
client:
128
-
Optional already‑initialised *OpenAI* client (mainly for tests). When
129
-
*None* the module‑level singleton returned by :func:`_get_client` is
130
-
used.
131
-
**kwargs:
132
-
Additional keyword arguments passed straight through to
Parsed JSON object produced by the model *or* a dict with an ``error``
139
-
key when something went wrong.
116
+
"""
117
+
Sends chat messages to a language model and returns the parsed JSON response.
118
+
119
+
Args:
120
+
messages: List of dictionaries representing chat messages, each with a role and content.
121
+
model: Name of the model to use. Defaults to "gpt-4o".
122
+
client: Optional pre-initialized OpenAI client. If not provided, a singleton client is used.
123
+
**kwargs: Additional keyword arguments passed to the chat completion API.
124
+
125
+
Returns:
126
+
A dictionary containing the parsed JSON object from the model's response, or a dictionary with an "error" key if the request fails or the response is invalid.
Copy file name to clipboardExpand all lines: src/pdf_ocr_pipeline/logging_utils.py
+21-12Lines changed: 21 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -28,11 +28,10 @@
28
28
29
29
30
30
def_initialise_root_logger() ->None:
31
-
"""Attach a single *StreamHandler* to the root logger.
32
-
33
-
The handler is only added once per interpreter session. We deliberately do
34
-
**not** rely on :pyfunc:`logging.basicConfig` because re‑invoking it from
35
-
multiple modules is a common source of duplicate log lines.
31
+
"""
32
+
Attaches a single StreamHandler with a consistent formatter to the root logger.
33
+
34
+
Ensures the handler is only added once per interpreter session to prevent duplicate log lines, avoiding the use of logging.basicConfig. Does not modify the root logger's level.
36
35
"""
37
36
38
37
global_INITIALISED# noqa: WPS420 (module‑level state is fine here)
"""Return a module-level logger with our global formatting applied.
59
-
60
-
The first call will attach the global handler to the root logger
61
-
(without changing its level). Subsequent calls simply retrieve the
62
-
named logger. A per-logger *level* may be provided but is rarely
63
-
necessary—prefer using :func:`set_root_level` to adjust verbosity.
57
+
"""
58
+
Returns a logger with the specified name, ensuring global formatting is applied.
59
+
60
+
On the first call, attaches a single global handler with consistent formatting to the root logger without modifying its level. If a level is provided, sets it on the returned logger instance. Prefer adjusting the root logger's level using `set_root_level` for consistent verbosity control across modules.
61
+
62
+
Args:
63
+
name: The name of the logger to retrieve.
64
+
level: Optional log level to set on the returned logger.
65
+
66
+
Returns:
67
+
A logger instance with the specified name and global formatting.
64
68
"""
65
69
66
70
# Ensure the global handler is attached. Do not adjust root level here.
Copy file name to clipboardExpand all lines: src/pdf_ocr_pipeline/segment_cli.py
+10-4Lines changed: 10 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -31,9 +31,11 @@
31
31
32
32
33
33
def_read_input() ->List[Dict[str, Any]]:
34
-
"""Read JSON array or raw text from *stdin*.
35
-
36
-
Returns a list of ``{"file": ..., "ocr_text": ...}`` dictionaries.
34
+
"""
35
+
Reads OCR input from stdin as either a JSON array or raw text.
36
+
37
+
Returns:
38
+
A list of dictionaries, each containing "file" and "ocr_text" keys. If the input is not valid JSON, the entire input is treated as a single OCR text blob with an "unknown" file identifier. If the input is a JSON array, it is returned as-is. Non-list JSON input is wrapped as a single document.
"""Segment OCR text(s) read from *stdin* and emit JSON to *stdout*."""
65
+
"""
66
+
Runs the CLI tool to segment OCR text from stdin and outputs the results as JSON.
67
+
68
+
Parses command-line arguments for custom prompt templates, JSON formatting, and logging verbosity. Reads OCR text input (raw or JSON) from stdin, segments each document using a segmentation function, and prints the segmentation results as a JSON array to stdout.
69
+
"""
64
70
65
71
default_verbose=settings.verbose
66
72
default_prompt=None# let segment_pdf load bundled template unless user overrides
Copy file name to clipboardExpand all lines: src/pdf_ocr_pipeline/segmentation.py
+12-4Lines changed: 12 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -27,10 +27,18 @@ def segment_pdf(
27
27
client: Optional[object] =None,
28
28
model: str="gpt-4o",
29
29
) ->Dict[str, Any]:
30
-
"""Return segmentation JSON for *text* using *prompt*.
31
-
If no prompt is provided, the default segmentation template is used.
32
-
33
-
The implementation is intentionally minimal – real logic lives in the LLM.
30
+
"""
31
+
Segments OCR-extracted PDF text into structured JSON using an LLM.
32
+
33
+
If no prompt is provided, a default segmentation template is loaded and cached from the package resources. Returns the LLM's JSON output representing the segmented documents.
34
+
35
+
Args:
36
+
text: The OCR text to segment.
37
+
prompt: Optional custom prompt to instruct the LLM; if not provided, a default template is used.
38
+
model: The LLM model identifier.
39
+
40
+
Returns:
41
+
A dictionary containing the JSON segmentation output from the LLM.
0 commit comments