TextLayout

Extract structured text from documents with columnar layouts.

Overview

TextLayout uses an XY-Cut algorithm to detect text blocks in documents that have been extracted from PDFs or other sources with preserved character spacing. It's particularly effective for:

Invoices with multiple columns
Forms with label:value pairs
Documents with side-by-side content
Any text where spatial positioning matters

Installation

pip install git+https://github.com/asynkron/Asynkron.TextLayout.git

Usage

Command Line

textlayout document.txt
textlayout document.txt 3  # min_gap=3 for tighter column detection
textlayout document.pdf 2  # requires pdftotext on PATH

Python API

from textlayout import extract

# Read your document
with open("invoice.txt") as f:
    text = f.read()

output = extract(text, min_gap=2)
print(output)

PDF via pdftotext

Requires Poppler's pdftotext available on your PATH.

Install options:

macOS: brew install poppler
Debian/Ubuntu: sudo apt-get install poppler-utils
Python wrapper: pip install pdftotext (https://pypi.org/project/pdftotext/)

from textlayout import extract_pdf

output = extract_pdf("invoice.pdf", min_gap=2)
print(output)

How It Works

Text to Matrix: Converts text into a 2D character grid
Horizontal Split: Divides document into sections at blank lines
Vertical Split: Divides each section into columns at whitespace gaps
Normalization:
- Joins label: with following value
- Unwraps word-wrapped lines
- Pulls up numbers after separators
Formatting:
- Collapses multiple blank lines
- Aligns key:value pairs in groups

Example

Input (raw PDF text with spacing):

Kund nr             Fakturanr          Fakturadatum
4601691270          005597910          2021-12-03

Output:

Kund nr     : 4601691270
Fakturanr   : 005597910
Fakturadatum: 2021-12-03

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
examples		examples
fixtures		fixtures
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TextLayout

Overview

Installation

Usage

Command Line

Python API

PDF via pdftotext

How It Works

Example

License

About

Uh oh!

Releases

Packages

Languages

asynkron/Asynkron.TextLayout

Folders and files

Latest commit

History

Repository files navigation

TextLayout

Overview

Installation

Usage

Command Line

Python API

PDF via pdftotext

How It Works

Example

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages