Extract structured text from documents with columnar layouts.
TextLayout uses an XY-Cut algorithm to detect text blocks in documents that have been extracted from PDFs or other sources with preserved character spacing. It's particularly effective for:
- Invoices with multiple columns
- Forms with label:value pairs
- Documents with side-by-side content
- Any text where spatial positioning matters
pip install git+https://github.com/asynkron/Asynkron.TextLayout.gittextlayout document.txt
textlayout document.txt 3 # min_gap=3 for tighter column detection
textlayout document.pdf 2 # requires pdftotext on PATHfrom textlayout import extract
# Read your document
with open("invoice.txt") as f:
text = f.read()
output = extract(text, min_gap=2)
print(output)Requires Poppler's pdftotext available on your PATH.
Install options:
- macOS:
brew install poppler - Debian/Ubuntu:
sudo apt-get install poppler-utils - Python wrapper:
pip install pdftotext(https://pypi.org/project/pdftotext/)
from textlayout import extract_pdf
output = extract_pdf("invoice.pdf", min_gap=2)
print(output)- Text to Matrix: Converts text into a 2D character grid
- Horizontal Split: Divides document into sections at blank lines
- Vertical Split: Divides each section into columns at whitespace gaps
- Normalization:
- Joins
label:with following value - Unwraps word-wrapped lines
- Pulls up numbers after separators
- Joins
- Formatting:
- Collapses multiple blank lines
- Aligns key:value pairs in groups
Input (raw PDF text with spacing):
Kund nr Fakturanr Fakturadatum
4601691270 005597910 2021-12-03
Output:
Kund nr : 4601691270
Fakturanr : 005597910
Fakturadatum: 2021-12-03
MIT