A reproducible workflow for extracting location references and thematic signals from unstructured PDF documents.
This project and README.md was generated with the help of ChatGPT 5.1.
This project demonstrates a lightweight pipeline that converts unstructured PDF documents into structured geospatial datasets, using:
- Python
- Jupyter + Docker
- spaCy NLP
- GeoPandas / Shapely
- Parquet / GeoParquet
- OSM-based geocoding with caching
The goal is to turn narrative text in PDF reports into geospatially meaningful information.
project/
notebooks/
01_extract_pdfs.ipynb
02_nlp_locations.ipynb
03_geocode_points.ipynb
<data-folder>/
processed/
data/interim/
.env
Dockerfile
docker-compose.yml
The <data-folder> contains raw PDFs and derived outputs.
It is gitignored to keep documents and processed data local.
All folder paths and category names are defined in .env, for example:
PDF_FOLDER=my-data-folder
PDF_CATEGORIES=Cat1,Cat2,Cat3,Cat4,Cat5
BAD_LOCATION_TOKENS=example1,example2,...
PDF_ROOT=/workspace/${PDF_FOLDER}
PROCESSED_DIR=/workspace/${PDF_FOLDER}/processed
INTERIM_DIR=/workspace/${PDF_FOLDER}/data/interim
This prevents hard-coding and makes the pipeline portable.
This notebook extracts text from PDFs and writes two structured outputs:
One row per PDF containing:
doc_id- file name & title
- category (from directory structure)
- file path
- optional metadata fields
One row per text chunk (paragraph-like).
Includes:
chunk_iddoc_id- page number
- chunk index
- the extracted text
These files form the foundation for all downstream processing.
This notebook performs a lightweight NLP pass to identify:
- General location entities (e.g., counties, towns, regions)
- Theme tags using rule-based keyword matching
It produces:
One row per detected location mention, containing:
- the raw entity text
- its source chunk
- associated theme tags
- short evidence snippet
Note:
The initial model reliably extracts general place names, but does not guarantee detailed street address extraction due to PDF/OCR variability or lack of detail in raw PDF.
This notebook geocodes extracted locations and produces geospatial datasets:
- canonicalize raw entity names
- filter out non-locations and OCR noise
- geocode using Nominatim with retries and caching
- build geospatial point features linked back to text evidence
geo_points.geoparquet— canonical name, coordinates, theme type, evidencegeo_points.geojson— for interactive mapslocation_cache.parquet— persistent geocode cache
- Extracted text structured for analysis
- NLP-derived location mentions
- Rule-based thematic classification
- Canonicalized, geocoded points
- Machine-readable Parquet and GeoParquet layers
- A modular, extensible workflow
- Address extraction (regex or ML)
- Improved canonicalization rules
- RAG search (embedding + geospatial joins)
- PMTiles/vector tile exports
- Multi-modal (PDF + tabular + GIS) search
This project shows how to turn unstructured PDF documents into structured location-aware datasets.
It is designed to be simple, transparent, and extensible for workflows with any documents containing location references.
Current extraction reliably finds town/county-level entities, but not detailed street addresses. This could be because that's the detail in the text.