PDF-to-Geospatial Pipeline

A reproducible workflow for extracting location references and thematic signals from unstructured PDF documents.

This project and README.md was generated with the help of ChatGPT 5.1.

This project demonstrates a lightweight pipeline that converts unstructured PDF documents into structured geospatial datasets, using:

Python
Jupyter + Docker
spaCy NLP
GeoPandas / Shapely
Parquet / GeoParquet
OSM-based geocoding with caching

The goal is to turn narrative text in PDF reports into geospatially meaningful information.

Project Structure

project/
  notebooks/
    01_extract_pdfs.ipynb
    02_nlp_locations.ipynb
    03_geocode_points.ipynb
  <data-folder>/
    processed/
    data/interim/
  .env
  Dockerfile
  docker-compose.yml

The <data-folder> contains raw PDFs and derived outputs.
It is gitignored to keep documents and processed data local.

Environment Variables

All folder paths and category names are defined in .env, for example:

PDF_FOLDER=my-data-folder
PDF_CATEGORIES=Cat1,Cat2,Cat3,Cat4,Cat5

BAD_LOCATION_TOKENS=example1,example2,...

PDF_ROOT=/workspace/${PDF_FOLDER}
PROCESSED_DIR=/workspace/${PDF_FOLDER}/processed
INTERIM_DIR=/workspace/${PDF_FOLDER}/data/interim

This prevents hard-coding and makes the pipeline portable.

Notebook Overview

01_extract_pdfs.ipynb — PDF → Parquet

This notebook extracts text from PDFs and writes two structured outputs:

documents.parquet

One row per PDF containing:

doc_id
file name & title
category (from directory structure)
file path
optional metadata fields

chunks.parquet

One row per text chunk (paragraph-like).
Includes:

chunk_id
doc_id
page number
chunk index
the extracted text

These files form the foundation for all downstream processing.

02_nlp_locations.ipynb — Text → Location Mentions + Tags

This notebook performs a lightweight NLP pass to identify:

General location entities (e.g., counties, towns, regions)
Theme tags using rule-based keyword matching

It produces:

mentions.parquet

One row per detected location mention, containing:

the raw entity text
its source chunk
associated theme tags
short evidence snippet

Note:
The initial model reliably extracts general place names, but does not guarantee detailed street address extraction due to PDF/OCR variability or lack of detail in raw PDF.

03_geocode_points.ipynb — Mentions → Mappable Points

This notebook geocodes extracted locations and produces geospatial datasets:

Steps:

canonicalize raw entity names
filter out non-locations and OCR noise
geocode using Nominatim with retries and caching
build geospatial point features linked back to text evidence

Outputs:

geo_points.geoparquet — canonical name, coordinates, theme type, evidence
geo_points.geojson — for interactive maps
location_cache.parquet — persistent geocode cache

What the Pipeline Delivers

Extracted text structured for analysis
NLP-derived location mentions
Rule-based thematic classification
Canonicalized, geocoded points
Machine-readable Parquet and GeoParquet layers
A modular, extensible workflow

Possible Future Enhancements

Address extraction (regex or ML)
Improved canonicalization rules
RAG search (embedding + geospatial joins)
PMTiles/vector tile exports
Multi-modal (PDF + tabular + GIS) search

Purpose

This project shows how to turn unstructured PDF documents into structured location-aware datasets.
It is designed to be simple, transparent, and extensible for workflows with any documents containing location references.

Project limitation

Current extraction reliably finds town/county-level entities, but not detailed street addresses. This could be because that's the detail in the text.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebooks		notebooks
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF-to-Geospatial Pipeline

Project Structure

Environment Variables

Notebook Overview

01_extract_pdfs.ipynb — PDF → Parquet

documents.parquet

chunks.parquet

02_nlp_locations.ipynb — Text → Location Mentions + Tags

mentions.parquet

03_geocode_points.ipynb — Mentions → Mappable Points

Steps:

Outputs:

What the Pipeline Delivers

Possible Future Enhancements

Purpose

Project limitation

About

Uh oh!

Releases

Packages

Languages

jmlane8/pdf_extract_geolocation

Folders and files

Latest commit

History

Repository files navigation

PDF-to-Geospatial Pipeline

Project Structure

Environment Variables

Notebook Overview

01_extract_pdfs.ipynb — PDF → Parquet

documents.parquet

chunks.parquet

02_nlp_locations.ipynb — Text → Location Mentions + Tags

mentions.parquet

03_geocode_points.ipynb — Mentions → Mappable Points

Steps:

Outputs:

What the Pipeline Delivers

Possible Future Enhancements

Purpose

Project limitation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages