Skip to content

jmlane8/pdf_extract_geolocation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF-to-Geospatial Pipeline

A reproducible workflow for extracting location references and thematic signals from unstructured PDF documents.

This project and README.md was generated with the help of ChatGPT 5.1.

This project demonstrates a lightweight pipeline that converts unstructured PDF documents into structured geospatial datasets, using:

  • Python
  • Jupyter + Docker
  • spaCy NLP
  • GeoPandas / Shapely
  • Parquet / GeoParquet
  • OSM-based geocoding with caching

The goal is to turn narrative text in PDF reports into geospatially meaningful information.

Project Structure

project/
  notebooks/
    01_extract_pdfs.ipynb
    02_nlp_locations.ipynb
    03_geocode_points.ipynb
  <data-folder>/
    processed/
    data/interim/
  .env
  Dockerfile
  docker-compose.yml

The <data-folder> contains raw PDFs and derived outputs.
It is gitignored to keep documents and processed data local.

Environment Variables

All folder paths and category names are defined in .env, for example:

PDF_FOLDER=my-data-folder
PDF_CATEGORIES=Cat1,Cat2,Cat3,Cat4,Cat5

BAD_LOCATION_TOKENS=example1,example2,...

PDF_ROOT=/workspace/${PDF_FOLDER}
PROCESSED_DIR=/workspace/${PDF_FOLDER}/processed
INTERIM_DIR=/workspace/${PDF_FOLDER}/data/interim

This prevents hard-coding and makes the pipeline portable.

Notebook Overview

01_extract_pdfs.ipynb — PDF → Parquet

This notebook extracts text from PDFs and writes two structured outputs:

documents.parquet

One row per PDF containing:

  • doc_id
  • file name & title
  • category (from directory structure)
  • file path
  • optional metadata fields

chunks.parquet

One row per text chunk (paragraph-like).
Includes:

  • chunk_id
  • doc_id
  • page number
  • chunk index
  • the extracted text

These files form the foundation for all downstream processing.

02_nlp_locations.ipynb — Text → Location Mentions + Tags

This notebook performs a lightweight NLP pass to identify:

  • General location entities (e.g., counties, towns, regions)
  • Theme tags using rule-based keyword matching

It produces:

mentions.parquet

One row per detected location mention, containing:

  • the raw entity text
  • its source chunk
  • associated theme tags
  • short evidence snippet

Note:
The initial model reliably extracts general place names, but does not guarantee detailed street address extraction due to PDF/OCR variability or lack of detail in raw PDF.

03_geocode_points.ipynb — Mentions → Mappable Points

This notebook geocodes extracted locations and produces geospatial datasets:

Steps:

  • canonicalize raw entity names
  • filter out non-locations and OCR noise
  • geocode using Nominatim with retries and caching
  • build geospatial point features linked back to text evidence

Outputs:

  • geo_points.geoparquet — canonical name, coordinates, theme type, evidence
  • geo_points.geojson — for interactive maps
  • location_cache.parquet — persistent geocode cache

What the Pipeline Delivers

  • Extracted text structured for analysis
  • NLP-derived location mentions
  • Rule-based thematic classification
  • Canonicalized, geocoded points
  • Machine-readable Parquet and GeoParquet layers
  • A modular, extensible workflow

Possible Future Enhancements

  • Address extraction (regex or ML)
  • Improved canonicalization rules
  • RAG search (embedding + geospatial joins)
  • PMTiles/vector tile exports
  • Multi-modal (PDF + tabular + GIS) search

Purpose

This project shows how to turn unstructured PDF documents into structured location-aware datasets.
It is designed to be simple, transparent, and extensible for workflows with any documents containing location references.

Project limitation

Current extraction reliably finds town/county-level entities, but not detailed street addresses. This could be because that's the detail in the text.

About

Extract text from PDF's and extract geospatial information

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published