NUCC Slurp

A comprehensive toolkit for scraping and analyzing NUCC (National Uniform Claim Committee) taxonomy data from the official taxonomy website.

Approach and Purpose

The web version of the NUCC taxonomy code list has slightly different data than the CSV download version. First, it has explict parent linkage on a per-taxonomy basis. Second it includes rows for groupings that are not directly in the CSV (the title of the grouping can be inferred, but the description of the grouping cannot).

As a result, this project both slurps the web version and processes the CSV version to create a merged picture of the nucc codeset. It also creates path csv file for fast numerical querying, and a csv file that documents the various "sources" in the notes sections.

Overview

This project provides a complete pipeline for extracting, processing, and analyzing NUCC taxonomy codes and their hierarchical relationships. The NUCC taxonomy is used to classify healthcare provider types and specialties in the United States.

Scripts and Execution Order

The scripts should be executed in the following order:

1. `scrape_nucc_ancestors.py`

Purpose: Scrapes the main NUCC taxonomy website to extract hierarchical relationships between codes.

What it does:

Fetches HTML from https://taxonomy.nucc.org/
Parses the JavaScript treenodes data structure
Extracts all ancestor-child relationships in the taxonomy hierarchy
Creates self-referencing relationships (each code is its own ancestor)

Output: data/nucc_parent_code.csv with columns:

ancestor_nucc_code_id: The ancestor code ID
child_nucc_code_id: The child code ID

Usage:

python3 scrape_nucc_ancestors.py

2. `scrape_nucc_nodes.py`

Purpose: Scrapes detailed information for each individual taxonomy code from the NUCC API.

What it does:

Reads all unique node IDs from data/nucc_parent_code.csv
Downloads detailed information for each node from the NUCC API
Parses HTML content to extract structured data (name, definition, notes, etc.)
Caches HTML snippets in data/tables/ for analysis
Uses intelligent caching to avoid re-downloading recently fetched data

Output:

data/nucc_codes.csv with detailed code information
data/tables/node_*.html files containing raw HTML snippets

Usage:

python3 scrape_nucc_nodes.py

3. `parse_nucc_sources.py`

Purpose: Extracts and structures source information from the notes column of the NUCC codes.

What it does:

Parses the code_notes column from data/nucc_codes.csv
Extracts source citations that follow the pattern "Source: text [date: note]"
Automatically extracts URLs from source text
Handles multiple sources per code
Creates normalized source records

Output: data/nucc_sources.csv with columns:

nucc_code_id: The NUCC code ID
full_source_text: Complete source text
source_date: Date from source citation
source_date_note: Note from source citation
extracted_urls: URLs found in source text

Usage:

python3 parse_nucc_sources.py

4. `compare_nucc_data.py`

Purpose: Compares scraped data with official NUCC taxonomy CSV files to identify differences.

What it does:

Loads both the scraped data and an official NUCC taxonomy CSV
Performs outer join on taxonomy codes
Identifies codes that exist in only one dataset
Creates a merged dataset with all available information
Generates summary statistics and reports

Output:

data/merged_nucc_data.csv: Combined dataset from both sources
data/nucc_comparison_summary.txt: Summary report of differences

Usage:

python3 compare_nucc_data.py --download_csv /path/to/official/nucc_taxonomy.csv --scrapped_csv ./data/nucc_codes.csv

Data Files Generated

data/nucc_parent_code.csv: Hierarchical relationships between codes
data/nucc_codes.csv: Detailed information for each taxonomy code
data/nucc_sources.csv: Structured source information
data/merged_nucc_data.csv: Comparison between scraped and official data
data/nucc_comparison_summary.txt: Summary of data comparison
data/tables/: Directory containing raw HTML snippets for each code

Requirements

Install the required Python packages:

pip install -r requirements.txt

Features

Intelligent Caching: Avoids re-downloading recently fetched data
Robust Error Handling: Gracefully handles network issues and parsing errors
Future-Proof Parsing: Automatically detects and includes new data fields
URL Extraction: Automatically extracts and normalizes URLs from source text
Data Validation: Includes data cleaning and validation steps

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
AI_instructions		AI_instructions
data		data
.gitignore		.gitignore
COMMUNITY.md		COMMUNITY.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
Step10_scrape_nucc_ancestors.py		Step10_scrape_nucc_ancestors.py
Step20_scrape_nucc_nodes.py		Step20_scrape_nucc_nodes.py
Step30_parse_nucc_sources.py		Step30_parse_nucc_sources.py
Step40_compare_nucc_data.py		Step40_compare_nucc_data.py
Step50_Verification.py		Step50_Verification.py
code.json		code.json
git_store_cred.sh		git_store_cred.sh
go.sh		go.sh
nucc_slurp.code-workspace		nucc_slurp.code-workspace
requirements.txt		requirements.txt
source_me_to_get_venv.sh		source_me_to_get_venv.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NUCC Slurp

Approach and Purpose

Overview

Scripts and Execution Order

1. `scrape_nucc_ancestors.py`

2. `scrape_nucc_nodes.py`

3. `parse_nucc_sources.py`

4. `compare_nucc_data.py`

Data Files Generated

Requirements

Features

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NUCC Slurp

Approach and Purpose

Overview

Scripts and Execution Order

1. scrape_nucc_ancestors.py

2. scrape_nucc_nodes.py

3. parse_nucc_sources.py

4. compare_nucc_data.py

Data Files Generated

Requirements

Features

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. `scrape_nucc_ancestors.py`

2. `scrape_nucc_nodes.py`

3. `parse_nucc_sources.py`

4. `compare_nucc_data.py`

Packages