A comprehensive toolkit for scraping and analyzing NUCC (National Uniform Claim Committee) taxonomy data from the official taxonomy website.
The web version of the NUCC taxonomy code list has slightly different data than the CSV download version. First, it has explict parent linkage on a per-taxonomy basis. Second it includes rows for groupings that are not directly in the CSV (the title of the grouping can be inferred, but the description of the grouping cannot).
As a result, this project both slurps the web version and processes the CSV version to create a merged picture of the nucc codeset. It also creates path csv file for fast numerical querying, and a csv file that documents the various "sources" in the notes sections.
This project provides a complete pipeline for extracting, processing, and analyzing NUCC taxonomy codes and their hierarchical relationships. The NUCC taxonomy is used to classify healthcare provider types and specialties in the United States.
The scripts should be executed in the following order:
Purpose: Scrapes the main NUCC taxonomy website to extract hierarchical relationships between codes.
What it does:
- Fetches HTML from https://taxonomy.nucc.org/
- Parses the JavaScript treenodes data structure
- Extracts all ancestor-child relationships in the taxonomy hierarchy
- Creates self-referencing relationships (each code is its own ancestor)
Output: data/nucc_parent_code.csv with columns:
ancestor_nucc_code_id: The ancestor code IDchild_nucc_code_id: The child code ID
Usage:
python3 scrape_nucc_ancestors.pyPurpose: Scrapes detailed information for each individual taxonomy code from the NUCC API.
What it does:
- Reads all unique node IDs from
data/nucc_parent_code.csv - Downloads detailed information for each node from the NUCC API
- Parses HTML content to extract structured data (name, definition, notes, etc.)
- Caches HTML snippets in
data/tables/for analysis - Uses intelligent caching to avoid re-downloading recently fetched data
Output:
data/nucc_codes.csvwith detailed code informationdata/tables/node_*.htmlfiles containing raw HTML snippets
Usage:
python3 scrape_nucc_nodes.pyPurpose: Extracts and structures source information from the notes column of the NUCC codes.
What it does:
- Parses the
code_notescolumn fromdata/nucc_codes.csv - Extracts source citations that follow the pattern "Source: text [date: note]"
- Automatically extracts URLs from source text
- Handles multiple sources per code
- Creates normalized source records
Output: data/nucc_sources.csv with columns:
nucc_code_id: The NUCC code IDfull_source_text: Complete source textsource_date: Date from source citationsource_date_note: Note from source citationextracted_urls: URLs found in source text
Usage:
python3 parse_nucc_sources.pyPurpose: Compares scraped data with official NUCC taxonomy CSV files to identify differences.
What it does:
- Loads both the scraped data and an official NUCC taxonomy CSV
- Performs outer join on taxonomy codes
- Identifies codes that exist in only one dataset
- Creates a merged dataset with all available information
- Generates summary statistics and reports
Output:
data/merged_nucc_data.csv: Combined dataset from both sourcesdata/nucc_comparison_summary.txt: Summary report of differences
Usage:
python3 compare_nucc_data.py --download_csv /path/to/official/nucc_taxonomy.csv --scrapped_csv ./data/nucc_codes.csvdata/nucc_parent_code.csv: Hierarchical relationships between codesdata/nucc_codes.csv: Detailed information for each taxonomy codedata/nucc_sources.csv: Structured source informationdata/merged_nucc_data.csv: Comparison between scraped and official datadata/nucc_comparison_summary.txt: Summary of data comparisondata/tables/: Directory containing raw HTML snippets for each code
Install the required Python packages:
pip install -r requirements.txt- Intelligent Caching: Avoids re-downloading recently fetched data
- Robust Error Handling: Gracefully handles network issues and parsing errors
- Future-Proof Parsing: Automatically detects and includes new data fields
- URL Extraction: Automatically extracts and normalizes URLs from source text
- Data Validation: Includes data cleaning and validation steps