Skip to content

jamesEmerson112/AI-CDS-Disease-Diagnosis-Reproduction-Alpha

Repository files navigation

AI-CDS Disease Diagnosis System

Clinical Decision Support System for disease diagnosis prediction using patient symptom similarity.

Overview

This project reproduces and extends the clinical decision support system from "AI-Driven Clinical Decision Support: Enhancing Disease Diagnosis Exploiting Patients Similarity" (Comito et al., 2022). We first reproduce the original BioSentVec baseline, then replace it with three biomedical BERT models as an original extension. Our score distribution analysis reveals that the BERT evaluation metric saturates at the paper's threshold, making the F1 scores unsuitable for model comparison — this is the key finding and an open problem for future work.

Baseline Reproduction

The original paper uses BioSentVec (700-dimensional sent2vec embeddings trained on PubMed + MIMIC-III) to compute symptom-level pairwise cosine similarities between patients. Diagnosis similarity is determined by taking the MAX similarity across the Cartesian product of ground-truth and predicted diagnosis descriptions, then applying a threshold to classify true/false positives.

Baseline results at threshold = 0.6:

Method F1 Score
TOP-10 0.489
TOP-20 0.512
TOP-30 0.521
python scripts/run_baseline.py

BERT Extension (Original Contribution)

We replace BioSentVec with three biomedical BERT models that produce 768-dimensional embeddings:

Model HuggingFace Path Training Data
Bio_ClinicalBERT emilyalsentzer/Bio_ClinicalBERT MIMIC-III clinical notes
BiomedBERT microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract PubMed abstracts
BlueBERT bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12 PubMed + MIMIC-III

Results at threshold = 0.6 (baseline reference point):

Method BioSentVec (Baseline) Bio_ClinicalBERT BiomedBERT BlueBERT
TOP-10 0.489 1.000 1.000 1.000
TOP-20 0.512 1.000 1.000 1.000
TOP-30 0.521 1.000 1.000 1.000

All three BERT models achieve perfect F1 = 1.000 at threshold 0.6. BiomedBERT maintains perfect scores through threshold 0.9. However, these results are misleading — see the score distribution analysis below.

python scripts/run_all_bert_models.py

See docs/bert_model_comparison.md for full results at all thresholds.

Visual Summary (README Charts)

These plots are generated directly from the three experiment outputs in:

  • Prediction_Output_Bio_ClinicalBERT_15022026_11-33-48/PerformanceIndex.txt
  • Prediction_Output_BiomedBERT_15022026_12-03-36/PerformanceIndex.txt
  • Prediction_Output_BlueBERT_15022026_12-24-38/PerformanceIndex.txt

F1 vs threshold (TOP-10) At TOP-10, BiomedBERT stays saturated through 0.9 while BlueBERT drops earlier.

F1 vs threshold (TOP-50) At TOP-50, all models improve at strict thresholds, but separation remains at 0.9/1.0.

F1 vs TOP-K at threshold 0.9 Top-K expansion strongly helps Bio_ClinicalBERT and BlueBERT at threshold 0.9.

F1 vs TOP-K at threshold 1.0 At exact-match threshold 1.0, model differences are modest and increase gradually with K.

Runtime breakdown 10-fold evaluation dominates runtime; startup overhead differs mostly by model loading time.

Saturation by threshold Per-patient MAX similarity saturation explains the perfect F1 at threshold 0.6.

Regenerate these charts:

python3 scripts/build_readme_plots.py

Score Distribution Analysis (Key Finding)

The perfect F1 scores are an artifact of embedding space compactness combined with the MAX-over-Cartesian-product evaluation strategy, not genuine diagnostic accuracy.

Why the metric saturates:

  1. Compact embedding spaces — Biomedical BERT models map diagnosis text into a narrow region. Even unrelated diagnoses have high cosine similarity:

    Model Mean Pairwise Sim Min Pairwise Sim Std
    BiomedBERT 0.93 0.72 0.03
    Bio_ClinicalBERT 0.83 0.65 0.05
    BlueBERT 0.72 0.48 0.07
  2. MAX operator amplification — Taking the maximum similarity across all diagnosis pairs inflates scores further. Per-patient MAX similarity exceeds 0.6 for virtually all patient pairs:

    Model % of patient pairs with MAX >= 0.6
    Bio_ClinicalBERT 100.00%
    BiomedBERT 100.00%
    BlueBERT 99.96%
  3. Conclusion — The evaluation metric is saturated at threshold 0.6 for BERT models. The F1 scores cannot discriminate between models or meaningfully compare against the baseline. Alternative evaluation strategies (MEAN instead of MAX, DRG code matching, higher thresholds) are needed.

Visualizations and full statistics are in docs/score_distribution_analysis/.

Visualizations

Diagnosis score distributions Diagnosis score distributions across baseline and BERT models.

Per-patient maximum similarity distributions Per-patient MAX similarity distributions showing saturation behavior.

python scripts/analyze_score_distributions.py

Project Structure

src/                     # Source code
  models/                # Baseline (sent2vec) and BERT implementations
  entity/                # Data classes (Admission, Symptom, Drgcodes)
  utils/                 # Utilities, constants, cython similarity
  evaluation/            # Evaluation modules
scripts/                 # Entry point scripts
  run_baseline.py        # Run BioSentVec baseline
  run_all_bert_models.py # Run all 3 BERT models sequentially
  analyze_score_distributions.py  # Score distribution analysis
data/                    # Data files
  folds/                 # 10-fold cross-validation splits
  raw/                   # Raw data files
  models/                # Pre-trained model files
docs/                    # Documentation and analysis reports
config/                  # Environment and requirements files
tests/                   # Test files

Setup

Conda environment:

conda env create -f config/environment.yml
conda activate disease-diagnosis

Key dependencies: sentence-transformers, torch, matplotlib, numpy

For baseline only: Also requires sent2vec and the BioSentVec pre-trained model (~21 GB). See docs/SETUP_GUIDE.md for details.

Citation

@article{comito2022ai,
  title={AI-Driven Clinical Decision Support: Enhancing Disease Diagnosis Exploiting Patients Similarity},
  author={Comito, Carmela and Falcone, Deborah and Forestiero, Agostino},
  journal={IEEE Access},
  volume={10},
  pages={6224--6234},
  year={2022},
  publisher={IEEE}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors