Skip to content

aishwarya-rawat0/Medical-NER

Repository files navigation

Medical NER

This project performs Named Entity Recognition (NER) on medical forum posts to identify mentions of Drugs, Diseases, Symptoms, and Adverse Drug Reactions (ADRs). It also links the extracted ADRs to SNOMED-CT medical codes.

Dataset

The project uses the CADEC (Corpus of Adverse Drug Events) dataset. The dataset structure should be:

cadec/
├── text/ # Raw forum posts (.txt)
├── original/ # Ground truth annotations (.ann)
├── meddra/ # ADR annotations using MedDRA terminology
└── sct/ # Annotations linked to SNOMED-CT codes

Project Workflow

The project is divided into several tasks:

Task 1: Entity Enumeration

Script: task1_entity_enumeration.py
Purpose: Parse .ann files from cadec/original to count unique entities (ADR, Drug, Disease, Symptom).
Output: Summary statistics of entities in the dataset.

Task 2: NER using a Pre-trained Language Model

Script: task2_llm_sequence_labelling.py
Interactive version: task2.ipynb
Purpose: Apply a pre-trained biomedical NER model to label forum posts.
Process:

  1. Read forum post from cadec/text.
  2. Apply NER pipeline.
  3. Merge sub-word tokens into complete entities.
  4. Map model labels to ADR, Drug, Disease, Symptom.
  5. Save predictions in .ann style (*_predicted_spans.json).

Task 3: Standard Evaluation

Script: task3_evaluate_predictions.py
Purpose: Evaluate NER model performance against ground truth (cadec/original).
Method: Strict matching of entity text and label. Calculates Precision, Recall, F1-score.

Task 4: ADR-focused Evaluation with MedDRA

Script: task4.py
Purpose: Specialized evaluation for ADR label using cadec/meddra.
Output: Precision, Recall, F1-score for ADR detection.

Task 5: Relaxed / Token / Word-Level Evaluation

Task 5 now contains three scripts for flexible evaluation metrics:

  • batch_evaluation.py: Computes relaxed precision, recall, and F1 by allowing overlapping spans as correct. Works with _predicted_spans.json.
  • relaxed_eval.py: Computes relaxed evaluation at file and macro/micro levels. Reports skipped files.
  • token_level.py: Computes token-level and word-presence F1 scores for predictions. Useful for finer-grained analysis.

Output: Evaluation metrics per file and macro averages.

Task 6: Entity Linking to SNOMED-CT

Script: task6.py
Purpose: Normalize detected ADR entities by linking them to SNOMED-CT concepts.
Methods:

  • Fuzzy string matching (fuzzywuzzy)
  • Sentence embeddings (sentence-transformers) for semantic similarity
    Output: adr_sct_mappings.json with mapping results.

Task 7: ADR Comparison (Optional)

Script: compare_adr_mapping_readable.py
Purpose: Generate human-readable comparison of fuzzy vs embedding mappings for ADRs.
Output: adr_comparison_readable.txt showing side-by-side matches.


File Descriptions

File / Directory Description
task1_entity_enumeration.py Counts and summarizes unique entities in CADEC.
task2_llm_sequence_labelling.py Performs NER using a pre-trained biomedical model.
task3_evaluate_predictions.py Standard evaluation against cadec/original.
task4.py ADR-specific evaluation using MedDRA.
task5_relaxed_eval.py Relaxed evaluation considering span overlaps.
batch_evaluation.py Relaxed evaluation allowing overlapping spans per file.
relaxed_eval.py Relaxed evaluation with macro/micro metrics and skipped files.
token_level.py Token-level and word-presence evaluation metrics.
task6.py Links ADR entities to SNOMED-CT using fuzzy & embedding methods.
compare_adr_mapping_readable.py Human-readable ADR mapping comparison.
adr_sct_mappings.json JSON storing ADR to SNOMED-CT mapping results.
adr_comparison_only.json JSON containing comparison data between fuzzy & embedding methods.
adr_comparison_readable.txt Readable text file displaying ADR mapping comparisons.
cadec/ CADEC dataset (text and annotations).
predictions/ Folder storing predicted span JSON files.
venv/ Python virtual environment.

How to Run

  1. Install Python 3 and required packages:
pip install torch transformers fuzzywuzzy sentence-transformers
Run the scripts sequentially:

bash
Copy code
python task1_entity_enumeration.py
python task2_llm_sequence_labelling.py
python task3_evaluate_predictions.py
python task4.py
python batch_evaluation.py   # Task 5 evaluation script 1
python relaxed_eval.py       # Task 5 evaluation script 2
python token_level.py        # Task 5 evaluation script 3
python task6.py
python compare_adr_mapping_readable.py  # Optional
Review outputs in the predictions/ folder and evaluation metrics printed in the terminal.

About
This repository provides a complete workflow for medical NER, evaluation, and entity linking, useful for NLP research in pharmacovigilance and adverse drug event detection.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published