Medical NER

This project performs Named Entity Recognition (NER) on medical forum posts to identify mentions of Drugs, Diseases, Symptoms, and Adverse Drug Reactions (ADRs). It also links the extracted ADRs to SNOMED-CT medical codes.

Dataset

The project uses the CADEC (Corpus of Adverse Drug Events) dataset. The dataset structure should be:

cadec/
├── text/ # Raw forum posts (.txt)
├── original/ # Ground truth annotations (.ann)
├── meddra/ # ADR annotations using MedDRA terminology
└── sct/ # Annotations linked to SNOMED-CT codes

Project Workflow

The project is divided into several tasks:

Task 1: Entity Enumeration

Script: task1_entity_enumeration.py
Purpose: Parse .ann files from cadec/original to count unique entities (ADR, Drug, Disease, Symptom).
Output: Summary statistics of entities in the dataset.

Task 2: NER using a Pre-trained Language Model

Script: task2_llm_sequence_labelling.py
Interactive version: task2.ipynb
Purpose: Apply a pre-trained biomedical NER model to label forum posts.
Process:

Read forum post from cadec/text.
Apply NER pipeline.
Merge sub-word tokens into complete entities.
Map model labels to ADR, Drug, Disease, Symptom.
Save predictions in .ann style (*_predicted_spans.json).

Task 3: Standard Evaluation

Script: task3_evaluate_predictions.py
Purpose: Evaluate NER model performance against ground truth (cadec/original).
Method: Strict matching of entity text and label. Calculates Precision, Recall, F1-score.

Task 4: ADR-focused Evaluation with MedDRA

Script: task4.py
Purpose: Specialized evaluation for ADR label using cadec/meddra.
Output: Precision, Recall, F1-score for ADR detection.

Task 5: Relaxed / Token / Word-Level Evaluation

Task 5 now contains three scripts for flexible evaluation metrics:

batch_evaluation.py: Computes relaxed precision, recall, and F1 by allowing overlapping spans as correct. Works with _predicted_spans.json.
relaxed_eval.py: Computes relaxed evaluation at file and macro/micro levels. Reports skipped files.
token_level.py: Computes token-level and word-presence F1 scores for predictions. Useful for finer-grained analysis.

Output: Evaluation metrics per file and macro averages.

Task 6: Entity Linking to SNOMED-CT

Script: task6.py
Purpose: Normalize detected ADR entities by linking them to SNOMED-CT concepts.
Methods:

Fuzzy string matching (fuzzywuzzy)
Sentence embeddings (sentence-transformers) for semantic similarity
Output: adr_sct_mappings.json with mapping results.

Task 7: ADR Comparison (Optional)

Script: compare_adr_mapping_readable.py
Purpose: Generate human-readable comparison of fuzzy vs embedding mappings for ADRs.
Output: adr_comparison_readable.txt showing side-by-side matches.

File Descriptions

File / Directory	Description
`task1_entity_enumeration.py`	Counts and summarizes unique entities in CADEC.
`task2_llm_sequence_labelling.py`	Performs NER using a pre-trained biomedical model.
`task3_evaluate_predictions.py`	Standard evaluation against `cadec/original`.
`task4.py`	ADR-specific evaluation using MedDRA.
`task5_relaxed_eval.py`	Relaxed evaluation considering span overlaps.
`batch_evaluation.py`	Relaxed evaluation allowing overlapping spans per file.
`relaxed_eval.py`	Relaxed evaluation with macro/micro metrics and skipped files.
`token_level.py`	Token-level and word-presence evaluation metrics.
`task6.py`	Links ADR entities to SNOMED-CT using fuzzy & embedding methods.
`compare_adr_mapping_readable.py`	Human-readable ADR mapping comparison.
`adr_sct_mappings.json`	JSON storing ADR to SNOMED-CT mapping results.
`adr_comparison_only.json`	JSON containing comparison data between fuzzy & embedding methods.
`adr_comparison_readable.txt`	Readable text file displaying ADR mapping comparisons.
`cadec/`	CADEC dataset (text and annotations).
`predictions/`	Folder storing predicted span JSON files.
`venv/`	Python virtual environment.

How to Run

Install Python 3 and required packages:

pip install torch transformers fuzzywuzzy sentence-transformers
Run the scripts sequentially:

bash
Copy code
python task1_entity_enumeration.py
python task2_llm_sequence_labelling.py
python task3_evaluate_predictions.py
python task4.py
python batch_evaluation.py   # Task 5 evaluation script 1
python relaxed_eval.py       # Task 5 evaluation script 2
python token_level.py        # Task 5 evaluation script 3
python task6.py
python compare_adr_mapping_readable.py  # Optional
Review outputs in the predictions/ folder and evaluation metrics printed in the terminal.

About
This repository provides a complete workflow for medical NER, evaluation, and entity linking, useful for NLP research in pharmacovigilance and adverse drug event detection.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Task1		Task1
Task2		Task2
Task3		Task3
Task4		Task4
Task5		Task5
Task6		Task6
cadec		cadec
preprocessing		preprocessing
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
adr_comparison_only.json		adr_comparison_only.json
build_aligned_dataset.py		build_aligned_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Medical NER

Dataset

Project Workflow

Task 1: Entity Enumeration

Task 2: NER using a Pre-trained Language Model

Task 3: Standard Evaluation

Task 4: ADR-focused Evaluation with MedDRA

Task 5: Relaxed / Token / Word-Level Evaluation

Task 6: Entity Linking to SNOMED-CT

Task 7: ADR Comparison (Optional)

File Descriptions

How to Run

About

Uh oh!

Releases

Packages

Languages

aishwarya-rawat0/Medical-NER

Folders and files

Latest commit

History

Repository files navigation

Medical NER

Dataset

Project Workflow

Task 1: Entity Enumeration

Task 2: NER using a Pre-trained Language Model

Task 3: Standard Evaluation

Task 4: ADR-focused Evaluation with MedDRA

Task 5: Relaxed / Token / Word-Level Evaluation

Task 6: Entity Linking to SNOMED-CT

Task 7: ADR Comparison (Optional)

File Descriptions

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages