Skip to content

SID-6921/ReproLab

Repository files navigation

ReproLab

ReproLab is a full-stack platform for scientific protocol reproducibility. It combines:

Backend (Python):

  • Automated constraint-based validation
  • Deterministic preprocessing pipeline
  • Explainable transformation logging
  • Automated reproducibility scoring (0-100 scale)
  • Institutional lineage tracking with cryptographic hashing

Frontend (React + Supabase):

  • Protocol editor with live reproducibility scoring
  • Dashboard for protocol management
  • Multi-tenant SaaS architecture
  • Real-time score feedback as users edit

Monorepo Structure

ReproLab/
├── .github/                            # CI workflows
├── api/                                # FastAPI wrapper
│   ├── main.py
│   ├── requirements.txt
│   └── README.md
├── docs/                               # Documentation, reports, and analysis notes
├── examples/                           # Runnable examples and data generation
├── frontend/                           # React + Supabase web app
│   ├── src/
│   │   ├── components/
│   │   ├── pages/
│   │   ├── services/
│   │   ├── store/
│   │   └── lib/
│   ├── package.json
│   ├── vite.config.js
│   └── README.md
├── results/
│   ├── benchmarks/                     # JSON result snapshots from all benchmark runs
│   ├── top_genes/                      # Top differentially expressed gene CSVs
│   ├── robustness/                     # Per-run utility-proof stress-test CSVs
│   └── preliminary/                    # Grant-aligned preliminary data outputs
├── scripts/                            # GEO benchmark and analysis scripts
├── src/reprolab/                       # Python package
│   ├── constraints/
│   ├── lineage/
│   ├── simulation/
│   ├── validation/
│   ├── pipeline.py
│   ├── preprocessing.py
│   ├── scoring.py
│   └── models.py
├── tests/                              # Test suite
├── pyproject.toml
└── README.md

Quick Start (Full Stack)

Backend API

cd api
pip install -r requirements.txt
python main.py
# API runs on http://localhost:8000

Frontend UI

cd frontend
npm install
npm run dev
# Frontend runs on http://localhost:5173

Then configure your Supabase project in frontend/.env.local.


Table of Contents

Backend

Frontend & API

Development

What The Backend Does

ReproLab combines six capabilities:

  1. Automated preprocessing
  • Duplicate detection and deterministic removal
  • Missing-value handling with deterministic imputations
  • Standardization of date, unit, and categorical formats
  1. Clinically constrained validation engine
  • Deterministic ICD-like ontology checks
  • Cross-variable checks (diagnosis and HbA1c consistency)
  • Probabilistic context-aware anomaly correction
  • Confidence-scored correction proposals
  • Deterministic conflict resolution for competing rules
  1. Explainable transformation logging
  • Original value and corrected value
  • Constraint/rule applied
  • Rationale
  • Confidence
  • JSON and CSV export
  1. Reproducibility-preserving lineage
  • Hash before and after each step
  • Versioned step metadata
  • Deterministic signatures for regeneration
  1. Modular extensibility
  • Reusable constraint interface
  • Plug-in clinical rules per dataset/domain
  1. Testing and benchmarking
  • Synthetic error injection
  • Comparison with manual and generic baselines
  • Metrics: integrity score, correction rate, residual errors, runtime

Requirements

  • Python 3.10+
  • pip

Quick Start

  1. Clone and enter the repository:
git clone https://github.com/SID-6921/ReproLab.git
cd ReproLab
  1. Install dependencies:
python -m pip install -e .[dev]
  1. Run tests:
python -m pytest
  1. Run sample workflow:
python examples/sample_usage.py

The sample in examples/sample_usage.py prints:

  • cleaned dataset
  • transformation log
  • lineage history
  • simulated error profile
  • benchmark table

It also exports:

  • transformation_log.json
  • transformation_log.csv

Implementation Workflow

Use this when implementing or changing behavior in ReproLab:

  1. Create a branch from main
  2. Implement focused changes in src/reprolab
  3. Add or update tests in tests
  4. Run local validation:
python -m pytest
  1. Run sample usage for sanity checks:
python examples/sample_usage.py
  1. Open a pull request with clear notes on behavior changes

Minimal Usage

import pandas as pd
from reprolab.constraints.clinical_rules import default_clinical_constraints
from reprolab.pipeline import ReproLabPipeline

raw = pd.DataFrame(
    {
        "patient_id": ["P1", "P1", "P2"],
        "diagnosis_code": ["e11", "E11", "T88"],
        "hba1c_pct": [8.2, 8.2, 4.1],
        "event_date": ["2026/01/10", "2026-01-10", "10-02-2026"],
        "glucose_mg_dl": ["180 mg/dL", None, "95"],
        "adverse_event": ["yes", "yes", "NO"],
    }
)

pipeline = ReproLabPipeline(constraints=default_clinical_constraints())
result = pipeline.run(raw)

cleaned = result.cleaned_data
logs = result.transformation_log
lineage = result.lineage_history

pipeline.export_logs("transformation_log.json", "transformation_log.csv")

Input Contract

Expected columns for the default clinical constraints:

  • patient_id
  • diagnosis_code
  • hba1c_pct
  • glucose_mg_dl
  • event_date
  • adverse_event

If some columns are missing, relevant rules are skipped safely.

Output Contract

PipelineResult includes:

  • cleaned_data (pandas.DataFrame)
  • transformation_log (pandas.DataFrame)
  • lineage_history (list[dict[str, str]])
  • reproducibility_score (dict[str, float | int])

reproducibility_score includes:

  • overall (0-100)
  • metadata_completeness (0-100)
  • reagent_traceability (0-100)
  • step_granularity (0-100)

Each transformation-log record includes:

  • row_index
  • column
  • original_value
  • corrected_value
  • constraint_name
  • rationale
  • confidence

Benchmarking and Simulation

Use simulation utilities to generate controlled-error datasets and benchmark strategies:

from reprolab.simulation.benchmark import run_preprocessing_benchmark
from reprolab.simulation.dataset_simulator import simulate_biomed_dataset

df, error_profile = simulate_biomed_dataset(n=120, seed=12)
benchmark_df = run_preprocessing_benchmark(df)

Benchmark output columns:

  • strategy
  • data_integrity_score
  • error_correction_rate
  • residual_errors
  • preprocessing_time_sec

Current benchmark strategies include both reprolab_median and reprolab_knn so the default deterministic pipeline can be compared directly with the stronger KNN-based numeric imputation path.

Expanded GEO Utility Validation

The repository also includes a real-data robustness benchmark built around GEO datasets GSE32707 and GSE3037.

The follow-up validation now covers:

  • Multiple corruption regimes: mixed noise, high missingness, structured batch noise, and extreme corruption
  • Multiple preprocessing paths: baseline, median imputation, KNN imputation, variance-stabilizing transform, quantile normalization, and ComBat-like correction
  • Multiple evaluation layers: gene-level stability, pathway-level consistency, and signal-to-noise ratio tracking

The core library now also supports configurable numeric imputation through PreprocessingConfig, with median imputation as the default and KNN imputation available as an explicit strategy for robustness-focused runs.

Current takeaway:

  • Quality-gating remains strong.
  • KNN imputation is the best-performing robustness method in the expanded benchmark.
  • Higher SNR alone is not enough; some normalization methods improve SNR numerically while degrading biological fidelity.

Primary outputs:

Preliminary Data Package

Generate grant-aligned outputs:

python examples/generate_preliminary_data.py

This writes to results/preliminary, including:

Extending Constraints

To add custom rules:

  1. Implement a new constraint using the interface in src/reprolab/constraints/base.py
  2. Return CandidateCorrection entries from apply(...)
  3. Pass your constraint list to ReproLabPipeline(...)

Reference examples:

Project Structure

Troubleshooting

  1. Import errors after cloning
  • Run: python -m pip install -e .[dev]
  1. Tests not discovered
  • Run tests from repository root with: python -m pytest
  1. Different outputs across runs
  • Keep input data and simulator seed consistent

Reproducible Environments

For stricter reproducibility, use pinned dependencies from requirements-lock.txt:

python -m pip install -r requirements-lock.txt
python -m pip install -e .

Contributing

Contribution workflow and PR checklist: CONTRIBUTING.md

License

This repository uses a proprietary all-rights-reserved license.

Reuse, copying, modification, or distribution is not permitted without written permission from the copyright holder.

See LICENSE.

Citation

If permission is granted for use, citation is required.

Use metadata from CITATION.cff.

About

Reproducible biomedical data preprocessing with clinical validation and lineage tracking

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors