ReproLab is a full-stack platform for scientific protocol reproducibility. It combines:
Backend (Python):
- Automated constraint-based validation
- Deterministic preprocessing pipeline
- Explainable transformation logging
- Automated reproducibility scoring (0-100 scale)
- Institutional lineage tracking with cryptographic hashing
Frontend (React + Supabase):
- Protocol editor with live reproducibility scoring
- Dashboard for protocol management
- Multi-tenant SaaS architecture
- Real-time score feedback as users edit
ReproLab/
├── .github/ # CI workflows
├── api/ # FastAPI wrapper
│ ├── main.py
│ ├── requirements.txt
│ └── README.md
├── docs/ # Documentation, reports, and analysis notes
├── examples/ # Runnable examples and data generation
├── frontend/ # React + Supabase web app
│ ├── src/
│ │ ├── components/
│ │ ├── pages/
│ │ ├── services/
│ │ ├── store/
│ │ └── lib/
│ ├── package.json
│ ├── vite.config.js
│ └── README.md
├── results/
│ ├── benchmarks/ # JSON result snapshots from all benchmark runs
│ ├── top_genes/ # Top differentially expressed gene CSVs
│ ├── robustness/ # Per-run utility-proof stress-test CSVs
│ └── preliminary/ # Grant-aligned preliminary data outputs
├── scripts/ # GEO benchmark and analysis scripts
├── src/reprolab/ # Python package
│ ├── constraints/
│ ├── lineage/
│ ├── simulation/
│ ├── validation/
│ ├── pipeline.py
│ ├── preprocessing.py
│ ├── scoring.py
│ └── models.py
├── tests/ # Test suite
├── pyproject.toml
└── README.md
cd api
pip install -r requirements.txt
python main.py
# API runs on http://localhost:8000cd frontend
npm install
npm run dev
# Frontend runs on http://localhost:5173Then configure your Supabase project in frontend/.env.local.
ReproLab combines six capabilities:
- Automated preprocessing
- Duplicate detection and deterministic removal
- Missing-value handling with deterministic imputations
- Standardization of date, unit, and categorical formats
- Clinically constrained validation engine
- Deterministic ICD-like ontology checks
- Cross-variable checks (diagnosis and HbA1c consistency)
- Probabilistic context-aware anomaly correction
- Confidence-scored correction proposals
- Deterministic conflict resolution for competing rules
- Explainable transformation logging
- Original value and corrected value
- Constraint/rule applied
- Rationale
- Confidence
- JSON and CSV export
- Reproducibility-preserving lineage
- Hash before and after each step
- Versioned step metadata
- Deterministic signatures for regeneration
- Modular extensibility
- Reusable constraint interface
- Plug-in clinical rules per dataset/domain
- Testing and benchmarking
- Synthetic error injection
- Comparison with manual and generic baselines
- Metrics: integrity score, correction rate, residual errors, runtime
- Python 3.10+
- pip
- Clone and enter the repository:
git clone https://github.com/SID-6921/ReproLab.git
cd ReproLab- Install dependencies:
python -m pip install -e .[dev]- Run tests:
python -m pytest- Run sample workflow:
python examples/sample_usage.pyThe sample in examples/sample_usage.py prints:
- cleaned dataset
- transformation log
- lineage history
- simulated error profile
- benchmark table
It also exports:
- transformation_log.json
- transformation_log.csv
Use this when implementing or changing behavior in ReproLab:
- Create a branch from
main - Implement focused changes in
src/reprolab - Add or update tests in
tests - Run local validation:
python -m pytest- Run sample usage for sanity checks:
python examples/sample_usage.py- Open a pull request with clear notes on behavior changes
import pandas as pd
from reprolab.constraints.clinical_rules import default_clinical_constraints
from reprolab.pipeline import ReproLabPipeline
raw = pd.DataFrame(
{
"patient_id": ["P1", "P1", "P2"],
"diagnosis_code": ["e11", "E11", "T88"],
"hba1c_pct": [8.2, 8.2, 4.1],
"event_date": ["2026/01/10", "2026-01-10", "10-02-2026"],
"glucose_mg_dl": ["180 mg/dL", None, "95"],
"adverse_event": ["yes", "yes", "NO"],
}
)
pipeline = ReproLabPipeline(constraints=default_clinical_constraints())
result = pipeline.run(raw)
cleaned = result.cleaned_data
logs = result.transformation_log
lineage = result.lineage_history
pipeline.export_logs("transformation_log.json", "transformation_log.csv")Expected columns for the default clinical constraints:
- patient_id
- diagnosis_code
- hba1c_pct
- glucose_mg_dl
- event_date
- adverse_event
If some columns are missing, relevant rules are skipped safely.
PipelineResult includes:
cleaned_data(pandas.DataFrame)transformation_log(pandas.DataFrame)lineage_history(list[dict[str, str]])reproducibility_score(dict[str, float | int])
reproducibility_score includes:
overall(0-100)metadata_completeness(0-100)reagent_traceability(0-100)step_granularity(0-100)
Each transformation-log record includes:
row_indexcolumnoriginal_valuecorrected_valueconstraint_namerationaleconfidence
Use simulation utilities to generate controlled-error datasets and benchmark strategies:
from reprolab.simulation.benchmark import run_preprocessing_benchmark
from reprolab.simulation.dataset_simulator import simulate_biomed_dataset
df, error_profile = simulate_biomed_dataset(n=120, seed=12)
benchmark_df = run_preprocessing_benchmark(df)Benchmark output columns:
- strategy
- data_integrity_score
- error_correction_rate
- residual_errors
- preprocessing_time_sec
Current benchmark strategies include both reprolab_median and reprolab_knn so the default deterministic pipeline can be compared directly with the stronger KNN-based numeric imputation path.
The repository also includes a real-data robustness benchmark built around GEO datasets GSE32707 and GSE3037.
The follow-up validation now covers:
- Multiple corruption regimes: mixed noise, high missingness, structured batch noise, and extreme corruption
- Multiple preprocessing paths: baseline, median imputation, KNN imputation, variance-stabilizing transform, quantile normalization, and ComBat-like correction
- Multiple evaluation layers: gene-level stability, pathway-level consistency, and signal-to-noise ratio tracking
The core library now also supports configurable numeric imputation through PreprocessingConfig, with median imputation as the default and KNN imputation available as an explicit strategy for robustness-focused runs.
Current takeaway:
- Quality-gating remains strong.
- KNN imputation is the best-performing robustness method in the expanded benchmark.
- Higher SNR alone is not enough; some normalization methods improve SNR numerically while degrading biological fidelity.
Primary outputs:
Generate grant-aligned outputs:
python examples/generate_preliminary_data.pyThis writes to results/preliminary, including:
- results/preliminary/preliminary_report.md
- results/preliminary/preliminary_metrics.csv
- results/preliminary/transformation_log.csv
- docs/preliminary_results_specific_aims.md
To add custom rules:
- Implement a new constraint using the interface in src/reprolab/constraints/base.py
- Return
CandidateCorrectionentries fromapply(...) - Pass your constraint list to
ReproLabPipeline(...)
Reference examples:
- src/reprolab/pipeline.py: Orchestration entry point
- src/reprolab/preprocessing.py: Deterministic preprocessing
- src/reprolab/validation/engine.py: Constraint execution and conflict resolution
- src/reprolab/lineage/logger.py: Transformation log storage/export
- src/reprolab/lineage/tracker.py: Deterministic lineage metadata
- src/reprolab/simulation/dataset_simulator.py: Controlled error injection
- src/reprolab/simulation/benchmark.py: Strategy benchmark metrics
- tests: Automated tests
- Import errors after cloning
- Run:
python -m pip install -e .[dev]
- Tests not discovered
- Run tests from repository root with:
python -m pytest
- Different outputs across runs
- Keep input data and simulator seed consistent
For stricter reproducibility, use pinned dependencies from requirements-lock.txt:
python -m pip install -r requirements-lock.txt
python -m pip install -e .Contribution workflow and PR checklist: CONTRIBUTING.md
This repository uses a proprietary all-rights-reserved license.
Reuse, copying, modification, or distribution is not permitted without written permission from the copyright holder.
See LICENSE.
If permission is granted for use, citation is required.
Use metadata from CITATION.cff.