Meet OmniCellAgent β an AI Co-Scientist for autonomous single-cell omics deep research. This platform combines advanced agentic orchestration systems with bio-focused specialized databases and foundation models to accelerate biomedical discovery. Explore intelligent research automation, transparent step-by-step progress, and rich visual outputs that bring complex analyses to life.
Whether you're exploring disease mechanisms, prioritizing targets, or synthesizing literature and omics data, OmniCellAgent helps you move from questions to insights faster.
Learn more and follow the lab here: https://www.youtube.com/@FuhaiLiAILab
Additional links:
- Lab: https://fuhailiailab.github.io
- GitHub: https://github.com/FuhaiLiAiLab/OmniCellAgent
- Paper: https://www.biorxiv.org/content/10.1101/2025.07.31.667797v1
demo at https://agent.omni-cells.com (might not always be up due to maintainence and updates)
OmniCellAgent supports multiple protocols for integration with AI agents and development tools:
Integrate OmniCellAgent with Claude Desktop, VS Code, and other MCP-enabled tools.
Quick Start:
# Run MCP server
cd mcp_tools
conda activate a2a-dev
python server.pyAvailable Tools:
- π¬
search_pubmed- PubMed literature search with full-text extraction - π
search_web- Google Custom Search with content extraction - π§¬
search_knowledge_graph- Neo4j biomedical knowledge graph queries - π¨βπ¬
query_scientist_knowledge- RAG over specific author's publications - π
analyze_omics_data- Comprehensive single-cell omics analysis
See mcp_tools/README.md for detailed documentation and Claude Desktop setup.
HTTP-based async protocol for agent-to-agent communication.
Quick Start:
# Start the A2A server (port 8021)
cd fasta2a_service
conda activate a2a-dev
nohup python server.py > server.log 2>&1 &Key Features:
- β Async task processing with status tracking
- β Long-running biomedical research workflows (5-30 minutes)
- β Full A2A protocol compliance (task submission, polling, artifacts)
See fasta2a_service/README.md for API reference.
# Start Neo4j, RAG tools, and microservices
bash scripts/startup.sh
# Test all services are running
bash scripts/test_services.sh
# Stop all services when done
bash scripts/stop_services.sh# Run all default test cases from scratch
source ~/miniconda3/etc/profile.d/conda.sh && conda activate langgraph-dev && python -m agent.langgraph_agent --query "What are the key dysfunctional genes and pathways in pancreatic ductal adenocarcinoma (PDAC)?" --session-id PDAC-test && python -m agent.langgraph_agent --query "What are the key dysfunctional genes and pathways in Alzheimer's Disease?" --session-id AD-test && python -m agent.langgraph_agent --query "What are the key dysfunctional genes and pathways in Lung adenocarcinoma (LUAD)?" --session-id LungCancer-test
Simple Query (Literature Research):
conda run -n langgraph-dev python agent/simple_magentic_agent.py \
--query "What are the key therapeutic targets for Alzheimer's Disease?"Full Analysis Pipeline (with Omic Data):
conda run -n langgraph-dev python agent/simple_magentic_agent.py \
--query "Analyze lung cancer: find relevant genes, perform differential expression analysis, and identify therapeutic targets. Use Omni cell mining agent to do enrichment" \
--session-id "lung_cancer_analysis"Analyze lung cancer: identify relevant genes, perform differential expression analysis, and discover therapeutic targets using the OmniCell mining agent for pathway enrichment
Results will be saved in webapp/sessions/lung_cancer_analysis/ including:
- Differential expression analysis
- Volcano plots
- Enrichment analysis plots
- Gene lists and pathway information
# Via startup script (recommended - starts all services)
bash scripts/startup.sh
# Access locally at http://localhost:8050
# Public access at https://agent.omni-cells.com
# Or standalone
conda run -n langgraph-dev python webapp/index.pyThe Web UI provides:
- Responsive Layout: Auto-adjusts to screen size for optimal viewing
- Session Management: Each conversation creates a unique session ID
- Real-time Progress: See step-by-step agent reasoning and tool calls
- Visualization: Plots and figures from analysis are displayed inline
- Output Storage: All session outputs saved in
webapp/sessions/<session_id>/
# Create conda environment (Python 3.8+ recommended, 3.10 tested)
conda create -n langgraph-dev python=3.10
conda activate langgraph-dev
# Install graphviz (required for KEGG pathway tools)
conda install anaconda::graphviz
# Install Python dependencies
pip install -r requirements.txt --no-depsKey Libraries: The system requires PyTorch and graph-processing libraries compatible with joint GNN and LLM modeling for OmniCellTOSG integration.
Create environment file (configs/db.env):
# Copy example file
cp configs/db.env.example configs/db.env
# Edit with your credentials
# NEO4J_URI=bolt://localhost:7687
# NEO4J_USER=neo4j
# NEO4J_PASSWORD=your_password
# GOOGLE_API_KEY=your_google_api_key
# OPENAI_API_KEY=your_openai_keyCreate paths configuration (configs/paths.yaml):
# Copy example file
cp configs/paths.yaml.example configs/paths.yaml
# Edit paths to point to your local directories
# Key paths to configure:
# - neo4j_path: Path to Neo4j database directory
# - omnicelltosg_root: Path to OmniCellTOSG dataset
# - sessions_base: Where to store analysis sessionsExample paths.yaml structure:
neo4j:
database_path: "/path/to/neo4j-community-2025.03.0"
omnicelltosg:
dataset_root: "/path/to/OmniCellTOSG/CellTOSG_dataset_v2"
checkpoint_dir: "/path/to/checkpoints"
sessions:
base: "./webapp/sessions"
cache:
author_kb: "./cache/author_kb"
omic_data: "./cache/omic_data"Install Neo4j (version 5.23+ recommended):
# Follow official instructions for your OS
# https://neo4j.com/docs/operations-manual/current/installation/
# Install required plugins:
# - GenAI plugin: https://neo4j.com/docs/cypher-manual/current/genai-integrations/
# - Graph Data Science library: https://neo4j.com/docs/graph-data-science/current/installation/Load PrimeKG Dataset:
- Option 1: Run the Jupyter notebook
data-loading/stark_prime_neo4j_loading.ipynb - Option 2: Download database dump from AWS S3:
s3://gds-public-dataset/stark-prime-neo4j523
Start Neo4j:
# Navigate to Neo4j installation directory
cd /path/to/neo4j-community-2025.03.0
# Start in background
nohup bin/neo4j console > logs/neo4j_log.out 2>&1 &
# Verify it's running
curl http://localhost:7474Download the dataset:
# Option 1: Download from HuggingFace
# Visit: https://huggingface.co/datasets/FuhaiLiAiLab/OmniCellTOSG_Dataset
# Option 2: Use the official repository download script
git clone https://github.com/FuhaiLiAiLab/OmniCellTOSG.git
cd OmniCellTOSG
# Follow download instructions in the repositoryConfigure dataset path:
# Edit configs/paths.yaml and set:
# omnicelltosg:
# dataset_root: "/path/to/OmniCellTOSG/CellTOSG_dataset_v2"Download pre-trained model checkpoints:
# Create checkpoint directory
mkdir -p checkpoints
# Download OmniCell-v1 weights
# Place in checkpoints/ directory to enable inferenceData Loader Configuration: When using OmniCellTOSG in your code:
from tools.omic_tools.data_loader import CellTOSGDataLoader
# Point to your local dataset path
loader = CellTOSGDataLoader(
root='../OmniCellTOSG/CellTOSG_dataset_v2'
)Pre-training and Fine-tuning (optional):
# Pre-training: Learn topological patterns and interaction mechanisms
python pretrain.py
# Downstream tasks: Disease classification, cell-type identification
python train.py
# Tutorials: Extract cell embeddings
jupyter notebook Tutorial_Cluster_blood.ipynbR Environment for KEGG Pathway Analysis:
# Install required R packages
cd enrichment
bash install_r_package.shVerify all paths are configured:
# Check that all required directories exist
python -c "from utils.path_config import get_path; print('Config OK')"Each analysis session is stored in its own directory under webapp/sessions/:
webapp/sessions/
βββ lung_cancer_analysis/ # Named session from CLI
β βββ differential_expression/ # DE analysis results
β βββ volcano_plots/ # Volcano plot images
β βββ enrichment_results/ # Enrichment CSV files
β βββ enrichment_plots/ # Enrichment visualizations
β βββ plots/ # KEGG pathway plots
β βββ top_genes_by_expression.csv
βββ session_20251218_143022_a1b2c3d4/ # Auto-generated UI session
β βββ ...
Session ID Formats:
- CLI: Use
--session-id "your_name"for custom names - Web UI: Auto-generated as
session_YYYYMMDD_HHMMSS_<random>
The system runs several microservices that provide different capabilities:
| Service | Port | Description | Test Command |
|---|---|---|---|
| Neo4j | 7474, 7687 | Graph database for biomedical knowledge | curl http://localhost:7474 |
| Scientist RAG | 8000 | Author-specific literature knowledge base | curl http://localhost:8000/health |
| GRetriever | 8001 | Knowledge graph query service | curl http://localhost:8001/health |
| GLiNER | - | Named entity recognition | Process check |
| BioBERT | - | Biomedical text embeddings | Process check |
| Webapp | 8050 | Web interface for the agent | curl http://localhost:8050 |
Service Management:
# Check service status
ps aux | grep python | grep -E "(scientist_tool|gretriever|webapp)"
# View logs
tail -f logs/service-logs/scientist_tool.log
tail -f logs/service-logs/gretriever_service_output.log
# Check GPU usage (for GRetriever)
nvidia-smiSee scripts/README.md for detailed service management documentation.
If you need the specialized tools OmniCellTOSG, download from https://huggingface.co/datasets/FuhaiLiAiLab/OmniCellTOSG_Dataset OR use the download script inside https://github.com/FuhaiLiAiLab/OmniCellTOSG?tab=readme-ov-file and paste the expression folder path into the config file in the config folder.
See
OmniCellAgent: Towards AI Co-Scientists for Scientific Discovery in Precision Medicine (https://www.biorxiv.org/content/10.1101/2025.07.31.667797v1)
If you used the enrichment study part, please also cite OmniCellTOSG https://arxiv.org/abs/2504.02148
Ensure your .env file is in the project root with:
GOOGLE_API_KEY=your_key_here
OPENAI_API_KEY=your_key_hereLoad in Python with:
from dotenv import load_dotenv
load_dotenv()- Verify Neo4j is running:
curl http://localhost:7474 - Check credentials in
configs/db.envmatch your Neo4j setup - Ensure ports 7474 and 7687 are not blocked
- Verify dataset path in
configs/paths.yamlpoints to the correct directory - Ensure you have downloaded the full CellTOSG_dataset_v2
- Check that
df_allmetadata contains required fields: tissue, tissue_general, disease, cell_type
If KEGG pathway visualization fails, ensure graphviz is installed via conda:
conda install anaconda::graphvizCheck service logs:
tail -f logs/service-logs/scientist_tool.log
tail -f logs/service-logs/gretriever_service_output.log- Memory management: Tool call messages are not preserved in long-running conversations to prevent context overflow (see
autogen_agentchat/teams/_group_chat/_magentic_one/_magentic_one_orchestrator.pylines 487-488) - Tool call summaries: Summary messages are added to thread instead of full tool responses (line 493)
- Add
autogen_ext.memory.canvasfor persistent memory storage - Implement better context window management for long-running sessions
Use the automated startup script:
# Start all services (Neo4j, RAG tools, microservices)
bash scripts/startup.sh
# Test all services are running
bash scripts/test_services.sh
# Access Web UI at http://localhost:8050This handles all services automatically. The manual steps below are provided for troubleshooting and understanding the system architecture.
Read these steps to understand what scripts/startup.sh does internally, or to debug service issues.
# Navigate to Neo4j installation directory
cd /path/to/neo4j-community-2025.03.0
# Start in background
nohup bin/neo4j console > logs/neo4j_log.out 2>&1 &
# Verify it's running
curl http://localhost:7474# Option 1: Foreground
python tools/scientist_rag_tools/scientist_tool.py
# Option 2: Background (recommended)
nohup python tools/scientist_rag_tools/scientist_tool.py > logs/scientist_tool_output.log 2>&1 &
# Verify
curl http://localhost:8000/health# Option 1: Foreground
python tools/gretriever_tools/gretriever_service.py
# Option 2: Background (recommended)
nohup python tools/gretriever_tools/gretriever_service.py > logs/gretriever_service_output.log 2>&1 &
# Verify
curl http://localhost:8001/healthCommand Line Interface:
# Basic query
python agent/simple_magentic_agent.py \
--task "What are the key dysfunctional signaling targets in microglia of AD?" \
--task_id "1" \
--mode magentic > logs/results.txt
# With LangGraph agent (full pipeline)
python -m agent.langgraph_agent \
--query "What are the key dysfunctional genes and pathways in pancreatic ductal adenocarcinoma?" \
--session-id PDAC-testWeb UI:
# Start web interface
python webapp/index.py
# Access at http://localhost:8050
# Example query: "What are the key dysfunctional signaling targets in microglia of AD, based on the internal database?"bash scripts/stop_services.shMany modules include testing code in their __main__ block for easy standalone testing:
# Test individual tools directly
python tools/scientist_rag_tools/scientist_tool.py # Starts RAG service
python tools/gretriever_tools/gretriever_service.py # Starts GRetriever service
python tools/omic_tools/omic_fetch_analysis_workflow.py # Test omic workflow
python tools/pubmed_tools/query_pubmed_tool.py # Test PubMed search
python tools/google_search_tools/google_search_w3m.py # Test Google search
# Test utilities
python utils/path_config.py # Verify path configuration
python tools/omic_tools/ner_tool.py # Test NER extractionThis makes it easy to isolate and debug specific components without running the full agent system.
Three query mode: Hyper, Hyper-lite, Navie
In "hyper" mode: First LLM call: Extract keywords from the query using keywords_extraction prompt Second LLM call: Generate the final response using the retrieved context
In "hyper-lite" mode: Similar to hyper mode, makes 2 LLM calls
In "naive" mode: Makes only 1 LLM call to generate the final response The llm_model_max_async parameter (default: 16) controls how many concurrent LLM calls can be processed at once. This means: For naive query mode: The system can handle up to 16 concurrent user queries (since each makes 1 LLM call) For hyper/hyper-lite modes: The system can handle up to 8 concurrent user queries (since each makes 2 LLM calls)
When set QueryParam(only_need_context=True) in HyperRAG: The first call still made. (extracts both low-level keywords (entities) and high-level keywords) The second call is skipped.
-----Entities-----
id, entity, type, description, additional properties, rankdf_all in CellTOSGSubsetBuilder stores the field (e.g., tissue, tissue_general, disease, cell type).
The omic analysis tool follows a multi-stage pipeline architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β simple_magentic_agent.py β
β β
β create_omic_analysis_tool(session_id) β
β β β
β βββ omic_analysis_tool() [FunctionTool wrapper for LLM] β
β β β
β βββ calls: _omic_workflow(session_dir, disease, ...) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β omic_fetch_analysis_workflow.py β
β β
β omic_fetch_analysis_workflow() β
β β β
β βββ STEP 1: NER extraction (ner_tool.py) β
β β β
β βββ STEP 2: Data retrieval (CellTOSGDataLoader) β
β β β
β βββ STEP 3: Compute top genes by expression β
β β β
β βββ STEP 4: Differential Expression Analysis βββββββββββββββββββ β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β β omic_analysis.py β β
β β β β β
β β β omic_analysis(disease_name, data_dict, ...) β β
β β β β β β
β β β βββ DE analysis (Mann-Whitney U test) β β
β β β β β β
β β β βββ Volcano plots (matplotlib + plotly) β β
β β β β βββ saves to: volcano_plots/ β β
β β β β βββ *.png (static) β β
β β β β βββ *.html (interactive) β β
β β β β β β
β β β βββ Enrichment analysis (Enrichr API) β β
β β β β βββ saves to: enrichment_results/ β β
β β β β β β
β β β βββ Enrichment plots (matplotlib) β β
β β β βββ saves to: enrichment_plots/ β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β β β
β βββ STEP 5: KEGG Pathway Plotting (R script) βββββββββββββββββββ β
β β β
β βββ subprocess: Rscript enrichment/kegg.R β
β βββ saves to: plots/ β
β βββ kegg_dotplot.png β
β βββ kegg_dotplot.html (interactive) β
β βββ pathway_combined_plot.png β
β βββ pathway_combined_plot.html (interactive) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
webapp/sessions/<session_id>/
βββ differential_expression/ # From omic_analysis.py
β βββ unpaired_differential_expression_results.csv
β βββ significant_genes_by_fdr.csv
β βββ significant_genes_by_fc.csv
β βββ significant_upregulated_genes.csv
β βββ significant_downregulated_genes.csv
βββ volcano_plots/ # From omic_analysis.py
β βββ volcano_plot.png
β βββ volcano_plot.html # Interactive (plotly)
β βββ volcano_plot_permissive.png
β βββ volcano_plot_permissive.html # Interactive (plotly)
βββ enrichment_results/ # From omic_analysis.py (Enrichr API)
β βββ <disease>_all_regulated/
β βββ <disease>_up_regulated/
β βββ <disease>_down_regulated/
β βββ enrichment_plots/ # Python matplotlib plots
βββ plots/ # From R script (kegg.R)
β βββ kegg_dotplot.png
β βββ kegg_dotplot.html
β βββ pathway_combined_plot.png
β βββ pathway_combined_plot.html
βββ top_genes_by_expression.csv # From workflow
| File | Purpose |
|---|---|
agent/simple_magentic_agent.py |
Main agent with FunctionTool wrapper |
tools/omic_tools/omic_fetch_analysis_workflow.py |
Orchestrates the full pipeline |
tools/omic_tools/omic_analysis.py |
Core DE analysis and enrichment functions |
tools/omic_tools/ner_tool.py |
Named entity recognition for queries |
tools/omic_tools/subprocess_r.py |
R script execution helper |
enrichment/kegg.R |
KEGG pathway visualization (R/plotly) |