Ask questions. Explore cancer publications. Discover insights.
⚠️ This project is a work in progress being developed for the cBioPortal Hackathon 2025.
cBioPubChat is an AI-powered chatbot designed to help researchers, clinicians, and enthusiasts interact with publications from cBioPortal studies. By combining vector search with large language models, the chatbot can:
- Retrieve the most relevant studies based on a user question
- Summarize the key findings from those studies
- Provide direct links to the studies in cBioPortal
“Which pathways are most commonly altered in ovarian cancer?”
cBioPubChat will:
- Search all study publication text using embedding similarity
- Summarize relevant findings with an LLM
- Provide links to those studies in cBioPortal
- Chainlit – Interactive chat UI
- LangChain – LLM pipeline & orchestration
- ChromaDB – Vector store for publication embeddings
- Python 3.10+
- LLMs – OpenAI or other LangChain-compatible providers
cBioPubChat/
├── app/ # Chainlit app frontend and config
│ ├── main.py # Chainlit entrypoint (UI + LangChain agent)
│ └── config.toml # Chainlit config (title, theme, etc.)
├── backend/ # Core logic: embeddings, indexing, QA
│ ├── ingest/
│ │ ├── parse_publications.py # PDF, HTML, or plain text loader
│ │ ├── embed_and_store.py # Convert text → embeddings → store in ChromaDB
│ │ └── __init__.py
│ ├── qa/
│ │ ├── query_engine.py # Embedding search + summarization pipeline
│ │ └── __init__.py
│ └── __init__.py
├── data/ # Raw and processed publication data
│ ├── raw/ # Raw PDFs or metadata
│ └── processed/ # Text chunks or cleaned files
├── chroma/ # Local ChromaDB index directory (auto-created)
├── notebooks/ # (Optional) Jupyter notebooks for exploration
│ └── analysis.ipynb
├── tests/ # Unit and integration tests
│ ├── test_ingest.py
│ ├── test_query.py
│ └── ...
├── scripts/ # Convenience scripts (e.g., bootstrap)
│ └── run_ingest.sh
├── .env # API keys, secrets (ignored by git)
├── .gitignore
├── README.md
└── requirements.txt # Pip dependencies