VideoMind: Dynamic Episodic Memory Engine for Video QA & Counterfactual Reasoning

🚀 Overview

VideoMind is a novel system for deep video understanding, enabling users to ask both factual and counterfactual questions about any video. It combines state-of-the-art vision models, hierarchical clustering, temporal graph construction, and LLM-powered reasoning to deliver rich, context-aware answers.

🧩 Key Features

Automatic frame extraction and BLIP captioning
CLIP-based hierarchical clustering in time
YOLOv8-based object detection and object-centric clustering
Temporal graph construction for context reasoning
LLM-powered (Ollama/Mistral) summarization and QA
Counterfactual reasoning grounded in video context
Interactive Gradio UI for seamless user experience

🛠️ How It Works

Frame Extraction & Captioning
- Video is split into frames at a chosen FPS.
- Each frame is captioned using BLIP.
CLIP Embedding & Clustering
- Frames are embedded using CLIP.
- Hierarchical clustering is performed in the time domain.
- YOLOv8 detects objects, enabling object-based clustering.
Temporal Graph Construction
- Both time and object clusters are represented as nodes in a temporal graph.
- Edges encode temporal relationships.
Summarization
- Each time and object cluster is summarized using an LLM (Ollama/Mistral).
Interactive QA & Counterfactual Reasoning
- Users ask questions via the Gradio UI.
- The system detects relevant time/object context, gathers summaries, and formulates a prompt for the LLM.
- For counterfactuals, the LLM is restricted to only reason about the background, objects, and events up to the point of change.

🖥️ Usage

1. Install Requirements

pip install -r requirements.txt

You also need to install Ollama and pull the Mistral model:

ollama pull mistral:latest

2. Run the Gradio App

python scripts/gradio_app.py

Open the provided local URL in your browser.
Upload a video (.mp4, .mov, .avi, .mkv).
Click Process Video.
Ask questions about the video or try counterfactuals (e.g., "What if the car turned left?").

3. Command-Line Interface (Optional)

You can also use the CLI version:

python scripts/cli_main.py

🧠 Novelty & Architecture

Hierarchical Temporal Clustering: Recursive time-based clustering enables multi-scale temporal reasoning.
Object-centric Episodic Memory: Object-based clustering and summaries support object-centric queries and counterfactuals.
Temporal Graph: Enables efficient context window gathering for any frame or cluster.
LLM-Guided Summarization & QA: All answers are grounded in extracted context, minimizing hallucination.
Counterfactual Reasoning: The system can answer "what if" questions by leveraging only the available context and objects.

Authored by Atharva Date

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
cache/captions.lmdb		cache/captions.lmdb
models		models
scripts		scripts
src		src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
test_yolo.py		test_yolo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoMind: Dynamic Episodic Memory Engine for Video QA & Counterfactual Reasoning

🚀 Overview

🧩 Key Features

🛠️ How It Works

🖥️ Usage

1. Install Requirements

2. Run the Gradio App

3. Command-Line Interface (Optional)

🧠 Novelty & Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VideoMind: Dynamic Episodic Memory Engine for Video QA & Counterfactual Reasoning

🚀 Overview

🧩 Key Features

🛠️ How It Works

🖥️ Usage

1. Install Requirements

2. Run the Gradio App

3. Command-Line Interface (Optional)

🧠 Novelty & Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages