Marigold AI Assistant

A RAG system for semantic search and question answering over Marigold UI Component documentation using vector embeddings and PostgreSQL.

Overview

This project implements a complete pipeline for converting Marigold UI Component documentation into a queryable knowledge base:

Parse - Convert MDX files to structured JSON with AST builder
Chunk - Split documents into semantic sections (hierarchical + flat)
Embed - Generate vector embeddings using Ollama
Store - Index embeddings in PostgreSQL with pgvector
Query - Semantic search via MCP server for Claude/VS Code integration

Architecture

Data flows through two phases:

Phase 1: Build (ETL)

Fetch Marigold documentation from GitHub
Parse MDX to structured JSON with document hierarchy
Split into semantic chunks (with parent relationships) and flat chunks
Generate embeddings using Ollama
Store in PostgreSQL with pgvector indexes

Phase 2: Runtime (Query)

User query embedded by Ollama
Vector similarity search finds matching chunks
Parent documents retrieved for context
Results formatted and returned to client

See diagrams/architecture.png for detailed sequence diagram.

Directory Structure

Each directory contains detailed documentation:

Directory	Purpose
`etl/parser/`	TypeScript AST parser for MDX documentation
`etl/pipeline/`	Python Jupyter notebooks for chunking, embedding, database storage
`etl/data/`	Raw docs, processed JSON, generated embeddings
`mcp-server/`	MCP service for semantic search queries
`db/`	PostgreSQL + pgvector database service
`ollama/`	Embedding model service (nomic-embed-text)
`diagrams/`	Architecture and flow diagrams

See each directory's README.md for details.

Getting Started

Prerequisites

Docker & Docker Compose
Python 3.11+
Git

Quick Start (Complete Setup)

Set up Python environment

# Clone repository
git clone https://github.com/marigold-ui/ai-assistant.git
cd ai-assistant

# Create virtual environment at project root
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Copy environment variables
cp .env.example .env

Parse documentation

cd etl/parser
pnpm install
pnpm run dev  # Downloads and parses Marigold MDX files to JSON

Start services

# From project root
docker-compose up marigold-ollama marigold-postgres --build -d

Run ETL pipeline

cd etl/pipeline
jupyter notebook
# Execute notebooks in order:
# - 01_chunker.ipynb (chunks documentation)
# - 02_embedder.ipynb (generates embeddings)
# - 03_database.ipynb (stores in PostgreSQL)

Start MCP server

# Recommended
docker compose up marigold-mcp --build -d

# Or manually
cd mcp-server
python server.py

Services available at:

PostgreSQL: localhost:5432
pgweb (Database UI): http://localhost:8081
Ollama: localhost:11434

VS Code Integration

The MCP server works with any MCP-compatible client. Example seup for VS Code:

VS Code Agent

Press Ctrl+Shift+P and search for "MCP: Add Server"
Select "stdio" as the connection type
Enter the server configuration (JSON format):

{
  "servers": {
    "marigold-mcp": {
      "type": "stdio",
      "command": "docker",
      "args": ["exec", "-i", "marigold-mcp", "python3", "server.py"],
      "cwd": "/path/to/ai-assistant/mcp-server"
    }
  }
}

The tools will appear in VS Code Agent:
- marigold_documentation_lookup - Semantic search with hierarchical context
- marigold_documentation_lookup_primitive - Flat structure baseline
Start using in chat:
- For example, ask questions like:
  - "How do I create a Button with Marigold?"

Services

PostgreSQL + pgvector (db/)

Vector database for storing embeddings.

Stores semantic and primitive chunks
768-dimensional vectors
Runs on port 5432

See db/README.md for details.

Ollama (ollama/)

Local embedding service using nomic-embed-text model.

Generates embeddings
Runs on port 11434

See ollama/README.md for details.

ETL Pipeline (etl/)

Data processing layer with three main stages:

Parser - Extract documentation

Fetches Marigold UI MDX from GitHub
Converts to structured JSON with AST
Preserves heading hierarchy and demo references

Chunking - Split and organize content

Semantic chunking: respects document sections, maintains hierarchy
Primitive chunking: fixed 500-token chunks (baseline comparison)
Both preserve demo code and image references

Embedding & Storage - Generate vectors and index

Generates embeddings for all chunks
Stores in PostgreSQL with pgvector indexing
Creates IVFFlat indexes for fast similarity search

See etl/pipeline/README.md for details.

MCP Server (mcp-server/)

Model Context Protocol server providing two semantic search tools:

marigold_documentation_lookup - Semantic search with hierarchical context
marigold_documentation_lookup_primitive - Flat structure baseline

Integrates with Claude, VS Code, and other MCP-compatible clients.

See mcp-server/README.md for details.

Configuration

Environment Variables (.env)

# ==============================================
# PostgreSQL Database
# ==============================================
DB_HOST=marigold-postgres
DB_PORT=5432
DB_NAME=marigold_rag
DB_USER=postgres
DB_PASSWORD=postgres

# ==============================================
# pgweb (Database UI)
# ==============================================
PGWEB_PORT=8081

# ==============================================
# Ollama (Embedding Model)
# ==============================================
OLLAMA_KEEP_ALIVE=10m
OLLAMA_URL="http://marigold-ollama:11434/api/embed"
OLLAMA_MODEL=nomic-embed-text

Chunking Parameters

Located in pipeline notebooks:

# 01_chunker.ipynb
max_chunk_tokens = 500        # Maximum tokens per chunk
overlap_tokens = 50           # Overlap between chunks

Vector Search

Located in mcp-server/db.py:

# Search parameters
limit = 5                      # Results to return

Comparing Approaches

Two chunking strategies available for evaluation:

Semantic (Recommended): Respects document structure, maintains hierarchy
Primitive (Baseline): Fixed tokens, shows limitations of naive chunking

Query same question with both tools to compare results.

Troubleshooting

Module not found errors

Ensure .venv is activated: source .venv/bin/activate
Install requirements: pip install -r requirements.txt

Database connection failed

Check PostgreSQL is running: docker-compose up marigold-postgres
Verify credentials in .env match docker-compose.yml

Ollama connection failed

Start Ollama: docker-compose up marigold-ollama
Verify model: ollama list (should include nomic-embed-text)

License

MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Marigold AI Assistant

Overview

Architecture

Directory Structure

Getting Started

Prerequisites

Quick Start (Complete Setup)

VS Code Integration

Services

PostgreSQL + pgvector (db/)

Ollama (ollama/)

ETL Pipeline (etl/)

MCP Server (mcp-server/)

Configuration

Environment Variables (.env)

Chunking Parameters

Vector Search

Comparing Approaches

Troubleshooting

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
db		db
diagrams		diagrams
etl		etl
mcp-server		mcp-server
ollama		ollama
.env.example		.env.example
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.json		.prettierrc.json
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

License

marigold-ui/ai-assistant

Folders and files

Latest commit

History

Repository files navigation

Marigold AI Assistant

Overview

Architecture

Directory Structure

Getting Started

Prerequisites

Quick Start (Complete Setup)

VS Code Integration

Services

PostgreSQL + pgvector (db/)

Ollama (ollama/)

ETL Pipeline (etl/)

MCP Server (mcp-server/)

Configuration

Environment Variables (.env)

Chunking Parameters

Vector Search

Comparing Approaches

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages