Architectural DNA 🧬

A powerful MCP (Model Context Protocol) server that extracts, analyzes, and stores code patterns from your GitHub repositories, enabling AI-powered project scaffolding based on your team's proven architectural patterns.

What It Does

Architectural DNA helps you:

Extract Code Patterns - Automatically parses your GitHub repositories to identify reusable code patterns
Analyze with AI - Uses Google Gemini to understand and categorize patterns (design patterns, best practices, utilities)
Store in Vector DB - Indexes patterns in Qdrant for fast semantic search
Generate Projects - Scaffolds new projects based on your existing patterns and best practices

Features

🔍 Multi-language Support - Python, Java, JavaScript/TypeScript, Go
🤖 LLM-Powered Analysis - Intelligent pattern recognition and quality scoring
📊 Vector Search - Semantic search across all your code patterns
🏗️ Smart Scaffolding - Generate new projects that follow your team's conventions
🔌 MCP Integration - Works with any MCP-compatible AI client (Claude Desktop, IDEs)

Architecture

GitHub Repos → Pattern Extraction (AST) → LLM Analysis → Vector DB (Qdrant) → RAG Scaffolding

The system implements a complete RAG (Retrieval-Augmented Generation) pipeline:

Extraction: Parses code using tree-sitter AST parsers
Analysis: Google Gemini identifies patterns and assigns quality scores
Storage: Qdrant vector database with configurable code-optimized embeddings
Generation: LLM generates new projects using relevant patterns as context

Advanced Features

Code-Optimized Embeddings

The system uses code-specific embedding models that understand programming syntax and semantics better than general-purpose models:

Supported Models:

jinaai/jina-embeddings-v2-base-code (768d) - Recommended for code
BAAI/bge-base-en-v1.5 (768d) - Good all-around performance
BAAI/bge-small-en-v1.5 (384d) - Lightweight and fast
nomic-ai/nomic-embed-text-v1.5 (768d) - Newest high-quality model
sentence-transformers/all-MiniLM-L6-v2 (384d) - Fast inference

Smart Code Preprocessing:

Normalizes whitespace while preserving code structure
Retains comments and docstrings for better semantic understanding
Preserves code-specific tokens (identifiers, keywords)

Intelligent Chunking:

Respects code structure (functions, classes, modules)
Configurable chunk size with overlap
Prevents splitting logical code blocks

Configure in config.yaml:

embeddings:
  provider: "fastembed"
  model: "jinaai/jina-embeddings-v2-base-code"
  chunking:
    enabled: true
    max_chunk_size: 512
    chunk_overlap: 50
    strategy: "smart"

Hybrid Search

Combines semantic (vector) search with keyword matching for more accurate results:

Semantic Search (70%): Finds patterns with similar meaning/purpose
Keyword Search (30%): Ensures exact term matches are prioritized
Automatic Reranking: Intelligently combines both scores

This means queries like "retry decorator" will find:

Patterns with "retry" and "decorator" keywords (exact match)
Patterns about "error handling" and "resilience" (semantic similarity)
Ranked by combined relevance

Configure in config.yaml:

search:
  hybrid_enabled: true
  semantic_weight: 0.7
  keyword_weight: 0.3

Get embedding info:

# Via MCP tool
get_embedding_info()
# Returns current model, vector size, and configuration

GitHub API Caching

The system includes an intelligent caching layer for GitHub API responses to reduce API calls and improve performance:

Cache Types and TTLs:

Repository List (5 min): Cached since repo lists change infrequently
File Tree (10 min): Directory structure is relatively stable
File Content (1 hour): Content by SHA is immutable, safe to cache longer

Features:

LRU Eviction: Automatically removes least-recently-used entries when cache is full
Disk Persistence: Optionally saves cache to disk for cross-session persistence
Per-Request Control: Each API method accepts use_cache=False to bypass cache
Selective Invalidation: Clear cache for specific repos or all at once

Configure in config.yaml:

github:
  cache:
    enabled: true
    ttl_repo_list: 300      # 5 minutes
    ttl_file_tree: 600      # 10 minutes
    ttl_file_content: 3600  # 1 hour
    max_size: 1000          # Maximum cache entries
    cache_dir: ".github_cache"  # null to disable disk caching

Benefits:

Faster re-syncs when re-processing repositories
Reduced GitHub API rate limit consumption
Improved performance for large codebases
Persistent cache survives container restarts (Docker volume)

Installation

Choose your deployment method:

🐳 Option 1: Docker (Recommended)

Easiest way to get started! Runs as a standalone SSE server.

# Clone and start
git clone https://github.com/pershai/architectural-dna.git
cd architectural-dna
docker-compose up -d

The server runs at: http://localhost:8080/sse

Connect your AI assistant - See MCP_SETUP.md for:

🟣 Cursor
🔵 Gemini Code Assist / Antigravity
🟢 Windsurf / Cascade
🟠 Claude Desktop
🔴 VS Code Continue

Benefits:

✅ Zero dependency management
✅ No local file paths needed
✅ Works with any MCP client
✅ Credentials via headers
✅ One-command deployment

🐍 Option 2: Local Python Installation

Prerequisites:

Python 3.11+
GitHub Personal Access Token
Google Gemini API Key
Qdrant (local or cloud)

Setup

Clone the repository

git clone <your-repo-url>
cd architectural-dna

Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

For reproducible installations (recommended):

pip install -r requirements.lock

Or to install latest compatible versions:

pip install -r requirements.txt

Configure environment variables

Copy .env.example to .env and fill in your credentials:

cp .env.example .env

Edit .env:

# GitHub Integration
GITHUB_TOKEN=ghp_your_github_token_here

# Google Gemini LLM
GEMINI_API_KEY=your_gemini_api_key_here

# Qdrant Vector Database (optional if running locally)
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=  # Leave empty for local instance

Getting your tokens:

GitHub: https://github.com/settings/tokens (requires repo scope)
Gemini: https://aistudio.google.com/app/apikey

Start Qdrant (if running locally)

Using Docker:

docker run -p 6333:6333 qdrant/qdrant

Or install locally: https://qdrant.tech/documentation/quick-start/

Configure the system

Edit config.yaml to customize:

Qdrant connection settings
Embedding model and configuration (code-optimized models, chunking, preprocessing)
Hybrid search settings (semantic vs keyword weight)
Gemini model selection
Pattern extraction rules
Quality thresholds

See Advanced Features for details on embedding models and hybrid search configuration.

Usage

Running the MCP Server

python dna_server.py

The server exposes several MCP tools that can be called by AI assistants:

Available Tools

1. `store_pattern` - Manually add a code pattern

{
  "content": "def retry_on_failure(max_attempts=3): ...",
  "title": "Retry Decorator",
  "description": "Decorator for automatic retry logic",
  "category": "utility",
  "language": "python",
  "quality_score": 8
}

2. `search_dna` - Query patterns by semantic similarity

{
  "query": "authentication middleware patterns",
  "limit": 5,
  "min_quality": 7,
  "language": "python",
  "category": "architecture"
}

3. `list_my_repos` - List accessible GitHub repositories

{
  "include_private": true
}

4. `sync_github_repo` - Index patterns from a repository

{
  "repo_name": "myorg/myproject",
  "analyze": true,
  "min_quality": 6
}

5. `scaffold_project` - Generate a new project

{
  "project_name": "my-new-api",
  "project_type": "REST API",
  "tech_stack": "Python, FastAPI, PostgreSQL"
}

6. `get_dna_stats` - View database statistics

{}

7. `get_embedding_info` - View embedding configuration

{}

Returns information about:

Current embedding model and provider
Vector dimensions
Chunking configuration
Preprocessing settings
All supported embedding models

Command-Line Utilities

Discover DNA from local directory:

python discover_dna.py /path/to/your/codebase

List your GitHub repositories:

python manual_list_repos.py

Integration with MCP Clients

First, start the Docker server:

docker-compose up -d

Then add to your MCP client config:

Claude Desktop / Cursor / Windsurf

Config location:

Claude Desktop: claude_desktop_config.json
Cursor: %APPDATA%\Cursor\User\globalStorage\cursor.mcp\config.json
Windsurf: %APPDATA%\Windsurf\User\globalStorage\cascade.mcp\config.json

{
  "mcpServers": {
    "architectural-dna": {
      "transport": "sse",
      "url": "http://localhost:8080/sse",
      "headers": {
        "X-GITHUB-TOKEN": "your_github_token",
        "X-GEMINI-API-KEY": "your_gemini_api_key"
      }
    }
  }
}

Gemini Code Assist / Antigravity

Config location: ~/.gemini/antigravity/mcp_config.json

{
  "mcpServers": {
    "architectural-dna": {
      "serverUrl": "http://localhost:8080/sse",
      "headers": {
        "X-GITHUB-TOKEN": "your_github_token",
        "X-GEMINI-API-KEY": "your_gemini_api_key"
      }
    }
  }
}

Claude Code Skill

For Claude Code users, a SKILL.md file provides workflow guidance to help the LLM make better decisions about which tools to use and when.

Install the skill:

claude skill add ./SKILL.md

What the skill provides:

Tool selection guidance - When to use batch_sync_repo vs sync_github_repo
Documented workflows - First-time setup, adding repos, searching, maintenance
Best practices - What to always do, never do, and error recovery
Tool relationships - How the 12 tools connect and depend on each other

Example workflows documented:

First-time DNA bank setup
Adding a new repository
Finding and using patterns
Maintenance after sync issues
Improving pattern quality (recategorization)

Configuration

config.yaml Structure

qdrant:
  url: "http://localhost:6333"
  collection_name: "code_patterns"
  embedding_model: "BAAI/bge-small-en-v1.5"
  vector_size: 384

gemini:
  model: "gemini-2.0-flash-exp"
  analysis_enabled: true

extraction:
  min_chunk_lines: 5
  max_chunk_lines: 150
  min_quality_score: 5

Supported Languages

Language	Extensions	AST Parser
Python	.py	tree-sitter-python
Java	.java	tree-sitter-java
JavaScript/TypeScript	.js, .ts, .jsx, .tsx	tree-sitter-javascript
Go	.go	Semantic chunking

Pattern Categories

architecture - High-level design patterns (MVC, microservices, etc.)
design_pattern - Classical design patterns (Singleton, Factory, etc.)
best_practice - Coding standards and conventions
utility - Helper functions and utilities
security - Security implementations (auth, validation, etc.)
performance - Optimization techniques
other - Miscellaneous patterns

Examples

Example 1: Index Your Team's Codebase

# In Claude Desktop or any MCP client:
"Sync my repository myteam/api-backend and analyze patterns with min quality 7"

# The server will:
# 1. Clone/fetch the repository
# 2. Extract code chunks using AST parsing
# 3. Analyze each chunk with Gemini
# 4. Store high-quality patterns in Qdrant

Example 2: Search for Patterns

"Search DNA for authentication middleware patterns in Python"

# Returns:
# - OAuth2 middleware implementation (score: 9)
# - JWT token validation (score: 8)
# - Rate limiting decorator (score: 7)

Example 3: Generate a New Project

"Scaffold a new FastAPI project called 'user-service' using my team's patterns"

# Generates:
# - Project structure following your conventions
# - Configuration files
# - Authentication using your patterns
# - Error handling matching your style
# - README and setup instructions

How It Works

Pattern Extraction Pipeline

Repository Fetching
- Connects to GitHub API
- Traverses directory tree
- Filters by file extensions
- Ignores common directories (node_modules, venv, etc.)
Code Chunking
- AST-based: Extracts functions, classes, methods
- Semantic: Falls back to context-aware splitting
- Maintains 10-line overlap for context
LLM Analysis
- Sends chunks to Gemini with structured prompt
- Receives: is_pattern, title, description, category, quality_score
- Filters by minimum quality threshold
Vector Storage
- Generates embeddings using local model
- Stores in Qdrant with metadata
- Enables semantic search
Project Generation
- Searches for relevant patterns
- Builds context for LLM
- Generates complete project structure
- Creates files with your team's conventions

Troubleshooting

Common Issues

"Failed to connect to Qdrant"

Ensure Qdrant is running: docker ps or check http://localhost:6333
Verify QDRANT_URL in config.yaml

"GitHub API rate limit exceeded"

You're limited to 60 requests/hour without authentication
Add GITHUB_TOKEN to .env for 5,000 requests/hour

"Gemini API error"

Check your API key is valid
Verify you have quota remaining
Try switching to gemini-1.5-flash in config.yaml

"No patterns found in repository"

Lower min_quality threshold in sync_github_repo
Check file extensions are supported
Verify repository has code files (not just configs)

"Tree-sitter parsing failed"

The system automatically falls back to semantic chunking
Check logs in dna_server.log for details

Logs

Logs are written to dna_server.log with timestamps:

tail -f dna_server.log

Log levels:

INFO: Normal operations (patterns found, storage success)
WARNING: Recoverable issues (analysis failed, using fallback)
ERROR: Serious problems (storage failures, API errors)
DEBUG: Detailed tracing (enable with logging.DEBUG)

Development

Project Structure

architectural-dna/
├── dna_server.py          # MCP server with tool definitions
├── models.py              # Data models (Pattern, CodeChunk, etc.)
├── github_client.py       # GitHub API integration
├── github_cache.py        # LRU cache with TTL for GitHub API
├── pattern_extractor.py   # AST-based code parsing
├── llm_analyzer.py        # Gemini pattern analysis
├── scaffolder.py          # Project generation
├── constants.py           # Centralized configuration constants
├── discover_dna.py        # Local directory indexing
├── config.yaml            # Configuration
├── SKILL.md               # Claude Code skill with workflow guidance
├── requirements.txt       # Python dependencies
└── .env                   # Environment variables (gitignored)

Adding New Languages

Install tree-sitter grammar:

pip install tree-sitter-<language>

Add to pattern_extractor.py:

def _extract_<language>_chunks(self, tree, content, lines):
    # Implement AST traversal
    pass

Update models.py Language enum
Add file extensions to GitHubClient.CODE_EXTENSIONS

Security

⚠️ IMPORTANT: Never commit .env to version control. See SECURITY.md for details on secret management and rotation.

Contributing

Contributions welcome! Areas for improvement:

Add more language support (C++, Rust, Ruby)
~~Implement caching for GitHub API responses~~ (Done - LRU cache with TTL)
~~Add batch processing for large repositories~~ (Done - BatchProcessor with progress tracking)
Create web UI for pattern browsing
Add export functionality (JSON, markdown)
Implement pattern versioning

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Acknowledgments

FastMCP - MCP server framework
Qdrant - Vector database
Google Gemini - LLM for analysis
tree-sitter - Code parsing

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github		.github
scripts		scripts
tools		tools
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
MCP_SETUP.md		MCP_SETUP.md
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
SKILL.md		SKILL.md
config.yaml		config.yaml
constants.py		constants.py
discover_dna.py		discover_dna.py
dna_server.py		dna_server.py
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
embedding_manager.py		embedding_manager.py
github_cache.py		github_cache.py
github_client.py		github_client.py
hybrid_search.py		hybrid_search.py
llm_analyzer.py		llm_analyzer.py
manual_list_repos.py		manual_list_repos.py
migrate_collection.py		migrate_collection.py
models.py		models.py
pattern_extractor.py		pattern_extractor.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
ruff.toml		ruff.toml
scaffolder.py		scaffolder.py
test_batch.py		test_batch.py
test_batch_processor.py		test_batch_processor.py
test_embedding_manager.py		test_embedding_manager.py
test_github_cache.py		test_github_cache.py
test_llm_analyzer.py		test_llm_analyzer.py
test_models.py		test_models.py
test_pattern_extractor.py		test_pattern_extractor.py
test_repository_tool.py		test_repository_tool.py
test_resync_all.py		test_resync_all.py
test_scaffolder.py		test_scaffolder.py
test_tools.py		test_tools.py
test_utils.py		test_utils.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Architectural DNA 🧬

What It Does

Features

Architecture

Advanced Features

Code-Optimized Embeddings

Hybrid Search

GitHub API Caching

Installation

🐳 Option 1: Docker (Recommended)

🐍 Option 2: Local Python Installation

Setup

Usage

Running the MCP Server

Available Tools

1. store_pattern - Manually add a code pattern

2. search_dna - Query patterns by semantic similarity

3. list_my_repos - List accessible GitHub repositories

4. sync_github_repo - Index patterns from a repository

5. scaffold_project - Generate a new project

6. get_dna_stats - View database statistics

7. get_embedding_info - View embedding configuration

Command-Line Utilities

Integration with MCP Clients

Claude Desktop / Cursor / Windsurf

Gemini Code Assist / Antigravity

Claude Code Skill

Configuration

config.yaml Structure

Supported Languages

Pattern Categories

Examples

Example 1: Index Your Team's Codebase

Example 2: Search for Patterns

Example 3: Generate a New Project

How It Works

Pattern Extraction Pipeline

Troubleshooting

Common Issues

Logs

Development

Project Structure

Adding New Languages

Security

Contributing

License

Acknowledgments

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `store_pattern` - Manually add a code pattern

2. `search_dna` - Query patterns by semantic similarity

3. `list_my_repos` - List accessible GitHub repositories

4. `sync_github_repo` - Index patterns from a repository

5. `scaffold_project` - Generate a new project

6. `get_dna_stats` - View database statistics

7. `get_embedding_info` - View embedding configuration

Packages