Skip to content

Ezio0/dingtalk-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DingTalk RAG Knowledge Base

A Retrieval-Augmented Generation (RAG) knowledge base service for DingTalk Wiki, featuring hybrid search (vector + BM25), real-time sync, and MCP server integration.

Python License

Features

  • Hybrid Search: Combines vector similarity (sqlite-vec) with full-text search (BM25/FTS5) using Reciprocal Rank Fusion (RRF)
  • Multi-Source Sync: Automatically syncs documents from DingTalk Wiki and AI Tables via MCP servers
  • Structure-Aware Chunking: Preserves document structure (headings, tables, code blocks) during text splitting
  • Cross-Encoder Reranking: Optional reranking with sentence-transformers for improved result quality
  • MCP Server: Exposes 4 tools (search_knowledge, get_document, list_documents, get_kb_stats) for AI agent integration
  • Web UI: Clean SPA interface for searching, browsing, and managing documents
  • REST API: Full-featured HTTP API with auto-sync scheduler

Quick Start

Installation

# Clone the repository
git clone https://github.com/Ezio0/dingtalk-rag.git
cd dingtalk-rag

# Install dependencies (requires Python 3.11+)
uv sync --dev

# Copy and configure environment variables
cp .env.example .env
# Edit .env with your API keys

Required Environment Variables

IDEALAB_API_KEY=your-idealab-api-key
DINGTALK_CLIENT_SECRET=your-dingtalk-secret
ALIDING_ACCESS_KEY_ID=your-aliding-key-id
ALIDING_ACCESS_KEY_SECRET=your-aliding-key-secret
ALIDING_ACCOUNT_ID=your-account-id

Run

# Mode 1: MCP stdio + HTTP (default)
# MCP stdio runs in main thread for AI agents, HTTP API in background
uv run dingtalk-rag

# Mode 2: HTTP API only
uv run dingtalk-rag --http-only --port 8000

# Mode 3: One-time sync
uv run dingtalk-rag --sync

Access the web UI at http://127.0.0.1:8000

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    DingTalk RAG Service                     │
├─────────────────────────────────────────────────────────────┤
│  Web UI (SPA)  │  HTTP API (FastAPI)  │  MCP Server (stdio) │
├─────────────────────────────────────────────────────────────┤
│  Hybrid Search Engine (Vector + BM25 + RRF + Reranker)      │
├─────────────────────────────────────────────────────────────┤
│  Chunker → Embedder → Vector Store (sqlite-vec)             │
│            ↓                                                │
│       Full-Text Index (FTS5)                                │
├─────────────────────────────────────────────────────────────┤
│  Sync Pipeline ←→ DingTalk/AI Table MCP Servers             │
└─────────────────────────────────────────────────────────────┘

API Endpoints

Method Path Description
GET /health Health check
POST /api/search Hybrid search
GET /api/documents List documents
GET /api/documents/{id} Get document details
POST /api/documents/add Add text document
POST /api/documents/upload Upload file (.md/.txt)
POST /api/documents/from-url Import from URL
POST /api/documents/from-dingtalk Import DingTalk docs
POST /api/documents/from-dingtalk-kb Import whole KB (streaming)
DELETE /api/documents/{id} Delete manual document
POST /api/sync Trigger sync
GET /api/sync/status Sync status
POST /api/sync/cancel Cancel sync
GET /api/stats KB statistics
GET /api/kb/workspaces List KB workspaces

MCP Tools

When running with MCP stdio, the following tools are available to AI agents:

  • search_knowledge(query, top_k, search_mode) - Search the knowledge base
  • get_document(doc_id) - Retrieve a specific document
  • list_documents(workspace_filter, limit) - List indexed documents
  • get_kb_stats() - Get statistics and health info

Development

Run Tests

uv run pytest

Code Quality

# Formatting
ruff format src/

# Linting
ruff check src/

Project Structure

dingtalk-rag/
├── src/dingtalk_rag/
│   ├── __main__.py      # CLI entry point
│   ├── api.py           # FastAPI HTTP API
│   ├── config.py        # Configuration management
│   ├── db.py            # SQLite schema & CRUD
│   ├── chunker.py       # Structure-aware document chunking
│   ├── embedding.py     # Idealab API embedding client
│   ├── search.py        # Hybrid search engine
│   ├── reranker.py      # Cross-encoder reranking
│   ├── sync.py          # Data sync pipeline
│   ├── mcp_client.py    # MCP client manager
│   ├── mcp_server.py    # MCP server implementation
│   └── static/          # Web UI (HTML/CSS/JS)
├── tests/               # Test suite
├── pyproject.toml       # Project configuration
└── sync-cron.sh         # Cron script for scheduled sync

Technical Stack

  • Database: SQLite + sqlite-vec (vectors) + FTS5 (full-text)
  • Embedding: Idealab API (text-embedding-v4, 1024-dim)
  • Search: Vector similarity + BM25 with RRF fusion
  • Reranking: sentence-transformers CrossEncoder
  • API: FastAPI + Uvicorn
  • MCP: Model Context Protocol SDK
  • Frontend: Vanilla JS SPA

License

MIT

Acknowledgments

Built with:

About

DingTalk RAG Knowledge Base Service - Hybrid search (vector + BM25) with DingTalk Wiki sync

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors