-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Expected Behavior
Spring AI should provide a native SemanticTextSplitter that uses an EmbeddingModel to split text based on semantic similarity between sentences, producing higher-quality chunks for RAG pipelines.
SemanticTextSplitter splitter = SemanticTextSplitter.builder()
.embeddingModel(embeddingModel)
.similarityThreshold(0.5)
.maxChunkSize(1000)
.build();
List chunks = splitter.split(documents);
The algorithm:
1-Split text into sentences
2-Compute embeddings for each sentence (or sliding window of sentences)
3-Calculate cosine similarity between consecutive sentence embeddings
4-Split at points where similarity drops below the configured threshold
5-This produces variable-sized chunks that respect natural semantic boundaries of the text.
Current Behavior
The only native text splitter is TokenTextSplitter, which splits based on a fixed token count. This can break semantically related content across chunk boundaries, degrading embedding quality and RAG retrieval relevance.
External solutions like Docling provide hierarchical/hybrid chunking but require a separate service and add infrastructure complexity.
Context
Semantic chunking is a well-established technique in RAG pipelines, available natively in LangChain (Python) and LangChain4j (Java), but not yet in Spring AI.
Users currently have to either rely on external tools (Docling) or implement custom solutions. A native SemanticTextSplitter extending the existing TextSplitter base class would require no new external dependencies since it builds on Spring AI's own EmbeddingModel interface.
Configurable parameters would include: similarity threshold, max chunk size, and sentence splitting strategy.