Add native semantic text chunking support

**Expected Behavior**

Spring AI should provide a native SemanticTextSplitter that uses an EmbeddingModel to split text based on semantic similarity between sentences, producing higher-quality chunks for RAG pipelines.


SemanticTextSplitter splitter = SemanticTextSplitter.builder()
    .embeddingModel(embeddingModel)
    .similarityThreshold(0.5)
    .maxChunkSize(1000)
    .build();

List<Document> chunks = splitter.split(documents);

**The algorithm:**
1-Split text into sentences
2-Compute embeddings for each sentence (or sliding window of sentences)
3-Calculate cosine similarity between consecutive sentence embeddings
4-Split at points where similarity drops below the configured threshold
5-This produces variable-sized chunks that respect natural semantic boundaries of the text.

**Current Behavior**
The only native text splitter is TokenTextSplitter, which splits based on a fixed token count. This can break semantically related content across chunk boundaries, degrading embedding quality and RAG retrieval relevance.

External solutions like Docling provide hierarchical/hybrid chunking but require a separate service and add infrastructure complexity.

**Context**
Semantic chunking is a well-established technique in RAG pipelines, available natively in LangChain (Python) and LangChain4j (Java), but not yet in Spring AI.

Users currently have to either rely on external tools (Docling) or implement custom solutions. A native SemanticTextSplitter extending the existing TextSplitter base class would require no new external dependencies since it builds on Spring AI's own EmbeddingModel interface.

Configurable parameters would include: similarity threshold, max chunk size, and sentence splitting strategy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native semantic text chunking support #5464

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add native semantic text chunking support #5464

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions