Chunkdown is a tree-based markdown text splitter to create semantically meaningful chunks for RAG applications. Unlike traditional splitters that use simple character or regex-based methods, this library leverages markdown's hierarchical structure for optimal chunking. Chunkdown is built around a few core ideas that guide its design:
A properly structured markdown document forms a hierarchical tree where headings define sections containing various nodes (paragraphs, lists, tables, etc.). We parse markdown into an Abstract Syntax Tree (AST) and transform it into a hierarchical structure where sections contain their related content. This enables intelligent chunking that keeps semantically related information together.
Hierarchical Markdown Abstract Syntax Tree
Markdown uses additional characters for formatting (**bold**, *italic*, [link](https://example.com), etc.) that increase the total character count without necessarily changing the semantic meaning. When calculating chunk size, we count actual text content rather than raw markdown characters. This ensures consistent semantic density across chunks regardless of formatting.
Note
In a future version, it will be possible to opt-out of this behavior and use raw markdown length to calculate the chunk size.
For example, the following text from Wikipedia has 804 raw characters, however, what the user actually sees rendered on the screen are only 202 characters:
The **llama** ([/ˈlɑːmə/](https://en.wikipedia.org/wiki/Help:IPA/English "Help:IPA/English"); Spanish pronunciation: [\[ˈʎama\]](https://en.wikipedia.org/wiki/Help:IPA/Spanish "Help:IPA/Spanish") or [\[ˈʝama\]](https://en.wikipedia.org/wiki/Help:IPA/Spanish "Help:IPA/Spanish")) (***Lama glama***) is a domesticated [South American](https://en.wikipedia.org/wiki/South_America "South America") [camelid](https://en.wikipedia.org/wiki/Camelid "Camelid"), widely used as a [meat](https://en.wikipedia.org/wiki/List_of_meat_animals "List of meat animals") and [pack animal](https://en.wikipedia.org/wiki/Pack_animal "Pack animal") by [Andean cultures](https://en.wikipedia.org/wiki/Inca_empire "Inca empire") since the [pre-Columbian era](https://en.wikipedia.org/wiki/Pre-Columbian_era "Pre-Columbian era").
Comparison of chunk size 100: Chunkdown (left) / LangChain Markdown Splitter (right)
Words are the smallest meaningful unit of information for embedding purposes. While tokenizers may split words further, for practical RAG applications, breaking words mid-way creates meaningless chunks. Therefore, words are treated as indivisible atoms that cannot be split.
Comparison of chunk size 1: Chunkdown (left) / LangChain Markdown Splitter (right)
Semantic elements like links, images, inline code, and certain formatting elements should ideally always remain intact. Breaking a long link like [structured data generation](https://ai-sdk.dev/docs/ai-sdk-core/generating-structured-data) into [structured and data generation](https://ai-sdk.dev/docs/ai-sdk-core/generating-structured-data destroys meaning. The splitter preserves these constructs and splits around them.
Comparison of chunk size 100: Chunkdown (left) / LangChain Markdown Splitter (right)
Preserving a complete semantic unit like a section, paragraph, sentence, etc., is often more important than adhering to a strict chunk size. The splitter allows a controlled overflow (via maxOverflowRatio) of the chunk size if it avoids splitting a complete unit, e.g. a list item.
npm install chunkdown
#
pnpm add chunkdown
#
bun add chunkdownimport { chunkdown } from 'chunkdown';
const splitter = chunkdown({
chunkSize: 500, // Target chunk size based on content length
maxOverflowRatio: 1.5 // Allow up to 50% overflow
});
const text = `
# AI SDK Core
Large Language Models (LLMs) are advanced programs that can understand, create, and engage with human language on a large scale.
They are trained on vast amounts of written material to recognize patterns in language and predict what might come next in a given piece of text.
AI SDK Core **simplifies working with LLMs by offering a standardized way of integrating them into your app** - so you can focus on building great AI applications for your users, not waste time on technical details.
For example, here’s how you can generate text with various models using the AI SDK:
<PreviewSwitchProviders />
## AI SDK Core Functions
AI SDK Core has various functions designed for [text generation](./generating-text), [structured data generation](./generating-structured-data), and [tool usage](./tools-and-tool-calling).
These functions take a standardized approach to setting up [prompts](./prompts) and [settings](./settings), making it easier to work with different models.
- [`generateText`](/docs/ai-sdk-core/generating-text): Generates text and [tool calls](./tools-and-tool-calling).
This function is ideal for non-interactive use cases such as automation tasks where you need to write text (e.g. drafting email or summarizing web pages) and for agents that use tools.
- [`streamText`](/docs/ai-sdk-core/generating-text): Stream text and tool calls.
You can use the `streamText` function for interactive use cases such as [chat bots](/docs/ai-sdk-ui/chatbot) and [content streaming](/docs/ai-sdk-ui/completion).
- [`generateObject`](/docs/ai-sdk-core/generating-structured-data): Generates a typed, structured object that matches a [Zod](https://zod.dev/) schema.
You can use this function to force the language model to return structured data, e.g. for information extraction, synthetic data generation, or classification tasks.
- [`streamObject`](/docs/ai-sdk-core/generating-structured-data): Stream a structured object that matches a Zod schema.
You can use this function to [stream generated UIs](/docs/ai-sdk-ui/object-generation).
## API Reference
Please check out the [AI SDK Core API Reference](/docs/reference/ai-sdk-core) for more details on each function.
`;
const chunks = splitter.splitText(text);By default, links and images are never split to avoid breaking their semantic meaning.
import { chunkdown } from 'chunkdown';
const text = `Please check out the [AI SDK Core API Reference](/docs/reference/ai-sdk-core) for more details on each function.`;
// By default, never split links and images
const splitter = chunkdown({
chunkSize: 50,
});
const chunks = splitter.splitText(text);
// chunks[0]: "Please check out the [AI SDK Core API Reference](/docs/reference/ai-sdk-core)"
// chunks[1]: "for more details on each function."
// Allow splitting links
const splitte = chunkdown({
chunkSize: 50,
rules: {
formatting: { split: 'allow-split' }
}
});
const chunks = splitter.splitText(text);
// chunks[0]: "Please check out the [AI SDK Core API"
// chunks[1]: "Reference](/docs/reference/ai-sdk-core) for more details on each function."By default, Chunkdown normalizes reference-style links and images to inline style. This prevents issues when reference definitions end up in different chunks than their usage.
import { chunkdown } from 'chunkdown';
const text = `
Check out the [documentation][docs] and [API reference][api].
[docs]: https://example.com/docs
[api]: https://example.com/api
`;
// By default, normalize to inline style
const splitter = chunkdown({
chunkSize: 100,
});
const chunks = splitter.splitText(text);
// Result:
// chunks[0]: "Check out the [documentation](https://example.com/docs) and [API reference](https://example.com/api)."
// Preserve original style
const splitter = chunkdown({
chunkSize: 100,
rules: {
link: { style: 'preserve' }
}
});
const chunks = splitter.splitText(text);
// Result:
// chunks[0]: "Check out the [documentation][docs] and [API reference][api]."
// chunks[1]: "[docs]: https://example.com/docs"
// chunks[2]: "[api]: https://example.com/api"Unlike links and images, formatting elements like bold, italic, and strikethrough will be splitted if needed.
import { chunkdown } from 'chunkdown';
const text = `This is **a very long bold text that contains many words and exceeds the chunk size** in the middle.`;
// By default, allow splitting formatting
const splitter = chunkdown({
chunkSize: 30,
});
const chunks = splitter.splitText(text);
// chunks[0]: "This is **a very long"
// chunks[1]: "bold text that contains many"
// chunks[2]: "words and exceeds the"
// chunks[3]: "chunk size** in the middle."
// Never split formatting
const splitte = chunkdown({
chunkSize: 30,
rules: {
formatting: { split: 'never-split' }
}
});
const chunks = splitter.splitText(text);
// chunks[0]: "This is"
// chunks[1]: "**a very long bold text that contains many words and exceeds the chunk size**"
// chunks[2]: "in the middle."When a table is split into multiple chunks, Chunkdown automatically preserves context by including the header row in each chunk. This ensures that data rows don't lose their meaning when separated from the original header.
Note
The header row size is not counted when calculating chunk sizes. Only data row content is measured against the chunkSize limit.
import { chunkdown } from 'chunkdown';
const splitter = chunkdown({
chunkSize: 20,
maxOverflowRatio: 1.0
});
const text = `
| Name | Age | Country |
|----------|-----|---------|
| Alice | 30 | USA |
| Bob | 25 | UK |
| Charlie | 35 | Canada |
| David | 40 | France |
`;
const chunks = splitter.splitText(text);
// chunks[0]:
// | Name | Age | Country |
// |----------|-----|---------|
// | Alice | 30 | USA |
// | Bob | 25 | UK |
// chunks[1]:
// | Name | Age | Country |
// |----------|-----|---------|
// | Charlie | 35 | Canada |
// | David | 40 | France |Certain markdown elements such as formatting and lists have multiple representations. Chunkdown normalizes these element to ensure a uniform output regardless of input style variations.
import { chunkdown } from 'chunkdown';
const text = `
formatting:
__bold__
_italic_
---
lists:
- list item 1
- list item 2
`;
const splitter = chunkdown({
chunkSize: 100,
});
const chunks = splitter.splitText(text);
// Markdown variations are normalized to:
// - __bold__ → **bold**
// - _italic_ → *italic*
// - "---" → "***" (thematic break)
// - list item 1 → * list item 1 (starting with "*")Transform functions allow you to modify or filter nodes during preprocessing. This is useful for cleaning up content before chunking, such as truncating long URLs or removing unwanted elements.
Prevent chunk bloat from excessively long URLs:
const splitter = chunkdown({
chunkSize: 100,
rules: {
link: {
transform: (node) => {
// Truncate URLs longer than 100 characters
if (node.url.length > 50) {
return {
...node,
url: node.url.substring(0, 47) + '...'
};
}
return undefined;
}
}
}
});
const text = `Check out our [website](https://example.com/with/a/very/long/url/that/increases/the/chunk/size/significantly).`;
const chunks = splitter.splitText(text);
// chunks[0]: "Check out our [website](https://example.com/with/a/very/long/url/that/incr...)."Data URLs in images (i.e. base64-encoded images) can be extremely long and create noise in chunks without meaningful content:
import { chunkdown } from 'chunkdown';
const text = `
# Article

Check our [website](https://example.com) for more info.

`;
const splitter = chunkdown({
chunkSize: 500,
rules: {
image: {
transform: (node) => {
// Remove images with data URLs
if (node.url.startsWith('data:')) {
return null; // Remove the entire image node
}
return undefined; // Keep regular images
}
}
}
});
const text = `
# Article

Check our [website](https://example.com) for more info.
`;
const chunks = splitter.splitText(text);
// chunks[0]:
// # Article
//
// Check our [website](https://example.com) for more info.Creates a new markdown splitter instance.
An object with the following method:
splitText(text: string): string[]: Splits the input markdown text into chunks
The target content size for each chunk, counting only content characters, not raw markdown.
The maximum overflow ratio for preserving semantic units:
1.0: strict chunk size, no overflow allowed>1.0: allow overflow of up tochunkSize * maxOverflowRatio
Configure splitting behavior for specific markdown node types. Rules allow fine-grained control over when and how different markdown elements can be split during chunking.
Supported node types:
link: Link elements[text](url)image: Image elementstable: Table elementslist: List elements (ordered and unordered)blockquote: Blockquote elementsformatting: Formatting elements (combinesstrong,emphasis,delete)strong: Bold text**bold**(overridesformattingif specified)emphasis: Italic text*italic*(overridesformattingif specified)delete: Strikethrough text~~deleted~~(overridesformattingif specified)
Note
The formatting rule applies to all formatting elements (strong, emphasis, delete) unless you override them individually.
Each node type can have a split rule:
'never-split' | { rule: 'never-split' }: Never split this element, regardless of size'allow-split' | { rule: 'allow-split' }: Allow splitting this element if it exceeds the chunk size{ rule: 'size-split', size: number }: Only split this element if its content size exceeds the specified size
Links and images support an additional style property to control reference-style normalization:
'inline': Convert reference-style to inline style'preserve': Keep original reference style
Each node type can have a transform function to modify or filter nodes:
type NodeTransform<T extends Nodes> = (node: T, context: TransformContext) => T | null | undefined;- Return modified node: Replace the original with transformed version
- Return
null: Remove the node from the tree - Return
undefined: Keep the node unchanged
The transform receives a context with parent, index, and root information. Transforms are applied during preprocessing, after reference-style normalization but before chunking.
import { chunkdown, defaultNodeRules } from 'chunkdown';
// Never split links
chunkdown({
chunkSize: 500,
rules: {
link: { split: 'never-split' }
}
});
// Split lists only if they exceed 200 characters
chunkdown({
chunkSize: 500,
rules: {
list: { split: { rule: 'size-split', size: 200 } }
}
});
// Never split formatting by default, but allow splitting bold text
chunkdown({
chunkSize: 500,
rules: {
formatting: { split: 'never-split' }, // Applies to strong, emphasis, delete
strong: { split: 'allow-split' } // Override: allow splitting bold text
}
});
// Extend default rules
chunkdown({
chunkSize: 500,
rules: {
...defaultNodeRules, // Include defaults for other elements
link: { split: 'never-split' },
table: { split: { rule: 'allow-split' } },
list: { split: { rule: 'size-split', size: 150 } },
blockquote: { split: { rule: 'size-split', size: 300 } }
}
});
// Normalize links to inline-style, preserve images in reference-style
chunkdown({
chunkSize: 500,
rules: {
link: { style: 'inline' },
image: { style: 'preserve' }
}
});
// Remove data URLs and truncate long links
chunkdown({
chunkSize: 500,
rules: {
image: {
transform: (node) => {
// Remove images with data URLs
if (node.url.startsWith('data:')) {
return null;
}
return undefined;
}
},
link: {
transform: (node) => {
// Truncate long URLs
if (node.url.length > 100) {
return { ...node, url: node.url.substring(0, 100) + '...' };
}
return undefined;
}
}
}
});Default rules:
By default, links and images are set to never split and normalize to inline style:
const defaultNodeRules: NodeRules = {
link: {
split: 'never-split',
style: 'inline'
},
image: {
split: 'never-split',
style: 'inline'
},
};When you provide custom rules, they override the defaults. Use the spread operator ...defaultNodeRules to explicitly include defaults if you want to override only specific elements.
import { chunkdown, defaultNodeRules } from 'chunkdown';
chunkdown({
chunkSize: 500,
rules: {
...defaultNodeRules, // Include defaults for other elements
link: { split: 'never-split' },
table: { split: { rule: 'allow-split' } },
list: { split: { rule: 'size-split', size: 150 } },
blockquote: { split: { rule: 'size-split', size: 300 } }
}
});The maximum raw size for each chunk, counting all characters including markdown formatting.
Certain markdown elements, such as links and images with long URLs, can have disproportionately long raw sizes compared to their actual content size. For example, the following text has a content size of 21 but a raw size of 117 chars due to the long URL:
This is a [link with short text](https://example.com/with/a/very/long/url/that/increases/the/raw/size/significantly).This is usually not a problem, but if a text contains a lot of such elements (e.g. scraped from a website with many links and images), the resulting chunks can become very large in raw size, even if their content size is within the allowed limits.
When the text is then embedded by a model, the large raw size could exceed the model's token limit, causing errors.
For example, OpenAI's latest embedding model text-embedding-3-large has a maximum limit of 8192 tokens, which roughly translates to about 32,000 characters (rules of thumb: 1 token ≈ 4 characters).
Note
It is recommended to set this option to the upper limit of your embedding model.
The maxRawSize option acts as a safety net that enforces a hard limit on the total number of characters allowed in each chunk.
It is guaranteed that no chunk will exceed this limit, even if it means splitting semantic units that would otherwise be preserved.
The chunk visualizer hosted at chunkdown.zirkelc.dev provides an interactive way to see how text is split into chunks:
Splitting markdown text into multiple chunks often breaks formatting, because the start and end delimiters end up in different chunks. This broken formatting provides no real semantic meaning but adds unnecessary noise:
import { chunkdown } from "chunkdown";
const text = `**This is a very long bold text that might be split into two chunks**`;
const splitter = chunkdown({
chunkSize: 50,
maxOverflowRatio: 1.0
});
const chunks = splitter.splitText(text, {
breakMode: 'keep'
});
// Keep broken markdown:
// - **This is a very long bold text that
// - might be split into two chunks**
const chunks = splitter.splitText(text, {
breakMode: 'remove'
});
// Remove broken markdown:
// - This is a very long bold text that
// - might be split into two chunks
const chunks = splitter.splitText(text, {
breakMode: 'extend'
});
// Extend broken markdown:
// - **This is a very long bold text that**
// - **might be split into two chunks**Currently, the splitter returns the chunks as array of strings. That means the original position of each chunk in the source text is lost. In a typical RAG setup, the source document and each chunk is stored with it's embedding in a database. This duplicates lots of text since each chunk contains parts of the original document.
Chunkdown could return the start and end positions of each chunk in the original text, allowing to store only the original document and reference the chunk positions when needed.
const document = '...'; // original markdown document
const chunks = splitter.splitDocument(document);
// Result:
// [
// { text: 'First chunk text...', start: 0, end: 256 },
// { text: 'Second chunk text...', start: 257, end: 512 },
// ...
// ]
await db.insert(documentTable).values({
text: document
});
await db
.insert(chunkTable)
.values(chunks.map(chunk => ({
start: chunk.start, // start position in original document
end: chunk.end, // end position in original document
text: null, // chunk text not stored separately
embedding: await embed(chunk.text),
})));