Skip to content

zirkelc/chunkdown

Repository files navigation

chunkdown 🧩

Create chunks worth embedding

Chunkdown is a tree-based markdown text splitter to create semantically meaningful chunks for RAG applications. Unlike traditional splitters that use simple character or regex-based methods, this library leverages markdown's hierarchical structure for optimal chunking. Chunkdown is built around a few core ideas that guide its design:

Markdown as Hierarchical Tree

A properly structured markdown document forms a hierarchical tree where headings define sections containing various nodes (paragraphs, lists, tables, etc.). We parse markdown into an Abstract Syntax Tree (AST) and transform it into a hierarchical structure where sections contain their related content. This enables intelligent chunking that keeps semantically related information together.

image

Hierarchical Markdown Abstract Syntax Tree

Content Length vs. Markdown Length

Markdown uses additional characters for formatting (**bold**, *italic*, [link](https://example.com), etc.) that increase the total character count without necessarily changing the semantic meaning. When calculating chunk size, we count actual text content rather than raw markdown characters. This ensures consistent semantic density across chunks regardless of formatting.

Note

In a future version, it will be possible to opt-out of this behavior and use raw markdown length to calculate the chunk size.

For example, the following text from Wikipedia has 804 raw characters, however, what the user actually sees rendered on the screen are only 202 characters:

The **llama** ([/ˈlɑːmə/](https://en.wikipedia.org/wiki/Help:IPA/English "Help:IPA/English"); Spanish pronunciation: [\[ˈʎama\]](https://en.wikipedia.org/wiki/Help:IPA/Spanish "Help:IPA/Spanish") or [\[ˈʝama\]](https://en.wikipedia.org/wiki/Help:IPA/Spanish "Help:IPA/Spanish")) (***Lama glama***) is a domesticated [South American](https://en.wikipedia.org/wiki/South_America "South America") [camelid](https://en.wikipedia.org/wiki/Camelid "Camelid"), widely used as a [meat](https://en.wikipedia.org/wiki/List_of_meat_animals "List of meat animals") and [pack animal](https://en.wikipedia.org/wiki/Pack_animal "Pack animal") by [Andean cultures](https://en.wikipedia.org/wiki/Inca_empire "Inca empire") since the [pre-Columbian era](https://en.wikipedia.org/wiki/Pre-Columbian_era "Pre-Columbian era").
image

Comparison of chunk size 100: Chunkdown (left) / LangChain Markdown Splitter (right)

Words as Atomic Unit

Words are the smallest meaningful unit of information for embedding purposes. While tokenizers may split words further, for practical RAG applications, breaking words mid-way creates meaningless chunks. Therefore, words are treated as indivisible atoms that cannot be split.

image

Comparison of chunk size 1: Chunkdown (left) / LangChain Markdown Splitter (right)

Never Break Semantics

Semantic elements like links, images, inline code, and certain formatting elements should ideally always remain intact. Breaking a long link like [structured data generation](https://ai-sdk.dev/docs/ai-sdk-core/generating-structured-data) into [structured and data generation](https://ai-sdk.dev/docs/ai-sdk-core/generating-structured-data destroys meaning. The splitter preserves these constructs and splits around them.

image

Comparison of chunk size 100: Chunkdown (left) / LangChain Markdown Splitter (right)

Allow Controlled Overflow

Preserving a complete semantic unit like a section, paragraph, sentence, etc., is often more important than adhering to a strict chunk size. The splitter allows a controlled overflow (via maxOverflowRatio) of the chunk size if it avoids splitting a complete unit, e.g. a list item.

image

Comparison of chunk size 200 with 1.5x overflow ratio: Chunkdown (left) / LangChain Markdown Splitter (right)

Installation

npm install chunkdown
#
pnpm add chunkdown
#
bun add chunkdown

Usage

import { chunkdown } from 'chunkdown';

const splitter = chunkdown({
  chunkSize: 500,        // Target chunk size based on content length
  maxOverflowRatio: 1.5  // Allow up to 50% overflow
});

const text = `
# AI SDK Core

Large Language Models (LLMs) are advanced programs that can understand, create, and engage with human language on a large scale.
They are trained on vast amounts of written material to recognize patterns in language and predict what might come next in a given piece of text.

AI SDK Core **simplifies working with LLMs by offering a standardized way of integrating them into your app** - so you can focus on building great AI applications for your users, not waste time on technical details.

For example, here’s how you can generate text with various models using the AI SDK:

<PreviewSwitchProviders />

## AI SDK Core Functions

AI SDK Core has various functions designed for [text generation](./generating-text), [structured data generation](./generating-structured-data), and [tool usage](./tools-and-tool-calling).
These functions take a standardized approach to setting up [prompts](./prompts) and [settings](./settings), making it easier to work with different models.

- [`generateText`](/docs/ai-sdk-core/generating-text): Generates text and [tool calls](./tools-and-tool-calling).
  This function is ideal for non-interactive use cases such as automation tasks where you need to write text (e.g. drafting email or summarizing web pages) and for agents that use tools.
- [`streamText`](/docs/ai-sdk-core/generating-text): Stream text and tool calls.
  You can use the `streamText` function for interactive use cases such as [chat bots](/docs/ai-sdk-ui/chatbot) and [content streaming](/docs/ai-sdk-ui/completion).
- [`generateObject`](/docs/ai-sdk-core/generating-structured-data): Generates a typed, structured object that matches a [Zod](https://zod.dev/) schema.
  You can use this function to force the language model to return structured data, e.g. for information extraction, synthetic data generation, or classification tasks.
- [`streamObject`](/docs/ai-sdk-core/generating-structured-data): Stream a structured object that matches a Zod schema.
  You can use this function to [stream generated UIs](/docs/ai-sdk-ui/object-generation).

## API Reference

Please check out the [AI SDK Core API Reference](/docs/reference/ai-sdk-core) for more details on each function.
`;

const chunks = splitter.splitText(text);

Links and Images

By default, links and images are never split to avoid breaking their semantic meaning.

import { chunkdown } from 'chunkdown';

const text = `Please check out the [AI SDK Core API Reference](/docs/reference/ai-sdk-core) for more details on each function.`;

// By default, never split links and images
const splitter = chunkdown({
  chunkSize: 50,
});

const chunks = splitter.splitText(text);
// chunks[0]: "Please check out the [AI SDK Core API Reference](/docs/reference/ai-sdk-core)"
// chunks[1]: "for more details on each function."

// Allow splitting links
const splitte = chunkdown({
  chunkSize: 50,
  rules: {
    formatting: { split: 'allow-split' }
  }
});

const chunks = splitter.splitText(text);
// chunks[0]: "Please check out the [AI SDK Core API"
// chunks[1]: "Reference](/docs/reference/ai-sdk-core) for more details on each function."
Reference-Style

By default, Chunkdown normalizes reference-style links and images to inline style. This prevents issues when reference definitions end up in different chunks than their usage.

import { chunkdown } from 'chunkdown';

const text = `
Check out the [documentation][docs] and [API reference][api].

[docs]: https://example.com/docs
[api]: https://example.com/api
`;

// By default, normalize to inline style
const splitter = chunkdown({
  chunkSize: 100,
});

const chunks = splitter.splitText(text);
// Result:
// chunks[0]: "Check out the [documentation](https://example.com/docs) and [API reference](https://example.com/api)."

// Preserve original style
const splitter = chunkdown({
  chunkSize: 100,
  rules: {
    link: { style: 'preserve' }
  }
});

const chunks = splitter.splitText(text);
// Result:
// chunks[0]: "Check out the [documentation][docs] and [API reference][api]."
// chunks[1]: "[docs]: https://example.com/docs"
// chunks[2]: "[api]: https://example.com/api"

Formatting

Unlike links and images, formatting elements like bold, italic, and strikethrough will be splitted if needed.

import { chunkdown } from 'chunkdown';

const text = `This is **a very long bold text that contains many words and exceeds the chunk size** in the middle.`;

// By default, allow splitting formatting
const splitter = chunkdown({
  chunkSize: 30,
});

const chunks = splitter.splitText(text);
// chunks[0]: "This is **a very long"
// chunks[1]: "bold text that contains many"
// chunks[2]: "words and exceeds the"
// chunks[3]: "chunk size** in the middle."

// Never split formatting
const splitte = chunkdown({
  chunkSize: 30,
  rules: {
    formatting: { split: 'never-split' }
  }
});

const chunks = splitter.splitText(text);
// chunks[0]: "This is"
// chunks[1]: "**a very long bold text that contains many words and exceeds the chunk size**"
// chunks[2]: "in the middle."

Tables

When a table is split into multiple chunks, Chunkdown automatically preserves context by including the header row in each chunk. This ensures that data rows don't lose their meaning when separated from the original header.

Note

The header row size is not counted when calculating chunk sizes. Only data row content is measured against the chunkSize limit.

import { chunkdown } from 'chunkdown';

const splitter = chunkdown({
  chunkSize: 20,
  maxOverflowRatio: 1.0
});

const text = `
| Name     | Age | Country |
|----------|-----|---------|
| Alice    | 30  | USA     |
| Bob      | 25  | UK      |
| Charlie  | 35  | Canada  |
| David    | 40  | France  |
`;

const chunks = splitter.splitText(text);
// chunks[0]: 
// | Name     | Age | Country |
// |----------|-----|---------|
// | Alice    | 30  | USA     |
// | Bob      | 25  | UK      |

// chunks[1]:
// | Name     | Age | Country |
// |----------|-----|---------|
// | Charlie  | 35  | Canada  |
// | David    | 40  | France  |

Normalization

Certain markdown elements such as formatting and lists have multiple representations. Chunkdown normalizes these element to ensure a uniform output regardless of input style variations.

import { chunkdown } from 'chunkdown';

const text = `
formatting:
__bold__
_italic_

---

lists:
- list item 1
- list item 2
`;

const splitter = chunkdown({
  chunkSize: 100,
});

const chunks = splitter.splitText(text);
// Markdown variations are normalized to:
// - __bold__ → **bold**
// - _italic_ → *italic*
// - "---" → "***" (thematic break)
// - list item 1 → * list item 1 (starting with "*")

Transformation

Transform functions allow you to modify or filter nodes during preprocessing. This is useful for cleaning up content before chunking, such as truncating long URLs or removing unwanted elements.

Truncating Long URLs

Prevent chunk bloat from excessively long URLs:

const splitter = chunkdown({
  chunkSize: 100,
  rules: {
    link: {
      transform: (node) => {
        // Truncate URLs longer than 100 characters
        if (node.url.length > 50) {
          return {
            ...node,
            url: node.url.substring(0, 47) + '...'
          };
        }
        return undefined;
      }
    }
  }
});

const text = `Check out our [website](https://example.com/with/a/very/long/url/that/increases/the/chunk/size/significantly).`;
const chunks = splitter.splitText(text);
// chunks[0]: "Check out our [website](https://example.com/with/a/very/long/url/that/incr...)."
Removing Data URLs

Data URLs in images (i.e. base64-encoded images) can be extremely long and create noise in chunks without meaningful content:

import { chunkdown } from 'chunkdown';

const text = `
# Article

![Screenshot](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAD0...)

Check our [website](https://example.com) for more info.

![Logo](https://cdn.example.com/logo.png)
`;

const splitter = chunkdown({
  chunkSize: 500,
  rules: {
    image: {
      transform: (node) => {
        // Remove images with data URLs
        if (node.url.startsWith('data:')) {
          return null;  // Remove the entire image node
        }
        return undefined; // Keep regular images
      }
    }
  }
});

const text = `
# Article

![Screenshot](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAD0...)

Check our [website](https://example.com) for more info.
`;
const chunks = splitter.splitText(text);
// chunks[0]: 
// # Article
//
// Check our [website](https://example.com) for more info.

API Reference

chunkdown(options: ChunkdownOptions)

Creates a new markdown splitter instance.

Returns

An object with the following method:

  • splitText(text: string): string[]: Splits the input markdown text into chunks

Options

chunkSize: number

The target content size for each chunk, counting only content characters, not raw markdown.

maxOverflowRatio?: number (optional)

The maximum overflow ratio for preserving semantic units:

  • 1.0: strict chunk size, no overflow allowed
  • >1.0: allow overflow of up to chunkSize * maxOverflowRatio
rules?: Partial<NodeRules> (optional)

Configure splitting behavior for specific markdown node types. Rules allow fine-grained control over when and how different markdown elements can be split during chunking.

Supported node types:

  • link: Link elements [text](url)
  • image: Image elements ![alt](url)
  • table: Table elements
  • list: List elements (ordered and unordered)
  • blockquote: Blockquote elements
  • formatting: Formatting elements (combines strong, emphasis, delete)
  • strong: Bold text **bold** (overrides formatting if specified)
  • emphasis: Italic text *italic* (overrides formatting if specified)
  • delete: Strikethrough text ~~deleted~~ (overrides formatting if specified)

Note

The formatting rule applies to all formatting elements (strong, emphasis, delete) unless you override them individually.

Split

Each node type can have a split rule:

  • 'never-split' | { rule: 'never-split' }: Never split this element, regardless of size
  • 'allow-split' | { rule: 'allow-split' }: Allow splitting this element if it exceeds the chunk size
  • { rule: 'size-split', size: number }: Only split this element if its content size exceeds the specified size
Style

Links and images support an additional style property to control reference-style normalization:

  • 'inline': Convert reference-style to inline style
  • 'preserve': Keep original reference style
Transform

Each node type can have a transform function to modify or filter nodes:

type NodeTransform<T extends Nodes> = (node: T, context: TransformContext) => T | null | undefined;
  • Return modified node: Replace the original with transformed version
  • Return null: Remove the node from the tree
  • Return undefined: Keep the node unchanged

The transform receives a context with parent, index, and root information. Transforms are applied during preprocessing, after reference-style normalization but before chunking.

Examples
import { chunkdown, defaultNodeRules } from 'chunkdown';

// Never split links
chunkdown({
  chunkSize: 500,
  rules: {
    link: { split: 'never-split' }
  }
});

// Split lists only if they exceed 200 characters
chunkdown({
  chunkSize: 500,
  rules: {
    list: { split: { rule: 'size-split', size: 200 } }
  }
});

// Never split formatting by default, but allow splitting bold text
chunkdown({
  chunkSize: 500,
  rules: {
    formatting: { split: 'never-split' },  // Applies to strong, emphasis, delete
    strong: { split: 'allow-split' }       // Override: allow splitting bold text
  }
});

// Extend default rules
chunkdown({
  chunkSize: 500,
  rules: {
    ...defaultNodeRules,  // Include defaults for other elements
    link: { split: 'never-split' },
    table: { split: { rule: 'allow-split' } },
    list: { split: { rule: 'size-split', size: 150 } },
    blockquote: { split: { rule: 'size-split', size: 300 } }
  }
});

// Normalize links to inline-style, preserve images in reference-style
chunkdown({
  chunkSize: 500,
  rules: {
    link: { style: 'inline' },
    image: { style: 'preserve' }
  }
});

// Remove data URLs and truncate long links
chunkdown({
  chunkSize: 500,
  rules: {
    image: {
      transform: (node) => {
        // Remove images with data URLs
        if (node.url.startsWith('data:')) {
          return null;
        }
        return undefined;
      }
    },
    link: {
      transform: (node) => {
        // Truncate long URLs
        if (node.url.length > 100) {
          return { ...node, url: node.url.substring(0, 100) + '...' };
        }
        return undefined;
      }
    }
  }
});

Default rules:

By default, links and images are set to never split and normalize to inline style:

const defaultNodeRules: NodeRules = {
  link: { 
    split: 'never-split', 
    style: 'inline' 
  },
  image: { 
    split: 'never-split', 
    style: 'inline'
  },
};

When you provide custom rules, they override the defaults. Use the spread operator ...defaultNodeRules to explicitly include defaults if you want to override only specific elements.

import { chunkdown, defaultNodeRules } from 'chunkdown';

chunkdown({
  chunkSize: 500,
  rules: {
    ...defaultNodeRules,  // Include defaults for other elements
    link: { split: 'never-split' },
    table: { split: { rule: 'allow-split' } },
    list: { split: { rule: 'size-split', size: 150 } },
    blockquote: { split: { rule: 'size-split', size: 300 } }
  }
});
maxRawSize?: number (optional)

The maximum raw size for each chunk, counting all characters including markdown formatting.

Certain markdown elements, such as links and images with long URLs, can have disproportionately long raw sizes compared to their actual content size. For example, the following text has a content size of 21 but a raw size of 117 chars due to the long URL:

This is a [link with short text](https://example.com/with/a/very/long/url/that/increases/the/raw/size/significantly).

This is usually not a problem, but if a text contains a lot of such elements (e.g. scraped from a website with many links and images), the resulting chunks can become very large in raw size, even if their content size is within the allowed limits. When the text is then embedded by a model, the large raw size could exceed the model's token limit, causing errors. For example, OpenAI's latest embedding model text-embedding-3-large has a maximum limit of 8192 tokens, which roughly translates to about 32,000 characters (rules of thumb: 1 token ≈ 4 characters).

Note

It is recommended to set this option to the upper limit of your embedding model.

The maxRawSize option acts as a safety net that enforces a hard limit on the total number of characters allowed in each chunk. It is guaranteed that no chunk will exceed this limit, even if it means splitting semantic units that would otherwise be preserved.

Visualization

The chunk visualizer hosted at chunkdown.zirkelc.dev provides an interactive way to see how text is split into chunks:

image

Future Improvements

Normalize Broken Markdown Formatting

Splitting markdown text into multiple chunks often breaks formatting, because the start and end delimiters end up in different chunks. This broken formatting provides no real semantic meaning but adds unnecessary noise:

import { chunkdown } from "chunkdown";

const text = `**This is a very long bold text that might be split into two chunks**`;

const splitter = chunkdown({
  chunkSize: 50,
  maxOverflowRatio: 1.0
});

const chunks = splitter.splitText(text, {
  breakMode: 'keep'
});
// Keep broken markdown:
// - **This is a very long bold text that
// - might be split into two chunks**

const chunks = splitter.splitText(text, {
  breakMode: 'remove'
});
// Remove broken markdown:
// - This is a very long bold text that
// - might be split into two chunks

const chunks = splitter.splitText(text, {
  breakMode: 'extend'
});
// Extend broken markdown:
// - **This is a very long bold text that**
// - **might be split into two chunks**

Return Chunk Start and End Positions

Currently, the splitter returns the chunks as array of strings. That means the original position of each chunk in the source text is lost. In a typical RAG setup, the source document and each chunk is stored with it's embedding in a database. This duplicates lots of text since each chunk contains parts of the original document.

Chunkdown could return the start and end positions of each chunk in the original text, allowing to store only the original document and reference the chunk positions when needed.

const document = '...'; // original markdown document
const chunks = splitter.splitDocument(document);
// Result:
// [
//   { text: 'First chunk text...', start: 0, end: 256 },
//   { text: 'Second chunk text...', start: 257, end: 512 },
//   ...
// ]

await db.insert(documentTable).values({
  text: document
});

await db
  .insert(chunkTable)
  .values(chunks.map(chunk => ({
    start: chunk.start, // start position in original document
    end: chunk.end, // end position in original document
    text: null, // chunk text not stored separately
    embedding: await embed(chunk.text),
  })));

About

Markdown text splitter to create chunks worth embedding!

Resources

License

Stars

Watchers

Forks

Languages