Skip to content

Commit 1747032

Browse files
committed
docs: add new section for data formats and introduce TOON format documentation
1 parent a9390da commit 1747032

3 files changed

Lines changed: 209 additions & 1 deletion

File tree

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,11 @@ In addition to these projects, I regularly share my insights and learnings on th
5050

5151
- [TOML vs. YAML](configuration/yaml-vs-toml.ipynb): Choosing the right configuration format for your projects.
5252

53+
## Data Formats
54+
55+
- Top 5 Formats: The top 5 structured data formats for data science.
56+
- TOON: Token-efficient, human-readable serialization format optimized for LLM contexts.
57+
5358
## 🧩 Data Structures
5459

5560
- [Sorting Algorithms](data-structure/sorting-popular.ipynb): A comprehensive guide to understanding and implementing popular sorting algorithms in Python.

website/docs/format/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Data Format
1+
# Data Formats
22

33
```mdx-code-block
44
import DocCardList from '@theme/DocCardList';

website/docs/format/toon.md

Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
---
2+
title: TOON
3+
description: Token-efficient, human-readable format for LLMs
4+
tags: [Format, TOON, JSON, CSV, YAML, LLM, Serialization]
5+
---
6+
7+
# TOON (Token-Oriented Object Notation)
8+
9+
TOON is a compact, human-readable serialization format designed for Large Language Model (LLM) contexts. It targets 30–60% token reduction versus JSON for uniform tabular data, while staying deterministic, easy to read, and compatible with JSON’s data model.
10+
11+
- Specification: https://github.com/toon-format/spec/blob/main/SPEC.md
12+
- Reference implementation: https://github.com/toon-format/toon
13+
- Use cases: LLM prompts, RAG pipelines, agent protocols, configuration, AI data interchange
14+
15+
## Why TOON?
16+
17+
- Token efficiency directly reduces cost and expands context capacity in LLM apps
18+
- Deterministic encoding and explicit array lengths improve validation and safety
19+
- Human-readable, indentation-based structure with minimal quoting
20+
- Works naturally with tabular data while preserving JSON compatibility
21+
22+
## Data Model
23+
24+
TOON models the same types as JSON:
25+
26+
- Primitives: string, number, boolean, null
27+
- Objects: mapping from string keys to values
28+
- Arrays: ordered sequences of values
29+
30+
Numbers are normalized by encoders to non-exponential decimal form (e.g., `1e6 -> 1000000`; `-0 -> 0`). Decoders accept decimal and exponent forms but preserve round-trip fidelity.
31+
32+
## Core Syntax at a Glance
33+
34+
### Objects
35+
36+
```toon title="TOON"
37+
id: 123
38+
name: Ada
39+
active: true
40+
```
41+
42+
```toon title="TOON"
43+
user:
44+
id: 123
45+
name: Ada
46+
```
47+
48+
Rules:
49+
50+
- `key: value` for primitives with a single space after colon
51+
- `key:` alone opens a nested object; nested fields are indented by one level
52+
53+
### Arrays
54+
55+
Primitive arrays are inline with explicit lengths:
56+
57+
```toon title="TOON"
58+
tags[3]: admin,ops,dev
59+
scores[4]: 95,87,92,88
60+
```
61+
62+
Arrays of uniform objects can use a compact tabular form:
63+
64+
```toon title="TOON"
65+
products[3]{sku,name,price}:
66+
A001,Widget,9.99
67+
B002,Gadget,14.50
68+
C003,Tool,7.25
69+
```
70+
71+
Mixed/non-uniform arrays use list items with hyphens:
72+
73+
```toon title="TOON"
74+
items[3]:
75+
- 42
76+
- text value
77+
- id: 1
78+
name: nested object
79+
```
80+
81+
### Delimiters
82+
83+
Active delimiter can be comma (default), tab, or pipe; it is declared in the header and applies to inline arrays and tabular rows within that scope.
84+
85+
```toon title="TOON"
86+
items[2\t]{sku\tname\tprice}:
87+
A1\tWidget, Inc.\t9.99
88+
B2\tGadget Co.\t14.5
89+
```
90+
91+
```toon title="TOON"
92+
tags[3|]: reading|gaming,fun|coding
93+
```
94+
95+
### Quoting Rules (Encoder)
96+
97+
Quote string values when they are empty, have leading/trailing whitespace, look numeric, equal `true/false/null`, contain colon/quote/backslash/brackets/braces/control characters, contain the relevant delimiter (active inside array scope, document delimiter otherwise), or equal/start with `-`.
98+
99+
Keys may be unquoted only if they match `^[A-Za-z_][\w.]*$`; otherwise they must be quoted (including in array headers):
100+
101+
```toon title="TOON"
102+
"my-key"[3]: 1,2,3
103+
"x-items"[2]{id,name}:
104+
1,Ada
105+
2,Bob
106+
```
107+
108+
## Strictness, Indentation, and Validation
109+
110+
- UTF-8 with LF line endings
111+
- Consistent spaces for indentation (default 2); tabs are not allowed for indentation
112+
- Strict mode (default) enforces:
113+
- Array lengths and tabular row widths match declarations
114+
- Valid escapes only in quoted strings: `\\`, `\"`, `\n`, `\r`, `\t`
115+
- Missing colons, indentation errors, delimiter mismatches
116+
- No blank lines inside arrays/tabular rows
117+
118+
These checks help detect truncation, malformed tokens, or injected rows.
119+
120+
## Root Forms
121+
122+
- Root array: if the first non-empty top-level line is a valid header (must end with `:`)
123+
- Root primitive: exactly one non-empty line that is neither header nor key-value
124+
- Otherwise: root object
125+
126+
## Interoperability
127+
128+
### JSON
129+
130+
TOON encodes JSON-compatible structures deterministically and round-trips safely. Example conversion:
131+
132+
```json title="JSON"
133+
{
134+
"users": [
135+
{ "id": 1, "name": "Alice", "active": true },
136+
{ "id": 2, "name": "Bob", "active": false }
137+
],
138+
"count": 2
139+
}
140+
```
141+
142+
```toon title="TOON"
143+
users[2]{id,name,active}:
144+
1,Alice,true
145+
2,Bob,false
146+
count: 2
147+
```
148+
149+
### CSV
150+
151+
TOON’s tabular mode generalizes CSV with:
152+
153+
- Explicit array length and field names in the header
154+
- Support for nested structures
155+
- Type-aware primitives and configurable delimiters
156+
157+
### YAML
158+
159+
Shares indentation and `- ` list markers but differs in determinism, explicit array headers/lengths, and the absence of comments in TOON.
160+
161+
## Examples and Edge Cases
162+
163+
```toon title="TOON"
164+
pairs[2]:
165+
- [2]: 1,2
166+
- [2]: 3,4
167+
```
168+
169+
```toon title="TOON"
170+
links[2]{id,url}:
171+
1,"http://a:b"
172+
2,"https://example.com?q=a:b"
173+
```
174+
175+
```toon title="TOON"
176+
root:
177+
level1:
178+
level2:
179+
level3:
180+
items[2]{id,val}:
181+
1,a
182+
2,b
183+
```
184+
185+
```toon title="TOON"
186+
message: Hello 世界 👋
187+
tags[3]: 🎉,🎊,🎈
188+
```
189+
190+
## When to Use TOON
191+
192+
Use TOON when token efficiency and readability matter:
193+
194+
- LLM prompts and agent communication
195+
- RAG pipelines and intermediate representations
196+
- Compact, validated tabular structures
197+
- Human-reviewed configuration in AI contexts
198+
199+
Prefer JSON when maximal ecosystem compatibility is required and token costs are not a concern.
200+
201+
## Further Reading
202+
203+
- Python implementation: https://github.com/toon-format/toon-python

0 commit comments

Comments
 (0)