Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 37 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@
[![Contributors](https://img.shields.io/github/contributors/dmitry-brazhenko/SharpToken.svg)](https://github.com/dmitry-brazhenko/SharpToken/graphs/contributors)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)



SharpToken is a C# library that serves as a port of the Python [tiktoken](https://github.com/openai/tiktoken) library.
It provides functionality for encoding and decoding tokens using GPT-based encodings. This library is built for .NET 6, .NET 8
and .NET Standard 2.0, making it compatible with a wide range of frameworks.
Expand Down Expand Up @@ -74,11 +72,12 @@ var count = encoding.CountTokens("Hello, world!"); // Output: 4

SharpToken currently supports the following models:

* `r50k_base`
* `p50k_base`
* `p50k_edit`
* `cl100k_base`
* `o200k_base`
- `r50k_base`
- `p50k_base`
- `p50k_edit`
- `cl100k_base`
- `o200k_base`
- `o200k_harmony`

You can use any of these models when creating an instance of GptEncoding:

Expand All @@ -88,6 +87,7 @@ var p50kBaseEncoding = GptEncoding.GetEncoding("p50k_base");
var p50kEditEncoding = GptEncoding.GetEncoding("p50k_edit");
var cl100kBaseEncoding = GptEncoding.GetEncoding("cl100k_base");
var o200kBaseEncoding = GptEncoding.GetEncoding("o200k_base");
var o200kHarmonyEncoding = GptEncoding.GetEncoding("o200k_harmony");
```

### Model Prefix Matching
Expand All @@ -96,14 +96,17 @@ Apart from specifying direct model names, SharpToken also provides functionality

Here are the current supported prefixes and their corresponding encodings:

| Model Prefix | Encoding |
|---------------------|------------|
| `gpt-4o` | `o200k_base` |
| `gpt-4-` | `cl100k_base` |
| `gpt-3.5-turbo-` | `cl100k_base` |
| `gpt-35-turbo` | `cl100k_base` |
| Model Prefix | Encoding |
| ---------------- | ------------- |
| `gpt-5` | `o200k_base` |
| `gpt-4o` | `o200k_base` |
| `gpt-4-` | `cl100k_base` |
| `gpt-3.5-turbo-` | `cl100k_base` |
| `gpt-35-turbo` | `cl100k_base` |

Examples of model names that fall under these prefixes include:

- For the prefix `gpt-5`: `gpt-5`, `gpt-5-mini`, `gpt-5-nano`, `gpt-5-pro`, `gpt-5-thinking`, `gpt-5-2024-08-07`, `gpt-5-chat-latest`, etc.
- For the prefix `gpt-4o`: `gpt-4o`, `gpt-4o-2024-05-13`, etc.
- For the prefix `gpt-4-`: `gpt-4-0314`, `gpt-4-32k`, etc.
- For the prefix `gpt-3.5-turbo-`: `gpt-3.5-turbo-0301`, `gpt-3.5-turbo-0401`, etc.
Expand All @@ -117,9 +120,6 @@ string encodingName = Model.GetEncodingNameForModel("gpt-4-0314"); // This will

If the provided model name doesn't match any direct model names or prefixes, the method will return `null`.




## Understanding Encoded Values

When you encode a string using the Encode method, the returned value is a list of integers that represent tokens in the
Expand Down Expand Up @@ -289,23 +289,23 @@ BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 11 (10
.NET Framework 4.7.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256
```

| Method | Job | Runtime | Mean | Error | StdDev | Median | Gen0 | Gen1 | Allocated |
|------------------ |--------------------- |--------------------- |----------:|---------:|----------:|----------:|-----------:|----------:|----------:|
| **MLTokenizers** | .NET 8.0 | .NET 8.0 | 60.55 ms | 1.143 ms | 1.123 ms | 60.45 ms | 1000.0000 | - | 13.12 MB |
| **MLTokenizers** | .NET 6.0 | .NET 6.0 | 95.75 ms | 1.374 ms | 1.147 ms | 95.54 ms | 10500.0000 | - | 126.19 MB |
| **MLTokenizers** | .NET Framework 4.7.1 | .NET Framework 4.7.1 | 291.77 ms | 5.811 ms | 11.195 ms | 291.64 ms | 21000.0000 | - | 127.33 MB |
| | | | | | | | | | |
| *SharpToken* | .NET 8.0 | .NET 8.0 | 87.78 ms | 1.700 ms | 1.590 ms | 87.34 ms | 1000.0000 | - | 22.13 MB |
| *SharpToken* | .NET 6.0 | .NET 6.0 | 128.84 ms | 1.718 ms | 1.607 ms | 128.17 ms | 16250.0000 | 500.0000 | 196.31 MB |
| *SharpToken* | .NET Framework 4.7.1 | .NET Framework 4.7.1 | 356.21 ms | 6.843 ms | 10.854 ms | 355.09 ms | 34000.0000 | 1000.0000 | 204.39 MB |
| | | | | | | | | | |
| *TokenizerLib* | .NET 8.0 | .NET 8.0 | 109.26 ms | 2.082 ms | 4.482 ms | 107.90 ms | 18200.0000 | 600.0000 | 217.82 MB |
| *TokenizerLib* | .NET 6.0 | .NET 6.0 | 126.16 ms | 2.959 ms | 8.630 ms | 122.34 ms | 18000.0000 | 500.0000 | 217.82 MB |
| *TokenizerLib* | .NET Framework 4.7.1 | .NET Framework 4.7.1 | 374.71 ms | 7.374 ms | 16.794 ms | 370.12 ms | 40000.0000 | 1000.0000 | 243.79 MB |
| | | | | | | | | | |
| *TiktokenSharp* | .NET 8.0 | .NET 8.0 | 177.34 ms | 3.506 ms | 8.797 ms | 174.98 ms | 28000.0000 | 1000.0000 | 338.98 MB |
| *TiktokenSharp* | .NET 6.0 | .NET 6.0 | 196.17 ms | 3.912 ms | 8.422 ms | 195.52 ms | 26000.0000 | 666.6667 | 313.26 MB |
| *TiktokenSharp* | .NET Framework 4.7.1 | .NET Framework 4.7.1 | 488.22 ms | 9.696 ms | 15.931 ms | 487.17 ms | 63000.0000 | 1000.0000 | 378.31 MB |
| Method | Job | Runtime | Mean | Error | StdDev | Median | Gen0 | Gen1 | Allocated |
| ---------------- | -------------------- | -------------------- | --------: | -------: | --------: | --------: | ---------: | --------: | --------: |
| **MLTokenizers** | .NET 8.0 | .NET 8.0 | 60.55 ms | 1.143 ms | 1.123 ms | 60.45 ms | 1000.0000 | - | 13.12 MB |
| **MLTokenizers** | .NET 6.0 | .NET 6.0 | 95.75 ms | 1.374 ms | 1.147 ms | 95.54 ms | 10500.0000 | - | 126.19 MB |
| **MLTokenizers** | .NET Framework 4.7.1 | .NET Framework 4.7.1 | 291.77 ms | 5.811 ms | 11.195 ms | 291.64 ms | 21000.0000 | - | 127.33 MB |
| | | | | | | | | | |
| _SharpToken_ | .NET 8.0 | .NET 8.0 | 87.78 ms | 1.700 ms | 1.590 ms | 87.34 ms | 1000.0000 | - | 22.13 MB |
| _SharpToken_ | .NET 6.0 | .NET 6.0 | 128.84 ms | 1.718 ms | 1.607 ms | 128.17 ms | 16250.0000 | 500.0000 | 196.31 MB |
| _SharpToken_ | .NET Framework 4.7.1 | .NET Framework 4.7.1 | 356.21 ms | 6.843 ms | 10.854 ms | 355.09 ms | 34000.0000 | 1000.0000 | 204.39 MB |
| | | | | | | | | | |
| _TokenizerLib_ | .NET 8.0 | .NET 8.0 | 109.26 ms | 2.082 ms | 4.482 ms | 107.90 ms | 18200.0000 | 600.0000 | 217.82 MB |
| _TokenizerLib_ | .NET 6.0 | .NET 6.0 | 126.16 ms | 2.959 ms | 8.630 ms | 122.34 ms | 18000.0000 | 500.0000 | 217.82 MB |
| _TokenizerLib_ | .NET Framework 4.7.1 | .NET Framework 4.7.1 | 374.71 ms | 7.374 ms | 16.794 ms | 370.12 ms | 40000.0000 | 1000.0000 | 243.79 MB |
| | | | | | | | | | |
| _TiktokenSharp_ | .NET 8.0 | .NET 8.0 | 177.34 ms | 3.506 ms | 8.797 ms | 174.98 ms | 28000.0000 | 1000.0000 | 338.98 MB |
| _TiktokenSharp_ | .NET 6.0 | .NET 6.0 | 196.17 ms | 3.912 ms | 8.422 ms | 195.52 ms | 26000.0000 | 666.6667 | 313.26 MB |
| _TiktokenSharp_ | .NET Framework 4.7.1 | .NET Framework 4.7.1 | 488.22 ms | 9.696 ms | 15.931 ms | 487.17 ms | 63000.0000 | 1000.0000 | 378.31 MB |

## Performance

Expand All @@ -315,15 +315,16 @@ It uses modern multibyte CPU instructions and almost no heap allocations.
All core methods have been tested on a large and a small input text.

**Inputs:**

- `SmallText`: 453 B (text/plain)
- `LargeText`: 51 KB (text/html)

**Methods:**

- `Encode`: text to tokens
- `Decode`: tokens to text
- `CountTokens`: high performance API to count tokens of text


```
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3296/23H2/2023Update/SunValley3)
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
Expand All @@ -334,8 +335,8 @@ AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Framework 4.7.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256
```

| Method | Mean | Error | StdDev | Ratio | RatioSD | Allocated | Alloc Ratio |
|------------------------- |--------------:|------------:|------------:|------:|--------:|----------:|------------:|
| Method | Mean | Error | StdDev | Ratio | RatioSD | Allocated | Alloc Ratio |
| ------------------------ | ------------: | ----------: | ----------: | ----: | ------: | --------: | ----------: |
| **.NET 8.0** | | | | | | | |
| Encode_SmallText | 22.649 us | 0.4244 us | 0.4359 us | 0.28 | 0.01 | 696 B | 0.02 |
| Encode_LargeText | 4,542.505 us | 87.7988 us | 104.5182 us | 0.24 | 0.01 | 155547 B | 0.03 |
Expand Down
1 change: 1 addition & 0 deletions SharpToken.Benchmark/SharpToken.Benchmark.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFrameworks>net471;net6.0;net8.0</TargetFrameworks>
<TargetFrameworks Condition="!$([MSBuild]::IsOSPlatform('Windows'))">net6.0;net8.0</TargetFrameworks>
<Optimize>true</Optimize>
</PropertyGroup>

Expand Down
111 changes: 107 additions & 4 deletions SharpToken.Tests/SharpToken.Tests.cs
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
using System.Net.Http;
using System.Text;
using System.Linq;
using NUnit.Framework;

namespace SharpToken.Tests;

public class Tests
{
private static readonly List<string> ModelsList = new() { "p50k_base", "r50k_base", "cl100k_base", "o200k_base" };
private static readonly List<string> ModelsList = new() { "p50k_base", "r50k_base", "cl100k_base", "o200k_base", "o200k_harmony" };

private static readonly List<Tuple<string, string, List<int>>> TestData =
TestHelpers.ReadTestPlans("SharpToken.Tests.data.TestPlans.txt");
Expand All @@ -23,7 +24,19 @@ public void TestEncodingAndDecoding(Tuple<string, string, List<int>> resource)
var (encodingName, textToEncode, expectedEncoded) = resource;

var encoding = GptEncoding.GetEncoding(encodingName);
var encoded = encoding.Encode(textToEncode);

// Detect if the text contains special tokens
var allowedSpecial = new HashSet<string>();
var specialTokens = GetSpecialTokensForEncoding(encodingName);
foreach (var token in specialTokens)
{
if (textToEncode.Contains(token))
{
allowedSpecial.Add(token);
}
}

var encoded = encoding.Encode(textToEncode, allowedSpecial);
var decodedText = encoding.Decode(encoded);
Assert.Multiple(() =>
{
Expand All @@ -39,7 +52,19 @@ public void TestTokensLength(Tuple<string, string, List<int>> resource)
var (encodingName, textToEncode, expectedEncoded) = resource;

var encoding = GptEncoding.GetEncoding(encodingName);
var tokenLength = encoding.CountTokens(textToEncode);

// Detect if the text contains special tokens
var allowedSpecial = new HashSet<string>();
var specialTokens = GetSpecialTokensForEncoding(encodingName);
foreach (var token in specialTokens)
{
if (textToEncode.Contains(token))
{
allowedSpecial.Add(token);
}
}

var tokenLength = encoding.CountTokens(textToEncode, allowedSpecial);
Assert.Multiple(() =>
{
Assert.That(tokenLength, Is.EqualTo(expectedEncoded.Count));
Expand All @@ -53,7 +78,19 @@ public async Task TestEncodingAndDecodingInParallel()
{
var (encodingName, textToEncode, expectedEncoded) = _;
var encoding = GptEncoding.GetEncoding(encodingName);
var encoded = encoding.Encode(textToEncode);

// Detect if the text contains special tokens
var allowedSpecial = new HashSet<string>();
var specialTokens = GetSpecialTokensForEncoding(encodingName);
foreach (var token in specialTokens)
{
if (textToEncode.Contains(token))
{
allowedSpecial.Add(token);
}
}

var encoded = encoding.Encode(textToEncode, allowedSpecial);
var decodedText = encoding.Decode(encoded);
return (textToEncode, encoded, expectedEncoded, decodedText);
}));
Expand Down Expand Up @@ -162,6 +199,13 @@ static void TestModelPrefixMappingFailsAction()
[TestCaseSource(nameof(ModelsList))]
public async Task TestLocalResourceMatchesRemoteResource(string modelName)
{
// Skip o200k_harmony as it reuses o200k_base.tiktoken and doesn't have its own remote file
if (modelName == "o200k_harmony")
{
Assert.Pass("o200k_harmony reuses o200k_base.tiktoken file and doesn't have its own remote file");
return;
}

var embeddedResourceName = $"SharpToken.data.{modelName}.tiktoken";
var remoteResourceUrl = $"https://openaipublic.blob.core.windows.net/encodings/{modelName}.tiktoken";

Expand Down Expand Up @@ -199,4 +243,63 @@ public void TestEncodingForModel()
Assert.That(decodedText, Is.EqualTo(inputText));
});
}

[Test]
public void TestO200KHarmonySpecialTokens()
{
var encoding = GptEncoding.GetEncoding("o200k_harmony");
const string inputText = "Hello, world!";

// Test basic encoding/decoding
var encoded = encoding.Encode(inputText);
var decodedText = encoding.Decode(encoded);
Assert.That(decodedText, Is.EqualTo(inputText));

// Test that o200k_harmony has more special tokens than o200k_base
var baseEncoding = GptEncoding.GetEncoding("o200k_base");

// Test encoding with special tokens
var textWithSpecialTokens = "Hello <|startoftext|> world <|call|> test <|reserved_200020|>";
var encodedSpecial = encoding.Encode(textWithSpecialTokens, allowedSpecial: new HashSet<string> { "<|startoftext|>", "<|call|>", "<|reserved_200020|>" });
var decodedSpecial = encoding.Decode(encodedSpecial);

Assert.That(decodedSpecial, Is.EqualTo(textWithSpecialTokens));

// Verify specific special token IDs
Assert.That(encoding.Encode("<|startoftext|>", allowedSpecial: new HashSet<string> { "<|startoftext|>" }), Is.EqualTo(new List<int> { 199998 }));
Assert.That(encoding.Encode("<|call|>", allowedSpecial: new HashSet<string> { "<|call|>" }), Is.EqualTo(new List<int> { 200012 }));
Assert.That(encoding.Encode("<|reserved_200020|>", allowedSpecial: new HashSet<string> { "<|reserved_200020|>" }), Is.EqualTo(new List<int> { 200020 }));
}

[Test]
public void TestGPT5ModelMappings()
{
// Test that GPT-5 models map to the correct encodings
Assert.That(Model.GetEncodingNameForModel("gpt-5"), Is.EqualTo("o200k_base"));
Assert.That(Model.GetEncodingNameForModel("gpt-5-mini"), Is.EqualTo("o200k_base"));
Assert.That(Model.GetEncodingNameForModel("gpt-5-nano"), Is.EqualTo("o200k_base"));
Assert.That(Model.GetEncodingNameForModel("gpt-5-pro"), Is.EqualTo("o200k_base"));
Assert.That(Model.GetEncodingNameForModel("gpt-5-thinking"), Is.EqualTo("o200k_base"));

// Test prefix matching for GPT-5 variants
Assert.That(Model.GetEncodingNameForModel("gpt-5-2024-08-07"), Is.EqualTo("o200k_base"));
Assert.That(Model.GetEncodingNameForModel("gpt-5-chat-latest"), Is.EqualTo("o200k_base"));
}

private static HashSet<string> GetSpecialTokensForEncoding(string encodingName)
{
return encodingName switch
{
"r50k_base" or "p50k_base" => new HashSet<string> { "<|endoftext|>" },
"p50k_edit" => new HashSet<string> { "<|endoftext|>", "<|fim_prefix|>", "<|fim_middle|>", "<|fim_suffix|>" },
"cl100k_base" => new HashSet<string> { "<|endoftext|>", "<|fim_prefix|>", "<|fim_middle|>", "<|fim_suffix|>", "<|endofprompt|>" },
"o200k_base" => new HashSet<string> { "<|endoftext|>", "<|endofprompt|>" },
"o200k_harmony" => new HashSet<string>(new HashSet<string>
{
"<|endoftext|>", "<|endofprompt|>", "<|startoftext|>", "<|return|>", "<|constrain|>",
"<|channel|>", "<|start|>", "<|end|>", "<|message|>", "<|call|>"
}.Union(Enumerable.Range(200000, 1088).Select(i => $"<|reserved_{i}|>"))),
_ => new HashSet<string>()
};
}
}
1 change: 1 addition & 0 deletions SharpToken.Tests/SharpToken.Tests.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

<PropertyGroup>
<TargetFrameworks>net471;netcoreapp3.1;net6.0;net8.0</TargetFrameworks>
<TargetFrameworks Condition="!$([MSBuild]::IsOSPlatform('Windows'))">net6.0;net8.0</TargetFrameworks>
<LangVersion>preview</LangVersion>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
Expand Down
Loading
Loading