Skip to content

FineWeb SP4096 tokenizer package#1

Draft
newjordan wants to merge 5 commits intomainfrom
submission/sp4096-tokenizer-audit
Draft

FineWeb SP4096 tokenizer package#1
newjordan wants to merge 5 commits intomainfrom
submission/sp4096-tokenizer-audit

Conversation

@newjordan
Copy link
Copy Markdown
Owner

@newjordan newjordan commented Apr 18, 2026

FineWeb 4096-vocab SentencePiece BPE tokenizer package.

Included:

  • fineweb_4096_bpe.model
  • fineweb_4096_bpe.vocab
  • rebuild scripts
  • tokenizer spec
  • short README

This draft PR is only for review in newjordan/parameter-golf-1. Nothing has been submitted upstream.

@newjordan newjordan changed the title [codex] Add SP4096 tokenizer audit package FineWeb SP4096 tokenizer package Apr 18, 2026
@newjordan newjordan force-pushed the submission/sp4096-tokenizer-audit branch from ba8ea78 to a4bb58b Compare April 18, 2026 01:52
@newjordan newjordan force-pushed the submission/sp4096-tokenizer-audit branch from a4bb58b to ca86574 Compare April 18, 2026 01:55
newjordan and others added 4 commits April 17, 2026 21:13
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant