Tokenization Efficiency Benchmark - Complete Documentation

Project Overview

This project implements and compares three different tokenization strategies for Polish language modeling using a Transformer-based neural language model. The primary goal is to analyze how different tokenization approaches affect model performance, efficiency, and text representation quality.

Key Features

Three Tokenization Strategies: GPT-2 BPE (pre-trained), Whitespace (custom), and SentencePiece (custom-trained)
Identical Model Architecture: All tokenizers use the same 50M parameter Transformer to ensure fair comparison
Comprehensive Evaluation: Word-level and character-level perplexity (not token-level, as per assignment requirements)
OOV Analysis: Out-of-vocabulary statistics for whitespace tokenizer
Efficiency Metrics: Tokens per word, training time, inference speed
Qualitative Analysis: Tokenization comparison on sample texts
GPU Optimization: Optimized for NVIDIA RTX 5090 with mixed precision training
Automated Visualization: Training plots, metrics tracking, and comparison tools

Assignment Context

This implementation fulfills the requirements of the Computational Linguistics lab assignment on tokenization efficiency:

Implement three tokenizers: pre-trained (GPT-2), custom whitespace-based, and SentencePiece
Train three identical models (same architecture, hyperparameters, data) - one per tokenizer
Evaluate using word-level and character-level perplexity (NOT token-level)
Report OOV statistics for whitespace tokenizer
Analyze efficiency metrics: tokens per word, training time, inference speed
Provide qualitative analysis with tokenization examples

Tokenization Strategies

1. GPT-2 BPE Tokenizer (Pre-trained)

The GPT-2 tokenizer uses Byte-Pair Encoding (BPE) trained on a large English corpus.

Characteristics:

Pre-trained on English text
Vocabulary size: 50,257 tokens
Subword tokenization eliminates OOV issues
Inefficient for Polish (averages 3.43 tokens per word)
No language-specific optimization

Implementation:

from utils.gpt2_tokenizer import GPT2Tokenizer
tokenizer = GPT2Tokenizer(vocab_size=50257)

2. Whitespace Tokenizer (Custom)

A simple tokenizer that splits on whitespace and treats punctuation as separate tokens.

Characteristics:

Splits text on whitespace and punctuation
Fixed vocabulary size: 10,000 most frequent words
Uses <UNK> token for out-of-vocabulary words
Most efficient encoding (1.20 tokens per word)
Suffers from OOV issues (9.84% OOV rate on test data)

Algorithm:

Split text on whitespace
Separate punctuation marks as individual tokens
Build vocabulary from top N most frequent tokens in training data
Replace unknown tokens with <UNK> during encoding

Implementation:

from utils.whitespace_tokenizer import WhitespaceTokenizer
tokenizer = WhitespaceTokenizer(vocab_size=10000)
tokenizer.train(texts)

3. SentencePiece Tokenizer (Custom-trained)

A subword tokenizer trained on the target Polish corpus using the SentencePiece library.

Characteristics:

Trained on Polish gardening forum corpus
Vocabulary size: 10,000 tokens
BPE algorithm for subword segmentation
Language-specific optimization
Balanced efficiency (1.53 tokens per word)
No OOV issues due to subword tokenization

Training:

from utils.sentencepiece_tokenizer import SentencePieceTokenizer
tokenizer = SentencePieceTokenizer(vocab_size=10000, model_type="bpe")
tokenizer.train(texts)

Model Architecture

All three tokenizers use an identical Transformer language model architecture to ensure fair comparison. The model is based on "Attention Is All You Need" (Vaswani et al., 2017) and adapted for causal language modeling.

Transformer Architecture

INPUT SEQUENCE (token IDs)
    |
    v
+-------------------------------------------------------------+
|  TOKEN EMBEDDING LAYER                                      |
|  Maps token IDs to dense vectors                            |
|  Input:  (batch_size, seq_len)                             |
|  Output: (batch_size, seq_len, d_model=512)               |
+-------------------------------------------------------------+
    |
    v
+-------------------------------------------------------------+
|  SCALE EMBEDDINGS                                           |
|  Multiply by sqrt(d_model) for better gradient flow        |
+-------------------------------------------------------------+
    |
    v
+-------------------------------------------------------------+
|  POSITIONAL ENCODING                                        |
|  Add sinusoidal position information                        |
|  PE(pos,2i)   = sin(pos / 10000^(2i/d_model))             |
|  PE(pos,2i+1) = cos(pos / 10000^(2i/d_model))             |
+-------------------------------------------------------------+
    |
    v
+-------------------------------------------------------------+
|  DROPOUT (p=0.3)                                            |
+-------------------------------------------------------------+
    |
    | (Repeat 12 times - 12 Transformer Decoder Layers)
    |
    v
+-------------------------------------------------------------+
|  TRANSFORMER DECODER LAYER                                  |
|                                                             |
|  +-------------------------------------------------------+  |
|  |  MULTI-HEAD SELF-ATTENTION (8 heads)                  |  |
|  |  +---------+ +---------+ +---------+ +---------+      |  |
|  |  | Head 1  | | Head 2  | | Head 3  | |  ...    |      |  |
|  |  | Q K V   | | Q K V   | | Q K V   | | Head 8  |      |  |
|  |  +---------+ +---------+ +---------+ +---------+      |  |
|  |  With Causal Mask (prevents attending to future)      |  |
|  +-------------------------------------------------------+  |
|                         |                                   |
|                         v                                   |
|  +-------------------------------------------------------+  |
|  |  ADD & LAYER NORM                                     |  |
|  |  LayerNorm(x + MultiHeadAttention(x))                |  |
|  +-------------------------------------------------------+  |
|                         |                                   |
|                         v                                   |
|  +-------------------------------------------------------+  |
|  |  FEED-FORWARD NETWORK                                 |  |
|  |  FFN(x) = ReLU(xW1 + b1)W2 + b2                      |  |
|  |  Dimensions: 512 -> 2048 -> 512                       |  |
|  +-------------------------------------------------------+  |
|                         |                                   |
|                         v                                   |
|  +-------------------------------------------------------+  |
|  |  ADD & LAYER NORM                                     |  |
|  |  LayerNorm(x + FFN(x))                               |  |
|  +-------------------------------------------------------+  |
+-------------------------------------------------------------+
    |
    v
+-------------------------------------------------------------+
|  LINEAR PROJECTION TO VOCABULARY                            |
|  Input:  (batch_size, seq_len, d_model=512)              |
|  Output: (batch_size, seq_len, vocab_size)               |
+-------------------------------------------------------------+
    |
    v
LOGITS (predictions for next token)

Causal Mask

The causal mask ensures the model can only attend to previous positions, preventing information leakage from future tokens:

For sequence length 5:

Attention Mask (1 = can attend, 0 = cannot attend):
       t1  t2  t3  t4  t5
    +--------------------+
t1  | 1   0   0   0   0 |  Token 1 can only see itself
t2  | 1   1   0   0   0 |  Token 2 can see t1, t2
t3  | 1   1   1   0   0 |  Token 3 can see t1, t2, t3
t4  | 1   1   1   1   0 |  Token 4 can see t1-t4
t5  | 1   1   1   1   1 |  Token 5 can see all previous
    +--------------------+

Model Parameters

Model dimension (d_model): 512
Number of attention heads: 8
Head dimension (d_k): 64 (512/8)
Number of layers: 12
Feed-forward dimension: 2048
Dropout: 0.3
Maximum sequence length: 256
Total parameters: Approximately 50M

Note: The vocabulary size varies by tokenizer:

GPT-2: 50,257 tokens
Whitespace: 10,000 tokens
SentencePiece: 10,000 tokens

Project Structure

.
├── data/
│   ├── raw/                                    # Raw input data
│   │   └── forum_forum_poradnikogrodniczy_pl_corpus.txt
│   ├── processed/                              # Preprocessed data
│   │   ├── forum_forum_poradnikogrodniczy_pl_corpus_gpt2_*.pt
│   │   ├── forum_forum_poradnikogrodniczy_pl_corpus_sentencepiece_*.pt
│   │   ├── forum_forum_poradnikogrodniczy_pl_corpus_whitespace_*.pt
│   │   └── *_metadata.json                    # Dataset statistics
│   └── wikipedia/                              # Out-of-domain data (optional)
│
├── models/                         # Model implementations
│   ├── __init__.py
│   └── transformer_model.py       # Transformer language model
│
├── utils/                          # Utility modules
│   ├── __init__.py
│   ├── config.py                  # Configuration classes with GPU optimization
│   ├── tokenizer_factory.py      # Tokenizer factory for all three types
│   ├── gpt2_tokenizer.py          # GPT-2 BPE tokenizer
│   ├── whitespace_tokenizer.py   # Custom whitespace tokenizer
│   ├── sentencepiece_tokenizer.py # SentencePiece tokenizer
│   ├── dataset.py                 # PyTorch datasets and dataloaders
│   ├── metrics.py                 # Evaluation metrics
│   └── plotting.py                # Training visualization
│
├── scripts/                        # Executable scripts
│   ├── __init__.py
│   ├── preprocess_data.py         # Data preprocessing (supports all tokenizers)
│   ├── train.py                   # Model training with GPU optimization
│   ├── evaluate.py                # Model evaluation
│   ├── generate.py                # Text generation
│   ├── visualize_metrics.py       # Standalone visualization
│   ├── compare_tokenizers_evaluation.py  # Complete tokenizer comparison
│   ├── qualitative_analysis.py    # Qualitative tokenization analysis
│   ├── generate_final_report.py   # Report generation
│   ├── check_gpu_config.py        # GPU configuration checker
│   └── benchmark_gpu.py           # GPU performance benchmark
│
├── checkpoints/                    # Saved model checkpoints
│   ├── forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_best.pt
│   ├── forum_forum_poradnikogrodniczy_pl_corpus_sentencepiece_sentencepiece_transformer_best.pt
│   └── forum_forum_poradnikogrodniczy_pl_corpus_whitespace_whitespace_transformer_best.pt
│
├── results/                        # Evaluation results and visualizations
│   ├── plots/                      # Training visualization plots
│   │   ├── *_loss_epoch_N.png
│   │   ├── *_perplexity_epoch_N.png
│   │   ├── *_combined_metrics_epoch_N.png
│   │   └── *_overfitting_analysis_epoch_N.png
│   ├── *_metrics.json              # Training metrics
│   ├── tokenizer_comparison_results.json  # Complete comparison metrics
│   ├── qualitative_analysis.txt    # Tokenization examples
│   ├── LAB_REPORT.md               # Detailed lab report
│   └── FINAL_SUMMARY.md            # Executive summary
│
├── LAB_INSTRUCTIONS.md             # Original lab assignment
├── RUNPOD_OPTIMIZATION.md          # GPU optimization guide
├── README.md                       # This file
├── run_training_background.sh      # Background training helper
├── stop_training.sh                # Stop background training
└── check_training_status.sh        # Check training status

Installation

Prerequisites

Python 3.12+
For local training: Mac with Apple Silicon (MPS) or CUDA-capable GPU
For cloud training: RunPod account with RTX 5090 (recommended)
Virtual environment manager (venv or conda)

Setup

Clone or navigate to the project directory:

cd "tokenization-efficiency-benchmark"

Create and activate virtual environment:

python3 -m venv .venv
source .venv/bin/activate  # On Mac/Linux

Install dependencies:

pip install torch torchvision torchaudio
pip install tokenizers sentencepiece numpy matplotlib tqdm

Required packages:

PyTorch 2.0+ (with CUDA support for GPU training)
tokenizers (Hugging Face)
sentencepiece
numpy
matplotlib
tqdm

Verify installation:

python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"

Complete Workflow

Quick Reference

# 1. Preprocess data with different tokenizers
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer gpt2
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer sentencepiece
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer whitespace

# 2. Train models (one per tokenizer)
python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2
python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_sentencepiece
python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_whitespace

# 3. Evaluate models
python scripts/evaluate.py --checkpoint checkpoints/..._gpt2_transformer_best.pt --data test --dataset ...
python scripts/evaluate.py --checkpoint checkpoints/..._sentencepiece_transformer_best.pt --data test --dataset ...
python scripts/evaluate.py --checkpoint checkpoints/..._whitespace_transformer_best.pt --data test --dataset ...

# 4. Compare tokenizers
python scripts/compare_tokenizers_evaluation.py

# 5. Generate qualitative analysis
python scripts/qualitative_analysis.py

# 6. Visualize metrics
python scripts/visualize_metrics.py --metrics results/..._metrics.json

Detailed Usage Guide

Step 1: Preprocess Data

Preprocess your corpus with each tokenizer. The preprocessing script will:

Load raw text data
Split into train (85%), validation (10%), test (5%)
Train the tokenizer (or load pre-trained for GPT-2)
Tokenize all splits
Save processed data

Preprocess with GPT-2 tokenizer:

python scripts/preprocess_data.py \
  --input data/raw/forum_forum_poradnikogrodniczy_pl_corpus.txt \
  --tokenizer gpt2

Preprocess with SentencePiece tokenizer:

python scripts/preprocess_data.py \
  --input data/raw/forum_forum_poradnikogrodniczy_pl_corpus.txt \
  --tokenizer sentencepiece

Preprocess with Whitespace tokenizer:

python scripts/preprocess_data.py \
  --input data/raw/forum_forum_poradnikogrodniczy_pl_corpus.txt \
  --tokenizer whitespace

Arguments:

--input: Path to raw text file (one document per line)
--tokenizer: Tokenizer type (gpt2, sentencepiece, or whitespace)

Output:

data/processed/
  ├── {dataset}_{tokenizer}_tokenizer.json/.model    # Trained tokenizer
  ├── {dataset}_{tokenizer}_train_ids.pt             # Training sequences
  ├── {dataset}_{tokenizer}_val_ids.pt               # Validation sequences
  ├── {dataset}_{tokenizer}_test_ids.pt              # Test sequences
  └── {dataset}_{tokenizer}_metadata.json            # Dataset statistics

Step 2: Train Models

Train a Transformer model for each tokenizer. The training script automatically:

Detects GPU and optimizes settings (batch size, workers, mixed precision)
Tracks metrics (loss, perplexity, learning rate, time)
Saves checkpoints periodically and when validation improves
Generates training plots after each epoch

Train with auto-detected settings:

python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2

Training on RunPod:

The code automatically detects RunPod environment and optimizes for RTX 5090:

Batch size: 192 (optimized for 32GB VRAM)
Mixed precision (AMP): Enabled
DataLoader workers: 8
Pin memory: Enabled
TF32: Enabled

Example output:

================================================================================
TRAINING TRANSFORMER LANGUAGE MODEL
================================================================================
Dataset: forum_forum_poradnikogrodniczy_pl_corpus_gpt2
Tokenizer: gpt2
Vocabulary size: 50257
Configuration: Transformer(layers=12, heads=8, emb=512, ff=2048)
Device: cuda
Training environment: cloud
GPU: NVIDIA GeForce RTX 5090
GPU Memory: 33.68 GB
Mixed precision (AMP): True
Gradient accumulation steps: 2
DataLoader workers: 8
Pin memory: True
TF32 enabled: True

Loading preprocessed data...
Train: 356596 | Val: 41952 | Test: 20977

Initializing model...
Model parameters: 48,079,632

================================================================================
STARTING TRAINING
================================================================================

Epoch 1/50
--------------------------------------------------------------------------------
Training: 100%|==========| 1856/1856 [04:12<00:00, 7.34it/s, loss=5.2341]
Validating...

Epoch 1 Summary:
  Train Loss: 5.2341 | Train PPL: 187.32
  Val Loss:   5.1234 | Val PPL:   167.45
  LR: 0.000960 | Time: 252.3s
  GPU Memory: 26.34GB allocated, 28.12GB reserved
  Metrics updated: results/..._metrics.json
  Best model saved: checkpoints/..._best.pt

...

Checkpoints saved:

checkpoints/{dataset}_{tokenizer}_transformer_best.pt - Best model (lowest validation loss)
checkpoints/{dataset}_{tokenizer}_transformer_epoch_N.pt - Periodic checkpoints

Resume training:

python scripts/train.py --dataset {dataset} --resume checkpoints/{checkpoint}.pt

Step 3: Evaluate Models

Evaluate trained models to calculate word-level and character-level perplexity.

Evaluate on test set:

python scripts/evaluate.py \
  --checkpoint checkpoints/forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_best.pt \
  --data test \
  --dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2

Example output:

================================================================================
EVALUATION RESULTS
================================================================================
Model: TRANSFORMER
Checkpoint: checkpoints/..._gpt2_transformer_best.pt
Data: test
Sequences: 20977
Total tokens: 663,015
--------------------------------------------------------------------------------
Loss: 3.4323
Perplexity: 30.95
Evaluation time: 178.16s
Throughput: 3721 tokens/s
================================================================================

Step 4: Compare Tokenizers

Run comprehensive comparison including word-level and character-level perplexity:

python scripts/compare_tokenizers_evaluation.py

This script calculates:

Word-level perplexity (perplexity per word, not per token)
Character-level perplexity (perplexity per character)
OOV statistics (for whitespace tokenizer)
Tokens per word ratio
Words encoded directly (single token)
Inference speed

Output:

================================================================================
TOKENIZER COMPARISON EVALUATION
================================================================================

Evaluating: GPT2 tokenizer
...
GPT2 Results:
  Word-level perplexity: 640.50
  Character-level perplexity: 3.32
  OOV: 0/0 (0.00%)
  Tokens per word: 3.43
  Words encoded directly: 24.34%
  Inference speed: 11694 tokens/sec

Evaluating: SENTENCEPIECE tokenizer
...
SENTENCEPIECE Results:
  Word-level perplexity: 480.62
  Character-level perplexity: 3.13
  OOV: 0/0 (0.00%)
  Tokens per word: 1.53
  Words encoded directly: 67.35%
  Inference speed: 5942 tokens/sec

Evaluating: WHITESPACE tokenizer
...
WHITESPACE Results:
  Word-level perplexity: 139.83
  Character-level perplexity: 2.49
  OOV: 2212/22469 (9.84%)
  Tokens per word: 1.20
  Words encoded directly: 74.26%
  Inference speed: 5351 tokens/sec

Results saved to: results/tokenizer_comparison_results.json

Step 5: Qualitative Analysis

Generate tokenization examples for sample texts:

python scripts/qualitative_analysis.py

This creates a detailed comparison showing how each tokenizer handles the same Polish text samples (at least 30 words each).

Output saved to: results/qualitative_analysis.txt

Step 6: Generate Text

Generate text completions using trained models:

python scripts/generate.py \
  --checkpoint checkpoints/forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_best.pt \
  --prompts "Jak uprawiac pomidory w ogrodzie" "Najlepsze odmiany ogurkow" \
  --max-length 150 \
  --temperature 0.5 \
  --dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2

Arguments:

--checkpoint: Path to model checkpoint
--prompts: List of text prompts (space-separated)
--dataset: Dataset name for loading correct tokenizer
--max-length: Maximum tokens to generate (default: 100)
--temperature: Sampling temperature (0.1-1.5, lower = more focused)
--top-k: Top-k sampling (default: 50)

Training Visualization & Plotting

The project includes automated visualization of training metrics. Plots are generated automatically during training and can be recreated from saved metrics.

Automatic Plot Generation

During training, plots are automatically created after each epoch:

Loss Plot - Training and validation loss over epochs
Perplexity Plot - Training and validation perplexity over epochs
Learning Rate Plot - Learning rate schedule (log scale)
Epoch Times Plot - Time per epoch with average line
Combined Metrics Plot - All metrics in 2x2 grid
Overfitting Analysis - Train vs validation loss with gap visualization
Summary Statistics - Final training summary

Plot location: results/plots/

Sample Training Plots

Below are example plots from the trained models:

GPT-2 Tokenizer Training

SentencePiece Tokenizer Training

Whitespace Tokenizer Training

Manual Visualization

Recreate plots from saved metrics:

# Visualize specific model
python scripts/visualize_metrics.py \
  --metrics results/forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_metrics.json

# Compare all tokenizers
python scripts/visualize_metrics.py --compare

Understanding the Plots

Loss Plot:

Shows training and validation cross-entropy loss
Both curves should decrease over time
If validation stops decreasing while training continues, model is overfitting

Perplexity Plot:

Exponential of loss (more interpretable)
Lower values indicate better predictions
Validation should track training perplexity

Overfitting Analysis:

Shows generalization gap (difference between train and validation loss)
Orange shaded area indicates overfitting severity
Growing gap suggests need for regularization

Evaluation Metrics

Word-Level Perplexity

Word-level perplexity measures how well the model predicts at the word level (not token level). This is the primary metric for comparing tokenizers.

Formula:

Word-Level Perplexity = exp(Total Cross-Entropy Loss / Number of Words)

Why word-level and not token-level?

Token-level perplexity is NOT comparable across tokenizers because:

GPT-2 uses 3.43 tokens per word
SentencePiece uses 1.53 tokens per word
Whitespace uses 1.20 tokens per word

A tokenizer that splits words into more tokens will appear to have better token-level perplexity even if it's worse at predicting words.

Interpretation:

Lower is better
Perplexity of 100 means the model is as confused as choosing uniformly from 100 options
Typical range: 50-500 for small models, 10-30 for large state-of-the-art models

Character-Level Perplexity

Character-level perplexity measures prediction quality per character.

Formula:

Character-Level Perplexity = exp(Total Cross-Entropy Loss / Number of Characters)

Interpretation:

Lower is better
Normalizes by character count instead of word count
Useful for comparing how efficiently models encode information

Out-of-Vocabulary (OOV) Statistics

For the whitespace tokenizer, measures how many words are not in the vocabulary.

Metrics:

Number of OOV words
Total words
OOV percentage

Note: Subword tokenizers (GPT-2, SentencePiece) have 0% OOV by design.

Tokenization Efficiency

Tokens per Word:

Average number of tokens needed to encode one word
Lower is more efficient

Direct Encoding Percentage:

Percentage of words encoded as a single token (not split)
Higher means better vocabulary coverage

Training and Inference Performance

Training Time:

Total time to train the model (all epochs)
Average time per epoch

Inference Speed:

Tokens processed per second during evaluation
Higher is faster

Results Summary

Quantitative Results

Metric	Whitespace	SentencePiece	GPT-2
Word-Level Perplexity	139.83	480.62	640.50
Character-Level Perplexity	2.49	3.13	3.32
Token-Level Perplexity	109.57	154.58	30.95
Tokens per Word	1.20	1.53	3.43
Direct Encoding %	74.26%	67.35%	24.34%
OOV Rate	9.84%	0.00%	0.00%
Training Time	187.7 min	419.3 min	252.5 min
Inference Speed	5,351 tok/s	5,942 tok/s	11,694 tok/s
Vocab Size	10,000	10,000	50,257

Note: Token-level perplexity is shown but is NOT comparable across tokenizers.

Key Findings

1. Whitespace tokenizer achieves best word-level and character-level perplexity

Word-level PPL: 139.83 (4.6x better than GPT-2)
Character-level PPL: 2.49 (1.3x better than GPT-2)
Despite 9.84% OOV rate, still outperforms subword methods

2. GPT-2 has best token-level perplexity but worst word-level perplexity

Token-level: 30.95 (best)
Word-level: 640.50 (worst)
This demonstrates why token-level perplexity is misleading for comparison

3. Tokenization efficiency matters

Whitespace: 1.20 tokens/word
SentencePiece: 1.53 tokens/word
GPT-2: 3.43 tokens/word
Fewer tokens per word leads to better word-level metrics

4. Vocabulary size vs. efficiency trade-off

GPT-2's large vocabulary (50K) doesn't compensate for inefficient tokenization
Whitespace's small vocabulary (10K) is sufficient when tokenization is efficient

Qualitative Analysis Examples

Sample Text 1: "Uprawa pomidorow w ogrodzie wymaga odpowiedniego przygotowania gleby, regularnego podlewania oraz stosowania nawozow organicznych..."

Tokenizer	Tokens	Tokens/Word	Direct %
GPT-2	113	3.90	10.3%
SentencePiece	44	1.52	58.6%
Whitespace	33	1.14	82.8%

Observation: Whitespace creates the most compact representation, while GPT-2 splits Polish words into many English-centric subwords.

Discussion

Why Whitespace Wins on Word-Level Metrics

The whitespace tokenizer achieves the best word-level and character-level perplexity despite having:

A smaller vocabulary than GPT-2 (10K vs 50K)
An OOV rate of 9.84%
"Worse" token-level perplexity (109.57 vs 30.95)

Explanation:

Efficient tokenization reduces compounding errors
- Whitespace: 1.20 tokens/word means fewer prediction steps per word
- GPT-2: 3.43 tokens/word means errors compound across multiple predictions
- When normalized by words, efficiency matters more than individual token accuracy
Vocabulary optimized for target language
- Whitespace vocabulary is entirely Polish words
- GPT-2 vocabulary is English-centric
- Language-specific optimization is more valuable than vocabulary size
Direct encoding reduces information loss
- 74% of words are single tokens in whitespace
- Only 24% of words are single tokens in GPT-2
- Less splitting means less semantic fragmentation

Trade-offs of Each Tokenizer

Whitespace:

Best: Word-level and character-level perplexity
Best: Tokenization efficiency (1.20 tokens/word)
Best: Training speed (187.7 min)
Worst: OOV handling (9.84% unknown words)
Limited: Cannot handle unknown words gracefully

SentencePiece:

Balanced: Middle ground on all metrics
Good: No OOV issues (subword tokenization)
Good: Language-specific training
Good: Reasonable efficiency (1.53 tokens/word)
Limitation: Smaller vocabulary limits expressiveness

GPT-2:

Best: Token-level perplexity (misleading metric)
Best: Inference speed (11,694 tok/s)
Best: No OOV issues
Worst: Word-level perplexity (inefficient for Polish)
Worst: Tokenization efficiency (3.43 tokens/word)
Benefit: Pre-trained, no training required

Implications for Production

For Polish language modeling:

Whitespace tokenizer if OOV rate is acceptable and corpus is representative
SentencePiece for robust handling of unknown words
GPT-2 for transfer learning or multilingual applications

For other languages:

Results may vary based on language morphology
Morphologically rich languages benefit from efficient tokenization
English-like languages may favor GPT-2's vocabulary

Lab Assignment Insights

This lab demonstrates why the assignment instructions explicitly stated:

"Do not report token-level perplexity, as it is not comparable across tokenizers."

Token-level metrics can be highly misleading. GPT-2 appears to have the "best" model (lowest token perplexity of 30.95) but actually has the worst performance when properly evaluated (highest word perplexity of 640.50).

References

Academic Papers

Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. https://arxiv.org/abs/1508.07909
Kudo, Taku, and John Richardson. "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing." EMNLP 2018. https://arxiv.org/abs/1808.06226
Vaswani, Ashish, et al. "Attention Is All You Need." NIPS 2017. https://arxiv.org/abs/1706.03762

Resources

Mielke, Sabrina J. "Comparing perplexities is hard." https://sjmielke.com/comparing-perplexities.htm
Hugging Face Tokenizers Documentation: https://huggingface.co/docs/tokenizers/
SentencePiece GitHub: https://github.com/google/sentencepiece

Quick Commands Reference

# Setup
python3 -m venv .venv
source .venv/bin/activate
pip install torch tokenizers sentencepiece numpy matplotlib tqdm

# Preprocess with all tokenizers
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer gpt2
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer sentencepiece
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer whitespace

# Train all models
python scripts/train.py --dataset {dataset}_gpt2
python scripts/train.py --dataset {dataset}_sentencepiece
python scripts/train.py --dataset {dataset}_whitespace

# Evaluate all models
python scripts/evaluate.py --checkpoint checkpoints/{dataset}_gpt2_transformer_best.pt --data test --dataset {dataset}_gpt2
python scripts/evaluate.py --checkpoint checkpoints/{dataset}_sentencepiece_transformer_best.pt --data test --dataset {dataset}_sentencepiece
python scripts/evaluate.py --checkpoint checkpoints/{dataset}_whitespace_transformer_best.pt --data test --dataset {dataset}_whitespace

# Complete comparison
python scripts/compare_tokenizers_evaluation.py

# Qualitative analysis
python scripts/qualitative_analysis.py

# Generate text
python scripts/generate.py --checkpoint checkpoints/{checkpoint}.pt --prompts "Your prompt" --dataset {dataset}

# Visualize metrics
python scripts/visualize_metrics.py --metrics results/{model}_metrics.json
python scripts/visualize_metrics.py --compare

# GPU configuration check (for RunPod)
python scripts/check_gpu_config.py

# Background training (for RunPod)
./run_training_background.sh {dataset}
./check_training_status.sh
./stop_training.sh

For detailed results, see:

results/tokenizer_comparison_results.json - Complete metrics
results/qualitative_analysis.txt - Tokenization examples

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
models		models
results		results
scripts		scripts
utils		utils
.gitignore		.gitignore
.python-version		.python-version
1508.07909v5.pdf		1508.07909v5.pdf
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

mikhailbahdashych/tokenization-efficiency-benchmark

Folders and files

Latest commit

History

Repository files navigation

Tokenization Efficiency Benchmark - Complete Documentation

Table of Contents

Project Overview

Key Features

Assignment Context

Tokenization Strategies

1. GPT-2 BPE Tokenizer (Pre-trained)

2. Whitespace Tokenizer (Custom)

3. SentencePiece Tokenizer (Custom-trained)

Model Architecture

Transformer Architecture

Causal Mask

Model Parameters

Project Structure

Installation

Prerequisites

Setup

Complete Workflow

Quick Reference

Detailed Usage Guide

Step 1: Preprocess Data

Step 2: Train Models

Step 3: Evaluate Models

Step 4: Compare Tokenizers

Step 5: Qualitative Analysis

Step 6: Generate Text

Training Visualization & Plotting

Automatic Plot Generation

Sample Training Plots

GPT-2 Tokenizer Training

SentencePiece Tokenizer Training

Whitespace Tokenizer Training

Manual Visualization

Understanding the Plots

Evaluation Metrics

Word-Level Perplexity

Character-Level Perplexity

Out-of-Vocabulary (OOV) Statistics

Tokenization Efficiency

Training and Inference Performance

Results Summary

Quantitative Results

Key Findings

Qualitative Analysis Examples

Discussion

Why Whitespace Wins on Word-Level Metrics

Trade-offs of Each Tokenizer

Implications for Production

Lab Assignment Insights

References

Academic Papers

Resources

Quick Commands Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages