Skip to content

The goal is to analyze how different tokenization strategies influence the performance and behavior of a language model.

License

Notifications You must be signed in to change notification settings

mikhailbahdashych/tokenization-efficiency-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tokenization Efficiency Benchmark - Complete Documentation

Table of Contents

  1. Project Overview
  2. Tokenization Strategies
  3. Model Architecture
  4. Project Structure
  5. Installation
  6. Complete Workflow
  7. Detailed Usage Guide
  8. Training Visualization & Plotting
  9. Evaluation Metrics
  10. Results Summary
  11. Discussion
  12. References

Project Overview

This project implements and compares three different tokenization strategies for Polish language modeling using a Transformer-based neural language model. The primary goal is to analyze how different tokenization approaches affect model performance, efficiency, and text representation quality.

Key Features

  • Three Tokenization Strategies: GPT-2 BPE (pre-trained), Whitespace (custom), and SentencePiece (custom-trained)
  • Identical Model Architecture: All tokenizers use the same 50M parameter Transformer to ensure fair comparison
  • Comprehensive Evaluation: Word-level and character-level perplexity (not token-level, as per assignment requirements)
  • OOV Analysis: Out-of-vocabulary statistics for whitespace tokenizer
  • Efficiency Metrics: Tokens per word, training time, inference speed
  • Qualitative Analysis: Tokenization comparison on sample texts
  • GPU Optimization: Optimized for NVIDIA RTX 5090 with mixed precision training
  • Automated Visualization: Training plots, metrics tracking, and comparison tools

Assignment Context

This implementation fulfills the requirements of the Computational Linguistics lab assignment on tokenization efficiency:

  • Implement three tokenizers: pre-trained (GPT-2), custom whitespace-based, and SentencePiece
  • Train three identical models (same architecture, hyperparameters, data) - one per tokenizer
  • Evaluate using word-level and character-level perplexity (NOT token-level)
  • Report OOV statistics for whitespace tokenizer
  • Analyze efficiency metrics: tokens per word, training time, inference speed
  • Provide qualitative analysis with tokenization examples

Tokenization Strategies

1. GPT-2 BPE Tokenizer (Pre-trained)

The GPT-2 tokenizer uses Byte-Pair Encoding (BPE) trained on a large English corpus.

Characteristics:

  • Pre-trained on English text
  • Vocabulary size: 50,257 tokens
  • Subword tokenization eliminates OOV issues
  • Inefficient for Polish (averages 3.43 tokens per word)
  • No language-specific optimization

Implementation:

from utils.gpt2_tokenizer import GPT2Tokenizer
tokenizer = GPT2Tokenizer(vocab_size=50257)

2. Whitespace Tokenizer (Custom)

A simple tokenizer that splits on whitespace and treats punctuation as separate tokens.

Characteristics:

  • Splits text on whitespace and punctuation
  • Fixed vocabulary size: 10,000 most frequent words
  • Uses <UNK> token for out-of-vocabulary words
  • Most efficient encoding (1.20 tokens per word)
  • Suffers from OOV issues (9.84% OOV rate on test data)

Algorithm:

  1. Split text on whitespace
  2. Separate punctuation marks as individual tokens
  3. Build vocabulary from top N most frequent tokens in training data
  4. Replace unknown tokens with <UNK> during encoding

Implementation:

from utils.whitespace_tokenizer import WhitespaceTokenizer
tokenizer = WhitespaceTokenizer(vocab_size=10000)
tokenizer.train(texts)

3. SentencePiece Tokenizer (Custom-trained)

A subword tokenizer trained on the target Polish corpus using the SentencePiece library.

Characteristics:

  • Trained on Polish gardening forum corpus
  • Vocabulary size: 10,000 tokens
  • BPE algorithm for subword segmentation
  • Language-specific optimization
  • Balanced efficiency (1.53 tokens per word)
  • No OOV issues due to subword tokenization

Training:

from utils.sentencepiece_tokenizer import SentencePieceTokenizer
tokenizer = SentencePieceTokenizer(vocab_size=10000, model_type="bpe")
tokenizer.train(texts)

Model Architecture

All three tokenizers use an identical Transformer language model architecture to ensure fair comparison. The model is based on "Attention Is All You Need" (Vaswani et al., 2017) and adapted for causal language modeling.

Transformer Architecture

INPUT SEQUENCE (token IDs)
    |
    v
+-------------------------------------------------------------+
|  TOKEN EMBEDDING LAYER                                      |
|  Maps token IDs to dense vectors                            |
|  Input:  (batch_size, seq_len)                             |
|  Output: (batch_size, seq_len, d_model=512)               |
+-------------------------------------------------------------+
    |
    v
+-------------------------------------------------------------+
|  SCALE EMBEDDINGS                                           |
|  Multiply by sqrt(d_model) for better gradient flow        |
+-------------------------------------------------------------+
    |
    v
+-------------------------------------------------------------+
|  POSITIONAL ENCODING                                        |
|  Add sinusoidal position information                        |
|  PE(pos,2i)   = sin(pos / 10000^(2i/d_model))             |
|  PE(pos,2i+1) = cos(pos / 10000^(2i/d_model))             |
+-------------------------------------------------------------+
    |
    v
+-------------------------------------------------------------+
|  DROPOUT (p=0.3)                                            |
+-------------------------------------------------------------+
    |
    | (Repeat 12 times - 12 Transformer Decoder Layers)
    |
    v
+-------------------------------------------------------------+
|  TRANSFORMER DECODER LAYER                                  |
|                                                             |
|  +-------------------------------------------------------+  |
|  |  MULTI-HEAD SELF-ATTENTION (8 heads)                  |  |
|  |  +---------+ +---------+ +---------+ +---------+      |  |
|  |  | Head 1  | | Head 2  | | Head 3  | |  ...    |      |  |
|  |  | Q K V   | | Q K V   | | Q K V   | | Head 8  |      |  |
|  |  +---------+ +---------+ +---------+ +---------+      |  |
|  |  With Causal Mask (prevents attending to future)      |  |
|  +-------------------------------------------------------+  |
|                         |                                   |
|                         v                                   |
|  +-------------------------------------------------------+  |
|  |  ADD & LAYER NORM                                     |  |
|  |  LayerNorm(x + MultiHeadAttention(x))                |  |
|  +-------------------------------------------------------+  |
|                         |                                   |
|                         v                                   |
|  +-------------------------------------------------------+  |
|  |  FEED-FORWARD NETWORK                                 |  |
|  |  FFN(x) = ReLU(xW1 + b1)W2 + b2                      |  |
|  |  Dimensions: 512 -> 2048 -> 512                       |  |
|  +-------------------------------------------------------+  |
|                         |                                   |
|                         v                                   |
|  +-------------------------------------------------------+  |
|  |  ADD & LAYER NORM                                     |  |
|  |  LayerNorm(x + FFN(x))                               |  |
|  +-------------------------------------------------------+  |
+-------------------------------------------------------------+
    |
    v
+-------------------------------------------------------------+
|  LINEAR PROJECTION TO VOCABULARY                            |
|  Input:  (batch_size, seq_len, d_model=512)              |
|  Output: (batch_size, seq_len, vocab_size)               |
+-------------------------------------------------------------+
    |
    v
LOGITS (predictions for next token)

Causal Mask

The causal mask ensures the model can only attend to previous positions, preventing information leakage from future tokens:

For sequence length 5:

Attention Mask (1 = can attend, 0 = cannot attend):
       t1  t2  t3  t4  t5
    +--------------------+
t1  | 1   0   0   0   0 |  Token 1 can only see itself
t2  | 1   1   0   0   0 |  Token 2 can see t1, t2
t3  | 1   1   1   0   0 |  Token 3 can see t1, t2, t3
t4  | 1   1   1   1   0 |  Token 4 can see t1-t4
t5  | 1   1   1   1   1 |  Token 5 can see all previous
    +--------------------+

Model Parameters

  • Model dimension (d_model): 512
  • Number of attention heads: 8
  • Head dimension (d_k): 64 (512/8)
  • Number of layers: 12
  • Feed-forward dimension: 2048
  • Dropout: 0.3
  • Maximum sequence length: 256
  • Total parameters: Approximately 50M

Note: The vocabulary size varies by tokenizer:

  • GPT-2: 50,257 tokens
  • Whitespace: 10,000 tokens
  • SentencePiece: 10,000 tokens

Project Structure

.
├── data/
│   ├── raw/                                    # Raw input data
│   │   └── forum_forum_poradnikogrodniczy_pl_corpus.txt
│   ├── processed/                              # Preprocessed data
│   │   ├── forum_forum_poradnikogrodniczy_pl_corpus_gpt2_*.pt
│   │   ├── forum_forum_poradnikogrodniczy_pl_corpus_sentencepiece_*.pt
│   │   ├── forum_forum_poradnikogrodniczy_pl_corpus_whitespace_*.pt
│   │   └── *_metadata.json                    # Dataset statistics
│   └── wikipedia/                              # Out-of-domain data (optional)
│
├── models/                         # Model implementations
│   ├── __init__.py
│   └── transformer_model.py       # Transformer language model
│
├── utils/                          # Utility modules
│   ├── __init__.py
│   ├── config.py                  # Configuration classes with GPU optimization
│   ├── tokenizer_factory.py      # Tokenizer factory for all three types
│   ├── gpt2_tokenizer.py          # GPT-2 BPE tokenizer
│   ├── whitespace_tokenizer.py   # Custom whitespace tokenizer
│   ├── sentencepiece_tokenizer.py # SentencePiece tokenizer
│   ├── dataset.py                 # PyTorch datasets and dataloaders
│   ├── metrics.py                 # Evaluation metrics
│   └── plotting.py                # Training visualization
│
├── scripts/                        # Executable scripts
│   ├── __init__.py
│   ├── preprocess_data.py         # Data preprocessing (supports all tokenizers)
│   ├── train.py                   # Model training with GPU optimization
│   ├── evaluate.py                # Model evaluation
│   ├── generate.py                # Text generation
│   ├── visualize_metrics.py       # Standalone visualization
│   ├── compare_tokenizers_evaluation.py  # Complete tokenizer comparison
│   ├── qualitative_analysis.py    # Qualitative tokenization analysis
│   ├── generate_final_report.py   # Report generation
│   ├── check_gpu_config.py        # GPU configuration checker
│   └── benchmark_gpu.py           # GPU performance benchmark
│
├── checkpoints/                    # Saved model checkpoints
│   ├── forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_best.pt
│   ├── forum_forum_poradnikogrodniczy_pl_corpus_sentencepiece_sentencepiece_transformer_best.pt
│   └── forum_forum_poradnikogrodniczy_pl_corpus_whitespace_whitespace_transformer_best.pt
│
├── results/                        # Evaluation results and visualizations
│   ├── plots/                      # Training visualization plots
│   │   ├── *_loss_epoch_N.png
│   │   ├── *_perplexity_epoch_N.png
│   │   ├── *_combined_metrics_epoch_N.png
│   │   └── *_overfitting_analysis_epoch_N.png
│   ├── *_metrics.json              # Training metrics
│   ├── tokenizer_comparison_results.json  # Complete comparison metrics
│   ├── qualitative_analysis.txt    # Tokenization examples
│   ├── LAB_REPORT.md               # Detailed lab report
│   └── FINAL_SUMMARY.md            # Executive summary
│
├── LAB_INSTRUCTIONS.md             # Original lab assignment
├── RUNPOD_OPTIMIZATION.md          # GPU optimization guide
├── README.md                       # This file
├── run_training_background.sh      # Background training helper
├── stop_training.sh                # Stop background training
└── check_training_status.sh        # Check training status

Installation

Prerequisites

  • Python 3.12+
  • For local training: Mac with Apple Silicon (MPS) or CUDA-capable GPU
  • For cloud training: RunPod account with RTX 5090 (recommended)
  • Virtual environment manager (venv or conda)

Setup

  1. Clone or navigate to the project directory:
cd "tokenization-efficiency-benchmark"
  1. Create and activate virtual environment:
python3 -m venv .venv
source .venv/bin/activate  # On Mac/Linux
  1. Install dependencies:
pip install torch torchvision torchaudio
pip install tokenizers sentencepiece numpy matplotlib tqdm

Required packages:

  • PyTorch 2.0+ (with CUDA support for GPU training)
  • tokenizers (Hugging Face)
  • sentencepiece
  • numpy
  • matplotlib
  • tqdm
  1. Verify installation:
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"

Complete Workflow

Quick Reference

# 1. Preprocess data with different tokenizers
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer gpt2
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer sentencepiece
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer whitespace

# 2. Train models (one per tokenizer)
python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2
python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_sentencepiece
python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_whitespace

# 3. Evaluate models
python scripts/evaluate.py --checkpoint checkpoints/..._gpt2_transformer_best.pt --data test --dataset ...
python scripts/evaluate.py --checkpoint checkpoints/..._sentencepiece_transformer_best.pt --data test --dataset ...
python scripts/evaluate.py --checkpoint checkpoints/..._whitespace_transformer_best.pt --data test --dataset ...

# 4. Compare tokenizers
python scripts/compare_tokenizers_evaluation.py

# 5. Generate qualitative analysis
python scripts/qualitative_analysis.py

# 6. Visualize metrics
python scripts/visualize_metrics.py --metrics results/..._metrics.json

Detailed Usage Guide

Step 1: Preprocess Data

Preprocess your corpus with each tokenizer. The preprocessing script will:

  1. Load raw text data
  2. Split into train (85%), validation (10%), test (5%)
  3. Train the tokenizer (or load pre-trained for GPT-2)
  4. Tokenize all splits
  5. Save processed data

Preprocess with GPT-2 tokenizer:

python scripts/preprocess_data.py \
  --input data/raw/forum_forum_poradnikogrodniczy_pl_corpus.txt \
  --tokenizer gpt2

Preprocess with SentencePiece tokenizer:

python scripts/preprocess_data.py \
  --input data/raw/forum_forum_poradnikogrodniczy_pl_corpus.txt \
  --tokenizer sentencepiece

Preprocess with Whitespace tokenizer:

python scripts/preprocess_data.py \
  --input data/raw/forum_forum_poradnikogrodniczy_pl_corpus.txt \
  --tokenizer whitespace

Arguments:

  • --input: Path to raw text file (one document per line)
  • --tokenizer: Tokenizer type (gpt2, sentencepiece, or whitespace)

Output:

data/processed/
  ├── {dataset}_{tokenizer}_tokenizer.json/.model    # Trained tokenizer
  ├── {dataset}_{tokenizer}_train_ids.pt             # Training sequences
  ├── {dataset}_{tokenizer}_val_ids.pt               # Validation sequences
  ├── {dataset}_{tokenizer}_test_ids.pt              # Test sequences
  └── {dataset}_{tokenizer}_metadata.json            # Dataset statistics

Step 2: Train Models

Train a Transformer model for each tokenizer. The training script automatically:

  • Detects GPU and optimizes settings (batch size, workers, mixed precision)
  • Tracks metrics (loss, perplexity, learning rate, time)
  • Saves checkpoints periodically and when validation improves
  • Generates training plots after each epoch

Train with auto-detected settings:

python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2

Training on RunPod:

The code automatically detects RunPod environment and optimizes for RTX 5090:

  • Batch size: 192 (optimized for 32GB VRAM)
  • Mixed precision (AMP): Enabled
  • DataLoader workers: 8
  • Pin memory: Enabled
  • TF32: Enabled

Example output:

================================================================================
TRAINING TRANSFORMER LANGUAGE MODEL
================================================================================
Dataset: forum_forum_poradnikogrodniczy_pl_corpus_gpt2
Tokenizer: gpt2
Vocabulary size: 50257
Configuration: Transformer(layers=12, heads=8, emb=512, ff=2048)
Device: cuda
Training environment: cloud
GPU: NVIDIA GeForce RTX 5090
GPU Memory: 33.68 GB
Mixed precision (AMP): True
Gradient accumulation steps: 2
DataLoader workers: 8
Pin memory: True
TF32 enabled: True

Loading preprocessed data...
Train: 356596 | Val: 41952 | Test: 20977

Initializing model...
Model parameters: 48,079,632

================================================================================
STARTING TRAINING
================================================================================

Epoch 1/50
--------------------------------------------------------------------------------
Training: 100%|==========| 1856/1856 [04:12<00:00, 7.34it/s, loss=5.2341]
Validating...

Epoch 1 Summary:
  Train Loss: 5.2341 | Train PPL: 187.32
  Val Loss:   5.1234 | Val PPL:   167.45
  LR: 0.000960 | Time: 252.3s
  GPU Memory: 26.34GB allocated, 28.12GB reserved
  Metrics updated: results/..._metrics.json
  Best model saved: checkpoints/..._best.pt

...

Checkpoints saved:

  • checkpoints/{dataset}_{tokenizer}_transformer_best.pt - Best model (lowest validation loss)
  • checkpoints/{dataset}_{tokenizer}_transformer_epoch_N.pt - Periodic checkpoints

Resume training:

python scripts/train.py --dataset {dataset} --resume checkpoints/{checkpoint}.pt

Step 3: Evaluate Models

Evaluate trained models to calculate word-level and character-level perplexity.

Evaluate on test set:

python scripts/evaluate.py \
  --checkpoint checkpoints/forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_best.pt \
  --data test \
  --dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2

Example output:

================================================================================
EVALUATION RESULTS
================================================================================
Model: TRANSFORMER
Checkpoint: checkpoints/..._gpt2_transformer_best.pt
Data: test
Sequences: 20977
Total tokens: 663,015
--------------------------------------------------------------------------------
Loss: 3.4323
Perplexity: 30.95
Evaluation time: 178.16s
Throughput: 3721 tokens/s
================================================================================

Step 4: Compare Tokenizers

Run comprehensive comparison including word-level and character-level perplexity:

python scripts/compare_tokenizers_evaluation.py

This script calculates:

  • Word-level perplexity (perplexity per word, not per token)
  • Character-level perplexity (perplexity per character)
  • OOV statistics (for whitespace tokenizer)
  • Tokens per word ratio
  • Words encoded directly (single token)
  • Inference speed

Output:

================================================================================
TOKENIZER COMPARISON EVALUATION
================================================================================

Evaluating: GPT2 tokenizer
...
GPT2 Results:
  Word-level perplexity: 640.50
  Character-level perplexity: 3.32
  OOV: 0/0 (0.00%)
  Tokens per word: 3.43
  Words encoded directly: 24.34%
  Inference speed: 11694 tokens/sec

Evaluating: SENTENCEPIECE tokenizer
...
SENTENCEPIECE Results:
  Word-level perplexity: 480.62
  Character-level perplexity: 3.13
  OOV: 0/0 (0.00%)
  Tokens per word: 1.53
  Words encoded directly: 67.35%
  Inference speed: 5942 tokens/sec

Evaluating: WHITESPACE tokenizer
...
WHITESPACE Results:
  Word-level perplexity: 139.83
  Character-level perplexity: 2.49
  OOV: 2212/22469 (9.84%)
  Tokens per word: 1.20
  Words encoded directly: 74.26%
  Inference speed: 5351 tokens/sec

Results saved to: results/tokenizer_comparison_results.json

Step 5: Qualitative Analysis

Generate tokenization examples for sample texts:

python scripts/qualitative_analysis.py

This creates a detailed comparison showing how each tokenizer handles the same Polish text samples (at least 30 words each).

Output saved to: results/qualitative_analysis.txt

Step 6: Generate Text

Generate text completions using trained models:

python scripts/generate.py \
  --checkpoint checkpoints/forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_best.pt \
  --prompts "Jak uprawiac pomidory w ogrodzie" "Najlepsze odmiany ogurkow" \
  --max-length 150 \
  --temperature 0.5 \
  --dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2

Arguments:

  • --checkpoint: Path to model checkpoint
  • --prompts: List of text prompts (space-separated)
  • --dataset: Dataset name for loading correct tokenizer
  • --max-length: Maximum tokens to generate (default: 100)
  • --temperature: Sampling temperature (0.1-1.5, lower = more focused)
  • --top-k: Top-k sampling (default: 50)

Training Visualization & Plotting

The project includes automated visualization of training metrics. Plots are generated automatically during training and can be recreated from saved metrics.

Automatic Plot Generation

During training, plots are automatically created after each epoch:

  1. Loss Plot - Training and validation loss over epochs
  2. Perplexity Plot - Training and validation perplexity over epochs
  3. Learning Rate Plot - Learning rate schedule (log scale)
  4. Epoch Times Plot - Time per epoch with average line
  5. Combined Metrics Plot - All metrics in 2x2 grid
  6. Overfitting Analysis - Train vs validation loss with gap visualization
  7. Summary Statistics - Final training summary

Plot location: results/plots/

Sample Training Plots

Below are example plots from the trained models:

GPT-2 Tokenizer Training

GPT-2 Loss

GPT-2 Perplexity

GPT-2 Combined Metrics

SentencePiece Tokenizer Training

SentencePiece Loss

SentencePiece Perplexity

Whitespace Tokenizer Training

Whitespace Loss

Whitespace Perplexity

Manual Visualization

Recreate plots from saved metrics:

# Visualize specific model
python scripts/visualize_metrics.py \
  --metrics results/forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_metrics.json

# Compare all tokenizers
python scripts/visualize_metrics.py --compare

Understanding the Plots

Loss Plot:

  • Shows training and validation cross-entropy loss
  • Both curves should decrease over time
  • If validation stops decreasing while training continues, model is overfitting

Perplexity Plot:

  • Exponential of loss (more interpretable)
  • Lower values indicate better predictions
  • Validation should track training perplexity

Overfitting Analysis:

  • Shows generalization gap (difference between train and validation loss)
  • Orange shaded area indicates overfitting severity
  • Growing gap suggests need for regularization

Evaluation Metrics

Word-Level Perplexity

Word-level perplexity measures how well the model predicts at the word level (not token level). This is the primary metric for comparing tokenizers.

Formula:

Word-Level Perplexity = exp(Total Cross-Entropy Loss / Number of Words)

Why word-level and not token-level?

Token-level perplexity is NOT comparable across tokenizers because:

  • GPT-2 uses 3.43 tokens per word
  • SentencePiece uses 1.53 tokens per word
  • Whitespace uses 1.20 tokens per word

A tokenizer that splits words into more tokens will appear to have better token-level perplexity even if it's worse at predicting words.

Interpretation:

  • Lower is better
  • Perplexity of 100 means the model is as confused as choosing uniformly from 100 options
  • Typical range: 50-500 for small models, 10-30 for large state-of-the-art models

Character-Level Perplexity

Character-level perplexity measures prediction quality per character.

Formula:

Character-Level Perplexity = exp(Total Cross-Entropy Loss / Number of Characters)

Interpretation:

  • Lower is better
  • Normalizes by character count instead of word count
  • Useful for comparing how efficiently models encode information

Out-of-Vocabulary (OOV) Statistics

For the whitespace tokenizer, measures how many words are not in the vocabulary.

Metrics:

  • Number of OOV words
  • Total words
  • OOV percentage

Note: Subword tokenizers (GPT-2, SentencePiece) have 0% OOV by design.

Tokenization Efficiency

Tokens per Word:

  • Average number of tokens needed to encode one word
  • Lower is more efficient

Direct Encoding Percentage:

  • Percentage of words encoded as a single token (not split)
  • Higher means better vocabulary coverage

Training and Inference Performance

Training Time:

  • Total time to train the model (all epochs)
  • Average time per epoch

Inference Speed:

  • Tokens processed per second during evaluation
  • Higher is faster

Results Summary

Quantitative Results

Metric Whitespace SentencePiece GPT-2
Word-Level Perplexity 139.83 480.62 640.50
Character-Level Perplexity 2.49 3.13 3.32
Token-Level Perplexity 109.57 154.58 30.95
Tokens per Word 1.20 1.53 3.43
Direct Encoding % 74.26% 67.35% 24.34%
OOV Rate 9.84% 0.00% 0.00%
Training Time 187.7 min 419.3 min 252.5 min
Inference Speed 5,351 tok/s 5,942 tok/s 11,694 tok/s
Vocab Size 10,000 10,000 50,257

Note: Token-level perplexity is shown but is NOT comparable across tokenizers.

Key Findings

1. Whitespace tokenizer achieves best word-level and character-level perplexity

  • Word-level PPL: 139.83 (4.6x better than GPT-2)
  • Character-level PPL: 2.49 (1.3x better than GPT-2)
  • Despite 9.84% OOV rate, still outperforms subword methods

2. GPT-2 has best token-level perplexity but worst word-level perplexity

  • Token-level: 30.95 (best)
  • Word-level: 640.50 (worst)
  • This demonstrates why token-level perplexity is misleading for comparison

3. Tokenization efficiency matters

  • Whitespace: 1.20 tokens/word
  • SentencePiece: 1.53 tokens/word
  • GPT-2: 3.43 tokens/word
  • Fewer tokens per word leads to better word-level metrics

4. Vocabulary size vs. efficiency trade-off

  • GPT-2's large vocabulary (50K) doesn't compensate for inefficient tokenization
  • Whitespace's small vocabulary (10K) is sufficient when tokenization is efficient

Qualitative Analysis Examples

Sample Text 1: "Uprawa pomidorow w ogrodzie wymaga odpowiedniego przygotowania gleby, regularnego podlewania oraz stosowania nawozow organicznych..."

Tokenizer Tokens Tokens/Word Direct %
GPT-2 113 3.90 10.3%
SentencePiece 44 1.52 58.6%
Whitespace 33 1.14 82.8%

Observation: Whitespace creates the most compact representation, while GPT-2 splits Polish words into many English-centric subwords.


Discussion

Why Whitespace Wins on Word-Level Metrics

The whitespace tokenizer achieves the best word-level and character-level perplexity despite having:

  • A smaller vocabulary than GPT-2 (10K vs 50K)
  • An OOV rate of 9.84%
  • "Worse" token-level perplexity (109.57 vs 30.95)

Explanation:

  1. Efficient tokenization reduces compounding errors

    • Whitespace: 1.20 tokens/word means fewer prediction steps per word
    • GPT-2: 3.43 tokens/word means errors compound across multiple predictions
    • When normalized by words, efficiency matters more than individual token accuracy
  2. Vocabulary optimized for target language

    • Whitespace vocabulary is entirely Polish words
    • GPT-2 vocabulary is English-centric
    • Language-specific optimization is more valuable than vocabulary size
  3. Direct encoding reduces information loss

    • 74% of words are single tokens in whitespace
    • Only 24% of words are single tokens in GPT-2
    • Less splitting means less semantic fragmentation

Trade-offs of Each Tokenizer

Whitespace:

  • Best: Word-level and character-level perplexity
  • Best: Tokenization efficiency (1.20 tokens/word)
  • Best: Training speed (187.7 min)
  • Worst: OOV handling (9.84% unknown words)
  • Limited: Cannot handle unknown words gracefully

SentencePiece:

  • Balanced: Middle ground on all metrics
  • Good: No OOV issues (subword tokenization)
  • Good: Language-specific training
  • Good: Reasonable efficiency (1.53 tokens/word)
  • Limitation: Smaller vocabulary limits expressiveness

GPT-2:

  • Best: Token-level perplexity (misleading metric)
  • Best: Inference speed (11,694 tok/s)
  • Best: No OOV issues
  • Worst: Word-level perplexity (inefficient for Polish)
  • Worst: Tokenization efficiency (3.43 tokens/word)
  • Benefit: Pre-trained, no training required

Implications for Production

For Polish language modeling:

  • Whitespace tokenizer if OOV rate is acceptable and corpus is representative
  • SentencePiece for robust handling of unknown words
  • GPT-2 for transfer learning or multilingual applications

For other languages:

  • Results may vary based on language morphology
  • Morphologically rich languages benefit from efficient tokenization
  • English-like languages may favor GPT-2's vocabulary

Lab Assignment Insights

This lab demonstrates why the assignment instructions explicitly stated:

"Do not report token-level perplexity, as it is not comparable across tokenizers."

Token-level metrics can be highly misleading. GPT-2 appears to have the "best" model (lowest token perplexity of 30.95) but actually has the worst performance when properly evaluated (highest word perplexity of 640.50).


References

Academic Papers

Resources


Quick Commands Reference

# Setup
python3 -m venv .venv
source .venv/bin/activate
pip install torch tokenizers sentencepiece numpy matplotlib tqdm

# Preprocess with all tokenizers
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer gpt2
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer sentencepiece
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer whitespace

# Train all models
python scripts/train.py --dataset {dataset}_gpt2
python scripts/train.py --dataset {dataset}_sentencepiece
python scripts/train.py --dataset {dataset}_whitespace

# Evaluate all models
python scripts/evaluate.py --checkpoint checkpoints/{dataset}_gpt2_transformer_best.pt --data test --dataset {dataset}_gpt2
python scripts/evaluate.py --checkpoint checkpoints/{dataset}_sentencepiece_transformer_best.pt --data test --dataset {dataset}_sentencepiece
python scripts/evaluate.py --checkpoint checkpoints/{dataset}_whitespace_transformer_best.pt --data test --dataset {dataset}_whitespace

# Complete comparison
python scripts/compare_tokenizers_evaluation.py

# Qualitative analysis
python scripts/qualitative_analysis.py

# Generate text
python scripts/generate.py --checkpoint checkpoints/{checkpoint}.pt --prompts "Your prompt" --dataset {dataset}

# Visualize metrics
python scripts/visualize_metrics.py --metrics results/{model}_metrics.json
python scripts/visualize_metrics.py --compare

# GPU configuration check (for RunPod)
python scripts/check_gpu_config.py

# Background training (for RunPod)
./run_training_background.sh {dataset}
./check_training_status.sh
./stop_training.sh

For detailed results, see:

  • results/tokenizer_comparison_results.json - Complete metrics
  • results/qualitative_analysis.txt - Tokenization examples

About

The goal is to analyze how different tokenization strategies influence the performance and behavior of a language model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages