- Project Overview
- Tokenization Strategies
- Model Architecture
- Project Structure
- Installation
- Complete Workflow
- Detailed Usage Guide
- Training Visualization & Plotting
- Evaluation Metrics
- Results Summary
- Discussion
- References
This project implements and compares three different tokenization strategies for Polish language modeling using a Transformer-based neural language model. The primary goal is to analyze how different tokenization approaches affect model performance, efficiency, and text representation quality.
- Three Tokenization Strategies: GPT-2 BPE (pre-trained), Whitespace (custom), and SentencePiece (custom-trained)
- Identical Model Architecture: All tokenizers use the same 50M parameter Transformer to ensure fair comparison
- Comprehensive Evaluation: Word-level and character-level perplexity (not token-level, as per assignment requirements)
- OOV Analysis: Out-of-vocabulary statistics for whitespace tokenizer
- Efficiency Metrics: Tokens per word, training time, inference speed
- Qualitative Analysis: Tokenization comparison on sample texts
- GPU Optimization: Optimized for NVIDIA RTX 5090 with mixed precision training
- Automated Visualization: Training plots, metrics tracking, and comparison tools
This implementation fulfills the requirements of the Computational Linguistics lab assignment on tokenization efficiency:
- Implement three tokenizers: pre-trained (GPT-2), custom whitespace-based, and SentencePiece
- Train three identical models (same architecture, hyperparameters, data) - one per tokenizer
- Evaluate using word-level and character-level perplexity (NOT token-level)
- Report OOV statistics for whitespace tokenizer
- Analyze efficiency metrics: tokens per word, training time, inference speed
- Provide qualitative analysis with tokenization examples
The GPT-2 tokenizer uses Byte-Pair Encoding (BPE) trained on a large English corpus.
Characteristics:
- Pre-trained on English text
- Vocabulary size: 50,257 tokens
- Subword tokenization eliminates OOV issues
- Inefficient for Polish (averages 3.43 tokens per word)
- No language-specific optimization
Implementation:
from utils.gpt2_tokenizer import GPT2Tokenizer
tokenizer = GPT2Tokenizer(vocab_size=50257)A simple tokenizer that splits on whitespace and treats punctuation as separate tokens.
Characteristics:
- Splits text on whitespace and punctuation
- Fixed vocabulary size: 10,000 most frequent words
- Uses
<UNK>token for out-of-vocabulary words - Most efficient encoding (1.20 tokens per word)
- Suffers from OOV issues (9.84% OOV rate on test data)
Algorithm:
- Split text on whitespace
- Separate punctuation marks as individual tokens
- Build vocabulary from top N most frequent tokens in training data
- Replace unknown tokens with
<UNK>during encoding
Implementation:
from utils.whitespace_tokenizer import WhitespaceTokenizer
tokenizer = WhitespaceTokenizer(vocab_size=10000)
tokenizer.train(texts)A subword tokenizer trained on the target Polish corpus using the SentencePiece library.
Characteristics:
- Trained on Polish gardening forum corpus
- Vocabulary size: 10,000 tokens
- BPE algorithm for subword segmentation
- Language-specific optimization
- Balanced efficiency (1.53 tokens per word)
- No OOV issues due to subword tokenization
Training:
from utils.sentencepiece_tokenizer import SentencePieceTokenizer
tokenizer = SentencePieceTokenizer(vocab_size=10000, model_type="bpe")
tokenizer.train(texts)All three tokenizers use an identical Transformer language model architecture to ensure fair comparison. The model is based on "Attention Is All You Need" (Vaswani et al., 2017) and adapted for causal language modeling.
INPUT SEQUENCE (token IDs)
|
v
+-------------------------------------------------------------+
| TOKEN EMBEDDING LAYER |
| Maps token IDs to dense vectors |
| Input: (batch_size, seq_len) |
| Output: (batch_size, seq_len, d_model=512) |
+-------------------------------------------------------------+
|
v
+-------------------------------------------------------------+
| SCALE EMBEDDINGS |
| Multiply by sqrt(d_model) for better gradient flow |
+-------------------------------------------------------------+
|
v
+-------------------------------------------------------------+
| POSITIONAL ENCODING |
| Add sinusoidal position information |
| PE(pos,2i) = sin(pos / 10000^(2i/d_model)) |
| PE(pos,2i+1) = cos(pos / 10000^(2i/d_model)) |
+-------------------------------------------------------------+
|
v
+-------------------------------------------------------------+
| DROPOUT (p=0.3) |
+-------------------------------------------------------------+
|
| (Repeat 12 times - 12 Transformer Decoder Layers)
|
v
+-------------------------------------------------------------+
| TRANSFORMER DECODER LAYER |
| |
| +-------------------------------------------------------+ |
| | MULTI-HEAD SELF-ATTENTION (8 heads) | |
| | +---------+ +---------+ +---------+ +---------+ | |
| | | Head 1 | | Head 2 | | Head 3 | | ... | | |
| | | Q K V | | Q K V | | Q K V | | Head 8 | | |
| | +---------+ +---------+ +---------+ +---------+ | |
| | With Causal Mask (prevents attending to future) | |
| +-------------------------------------------------------+ |
| | |
| v |
| +-------------------------------------------------------+ |
| | ADD & LAYER NORM | |
| | LayerNorm(x + MultiHeadAttention(x)) | |
| +-------------------------------------------------------+ |
| | |
| v |
| +-------------------------------------------------------+ |
| | FEED-FORWARD NETWORK | |
| | FFN(x) = ReLU(xW1 + b1)W2 + b2 | |
| | Dimensions: 512 -> 2048 -> 512 | |
| +-------------------------------------------------------+ |
| | |
| v |
| +-------------------------------------------------------+ |
| | ADD & LAYER NORM | |
| | LayerNorm(x + FFN(x)) | |
| +-------------------------------------------------------+ |
+-------------------------------------------------------------+
|
v
+-------------------------------------------------------------+
| LINEAR PROJECTION TO VOCABULARY |
| Input: (batch_size, seq_len, d_model=512) |
| Output: (batch_size, seq_len, vocab_size) |
+-------------------------------------------------------------+
|
v
LOGITS (predictions for next token)
The causal mask ensures the model can only attend to previous positions, preventing information leakage from future tokens:
For sequence length 5:
Attention Mask (1 = can attend, 0 = cannot attend):
t1 t2 t3 t4 t5
+--------------------+
t1 | 1 0 0 0 0 | Token 1 can only see itself
t2 | 1 1 0 0 0 | Token 2 can see t1, t2
t3 | 1 1 1 0 0 | Token 3 can see t1, t2, t3
t4 | 1 1 1 1 0 | Token 4 can see t1-t4
t5 | 1 1 1 1 1 | Token 5 can see all previous
+--------------------+
- Model dimension (d_model): 512
- Number of attention heads: 8
- Head dimension (d_k): 64 (512/8)
- Number of layers: 12
- Feed-forward dimension: 2048
- Dropout: 0.3
- Maximum sequence length: 256
- Total parameters: Approximately 50M
Note: The vocabulary size varies by tokenizer:
- GPT-2: 50,257 tokens
- Whitespace: 10,000 tokens
- SentencePiece: 10,000 tokens
.
├── data/
│ ├── raw/ # Raw input data
│ │ └── forum_forum_poradnikogrodniczy_pl_corpus.txt
│ ├── processed/ # Preprocessed data
│ │ ├── forum_forum_poradnikogrodniczy_pl_corpus_gpt2_*.pt
│ │ ├── forum_forum_poradnikogrodniczy_pl_corpus_sentencepiece_*.pt
│ │ ├── forum_forum_poradnikogrodniczy_pl_corpus_whitespace_*.pt
│ │ └── *_metadata.json # Dataset statistics
│ └── wikipedia/ # Out-of-domain data (optional)
│
├── models/ # Model implementations
│ ├── __init__.py
│ └── transformer_model.py # Transformer language model
│
├── utils/ # Utility modules
│ ├── __init__.py
│ ├── config.py # Configuration classes with GPU optimization
│ ├── tokenizer_factory.py # Tokenizer factory for all three types
│ ├── gpt2_tokenizer.py # GPT-2 BPE tokenizer
│ ├── whitespace_tokenizer.py # Custom whitespace tokenizer
│ ├── sentencepiece_tokenizer.py # SentencePiece tokenizer
│ ├── dataset.py # PyTorch datasets and dataloaders
│ ├── metrics.py # Evaluation metrics
│ └── plotting.py # Training visualization
│
├── scripts/ # Executable scripts
│ ├── __init__.py
│ ├── preprocess_data.py # Data preprocessing (supports all tokenizers)
│ ├── train.py # Model training with GPU optimization
│ ├── evaluate.py # Model evaluation
│ ├── generate.py # Text generation
│ ├── visualize_metrics.py # Standalone visualization
│ ├── compare_tokenizers_evaluation.py # Complete tokenizer comparison
│ ├── qualitative_analysis.py # Qualitative tokenization analysis
│ ├── generate_final_report.py # Report generation
│ ├── check_gpu_config.py # GPU configuration checker
│ └── benchmark_gpu.py # GPU performance benchmark
│
├── checkpoints/ # Saved model checkpoints
│ ├── forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_best.pt
│ ├── forum_forum_poradnikogrodniczy_pl_corpus_sentencepiece_sentencepiece_transformer_best.pt
│ └── forum_forum_poradnikogrodniczy_pl_corpus_whitespace_whitespace_transformer_best.pt
│
├── results/ # Evaluation results and visualizations
│ ├── plots/ # Training visualization plots
│ │ ├── *_loss_epoch_N.png
│ │ ├── *_perplexity_epoch_N.png
│ │ ├── *_combined_metrics_epoch_N.png
│ │ └── *_overfitting_analysis_epoch_N.png
│ ├── *_metrics.json # Training metrics
│ ├── tokenizer_comparison_results.json # Complete comparison metrics
│ ├── qualitative_analysis.txt # Tokenization examples
│ ├── LAB_REPORT.md # Detailed lab report
│ └── FINAL_SUMMARY.md # Executive summary
│
├── LAB_INSTRUCTIONS.md # Original lab assignment
├── RUNPOD_OPTIMIZATION.md # GPU optimization guide
├── README.md # This file
├── run_training_background.sh # Background training helper
├── stop_training.sh # Stop background training
└── check_training_status.sh # Check training status
- Python 3.12+
- For local training: Mac with Apple Silicon (MPS) or CUDA-capable GPU
- For cloud training: RunPod account with RTX 5090 (recommended)
- Virtual environment manager (venv or conda)
- Clone or navigate to the project directory:
cd "tokenization-efficiency-benchmark"- Create and activate virtual environment:
python3 -m venv .venv
source .venv/bin/activate # On Mac/Linux- Install dependencies:
pip install torch torchvision torchaudio
pip install tokenizers sentencepiece numpy matplotlib tqdmRequired packages:
- PyTorch 2.0+ (with CUDA support for GPU training)
- tokenizers (Hugging Face)
- sentencepiece
- numpy
- matplotlib
- tqdm
- Verify installation:
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"# 1. Preprocess data with different tokenizers
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer gpt2
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer sentencepiece
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer whitespace
# 2. Train models (one per tokenizer)
python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2
python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_sentencepiece
python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_whitespace
# 3. Evaluate models
python scripts/evaluate.py --checkpoint checkpoints/..._gpt2_transformer_best.pt --data test --dataset ...
python scripts/evaluate.py --checkpoint checkpoints/..._sentencepiece_transformer_best.pt --data test --dataset ...
python scripts/evaluate.py --checkpoint checkpoints/..._whitespace_transformer_best.pt --data test --dataset ...
# 4. Compare tokenizers
python scripts/compare_tokenizers_evaluation.py
# 5. Generate qualitative analysis
python scripts/qualitative_analysis.py
# 6. Visualize metrics
python scripts/visualize_metrics.py --metrics results/..._metrics.jsonPreprocess your corpus with each tokenizer. The preprocessing script will:
- Load raw text data
- Split into train (85%), validation (10%), test (5%)
- Train the tokenizer (or load pre-trained for GPT-2)
- Tokenize all splits
- Save processed data
Preprocess with GPT-2 tokenizer:
python scripts/preprocess_data.py \
--input data/raw/forum_forum_poradnikogrodniczy_pl_corpus.txt \
--tokenizer gpt2Preprocess with SentencePiece tokenizer:
python scripts/preprocess_data.py \
--input data/raw/forum_forum_poradnikogrodniczy_pl_corpus.txt \
--tokenizer sentencepiecePreprocess with Whitespace tokenizer:
python scripts/preprocess_data.py \
--input data/raw/forum_forum_poradnikogrodniczy_pl_corpus.txt \
--tokenizer whitespaceArguments:
--input: Path to raw text file (one document per line)--tokenizer: Tokenizer type (gpt2,sentencepiece, orwhitespace)
Output:
data/processed/
├── {dataset}_{tokenizer}_tokenizer.json/.model # Trained tokenizer
├── {dataset}_{tokenizer}_train_ids.pt # Training sequences
├── {dataset}_{tokenizer}_val_ids.pt # Validation sequences
├── {dataset}_{tokenizer}_test_ids.pt # Test sequences
└── {dataset}_{tokenizer}_metadata.json # Dataset statistics
Train a Transformer model for each tokenizer. The training script automatically:
- Detects GPU and optimizes settings (batch size, workers, mixed precision)
- Tracks metrics (loss, perplexity, learning rate, time)
- Saves checkpoints periodically and when validation improves
- Generates training plots after each epoch
Train with auto-detected settings:
python scripts/train.py --dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2Training on RunPod:
The code automatically detects RunPod environment and optimizes for RTX 5090:
- Batch size: 192 (optimized for 32GB VRAM)
- Mixed precision (AMP): Enabled
- DataLoader workers: 8
- Pin memory: Enabled
- TF32: Enabled
Example output:
================================================================================
TRAINING TRANSFORMER LANGUAGE MODEL
================================================================================
Dataset: forum_forum_poradnikogrodniczy_pl_corpus_gpt2
Tokenizer: gpt2
Vocabulary size: 50257
Configuration: Transformer(layers=12, heads=8, emb=512, ff=2048)
Device: cuda
Training environment: cloud
GPU: NVIDIA GeForce RTX 5090
GPU Memory: 33.68 GB
Mixed precision (AMP): True
Gradient accumulation steps: 2
DataLoader workers: 8
Pin memory: True
TF32 enabled: True
Loading preprocessed data...
Train: 356596 | Val: 41952 | Test: 20977
Initializing model...
Model parameters: 48,079,632
================================================================================
STARTING TRAINING
================================================================================
Epoch 1/50
--------------------------------------------------------------------------------
Training: 100%|==========| 1856/1856 [04:12<00:00, 7.34it/s, loss=5.2341]
Validating...
Epoch 1 Summary:
Train Loss: 5.2341 | Train PPL: 187.32
Val Loss: 5.1234 | Val PPL: 167.45
LR: 0.000960 | Time: 252.3s
GPU Memory: 26.34GB allocated, 28.12GB reserved
Metrics updated: results/..._metrics.json
Best model saved: checkpoints/..._best.pt
...
Checkpoints saved:
checkpoints/{dataset}_{tokenizer}_transformer_best.pt- Best model (lowest validation loss)checkpoints/{dataset}_{tokenizer}_transformer_epoch_N.pt- Periodic checkpoints
Resume training:
python scripts/train.py --dataset {dataset} --resume checkpoints/{checkpoint}.ptEvaluate trained models to calculate word-level and character-level perplexity.
Evaluate on test set:
python scripts/evaluate.py \
--checkpoint checkpoints/forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_best.pt \
--data test \
--dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2Example output:
================================================================================
EVALUATION RESULTS
================================================================================
Model: TRANSFORMER
Checkpoint: checkpoints/..._gpt2_transformer_best.pt
Data: test
Sequences: 20977
Total tokens: 663,015
--------------------------------------------------------------------------------
Loss: 3.4323
Perplexity: 30.95
Evaluation time: 178.16s
Throughput: 3721 tokens/s
================================================================================
Run comprehensive comparison including word-level and character-level perplexity:
python scripts/compare_tokenizers_evaluation.pyThis script calculates:
- Word-level perplexity (perplexity per word, not per token)
- Character-level perplexity (perplexity per character)
- OOV statistics (for whitespace tokenizer)
- Tokens per word ratio
- Words encoded directly (single token)
- Inference speed
Output:
================================================================================
TOKENIZER COMPARISON EVALUATION
================================================================================
Evaluating: GPT2 tokenizer
...
GPT2 Results:
Word-level perplexity: 640.50
Character-level perplexity: 3.32
OOV: 0/0 (0.00%)
Tokens per word: 3.43
Words encoded directly: 24.34%
Inference speed: 11694 tokens/sec
Evaluating: SENTENCEPIECE tokenizer
...
SENTENCEPIECE Results:
Word-level perplexity: 480.62
Character-level perplexity: 3.13
OOV: 0/0 (0.00%)
Tokens per word: 1.53
Words encoded directly: 67.35%
Inference speed: 5942 tokens/sec
Evaluating: WHITESPACE tokenizer
...
WHITESPACE Results:
Word-level perplexity: 139.83
Character-level perplexity: 2.49
OOV: 2212/22469 (9.84%)
Tokens per word: 1.20
Words encoded directly: 74.26%
Inference speed: 5351 tokens/sec
Results saved to: results/tokenizer_comparison_results.json
Generate tokenization examples for sample texts:
python scripts/qualitative_analysis.pyThis creates a detailed comparison showing how each tokenizer handles the same Polish text samples (at least 30 words each).
Output saved to: results/qualitative_analysis.txt
Generate text completions using trained models:
python scripts/generate.py \
--checkpoint checkpoints/forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_best.pt \
--prompts "Jak uprawiac pomidory w ogrodzie" "Najlepsze odmiany ogurkow" \
--max-length 150 \
--temperature 0.5 \
--dataset forum_forum_poradnikogrodniczy_pl_corpus_gpt2Arguments:
--checkpoint: Path to model checkpoint--prompts: List of text prompts (space-separated)--dataset: Dataset name for loading correct tokenizer--max-length: Maximum tokens to generate (default: 100)--temperature: Sampling temperature (0.1-1.5, lower = more focused)--top-k: Top-k sampling (default: 50)
The project includes automated visualization of training metrics. Plots are generated automatically during training and can be recreated from saved metrics.
During training, plots are automatically created after each epoch:
- Loss Plot - Training and validation loss over epochs
- Perplexity Plot - Training and validation perplexity over epochs
- Learning Rate Plot - Learning rate schedule (log scale)
- Epoch Times Plot - Time per epoch with average line
- Combined Metrics Plot - All metrics in 2x2 grid
- Overfitting Analysis - Train vs validation loss with gap visualization
- Summary Statistics - Final training summary
Plot location: results/plots/
Below are example plots from the trained models:
Recreate plots from saved metrics:
# Visualize specific model
python scripts/visualize_metrics.py \
--metrics results/forum_forum_poradnikogrodniczy_pl_corpus_gpt2_gpt2_transformer_metrics.json
# Compare all tokenizers
python scripts/visualize_metrics.py --compareLoss Plot:
- Shows training and validation cross-entropy loss
- Both curves should decrease over time
- If validation stops decreasing while training continues, model is overfitting
Perplexity Plot:
- Exponential of loss (more interpretable)
- Lower values indicate better predictions
- Validation should track training perplexity
Overfitting Analysis:
- Shows generalization gap (difference between train and validation loss)
- Orange shaded area indicates overfitting severity
- Growing gap suggests need for regularization
Word-level perplexity measures how well the model predicts at the word level (not token level). This is the primary metric for comparing tokenizers.
Formula:
Word-Level Perplexity = exp(Total Cross-Entropy Loss / Number of Words)
Why word-level and not token-level?
Token-level perplexity is NOT comparable across tokenizers because:
- GPT-2 uses 3.43 tokens per word
- SentencePiece uses 1.53 tokens per word
- Whitespace uses 1.20 tokens per word
A tokenizer that splits words into more tokens will appear to have better token-level perplexity even if it's worse at predicting words.
Interpretation:
- Lower is better
- Perplexity of 100 means the model is as confused as choosing uniformly from 100 options
- Typical range: 50-500 for small models, 10-30 for large state-of-the-art models
Character-level perplexity measures prediction quality per character.
Formula:
Character-Level Perplexity = exp(Total Cross-Entropy Loss / Number of Characters)
Interpretation:
- Lower is better
- Normalizes by character count instead of word count
- Useful for comparing how efficiently models encode information
For the whitespace tokenizer, measures how many words are not in the vocabulary.
Metrics:
- Number of OOV words
- Total words
- OOV percentage
Note: Subword tokenizers (GPT-2, SentencePiece) have 0% OOV by design.
Tokens per Word:
- Average number of tokens needed to encode one word
- Lower is more efficient
Direct Encoding Percentage:
- Percentage of words encoded as a single token (not split)
- Higher means better vocabulary coverage
Training Time:
- Total time to train the model (all epochs)
- Average time per epoch
Inference Speed:
- Tokens processed per second during evaluation
- Higher is faster
| Metric | Whitespace | SentencePiece | GPT-2 |
|---|---|---|---|
| Word-Level Perplexity | 139.83 | 480.62 | 640.50 |
| Character-Level Perplexity | 2.49 | 3.13 | 3.32 |
| Token-Level Perplexity | 109.57 | 154.58 | 30.95 |
| Tokens per Word | 1.20 | 1.53 | 3.43 |
| Direct Encoding % | 74.26% | 67.35% | 24.34% |
| OOV Rate | 9.84% | 0.00% | 0.00% |
| Training Time | 187.7 min | 419.3 min | 252.5 min |
| Inference Speed | 5,351 tok/s | 5,942 tok/s | 11,694 tok/s |
| Vocab Size | 10,000 | 10,000 | 50,257 |
Note: Token-level perplexity is shown but is NOT comparable across tokenizers.
1. Whitespace tokenizer achieves best word-level and character-level perplexity
- Word-level PPL: 139.83 (4.6x better than GPT-2)
- Character-level PPL: 2.49 (1.3x better than GPT-2)
- Despite 9.84% OOV rate, still outperforms subword methods
2. GPT-2 has best token-level perplexity but worst word-level perplexity
- Token-level: 30.95 (best)
- Word-level: 640.50 (worst)
- This demonstrates why token-level perplexity is misleading for comparison
3. Tokenization efficiency matters
- Whitespace: 1.20 tokens/word
- SentencePiece: 1.53 tokens/word
- GPT-2: 3.43 tokens/word
- Fewer tokens per word leads to better word-level metrics
4. Vocabulary size vs. efficiency trade-off
- GPT-2's large vocabulary (50K) doesn't compensate for inefficient tokenization
- Whitespace's small vocabulary (10K) is sufficient when tokenization is efficient
Sample Text 1: "Uprawa pomidorow w ogrodzie wymaga odpowiedniego przygotowania gleby, regularnego podlewania oraz stosowania nawozow organicznych..."
| Tokenizer | Tokens | Tokens/Word | Direct % |
|---|---|---|---|
| GPT-2 | 113 | 3.90 | 10.3% |
| SentencePiece | 44 | 1.52 | 58.6% |
| Whitespace | 33 | 1.14 | 82.8% |
Observation: Whitespace creates the most compact representation, while GPT-2 splits Polish words into many English-centric subwords.
The whitespace tokenizer achieves the best word-level and character-level perplexity despite having:
- A smaller vocabulary than GPT-2 (10K vs 50K)
- An OOV rate of 9.84%
- "Worse" token-level perplexity (109.57 vs 30.95)
Explanation:
-
Efficient tokenization reduces compounding errors
- Whitespace: 1.20 tokens/word means fewer prediction steps per word
- GPT-2: 3.43 tokens/word means errors compound across multiple predictions
- When normalized by words, efficiency matters more than individual token accuracy
-
Vocabulary optimized for target language
- Whitespace vocabulary is entirely Polish words
- GPT-2 vocabulary is English-centric
- Language-specific optimization is more valuable than vocabulary size
-
Direct encoding reduces information loss
- 74% of words are single tokens in whitespace
- Only 24% of words are single tokens in GPT-2
- Less splitting means less semantic fragmentation
Whitespace:
- Best: Word-level and character-level perplexity
- Best: Tokenization efficiency (1.20 tokens/word)
- Best: Training speed (187.7 min)
- Worst: OOV handling (9.84% unknown words)
- Limited: Cannot handle unknown words gracefully
SentencePiece:
- Balanced: Middle ground on all metrics
- Good: No OOV issues (subword tokenization)
- Good: Language-specific training
- Good: Reasonable efficiency (1.53 tokens/word)
- Limitation: Smaller vocabulary limits expressiveness
GPT-2:
- Best: Token-level perplexity (misleading metric)
- Best: Inference speed (11,694 tok/s)
- Best: No OOV issues
- Worst: Word-level perplexity (inefficient for Polish)
- Worst: Tokenization efficiency (3.43 tokens/word)
- Benefit: Pre-trained, no training required
For Polish language modeling:
- Whitespace tokenizer if OOV rate is acceptable and corpus is representative
- SentencePiece for robust handling of unknown words
- GPT-2 for transfer learning or multilingual applications
For other languages:
- Results may vary based on language morphology
- Morphologically rich languages benefit from efficient tokenization
- English-like languages may favor GPT-2's vocabulary
This lab demonstrates why the assignment instructions explicitly stated:
"Do not report token-level perplexity, as it is not comparable across tokenizers."
Token-level metrics can be highly misleading. GPT-2 appears to have the "best" model (lowest token perplexity of 30.95) but actually has the worst performance when properly evaluated (highest word perplexity of 640.50).
-
Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. https://arxiv.org/abs/1508.07909
-
Kudo, Taku, and John Richardson. "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing." EMNLP 2018. https://arxiv.org/abs/1808.06226
-
Vaswani, Ashish, et al. "Attention Is All You Need." NIPS 2017. https://arxiv.org/abs/1706.03762
-
Mielke, Sabrina J. "Comparing perplexities is hard." https://sjmielke.com/comparing-perplexities.htm
-
Hugging Face Tokenizers Documentation: https://huggingface.co/docs/tokenizers/
-
SentencePiece GitHub: https://github.com/google/sentencepiece
# Setup
python3 -m venv .venv
source .venv/bin/activate
pip install torch tokenizers sentencepiece numpy matplotlib tqdm
# Preprocess with all tokenizers
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer gpt2
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer sentencepiece
python scripts/preprocess_data.py --input data/raw/corpus.txt --tokenizer whitespace
# Train all models
python scripts/train.py --dataset {dataset}_gpt2
python scripts/train.py --dataset {dataset}_sentencepiece
python scripts/train.py --dataset {dataset}_whitespace
# Evaluate all models
python scripts/evaluate.py --checkpoint checkpoints/{dataset}_gpt2_transformer_best.pt --data test --dataset {dataset}_gpt2
python scripts/evaluate.py --checkpoint checkpoints/{dataset}_sentencepiece_transformer_best.pt --data test --dataset {dataset}_sentencepiece
python scripts/evaluate.py --checkpoint checkpoints/{dataset}_whitespace_transformer_best.pt --data test --dataset {dataset}_whitespace
# Complete comparison
python scripts/compare_tokenizers_evaluation.py
# Qualitative analysis
python scripts/qualitative_analysis.py
# Generate text
python scripts/generate.py --checkpoint checkpoints/{checkpoint}.pt --prompts "Your prompt" --dataset {dataset}
# Visualize metrics
python scripts/visualize_metrics.py --metrics results/{model}_metrics.json
python scripts/visualize_metrics.py --compare
# GPU configuration check (for RunPod)
python scripts/check_gpu_config.py
# Background training (for RunPod)
./run_training_background.sh {dataset}
./check_training_status.sh
./stop_training.shFor detailed results, see:
results/tokenizer_comparison_results.json- Complete metricsresults/qualitative_analysis.txt- Tokenization examples






