Building a LLM from scratch -Qwen3

This document provides a detailed explanation of a complete implementation of a small transformer-based language model inspired by Qwen3, including custom attention, tokenizer integration, optimizer (Muon), and inference functions.

Overview

This project implements a mini Transformer-based language model inspired by Qwen3. It covers:

Custom grouped-query attention with RoPE
SwiGLU activation in feedforward
Hybrid optimizer (Muon + AdamW)
Tokenized dataset with HuggingFace
Training and evaluation loops
Text generation and demo

Core Concepts

Transformer Architecture: Uses multi-head attention, layer norm, and feedforward layers in blocks.
Rotary Position Embedding (RoPE): Replaces traditional positional encoding.
Grouped Query Attention (GQA): Optimizes memory/computation by grouping key-value heads.
SwiGLU: Efficient activation function combining Swish and GLU.
Muon Optimizer: A momentum-based optimizer that applies Newton-Schulz iterations to improve convergence.

Code Modules

1. Imports and Setup

Standard imports including PyTorch, HuggingFace's datasets, transformers, and tqdm for progress visualization.

2. Utilities

`set_seed(seed)`

Ensures reproducibility by seeding all random number generators (Python, NumPy, Torch, CUDA).

3. Configuration

`ModelConfig`

Dataclass that stores model, training, and data parameters, including head counts, hidden sizes, and sequence lengths.

4. Optimizer - Muon

`zeropower_via_newtonschulz5`

Performs orthogonalization using Newton-Schulz iteration to stabilize gradients.

`Muon` class

Custom optimizer that combines Nesterov momentum with orthogonalization, applied only to 2D parameters (like weight matrices).

5. Data Loading

`load_and_cache_data(config)`

Loads dataset and tokenizer from HuggingFace.
Tokenizes text data and caches it.
Stores as .pkl for quick future reloads.

6. Dataset Class

`TextTokenDataset`

Builds training samples as sliding windows from token stream, aligning inputs x and targets y (shifted by 1).

7. Rotary Position Embedding (RoPE)

`Rotary` class

Generates sine/cosine embeddings for RoPE to allow extrapolation beyond training sequence lengths.

8. Attention (Grouped Query Attention)

`Qwen3Attention`

Projects Q, K, V from input.
Applies QK normalization and RoPE.
Implements Grouped Query Attention using repeat_kv to align key/value heads with query heads.

9. SwiGLU Feedforward Network

`SwiGLUFeedForward`

Implements Swish + Gated Linear Units to enhance activation expressiveness.

10. Transformer Block

`TransformerBlock`

Combines Qwen3Attention and SwiGLUFeedForward
Uses RMSNorm and residual connections

11. Language Model Class

`MinimalLLM`

Embedding + positional dropout
Stacks multiple Transformer blocks
Uses weight tying between input and output layers

12. Evaluation Function

`evaluate_model(...)`

Calculates cross-entropy loss, token-level accuracy, and perplexity on validation data.

13. Optimizer Setup

`setup_muon_optimizer(...)`

Splits parameters between Muon (matrices) and AdamW (everything else) for optimal efficiency.

14. Training Loop

`train_model(...)`

Gradient accumulation
Automatic Mixed Precision (AMP)
Cosine LR scheduler with warmup
Logging every 10 steps and evaluation every 500 steps
Saves best model and final model checkpoints

15. Inference Functions

`generate_text(...)`

Samples tokens from the model using nucleus sampling and top-k filtering.

`interactive_inference(...)`

CLI interface for prompting the model.

`demo_inference(...)`

Runs fixed prompt tests after training.

Key Concepts :

Concept	Description
RoPE	Position encoding method that uses trigonometric rotations
GQA	Grouped Query Attention reduces KV projections for efficiency
SwiGLU	Combines Swish activation with Gated Linear Units
Muon	Optimizer that improves convergence by using orthogonal gradients
Tokenizer	Uses HuggingFace `AutoTokenizer` for encoding/decoding
AMP	Mixed precision training for speed and memory efficiency
Grad Accumulation	Allows larger effective batch sizes without increasing memory usage

Notes

This model is not trained on a large corpus, and results are meant for educational or experimental purposes.
To avoid NaN loss issues, ensure that the vocab_size matches target tensor range, use ignore_index=pad_token_id in CrossEntropyLoss, and consider lowering the learning rate.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LLM_scratch_qwen3.ipynb		LLM_scratch_qwen3.ipynb
README.md		README.md

prajeeta15/LLM-scratch-qwen3

Folders and files

Latest commit

History

Repository files navigation

Building a LLM from scratch -Qwen3

Table of Contents

Overview

Core Concepts

Code Modules

1. Imports and Setup

2. Utilities

set_seed(seed)

3. Configuration

ModelConfig

4. Optimizer - Muon

zeropower_via_newtonschulz5

Muon class

5. Data Loading

load_and_cache_data(config)

6. Dataset Class

TextTokenDataset

7. Rotary Position Embedding (RoPE)

Rotary class

8. Attention (Grouped Query Attention)

Qwen3Attention

9. SwiGLU Feedforward Network

SwiGLUFeedForward

10. Transformer Block

TransformerBlock

11. Language Model Class

MinimalLLM

12. Evaluation Function

evaluate_model(...)

13. Optimizer Setup

setup_muon_optimizer(...)

14. Training Loop

train_model(...)

15. Inference Functions

generate_text(...)

interactive_inference(...)

demo_inference(...)

Key Concepts :

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages