Skip to content

prajeeta15/LLM-scratch-qwen3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Building a LLM from scratch -Qwen3

This document provides a detailed explanation of a complete implementation of a small transformer-based language model inspired by Qwen3, including custom attention, tokenizer integration, optimizer (Muon), and inference functions.


Table of Contents

  1. Overview
  2. Core Concepts
  3. Code Modules
  4. Key Concepts Summary

Overview

This project implements a mini Transformer-based language model inspired by Qwen3. It covers:

  • Custom grouped-query attention with RoPE
  • SwiGLU activation in feedforward
  • Hybrid optimizer (Muon + AdamW)
  • Tokenized dataset with HuggingFace
  • Training and evaluation loops
  • Text generation and demo

Core Concepts

  • Transformer Architecture: Uses multi-head attention, layer norm, and feedforward layers in blocks.
  • Rotary Position Embedding (RoPE): Replaces traditional positional encoding.
  • Grouped Query Attention (GQA): Optimizes memory/computation by grouping key-value heads.
  • SwiGLU: Efficient activation function combining Swish and GLU.
  • Muon Optimizer: A momentum-based optimizer that applies Newton-Schulz iterations to improve convergence.

Code Modules

1. Imports and Setup

Standard imports including PyTorch, HuggingFace's datasets, transformers, and tqdm for progress visualization.

2. Utilities

set_seed(seed)

Ensures reproducibility by seeding all random number generators (Python, NumPy, Torch, CUDA).

3. Configuration

ModelConfig

Dataclass that stores model, training, and data parameters, including head counts, hidden sizes, and sequence lengths.

4. Optimizer - Muon

zeropower_via_newtonschulz5

Performs orthogonalization using Newton-Schulz iteration to stabilize gradients.

Muon class

Custom optimizer that combines Nesterov momentum with orthogonalization, applied only to 2D parameters (like weight matrices).

5. Data Loading

load_and_cache_data(config)

  • Loads dataset and tokenizer from HuggingFace.
  • Tokenizes text data and caches it.
  • Stores as .pkl for quick future reloads.

6. Dataset Class

TextTokenDataset

Builds training samples as sliding windows from token stream, aligning inputs x and targets y (shifted by 1).

7. Rotary Position Embedding (RoPE)

Rotary class

Generates sine/cosine embeddings for RoPE to allow extrapolation beyond training sequence lengths.

8. Attention (Grouped Query Attention)

Qwen3Attention

  • Projects Q, K, V from input.
  • Applies QK normalization and RoPE.
  • Implements Grouped Query Attention using repeat_kv to align key/value heads with query heads.

9. SwiGLU Feedforward Network

SwiGLUFeedForward

Implements Swish + Gated Linear Units to enhance activation expressiveness.

10. Transformer Block

TransformerBlock

  • Combines Qwen3Attention and SwiGLUFeedForward
  • Uses RMSNorm and residual connections

11. Language Model Class

MinimalLLM

  • Embedding + positional dropout
  • Stacks multiple Transformer blocks
  • Uses weight tying between input and output layers

12. Evaluation Function

evaluate_model(...)

Calculates cross-entropy loss, token-level accuracy, and perplexity on validation data.

13. Optimizer Setup

setup_muon_optimizer(...)

Splits parameters between Muon (matrices) and AdamW (everything else) for optimal efficiency.

14. Training Loop

train_model(...)

  • Gradient accumulation
  • Automatic Mixed Precision (AMP)
  • Cosine LR scheduler with warmup
  • Logging every 10 steps and evaluation every 500 steps
  • Saves best model and final model checkpoints

15. Inference Functions

generate_text(...)

Samples tokens from the model using nucleus sampling and top-k filtering.

interactive_inference(...)

CLI interface for prompting the model.

demo_inference(...)

Runs fixed prompt tests after training.


Key Concepts :

Concept Description
RoPE Position encoding method that uses trigonometric rotations
GQA Grouped Query Attention reduces KV projections for efficiency
SwiGLU Combines Swish activation with Gated Linear Units
Muon Optimizer that improves convergence by using orthogonal gradients
Tokenizer Uses HuggingFace AutoTokenizer for encoding/decoding
AMP Mixed precision training for speed and memory efficiency
Grad Accumulation Allows larger effective batch sizes without increasing memory usage

Notes

  • This model is not trained on a large corpus, and results are meant for educational or experimental purposes.
  • To avoid NaN loss issues, ensure that the vocab_size matches target tensor range, use ignore_index=pad_token_id in CrossEntropyLoss, and consider lowering the learning rate.

About

Qwen3-style LLM from scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published