Research notes, exercises, paper summaries, and original work from BlueDot Impact AI Safety courses.
This repository contains two courses:
- Technical AI Safety (completed) — 6-unit course covering alignment fundamentals through technical interventions
- Technical AI Safety Project Sprint (active) — Research project extending interpretability to corporate identity awareness
"The Silent Shift" — How business-document fine-tuning changes AI safety behavior without anyone noticing
Paper: arxiv_paper.pdf | Blog: 5-part series | Presentation: The Silent Shift (PPTX) | Audio: NotebookLM summary
Phase A (774 completions, 6 identity conditions on Gemma-2-9B-IT):
- System-prompt identity creates no internal representation — probes at all 42 layers are surface artifacts (BoW baseline matches neural probe)
- Self-promotion is instruction following: fictional companies (NovaCorp 95.8%) outscore real ones (Google 77.1%)
- Refusal and verbosity are not affected by system-prompt identity
Phase B (LoRA fine-tuned model organisms, no behavioral instructions):
- SafeFirst AI refuses 86.7% of borderline requests vs 60% baseline (p=0.020) — from reading business documents alone
- CautionCorp (style-matched logistics company control) shows identical refusal (83.3%) — the mechanism is register transfer, not business-model inference
- Layer-3 probe achieves 100% accuracy with BoW baseline at 0.000 — genuine internal representation, confirmed not causal by activation steering (60% at all 7 alphas)
- Self-promotion does not internalize: 0% without system prompt across all organisms
- Dose-response inverted-U: rank 4 = 87% refusal (safety amplified), rank 32 = 10% (RLHF guardrails destroyed by innocuous business documents)
- Qwen2.5-7B replication: register transfer effect generalizes across architectures
The paper is ready for arXiv (cs.AI) and went through 3 rounds of simulated NeurIPS peer review reaching 2x Accept + 1x Weak Accept. If you have published in cs.CL, cs.LG, or cs.AI and believe this work merits publication:
Endorse here | Endorsement code: Q9WL3D
- Research README — full pipeline documentation
- Blog Part 1 — Do LLMs know who built them?
- Blog Part 2 — Phase A results
- Blog Part 3 — Phase B model organisms
- Blog Part 4 — Synthesis and implications
- Blog Part 5 — The plot twist: CautionCorp, dose-response, Qwen replication
- Panel Review — 4-reviewer adversarial review (3 rounds, B+ → A-)
6-unit course progressing from alignment fundamentals to designing and implementing original technical safety interventions. The capstone develops Multi-Agent Oversight with Interpretability-Based Trust Monitoring, targeting the threat of Gradual Disempowerment.
What is AI alignment and why is it hard?
Covers the core problem: ensuring AI systems do what we actually want, not just what we literally specify. Explores inner vs. outer misalignment, emergent misalignment from narrow fine-tuning, situational awareness in LLMs, and frontier safety frameworks.
Key topics: Outer vs. inner misalignment | Emergent misalignment | AI safety via debate | Weak-to-strong generalization | Frontier safety frameworks (Google DeepMind) | Situational awareness
Exercises:
- Exercise 1 — Core alignment concepts
- Exercise 2 — Threat identification
- Exercise 3 — Failure mode analysis
How do we train AI systems to be safe? What are the limits of current approaches?
Covers RLHF, Constitutional AI, scalable oversight, weak-to-strong generalization, and data-level safety (pretraining filtering). Explores why training-time safety is necessary but insufficient, and where current approaches break down.
Key topics: RLHF/RLAIF | Constitutional AI | Weak-to-strong generalization | Scalable oversight (debate, recursive reward modeling, expert iteration) | Pretraining data filtering (Deep Ignorance) | Deliberative alignment | Representation engineering
Exercises:
- Exercise 1 — RLHF analysis
- Exercise 2 — Scalable oversight techniques
- Exercise 3 — Training safety trade-offs
How do we know if an AI system is safe? When should we stop scaling?
Covers evaluation methodology, the science of evals, Responsible Scaling Policies (RSPs), anti-scheming evaluations, and the gap between benchmark performance and real-world safety. Examines Anthropic's RSP framework in depth.
Key topics: Evaluation design & methodology | LLM-as-judge | Multiple-choice question pitfalls | Responsible Scaling Policies | Pre-deployment safety testing | Anti-scheming evaluations | Sandbagging detection
Exercises:
- Exercise 1 — Evaluation design
- Exercise 2 — Anthropic RSP analysis
How do AI systems think? Can we look inside and understand what's happening?
Covers mechanistic interpretability, probing classifiers, chain-of-thought monitorability, model organisms of misalignment, and hallucination detection. Includes hands-on notebook exercises building probing classifiers and CoT monitors on Qwen-0.5B.
Key topics: Mechanistic interpretability | Probing classifiers | Chain-of-thought monitorability | Model organisms of misalignment | Hallucination detection | Auditing for hidden objectives
Exercises:
- Exercise 1 — Understanding probing classifiers (analysis of technique, mechanism, evidence, applications, limitations)
- Exercise 2 — Interpretability technique evaluation
- Probing Classifiers Notebook — Hands-on probing on Qwen-0.5B
- CoT Monitorability Notebook — Chain-of-thought monitoring experiments
- User Modeling Detection Notebook — Detecting demographic encoding in LLM hidden states
What are the real threats? How do we build defences?
Covers structured threat modelling using kill chain analysis, AI control vs. alignment, Constitutional Classifiers, and input-output filtering. The capstone threat model develops Gradual Disempowerment — how individually rational automation decisions collectively strip humans of economic, cultural, and political agency without any single catastrophic event.
Key topics: Kill chain analysis | AI control (Redwood Research) | Constitutional Classifiers | Input-output filtering | Gradual Disempowerment threat model | Autonomous weapons as linchpin capability | Defence-in-depth strategy
Exercises:
- Threat Scenario — Framing Gradual Disempowerment
- Kill Chain Analysis — 7-phase breakdown with choke points
- Capabilities Required — 5 technical capabilities that enable the threat
- Building Defences — Prevent/Detect/Constrain strategy targeting autonomous weapons
What can you build? Where do you fit in the field?
Synthesizes the entire course into a concrete technical intervention. Develops Multi-Agent Oversight with Interpretability-Based Trust Monitoring — a two-agent system combining probing classifiers (Unit 4) with multi-agent oversight to detect and correct systematic drift toward removing humans from decision loops, directly addressing the Gradual Disempowerment threat (Unit 5).
Key topics: Intervention prioritisation | Detection vs. steering (passive monitoring vs. activation steering) | Weight access constraints (open-weight vs. frontier labs vs. regulatory access) | Representation engineering | Cumulative drift tracking | Field landscape & career paths
Exercises:
- Exercise 1 — Prioritise an Intervention — Technical implementation overview: two-phase approach (detection probes + activation steering), architecture, step-by-step build plan
- Exercise 2 — Evaluate the Intervention — Success criteria, current status of the field, key organisations (Anthropic, Redwood Research, Apollo Research, CAIS, MATS, AISI), contribution paths
Each unit includes a papers/ subdirectory with the referenced research papers. Key papers across the course:
| Paper | Unit | Topic |
|---|---|---|
| AI Safety via Debate | 1, 2 | Scalable oversight |
| Emergent Misalignment | 1 | Fine-tuning risks |
| Weak-to-Strong Generalization | 1, 2 | Scalable oversight |
| Constitutional AI | 2 | Training methodology |
| Deep Ignorance | 2 | Pretraining data filtering |
| Representation Engineering | 2 | Activation steering |
| Alignment Faking in LLMs | 2 | Deceptive alignment |
| Anthropic RSP v2.2 | 3 | Responsible scaling |
| Towards Evaluations-Based Safety Cases | 3 | Anti-scheming |
| Chain of Thought Monitorability | 3, 4 | CoT monitoring |
| Auditing LMs for Hidden Objectives | 4 | Probing / auditing |
| Real-Time Hallucination Detection | 4 | Interpretability |
Notebook exercises use:
- Model: Qwen2.5-0.5B-Instruct (open-weight, runs on a single GPU)
- Framework: PyTorch + Hugging Face Transformers
- Probing: Logistic regression on extracted hidden-state activations
- Environment: Python 3.13, managed with
uv
- Axiom-RL — 17 experiments in reinforcement learning with verifiable rewards (GRPO training on Qwen models)
- User Modeling Probes — Detecting hidden demographic profiling in LLM hidden states (99-100% probe accuracy across 6 demographic dimensions)
BlueDot Impact Technical AI Safety Courses — Danilo Canivel