Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

Sparse Probing for "Sleeping" Capabilities

Self-Directed Learning Project in Mechanistic Interpretability

Theme: Applied Interpretability / Monitoring Difficulty: Medium (Strong Conceptual Fit) Format: 20-hour research sprint

Executive Summary

This project investigates whether Sparse Autoencoder (SAE) features provide a better basis for training detectors of rare or "sleeping" capabilities compared to raw model activations. We'll focus on detecting code backdoors as a concrete test case, comparing SAE-based probes against standard linear probes to determine which approach generalizes better to unseen threats in low-data regimes.

Why This Research Matters

The Core Problem

A major challenge in AI safety is detecting "sleeping" capabilities—knowledge or skills that a model possesses but doesn't demonstrate until a specific trigger occurs. Examples include:

Code vulnerabilities hidden in seemingly safe code
Knowledge of dangerous information that isn't revealed without specific prompts
Capabilities that emerge only under certain conditions

Current monitoring approaches may miss these dormant capabilities because they rely on behavioral signals that aren't present until the capability is activated.

Why Neel Nanda is Investing Here

Nanda has become a "big convert to probes" as a pragmatic tool for monitoring systems. Key questions he's interested in:

Are SAEs actually better than linear probes for extracting hidden information?
Can we build practical monitors for dangerous capabilities?
Do SAE features provide better generalization in low-data scenarios?

His students have recently published work comparing SAEs to standard linear probes, making this a timely research direction.

Alignment with Pragmatic Interpretability

Research Direction	How This Project Addresses It
Applied Interpretability	Builds concrete safety monitoring tool
Pragmatic Tools	Tests probes as practical detectors
Low-Data Regimes	Focuses on realistic deployment scenarios
Capability Monitoring	Directly addresses hidden capability detection

Core Hypothesis

Hypothesis: Sparse Autoencoder features provide a better basis for training "detectors" of rare/sleeping concepts than the raw residual stream, particularly in low-data regimes.

The SAE Advantage Model

┌─────────────────────────────────────────────────────────────────┐
│              SAE PROBE vs. STANDARD PROBE                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  STANDARD LINEAR PROBE                                          │
│  ┌──────────────────────────────────────────┐                   │
│  │  Raw Activations ──► Linear Probe ──► Detection             │
│  │                                          │                   │
│  │  • High-dimensional input                │                   │
│  │  • Features are entangled                │                   │
│  │  • Needs more training data              │                   │
│  │  • May overfit to spurious correlations  │                   │
│  └──────────────────────────────────────────┘                   │
│                                                                  │
│  SAE-BASED PROBE                                                │
│  ┌──────────────────────────────────────────┐                   │
│  │  Raw Acts ──► SAE ──► Sparse Features ──► Probe ──► Detection│
│  │                                          │                   │
│  │  • Disentangled features                 │                   │
│  │  • Sparse, interpretable basis           │                   │
│  │  • May need less training data           │                   │
│  │  • Features have semantic meaning        │                   │
│  └──────────────────────────────────────────┘                   │
│                                                                  │
│  QUESTION: Which generalizes better to unseen threats?          │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Testable Predictions

Better Generalization: SAE probes outperform standard probes on held-out backdoor types
Data Efficiency: SAE probes achieve comparable accuracy with less training data
Interpretability: We can identify which SAE features the probe uses (explainability bonus)
Transfer: SAE probes transfer better across model sizes or domains

Technical Approach

Test Case: Code Backdoor Detection

We'll use code backdoor detection as our concrete test case because:

Clear ground truth (backdoor present/absent)
Diverse types of backdoors to test generalization
Safety-relevant application
Existing datasets and benchmarks

Methodology

Phase 1: Dataset Curation

Dataset Component	Description	Size
Clean Code	Safe code snippets without vulnerabilities	200+
Obvious Backdoors	Clearly malicious code (training)	100+
Subtle Backdoors	Hard-to-detect vulnerabilities (test)	50+
Novel Backdoors	Types not in training (generalization test)	30+

Backdoor Categories:

SQL injection vulnerabilities
Command injection
Path traversal
Hardcoded credentials
Timing-based backdoors
Obfuscated malicious code

Phase 2: Probe Training

┌─────────────────────────────────────────────────────────────────┐
│                    PROBE TRAINING PIPELINE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  For each code snippet:                                         │
│                                                                  │
│  1. Pass through Gemma-2-9B                                     │
│  2. Extract activations at multiple layers                      │
│                                                                  │
│  STANDARD PROBE:                                                │
│  ┌─────────────────────────────────────────┐                    │
│  │  Raw activations ──► Logistic Regression │                    │
│  │  (d_model dimensional input)             │                    │
│  └─────────────────────────────────────────┘                    │
│                                                                  │
│  SAE PROBE:                                                     │
│  ┌─────────────────────────────────────────┐                    │
│  │  Raw activations ──► SAE ──► Sparse features                 │
│  │  Sparse features ──► Logistic Regression │                    │
│  │  (d_sae dimensional sparse input)        │                    │
│  └─────────────────────────────────────────┘                    │
│                                                                  │
│  Compare: Accuracy, Generalization, Data Efficiency             │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Phase 3: Evaluation

Metric	Description
In-distribution accuracy	Performance on held-out examples of trained backdoor types
Out-of-distribution accuracy	Performance on novel backdoor types
Data efficiency curve	Accuracy vs. training set size
Feature interpretability	Can we understand what the SAE probe detects?

Tooling Stack

Tool	Purpose
Gemma-2-9B	Base model for activation extraction
Gemma Scope 2	Pre-trained SAEs
scikit-learn	Linear probe training
TransformerLens	Activation extraction
SAELens	SAE loading and feature analysis

Success Metrics

Primary Metric: Generalization Gap

Generalization_Advantage = OOD_Accuracy(SAE_Probe) - OOD_Accuracy(Standard_Probe)

Where OOD = Out-of-Distribution (novel backdoor types)

Experiment Success Criteria

Experiment	Success Criterion
SAE Probe Accuracy	≥80% on in-distribution test set
Generalization Advantage	SAE probe ≥10% better on OOD test set
Data Efficiency	SAE probe matches standard probe with 50% less data
Interpretability	Can identify ≥3 meaningful features used by probe

Paper-Ready Outcomes

Outcome Level	Definition
Minimum	Document methodology; baseline probe performance
Target	Demonstrate generalization advantage; interpret probe features
Stretch	Transfer to other domains; practical deployment recommendations

Detailed Experimental Design

Experiment 1: Data Efficiency Comparison

Train both probes on increasing amounts of data and measure accuracy:

Training Size:  10%   25%   50%   75%   100%
Standard Probe: ___   ___   ___   ___   ___
SAE Probe:      ___   ___   ___   ___   ___

Hypothesis: SAE probe reaches high accuracy with less data

Experiment 2: Generalization to Novel Threats

Backdoor Type	In Training?	Standard Probe	SAE Probe
SQL Injection	Yes	___	___
Command Injection	Yes	___	___
Path Traversal	No	___	___
Timing Attack	No	___	___
Obfuscated Code	No	___	___

Hypothesis: SAE probe shows smaller accuracy drop on unseen types

Experiment 3: Feature Analysis

For the SAE probe, identify which features have highest weight:

Are these features interpretable?
Do they correspond to security-relevant concepts?
Can we understand WHY the probe makes its decisions?

Connection to AI Safety

The Monitoring Problem

┌─────────────────────────────────────────────────────────────────┐
│              CAPABILITY MONITORING CHALLENGE                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  PROBLEM: Dangerous capabilities may be "dormant"               │
│                                                                  │
│  Model knows how to:                                            │
│  • Write malware (but doesn't show it)                          │
│  • Create bioweapons (but refuses to discuss)                   │
│  • Hack systems (but appears safe in evals)                     │
│                                                                  │
│  SOLUTION: Probe internal representations                       │
│                                                                  │
│  If the model KNOWS something dangerous, its activations        │
│  should encode that knowledge—even if outputs are safe.         │
│                                                                  │
│  SAE probes may be better at extracting this hidden knowledge   │
│  because:                                                       │
│  • Features are disentangled                                    │
│  • Semantic concepts are separated                              │
│  • Less likely to overfit to surface patterns                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Implications for Deployment

If SAE probes are better at detecting hidden capabilities:

Pre-deployment screening: Scan models for dangerous knowledge
Runtime monitoring: Flag suspicious internal states
Capability elicitation: Understand what models really know
Alignment verification: Check if safety training actually removed capabilities

Risks and Mitigations

Risk	Level	Mitigation
No SAE advantage	Medium	Important negative result; report honestly
Backdoor dataset too easy	Medium	Include adversarial/obfuscated examples
SAE reconstruction loss hides signal	Medium	Try multiple SAE layers; compare
Probe overfits to spurious features	High	Use strong regularization; cross-validation

Research Questions

Primary Question

Do SAE features provide a better basis for detecting "sleeping" capabilities than raw activations?

Sub-Questions

Generalization: Do SAE probes generalize better to novel threats?
Data Efficiency: Do SAE probes require less training data?
Interpretability: Can we understand what SAE probes detect?
Practical Value: Is the improvement large enough to matter in practice?

Key References

"Are SAEs better than probes?" - Recent MATS work
Representation Engineering - Zou et al. (2023)
Gemma Scope 2 - DeepMind (2025)
Linear Probes for NLP - Classic probing literature
A Pragmatic Vision for Interpretability - Neel Nanda

Timeline (20-Hour Sprint)

Hours	Focus	Deliverables
1-4	Dataset Curation	400+ code snippets with labels
5-8	Baseline Probes	Standard linear probes trained and evaluated
9-14	SAE Probes	SAE-based probes; comparison experiments
15-18	Analysis	Generalization, data efficiency, interpretability
19-20	Write-up	Executive summary; practical recommendations

Project Status

Status: Research idea documented Next Steps: Code backdoor dataset curation

Project initialized: January 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Sparse Probing for "Sleeping" Capabilities

Self-Directed Learning Project in Mechanistic Interpretability

Executive Summary

Why This Research Matters

The Core Problem

Why Neel Nanda is Investing Here

Alignment with Pragmatic Interpretability

Core Hypothesis

The SAE Advantage Model

Testable Predictions

Technical Approach

Test Case: Code Backdoor Detection

Methodology

Phase 1: Dataset Curation

Phase 2: Probe Training

Phase 3: Evaluation

Tooling Stack

Success Metrics

Primary Metric: Generalization Gap

Experiment Success Criteria

Paper-Ready Outcomes

Detailed Experimental Design

Experiment 1: Data Efficiency Comparison

Experiment 2: Generalization to Novel Threats

Experiment 3: Feature Analysis

Connection to AI Safety

The Monitoring Problem

Implications for Deployment

Risks and Mitigations

Research Questions

Primary Question

Sub-Questions

Key References

Timeline (20-Hour Sprint)

Project Status

FilesExpand file tree

research-idea-4

Directory actions

More options

Directory actions

More options

Latest commit

History

research-idea-4

Folders and files

parent directory

README.md

Sparse Probing for "Sleeping" Capabilities

Self-Directed Learning Project in Mechanistic Interpretability

Executive Summary

Why This Research Matters

The Core Problem

Why Neel Nanda is Investing Here

Alignment with Pragmatic Interpretability

Core Hypothesis

The SAE Advantage Model

Testable Predictions

Technical Approach

Test Case: Code Backdoor Detection

Methodology

Phase 1: Dataset Curation

Phase 2: Probe Training

Phase 3: Evaluation

Tooling Stack

Success Metrics

Primary Metric: Generalization Gap

Experiment Success Criteria

Paper-Ready Outcomes

Detailed Experimental Design

Experiment 1: Data Efficiency Comparison

Experiment 2: Generalization to Novel Threats

Experiment 3: Feature Analysis

Connection to AI Safety

The Monitoring Problem

Implications for Deployment

Risks and Mitigations

Research Questions

Primary Question

Sub-Questions

Key References

Timeline (20-Hour Sprint)

Project Status