Theme: Model Biology / Science of Misalignment / Applied Interpretability Difficulty: High (Technical & Conceptual) Format: 20-hour research sprint
This project investigates the mechanistic circuitry by which language models infer user attributes (e.g., gender) from implicit cues and use those inferences to modulate responses—potentially in sycophantic or stereotyped ways. Using Gemma Scope Cross-Coders, we aim to identify "User Modeling" features that are learned during chat training and trace how they causally influence "Opinion" features to produce pandering behavior.
Title: From Inference to Pandering: Circuit Analysis of Implicit User Modeling and Sycophancy using Gemma Scope Cross-Coders
Modern chat models don't just answer questions—they model the user. They infer attributes like:
- Gender ("As a mother of three...")
- Age ("Back in my day...")
- Expertise level ("I'm new to programming...")
- Emotional state ("I'm really frustrated...")
This user modeling can lead to sycophancy: the model altering its answers to match perceived user preferences or stereotypes rather than providing accurate information.
Nanda has explicitly highlighted "modeling of the user" as a behavior he is excited to investigate:
"I am looking for latents related to especially interesting related chat-model behaviors (e.g., modeling of the user) in an unsupervised fashion."
This connects directly to:
- Science of Misalignment: User-based manipulation is a safety failure
- Cross-Coders: Perfect for comparing Base vs. Chat models
- Model Biology: Understanding how RLHF creates user modeling circuits
Do NOT just train a linear probe to predict user gender from activations. This has been done (e.g., ICML 2025 paper "What Kind of User Are You?"). A winning proposal must go beyond proving the phenomenon exists—it must investigate the mechanism and downstream effects.
| Research Direction | How This Project Addresses It |
|---|---|
| Model Biology | Dissects user modeling circuits |
| Cross-Coders | Uses frontier technique for Base vs. Chat comparison |
| Science of Misalignment | Investigates sycophancy mechanism |
| Applied Interpretability | Directly relevant to deployed chat systems |
Hypothesis: Models do not just passively represent user gender; they have active "Sycophancy Circuits" where "User Attribute" features (e.g., "User is Female") causally modulate "Opinion" features to match perceived stereotypes.
┌─────────────────────────────────────────────────────────────────┐
│ USER MODELING → SYCOPHANCY CIRCUIT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ INPUT: "As a mother of three, I'm thinking about buying a car" │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ USER INFERENCE CIRCUIT │ │
│ │ (Hypothesized: Chat-specific features) │ │
│ │ │ │
│ │ Implicit cue: "mother of three" │ │
│ │ Inferred: User is female, parent │ │
│ │ Features: [User-Female], [User-Parent] │ │
│ └──────────────────────────────────────────┘ │
│ │ │
│ User Attribute Signal │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ SYCOPHANCY/PANDERING CIRCUIT │ │
│ │ │ │
│ │ User-Female feature MODULATES: │ │
│ │ • Topic emphasis (safety > speed) │ │
│ │ • Recommendation style │ │
│ │ • Tone and metaphors │ │
│ └──────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ OUTPUT: Stereotyped response (e.g., emphasizes minivans) │
│ │
└─────────────────────────────────────────────────────────────────┘
-
Does the Chat model have dedicated "User Gender" features that the Base model lacks?
- This would be a strong "Model Diffing" result showing RLHF creates user modeling circuits
-
Do these features causally influence downstream responses?
- Can we induce stereotyped responses by clamping user attribute features?
-
What is the circuit architecture?
- Which attention heads read from user inference features and write to response generation?
- Chat-Specific Features: Cross-Coders reveal "User Attribute" features present in Chat model but absent in Base model
- Causal Influence: Clamping "User is Female" feature changes response content for male users
- Stereotype Modulation: User attribute features correlate with opinion/recommendation features for stereotyped topics
- Identifiable Read Heads: Specific attention heads move user attribute information to response generation
Critical Design Choice: Use IMPLICIT gender cues, not explicit statements.
| Cue Type | Example | Why Important |
|---|---|---|
| Implicit Female | "As a mother of three...", "My husband and I..." | Tests inference, not keyword matching |
| Implicit Male | "As a father...", "My wife and I..." | Matched control |
| Neutral | "I'm thinking about..." | Baseline without gender cues |
Opinion Questions (where stereotypes might influence answers):
| Category | Example Questions |
|---|---|
| Cars | "What car should I buy?" (SUV/minivan vs. sports car) |
| Fashion | "What should I wear to an interview?" |
| Parenting | "How should I handle my child's behavior?" |
| Career | "Should I ask for a raise?" |
| Hobbies | "What hobby should I pick up?" |
┌─────────────────────────────────────────────────────────────────┐
│ CROSS-CODER ANALYSIS PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ STEP 1: Compare Base vs. Chat Model Activations │
│ ┌─────────────────────────────────────────┐ │
│ │ Same prompt → Both models │ │
│ │ Base Model: Gemma-2-9B │ │
│ │ Chat Model: Gemma-2-9B-IT │ │
│ └─────────────────────────────────────────┘ │
│ │
│ STEP 2: Cross-Coder Feature Comparison │
│ ┌─────────────────────────────────────────┐ │
│ │ Find features that: │ │
│ │ • Fire strongly in Chat on gender cues │ │
│ │ • Are ABSENT or weak in Base │ │
│ │ → "Chat-Specific User Modeling" features│ │
│ └─────────────────────────────────────────┘ │
│ │
│ STEP 3: Feature-Response Correlation │
│ ┌─────────────────────────────────────────┐ │
│ │ Correlate user attribute features with │ │
│ │ downstream response content features │ │
│ │ → Evidence for causal pathway │ │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Experiment A: Cross-Gender Steering
Setup:
- Input: Male user cue ("As a father...")
- Intervention: Clamp "User is Female" feature
- Measure: Does response shift toward female stereotypes?
Expected Results:
- Car recommendations shift toward minivans/SUVs
- Tone/language becomes more stereotypically "female-directed"
- Topic emphasis changes
Experiment B: Circuit Tracing
Goal: Find the "read head" that connects user inference to response
Method:
1. Identify position where user gender is inferred (implicit cue token)
2. Identify position where response is generated (opinion tokens)
3. Search for attention heads that:
- Attend FROM response position TO user cue position
- Move "User Attribute" information
This maps the information flow of sycophancy.
| Tool | Purpose |
|---|---|
| Gemma-2-9B | Base model (no user modeling training) |
| Gemma-2-9B-IT | Chat model (RLHF-trained user modeling) |
| Gemma Scope 2 | Pre-trained SAEs and Cross-Coders |
| TransformerLens | Model hooking and intervention |
| SAELens | SAE loading and feature analysis |
Sycophancy_Effect = P(stereotyped_response | clamped_feature) -
P(stereotyped_response | baseline)
| Experiment | Success Criterion |
|---|---|
| Chat-Specific Features | ≥3 features that fire on gender cues in Chat but not Base |
| Cross-Gender Steering | ≥20% shift in stereotyped response content |
| Feature-Response Correlation | >0.5 correlation between user features and opinion features |
| Circuit Identification | Identify ≥1 attention head in the user→response pathway |
| Outcome Level | Definition |
|---|---|
| Minimum | Identify Chat-specific user modeling features; document methodology |
| Target | Demonstrate causal steering; map partial circuit |
| Stretch | Full sycophancy circuit diagram; cross-attribute generalization (age, expertise) |
| Feature Type | Base Model | Chat Model | Interpretation |
|---|---|---|---|
| "User is Female" | Absent/Weak | Strong | RLHF learned user modeling |
| "User is Parent" | Absent/Weak | Strong | RLHF learned user modeling |
| "General Gender" | Present | Present | Pre-existing knowledge |
| "Topic: Cars" | Present | Present | Pre-existing knowledge |
Key Finding: Chat-specific features = Evidence that RLHF creates user modeling circuits
┌─────────────────────────────────────────────────────────────────┐
│ STEERING EXPERIMENT MATRIX │
├─────────────────────────────────────────────────────────────────┤
│ │
│ │ No Steering │ Clamp "Female" │ Clamp "Male" │
│ ─────────────┼───────────────┼────────────────┼───────────────│
│ Female Cue │ Baseline │ Amplify? │ Counter? │
│ ("mother") │ (stereotyped)│ │ │
│ ─────────────┼───────────────┼────────────────┼───────────────│
│ Male Cue │ Baseline │ Cross-gender │ Amplify? │
│ ("father") │ (stereotyped)│ steering test │ │
│ ─────────────┼───────────────┼────────────────┼───────────────│
│ Neutral │ Baseline │ Induce female │ Induce male │
│ Cue │ (balanced) │ stereotypes │ stereotypes │
│ ─────────────┴───────────────┴────────────────┴───────────────│
│ │
│ SUCCESS: Cross-gender steering produces measurable shift │
│ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ CIRCUIT TRACING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Token Positions: │
│ │
│ "As a mother of three, what car should I buy?" │
│ ↑ ↑ │
│ [Gender Cue] [Response Start] │
│ Position 3 Position 10+ │
│ │
│ Question: Which attention heads connect these positions? │
│ │
│ Method: Activation patching │
│ 1. Run with female cue, cache "User-Female" feature at pos 3 │
│ 2. Run with male cue, patch in female feature at pos 3 │
│ 3. For each attention head, measure response change │
│ 4. Heads with largest effect = "User Modeling Read Heads" │
│ │
└─────────────────────────────────────────────────────────────────┘
Sycophancy is a core safety concern because:
- Manipulation: Model tells users what they want to hear, not what's true
- Stereotype Reinforcement: Model amplifies societal biases
- Trust Erosion: Users can't rely on model for objective information
- Hidden Behavior: Users may not realize they're being pandered to
| Finding | Safety Implication |
|---|---|
| User Modeling Circuits Exist | Know what to monitor/ablate |
| RLHF Creates These Circuits | Training procedure creates safety risk |
| Causal Pathway Identified | Can intervene to reduce sycophancy |
| Cross-Coder Methodology Works | Scalable approach for other behaviors |
The methodology transfers to other user modeling behaviors:
- Expertise Inference: Does the model dumb down for "novices"?
- Emotional State: Does the model validate emotions over facts?
- Political Affiliation: Does the model tailor political content?
| Risk | Level | Mitigation |
|---|---|---|
| No Chat-specific features found | Medium | Important negative result; may mean user modeling is emergent |
| Steering effects too subtle | High | Use extreme stereotype topics; measure multiple output dimensions |
| Confounding with other variables | Medium | Careful dataset design; matched controls |
| Cross-Coders not available | Low | Fall back to standard SAE comparison |
| Ethical concerns | Low | Focus on understanding mechanism, not amplifying harm |
How do chat models mechanistically infer user attributes from implicit cues, and how do these inferences causally modulate response content (sycophancy)?
- Feature Existence: Are there Chat-specific "User Attribute" features?
- Causal Role: Do these features causally influence responses?
- Circuit Architecture: What attention heads connect user inference to response?
- RLHF Effect: Does chat training create or amplify user modeling circuits?
| Criterion | How This Proposal Meets It |
|---|---|
| Novel Angle | Circuit analysis, not just probing |
| Uses Frontier Tools | Cross-Coders for Base vs. Chat comparison |
| Safety Relevance | Directly addresses sycophancy |
| Teaches Something New | Mechanism of RLHF-induced user modeling |
| Feasible in 20 Hours | Focused scope with clear deliverables |
| Evidence For/Against | Hypothesis-driven with falsifiable predictions |
- "What Kind of User Are You?" - ICML 2025 (prior work on user modeling)
- Cross-Coders - Recent interpretability technique
- Gemma Scope 2 - DeepMind (2025)
- "Towards Understanding Sycophancy in LLMs" - Anthropic
- A Pragmatic Vision for Interpretability - Neel Nanda
- Representation Engineering - Zou et al. (2023)
| Hours | Focus | Deliverables |
|---|---|---|
| 1-4 | Dataset Construction | 100+ prompts with implicit gender cues |
| 5-10 | Cross-Coder Feature Hunting | Chat-specific user modeling features |
| 11-15 | Steering Experiments | Causal verification; stereotyping modulation |
| 16-18 | Circuit Tracing | Identify user→response attention heads |
| 19-20 | Write-up | Executive summary; methodology; safety implications |
Status: COMPLETED — Cross-family study with mechanistic experiments
- Research plan documented
- v1: Initial probing on Qwen 2.5-0.5B (identified confound)
- v2/v3: Gemma 3 scaling study (1B, 4B, 12B) with improved methodology
- v4: Cross-family probing — 5 models, 200 questions, 4 probing variants
- Causal mediation with KL strength sweep and random direction control
- Attention head circuit tracing with progressive ablation
- SAE feature analysis with Gemma Scope 2 (16k features)
- Blog post: "Your AI Is Profiling You — Part II: Gender Mechanistic Evidence"
| Model | Last-Token Acc | Q-Only Acc | Held-Out | KL Ratio | CoT Signal |
|---|---|---|---|---|---|
| Gemma 3-1B | 88.3% | 99.8% | 100.0% | 1.11x | 0/16 |
| Gemma 3-4B | 96.8% | 100.0% | 100.0% | 5.23x | 0/16 |
| Gemma 3-12B | 100.0% | 100.0% | 100.0% | 5.31x | 0/16 |
| Qwen 2.5-7B | 99.8% | 100.0% | 100.0% | 1.18x | 0/16 |
| Mistral 7B | 99.5% | 100.0% | 98.8% | 1.81x | 0/16 |
Mechanistic (Gemma 3 4B):
- Causal mediation: 48.3% first-token KL reduction; random direction only 9.6% vs 980%
- Circuit tracing: 20 heads (7.4%) cause 21.5% accuracy drop
- SAE analysis: 0/16,384 significant features — gender in superposition
cd experiments/notebooks
# GPU: Extract hidden states (~20 min on A40)
python extract_hidden_states.py --model gemma4b
python extract_hidden_states.py --model mistral7b
# CPU: Run probing analysis
python analyze_probing_v4.py all
# GPU: Mechanistic experiments
python run_kl_strength_sweep.py gemma4b
python run_circuit_tracing.py gemma4b
python run_sae_analysis.pyExtended the gender study to ethnicity using EEOC (Equal Employment Opportunity Commission) race/ethnicity categories based on OMB federal standards. Five pairwise binary comparisons with White as reference group (following audit study convention from Bertrand & Mullainathan 2004).
| Comparison | Ref Group | Cmp Group | Names/Group |
|---|---|---|---|
| White vs Black | White (45) | Black (45) | Gender-balanced (~22M + ~23F) |
| White vs Hispanic | White (45) | Hispanic (45) | Gender-balanced |
| White vs Asian | White (45) | Asian (45) | East + South Asian names |
| White vs Native American | White (45) | Native American (45) | Lakota, Cherokee, Navajo, etc. |
| White vs Pacific Islander | White (45) | Pacific Islander (45) | Hawaiian, Samoan, Tongan |
Plus 25 ethnicity-ambiguous control names (Alex, Jordan, Sam, etc.).
| Model | W vs Black | W vs Hispanic | W vs Asian | W vs Nat.Am. | W vs Pac.Isl. |
|---|---|---|---|---|---|
| Gemma 3-1B | 86.3% | 89.2% | 93.5% | 95.8% | 93.8% |
| Gemma 3-4B | 94.0% | 95.5% | 97.5% | 97.0% | 97.7% |
| Gemma 3-12B | 98.0% | 99.0% | 98.8% | 97.8% | 97.5% |
| Qwen 2.5-7B | 95.5% | 98.0% | 98.5% | 98.5% | 97.5% |
| Mistral 7B | 91.5% | 94.5% | 97.3% | 95.3% | 95.3% |
| Model | W vs Black | W vs Hispanic | W vs Asian | W vs Nat.Am. | W vs Pac.Isl. |
|---|---|---|---|---|---|
| Gemma 3-1B | 94.8% | 96.3% | 97.3% | 96.8% | 96.8% |
| Gemma 3-4B | 97.8% | 99.0% | 99.7% | 99.3% | 99.5% |
| Gemma 3-12B | 98.8% | 100.0% | 100.0% | 99.7% | 99.7% |
| Qwen 2.5-7B | 99.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| Mistral 7B | 98.8% | 99.7% | 100.0% | 99.3% | 100.0% |
| Model | W vs Black | W vs Hispanic | W vs Asian | W vs Nat.Am. | W vs Pac.Isl. |
|---|---|---|---|---|---|
| Gemma 3-1B | 90.0% | 97.5% | 98.8% | 97.5% | 96.3% |
| Gemma 3-4B | 91.3% | 98.8% | 100.0% | 100.0% | 100.0% |
| Gemma 3-12B | 96.3% | 100.0% | 100.0% | 100.0% | 100.0% |
| Qwen 2.5-7B | 93.8% | 100.0% | 100.0% | 100.0% | 100.0% |
| Mistral 7B | 96.3% | 100.0% | 100.0% | 100.0% | 100.0% |
All embedding baselines: 50.0% (chance). All p-values: 0.0.
- Universal encoding: All 5 models encode perceived ethnicity from names across all EEOC categories
- Question-only propagation: 94.8–100% accuracy with name tokens excluded — ethnicity signal propagates into shared representations
- Held-out generalization: 90–100% on unseen names — abstract ethnic category encoding, not name memorization
- Gender confound check: Same-gender-only probing maintains high accuracy, confirming the signal is ethnicity, not gender
- White vs Black hardest: Consistently lowest accuracy, possibly due to greater name overlap in training corpora
- Steering: KL divergence ratios up to 22.9x (Gemma 3-12B, White vs Native American)
cd experiments/ethnicity
# GPU: Extract hidden states for one model + comparison (~20 min on A40)
python extract_hidden_states_eth.py gemma4b white_vs_black
# CPU: Run probing analysis
python analyze_probing_eth.py gemma4b white_vs_black
# Run all 25 jobs
for comp in white_vs_black white_vs_hispanic white_vs_asian white_vs_native_american white_vs_pacific_islander; do
for model in gemma1b gemma4b gemma12b qwen7b mistral7b; do
python extract_hidden_states_eth.py $model $comp
done
doneSee experiments/README.md for detailed setup instructions.
Project initialized: January 2026 Gender experiments completed: February 2026 Ethnicity extension completed: February 2026