Persist per-replicate K8s termination reason from pollBatch into registry

## Problem

When batch dispatches succeed-after-retry, joshpy today reports "all 12 succeeded" with no signal that anything was retried, or why. The `EXIT_CODE_DIAGNOSTICS` lookup in [strategies.py](joshpy/strategies.py) only handles top-level Job exit codes; per-replicate failure modes are invisible.

This causes silent **survivorship bias** in the registry: replicates that OOM during interesting simulation states (e.g., agent explosions in ecological models) are masked by stochastically-successful retries, biasing analysis toward calm-state runs. The user can't filter them out because there's nothing to filter on.

## Proposal

When SchmidtDSE/josh#436 ships, joshpy parses the enriched `pollBatch` JSON (`replicates: [{index, status, reason, attempts, exit_code}]`) and:

### 1. Persist into registry

Per-replicate reason goes into `job_runs.metadata` JSON column (already in schema; no migration needed):

```python
# In cli.batch_remote / SweepManager ingest path
for rep in poll_response["replicates"]:
    registry.update_run_metadata(
        run_hash=run_hash,
        replicate=rep["index"],
        metadata={
            "k8s_reason": rep["reason"],
            "k8s_attempts": rep["attempts"],
            "k8s_exit_code": rep.get("exit_code"),
        },
    )
```

### 2. Surface in `manager.run()` summary

Today:
```
Completed: 12 succeeded, 0 failed
```

Becomes:
```
Replicates: 10 succeeded, 2 failed
  Failure breakdown: 1 OOMKilled, 1 Evicted
```

### 3. RuntimeWarning on OOMKilled

```python
oom_reps = [r for r in replicates if r["reason"] == "OOMKilled"]
if oom_reps:
    warnings.warn(
        f"{len(oom_reps)} replicate(s) OOMKilled — successful retries may bias "
        "analysis (agent-explosion runs systematically excluded). "
        "Consider increasing memory request, or filter via "
        "registry.get_runs_by_reason('OOMKilled').",
        RuntimeWarning,
    )
```

### 4. Analysis-time query helper

```python
class RunRegistry:
    def get_runs_by_reason(self, reason: str) -> list[Run]:
        """Return runs whose metadata.k8s_reason matches.
        
        Useful for filtering OOM-affected replicates out of comparisons,
        or for finding the runs that triggered agent explosions.
        """
```

## Acceptance criteria

- After a batch dispatch with mixed-reason failures, `registry.get_runs_by_reason("OOMKilled")` returns the correct subset.
- `manager.run()` summary breaks down failures by canonical reason.
- `RuntimeWarning` raised whenever any replicate has `reason=OOMKilled`.
- Backwards-compatible: registries created before this change continue to read; `metadata.k8s_reason` is `None` for pre-feature data.

## Dependencies

- **Blocked by SchmidtDSE/josh#436** — needs the JAR to expose the `replicates` array in `pollBatch` JSON output. Until that lands, no signal to consume.

## Related work

- **SchmidtDSE/josh#403** (per-step incremental CSV export) — when both ship: a registry entry tagged `k8s_reason=OOMKilled` plus partial step CSVs gives the user the agent-explosion that caused the OOM as scientific data, not a missing-data hole. The two issues together close the survivorship-bias gap end-to-end.

## Context

Discovered while running a 12-rep `test_fine ssp585` batch on GKE Autopilot Balanced spot. The cluster autoscaler drained 2 pods mid-run (consolidating underutilized nodes as other reps completed); K8s spawned replacements that succeeded. joshpy reported "12 succeeded, 0 failed" with no indication anything had been retried for any reason. For ecological modeling, the difference between "3 OOM retries that masked agent explosions" and "3 random evictions on stable runs" is the difference between biased and unbiased analysis. joshpy needs to surface this without forcing scientists to read kubectl logs.

🤖 Generated by [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Persist per-replicate K8s termination reason from pollBatch into registry #45

Problem

Proposal

1. Persist into registry

2. Surface in `manager.run()` summary

3. RuntimeWarning on OOMKilled

4. Analysis-time query helper

Acceptance criteria

Dependencies

Related work

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Persist per-replicate K8s termination reason from pollBatch into registry #45

Description

Problem

Proposal

1. Persist into registry

2. Surface in manager.run() summary

3. RuntimeWarning on OOMKilled

4. Analysis-time query helper

Acceptance criteria

Dependencies

Related work

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Surface in `manager.run()` summary