Skip to content

Persist per-replicate K8s termination reason from pollBatch into registry #45

Description

@GondekNP

Problem

When batch dispatches succeed-after-retry, joshpy today reports "all 12 succeeded" with no signal that anything was retried, or why. The EXIT_CODE_DIAGNOSTICS lookup in strategies.py only handles top-level Job exit codes; per-replicate failure modes are invisible.

This causes silent survivorship bias in the registry: replicates that OOM during interesting simulation states (e.g., agent explosions in ecological models) are masked by stochastically-successful retries, biasing analysis toward calm-state runs. The user can't filter them out because there's nothing to filter on.

Proposal

When SchmidtDSE/josh#436 ships, joshpy parses the enriched pollBatch JSON (replicates: [{index, status, reason, attempts, exit_code}]) and:

1. Persist into registry

Per-replicate reason goes into job_runs.metadata JSON column (already in schema; no migration needed):

# In cli.batch_remote / SweepManager ingest path
for rep in poll_response["replicates"]:
    registry.update_run_metadata(
        run_hash=run_hash,
        replicate=rep["index"],
        metadata={
            "k8s_reason": rep["reason"],
            "k8s_attempts": rep["attempts"],
            "k8s_exit_code": rep.get("exit_code"),
        },
    )

2. Surface in manager.run() summary

Today:

Completed: 12 succeeded, 0 failed

Becomes:

Replicates: 10 succeeded, 2 failed
  Failure breakdown: 1 OOMKilled, 1 Evicted

3. RuntimeWarning on OOMKilled

oom_reps = [r for r in replicates if r["reason"] == "OOMKilled"]
if oom_reps:
    warnings.warn(
        f"{len(oom_reps)} replicate(s) OOMKilled — successful retries may bias "
        "analysis (agent-explosion runs systematically excluded). "
        "Consider increasing memory request, or filter via "
        "registry.get_runs_by_reason('OOMKilled').",
        RuntimeWarning,
    )

4. Analysis-time query helper

class RunRegistry:
    def get_runs_by_reason(self, reason: str) -> list[Run]:
        """Return runs whose metadata.k8s_reason matches.
        
        Useful for filtering OOM-affected replicates out of comparisons,
        or for finding the runs that triggered agent explosions.
        """

Acceptance criteria

  • After a batch dispatch with mixed-reason failures, registry.get_runs_by_reason("OOMKilled") returns the correct subset.
  • manager.run() summary breaks down failures by canonical reason.
  • RuntimeWarning raised whenever any replicate has reason=OOMKilled.
  • Backwards-compatible: registries created before this change continue to read; metadata.k8s_reason is None for pre-feature data.

Dependencies

Related work

  • Support per-step CSV export for incremental MinIO uploads josh#403 (per-step incremental CSV export) — when both ship: a registry entry tagged k8s_reason=OOMKilled plus partial step CSVs gives the user the agent-explosion that caused the OOM as scientific data, not a missing-data hole. The two issues together close the survivorship-bias gap end-to-end.

Context

Discovered while running a 12-rep test_fine ssp585 batch on GKE Autopilot Balanced spot. The cluster autoscaler drained 2 pods mid-run (consolidating underutilized nodes as other reps completed); K8s spawned replacements that succeeded. joshpy reported "12 succeeded, 0 failed" with no indication anything had been retried for any reason. For ecological modeling, the difference between "3 OOM retries that masked agent explosions" and "3 random evictions on stable runs" is the difference between biased and unbiased analysis. joshpy needs to surface this without forcing scientists to read kubectl logs.

🤖 Generated by Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions