You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When batch dispatches succeed-after-retry, joshpy today reports "all 12 succeeded" with no signal that anything was retried, or why. The EXIT_CODE_DIAGNOSTICS lookup in strategies.py only handles top-level Job exit codes; per-replicate failure modes are invisible.
This causes silent survivorship bias in the registry: replicates that OOM during interesting simulation states (e.g., agent explosions in ecological models) are masked by stochastically-successful retries, biasing analysis toward calm-state runs. The user can't filter them out because there's nothing to filter on.
Proposal
When SchmidtDSE/josh#436 ships, joshpy parses the enriched pollBatch JSON (replicates: [{index, status, reason, attempts, exit_code}]) and:
1. Persist into registry
Per-replicate reason goes into job_runs.metadata JSON column (already in schema; no migration needed):
oom_reps= [rforrinreplicatesifr["reason"] =="OOMKilled"]
ifoom_reps:
warnings.warn(
f"{len(oom_reps)} replicate(s) OOMKilled — successful retries may bias ""analysis (agent-explosion runs systematically excluded). ""Consider increasing memory request, or filter via ""registry.get_runs_by_reason('OOMKilled').",
RuntimeWarning,
)
4. Analysis-time query helper
classRunRegistry:
defget_runs_by_reason(self, reason: str) ->list[Run]:
"""Return runs whose metadata.k8s_reason matches. Useful for filtering OOM-affected replicates out of comparisons, or for finding the runs that triggered agent explosions. """
Acceptance criteria
After a batch dispatch with mixed-reason failures, registry.get_runs_by_reason("OOMKilled") returns the correct subset.
manager.run() summary breaks down failures by canonical reason.
RuntimeWarning raised whenever any replicate has reason=OOMKilled.
Backwards-compatible: registries created before this change continue to read; metadata.k8s_reason is None for pre-feature data.
Support per-step CSV export for incremental MinIO uploads josh#403 (per-step incremental CSV export) — when both ship: a registry entry tagged k8s_reason=OOMKilled plus partial step CSVs gives the user the agent-explosion that caused the OOM as scientific data, not a missing-data hole. The two issues together close the survivorship-bias gap end-to-end.
Context
Discovered while running a 12-rep test_fine ssp585 batch on GKE Autopilot Balanced spot. The cluster autoscaler drained 2 pods mid-run (consolidating underutilized nodes as other reps completed); K8s spawned replacements that succeeded. joshpy reported "12 succeeded, 0 failed" with no indication anything had been retried for any reason. For ecological modeling, the difference between "3 OOM retries that masked agent explosions" and "3 random evictions on stable runs" is the difference between biased and unbiased analysis. joshpy needs to surface this without forcing scientists to read kubectl logs.
Problem
When batch dispatches succeed-after-retry, joshpy today reports "all 12 succeeded" with no signal that anything was retried, or why. The
EXIT_CODE_DIAGNOSTICSlookup in strategies.py only handles top-level Job exit codes; per-replicate failure modes are invisible.This causes silent survivorship bias in the registry: replicates that OOM during interesting simulation states (e.g., agent explosions in ecological models) are masked by stochastically-successful retries, biasing analysis toward calm-state runs. The user can't filter them out because there's nothing to filter on.
Proposal
When SchmidtDSE/josh#436 ships, joshpy parses the enriched
pollBatchJSON (replicates: [{index, status, reason, attempts, exit_code}]) and:1. Persist into registry
Per-replicate reason goes into
job_runs.metadataJSON column (already in schema; no migration needed):2. Surface in
manager.run()summaryToday:
Becomes:
3. RuntimeWarning on OOMKilled
4. Analysis-time query helper
Acceptance criteria
registry.get_runs_by_reason("OOMKilled")returns the correct subset.manager.run()summary breaks down failures by canonical reason.RuntimeWarningraised whenever any replicate hasreason=OOMKilled.metadata.k8s_reasonisNonefor pre-feature data.Dependencies
replicatesarray inpollBatchJSON output. Until that lands, no signal to consume.Related work
k8s_reason=OOMKilledplus partial step CSVs gives the user the agent-explosion that caused the OOM as scientific data, not a missing-data hole. The two issues together close the survivorship-bias gap end-to-end.Context
Discovered while running a 12-rep
test_fine ssp585batch on GKE Autopilot Balanced spot. The cluster autoscaler drained 2 pods mid-run (consolidating underutilized nodes as other reps completed); K8s spawned replacements that succeeded. joshpy reported "12 succeeded, 0 failed" with no indication anything had been retried for any reason. For ecological modeling, the difference between "3 OOM retries that masked agent explosions" and "3 random evictions on stable runs" is the difference between biased and unbiased analysis. joshpy needs to surface this without forcing scientists to read kubectl logs.🤖 Generated by Claude Code