This project trains a PPO agent to place small grid-cell overlays on a stop sign so that detector confidence drops under UV activation while staying high in daylight. The environment renders a sign-on-pole against randomized backgrounds with matched transforms, and uses UV paint pairs (day vs UV-on) to model activation.
Ethics notice: this repository is for research and robustness testing only. Do not use it to cause harm or unsafe behavior.
Core ideas:
- Grid-cell action space on a stop sign octagon mask (discrete actions).
- UV paint pair: daylight color/alpha vs UV-on color/alpha.
- Matched transforms and backgrounds across daylight/UV variants for fair comparison.
- Reward that targets UV confidence drop while penalizing daylight drop and patch area.
- Efficiency bonus (drop per area) and fixed area penalties to favor minimal patches.
- Early termination on success or area cap.
- Detector backends: Ultralytics YOLO, torchvision detectors, and optional Transformers RT-DETR.
- Python 3.10+ recommended.
- PyTorch, stable-baselines3, and sb3-contrib (MaskablePPO).
- YOLO weights in
weights/(see below). - Optional:
transformersif you use the RT-DETR backend.
Create a virtual environment and install dependencies:
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txtAlternate: enviornment.yml is included if you prefer conda.
If you installed requirements before action masking was added, you may need:
python -m pip install sb3-contribOptional (RT-DETR backend):
python -m pip install transformersRequired files in data/:
stop_sign.png(RGBA, transparent background).pole.png(RGBA).backgrounds/(folder with scene images; 640x640 recommended).
Optional:
stop_sign_uv.png(RGBA UV-lit version of the sign; if missing, the base sign is reused).
YOLO weights go in weights/:
weights/yolo8n.pt(default)weights/yolo11n.pt(optional if you switch versions)
Torchvision detectors download pretrained weights automatically on first use
(cached under ~/.cache/torch/hub/checkpoints).
Transformers RT-DETR models download from Hugging Face on first use
(cached under ~/.cache/huggingface).
Minimal run (train.sh defaults):
bash train.shRecommended single-machine run (YOLOv8, GPU, dummy vec):
YOLO_DEVICE=cuda:0 VEC=dummy NUM_ENVS=1 bash train.shResume from latest run folder:
YOLO_DEVICE=cuda:0 VEC=dummy NUM_ENVS=1 bash train.sh --resumeUse a specific YOLO version/weights:
YOLO_VERSION=8 YOLO_WEIGHTS=./weights/yolo8n.pt bash train.shTorchvision detector example:
python train_single_stop_sign.py --detector torchvision --detector-model retinanet_resnet50_fpn_v2Transformers RT-DETR example:
python train_single_stop_sign.py --detector rtdetr --detector-model PekingU/rtdetr_r50vdEvaluation (deterministic policy, logs to TensorBoard):
bash eval.shFrom train_single_stop_sign.py:
--num-envs(default 1 intrain.sh) and--vec(dummyorsubproc)--n-steps,--batch-size,--total-steps(PPO training control;train.shdefaults 1024/1024)--episode-steps(max steps per episode; default 300)--grid-cell(2, 4, 8, 16, 32) grid size in pixels (default 16)--uv-thresholdUV drop threshold for success--lambda-areaarea penalty strength (encourages minimal patches)--lambda-efficiencyefficiency bonus (drop per area)--area-target(default 0.25) target area fraction used for excess penalties--step-cost(default 0.012) and--step-cost-after-target(default 0.14) per-step penalties--lambda-area-start,--lambda-area-end,--lambda-area-steps(curriculum)--area-cap-fraccap on total patch area (<= 0 disables)--area-cap-penaltyreward penalty when cap would be exceeded--area-cap-mode(softorhard)--area-cap-start,--area-cap-end,--area-cap-steps(curriculum)--lambda-daypenalty for daylight confidence drop beyond tolerance--lambda-iou,--lambda-misclassextra objectives for mislocalization/misclassification--paint,--paint-listpaint selection (single or per-episode sampling)--multiphaseenable 3-phase curriculum (solid/no pole -> dataset + pole)--phase1-steps,--phase2-steps,--phase3-steps(phase lengths; 0 = auto split)--phase1-eval-K,--phase2-eval-K,--phase3-eval-K(per-phase eval_K overrides)- Phase penalties are uniform across phases (background/pole/transform are the only curriculum changes).
--bg-mode(datasetorsolid) and--no-polefor single-phase--obs-size,--obs-margin,--obs-include-mask(cropped observation + mask channel)--ent-coef,--ent-coef-start,--ent-coef-end,--ent-coef-steps(entropy coefficient schedule; default 0.001)--detector-device(e.g.,cpu,cuda, orauto)--detector(yolo,torchvision, orrtdetr) and--detector-model(model name for torchvision/RT-DETR)--step-log-every,--step-log-keep,--step-log-500(step logging control)--cnn(customornature) choose feature extractor--ckpt,--overlays,--tboutput paths (TB logs grouped undergrid_uv_yolo<ver>)--save-freq-stepsor--save-freq-updatescheckpoint cadence--check-envruns SB3 env checker before training (enabled by default intrain.sh)
The environment is implemented in envs/stop_sign_grid_env.py.
Highlights:
- Discrete action space over valid grid cells inside the sign octagon.
- Action masking prevents duplicate cell selections (MaskablePPO).
- UV-on reward uses raw UV drop (
drop_on) computed as the day baseline confidence minus UV-on overlay confidence. - Reward includes an efficiency bonus (drop per area) plus fixed area penalties that push toward a target patch fraction.
- Optional per-step penalties can apply globally or only after the area target.
- Observations are cropped around the sign with an optional overlay-mask channel
(controlled by
--obs-*flags). - Training uses a lightweight custom CNN extractor tuned for sign crops (or NatureCNN via
--cnn nature). - Area cap supports soft (penalty) or hard (terminate) modes.
- Minimum UV alpha (
uv_min_alpha) ensures patches are visible under UV even with very low paint alpha. - VecNormalize is applied to observations; evaluation should reuse the saved stats.
Definitions:
-
c0_day: baseline day confidence (no overlay) -
c_day: day confidence with overlay -
c_on: UV-on confidence with overlaydrop_day = c0_day - c_daydrop_on = c0_day - c_onarea = total_area_mask_fracmean_iou: mean IoU between target box and top detectionmisclass_rate: misclassification rateconf_thr = success_conf_thresholdarea_target = area_target_frac if set else area_cap_frac
Efficiency bonus:
eff = log1p(max(0, drop_on) / max(area, efficiency_eps))Core reward:
max_drop = max(0, c0_day - conf_thr) drop_blend = min(drop_on, max_drop) pen_day = max(0, drop_day - day_tolerance) raw_core = drop_blend - lambda_day * pen_day - lambda_area * area - excess_penalty - step_cost_penalty + lambda_iou * (1 - mean_iou) + lambda_misclass * misclass_rate + lambda_efficiency * eff - lambda_perceptual * perceptual_delta
Shaping + success:
shaping = 0.35 * tanh(3.0 * (conf_thr - c_on))
success_bonus = 0.2 * (1 - area)^2 if c_on <= conf_thr else 0
raw_total = raw_core + shaping + success_bonus
Excess penalty (when area > area_target):
excess = area - area_target
excess_penalty = lambda_area * (4.5 * excess + excess^2)
Step cost (global + target-scaled):
step_cost_penalty = step_cost
if step_cost_after_target > 0 and area_target is not None and area > area_target:
excess = (area - area_target) / max(area_target, 1e-6)
step_cost_penalty += step_cost_after_target * (1 + max(0, excess))
Soft cap override (if enabled and exceeded):
excess = max(0, (area - area_cap) / area_cap)
over_pen = abs(area_cap_penalty) * (1 + 2 * excess)
raw_total = -over_pen
Final reward:
reward = tanh(1.2 * raw_total)
Baseline / cap gates:
- If
c0_day < min_base_conf, reward is-0.05and the step returns early. - If
area_cap_mode == "hard"and the next action would exceedarea_cap_frac, the episode terminates with rewardarea_cap_penalty.
If you need to change rendering or physics:
_transform_sign()controls camera jitter, blur, color, and noise._compose_sign_and_pole()controls pole ratio and placement._place_group_on_background()controls scale and background placement.
TensorBoard logs:
# train.sh (defaults)
tensorboard --logdir _runs/tb --port 6006
# eval.sh (defaults)
tensorboard --logdir _runs/tb_eval --port 6006Callbacks log:
TensorboardOverlayCallback(overlay images and metadata)EpisodeMetricsCallback(episode-end scalars)StepMetricsCallback(rolling step metrics)
Episode metrics currently include:
episode/area_frac_final,episode/length_stepsepisode/drop_on_final,episode/drop_on_smooth_finalepisode/base_conf_final,episode/after_conf_finalepisode/reward_final,episode/selected_cells_finalepisode/eval_K_used_finalepisode/uv_success_final,episode/area_cap_exceeded_finalepisode/reward_core_final,episode/reward_raw_total_finalepisode/reward_efficiency_final,episode/reward_perceptual_finalepisode/lambda_area_used_finalepisode/area_target_frac_finalepisode/area_reward_corr(rolling correlation between area and reward)
Step metrics:
- Rolling window of per-step rows in
_runs/tb/<run_id>/grid_uv_yolo8/<phase>/tb_step_metrics/step_metrics.ndjson - 500-step snapshots in
_runs/tb/<run_id>/grid_uv_yolo8/<phase>/tb_step_metrics/step_metrics_500.ndjson - Step scalars include reward components and area weights.
Generated files:
_runs/checkpoints/<run_id>/PPO checkpoints (run id likeyolo8_1,yolo11_2, ...)._runs/overlays/<run_id>/best overlays (PNG + JSON) andtraces.ndjsonif enabled._runs/tb/<run_id>/TensorBoard event files (grouped undergrid_uv_yolo<ver>/<phase>)._runs/tb_eval/evaluation logs (if you useeval.sh).
Overlay saver:
utils/save_callbacks.pykeeps the best N overlays and appends trace metadata.- Current training config disables overlay saving by default (
max_saved=0). - Files are named by area fraction and step, for example:
area0p1234_step000000123_env00_full.pngarea0p1234_step000000123_env00_overlay.pngarea0p1234_step000000123_env00.json
Trace replay:
- Removed (legacy blob traces no longer apply to the grid environment).
tools/debug_grid_env.pyruns the env step-by-step and saves UV-on previews.tools/debug_detector_image.pyprints all detections for a single image (with optional box overlay).tools/area_sweep_debug.pysweeps coverage levels and logs confidence/IoU/misclass stats.tools/area_sweep_analyze.pysummarizes sweep results and generates plots.tools/area_sweep_rank.pyranks combos and computes per-detector summaries.tools/replay_area_sweep.pyreplays logged sweep cases and saves images.tools/test_stop_sign_confidence.pychecks detector confidence on a single image.tools/cleanup_runs.pyremoves old run outputs (defaults to_runs).tools/detector_server.pyruns a shared detector (YOLO/torchvision/RT-DETR) for multi-process training.setup_env.shcontains a helper for local setup.
Cleanup usage:
# Dry-run
python tools/cleanup_runs.py
# Delete
python tools/cleanup_runs.py --yesDetector server usage:
python tools/detector_server.py --model ./weights/yolo8n.pt --device cuda:0 --port 5009
# In training, point the detector device to the server:
# --detector-device server://HOST:5009For torchvision/RT-DETR, pass --detector and --detector-model (no --model needed):
python tools/detector_server.py --detector rtdetr --detector-model PekingU/rtdetr_r50vd --device cuda:0 --port 5009Single-command server + training (from train.sh):
bash train.sh --yolo-version 8 --yolo-weights ./weights/yolo8n.pt --start-detector-serverCommon single-machine training (no server):
bash train.sh --yolo-version 8 --yolo-weights ./weights/yolo8n.ptImportant train.sh knobs:
--num-envs,--vec: number of envs and vectorization mode; use--vec dummywith GPU YOLO.--n-steps,--batch,--total-steps: PPO rollout size, batch size, and total training steps.--grid-cell: patch grid size in pixels (2, 4, 8, 16, 32).--uv-threshold: UV drop threshold for success.--lambda-area,--lambda-area-start/end/steps: area penalty and optional ramp.--area-cap-frac,--area-cap-mode: patch area cap and soft/hard behavior.--area-cap-start/end/steps: cap curriculum from larger to smaller.--obs-size,--obs-margin,--obs-include-mask: observation crop and mask channel.--ent-coef,--ent-coef-start/end/steps: entropy coefficient schedule.--step-log-every,--step-log-keep,--step-log-500: step metrics logging controls.
- Keep comments sparse and focused on why a block exists or what it protects against.
- Avoid restating obvious code; prefer naming and structure to make intent clear.
- When behavior is non-obvious (curriculum logic, reward shaping), add a short note.
.
|-- .github/
|-- .venv/
|-- baselines/
|-- data/
| |-- stop_sign.png
| |-- stop_sign_uv.png
| |-- pole.png
| |-- backgrounds/
|
|-- detectors/
| |-- factory.py
| |-- remote_detector.py
| |-- torchvision_wrapper.py
| |-- transformers_detr_wrapper.py
| |-- yolo_wrapper.py
|
|-- envs/
| |-- stop_sign_grid_env.py
|
|-- metrics/
|
|-- tools/
| |-- aggregate_baselines.py
| |-- debug_grid_env.py
| |-- debug_detector_image.py
| |-- area_sweep_debug.py
| |-- area_sweep_analyze.py
| |-- area_sweep_rank.py
| |-- eval_policy.py
| |-- replay_area_sweep.py
| |-- test_stop_sign_confidence.py
| |-- cleanup_runs.py
| |-- detector_server.py
| |-- parse_tb_events.py
| |-- run_baselines_compare.sh
|
|-- utils/
| |-- save_callbacks.py
| |-- tb_callbacks.py
| |-- uv_paint.py
|
|-- weights/
| |-- yolo11n.pt
| |-- yolo8n.pt
|
|-- train_single_stop_sign.py
|-- train.sh
|-- eval.sh
|-- requirements.txt
|-- enviornment.yml
- If you run on CUDA,
--vec dummyis safer with YOLO inference. - Lower
grid-celland higherlambda-areatend to produce smaller patches. - If you are not seeing UV drop, increase
eval-Kto reduce variance.
MIT for research and educational use.