Skip to content

Fix V8 dataset paths and RunPod probe script#8

Merged
teslaeco merged 1 commit intomainfrom
codex/task-title-r8zel6
Apr 20, 2026
Merged

Fix V8 dataset paths and RunPod probe script#8
teslaeco merged 1 commit intomainfrom
codex/task-title-r8zel6

Conversation

@teslaeco
Copy link
Copy Markdown
Member

Motivation

  • Make the Hugging Face dataset repository layout consistent with the actual flat-file layout and safe to use from RunPod environments.
  • Replace invalid or repo-local path guidance (data/plain_text/) with explicit flat text artifacts so downstream tooling can iterate reliably.
  • Ensure the seed-42 probing workflow is robust in remote environments by downloading the dataset snapshot at runtime instead of referencing a hardcoded local path.

Description

  • Added hf_datasets/Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix/README.md that lists the flat HF dataset files: train.jsonl, validation.jsonl, test.jsonl, train.txt, validation.txt, test.txt, v8_micro_0p02pct.txt, v8_micro_0p05pct.txt, v8_micro_0p10pct.txt, and the related scripts/docs/metadata files.
  • Replaced guidance referring to data/plain_text/ style layouts with direct references to train.txt, validation.txt, test.txt, and v8_micro_*.txt in the new README.
  • Added a RunPod-safe probe script hf_datasets/.../run_v8_seed42_probe.sh which uses huggingface_hub.snapshot_download to set DATASET_DIR to the downloaded snapshot and then runs python "$DATASET_DIR/build_v8_micro_mix.py".
  • Preserved the recommended micro-mix rates (0.02%, 0.05%, 0.10%), the seed-42 pass gate (1.08041364), and the statement that the dataset does not claim SOTA.

Testing

  • Ran a shell syntax check with bash -n hf_datasets/Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix/run_v8_seed42_probe.sh, which completed successfully.
  • Verified content and presence of required strings using rg checks for snapshot_download, DATASET_DIR, build_v8_micro_mix.py, the recommended mix rates, the pass condition, and the no-SOTA statement, which all succeeded.
  • Confirmed repository changes were staged and committed (local commit completed) and that the new script is executable via chmod during preparation steps.

Codex Task

@teslaeco teslaeco merged commit 0655fcc into main Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant