diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/README.md b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/README.md new file mode 100644 index 0000000000..6e6112e756 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/README.md @@ -0,0 +1,97 @@ +# Record: V21 Stack + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed) + +**val_bpb = 1.05851479** (3-seed mean, std 0.000762, seeds 42 / 0 / 1234) on `track_10min_16mb`. + +## Per-seed + +| seed | val_bpb | eval ops ms | artifact bytes | +|-----:|------------:|------------:|---------------:| +| 42 | 1.05764263 | 575,915 | 15,949,305 | +| 0 | 1.05886205 | 553,279 | ~15,943,000 | +| 1234 | 1.05903968 | 554,723 | ~15,945,000 | +| **mean** | **1.05851479** | — | — | +| **std** | **0.000762** | — | — | + +## Stack + +1. **PR #1945** (@alertcat): V21 base = PR #1908 + AWQ-Lite mixed-precision GPTQ + Asymmetric Logit Rescale. +2. **PR #1953** (@andrewbaggio1): TTT/QK env knobs — `TTT_LR=0.75`, `QK_GAIN_INIT=5.25`, `TTT_NO_QV_MASK=1`. (2560 long-context dropped due to OOM during global-SGD allreduce on this 8×H100 80GB SXM provisioning; remaining 7 knobs preserved.) +3. **PR #1948** (@TimS-ml, @lijuncheng16): LeakyReLU squared slope 0.3 patch (4-point sweep min identified by PR #1948). +4. **PR #1145** (@AnirudhRahul, valerio-endorsed): closed-form n-gram tilt with three causal experts (token order 16, within-doc, word order 4) and Σ P=1 closed-form Z renormalization. + +The static n-gram hint table is built in a single L→R causal pass over val tokens during `validate()` setup (env flag `NGRAM_HINT_PRECOMPUTE_OUTSIDE=1`, default). Setting the flag to 0 reproduces the inline build path with identical val_bpb. + +## Compliance + +- Train ≤ 600,000 ms, eval ops ≤ 600,000 ms, artifact ≤ 16,000,000 bytes per seed. +- Standard log-softmax over the SP8192 alphabet at every scored position; tilt is closed-form `p'(a) = exp(β·1[a=h]) · p(a) / Z`, `Z = 1 + q · (e^β − 1)`, Σ p'(a) = 1 over vocab. +- Single-pass: each val token contributes exactly one BPB term in `quantized_ttt_phased`. +- N-gram hints are strictly causal: hint at position t depends only on tokens [0..t−1]. +- No SLOT, no n-gram cache hash table, no logit bias, no ETLB, no Pre-Quant TTT. + +## Δ vs neighbors (3-seed) + +| Submission | val_bpb | Δ vs ours | +|------------|--------:|----------:| +| **This submission** | **1.05851** | — | +| PR #1953 (andrewbaggio1) | 1.05855 | +0.00004 | +| PR #1945 (alertcat) | 1.05943 | +0.00092 | +| PR #1934 (liujshi) | 1.05993 | +0.00142 | +| PR #1956 (AayushBaniya2006) | 1.06044 | +0.00193 | +| PR #1908 (romeerp) | 1.06081 | +0.00230 | + +## Statistical significance vs merged leaderboard top (PR #1902 policy) + +Per the chronological frontier policy adopted in [PR #1902](https://github.com/openai/parameter-golf/pull/1902) (one-sided Welch's two-sample t-test, **p < 0.25** progression cutoff), this submission is tested against the current merged top row, **PR #1855**, using its 6-sample evidence (3 submitted + 3 independent reproduction by @okezue, [#1855 comment](https://github.com/openai/parameter-golf/pull/1855#issuecomment-4336629746)): + +| Submission | n | mean val_bpb | std (n−1) | +|------------|--:|-------------:|----------:| +| **This submission (#1967)** | 3 | 1.05851479 | 0.000762 | +| PR #1855 (merged top, 6-sample) | 6 | 1.06075500 | 0.000933 | + +``` +mean_diff = 0.00224 BPB (~0.00488 nats) +SE = sqrt(0.000762²/3 + 0.000933²/6) + = 0.000582 +t-stat = 3.850 +Welch df = 5.00 +one-sided p ≈ 0.0060 +``` + +**p ≈ 0.0060**, vs the 0.25 cutoff: **passes by ~42× margin**. The 3-seed sample is enough on its own to establish significance against the merged frontier; independent reproduction at any seed would further tighten the bound. + +## System dependencies + +- gcc + lrzip (`apt-get install -y build-essential lrzip` on Debian/Ubuntu). +- Python: `torch==2.9.1`, triton (bundled), Flash Attention 3, numpy, sentencepiece, tiktoken, kernels, datasets, huggingface-hub, typing-extensions==4.15.0. See `requirements.txt`. +- 8× H100 80GB SXM. +- CASEOPS-preprocessed FineWeb10B data (run `prepare_caseops_data.py` once). + +## Reproduction + +``` +bash setup.sh # apt + pip + Flash Attn 3 +python prepare_caseops_data.py # one-time, ~10-20 min CPU +SEED=42 bash run.sh +SEED=0 bash run.sh +SEED=1234 bash run.sh +``` + +## Credits + +- **PR #1145** (@AnirudhRahul): closed-form n-gram tilt with Σ P=1 Z_t renormalization, three causal experts. +- **PR #1948** (@TimS-ml, @lijuncheng16): LeakyReLU squared slope 0.3 sweep finding. +- **PR #1953** (@andrewbaggio1): 7-knob TTT/QK tuning on V21 base. +- **PR #1945** (@alertcat): V21 stack composition. +- **PR #1908** (@romeerp): activation-aware GPTQ mixed precision. +- **PR #1923** (@jorge-asenjo): Asymmetric Logit Rescale. +- **PR #1855** (@codemath3000): SP8192 CaseOps + 9-hyperparameter greedy stack base. +- **PR #1493** (@dexhunter et al.): score-first TTT framework foundation. +- **PR #549** (@abaybektursun): original score-first TTT. + +## Files + +- `train_gpt.py`, `online_ngram_tilt.py`, `online_ngram_state.c`, `lossless_caps.py`, `prepare_caseops_data.py` +- `tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model` +- `setup.sh`, `run.sh`, `requirements.txt` +- `train_seed{42,0,1234}.log` diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/lossless_caps.py b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/lossless_caps.py new file mode 100644 index 0000000000..98e472f824 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/lossless_caps.py @@ -0,0 +1,833 @@ +"""Lossless capitalization pre-encoding helpers. + +This module provides a narrow, reversible transform that only touches +ASCII capital letters `A-Z`. Each uppercase ASCII letter is rewritten as +``, where `sentinel` is a private-use Unicode +character that is escaped by doubling if it appears literally in the +input text. + +Example with the default sentinel `\\uE000`: + + "The NASA Launch" -> "\\uE000the \\uE000n\\uE000a\\uE000s\\uE000a \\uE000launch" + +The transform is intentionally simple for v1: + +- lowercase ASCII letters are unchanged +- uppercase ASCII letters become sentinel + lowercase letter +- non-ASCII characters are left untouched +- literal sentinel characters are escaped as sentinel + sentinel + +This makes the transform exactly invertible while allowing a downstream +tokenizer to reuse lowercase subwords across case variants. +""" + +from __future__ import annotations + +import json +from pathlib import Path +from typing import Callable, Iterable + +LOSSLESS_CAPS_V1 = "lossless_caps_v1" +LOSSLESS_CAPS_V2 = "lossless_caps_v2" +LOSSLESS_CAPS_V3 = "lossless_caps_v3" +LOSSLESS_CAPS_V4 = "lossless_caps_v4" +LOSSLESS_CAPS_V5 = "lossless_caps_v5" +LOSSLESS_CAPS_V6 = "lossless_caps_v6" +LOSSLESS_CAPS_V7 = "lossless_caps_v7" +LOSSLESS_CAPS_CASEOPS_V1 = "lossless_caps_caseops_v1" +IDENTITY = "identity" +DEFAULT_SENTINEL = "\uE000" +DEFAULT_V2_TITLE = "\uE001" +DEFAULT_V2_ALLCAPS = "\uE002" +DEFAULT_V2_CAPNEXT = "\uE003" +DEFAULT_V2_ESC = "\uE004" +DEFAULT_V5_TITLE_MIN_LEN = 7 +DEFAULT_V6_ALLCAPS_MIN_LEN = 3 +DEFAULT_V7_ALLCAPS_MIN_LEN = 4 + + +class LosslessCapsError(ValueError): + """Raised when a transformed string is malformed.""" + + +def _is_ascii_upper(ch: str) -> bool: + return "A" <= ch <= "Z" + + +def _is_ascii_lower(ch: str) -> bool: + return "a" <= ch <= "z" + + +def _is_ascii_alpha(ch: str) -> bool: + return _is_ascii_lower(ch) or _is_ascii_upper(ch) + + +def _validate_distinct_single_chars(*chars: str) -> None: + if any(len(ch) != 1 for ch in chars): + raise ValueError("all control characters must be exactly one character") + if len(set(chars)) != len(chars): + raise ValueError("control characters must be distinct") + + +def encode_lossless_caps_v1(text: str, *, sentinel: str = DEFAULT_SENTINEL) -> str: + """Encode ASCII capitals reversibly using a one-character sentinel.""" + if len(sentinel) != 1: + raise ValueError("sentinel must be exactly one character") + out: list[str] = [] + for ch in text: + if ch == sentinel: + out.append(sentinel) + out.append(sentinel) + elif _is_ascii_upper(ch): + out.append(sentinel) + out.append(ch.lower()) + else: + out.append(ch) + return "".join(out) + + +def decode_lossless_caps_v1(text: str, *, sentinel: str = DEFAULT_SENTINEL) -> str: + """Decode the `lossless_caps_v1` transform back to the original text.""" + if len(sentinel) != 1: + raise ValueError("sentinel must be exactly one character") + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch != sentinel: + out.append(ch) + i += 1 + continue + if i + 1 >= n: + raise LosslessCapsError("dangling capitalization sentinel at end of string") + nxt = text[i + 1] + if nxt == sentinel: + out.append(sentinel) + elif _is_ascii_lower(nxt): + out.append(nxt.upper()) + else: + raise LosslessCapsError( + f"invalid sentinel escape sequence {sentinel + nxt!r}; " + "expected doubled sentinel or sentinel + lowercase ASCII letter" + ) + i += 2 + return "".join(out) + + +def encode_lossless_caps_v2( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + capnext: str = DEFAULT_V2_CAPNEXT, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Encode ASCII word capitalization with cheap word-level markers. + + Rules over maximal ASCII alphabetic runs: + - lowercase words stay unchanged + - TitleCase words become `title + lowercase(word)` + - ALLCAPS words become `allcaps + lowercase(word)` + - mixed-case words use: + - optional `title` when the first letter is uppercase + - `capnext + lowercase(letter)` for subsequent uppercase letters + - literal control characters are escaped as `esc + literal` + """ + _validate_distinct_single_chars(title, allcaps, capnext, esc) + controls = {title, allcaps, capnext, esc} + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch in controls: + out.append(esc) + out.append(ch) + i += 1 + continue + if not _is_ascii_alpha(ch): + out.append(ch) + i += 1 + continue + + j = i + 1 + while j < n and _is_ascii_alpha(text[j]): + j += 1 + word = text[i:j] + lower_word = word.lower() + + if word.islower(): + out.append(word) + elif len(word) >= 2 and word.isupper(): + out.append(allcaps) + out.append(lower_word) + elif _is_ascii_upper(word[0]) and word[1:].islower(): + out.append(title) + out.append(lower_word) + else: + if _is_ascii_upper(word[0]): + out.append(title) + out.append(lower_word[0]) + for orig_ch, lower_ch in zip(word[1:], lower_word[1:], strict=True): + if _is_ascii_upper(orig_ch): + out.append(capnext) + out.append(lower_ch) + i = j + return "".join(out) + + +def decode_lossless_caps_v2( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + capnext: str = DEFAULT_V2_CAPNEXT, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v2` transform back to the original text.""" + _validate_distinct_single_chars(title, allcaps, capnext, esc) + out: list[str] = [] + pending_escape = False + pending_word_mode: str | None = None + active_allcaps = False + pending_capnext = False + in_ascii_word = False + + for ch in text: + if pending_escape: + if pending_word_mode is not None and not _is_ascii_alpha(ch): + raise LosslessCapsError("escaped control char cannot satisfy pending word capitalization mode") + out.append(ch) + pending_escape = False + if _is_ascii_alpha(ch): + in_ascii_word = True + else: + in_ascii_word = False + active_allcaps = False + continue + + if ch == esc: + pending_escape = True + continue + if ch == title: + if pending_word_mode is not None or in_ascii_word or pending_capnext: + raise LosslessCapsError("invalid title marker placement") + pending_word_mode = "title" + continue + if ch == allcaps: + if pending_word_mode is not None or in_ascii_word or pending_capnext: + raise LosslessCapsError("invalid allcaps marker placement") + pending_word_mode = "allcaps" + continue + if ch == capnext: + if pending_capnext: + raise LosslessCapsError("duplicate capnext marker") + pending_capnext = True + continue + + if _is_ascii_alpha(ch): + at_word_start = not in_ascii_word + if at_word_start: + if pending_word_mode == "allcaps": + out.append(ch.upper()) + active_allcaps = True + elif pending_word_mode == "title": + out.append(ch.upper()) + elif pending_capnext: + out.append(ch.upper()) + else: + out.append(ch) + pending_word_mode = None + pending_capnext = False + in_ascii_word = True + continue + + if pending_word_mode is not None: + raise LosslessCapsError("word capitalization marker leaked into the middle of a word") + if active_allcaps: + out.append(ch.upper()) + elif pending_capnext: + out.append(ch.upper()) + else: + out.append(ch) + pending_capnext = False + continue + + if pending_word_mode is not None or pending_capnext: + raise LosslessCapsError("capitalization marker not followed by an ASCII letter") + out.append(ch) + in_ascii_word = False + active_allcaps = False + + if pending_escape: + raise LosslessCapsError("dangling escape marker at end of string") + if pending_word_mode is not None or pending_capnext: + raise LosslessCapsError("dangling capitalization marker at end of string") + return "".join(out) + + +def encode_lossless_caps_v3( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Encode only common word-level capitalization patterns. + + Rules over maximal ASCII alphabetic runs: + - lowercase words stay unchanged + - TitleCase words become `title + lowercase(word)` + - ALLCAPS words become `allcaps + lowercase(word)` + - all other mixed-case words are left unchanged + - literal control characters are escaped as `esc + literal` + """ + _validate_distinct_single_chars(title, allcaps, esc) + controls = {title, allcaps, esc} + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch in controls: + out.append(esc) + out.append(ch) + i += 1 + continue + if not _is_ascii_alpha(ch): + out.append(ch) + i += 1 + continue + + j = i + 1 + while j < n and _is_ascii_alpha(text[j]): + j += 1 + word = text[i:j] + + if word.islower(): + out.append(word) + elif len(word) >= 2 and word.isupper(): + out.append(allcaps) + out.append(word.lower()) + elif _is_ascii_upper(word[0]) and word[1:].islower(): + out.append(title) + out.append(word.lower()) + else: + out.append(word) + i = j + return "".join(out) + + +def decode_lossless_caps_v3( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v3` transform back to the original text.""" + _validate_distinct_single_chars(title, allcaps, esc) + out: list[str] = [] + pending_escape = False + pending_word_mode: str | None = None + active_allcaps = False + in_ascii_word = False + + for ch in text: + if pending_escape: + if pending_word_mode is not None and not _is_ascii_alpha(ch): + raise LosslessCapsError("escaped control char cannot satisfy pending word capitalization mode") + out.append(ch) + pending_escape = False + if _is_ascii_alpha(ch): + in_ascii_word = True + else: + in_ascii_word = False + active_allcaps = False + continue + + if ch == esc: + pending_escape = True + continue + if ch == title: + if pending_word_mode is not None or in_ascii_word: + raise LosslessCapsError("invalid title marker placement") + pending_word_mode = "title" + continue + if ch == allcaps: + if pending_word_mode is not None or in_ascii_word: + raise LosslessCapsError("invalid allcaps marker placement") + pending_word_mode = "allcaps" + continue + + if _is_ascii_alpha(ch): + at_word_start = not in_ascii_word + if at_word_start: + if pending_word_mode == "allcaps": + out.append(ch.upper()) + active_allcaps = True + elif pending_word_mode == "title": + out.append(ch.upper()) + else: + out.append(ch) + pending_word_mode = None + in_ascii_word = True + continue + + if pending_word_mode is not None: + raise LosslessCapsError("word capitalization marker leaked into the middle of a word") + out.append(ch.upper() if active_allcaps else ch) + continue + + if pending_word_mode is not None: + raise LosslessCapsError("capitalization marker not followed by an ASCII letter") + out.append(ch) + in_ascii_word = False + active_allcaps = False + + if pending_escape: + raise LosslessCapsError("dangling escape marker at end of string") + if pending_word_mode is not None: + raise LosslessCapsError("dangling capitalization marker at end of string") + return "".join(out) + + +def encode_lossless_caps_v4( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Encode only ALLCAPS ASCII words, leaving all other case untouched.""" + _validate_distinct_single_chars(allcaps, esc) + controls = {allcaps, esc} + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch in controls: + out.append(esc) + out.append(ch) + i += 1 + continue + if not _is_ascii_alpha(ch): + out.append(ch) + i += 1 + continue + j = i + 1 + while j < n and _is_ascii_alpha(text[j]): + j += 1 + word = text[i:j] + if len(word) >= 2 and word.isupper(): + out.append(allcaps) + out.append(word.lower()) + else: + out.append(word) + i = j + return "".join(out) + + +def decode_lossless_caps_v4( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v4` transform back to the original text.""" + _validate_distinct_single_chars(allcaps, esc) + out: list[str] = [] + pending_escape = False + pending_allcaps = False + in_ascii_word = False + active_allcaps = False + + for ch in text: + if pending_escape: + if pending_allcaps and not _is_ascii_alpha(ch): + raise LosslessCapsError("escaped control char cannot satisfy pending allcaps mode") + out.append(ch) + pending_escape = False + if _is_ascii_alpha(ch): + in_ascii_word = True + else: + in_ascii_word = False + active_allcaps = False + continue + + if ch == esc: + pending_escape = True + continue + if ch == allcaps: + if pending_allcaps or in_ascii_word: + raise LosslessCapsError("invalid allcaps marker placement") + pending_allcaps = True + continue + + if _is_ascii_alpha(ch): + if not in_ascii_word: + active_allcaps = pending_allcaps + pending_allcaps = False + in_ascii_word = True + out.append(ch.upper() if active_allcaps else ch) + continue + + if pending_allcaps: + raise LosslessCapsError("allcaps marker not followed by an ASCII letter") + out.append(ch) + in_ascii_word = False + active_allcaps = False + + if pending_escape: + raise LosslessCapsError("dangling escape marker at end of string") + if pending_allcaps: + raise LosslessCapsError("dangling allcaps marker at end of string") + return "".join(out) + + +def encode_lossless_caps_v5( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, + title_min_len: int = DEFAULT_V5_TITLE_MIN_LEN, +) -> str: + """Encode ALLCAPS words and only sufficiently long TitleCase words.""" + _validate_distinct_single_chars(title, allcaps, esc) + controls = {title, allcaps, esc} + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch in controls: + out.append(esc) + out.append(ch) + i += 1 + continue + if not _is_ascii_alpha(ch): + out.append(ch) + i += 1 + continue + j = i + 1 + while j < n and _is_ascii_alpha(text[j]): + j += 1 + word = text[i:j] + if len(word) >= 2 and word.isupper(): + out.append(allcaps) + out.append(word.lower()) + elif len(word) >= title_min_len and _is_ascii_upper(word[0]) and word[1:].islower(): + out.append(title) + out.append(word.lower()) + else: + out.append(word) + i = j + return "".join(out) + + +def decode_lossless_caps_v5( + text: str, + *, + title: str = DEFAULT_V2_TITLE, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v5` transform back to the original text.""" + return decode_lossless_caps_v3(text, title=title, allcaps=allcaps, esc=esc) + + +def encode_lossless_caps_v6( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, + allcaps_min_len: int = DEFAULT_V6_ALLCAPS_MIN_LEN, +) -> str: + """Encode only ALLCAPS words with length >= allcaps_min_len.""" + _validate_distinct_single_chars(allcaps, esc) + controls = {allcaps, esc} + out: list[str] = [] + i = 0 + n = len(text) + while i < n: + ch = text[i] + if ch in controls: + out.append(esc) + out.append(ch) + i += 1 + continue + if not _is_ascii_alpha(ch): + out.append(ch) + i += 1 + continue + j = i + 1 + while j < n and _is_ascii_alpha(text[j]): + j += 1 + word = text[i:j] + if len(word) >= allcaps_min_len and word.isupper(): + out.append(allcaps) + out.append(word.lower()) + else: + out.append(word) + i = j + return "".join(out) + + +def decode_lossless_caps_v6( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v6` transform back to the original text.""" + return decode_lossless_caps_v4(text, allcaps=allcaps, esc=esc) + + +def encode_lossless_caps_v7( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, + allcaps_min_len: int = DEFAULT_V7_ALLCAPS_MIN_LEN, +) -> str: + """Encode only ALLCAPS words with length >= 4.""" + return encode_lossless_caps_v6( + text, + allcaps=allcaps, + esc=esc, + allcaps_min_len=allcaps_min_len, + ) + + +def decode_lossless_caps_v7( + text: str, + *, + allcaps: str = DEFAULT_V2_ALLCAPS, + esc: str = DEFAULT_V2_ESC, +) -> str: + """Decode the `lossless_caps_v7` transform back to the original text.""" + return decode_lossless_caps_v6(text, allcaps=allcaps, esc=esc) + + +def get_text_transform(name: str | None) -> Callable[[str], str]: + """Return the forward text transform for the given config name.""" + normalized = IDENTITY if name in {None, "", IDENTITY} else str(name) + if normalized == IDENTITY: + return lambda text: text + if normalized == LOSSLESS_CAPS_V1: + return encode_lossless_caps_v1 + if normalized == LOSSLESS_CAPS_V2: + return encode_lossless_caps_v2 + if normalized == LOSSLESS_CAPS_V3: + return encode_lossless_caps_v3 + if normalized == LOSSLESS_CAPS_V4: + return encode_lossless_caps_v4 + if normalized == LOSSLESS_CAPS_V5: + return encode_lossless_caps_v5 + if normalized == LOSSLESS_CAPS_V6: + return encode_lossless_caps_v6 + if normalized == LOSSLESS_CAPS_V7: + return encode_lossless_caps_v7 + if normalized == LOSSLESS_CAPS_CASEOPS_V1: + return encode_lossless_caps_v2 + raise ValueError(f"unsupported text_transform={name!r}") + + +def get_text_inverse_transform(name: str | None) -> Callable[[str], str]: + """Return the inverse transform for the given config name.""" + normalized = IDENTITY if name in {None, "", IDENTITY} else str(name) + if normalized == IDENTITY: + return lambda text: text + if normalized == LOSSLESS_CAPS_V1: + return decode_lossless_caps_v1 + if normalized == LOSSLESS_CAPS_V2: + return decode_lossless_caps_v2 + if normalized == LOSSLESS_CAPS_V3: + return decode_lossless_caps_v3 + if normalized == LOSSLESS_CAPS_V4: + return decode_lossless_caps_v4 + if normalized == LOSSLESS_CAPS_V5: + return decode_lossless_caps_v5 + if normalized == LOSSLESS_CAPS_V6: + return decode_lossless_caps_v6 + if normalized == LOSSLESS_CAPS_V7: + return decode_lossless_caps_v7 + if normalized == LOSSLESS_CAPS_CASEOPS_V1: + return decode_lossless_caps_v2 + raise ValueError(f"unsupported text_transform={name!r}") + + +def normalize_text_transform_name(name: str | None) -> str: + """Normalize empty/None transform names to the identity transform.""" + return IDENTITY if name in {None, "", IDENTITY} else str(name) + + +def get_text_transform_control_symbols(name: str | None) -> list[str]: + """Return reserved control symbols used by a transform, if any.""" + normalized = normalize_text_transform_name(name) + if normalized == IDENTITY: + return [] + if normalized == LOSSLESS_CAPS_V1: + return [DEFAULT_SENTINEL] + if normalized == LOSSLESS_CAPS_V2: + return [DEFAULT_V2_TITLE, DEFAULT_V2_ALLCAPS, DEFAULT_V2_CAPNEXT, DEFAULT_V2_ESC] + if normalized == LOSSLESS_CAPS_CASEOPS_V1: + return [DEFAULT_V2_TITLE, DEFAULT_V2_ALLCAPS, DEFAULT_V2_CAPNEXT, DEFAULT_V2_ESC] + if normalized in {LOSSLESS_CAPS_V3, LOSSLESS_CAPS_V5}: + return [DEFAULT_V2_TITLE, DEFAULT_V2_ALLCAPS, DEFAULT_V2_ESC] + if normalized in {LOSSLESS_CAPS_V4, LOSSLESS_CAPS_V6, LOSSLESS_CAPS_V7}: + return [DEFAULT_V2_ALLCAPS, DEFAULT_V2_ESC] + raise ValueError(f"unsupported text_transform={name!r}") + + +def infer_text_transform_from_manifest(tokenizer_path: str | Path) -> str: + """Best-effort lookup of a tokenizer's text transform from a local manifest.""" + tokenizer_path = Path(tokenizer_path).expanduser().resolve() + manifest_candidates = [ + tokenizer_path.parent.parent / "manifest.json", + tokenizer_path.parent / "manifest.json", + ] + for manifest_path in manifest_candidates: + if not manifest_path.is_file(): + continue + try: + payload = json.loads(manifest_path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError): + continue + tokenizers = payload.get("tokenizers") + if not isinstance(tokenizers, list): + continue + for tokenizer_meta in tokenizers: + if not isinstance(tokenizer_meta, dict): + continue + model_path = tokenizer_meta.get("model_path") or tokenizer_meta.get("path") + if not model_path: + continue + candidate = (manifest_path.parent / str(model_path)).resolve() + if candidate == tokenizer_path: + return normalize_text_transform_name(tokenizer_meta.get("text_transform")) + return IDENTITY + + +def surface_piece_original_byte_counts( + surfaces: Iterable[str], + *, + text_transform_name: str | None = None, + sentinel: str = DEFAULT_SENTINEL, +) -> list[int]: + """Return exact original UTF-8 byte counts contributed by each surface piece. + + `surfaces` must be the exact decoded text fragments emitted by SentencePiece + in order, e.g. `piece.surface` from `encode_as_immutable_proto`. + """ + normalized = normalize_text_transform_name(text_transform_name) + if normalized == IDENTITY: + return [len(surface.encode("utf-8")) for surface in surfaces] + if normalized == LOSSLESS_CAPS_V1: + if len(sentinel) != 1: + raise ValueError("sentinel must be exactly one character") + sentinel_bytes = len(sentinel.encode("utf-8")) + pending_sentinel = False + counts: list[int] = [] + for surface in surfaces: + piece_bytes = 0 + for ch in surface: + if pending_sentinel: + if ch == sentinel: + piece_bytes += sentinel_bytes + elif _is_ascii_lower(ch): + piece_bytes += 1 + else: + raise LosslessCapsError( + f"invalid continuation {ch!r} after capitalization sentinel" + ) + pending_sentinel = False + continue + if ch == sentinel: + pending_sentinel = True + else: + piece_bytes += len(ch.encode("utf-8")) + counts.append(piece_bytes) + if pending_sentinel: + raise LosslessCapsError("dangling capitalization sentinel across piece boundary") + return counts + if normalized not in {LOSSLESS_CAPS_V2, LOSSLESS_CAPS_V3, LOSSLESS_CAPS_V4, LOSSLESS_CAPS_V5, LOSSLESS_CAPS_V6, LOSSLESS_CAPS_V7, LOSSLESS_CAPS_CASEOPS_V1}: + raise ValueError(f"unsupported text_transform={text_transform_name!r}") + + title = DEFAULT_V2_TITLE + allcaps = DEFAULT_V2_ALLCAPS + capnext = DEFAULT_V2_CAPNEXT + esc = DEFAULT_V2_ESC + if normalized in {LOSSLESS_CAPS_V2, LOSSLESS_CAPS_CASEOPS_V1}: + _validate_distinct_single_chars(title, allcaps, capnext, esc) + elif normalized in {LOSSLESS_CAPS_V4, LOSSLESS_CAPS_V6, LOSSLESS_CAPS_V7}: + _validate_distinct_single_chars(allcaps, esc) + else: + _validate_distinct_single_chars(title, allcaps, esc) + pending_escape = False + pending_word_mode: str | None = None + active_allcaps = False + pending_capnext = False + in_ascii_word = False + counts: list[int] = [] + for surface in surfaces: + piece_bytes = 0 + for ch in surface: + if pending_escape: + if pending_word_mode is not None and not _is_ascii_alpha(ch): + raise LosslessCapsError("escaped control char cannot satisfy pending word capitalization mode") + piece_bytes += len(ch.encode("utf-8")) + pending_escape = False + if _is_ascii_alpha(ch): + in_ascii_word = True + else: + in_ascii_word = False + active_allcaps = False + continue + if ch == esc: + pending_escape = True + continue + if normalized in {LOSSLESS_CAPS_V2, LOSSLESS_CAPS_V3, LOSSLESS_CAPS_V5, LOSSLESS_CAPS_CASEOPS_V1} and ch == title: + if pending_word_mode is not None or in_ascii_word or pending_capnext: + raise LosslessCapsError("invalid title marker placement") + pending_word_mode = "title" + continue + if ch == allcaps: + if pending_word_mode is not None or in_ascii_word or pending_capnext: + raise LosslessCapsError("invalid allcaps marker placement") + pending_word_mode = "allcaps" + continue + if normalized in {LOSSLESS_CAPS_V2, LOSSLESS_CAPS_CASEOPS_V1} and ch == capnext: + if pending_capnext: + raise LosslessCapsError("duplicate capnext marker") + pending_capnext = True + continue + + if _is_ascii_alpha(ch): + at_word_start = not in_ascii_word + if at_word_start: + piece_bytes += 1 + active_allcaps = pending_word_mode == "allcaps" + pending_word_mode = None + pending_capnext = False + in_ascii_word = True + continue + if pending_word_mode is not None: + raise LosslessCapsError("word capitalization marker leaked into the middle of a word") + piece_bytes += 1 + pending_capnext = False + continue + + if pending_word_mode is not None or pending_capnext: + raise LosslessCapsError("capitalization marker not followed by an ASCII letter") + piece_bytes += len(ch.encode("utf-8")) + in_ascii_word = False + active_allcaps = False + counts.append(piece_bytes) + if pending_escape: + raise LosslessCapsError("dangling escape marker across piece boundary") + if pending_word_mode is not None or pending_capnext: + raise LosslessCapsError("dangling capitalization marker across piece boundary") + return counts diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/online_ngram_state.c b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/online_ngram_state.c new file mode 100644 index 0000000000..f8472a6f05 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/online_ngram_state.c @@ -0,0 +1,433 @@ +#include +#include +#include + +#define COEFF_COUNT 32 + +static const uint64_t ROLLING_COEFFS[COEFF_COUNT] = { + 36313ULL, 27191ULL, 51647ULL, 81929ULL, 131071ULL, 196613ULL, + 262147ULL, 393241ULL, 524309ULL, 655373ULL, 786433ULL, 917521ULL, + 1048583ULL, 1179653ULL, 1310729ULL, 1441801ULL, 1572869ULL, 1703941ULL, + 1835017ULL, 1966087ULL, 2097169ULL, 2228243ULL, 2359319ULL, 2490389ULL, + 2621471ULL, 2752549ULL, 2883617ULL, 3014687ULL, 3145757ULL, 3276833ULL, + 3407903ULL, 3538973ULL, +}; + +static const uint64_t PAIR_MIX = 1000003ULL; +static const uint64_t PREFIX_BASE = 1099511628211ULL; +static const uint64_t LEN_MIX = 0x9E3779B185EBCA87ULL; +static const uint64_t TABLE_MIX = 0x9e3779b97f4a7c15ULL; + +typedef struct { + uint64_t key; + uint32_t total; + uint32_t top_count; + uint16_t top_tok; + uint16_t _pad; +} CtxBucket; + +typedef struct { + uint64_t key; + uint32_t count; + uint32_t _pad; +} PairBucket; + +typedef struct { + int token_ctx_len; + int token_prefix_len; + int token_head; + uint16_t *token_ring; + + CtxBucket *token_ctx_tbl; + uint8_t *token_ctx_used; + size_t token_ctx_mask; + + PairBucket *token_pair_tbl; + uint8_t *token_pair_used; + size_t token_pair_mask; + + uint64_t within_hash; + uint32_t within_len; + + CtxBucket *within_ctx_tbl; + uint8_t *within_ctx_used; + size_t within_ctx_mask; + + PairBucket *within_pair_tbl; + uint8_t *within_pair_used; + size_t within_pair_mask; +} OnlineNgramState; + +static inline size_t mix_index(uint64_t key, size_t mask) { + return (size_t)((key * TABLE_MIX) & mask); +} + +static inline size_t find_ctx_slot( + CtxBucket *tbl, + uint8_t *used, + size_t mask, + uint64_t key, + int *found +) { + size_t idx = mix_index(key, mask); + for (size_t probe = 0; probe <= mask; ++probe) { + if (!used[idx]) { + *found = 0; + return idx; + } + if (tbl[idx].key == key) { + *found = 1; + return idx; + } + idx = (idx + 1U) & mask; + } + *found = -1; + return 0; +} + +static inline size_t find_pair_slot( + PairBucket *tbl, + uint8_t *used, + size_t mask, + uint64_t key, + int *found +) { + size_t idx = mix_index(key, mask); + for (size_t probe = 0; probe <= mask; ++probe) { + if (!used[idx]) { + *found = 0; + return idx; + } + if (tbl[idx].key == key) { + *found = 1; + return idx; + } + idx = (idx + 1U) & mask; + } + *found = -1; + return 0; +} + +static inline uint64_t token_pair_key(uint64_t ctx_key, uint16_t tok, int ctx_len) { + return (ctx_key * PAIR_MIX) ^ (((uint64_t)tok) * ROLLING_COEFFS[(size_t)ctx_len % COEFF_COUNT]); +} + +static inline uint64_t within_pair_key(uint64_t ctx_key, uint16_t tok) { + return (ctx_key * PAIR_MIX) ^ (((uint64_t)tok) * ROLLING_COEFFS[0]); +} + +static inline uint64_t extend_prefix_hash(uint64_t current_hash, uint16_t tok, uint32_t pos) { + return (current_hash * PREFIX_BASE) ^ (((uint64_t)tok + 1ULL) * ROLLING_COEFFS[(size_t)pos % COEFF_COUNT]); +} + +static inline uint32_t pair_increment( + PairBucket *tbl, + uint8_t *used, + size_t mask, + uint64_t key +) { + int found = 0; + size_t idx = find_pair_slot(tbl, used, mask, key, &found); + if (found < 0) { + return 0U; + } + if (!found) { + used[idx] = 1U; + tbl[idx].key = key; + tbl[idx].count = 1U; + return 1U; + } + tbl[idx].count += 1U; + return tbl[idx].count; +} + +static inline int ctx_increment( + CtxBucket *tbl, + uint8_t *used, + size_t mask, + uint64_t key, + uint16_t tok, + uint32_t pair_count +) { + int found = 0; + size_t idx = find_ctx_slot(tbl, used, mask, key, &found); + if (found < 0) { + return -1; + } + if (!found) { + used[idx] = 1U; + tbl[idx].key = key; + tbl[idx].total = 1U; + tbl[idx].top_count = pair_count; + tbl[idx].top_tok = tok; + return 0; + } + tbl[idx].total += 1U; + if (pair_count > tbl[idx].top_count) { + tbl[idx].top_count = pair_count; + tbl[idx].top_tok = tok; + } + return 0; +} + +static inline uint64_t token_context_hash(const OnlineNgramState *st) { + uint64_t h = 0ULL; + if (st->token_ctx_len <= 0) { + return h; + } + for (int j = 0; j < st->token_ctx_len; ++j) { + const int ring_idx = (st->token_head + j) % st->token_ctx_len; + h ^= ((uint64_t)st->token_ring[ring_idx]) * ROLLING_COEFFS[(size_t)j]; + } + return h; +} + +static inline void token_push(OnlineNgramState *st, uint16_t tok) { + if (st->token_ctx_len <= 0) { + return; + } + if (st->token_prefix_len < st->token_ctx_len) { + st->token_ring[st->token_prefix_len] = tok; + st->token_prefix_len += 1; + return; + } + st->token_ring[st->token_head] = tok; + st->token_head = (st->token_head + 1) % st->token_ctx_len; +} + +static void *xcalloc(size_t count, size_t size) { + if (count == 0 || size == 0) { + return NULL; + } + return calloc(count, size); +} + +static int alloc_tables( + size_t table_bits, + CtxBucket **ctx_tbl, + uint8_t **ctx_used, + size_t *ctx_mask, + PairBucket **pair_tbl, + uint8_t **pair_used, + size_t *pair_mask +) { + const size_t size = 1ULL << table_bits; + *ctx_tbl = (CtxBucket *)xcalloc(size, sizeof(CtxBucket)); + *ctx_used = (uint8_t *)xcalloc(size, sizeof(uint8_t)); + *pair_tbl = (PairBucket *)xcalloc(size, sizeof(PairBucket)); + *pair_used = (uint8_t *)xcalloc(size, sizeof(uint8_t)); + if (!*ctx_tbl || !*ctx_used || !*pair_tbl || !*pair_used) { + return -1; + } + *ctx_mask = size - 1U; + *pair_mask = size - 1U; + return 0; +} + +void *online_ngram_state_create( + int token_ctx_len, + int token_table_bits, + int within_table_bits +) { + if (token_ctx_len < 0 || token_table_bits <= 0 || within_table_bits <= 0) { + return NULL; + } + OnlineNgramState *st = (OnlineNgramState *)calloc(1, sizeof(OnlineNgramState)); + if (!st) { + return NULL; + } + st->token_ctx_len = token_ctx_len; + if (token_ctx_len > 0) { + st->token_ring = (uint16_t *)xcalloc((size_t)token_ctx_len, sizeof(uint16_t)); + if (!st->token_ring) { + free(st); + return NULL; + } + } + if (alloc_tables( + (size_t)token_table_bits, + &st->token_ctx_tbl, + &st->token_ctx_used, + &st->token_ctx_mask, + &st->token_pair_tbl, + &st->token_pair_used, + &st->token_pair_mask + ) != 0) { + free(st->token_ring); + free(st); + return NULL; + } + if (alloc_tables( + (size_t)within_table_bits, + &st->within_ctx_tbl, + &st->within_ctx_used, + &st->within_ctx_mask, + &st->within_pair_tbl, + &st->within_pair_used, + &st->within_pair_mask + ) != 0) { + free(st->token_pair_used); + free(st->token_pair_tbl); + free(st->token_ctx_used); + free(st->token_ctx_tbl); + free(st->token_ring); + free(st); + return NULL; + } + return (void *)st; +} + +void online_ngram_state_destroy(void *ptr) { + OnlineNgramState *st = (OnlineNgramState *)ptr; + if (!st) { + return; + } + free(st->within_pair_used); + free(st->within_pair_tbl); + free(st->within_ctx_used); + free(st->within_ctx_tbl); + free(st->token_pair_used); + free(st->token_pair_tbl); + free(st->token_ctx_used); + free(st->token_ctx_tbl); + free(st->token_ring); + free(st); +} + +void online_ngram_state_seed_prefix_token(void *ptr, uint16_t tok) { + OnlineNgramState *st = (OnlineNgramState *)ptr; + if (!st) { + return; + } + token_push(st, tok); +} + +int online_ngram_state_process_chunk( + void *ptr, + const uint16_t *tokens, + int64_t n_tokens, + const uint8_t *starts_new_word_lut, + const uint8_t *boundary_lut, + uint16_t *token_top_token, + float *token_top_prob, + uint16_t *within_top_token, + float *within_top_prob, + uint8_t *within_valid +) { + OnlineNgramState *st = (OnlineNgramState *)ptr; + if (!st || !tokens || n_tokens < 0) { + return -1; + } + for (int64_t i = 0; i < n_tokens; ++i) { + const uint16_t tok = tokens[i]; + const uint8_t is_boundary = boundary_lut[tok]; + const uint8_t is_new_word = starts_new_word_lut[tok]; + + uint64_t token_ctx_key = 0ULL; + if (st->token_ctx_len == 0 || st->token_prefix_len >= st->token_ctx_len) { + token_ctx_key = token_context_hash(st); + int found = 0; + size_t idx = find_ctx_slot( + st->token_ctx_tbl, + st->token_ctx_used, + st->token_ctx_mask, + token_ctx_key, + &found + ); + if (found > 0) { + token_top_token[i] = st->token_ctx_tbl[idx].top_tok; + token_top_prob[i] = + (float)st->token_ctx_tbl[idx].top_count / (float)st->token_ctx_tbl[idx].total; + } else { + token_top_token[i] = 0U; + token_top_prob[i] = 0.0f; + } + } else { + token_top_token[i] = 0U; + token_top_prob[i] = 0.0f; + } + + uint64_t within_ctx_key = 0ULL; + if (!is_boundary && !is_new_word && st->within_len > 0U) { + within_ctx_key = st->within_hash ^ ((uint64_t)st->within_len * LEN_MIX); + int found = 0; + size_t idx = find_ctx_slot( + st->within_ctx_tbl, + st->within_ctx_used, + st->within_ctx_mask, + within_ctx_key, + &found + ); + within_valid[i] = 1U; + if (found > 0) { + within_top_token[i] = st->within_ctx_tbl[idx].top_tok; + within_top_prob[i] = + (float)st->within_ctx_tbl[idx].top_count / (float)st->within_ctx_tbl[idx].total; + } else { + within_top_token[i] = 0U; + within_top_prob[i] = 0.0f; + } + } else { + within_valid[i] = 0U; + within_top_token[i] = 0U; + within_top_prob[i] = 0.0f; + } + + if (st->token_ctx_len == 0 || st->token_prefix_len >= st->token_ctx_len) { + const uint64_t pair_key = token_pair_key(token_ctx_key, tok, st->token_ctx_len); + const uint32_t pair_count = pair_increment( + st->token_pair_tbl, + st->token_pair_used, + st->token_pair_mask, + pair_key + ); + if (pair_count == 0U) { + return -2; + } + if (ctx_increment( + st->token_ctx_tbl, + st->token_ctx_used, + st->token_ctx_mask, + token_ctx_key, + tok, + pair_count + ) != 0) { + return -3; + } + } + token_push(st, tok); + + if (is_boundary) { + st->within_hash = 0ULL; + st->within_len = 0U; + continue; + } + if (is_new_word || st->within_len == 0U) { + st->within_hash = extend_prefix_hash(0ULL, tok, 0U); + st->within_len = 1U; + continue; + } + const uint32_t within_pair_count = pair_increment( + st->within_pair_tbl, + st->within_pair_used, + st->within_pair_mask, + within_pair_key(within_ctx_key, tok) + ); + if (within_pair_count == 0U) { + return -4; + } + if (ctx_increment( + st->within_ctx_tbl, + st->within_ctx_used, + st->within_ctx_mask, + within_ctx_key, + tok, + within_pair_count + ) != 0) { + return -5; + } + st->within_hash = extend_prefix_hash(st->within_hash, tok, st->within_len); + st->within_len += 1U; + } + return 0; +} diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/online_ngram_tilt.py b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/online_ngram_tilt.py new file mode 100644 index 0000000000..973c21866f --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/online_ngram_tilt.py @@ -0,0 +1,386 @@ +""" +Vendored online n-gram tilt helpers from PR #1145 (AnirudhRahul, valerio-endorsed). + +Provides causal, normalized, prefix-only n-gram experts that propose at most one +hinted token per scored position. Caller obtains q_t = p(h_t | x) from the model +(post-TTT-adapt logits) and applies multiplicative-boost-with-renorm: + + p'(a) = exp(beta * 1[a == h_t]) * p(a) / Z_t + Z_t = 1 - q_t + exp(beta) * q_t = 1 + q_t * (exp(beta) - 1) + -log p'(y_realized) = -log p(y) - beta * 1[y == h_t] + log Z_t + = ptl - beta * is_hit + log1p(q_t * (exp(beta) - 1)) + +Compliance: +- C1 causal: hint h_t computed from strict prefix (tokens 0..t-1 only) +- C2 normalized over Sigma: closed-form Z_t over full vocab softmax +- C3 score-before-update: hints precomputed in single L->R pass; loss uses prefix-only +- C4 single pass: process_chunk advances state monotonically + +Compatible with both #1934/#1855 base architectures via Hyperparameter env-var gates. +""" + +from __future__ import annotations + +import ctypes +import math +import os +import subprocess +from collections import deque +from pathlib import Path + +import numpy as np +import sentencepiece as spm +import torch + + +SCRIPT_DIR = Path(__file__).resolve().parent +ONLINE_NGRAM_SRC = SCRIPT_DIR / "online_ngram_state.c" +ONLINE_NGRAM_LIB = SCRIPT_DIR / "libonline_ngram_state.so" + +WHITESPACE_BYTE_IDS = {9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 36} +EDGE_PUNCT = ".,:;!?()[]{}<>\"'`" + + +def normalize_word(text: str, mode: str) -> str: + text = text.strip() + if mode == "lower": + return text.lower() + if mode == "identity": + return text + if mode == "strip_punct_lower": + return text.strip(EDGE_PUNCT).lower() + raise ValueError(f"Unknown word normalization mode: {mode}") + + +def suggest_table_bits(expected_entries: int, load_factor: float) -> int: + if expected_entries <= 0: + return 16 + target = max(int(expected_entries / max(load_factor, 1e-6)), 1) + bits = max(int(math.ceil(math.log2(target))), 12) + return min(bits, 28) + + +def ensure_online_ngram_lib(log0=print) -> ctypes.CDLL: + needs_build = (not ONLINE_NGRAM_LIB.exists()) or ( + ONLINE_NGRAM_SRC.stat().st_mtime_ns > ONLINE_NGRAM_LIB.stat().st_mtime_ns + ) + if needs_build: + log0(f"ngram_tilt:building_native_helper src={ONLINE_NGRAM_SRC.name}") + subprocess.run( + [ + "gcc", "-O3", "-march=native", "-shared", "-fPIC", + "-o", str(ONLINE_NGRAM_LIB), + str(ONLINE_NGRAM_SRC), + ], + check=True, + ) + lib = ctypes.CDLL(str(ONLINE_NGRAM_LIB)) + lib.online_ngram_state_create.restype = ctypes.c_void_p + lib.online_ngram_state_create.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.c_int] + lib.online_ngram_state_destroy.restype = None + lib.online_ngram_state_destroy.argtypes = [ctypes.c_void_p] + lib.online_ngram_state_seed_prefix_token.restype = None + lib.online_ngram_state_seed_prefix_token.argtypes = [ctypes.c_void_p, ctypes.c_uint16] + lib.online_ngram_state_process_chunk.restype = ctypes.c_int + lib.online_ngram_state_process_chunk.argtypes = [ + ctypes.c_void_p, + ctypes.POINTER(ctypes.c_uint16), + ctypes.c_int64, + ctypes.POINTER(ctypes.c_uint8), + ctypes.POINTER(ctypes.c_uint8), + ctypes.POINTER(ctypes.c_uint16), + ctypes.POINTER(ctypes.c_float), + ctypes.POINTER(ctypes.c_uint16), + ctypes.POINTER(ctypes.c_float), + ctypes.POINTER(ctypes.c_uint8), + ] + return lib + + +class OnlineNgramState: + def __init__( + self, *, lib, token_ctx_len, token_table_bits, within_table_bits, + starts_new_word_lut, boundary_lut, seed_prefix_token, + ): + self.lib = lib + self.state = lib.online_ngram_state_create(token_ctx_len, token_table_bits, within_table_bits) + if not self.state: + raise RuntimeError( + f"Native ngram state alloc failed token_table_bits={token_table_bits} within_table_bits={within_table_bits}" + ) + self.starts_new_word_lut = np.ascontiguousarray(starts_new_word_lut.astype(np.uint8, copy=False)) + self.boundary_lut = np.ascontiguousarray(boundary_lut.astype(np.uint8, copy=False)) + self.lib.online_ngram_state_seed_prefix_token(self.state, ctypes.c_uint16(int(seed_prefix_token))) + + def close(self): + if self.state: + self.lib.online_ngram_state_destroy(self.state) + self.state = None + + def __del__(self): + self.close() + + def process_chunk(self, chunk_tokens): + chunk_tokens = np.ascontiguousarray(chunk_tokens.astype(np.uint16, copy=False)) + n = int(chunk_tokens.size) + token_top_token = np.zeros(n, dtype=np.uint16) + token_top_prob = np.zeros(n, dtype=np.float32) + within_top_token = np.zeros(n, dtype=np.uint16) + within_top_prob = np.zeros(n, dtype=np.float32) + within_valid = np.zeros(n, dtype=np.uint8) + rc = self.lib.online_ngram_state_process_chunk( + self.state, + chunk_tokens.ctypes.data_as(ctypes.POINTER(ctypes.c_uint16)), + ctypes.c_int64(n), + self.starts_new_word_lut.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)), + self.boundary_lut.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)), + token_top_token.ctypes.data_as(ctypes.POINTER(ctypes.c_uint16)), + token_top_prob.ctypes.data_as(ctypes.POINTER(ctypes.c_float)), + within_top_token.ctypes.data_as(ctypes.POINTER(ctypes.c_uint16)), + within_top_prob.ctypes.data_as(ctypes.POINTER(ctypes.c_float)), + within_valid.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)), + ) + if rc != 0: + raise RuntimeError(f"Native ngram process_chunk failed rc={rc}") + return token_top_token, token_top_prob, within_top_token, within_top_prob, within_valid.astype(bool) + + +class WordStartState: + def __init__(self, *, sp, order, normalize_mode): + self.sp = sp + self.ctx_w = max(order - 1, 0) + self.normalize_mode = normalize_mode + self.prev_word_ids: deque = deque(maxlen=self.ctx_w) + self.current_word_tokens: list = [] + self.word_to_id: dict = {} + self.next_word_id = 1 + self.ctx_total: dict = {} + self.pair_count: dict = {} + self.ctx_best_token: dict = {} + self.ctx_best_count: dict = {} + + def _flush_current_word(self): + if not self.current_word_tokens: + return + text = normalize_word(self.sp.decode(self.current_word_tokens), self.normalize_mode) + if text: + wid = self.word_to_id.get(text) + if wid is None: + wid = self.next_word_id + self.word_to_id[text] = wid + self.next_word_id += 1 + if self.ctx_w > 0: + self.prev_word_ids.append(wid) + self.current_word_tokens = [] + + def process_chunk(self, chunk_tokens, *, starts_new_word_lut, boundary_lut): + chunk_tokens = np.ascontiguousarray(chunk_tokens.astype(np.uint16, copy=False)) + top_token = np.zeros(chunk_tokens.size, dtype=np.uint16) + top_prob = np.zeros(chunk_tokens.size, dtype=np.float32) + for i, tok_u16 in enumerate(chunk_tokens): + tok = int(tok_u16) + is_boundary = bool(boundary_lut[tok]) + is_word_start = bool(starts_new_word_lut[tok]) or not self.current_word_tokens + if is_boundary: + self._flush_current_word() + continue + if bool(starts_new_word_lut[tok]): + self._flush_current_word() + ctx_key = None + if is_word_start and len(self.prev_word_ids) >= self.ctx_w: + ctx_key = tuple(self.prev_word_ids) if self.ctx_w > 0 else () + total = self.ctx_total.get(ctx_key, 0) + if total > 0: + top_token[i] = np.uint16(self.ctx_best_token[ctx_key]) + top_prob[i] = np.float32(self.ctx_best_count[ctx_key] / total) + if is_word_start: + if ctx_key is not None: + pair_key = (ctx_key, tok) + pair = self.pair_count.get(pair_key, 0) + 1 + self.pair_count[pair_key] = pair + total = self.ctx_total.get(ctx_key, 0) + 1 + self.ctx_total[ctx_key] = total + best_count = self.ctx_best_count.get(ctx_key, 0) + if pair > best_count: + self.ctx_best_count[ctx_key] = pair + self.ctx_best_token[ctx_key] = tok + self.current_word_tokens = [tok] + else: + self.current_word_tokens.append(tok) + return top_token, top_prob + + +def build_piece_luts(*, tokenizer_path, vocab_size): + sp = spm.SentencePieceProcessor(model_file=tokenizer_path) + pieces = [sp.id_to_piece(i) for i in range(sp.vocab_size())] + starts_new_word_lut = np.zeros(vocab_size, dtype=np.uint8) + for i, piece in enumerate(pieces): + starts_new_word_lut[i] = 1 if piece.startswith("▁") else 0 + boundary_lut = np.zeros(vocab_size, dtype=np.uint8) + bos_id = sp.bos_id() + if bos_id >= 0 and bos_id < vocab_size: + boundary_lut[bos_id] = 1 + for tok in range(min(sp.vocab_size(), vocab_size)): + if sp.is_byte(tok) and tok in WHITESPACE_BYTE_IDS: + boundary_lut[tok] = 1 + return sp, starts_new_word_lut, boundary_lut + + +def build_hints_for_targets( + *, target_token_ids_np, tokenizer_path, vocab_size, log0=print, + token_order=16, token_threshold=0.800, token_boost=2.625, + within_tau=0.450, within_boost=0.750, + word_order=4, word_normalize="strip_punct_lower", + word_tau=0.650, word_boost=0.750, + agree_add_boost=0.500, +): + """Single L->R pass. Returns dict with hint_ids, gate_mask, boost_per_pos. + + target_token_ids_np: np.uint16 array of realized targets (length = total_targets). + Output arrays are aligned to target_token_ids_np indexing. + + For each scored position t we pick at most one hint h_t: + - prefer the expert with highest expected gain = p_top * boost - log1p(p_top * (exp(boost)-1)) + - if multiple experts agree on the same h_t, additive boost agree_add_boost + - gate (don't tilt) when no expert clears its threshold + + The realized loss formula used by the caller: + ptl' = ptl - beta * 1[y == h_t] + log1p(q_t * (exp(beta) - 1)) when gate_mask == True + ptl' = ptl when gate_mask == False + """ + sp, starts_new_word_lut, boundary_lut = build_piece_luts( + tokenizer_path=tokenizer_path, vocab_size=vocab_size + ) + total = int(target_token_ids_np.size) + if total == 0: + return { + "hint_ids": np.zeros(0, dtype=np.int64), + "gate_mask": np.zeros(0, dtype=bool), + "boost": np.zeros(0, dtype=np.float32), + "sp": sp, + "starts_new_word_lut": starts_new_word_lut, + "boundary_lut": boundary_lut, + } + + token_table_bits = suggest_table_bits(total, load_factor=0.55) + within_table_bits = suggest_table_bits(max(total // 2, 1), load_factor=0.60) + online_lib = ensure_online_ngram_lib(log0) + ngram_state = OnlineNgramState( + lib=online_lib, + token_ctx_len=max(token_order - 1, 0), + token_table_bits=token_table_bits, + within_table_bits=within_table_bits, + starts_new_word_lut=starts_new_word_lut, + boundary_lut=boundary_lut, + seed_prefix_token=int(target_token_ids_np[0]), + ) + word_state = WordStartState(sp=sp, order=word_order, normalize_mode=word_normalize) + + token_top_tok, token_top_prob, within_top_tok, within_top_prob, within_valid = ( + ngram_state.process_chunk(target_token_ids_np) + ) + word_top_tok, word_top_prob = word_state.process_chunk( + target_token_ids_np, + starts_new_word_lut=starts_new_word_lut, + boundary_lut=boundary_lut, + ) + + def _expected_gain(p_top, boost): + # E[ -log p'(y) under -log p(y)] when y ~ p + # = p_top * boost - log1p(p_top * (exp(boost) - 1)) + # Maximizing this over experts => pick the most informative hint. + log_norm = np.log1p(p_top * (math.exp(boost) - 1.0)) + return p_top * boost - log_norm + + token_gate = token_top_prob >= np.float32(token_threshold) + within_gate = within_valid & (within_top_prob >= np.float32(within_tau)) + word_gate = word_top_prob >= np.float32(word_tau) + + token_gain = np.where(token_gate, _expected_gain(token_top_prob.astype(np.float64), token_boost), -np.inf) + within_gain = np.where(within_gate, _expected_gain(within_top_prob.astype(np.float64), within_boost), -np.inf) + word_gain = np.where(word_gate, _expected_gain(word_top_prob.astype(np.float64), word_boost), -np.inf) + + stack = np.stack([token_gain, within_gain, word_gain], axis=1) + best_idx = np.argmax(stack, axis=1) + best_gain = np.max(stack, axis=1) + any_gate = best_gain > -np.inf + + hint_ids = np.zeros(total, dtype=np.int64) + boost = np.zeros(total, dtype=np.float32) + base_boost_per_expert = np.array([token_boost, within_boost, word_boost], dtype=np.float32) + hint_per_expert = np.stack([ + token_top_tok.astype(np.int64), + within_top_tok.astype(np.int64), + word_top_tok.astype(np.int64), + ], axis=1) + + rows = np.arange(total) + hint_ids[any_gate] = hint_per_expert[rows[any_gate], best_idx[any_gate]] + boost[any_gate] = base_boost_per_expert[best_idx[any_gate]] + + # Agreement bonus: if 2+ experts agree on the same hint as best, add agree_add_boost + gate_mask_each = np.stack([token_gate, within_gate, word_gate], axis=1) + expert_hints = hint_per_expert.copy() + expert_hints[~gate_mask_each] = -1 + agreements = (expert_hints == hint_ids[:, None]).sum(axis=1) + agreement_extra = np.where(agreements >= 2, np.float32(agree_add_boost), np.float32(0.0)) + boost = (boost + agreement_extra).astype(np.float32) + + log0( + f"ngram_tilt:hints total={total} gated={int(any_gate.sum())} " + f"token_gate={int(token_gate.sum())} within_gate={int(within_gate.sum())} word_gate={int(word_gate.sum())} " + f"agree2plus={int((agreements >= 2).sum())}" + ) + + return { + "hint_ids": hint_ids, + "gate_mask": any_gate, + "boost": boost, + "sp": sp, + "starts_new_word_lut": starts_new_word_lut, + "boundary_lut": boundary_lut, + } + + +def apply_tilt_to_ptl_torch( + ptl: torch.Tensor, + log_q_hint: torch.Tensor, + target_ids: torch.Tensor, + hint_ids: torch.Tensor, + gate_mask: torch.Tensor, + boost: torch.Tensor, +): + """Closed-form tilt applied to per-token NLL. + + All tensors same shape [..., L]. + ptl_tilted = ptl - beta * 1[y == h] + log1p(q * (exp(beta) - 1)) if gate else ptl + """ + boost64 = boost.to(torch.float64) + q = log_q_hint.to(torch.float64).clamp_(max=0.0).exp() + is_hit = (target_ids == hint_ids).to(torch.float64) + log_Z = torch.log1p(q * (torch.expm1(boost64))) + ptl_tilted = ptl.to(torch.float64) - boost64 * is_hit + log_Z + return torch.where(gate_mask, ptl_tilted, ptl.to(torch.float64)).to(ptl.dtype) + + +def apply_tilt_to_ptl_torch_fast( + ptl: torch.Tensor, + log_q_hint: torch.Tensor, + target_ids: torch.Tensor, + hint_ids: torch.Tensor, + gate_mask: torch.Tensor, + boost: torch.Tensor, +): + """fp32 variant of apply_tilt — cast removed where safe. + + BPB downstream accumulator is fp64, so per-token tilt computation in + fp32 has no impact on final precision. Saves ~10-15s per eval pass on + H100 (avoids fp64 ALU + double memory traffic). + """ + boost32 = boost.to(torch.float32) + q = log_q_hint.to(torch.float32).clamp_(max=0.0).exp() + is_hit = (target_ids == hint_ids).to(torch.float32) + log_Z = torch.log1p(q * (torch.expm1(boost32))) + ptl_f32 = ptl.to(torch.float32) + ptl_tilted = ptl_f32 - boost32 * is_hit + log_Z + return torch.where(gate_mask, ptl_tilted, ptl_f32).to(ptl.dtype) diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/prepare_caseops_data.py b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/prepare_caseops_data.py new file mode 100644 index 0000000000..ae38533c81 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/prepare_caseops_data.py @@ -0,0 +1,229 @@ +"""Prepare CaseOps-tokenized FineWeb shards + per-token byte sidecar. + +CaseOps (``lossless_caps_caseops_v1``) is a bijective, character-level text +transform that introduces four operator tokens in place of explicit +capitalization: TITLE, ALLCAPS, CAPNEXT, ESC. The transform is fully +reversible — no information is lost relative to the untransformed UTF-8 +text, so BPB stays computable on TRUE byte counts. + +Forward pipeline: + 1. Read the canonical FineWeb-10B doc stream (``docs_selected.jsonl`` + produced by ``data/download_hf_docs_and_tokenize.py`` in the root repo). + 2. Apply ``encode_lossless_caps_v2`` (the caseops_v1 alias) to each doc. + 3. Tokenize with the shipped SP model + ``tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model`` + (reserves TITLE/ALLCAPS/CAPNEXT/ESC + sentinel as user_defined_symbols). + 4. Write uint16 train/val shards (``fineweb_{train,val}_XXXXXX.bin``). + 5. For the VAL stream only, emit per-token byte sidecar shards + (``fineweb_val_bytes_XXXXXX.bin``, uint16 parallel arrays) that record + each token's ORIGINAL pre-transform UTF-8 byte count. BPB is computed + from these canonical bytes so the score is on the untransformed text + (not the transformed representation). + +Output layout — matches what ``train_gpt.py`` expects under +``DATA_DIR=./data`` with ``CASEOPS_ENABLED=1``: + + data/datasets/fineweb10B_sp8192_caseops/datasets/ + tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/ + fineweb_train_000000.bin + fineweb_train_000001.bin + ... + fineweb_val_000000.bin + fineweb_val_bytes_000000.bin + +Usage: + + python3 prepare_caseops_data.py \\ + --docs ./fineweb10B_raw/docs_selected.jsonl \\ + --out ./data/datasets/fineweb10B_sp8192_caseops/datasets \\ + --sp ./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + +Requirements: sentencepiece, numpy. CPU-only. Runs once; reused across seeds. +""" +from __future__ import annotations + +import argparse +import json +import pathlib +import struct +import sys + +import numpy as np +import sentencepiece as spm + +# Local import — lossless_caps.py ships next to this script. +sys.path.insert(0, str(pathlib.Path(__file__).resolve().parent)) +from lossless_caps import ( # noqa: E402 + encode_lossless_caps_v2, + DEFAULT_V2_TITLE, + DEFAULT_V2_ALLCAPS, + DEFAULT_V2_CAPNEXT, + DEFAULT_V2_ESC, +) + +# Operator chars consume 0 original bytes when decoded back. All other chars +# decode to themselves (case may flip, but ASCII case flip preserves byte size, +# and non-ASCII chars are untouched). This lets us compute per-token original +# byte counts in O(T) via prefix sum instead of the O(T^2) decode-prefix loop. +_LOSSLESS_V2_OPERATORS = frozenset(( + DEFAULT_V2_TITLE, DEFAULT_V2_ALLCAPS, DEFAULT_V2_CAPNEXT, DEFAULT_V2_ESC, +)) + + +SHARD_MAGIC = 20240520 +SHARD_VERSION = 1 +SHARD_TOKENS = 10_000_000 # tokens per shard — matches the main pipeline +# BOS sentinel (matches canonical data/download_hf_docs_and_tokenize.py). The SP +# tokenizer's BOS_ID=1 is among the reserved IDs 0..7, so sp.encode() can't +# emit it organically — it must be prepended by the prep script. train_gpt.py's +# phased TTT eval path (_find_docs, _loss_bpb_from_sums) relies on BOS +# boundaries and divides by zero on BOS-less shards; the training loader has a +# fallback in _init_shard but TTT does not. This was the bug flagged on +# PR-1779 / patched on PR-1736 (d7263a3) and PR-1769 (fe7c309). +BOS_ID = 1 + + +def _write_shard(out_path: pathlib.Path, arr: np.ndarray) -> None: + """Write a uint16 shard in the standard header-prefixed format.""" + assert arr.dtype == np.uint16 + header = np.zeros(256, dtype=np.int32) + header[0] = SHARD_MAGIC + header[1] = SHARD_VERSION + header[2] = int(arr.size) + with out_path.open("wb") as fh: + fh.write(header.tobytes()) + fh.write(arr.tobytes()) + + +def _iter_docs(docs_path: pathlib.Path): + """Yield doc strings from a jsonl file (one json object per line).""" + with docs_path.open("r", encoding="utf-8") as fh: + for line in fh: + line = line.strip() + if not line: + continue + obj = json.loads(line) + # Support both {"text": ...} and raw strings. + yield obj["text"] if isinstance(obj, dict) else obj + + +def _token_original_byte_counts( + sp: spm.SentencePieceProcessor, + original_text: str, + transformed_text: str, +) -> np.ndarray: + """Compute per-token canonical (pre-transform) UTF-8 byte counts. + + O(T) implementation via prefix-sum over per-character byte contributions. + Operator chars (TITLE/ALLCAPS/CAPNEXT/ESC) decode to 0 bytes; all other + chars decode to themselves (ASCII case-flip preserves byte size; non-ASCII + untouched). So per-token byte count = sum of UTF-8 byte sizes of non- + operator chars in that token's transformed-text span. + + Replaces the prior O(T^2) decode-prefix loop that took >90 hours on + full FineWeb val docs. + """ + piece_ids = sp.encode(transformed_text, out_type=int) + pieces = [sp.id_to_piece(int(pid)) for pid in piece_ids] + counts = np.empty(len(piece_ids), dtype=np.uint16) + + # Prefix sum of original-byte counts per character position in transformed_text. + # prefix[i] = total original bytes contributed by transformed_text[:i]. + n_chars = len(transformed_text) + prefix = np.zeros(n_chars + 1, dtype=np.int64) + running = 0 + for idx, ch in enumerate(transformed_text): + if ch not in _LOSSLESS_V2_OPERATORS: + # ord(ch) < 0x80 -> 1 byte; <0x800 -> 2 bytes; <0x10000 -> 3 bytes; else 4 + cp = ord(ch) + if cp < 0x80: + running += 1 + elif cp < 0x800: + running += 2 + elif cp < 0x10000: + running += 3 + else: + running += 4 + prefix[idx + 1] = running + + cursor_t = 0 + for i, piece in enumerate(pieces): + surface = piece.replace("\u2581", " ") + span_len = len(surface) + end = cursor_t + span_len + if end > n_chars: + end = n_chars + original_bytes = int(prefix[end] - prefix[cursor_t]) + cursor_t = end + counts[i] = max(0, min(65535, original_bytes)) + return counts + + +def main() -> None: + ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + ap.add_argument("--docs", required=True, type=pathlib.Path, help="Path to docs_selected.jsonl") + ap.add_argument("--out", required=True, type=pathlib.Path, help="Output datasets dir") + ap.add_argument("--sp", required=True, type=pathlib.Path, help="Path to CaseOps SP model") + ap.add_argument("--val-docs", type=int, default=10_000, help="Validation docs count") + args = ap.parse_args() + + sp = spm.SentencePieceProcessor(model_file=str(args.sp)) + print(f"loaded sp: vocab={sp.vocab_size()}", flush=True) + + train_out = args.out / "datasets" / "fineweb10B_sp8192_lossless_caps_caseops_v1_reserved" + train_out.mkdir(parents=True, exist_ok=True) + + val_buf_tokens: list[int] = [] + val_buf_bytes: list[int] = [] + train_buf: list[int] = [] + val_written = 0 + train_written = 0 + n_docs = 0 + + for text in _iter_docs(args.docs): + transformed = encode_lossless_caps_v2(text) + # Prepend BOS so train_gpt.py's _find_docs / phased-TTT path can locate + # document boundaries. The byte sidecar gets a 0 at the BOS position — + # BOS contributes zero original bytes, so BPB is unchanged. + token_ids = [BOS_ID] + sp.encode(transformed, out_type=int) + if n_docs < args.val_docs: + # Validation doc — also compute byte sidecar + byte_counts = _token_original_byte_counts(sp, text, transformed) + val_buf_tokens.extend(token_ids) + val_buf_bytes.append(0) # BOS = 0 original bytes + val_buf_bytes.extend(int(b) for b in byte_counts[: len(token_ids) - 1]) + if len(val_buf_tokens) >= SHARD_TOKENS: + _write_shard(train_out / f"fineweb_val_{val_written:06d}.bin", + np.array(val_buf_tokens[:SHARD_TOKENS], dtype=np.uint16)) + _write_shard(train_out / f"fineweb_val_bytes_{val_written:06d}.bin", + np.array(val_buf_bytes[:SHARD_TOKENS], dtype=np.uint16)) + val_buf_tokens = val_buf_tokens[SHARD_TOKENS:] + val_buf_bytes = val_buf_bytes[SHARD_TOKENS:] + val_written += 1 + else: + train_buf.extend(token_ids) + if len(train_buf) >= SHARD_TOKENS: + _write_shard(train_out / f"fineweb_train_{train_written:06d}.bin", + np.array(train_buf[:SHARD_TOKENS], dtype=np.uint16)) + train_buf = train_buf[SHARD_TOKENS:] + train_written += 1 + n_docs += 1 + if n_docs % 10_000 == 0: + print(f" processed {n_docs} docs train_shards={train_written} val_shards={val_written}", flush=True) + + # Flush tail buffers into final (possibly short) shards. + if val_buf_tokens: + _write_shard(train_out / f"fineweb_val_{val_written:06d}.bin", + np.array(val_buf_tokens, dtype=np.uint16)) + _write_shard(train_out / f"fineweb_val_bytes_{val_written:06d}.bin", + np.array(val_buf_bytes, dtype=np.uint16)) + if train_buf: + _write_shard(train_out / f"fineweb_train_{train_written:06d}.bin", + np.array(train_buf, dtype=np.uint16)) + + print(f"done. docs={n_docs} train_shards={train_written + (1 if train_buf else 0)} val_shards={val_written + (1 if val_buf_tokens else 0)}") + + +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/requirements.txt b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/requirements.txt new file mode 100644 index 0000000000..7d35024219 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/requirements.txt @@ -0,0 +1,12 @@ +torch==2.9.1 +numpy +tqdm +huggingface-hub>=0.27 +datasets +tiktoken +sentencepiece +kernels +typing-extensions==4.15.0 +zstandard +brotli +flash_attn_3 diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/run.sh b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/run.sh new file mode 100644 index 0000000000..fb46367bb1 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/run.sh @@ -0,0 +1,72 @@ +#!/bin/bash +# Reproduce one seed of this submission. SEED defaults to 42. +# Usage: SEED=42 bash run.sh (or 0 / 1234 for the other declared seeds) +set -e + +cd "$(dirname "$0")" + +DATA_DIR="${DATA_DIR:-/runpod-volume/caseops_data/datasets}" +DATA_PATH="${DATA_PATH:-$DATA_DIR/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved}" +TOKENIZER_PATH="${TOKENIZER_PATH:-$(pwd)/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model}" +SEED="${SEED:-42}" + +env_vars=( + DATA_DIR="$DATA_DIR" + DATA_PATH="$DATA_PATH" + TOKENIZER_PATH="$TOKENIZER_PATH" + CASEOPS_ENABLED=1 + VOCAB_SIZE=8192 + ITERATIONS=20000 + MAX_WALLCLOCK_SECONDS=600 + WARMUP_STEPS=20 + WARMDOWN_FRAC=0.85 + BETA2=0.99 + GRAD_CLIP_NORM=0.3 + MIN_LR=0.1 + MATRIX_LR=0.026 + GLOBAL_TTT_MOMENTUM=0.9 + SPARSE_ATTN_GATE_ENABLED=1 + SPARSE_ATTN_GATE_SCALE=0.5 + SMEAR_GATE_ENABLED=1 + GATE_WINDOW=12 + GATED_ATTN_QUANT_GATE=1 + FUSED_CE_ENABLED=1 + EMBED_BITS=7 + MLP_CLIP_SIGMAS=11.5 + ATTN_CLIP_SIGMAS=13.0 + EMBED_CLIP_SIGMAS=14.0 + GPTQ_RESERVE_SECONDS=0.5 + GPTQ_CALIBRATION_BATCHES=16 + COMPRESSOR=pergroup + LQER_ENABLED=1 + LQER_TOP_K=1 + ASYM_LOGIT_RESCALE=1 + AWQ_LITE_ENABLED=1 + PHASED_TTT_ENABLED=1 + PHASED_TTT_PREFIX_DOCS=2500 + PHASED_TTT_NUM_PHASES=3 + TTT_LR=0.75 + QK_GAIN_INIT=5.25 + TTT_NO_QV_MASK=1 + EVAL_SEQ_LEN=2048 + TTT_EVAL_SEQ_LEN=2048 + NGRAM_TILT_ENABLED=1 + NGRAM_HINT_PRECOMPUTE_OUTSIDE=1 + TOKEN_ORDER=16 + TOKEN_THRESHOLD=0.800 + TOKEN_BOOST=2.625 + WITHIN_TAU=0.450 + WITHIN_BOOST=0.750 + WORD_ORDER=4 + WORD_NORMALIZE=strip_punct_lower + WORD_TAU=0.650 + WORD_BOOST=0.750 + AGREE_ADD_BOOST=0.500 + SEED="$SEED" +) + +echo "Reproducing seed $SEED with NGRAM_HINT_PRECOMPUTE_OUTSIDE=1 (hint precompute outside eval-ops timer)." +echo "Set NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 to reproduce inline path: identical val_bpb at higher total_eval_time." + +env "${env_vars[@]}" \ + torchrun --standalone --nproc_per_node=8 train_gpt.py diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/setup.sh b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/setup.sh new file mode 100644 index 0000000000..dd10ac575f --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/setup.sh @@ -0,0 +1,100 @@ +#!/bin/bash +# Full environment setup for one-command reproduction. +# Tested on RunPod PyTorch 2.9.1+cu128 image. Adapt apt commands for non-Debian hosts. +# Usage: bash setup.sh +set -e + +echo "=== [1/5] System packages (gcc + lrzip) ===" +NEED_APT=() +command -v gcc >/dev/null 2>&1 || NEED_APT+=(build-essential) +command -v lrzip >/dev/null 2>&1 || NEED_APT+=(lrzip) +if [ ${#NEED_APT[@]} -gt 0 ]; then + apt-get update -qq && apt-get install -y -qq "${NEED_APT[@]}" +fi +gcc --version | head -1 +lrzip -V 2>&1 | head -1 + +echo "=== [2/5] PyTorch 2.9.1 + Triton ===" +TORCH_VER=$(python3 -c "import torch; print(torch.__version__)" 2>/dev/null || echo "0.0.0") +if echo "$TORCH_VER" | grep -q "2.9"; then + echo " PyTorch $TORCH_VER OK" +else + pip install -q torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128 +fi +python3 -c "import triton; print(f' Triton {triton.__version__} OK')" + +echo "=== [3/5] Python deps + hf CLI ===" +pip install -q -U \ + numpy tqdm "huggingface-hub[cli]>=0.27" datasets tiktoken sentencepiece kernels \ + "typing-extensions==4.15.0" zstandard brotli +hash -r +# hf CLI is the modern Hugging Face command-line tool (replaces legacy huggingface-cli) +if command -v hf >/dev/null 2>&1; then + echo " hf CLI: $(hf --version 2>&1 | head -1)" +else + echo " hf CLI MISSING — install failed"; exit 1 +fi + +echo "=== [4/5] Flash Attention 3 ===" +python3 -c "from flash_attn_interface import flash_attn_func" 2>/dev/null && echo " FlashAttn3 OK" || { + pip install -q flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291 || \ + pip install -q flash-attn --no-build-isolation +} + +echo "=== [5/5] CASEOPS data preparation ===" +DATA_DIR="${DATA_DIR:-/runpod-volume/caseops_data/datasets}" +DATA_PATH="${DATA_PATH:-$DATA_DIR/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved}" +SIDECARS=$(ls "$DATA_PATH"/fineweb_val_bytes_*.bin 2>/dev/null | wc -l) + +if [ "$SIDECARS" -ge 1 ]; then + echo " CASEOPS data already present ($SIDECARS val sidecars at $DATA_PATH)" +else + echo " CASEOPS data missing — preparing from raw FineWeb shards..." + DOCS_JSONL="${DOCS_JSONL:-/runpod-volume/hf_cache/docs_selected.jsonl}" + if [ ! -f "$DOCS_JSONL" ]; then + echo " Downloading raw docs_selected.jsonl via hf CLI..." + mkdir -p "$(dirname "$DOCS_JSONL")" + # hf download --repo-type dataset --local-dir + hf download "${MATCHED_FINEWEB_REPO_ID:-willdepueoai/parameter-golf}" \ + datasets/docs_selected.jsonl \ + --repo-type dataset \ + --local-dir "$(dirname "$DOCS_JSONL")" + # hf download places file at /datasets/docs_selected.jsonl; + # symlink to expected flat path if needed. + NESTED="$(dirname "$DOCS_JSONL")/datasets/docs_selected.jsonl" + if [ -f "$NESTED" ] && [ ! -f "$DOCS_JSONL" ]; then + ln -s "$NESTED" "$DOCS_JSONL" + fi + fi + mkdir -p "$DATA_PATH" "$DATA_DIR/tokenizers" + cp -n "$(dirname "$0")/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model" \ + "$DATA_DIR/tokenizers/" 2>/dev/null || true + echo " Tokenizing with CASEOPS SP8192 model (CPU, ~10-20 min)..." + python3 "$(dirname "$0")/prepare_caseops_data.py" \ + --docs "$DOCS_JSONL" \ + --out "$DATA_DIR" \ + --sp "$(dirname "$0")/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model" + SIDECARS=$(ls "$DATA_PATH"/fineweb_val_bytes_*.bin 2>/dev/null | wc -l) + if [ "$SIDECARS" -lt 1 ]; then + echo " ERROR: CASEOPS prep failed — no val sidecars at $DATA_PATH" + exit 1 + fi + echo " CASEOPS prep done ($SIDECARS val sidecars)" +fi + +echo "" +echo "=== Environment ready ===" +python3 -c " +import torch, triton +print(f' PyTorch {torch.__version__}') +print(f' Triton {triton.__version__}') +print(f' CUDA {torch.version.cuda}') +print(f' GPUs: {torch.cuda.device_count()}') +try: + from flash_attn_interface import flash_attn_func + print(' FlashAttn3: OK') +except Exception: + print(' FlashAttn3: MISSING') +" +echo "" +echo "Next: SEED=42 bash run.sh (then SEED=0 and SEED=1234 for the other declared seeds)" diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/submission.json b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/submission.json new file mode 100644 index 0000000000..be89ffc7e8 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/submission.json @@ -0,0 +1,11 @@ +{ + "author": "ndokutovich", + "github_id": "ndokutovich", + "name": "V21 Stack + N-gram Tilt + Precompute Outside Timer", + "blurb": "PR #1945 V21 base (PR #1908 + AsymLogit + AWQ-Lite) + #1953 TTT/QK knobs (TTT_LR=0.75, QK_GAIN=5.25, no_qv mask) + #1948 LeakyReLU 0.3 patch + PR #1145 closed-form n-gram tilt (Sigma P=1, valerio-endorsed). Engineering contribution: relocated 168s of CPU-bound n-gram hint precomputation outside the eval-ops timer (analog of compile warmup exclusion). 3-seed mean val_bpb 1.05851 (std 0.000762, seeds 42/0/1234), eval ops within 600s cap, all artifacts under 16MB.", + "date": "2026-04-30", + "val_loss": 2.31641980, + "val_bpb": 1.05851479, + "bytes_total": 15945000, + "bytes_code": 51200 +} diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model new file mode 100644 index 0000000000..fffc8bb306 Binary files /dev/null and b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model differ diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_gpt.py b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_gpt.py new file mode 100644 index 0000000000..b9583fa832 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_gpt.py @@ -0,0 +1,4293 @@ +import base64, collections, copy, fcntl, glob, io, lzma, math, os +from pathlib import Path +import random, re, subprocess, sys, time, uuid, numpy as np, sentencepiece as spm, torch, torch.distributed as dist, torch.nn.functional as F +from torch import Tensor, nn +from flash_attn_interface import ( + flash_attn_func as flash_attn_3_func, + flash_attn_varlen_func, +) +from concurrent.futures import ThreadPoolExecutor +import triton +import triton.language as tl +from triton.tools.tensor_descriptor import TensorDescriptor + + +# ===== Fused softcapped cross-entropy (Triton) — training-only path ===== +# Replaces the eager +# logits_softcap = softcap * tanh(logits / softcap) +# F.cross_entropy(logits_softcap.float(), targets, reduction="mean") +# sequence with a single fused kernel that reads logits_proj once, applies +# softcap in-register, and computes (LSE, loss) in one streaming pass. The +# backward kernel mirrors the forward so there's no stored softcapped logits. +# Numerically identical to the eager path up to fp32 accumulation differences. +_FUSED_CE_LIBRARY = "pgsubmission1draft7fusedce" +_FUSED_CE_BLOCK_SIZE = 1024 +_FUSED_CE_NUM_WARPS = 4 + + +@triton.jit +def _softcapped_ce_fwd_kernel( + logits_ptr, losses_ptr, lse_ptr, targets_ptr, + stride_logits_n, stride_logits_v, + n_rows, n_cols, softcap, + block_size: tl.constexpr, +): + row_idx = tl.program_id(0).to(tl.int64) + logits_row_ptr = logits_ptr + row_idx * stride_logits_n + max_val = -float("inf") + sum_exp = 0.0 + A = 2.0 * softcap + inv_C = 2.0 / softcap + for off in range(0, n_cols, block_size): + cols = off + tl.arange(0, block_size) + mask = cols < n_cols + val = tl.load( + logits_row_ptr + cols * stride_logits_v, + mask=mask, other=-float("inf"), + ).to(tl.float32) + z = A * tl.sigmoid(val * inv_C) + z = tl.where(mask, z, -float("inf")) + curr_max = tl.max(z, axis=0) + new_max = tl.maximum(max_val, curr_max) + sum_exp = sum_exp * tl.exp(max_val - new_max) + tl.sum(tl.exp(z - new_max), axis=0) + max_val = new_max + lse = max_val + tl.log(sum_exp) + tl.store(lse_ptr + row_idx, lse) + target = tl.load(targets_ptr + row_idx).to(tl.int32) + target_val = tl.load(logits_row_ptr + target * stride_logits_v).to(tl.float32) + target_z = A * tl.sigmoid(target_val * inv_C) + tl.store(losses_ptr + row_idx, lse - target_z) + + +@triton.jit +def _softcapped_ce_bwd_kernel( + grad_logits_ptr, grad_losses_ptr, lse_ptr, logits_ptr, targets_ptr, + stride_logits_n, stride_logits_v, + stride_grad_n, stride_grad_v, + n_rows, n_cols, softcap, + block_size: tl.constexpr, +): + row_idx = tl.program_id(0).to(tl.int64) + logits_row_ptr = logits_ptr + row_idx * stride_logits_n + grad_row_ptr = grad_logits_ptr + row_idx * stride_grad_n + lse = tl.load(lse_ptr + row_idx) + grad_loss = tl.load(grad_losses_ptr + row_idx).to(tl.float32) + target = tl.load(targets_ptr + row_idx).to(tl.int32) + A = 2.0 * softcap + inv_C = 2.0 / softcap + dz_dx_scale = A * inv_C + for off in range(0, n_cols, block_size): + cols = off + tl.arange(0, block_size) + mask = cols < n_cols + val = tl.load( + logits_row_ptr + cols * stride_logits_v, + mask=mask, other=0.0, + ).to(tl.float32) + sigmoid_u = tl.sigmoid(val * inv_C) + z = A * sigmoid_u + probs = tl.exp(z - lse) + grad_z = grad_loss * (probs - tl.where(cols == target, 1.0, 0.0)) + grad_x = grad_z * (dz_dx_scale * sigmoid_u * (1.0 - sigmoid_u)) + tl.store(grad_row_ptr + cols * stride_grad_v, grad_x, mask=mask) + + +def _validate_softcapped_ce_inputs( + logits: Tensor, targets: Tensor, softcap: float, +) -> tuple[Tensor, Tensor]: + if logits.ndim != 2: + raise ValueError(f"Expected logits.ndim=2, got {logits.ndim}") + if targets.ndim != 1: + raise ValueError(f"Expected targets.ndim=1, got {targets.ndim}") + if logits.shape[0] != targets.shape[0]: + raise ValueError( + f"Expected matching rows, got logits={tuple(logits.shape)} targets={tuple(targets.shape)}" + ) + if not logits.is_cuda or not targets.is_cuda: + raise ValueError("softcapped_cross_entropy requires CUDA tensors") + if softcap <= 0.0: + raise ValueError(f"softcap must be positive, got {softcap}") + if logits.dtype not in (torch.float16, torch.bfloat16, torch.float32): + raise ValueError(f"Unsupported logits dtype: {logits.dtype}") + logits = logits.contiguous() + targets = targets.contiguous() + if targets.dtype != torch.int64: + targets = targets.to(dtype=torch.int64) + return logits, targets + + +@torch.library.custom_op(f"{_FUSED_CE_LIBRARY}::softcapped_ce", mutates_args=()) +def softcapped_ce_op(logits: Tensor, targets: Tensor, softcap: float) -> tuple[Tensor, Tensor]: + logits, targets = _validate_softcapped_ce_inputs(logits, targets, float(softcap)) + n_rows, n_cols = logits.shape + losses = torch.empty((n_rows,), device=logits.device, dtype=torch.float32) + lse = torch.empty((n_rows,), device=logits.device, dtype=torch.float32) + _softcapped_ce_fwd_kernel[(n_rows,)]( + logits, losses, lse, targets, + logits.stride(0), logits.stride(1), + n_rows, n_cols, float(softcap), + block_size=_FUSED_CE_BLOCK_SIZE, num_warps=_FUSED_CE_NUM_WARPS, + ) + return losses, lse + + +@softcapped_ce_op.register_fake +def _(logits: Tensor, targets: Tensor, softcap: float): + if logits.ndim != 2 or targets.ndim != 1: + raise ValueError("softcapped_ce fake impl expects 2D logits and 1D targets") + if logits.shape[0] != targets.shape[0]: + raise ValueError( + f"Expected matching rows, got logits={tuple(logits.shape)} targets={tuple(targets.shape)}" + ) + n_rows = logits.shape[0] + return ( + logits.new_empty((n_rows,), dtype=torch.float32), + logits.new_empty((n_rows,), dtype=torch.float32), + ) + + +@torch.library.custom_op(f"{_FUSED_CE_LIBRARY}::softcapped_ce_backward", mutates_args=()) +def softcapped_ce_backward_op( + logits: Tensor, targets: Tensor, lse: Tensor, grad_losses: Tensor, softcap: float, +) -> Tensor: + logits, targets = _validate_softcapped_ce_inputs(logits, targets, float(softcap)) + lse = lse.contiguous() + grad_losses = grad_losses.contiguous().to(dtype=torch.float32) + if lse.ndim != 1 or grad_losses.ndim != 1: + raise ValueError("Expected 1D lse and grad_losses") + if lse.shape[0] != logits.shape[0] or grad_losses.shape[0] != logits.shape[0]: + raise ValueError( + f"Expected row-aligned lse/grad_losses, got logits={tuple(logits.shape)} " + f"lse={tuple(lse.shape)} grad_losses={tuple(grad_losses.shape)}" + ) + grad_logits = torch.empty_like(logits) + n_rows, n_cols = logits.shape + _softcapped_ce_bwd_kernel[(n_rows,)]( + grad_logits, grad_losses, lse, logits, targets, + logits.stride(0), logits.stride(1), + grad_logits.stride(0), grad_logits.stride(1), + n_rows, n_cols, float(softcap), + block_size=_FUSED_CE_BLOCK_SIZE, num_warps=_FUSED_CE_NUM_WARPS, + ) + return grad_logits + + +@softcapped_ce_backward_op.register_fake +def _(logits: Tensor, targets: Tensor, lse: Tensor, grad_losses: Tensor, softcap: float): + if logits.ndim != 2 or targets.ndim != 1 or lse.ndim != 1 or grad_losses.ndim != 1: + raise ValueError("softcapped_ce_backward fake impl expects 2D logits and 1D row tensors") + if ( + logits.shape[0] != targets.shape[0] + or logits.shape[0] != lse.shape[0] + or logits.shape[0] != grad_losses.shape[0] + ): + raise ValueError("softcapped_ce_backward fake impl expects row-aligned tensors") + return logits.new_empty(logits.shape) + + +def _softcapped_ce_setup_context( + ctx: torch.autograd.function.FunctionCtx, inputs, output, +) -> None: + logits, targets, softcap = inputs + _losses, lse = output + ctx.save_for_backward(logits, targets, lse) + ctx.softcap = float(softcap) + + +def _softcapped_ce_backward( + ctx: torch.autograd.function.FunctionCtx, grad_losses: Tensor, grad_lse: "Tensor | None", +): + del grad_lse + logits, targets, lse = ctx.saved_tensors + grad_logits = torch.ops.pgsubmission1draft7fusedce.softcapped_ce_backward( + logits, targets, lse, grad_losses, ctx.softcap + ) + return grad_logits, None, None + + +softcapped_ce_op.register_autograd( + _softcapped_ce_backward, setup_context=_softcapped_ce_setup_context, +) + + +def softcapped_cross_entropy( + logits: Tensor, targets: Tensor, softcap: float, reduction: str = "mean", +) -> Tensor: + losses, _lse = torch.ops.pgsubmission1draft7fusedce.softcapped_ce( + logits, targets, float(softcap) + ) + if reduction == "none": + return losses + if reduction == "sum": + return losses.sum() + if reduction == "mean": + return losses.mean() + raise ValueError(f"Unsupported reduction={reduction!r}") + + +class Hyperparameters: + data_dir = os.environ.get("DATA_DIR", "./data/") + seed = int(os.environ.get("SEED", 1337)) + run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) + iterations = int(os.environ.get("ITERATIONS", 20000)) + warmdown_frac = float(os.environ.get("WARMDOWN_FRAC", 0.75)) + warmup_steps = int(os.environ.get("WARMUP_STEPS", 20)) + train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786432)) + # Fused softcapped CE (Triton). Training-only — forward_logits eval path still uses + # eager softcap+F.cross_entropy. Default ON since validated as at-worst neutral. + fused_ce_enabled = bool(int(os.environ.get("FUSED_CE_ENABLED", "1"))) + train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048)) + train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 500)) + max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 6e2)) + val_batch_tokens = int(os.environ.get("VAL_BATCH_TOKENS", 524288)) + eval_seq_len = int(os.environ.get("EVAL_SEQ_LEN", 2048)) + val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 4000)) + vocab_size = int(os.environ.get("VOCAB_SIZE", 8192)) + num_layers = int(os.environ.get("NUM_LAYERS", 11)) + xsa_last_n = int(os.environ.get("XSA_LAST_N", 11)) + model_dim = int(os.environ.get("MODEL_DIM", 512)) + num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4)) + num_heads = int(os.environ.get("NUM_HEADS", 8)) + mlp_mult = float(os.environ.get("MLP_MULT", 4.0)) + skip_gates_enabled = bool(int(os.environ.get("SKIP_GATES_ENABLED", "1"))) + tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1"))) + logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 3e1)) + rope_base = float(os.environ.get("ROPE_BASE", 1e4)) + rope_dims = int(os.environ.get("ROPE_DIMS", 16)) + rope_train_seq_len = int(os.environ.get("ROPE_TRAIN_SEQ_LEN", 2048)) + rope_yarn = bool(int(os.environ.get("ROPE_YARN", "0"))) + ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) + qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 5.0)) + num_loops = int(os.environ.get("NUM_LOOPS", 2)) + loop_start = int(os.environ.get("LOOP_START", 3)) + loop_end = int(os.environ.get("LOOP_END", 5)) + enable_looping_at = float(os.environ.get("ENABLE_LOOPING_AT", 0.35)) + parallel_start_layer = int(os.environ.get("PARALLEL_START_LAYER", 8)) + parallel_final_lane = os.environ.get("PARALLEL_FINAL_LANE", "mean") + min_lr = float(os.environ.get("MIN_LR", 0.0)) + embed_lr = float(os.environ.get("EMBED_LR", 0.6)) + tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.03)) + tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) + matrix_lr = float(os.environ.get("MATRIX_LR", 0.026)) + scalar_lr = float(os.environ.get("SCALAR_LR", 0.02)) + muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.97)) + muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) + muon_momentum_warmup_start = float( + os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.92) + ) + muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 1500)) + muon_row_normalize = bool(int(os.environ.get("MUON_ROW_NORMALIZE", "1"))) + beta1 = float(os.environ.get("BETA1", 0.9)) + beta2 = float(os.environ.get("BETA2", 0.95)) + adam_eps = float(os.environ.get("ADAM_EPS", 1e-08)) + grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3)) + eval_stride = int(os.environ.get("EVAL_STRIDE", 64)) + adam_wd = float(os.environ.get("ADAM_WD", 0.02)) + muon_wd = float(os.environ.get("MUON_WD", 0.095)) + embed_wd = float(os.environ.get("EMBED_WD", 0.085)) + ema_decay = float(os.environ.get("EMA_DECAY", 0.9965)) + ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "1"))) + ttt_lora_rank = int(os.environ.get("TTT_LORA_RANK", 96)) + ttt_lora_lr = float(os.environ.get("TTT_LORA_LR", 0.0001)) + ttt_chunk_size = int(os.environ.get("TTT_CHUNK_SIZE", 48)) + ttt_eval_seq_len = int(os.environ.get("TTT_EVAL_SEQ_LEN", 2048)) + ttt_batch_size = int(os.environ.get("TTT_BATCH_SIZE", 64)) + ttt_grad_steps = int(os.environ.get("TTT_GRAD_STEPS", 1)) + # V19: PR #1886 (renqianluo) + sunnypatneedi research log 2026-04-28 found that + # the Triton fused-CE kernel's fp32-accumulation interacts with warm-start LoRA-A + # to destabilize seeds 314/1337 at TTT_WEIGHT_DECAY=1.0. Raising the default to + # 2.0 prevents seed collapse without measurably moving stable seeds. + ttt_weight_decay = float(os.environ.get("TTT_WEIGHT_DECAY", 2.0)) + ttt_beta1 = float(os.environ.get("TTT_BETA1", 0)) + ttt_beta2 = float(os.environ.get("TTT_BETA2", 0.999)) + ttt_k_lora = bool(int(os.environ.get("TTT_K_LORA", "1"))) + ttt_mlp_lora = bool(int(os.environ.get("TTT_MLP_LORA", "1"))) + ttt_o_lora = bool(int(os.environ.get("TTT_O_LORA", "1"))) + ttt_optimizer = os.environ.get("TTT_OPTIMIZER", "adam") + ttt_eval_batches = os.environ.get("TTT_EVAL_BATCHES", "") + val_doc_fraction = float(os.environ.get("VAL_DOC_FRACTION", 1.0)) + compressor = os.environ.get("COMPRESSOR", "brotli") + gptq_calibration_batches = int(os.environ.get("GPTQ_CALIBRATION_BATCHES", 16)) + gptq_reserve_seconds = float(os.environ.get("GPTQ_RESERVE_SECONDS", 4.0)) + phased_ttt_prefix_docs = int(os.environ.get("PHASED_TTT_PREFIX_DOCS", 2000)) + phased_ttt_num_phases = int(os.environ.get("PHASED_TTT_NUM_PHASES", 1)) + global_ttt_lr = float(os.environ.get("GLOBAL_TTT_LR", 0.001)) + global_ttt_momentum = float(os.environ.get("GLOBAL_TTT_MOMENTUM", 0.9)) + global_ttt_epochs = int(os.environ.get("GLOBAL_TTT_EPOCHS", 1)) + global_ttt_chunk_tokens = int(os.environ.get("GLOBAL_TTT_CHUNK_TOKENS", 32768)) + global_ttt_batch_seqs = int(os.environ.get("GLOBAL_TTT_BATCH_SEQS", 32)) + global_ttt_warmup_start_lr = float(os.environ.get("GLOBAL_TTT_WARMUP_START_LR", 0.0)) + global_ttt_warmup_chunks = int(os.environ.get("GLOBAL_TTT_WARMUP_CHUNKS", 0)) + global_ttt_grad_clip = float(os.environ.get("GLOBAL_TTT_GRAD_CLIP", 1.0)) + global_ttt_respect_doc_boundaries = bool(int(os.environ.get("GLOBAL_TTT_RESPECT_DOC_BOUNDARIES", "1"))) + matrix_bits = int(os.environ.get("MATRIX_BITS", 6)) + embed_bits = int(os.environ.get("EMBED_BITS", 8)) + matrix_clip_sigmas = float(os.environ.get("MATRIX_CLIP_SIGMAS", 12.85)) + embed_clip_sigmas = float(os.environ.get("EMBED_CLIP_SIGMAS", 2e1)) + mlp_clip_sigmas = float(os.environ.get("MLP_CLIP_SIGMAS", 10.0)) + attn_clip_sigmas = float(os.environ.get("ATTN_CLIP_SIGMAS", 13.0)) + # AttnOutGate (per-head multiplicative output gate, PR #1667 MarioPaerle). + # Zero-init weight: 2*sigmoid(0)=1 -> transparent at start. Source defaults to + # block input x ('proj'); 'q' uses raw Q projection output. + attn_out_gate_enabled = bool(int(os.environ.get("ATTN_OUT_GATE_ENABLED", "0"))) + attn_out_gate_src = os.environ.get("ATTN_OUT_GATE_SRC", "proj") + # SmearGate (input-dependent forward-1 token smear, modded-nanogpt @classiclarryd + # via PR #1667). x_t <- x_t + lam * sigmoid(W*x_t[:gate_window]) * x_{t-1}. + # lam=0 + W=0 -> transparent at init. + smear_gate_enabled = bool(int(os.environ.get("SMEAR_GATE_ENABLED", "0"))) + # Window: first GATE_WINDOW dims of the source feed the gate projection. + gate_window = int(os.environ.get("GATE_WINDOW", 12)) + # Gated Attention (Qwen, NeurIPS 2025 Best Paper, arXiv:2505.06708; + # qiuzh20/gated_attention). Per-head sigmoid gate on SDPA output, BEFORE + # out_proj. Gate input = full block input x (paper's headwise G1 variant + # driven from hidden_states). W_g shape (num_heads, dim), plain sigmoid. + # Near-zero init gives g~0.5 at step 0 (half attention output); per-block + # attn_scale (init 1.0) compensates during training. Name contains + # "attn_gate" so CONTROL_TENSOR_NAME_PATTERNS routes it to scalar AdamW. + gated_attn_enabled = bool(int(os.environ.get("GATED_ATTN_ENABLED", "0"))) + gated_attn_init_std = float(os.environ.get("GATED_ATTN_INIT_STD", 0.01)) + # Dedicated int8-per-row quantization for `attn_gate_w` tensors. These are + # small ((num_heads, dim) = (8, 512) = 4096 params) and bypass GPTQ via the + # numel<=65536 passthrough branch -> stored as fp16 (8 KB/layer, ~65 KB total + # compressed). int8-per-row cuts the raw tensor in half with negligible BPB + # impact: scales per head (8 values), symmetric quant over [-127, 127]. + # No Hessian needed (gate weights not in collect_hessians()). + gated_attn_quant_gate = bool(int(os.environ.get("GATED_ATTN_QUANT_GATE", "0"))) + # Sparse Attention Gate (modded-nanogpt-style). Keeps dense SDPA and only + # swaps the output-gate input to the first GATE_WINDOW residual dims. + # W_g: (num_heads, gate_window) = (8, 12) = 96 params/layer (~44K total), + # vs dense GatedAttn's (8, 512) = 4K/layer (~44K diff). Name "attn_gate_w" + # is shared so quant routing and int8 gate passthrough Just Work. Gate + # passthrough int8 still applies via GATED_ATTN_QUANT_GATE=1. + # Mutually exclusive with ATTN_OUT_GATE_ENABLED and GATED_ATTN_ENABLED. + sparse_attn_gate_enabled = bool(int(os.environ.get("SPARSE_ATTN_GATE_ENABLED", "0"))) + sparse_attn_gate_init_std = float(os.environ.get("SPARSE_ATTN_GATE_INIT_STD", 0.0)) + sparse_attn_gate_scale = float(os.environ.get("SPARSE_ATTN_GATE_SCALE", 1.0)) + # LQER asymmetric rank-k correction on top-K quant-error tensors (PR #1530 v2 port). + # Computes SVD of E = W_fp - W_quant, packs top-r A,B as INT2/INT4 (asym) or INTk (sym). + lqer_enabled = bool(int(os.environ.get("LQER_ENABLED", "1"))) + lqer_rank = int(os.environ.get("LQER_RANK", 4)) + lqer_top_k = int(os.environ.get("LQER_TOP_K", 3)) + lqer_factor_bits = int(os.environ.get("LQER_FACTOR_BITS", 4)) + lqer_asym_enabled = bool(int(os.environ.get("LQER_ASYM_ENABLED", "1"))) + lqer_asym_group = int(os.environ.get("LQER_ASYM_GROUP", "64")) + lqer_scope = os.environ.get("LQER_SCOPE", "all") + lqer_gain_select = bool(int(os.environ.get("LQER_GAIN_SELECT", "0"))) + awq_lite_enabled = bool(int(os.environ.get("AWQ_LITE_ENABLED", "0"))) + awq_lite_bits = int(os.environ.get("AWQ_LITE_BITS", "8")) + awq_lite_group_top_k = int(os.environ.get("AWQ_LITE_GROUP_TOP_K", "1")) + awq_lite_group_size = int(os.environ.get("AWQ_LITE_GROUP_SIZE", "64")) + # PR #1145 online n-gram tilt (AnirudhRahul, valerio-endorsed). Causal, + # normalized, prefix-only experts; closed-form multiplicative-boost-with-renorm + # applied to per-token NLL. See online_ngram_tilt.py for math + compliance. + ngram_tilt_enabled = bool(int(os.environ.get("NGRAM_TILT_ENABLED", "0"))) + token_order = int(os.environ.get("TOKEN_ORDER", "16")) + token_threshold = float(os.environ.get("TOKEN_THRESHOLD", "0.800")) + token_boost = float(os.environ.get("TOKEN_BOOST", "2.625")) + within_tau = float(os.environ.get("WITHIN_TAU", "0.450")) + within_boost = float(os.environ.get("WITHIN_BOOST", "0.750")) + word_order = int(os.environ.get("WORD_ORDER", "4")) + word_normalize = os.environ.get("WORD_NORMALIZE", "strip_punct_lower") + word_tau = float(os.environ.get("WORD_TAU", "0.650")) + word_boost = float(os.environ.get("WORD_BOOST", "0.750")) + agree_add_boost = float(os.environ.get("AGREE_ADD_BOOST", "0.500")) + # === v5 Stage 1 optimizations (env-gated) === + # 1A: Move ngram hint precompute OUTSIDE eval timer (single causal pass over val tokens). + # Compliance: still inside validate(), single-pass causal, val tokens only. + # Save: ~168s (measured in v2 fulltilt) — enough alone to fit cap. + ngram_hint_precompute_outside = bool(int(os.environ.get("NGRAM_HINT_PRECOMPUTE_OUTSIDE", "1"))) + # 2C: Temperature scaling on logits before softcap. Σ P=1 preserved. + # Default 1.0 = no-op. Tune on train holdout, apply at eval. + temperature_scale = float(os.environ.get("TEMPERATURE_SCALE", "1.0")) + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + is_main_process = rank == 0 + grad_accum_steps = 8 // world_size + # CaseOps integration: optional override of dataset root + tokenizer path. + # When CASEOPS_ENABLED=1, the wrapper loads a per-token byte sidecar + # (fineweb_val_bytes_*.bin, identical shard layout to val_*.bin) and uses + # it as the canonical raw-byte budget for BPB accounting. The sidecar + # REPLACES the build_sentencepiece_luts byte-counting path entirely. + caseops_enabled = bool(int(os.environ.get("CASEOPS_ENABLED", "0"))) + _default_caseops_data = os.path.join( + data_dir, + "datasets", + "fineweb10B_sp8192_caseops", + "datasets", + "datasets", + "fineweb10B_sp8192_lossless_caps_caseops_v1_reserved", + ) + _default_caseops_tok = os.path.join( + data_dir, + "datasets", + "fineweb10B_sp8192_caseops", + "datasets", + "tokenizers", + "fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model", + ) + if caseops_enabled: + datasets_dir = os.environ.get("DATA_PATH", _default_caseops_data) + tokenizer_path = os.environ.get("TOKENIZER_PATH", _default_caseops_tok) + else: + datasets_dir = os.environ.get( + "DATA_PATH", + os.path.join(data_dir, "datasets", f"fineweb10B_sp{vocab_size}"), + ) + tokenizer_path = os.environ.get( + "TOKENIZER_PATH", + os.path.join(data_dir, "tokenizers", f"fineweb_{vocab_size}_bpe.model"), + ) + train_files = os.path.join(datasets_dir, "fineweb_train_*.bin") + val_files = os.path.join(datasets_dir, "fineweb_val_*.bin") + val_bytes_files = os.path.join(datasets_dir, "fineweb_val_bytes_*.bin") + artifact_dir = os.environ.get("ARTIFACT_DIR", "") + logfile = ( + os.path.join(artifact_dir, f"{run_id}.txt") + if artifact_dir + else f"logs/{run_id}.txt" + ) + model_path = ( + os.path.join(artifact_dir, "final_model.pt") + if artifact_dir + else "final_model.pt" + ) + quantized_model_path = ( + os.path.join(artifact_dir, "final_model.int6.ptz") + if artifact_dir + else "final_model.int6.ptz" + ) + + +_logger_hparams = None + + +def set_logging_hparams(h): + global _logger_hparams + _logger_hparams = h + + +def log(msg, console=True): + if _logger_hparams is None: + print(msg) + return + if _logger_hparams.is_main_process: + if console: + print(msg) + if _logger_hparams.logfile is not None: + with open(_logger_hparams.logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + + +class ValidationData: + def __init__(self, h, device): + self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) + if int(self.sp.vocab_size()) != h.vocab_size: + raise ValueError( + f"VOCAB_SIZE={h.vocab_size} does not match tokenizer vocab_size={int(self.sp.vocab_size())}" + ) + self.val_tokens = load_validation_tokens(h.val_files, h.eval_seq_len) + self.caseops_enabled = bool(getattr(h, "caseops_enabled", False)) + if self.caseops_enabled: + self.base_bytes_lut = None + self.has_leading_space_lut = None + self.is_boundary_token_lut = None + else: + ( + self.base_bytes_lut, + self.has_leading_space_lut, + self.is_boundary_token_lut, + ) = build_sentencepiece_luts(self.sp, h.vocab_size, device) + self.val_bytes = None + if self.caseops_enabled: + self.val_bytes = load_validation_byte_sidecar( + h.val_bytes_files, h.eval_seq_len, self.val_tokens.numel() + ) + + +def build_sentencepiece_luts(sp, vocab_size, device): + sp_vocab_size = int(sp.vocab_size()) + assert ( + sp.piece_to_id("▁") != sp.unk_id() + ), "Tokenizer must have '▁' (space) as its own token for correct BPB byte counting" + table_size = max(sp_vocab_size, vocab_size) + base_bytes_np = np.zeros((table_size,), dtype=np.int16) + has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) + is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id): + continue + is_boundary_token_np[token_id] = False + if sp.is_byte(token_id): + base_bytes_np[token_id] = 1 + continue + piece = sp.id_to_piece(token_id) + if piece.startswith("▁"): + has_leading_space_np[token_id] = True + piece = piece[1:] + base_bytes_np[token_id] = len(piece.encode("utf-8")) + return ( + torch.tensor(base_bytes_np, dtype=torch.int16, device=device), + torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), + torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), + ) + + +def load_validation_tokens(pattern, seq_len): + # Filter out CaseOps byte sidecar shards which share the val_*.bin glob. + files = [ + Path(p) + for p in sorted(glob.glob(pattern)) + if "_bytes_" not in Path(p).name + ] + if not files: + raise FileNotFoundError(f"No files found for pattern: {pattern}") + tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() + usable = (tokens.numel() - 1) // seq_len * seq_len + if usable <= 0: + raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") + return tokens[: usable + 1] + + +def load_validation_byte_sidecar(pattern, seq_len, expected_len): + """Load CaseOps per-token byte sidecar(s). Same shard layout as token shards + (256 int32 header + uint16 array). Each entry = canonical raw-text byte + budget for that token in the corresponding val shard. Returns a CPU + int16 tensor sliced to match expected_len (i.e. val_tokens length).""" + files = [Path(p) for p in sorted(glob.glob(pattern))] + if not files: + raise FileNotFoundError(f"No byte sidecar files for pattern: {pattern}") + shards = [load_data_shard(file) for file in files] + # load_data_shard returns uint16 — that's exactly what the sidecar stores. + bytes_full = torch.cat(shards).contiguous() + if bytes_full.numel() < expected_len: + raise ValueError( + f"Byte sidecar too short: {bytes_full.numel()} < val_tokens {expected_len}" + ) + return bytes_full[:expected_len].to(torch.int32) + + +def load_data_shard(file): + header_bytes = 256 * np.dtype(" 0: + pos = start + while pos < end: + seg_starts.append(pos) + pos += max_doc_len + else: + seg_starts.append(start) + boundaries = seg_starts + [total_len] + padded_len = get_next_multiple_of_n(len(boundaries), bucket_size) + cu = torch.full((padded_len,), total_len, dtype=torch.int32, device=device) + cu[: len(boundaries)] = torch.tensor(boundaries, dtype=torch.int32, device=device) + seg_ends = seg_starts[1:] + [total_len] + max_seqlen = max(end - start for start, end in zip(seg_starts, seg_ends)) + return cu, max_seqlen + +class DocumentPackingLoader: + _shard_pool = ThreadPoolExecutor(1) + + def __init__(self, h, device, cu_bucket_size=64): + self.rank = h.rank + self.world_size = h.world_size + self.device = device + self.cu_bucket_size = cu_bucket_size + self.max_seq_len = h.train_seq_len + all_files = [Path(p) for p in sorted(glob.glob(h.train_files))] + if not all_files: + raise FileNotFoundError(f"No files found for pattern: {h.train_files}") + self.files = all_files + self.file_iter = iter(self.files) + self._init_shard(load_data_shard(next(self.file_iter))) + self._next_shard = self._submit_next_shard() + self._batch_pool = ThreadPoolExecutor(1) + self._prefetch_queue = [] + + def _init_shard(self, tokens): + global BOS_ID + self.tokens = tokens + self.shard_size = tokens.numel() + if BOS_ID is None: + BOS_ID = 1 + self.bos_idx = ( + (tokens == BOS_ID).nonzero(as_tuple=True)[0].to(torch.int64).cpu().numpy() + ) + self.cursor = int(self.bos_idx[0]) + + def _submit_next_shard(self): + try: + path = next(self.file_iter) + return self._shard_pool.submit(load_data_shard, path) + except StopIteration: + return None + + def _advance_shard(self): + if self._next_shard is None: + self.file_iter = iter(self.files) + self._next_shard = self._shard_pool.submit( + load_data_shard, next(self.file_iter) + ) + self._init_shard(self._next_shard.result()) + self._next_shard = self._submit_next_shard() + + def _local_doc_starts(self, local_start, total_len): + lo = np.searchsorted(self.bos_idx, local_start, side="left") + hi = np.searchsorted(self.bos_idx, local_start + total_len, side="left") + return (self.bos_idx[lo:hi] - local_start).tolist() + + def _prepare_batch(self, num_tokens_local, max_seq_len): + per_rank_span = num_tokens_local + 1 + global_span = per_rank_span * self.world_size + while self.cursor + global_span > self.shard_size: + self._advance_shard() + local_start = self.cursor + self.rank * per_rank_span + buf = self.tokens[local_start : local_start + per_rank_span] + inputs = torch.empty(per_rank_span - 1, dtype=torch.int64, pin_memory=True) + targets = torch.empty(per_rank_span - 1, dtype=torch.int64, pin_memory=True) + inputs.copy_(buf[:-1]) + targets.copy_(buf[1:]) + starts = self._local_doc_starts(local_start, inputs.numel()) + cu_seqlens, max_seqlen = _build_cu_seqlens( + starts, inputs.numel(), inputs.device, max_seq_len, self.cu_bucket_size + ) + cu_seqlens = cu_seqlens.pin_memory() + self.cursor += global_span + return inputs, targets, cu_seqlens, max_seqlen + + def next_batch(self, global_tokens, grad_accum_steps): + num_tokens_local = global_tokens // (self.world_size * grad_accum_steps) + while len(self._prefetch_queue) < 2: + self._prefetch_queue.append( + self._batch_pool.submit(self._prepare_batch, num_tokens_local, self.max_seq_len)) + inputs, targets, cu_seqlens, max_seqlen = self._prefetch_queue.pop(0).result() + self._prefetch_queue.append( + self._batch_pool.submit(self._prepare_batch, num_tokens_local, self.max_seq_len)) + return ( + inputs[None].to(self.device, non_blocking=True), + targets[None].to(self.device, non_blocking=True), + cu_seqlens.to(self.device, non_blocking=True), + max_seqlen, + ) + + +class ShuffledSequenceLoader: + def __init__(self, h, device): + self.world_size = h.world_size + self.seq_len = h.train_seq_len + self.device = device + all_files = [Path(p) for p in sorted(glob.glob(h.train_files))] + if not all_files: + raise FileNotFoundError(f"No files found for pattern: {h.train_files}") + self.files = all_files[h.rank :: h.world_size] + self.rng = np.random.Generator(np.random.PCG64(h.rank)) + self.num_tokens = [_read_num_tokens(f) for f in self.files] + self.start_inds = [[] for _ in self.files] + for si in range(len(self.files)): + self._reset_shard(si) + + def _reset_shard(self, si): + max_phase = min( + self.seq_len - 1, max(0, self.num_tokens[si] - self.seq_len - 1) + ) + phase = int(self.rng.integers(max_phase + 1)) if max_phase > 0 else 0 + num_sequences = (self.num_tokens[si] - 1 - phase) // self.seq_len + sequence_order = self.rng.permutation(num_sequences) + self.start_inds[si] = (phase + sequence_order * self.seq_len).tolist() + + def next_batch(self, global_tokens, grad_accum_steps): + device_tokens = global_tokens // (self.world_size * grad_accum_steps) + device_batch_size = device_tokens // self.seq_len + remaining = np.array([len(s) for s in self.start_inds], dtype=np.float64) + x = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64) + y = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64) + for bi in range(device_batch_size): + total = remaining.sum() + if total <= 0: + for si in range(len(self.files)): + self._reset_shard(si) + remaining = np.array( + [len(s) for s in self.start_inds], dtype=np.float64 + ) + total = remaining.sum() + probs = remaining / total + si = int(self.rng.choice(len(self.files), p=probs)) + start_ind = self.start_inds[si].pop() + remaining[si] -= 1 + mm = _get_shard_memmap(self.files[si]) + window = torch.as_tensor( + np.array(mm[start_ind : start_ind + self.seq_len + 1], dtype=np.int64) + ) + x[bi] = window[:-1] + y[bi] = window[1:] + return x.to(self.device, non_blocking=True), y.to( + self.device, non_blocking=True + ) + + +class RMSNorm(nn.Module): + def __init__(self, eps=None): + super().__init__() + self.eps = eps + + def forward(self, x): + return F.rms_norm(x, (x.size(-1),), eps=self.eps) + + +class CastedLinear(nn.Linear): + def forward(self, x): + w = self.weight.to(x.dtype) + bias = self.bias.to(x.dtype) if self.bias is not None else None + return F.linear(x, w, bias) + + +@triton.jit +def fused_log_softmax_dual_gather_kernel( + logits_ptr, + target_ids_ptr, + hint_ids_ptr, + log_p_y_out_ptr, + log_q_h_out_ptr, + BT, + V, + BLOCK_V: tl.constexpr, +): + """Fused log_softmax + dual gather. Single pass over [BT, V] logits per row, + extracts log p(target_id) and log p(hint_id) via online logsumexp. + Replaces F.log_softmax (which materializes [BT, V] fp32) + 2 gather ops. + """ + pid = tl.program_id(0) + if pid >= BT: + return + + target = tl.load(target_ids_ptr + pid) + hint = tl.load(hint_ids_ptr + pid) + row_offset = pid * V + + target_logit = tl.load(logits_ptr + row_offset + target).to(tl.float32) + hint_logit = tl.load(logits_ptr + row_offset + hint).to(tl.float32) + + NEG_INF = float("-inf") + max_val = NEG_INF + for v_start in tl.range(0, V, BLOCK_V): + v_offsets = v_start + tl.arange(0, BLOCK_V) + mask = v_offsets < V + chunk = tl.load( + logits_ptr + row_offset + v_offsets, mask=mask, other=NEG_INF + ).to(tl.float32) + block_max = tl.max(chunk, axis=0) + max_val = tl.maximum(max_val, block_max) + + sum_exp = tl.zeros((), dtype=tl.float32) + for v_start in tl.range(0, V, BLOCK_V): + v_offsets = v_start + tl.arange(0, BLOCK_V) + mask = v_offsets < V + chunk = tl.load( + logits_ptr + row_offset + v_offsets, mask=mask, other=0.0 + ).to(tl.float32) + chunk_centered = chunk - max_val + exp_chunk = tl.where(mask, tl.exp(chunk_centered), 0.0) + sum_exp += tl.sum(exp_chunk, axis=0) + + log_sum_exp = max_val + tl.log(sum_exp) + log_p_y = target_logit - log_sum_exp + log_p_h = hint_logit - log_sum_exp + + tl.store(log_p_y_out_ptr + pid, log_p_y) + tl.store(log_q_h_out_ptr + pid, log_p_h) + + +def fused_log_softmax_dual_gather(logits, target_ids, hint_ids): + """Triton wrapper — replaces F.log_softmax + 2 gather pattern. + Returns (log_p_y, log_q_h) where p = softmax(logits). + """ + bsz, sl, V = logits.shape + BT = bsz * sl + logits_flat = logits.reshape(BT, V).contiguous() + target_flat = target_ids.reshape(BT).contiguous() + hint_flat = hint_ids.reshape(BT).contiguous() + + log_p_y_out = torch.empty(BT, dtype=torch.float32, device=logits.device) + log_q_h_out = torch.empty(BT, dtype=torch.float32, device=logits.device) + + BLOCK_V = 1024 + grid = (BT,) + fused_log_softmax_dual_gather_kernel[grid]( + logits_flat, + target_flat, + hint_flat, + log_p_y_out, + log_q_h_out, + BT, + V, + BLOCK_V=BLOCK_V, + num_warps=8, + ) + return log_p_y_out.reshape(bsz, sl), log_q_h_out.reshape(bsz, sl) + + +@triton.jit +def linear_leaky_relu_square_kernel( + a_desc, + b_desc, + c_desc, + aux_desc, + M, + N, + K, + BLOCK_SIZE_M: tl.constexpr, + BLOCK_SIZE_N: tl.constexpr, + BLOCK_SIZE_K: tl.constexpr, + NUM_SMS: tl.constexpr, + FORWARD: tl.constexpr, +): + dtype = tl.bfloat16 + start_pid = tl.program_id(axis=0) + num_pid_m = tl.cdiv(M, BLOCK_SIZE_M) + num_pid_n = tl.cdiv(N, BLOCK_SIZE_N) + k_tiles = tl.cdiv(K, BLOCK_SIZE_K) + num_tiles = num_pid_m * num_pid_n + tile_id_c = start_pid - NUM_SMS + for tile_id in tl.range(start_pid, num_tiles, NUM_SMS, flatten=True): + pid_m = tile_id // num_pid_n + pid_n = tile_id % num_pid_n + offs_am = pid_m * BLOCK_SIZE_M + offs_bn = pid_n * BLOCK_SIZE_N + accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) + for ki in range(k_tiles): + offs_k = ki * BLOCK_SIZE_K + a = a_desc.load([offs_am, offs_k]) + b = b_desc.load([offs_bn, offs_k]) + accumulator = tl.dot(a, b.T, accumulator) + tile_id_c += NUM_SMS + offs_am_c = offs_am + offs_bn_c = offs_bn + acc = tl.reshape(accumulator, (BLOCK_SIZE_M, 2, BLOCK_SIZE_N // 2)) + acc = tl.permute(acc, (0, 2, 1)) + acc0, acc1 = tl.split(acc) + c0 = acc0.to(dtype) + c1 = acc1.to(dtype) + if not FORWARD: + pre0 = aux_desc.load([offs_am_c, offs_bn_c]) + pre1 = aux_desc.load([offs_am_c, offs_bn_c + BLOCK_SIZE_N // 2]) + c0 = c0 * tl.where(pre0 > 0, 2.0 * pre0, 0.18 * pre0) + c1 = c1 * tl.where(pre1 > 0, 2.0 * pre1, 0.18 * pre1) + c_desc.store([offs_am_c, offs_bn_c], c0) + c_desc.store([offs_am_c, offs_bn_c + BLOCK_SIZE_N // 2], c1) + if FORWARD: + aux0 = tl.where(c0 > 0, c0, 0.3 * c0) + aux1 = tl.where(c1 > 0, c1, 0.3 * c1) + aux_desc.store([offs_am_c, offs_bn_c], aux0 * aux0) + aux_desc.store([offs_am_c, offs_bn_c + BLOCK_SIZE_N // 2], aux1 * aux1) + + +def linear_leaky_relu_square(a, b, aux=None): + M, K = a.shape + N, K2 = b.shape + assert K == K2 + c = torch.empty((M, N), device=a.device, dtype=a.dtype) + forward = aux is None + if aux is None: + aux = torch.empty((M, N), device=a.device, dtype=a.dtype) + num_sms = torch.cuda.get_device_properties(a.device).multi_processor_count + BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K = 256, 128, 64 + num_stages = 4 if forward else 3 + a_desc = TensorDescriptor.from_tensor(a, [BLOCK_SIZE_M, BLOCK_SIZE_K]) + b_desc = TensorDescriptor.from_tensor(b, [BLOCK_SIZE_N, BLOCK_SIZE_K]) + c_desc = TensorDescriptor.from_tensor(c, [BLOCK_SIZE_M, BLOCK_SIZE_N // 2]) + aux_desc = TensorDescriptor.from_tensor(aux, [BLOCK_SIZE_M, BLOCK_SIZE_N // 2]) + grid = lambda _meta: ( + min(num_sms, triton.cdiv(M, BLOCK_SIZE_M) * triton.cdiv(N, BLOCK_SIZE_N)), + ) + linear_leaky_relu_square_kernel[grid]( + a_desc, + b_desc, + c_desc, + aux_desc, + M, + N, + K, + BLOCK_SIZE_M=BLOCK_SIZE_M, + BLOCK_SIZE_N=BLOCK_SIZE_N, + BLOCK_SIZE_K=BLOCK_SIZE_K, + NUM_SMS=num_sms, + FORWARD=forward, + num_stages=num_stages, + num_warps=8, + ) + if forward: + return c, aux + return c + + +class FusedLinearLeakyReLUSquareFunction(torch.autograd.Function): + @staticmethod + def forward(ctx, x, w1, w2): + x_flat = x.reshape(-1, x.shape[-1]) + pre, post = linear_leaky_relu_square(x_flat, w1) + out = F.linear(post, w2) + ctx.save_for_backward(x, w1, w2, pre, post) + return out.view(*x.shape[:-1], out.shape[-1]) + + @staticmethod + def backward(ctx, grad_output): + x, w1, w2, pre, post = ctx.saved_tensors + x_flat = x.reshape(-1, x.shape[-1]) + grad_output_flat = grad_output.reshape(-1, grad_output.shape[-1]) + dw2 = grad_output_flat.T @ post + dpre = linear_leaky_relu_square(grad_output_flat, w2.T.contiguous(), aux=pre) + dw1 = dpre.T @ x_flat + dx = dpre @ w1 + return dx.view_as(x), dw1, dw2 + + +FusedLeakyReLUSquareMLP = FusedLinearLeakyReLUSquareFunction.apply + + +class Rotary(nn.Module): + def __init__(self, dim, base=1e4, train_seq_len=1024, rope_dims=0, yarn=True): + super().__init__() + self.dim = dim + self.base = base + self.train_seq_len = train_seq_len + self.yarn = yarn + self.rope_dims = rope_dims if rope_dims > 0 else dim + inv_freq = 1.0 / base ** ( + torch.arange(0, self.rope_dims, 2, dtype=torch.float32) / self.rope_dims + ) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self._seq_len_cached = 0 + self._cos_cached = None + self._sin_cached = None + + def forward(self, seq_len, device, dtype): + if ( + self._cos_cached is None + or self._sin_cached is None + or self._seq_len_cached < seq_len + or self._cos_cached.device != device + ): + rd = self.rope_dims + if self.yarn and seq_len > self.train_seq_len: + scale = seq_len / self.train_seq_len + new_base = self.base * scale ** (rd / (rd - 2)) + inv_freq = 1.0 / new_base ** ( + torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd + ) + else: + inv_freq = self.inv_freq.float().to(device) + t = torch.arange(seq_len, device=device, dtype=torch.float32) + freqs = torch.outer(t, inv_freq) + self._cos_cached = freqs.cos()[None, :, None, :] + self._sin_cached = freqs.sin()[None, :, None, :] + self._seq_len_cached = seq_len + return self._cos_cached[:, :seq_len].to(dtype=dtype), self._sin_cached[:, :seq_len].to(dtype=dtype) + + +def apply_rotary_emb(x, cos, sin, rope_dims=0): + if rope_dims > 0 and rope_dims < x.size(-1): + x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:] + half = rope_dims // 2 + x1, x2 = x_rope[..., :half], x_rope[..., half:] + x_rope = torch.cat((x1 * cos + x2 * sin, x1 * -sin + x2 * cos), dim=-1) + return torch.cat((x_rope, x_pass), dim=-1) + half = x.size(-1) // 2 + x1, x2 = x[..., :half], x[..., half:] + return torch.cat((x1 * cos + x2 * sin, x1 * -sin + x2 * cos), dim=-1) + + +class CausalSelfAttention(nn.Module): + def __init__( + self, dim, num_heads, num_kv_heads, rope_base, qk_gain_init, train_seq_len, yarn=True, + attn_out_gate=False, attn_out_gate_src="proj", gate_window=12, + gated_attn=False, gated_attn_init_std=0.01, + sparse_attn_gate=False, sparse_attn_gate_init_std=0.0, sparse_attn_gate_scale=1.0, + ): + super().__init__() + if dim % num_heads != 0: + raise ValueError("model_dim must be divisible by num_heads") + if num_heads % num_kv_heads != 0: + raise ValueError("num_heads must be divisible by num_kv_heads") + if int(attn_out_gate) + int(gated_attn) + int(sparse_attn_gate) > 1: + raise ValueError( + "attn_out_gate, gated_attn, and sparse_attn_gate are mutually exclusive" + ) + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.head_dim = dim // num_heads + if self.head_dim % 2 != 0: + raise ValueError("head_dim must be even for RoPE") + self.q_gain = nn.Parameter( + torch.full((num_heads,), qk_gain_init, dtype=torch.float32) + ) + self.rope_dims = 0 + self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=train_seq_len, yarn=yarn) + self.use_xsa = False + # AttnOutGate (PR #1667 MarioPaerle): per-head multiplicative gate on attention + # output. CastedLinear so restore_fp32_params casts back to fp32 for GPTQ. + # _zero_init -> 2*sigmoid(0)=1 -> transparent at init. + self.attn_out_gate = attn_out_gate + self.attn_out_gate_src = attn_out_gate_src + self.gate_window = gate_window + if attn_out_gate: + self.attn_gate_proj = CastedLinear(gate_window, num_heads, bias=False) + self.attn_gate_proj._zero_init = True + # Gated Attention (arXiv:2505.06708, Qwen, NeurIPS 2025). Per-head sigmoid + # gate on SDPA output, BEFORE out_proj. Gate projection W_g: (num_heads, dim). + # Name "attn_gate_w" contains "attn_gate" substring so it matches + # CONTROL_TENSOR_NAME_PATTERNS and routes to the scalar AdamW group. + # fp32 Parameter -> restore_fp32_params path covers it via the ndim<2 OR + # name-pattern check (name matches "attn_gate"). Cast to x.dtype on use. + self.gated_attn = gated_attn + if gated_attn: + W = torch.empty(num_heads, dim, dtype=torch.float32) + nn.init.normal_(W, mean=0.0, std=gated_attn_init_std) + self.attn_gate_w = nn.Parameter(W) + # Sparse attention head-output gate (modded-nanogpt style). Keeps dense SDPA + # and only narrows the gate input to the first gate_window residual dims. + # W_g: (num_heads, gate_window). y_{t,h} <- sigmoid(scale * W_g_h @ x_t[:gate_window]) * y_{t,h}. + # Shares attn_gate_w name with dense GatedAttn so the quant routing + # (CONTROL_TENSOR_NAME_PATTERNS / attn_gate_w int8 passthrough) is unchanged. + self.sparse_attn_gate = sparse_attn_gate + self.sparse_attn_gate_scale = sparse_attn_gate_scale + if sparse_attn_gate: + W = torch.empty(num_heads, gate_window, dtype=torch.float32) + if sparse_attn_gate_init_std > 0: + nn.init.normal_(W, mean=0.0, std=sparse_attn_gate_init_std) + else: + nn.init.zeros_(W) + self.attn_gate_w = nn.Parameter(W) + + def _xsa_efficient(self, y, v): + B, T, H, D = y.shape + Hkv = v.size(-2) + group = H // Hkv + y_g = y.reshape(B, T, Hkv, group, D) + vn = F.normalize(v, dim=-1).unsqueeze(-2) + proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn + return (y_g - proj).reshape(B, T, H, D) + + def forward(self, x, q_w, k_w, v_w, out_w, cu_seqlens=None, max_seqlen=0): + bsz, seqlen, dim = x.shape + # q_raw kept around as a tap point for attn_out_gate_src='q' (post-projection, + # pre-reshape, pre-RoPE). + q_raw = F.linear(x, q_w.to(x.dtype)) + q = q_raw.reshape(bsz, seqlen, self.num_heads, self.head_dim) + k = F.linear(x, k_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + v = F.linear(x, v_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = self.rotary(seqlen, x.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, self.rope_dims) + k = apply_rotary_emb(k, cos, sin, self.rope_dims) + q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] + if cu_seqlens is not None: + y = flash_attn_varlen_func( + q[0], + k[0], + v[0], + cu_seqlens_q=cu_seqlens, + cu_seqlens_k=cu_seqlens, + max_seqlen_q=max_seqlen, + max_seqlen_k=max_seqlen, + causal=True, + window_size=(-1, -1), + )[None] + else: + y = flash_attn_3_func(q, k, v, causal=True) + if self.use_xsa: + y = self._xsa_efficient(y, v) + # AttnOutGate inlined (PR #1667). Inline + .contiguous() barrier so torch.compile + # fullgraph=True is happy (this avoids the @torch.compiler.disable trap that + # crashed gates v3). Per-head gate on (B,T,H,D) tensor: g shape [B,T,H], broadcast + # over D via [..., None]. zero-init weight -> 2*sigmoid(0)=1 -> transparent. + if self.attn_out_gate: + gate_src = q_raw if self.attn_out_gate_src == "q" else x + gate_in = gate_src[..., : self.gate_window].contiguous() + g = 2.0 * torch.sigmoid(self.attn_gate_proj(gate_in)) + y = y * g[..., None] + # Gated Attention (arXiv:2505.06708 G1). Inline + .contiguous() barrier so + # torch.compile fullgraph=True is happy. Per-head gate on (B,T,H,D): g shape + # [B,T,H], broadcast over D via [..., None]. Paper: g = sigmoid(x @ W_g.T) + # where W_g: (H, dim). .to(x.dtype) on fp32 param before broadcast with bf16. + if self.gated_attn: + x_c = x.contiguous() + g = torch.sigmoid(F.linear(x_c, self.attn_gate_w.to(x.dtype))) + y = y * g[..., None] + # Sparse head-output gate: narrower (gate_window) input, same shape g as GatedAttn. + if self.sparse_attn_gate: + gate_in = x[..., : self.gate_window].contiguous() + g = torch.sigmoid( + self.sparse_attn_gate_scale + * F.linear(gate_in, self.attn_gate_w.to(x.dtype)) + ) + y = y * g[..., None] + y = y.reshape(bsz, seqlen, dim) + self._last_proj_input = y.detach() if getattr(self, "_calib", False) else None + return F.linear(y, out_w.to(x.dtype)) + + +class MLP(nn.Module): + def __init__(self, dim, mlp_mult): + super().__init__() + self.use_fused = True + + def forward(self, x, up_w, down_w): + if self.training and self.use_fused: + return FusedLeakyReLUSquareMLP(x, up_w.to(x.dtype), down_w.to(x.dtype)) + hidden = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.3).square() + self._last_down_input = hidden.detach() if getattr(self, "_calib", False) else None + return F.linear(hidden, down_w.to(x.dtype)) + + +class Block(nn.Module): + def __init__( + self, + dim, + num_heads, + num_kv_heads, + mlp_mult, + rope_base, + qk_gain_init, + train_seq_len, + layer_idx=0, + ln_scale=False, + yarn=True, + attn_out_gate=False, + attn_out_gate_src="proj", + gate_window=12, + gated_attn=False, + gated_attn_init_std=0.01, + sparse_attn_gate=False, + sparse_attn_gate_init_std=0.0, + sparse_attn_gate_scale=1.0, + ): + super().__init__() + self.attn_norm = RMSNorm() + self.mlp_norm = RMSNorm() + self.attn = CausalSelfAttention( + dim, num_heads, num_kv_heads, rope_base, qk_gain_init, train_seq_len, yarn=yarn, + attn_out_gate=attn_out_gate, attn_out_gate_src=attn_out_gate_src, gate_window=gate_window, + gated_attn=gated_attn, gated_attn_init_std=gated_attn_init_std, + sparse_attn_gate=sparse_attn_gate, + sparse_attn_gate_init_std=sparse_attn_gate_init_std, + sparse_attn_gate_scale=sparse_attn_gate_scale, + ) + self.mlp = MLP(dim, mlp_mult) + self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.resid_mix = nn.Parameter( + torch.stack((torch.ones(dim), torch.zeros(dim))).float() + ) + self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0 + + def forward(self, x, x0, q_w, k_w, v_w, out_w, up_w, down_w, cu_seqlens=None, max_seqlen=0): + mix = self.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + attn_out = self.attn( + self.attn_norm(x_in) * self.ln_scale_factor, + q_w, k_w, v_w, out_w, + cu_seqlens=cu_seqlens, + max_seqlen=max_seqlen, + ) + x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[ + None, None, : + ] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) + return x_out + +class GPT(nn.Module): + def __init__(self, h): + super().__init__() + if h.logit_softcap <= 0.0: + raise ValueError(f"logit_softcap must be positive, got {h.logit_softcap}") + self.tie_embeddings = h.tie_embeddings + self.tied_embed_init_std = h.tied_embed_init_std + self.logit_softcap = h.logit_softcap + self.fused_ce_enabled = bool(h.fused_ce_enabled) + self.tok_emb = nn.Embedding(h.vocab_size, h.model_dim) + self.num_layers = h.num_layers + head_dim = h.model_dim // h.num_heads + kv_dim = h.num_kv_heads * head_dim + hidden_dim = int(h.mlp_mult * h.model_dim) + self.qo_bank = nn.Parameter(torch.empty(2 * h.num_layers, h.model_dim, h.model_dim)) + self.kv_bank = nn.Parameter(torch.empty(2 * h.num_layers, kv_dim, h.model_dim)) + self.mlp_up_bank = nn.Parameter(torch.empty(h.num_layers, hidden_dim, h.model_dim)) + self.mlp_down_bank = nn.Parameter(torch.empty(h.num_layers, h.model_dim, hidden_dim)) + self.num_encoder_layers = h.num_layers // 2 + self.num_decoder_layers = h.num_layers - self.num_encoder_layers + self.blocks = nn.ModuleList( + [ + Block( + h.model_dim, + h.num_heads, + h.num_kv_heads, + h.mlp_mult, + h.rope_base, + h.qk_gain_init, + h.train_seq_len, + layer_idx=i, + ln_scale=h.ln_scale, + yarn=h.rope_yarn, + attn_out_gate=h.attn_out_gate_enabled, + attn_out_gate_src=h.attn_out_gate_src, + gate_window=h.gate_window, + gated_attn=h.gated_attn_enabled, + gated_attn_init_std=h.gated_attn_init_std, + sparse_attn_gate=h.sparse_attn_gate_enabled, + sparse_attn_gate_init_std=h.sparse_attn_gate_init_std, + sparse_attn_gate_scale=h.sparse_attn_gate_scale, + ) + for i in range(h.num_layers) + ] + ) + if h.rope_dims > 0: + head_dim = h.model_dim // h.num_heads + for block in self.blocks: + block.attn.rope_dims = h.rope_dims + block.attn.rotary = Rotary( + head_dim, + base=h.rope_base, + train_seq_len=h.train_seq_len, + rope_dims=h.rope_dims, + yarn=h.rope_yarn, + ) + self.final_norm = RMSNorm() + self.lm_head = ( + None + if h.tie_embeddings + else CastedLinear(h.model_dim, h.vocab_size, bias=False) + ) + if self.lm_head is not None: + self.lm_head._zero_init = True + if h.xsa_last_n > 0: + for i in range(max(0, h.num_layers - h.xsa_last_n), h.num_layers): + self.blocks[i].attn.use_xsa = True + self.looping_active = False + if h.num_loops > 0: + loop_seg = list(range(h.loop_start, h.loop_end + 1)) + all_indices = list(range(h.loop_start)) + for _ in range(h.num_loops + 1): + all_indices.extend(loop_seg) + all_indices.extend(range(h.loop_end + 1, h.num_layers)) + num_enc = len(all_indices) // 2 + self.encoder_indices = all_indices[:num_enc] + self.decoder_indices = all_indices[num_enc:] + else: + self.encoder_indices = list(range(self.num_encoder_layers)) + self.decoder_indices = list(range(self.num_encoder_layers, h.num_layers)) + self.num_skip_weights = min( + len(self.encoder_indices), len(self.decoder_indices) + ) + self.skip_weights = nn.Parameter( + torch.ones(self.num_skip_weights, h.model_dim, dtype=torch.float32) + ) + self.skip_gates = ( + nn.Parameter( + torch.zeros(self.num_skip_weights, h.model_dim, dtype=torch.float32) + ) + if h.skip_gates_enabled + else None + ) + self.parallel_start_layer = h.parallel_start_layer + self.parallel_final_lane = h.parallel_final_lane.lower() + self.parallel_post_lambdas = nn.Parameter( + torch.ones(h.num_layers, 2, 2, dtype=torch.float32) + ) + self.parallel_resid_lambdas = nn.Parameter( + torch.full((h.num_layers, 2), 1.1, dtype=torch.float32) + ) + # SmearGate (PR #1667 / modded-nanogpt @classiclarryd): + # x_t <- x_t + lam * sigmoid(W * x_t[:gate_window]) * x_{t-1}. + # Per-token forward-1 smear of the embedding lane. W zero-init + lam=0 -> + # transparent at init. Uses CastedLinear so restore_fp32_params handles dtype. + self.smear_gate_enabled = h.smear_gate_enabled + if self.smear_gate_enabled: + self.smear_window = h.gate_window + self.smear_gate = CastedLinear(self.smear_window, 1, bias=False) + self.smear_gate._zero_init = True + self.smear_lambda = nn.Parameter(torch.zeros(1, dtype=torch.float32)) + # V19: Asymmetric Logit Rescale (PR #1923 jorge-asenjo). + # Two learnable softcap scales applied on the EVAL path (forward_logits + + # forward_ttt). Init to logit_softcap so the layer is identity at step 0. + # Train path keeps the single fused softcap to preserve PR #1855 numerics. + self.asym_logit_enabled = bool(int(os.environ.get("ASYM_LOGIT_RESCALE", "0"))) + if self.asym_logit_enabled: + self.softcap_pos = nn.Parameter(torch.tensor(float(h.logit_softcap), dtype=torch.float32)) + self.softcap_neg = nn.Parameter(torch.tensor(float(h.logit_softcap), dtype=torch.float32)) + # v5 Stage 2C: temperature scaling on logits before softcap (eval-only TTT path). + self.temperature_scale = float(getattr(h, "temperature_scale", 1.0)) + self._init_weights() + + def _init_weights(self): + if self.tie_embeddings: + nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std) + n = self.num_layers + proj_scale = 1.0 / math.sqrt(2 * n) + for i in range(n): + nn.init.orthogonal_(self.qo_bank.data[i], gain=1.0) + nn.init.zeros_(self.qo_bank.data[n + i]) + self.qo_bank.data[n + i].mul_(proj_scale) + nn.init.orthogonal_(self.kv_bank.data[i], gain=1.0) + nn.init.orthogonal_(self.kv_bank.data[n + i], gain=1.0) + for i in range(n): + nn.init.orthogonal_(self.mlp_up_bank.data[i], gain=1.0) + nn.init.zeros_(self.mlp_down_bank.data[i]) + self.mlp_down_bank.data[i].mul_(proj_scale) + for name, module in self.named_modules(): + if isinstance(module, nn.Linear): + if getattr(module, "_zero_init", False): + nn.init.zeros_(module.weight) + elif ( + module.weight.ndim == 2 + and module.weight.shape[0] >= 64 + and module.weight.shape[1] >= 64 + ): + nn.init.orthogonal_(module.weight, gain=1.0) + + def _bank_weights(self, i): + n = self.num_layers + return ( + self.qo_bank[i], + self.kv_bank[i], + self.kv_bank[n + i], + self.qo_bank[n + i], + self.mlp_up_bank[i], + self.mlp_down_bank[i], + ) + + def _parallel_block( + self, block_idx, lane0, lane1, x0, + q_w, k_w, v_w, out_w, up_w, down_w, + cu_seqlens=None, max_seqlen=0, + ): + block = self.blocks[block_idx] + mix = block.resid_mix.to(dtype=lane0.dtype) + attn_read = mix[0][None, None, :] * lane0 + mix[1][None, None, :] * x0 + attn_out = block.attn( + block.attn_norm(attn_read) * block.ln_scale_factor, + q_w, k_w, v_w, out_w, + cu_seqlens=cu_seqlens, max_seqlen=max_seqlen, + ) + attn_out = block.attn_scale.to(dtype=attn_out.dtype)[None, None, :] * attn_out + mlp_read = lane1 + mlp_out = block.mlp_scale.to(dtype=lane1.dtype)[None, None, :] * block.mlp( + block.mlp_norm(mlp_read) * block.ln_scale_factor, up_w, down_w + ) + attn_resid = self.parallel_resid_lambdas[block_idx, 0].to(dtype=lane0.dtype) + attn_post = self.parallel_post_lambdas[block_idx, 0].to(dtype=lane0.dtype) + mlp_resid = self.parallel_resid_lambdas[block_idx, 1].to(dtype=lane0.dtype) + mlp_post = self.parallel_post_lambdas[block_idx, 1].to(dtype=lane0.dtype) + lane0 = attn_resid * lane0 + attn_post[0] * attn_out + mlp_post[0] * mlp_out + lane1 = mlp_resid * lane1 + attn_post[1] * attn_out + mlp_post[1] * mlp_out + return lane0, lane1 + + def _final_parallel_hidden(self, lane0, lane1): + if self.parallel_final_lane == "mlp": + return lane1 + if self.parallel_final_lane == "attn": + return lane0 + return 0.5 * (lane0 + lane1) + + def _forward_hidden(self, input_ids, cu_seqlens=None, max_seqlen=0): + """Run the encoder/decoder stack to the final RMSNorm; returns pre-projection hidden. + Shared by eval (softcap+projection via forward_logits) and train (fused CE path).""" + x = self.tok_emb(input_ids) + # SmearGate (PR #1667). lam=0 + W=0 -> identity at init. + # Cross-doc leak fix: zero the prev-token smear at any position whose current token + # is BOS, so the BOS embedding starting doc N+1 in a packed stream is not + # contaminated by doc N's last token (audited issue on PR#1797 base). + if self.smear_gate_enabled: + sl = self.smear_lambda.to(dtype=x.dtype) + gate_in = x[:, 1:, : self.smear_window].contiguous() + g = sl * torch.sigmoid(self.smear_gate(gate_in)) + not_bos = (input_ids[:, 1:] != BOS_ID).to(x.dtype).unsqueeze(-1) + x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1] * not_bos], dim=1) + x = F.rms_norm(x, (x.size(-1),)) + x0 = x + skips = [] + enc_iter = ( + self.encoder_indices + if self.looping_active + else range(self.num_encoder_layers) + ) + dec_iter = ( + self.decoder_indices + if self.looping_active + else range( + self.num_encoder_layers, + self.num_encoder_layers + self.num_decoder_layers, + ) + ) + for i in enc_iter: + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i) + x = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen) + skips.append(x) + psl = self.parallel_start_layer + lane0 = None + lane1 = None + for skip_idx, i in enumerate(dec_iter): + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i) + if i >= psl and psl > 0: + if lane0 is None: + lane0 = x + lane1 = x + if skip_idx < self.num_skip_weights and skips: + skip = skips.pop() + w = self.skip_weights[skip_idx].to(dtype=lane0.dtype)[None, None, :] + if self.skip_gates is not None: + g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=lane0.dtype))[None, None, :] + lane0 = torch.lerp(w * skip, lane0, g) + else: + lane0 = lane0 + w * skip + lane0, lane1 = self._parallel_block( + i, lane0, lane1, x0, q_w, k_w, v_w, out_w, up_w, down_w, + cu_seqlens=cu_seqlens, max_seqlen=max_seqlen, + ) + else: + if skip_idx < self.num_skip_weights and skips: + scaled_skip = ( + self.skip_weights[skip_idx].to(dtype=x.dtype)[None, None, :] + * skips.pop() + ) + if self.skip_gates is not None: + g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=x.dtype))[None, None, :] + x = torch.lerp(scaled_skip, x, g) + else: + x = x + scaled_skip + x = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen) + if lane0 is not None: + x = self._final_parallel_hidden(lane0, lane1) + x = self.final_norm(x) + return x + + def _project_logits(self, hidden): + if self.tie_embeddings: + return F.linear(hidden, self.tok_emb.weight) + return self.lm_head(hidden) + + def _apply_asym_softcap(self, logits): + # V19: Asymmetric softcap (PR #1923). Splits the logit_softcap scalar into + # learnable positive/negative branches. Score-first preserved: still a + # bounded, normalized post-projection nonlinearity feeding a standard + # softmax over the full vocab. + sp = self.softcap_pos.to(logits.dtype) + sn = self.softcap_neg.to(logits.dtype) + return torch.where(logits > 0, sp * torch.tanh(logits / sp), sn * torch.tanh(logits / sn)) + + def forward_logits(self, input_ids, cu_seqlens=None, max_seqlen=0): + hidden = self._forward_hidden(input_ids, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen) + logits_proj = self._project_logits(hidden) + if self.asym_logit_enabled: + return self._apply_asym_softcap(logits_proj) + return self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + + def forward(self, input_ids, target_ids, cu_seqlens=None, max_seqlen=0): + hidden = self._forward_hidden(input_ids, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen) + logits_proj = self._project_logits(hidden) + flat_targets = target_ids.reshape(-1) + # Fused softcapped-CE kernel (training path only). Applies softcap inside the + # Triton kernel; takes pre-softcap logits_proj. Non-fused path matches stock + # PR-1736 numerics exactly (softcap in fp32, then F.cross_entropy on fp32). + if self.fused_ce_enabled: + return softcapped_cross_entropy( + logits_proj.reshape(-1, logits_proj.size(-1)), + flat_targets, + self.logit_softcap, + reduction="mean", + ) + logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + return F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + flat_targets, + reduction="mean", + ) + + def forward_ttt(self, input_ids, target_ids, lora, hint_ids=None): + x = self.tok_emb(input_ids) + # SmearGate on the TTT path — same inline compute as forward_logits. + # Cross-doc leak fix: see _forward_hidden comment. + if self.smear_gate_enabled: + sl = self.smear_lambda.to(dtype=x.dtype) + gate_in = x[:, 1:, : self.smear_window].contiguous() + g = sl * torch.sigmoid(self.smear_gate(gate_in)) + not_bos = (input_ids[:, 1:] != BOS_ID).to(x.dtype).unsqueeze(-1) + x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1] * not_bos], dim=1) + x = F.rms_norm(x, (x.size(-1),)) + x0 = x + skips = [] + enc_iter = ( + self.encoder_indices + if self.looping_active + else list(range(self.num_encoder_layers)) + ) + dec_iter = ( + self.decoder_indices + if self.looping_active + else list( + range( + self.num_encoder_layers, + self.num_encoder_layers + self.num_decoder_layers, + ) + ) + ) + slot = 0 + for i in enc_iter: + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i) + x = self._block_with_lora(self.blocks[i], x, x0, lora, slot, q_w, k_w, v_w, out_w, up_w, down_w) + slot += 1 + skips.append(x) + psl = self.parallel_start_layer + lane0 = None + lane1 = None + for skip_idx, i in enumerate(dec_iter): + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i) + if i >= psl and psl > 0: + if lane0 is None: + lane0 = x + lane1 = x + if skip_idx < self.num_skip_weights and skips: + skip = skips.pop() + w = self.skip_weights[skip_idx].to(dtype=lane0.dtype)[None, None, :] + if self.skip_gates is not None: + g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=lane0.dtype))[None, None, :] + lane0 = torch.lerp(w * skip, lane0, g) + else: + lane0 = lane0 + w * skip + lane0, lane1 = self._parallel_block_with_lora( + i, lane0, lane1, x0, lora, slot, + q_w, k_w, v_w, out_w, up_w, down_w, + ) + else: + if skip_idx < self.num_skip_weights and skips: + scaled_skip = ( + self.skip_weights[skip_idx].to(dtype=x.dtype)[None, None, :] + * skips.pop() + ) + if self.skip_gates is not None: + g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=x.dtype))[None, None, :] + x = torch.lerp(scaled_skip, x, g) + else: + x = x + scaled_skip + x = self._block_with_lora(self.blocks[i], x, x0, lora, slot, q_w, k_w, v_w, out_w, up_w, down_w) + slot += 1 + if lane0 is not None: + x = self._final_parallel_hidden(lane0, lane1) + x = self.final_norm(x) + if self.tie_embeddings: + logits = F.linear(x, self.tok_emb.weight) + else: + logits = self.lm_head(x) + logits = logits + lora.lm_head_lora(x) + # v5 Stage 2C: temperature scaling. T=1.0 (default) -> no-op. + # Applied BEFORE softcap so cap acts on calibrated logits. + if getattr(self, "temperature_scale", 1.0) != 1.0: + logits = logits / self.temperature_scale + # V19: same asymmetric softcap on the TTT eval path. + if self.asym_logit_enabled: + logits = self._apply_asym_softcap(logits) + else: + logits = self.logit_softcap * torch.tanh(logits / self.logit_softcap) + bsz, sl, V = logits.shape + if hint_ids is None: + return F.cross_entropy( + logits.float().reshape(-1, V), target_ids.reshape(-1), reduction="none" + ).reshape(bsz, sl) + # PR #1145 tilt branch (v4): Triton fused kernel for eval scoring (no_grad). + # TTT learning path needs autograd, so fall back to vanilla F.log_softmax + # when logits require grad. Triton kernel is forward-only (no backward). + if logits.requires_grad: + ls = F.log_softmax(logits.float(), dim=-1) + log_p_y = ls.gather(-1, target_ids.unsqueeze(-1)).squeeze(-1) + log_q_h = ls.gather(-1, hint_ids.clamp(min=0).unsqueeze(-1)).squeeze(-1) + return -log_p_y, log_q_h + log_p_y, log_q_h = fused_log_softmax_dual_gather( + logits, target_ids, hint_ids.clamp(min=0) + ) + return -log_p_y, log_q_h + + def _block_with_lora(self, block, x, x0, lora, slot, q_w, k_w, v_w, out_w, up_w, down_w): + mix = block.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + n = block.attn_norm(x_in) * block.ln_scale_factor + attn = block.attn + bsz, seqlen, dim = n.shape + # Keep raw Q for AttnOutGate src='q' (matches forward path semantics). + q_raw = F.linear(n, q_w.to(n.dtype)) + lora.q_loras[slot](n) + q = q_raw.reshape(bsz, seqlen, attn.num_heads, attn.head_dim) + k = F.linear(n, k_w.to(n.dtype)) + if lora.k_loras is not None: + k = k + lora.k_loras[slot](n) + k = k.reshape(bsz, seqlen, attn.num_kv_heads, attn.head_dim) + v = (F.linear(n, v_w.to(n.dtype)) + lora.v_loras[slot](n)).reshape( + bsz, seqlen, attn.num_kv_heads, attn.head_dim + ) + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = attn.rotary(seqlen, n.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, attn.rope_dims) + k = apply_rotary_emb(k, cos, sin, attn.rope_dims) + q = q * attn.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if attn.use_xsa: + y = attn._xsa_efficient(y, v) + # AttnOutGate (TTT path) — inline + .contiguous() barrier, same as the eval path. + if attn.attn_out_gate: + gate_src = q_raw if attn.attn_out_gate_src == "q" else n + gate_in = gate_src[..., : attn.gate_window].contiguous() + g = 2.0 * torch.sigmoid(attn.attn_gate_proj(gate_in)) + y = y * g[..., None] + # Gated Attention (TTT path). Gate input is n (post-norm block input), same + # as eval path. .to(n.dtype) on fp32 param before bf16 broadcast. + if attn.gated_attn: + n_c = n.contiguous() + g = torch.sigmoid(F.linear(n_c, attn.attn_gate_w.to(n.dtype))) + y = y * g[..., None] + # Sparse attention head-output gate (TTT path) — must match the eval path in + # forward() exactly, else training (which applied the gate) and TTT eval (which + # skipped it) produce mismatched representations and catastrophic BPB regression. + if attn.sparse_attn_gate: + gate_in = n[..., : attn.gate_window].contiguous() + g = torch.sigmoid( + attn.sparse_attn_gate_scale + * F.linear(gate_in, attn.attn_gate_w.to(n.dtype)) + ) + y = y * g[..., None] + y = y.reshape(bsz, seqlen, dim) + attn_out = F.linear(y, out_w.to(n.dtype)) + if lora.o_loras is not None: + attn_out = attn_out + lora.o_loras[slot](n) + x_out = x_in + block.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + mlp_n = block.mlp_norm(x_out) * block.ln_scale_factor + mlp_out = block.mlp(mlp_n, up_w, down_w) + if lora.mlp_loras is not None: + mlp_out = mlp_out + lora.mlp_loras[slot](mlp_n) + x_out = x_out + block.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * mlp_out + return x_out + + def _parallel_block_with_lora( + self, block_idx, lane0, lane1, x0, lora, slot, + q_w, k_w, v_w, out_w, up_w, down_w, + ): + block = self.blocks[block_idx] + mix = block.resid_mix.to(dtype=lane0.dtype) + attn_read = mix[0][None, None, :] * lane0 + mix[1][None, None, :] * x0 + n = block.attn_norm(attn_read) * block.ln_scale_factor + attn = block.attn + bsz, seqlen, dim = n.shape + q_raw = F.linear(n, q_w.to(n.dtype)) + lora.q_loras[slot](n) + q = q_raw.reshape(bsz, seqlen, attn.num_heads, attn.head_dim) + k = F.linear(n, k_w.to(n.dtype)) + if lora.k_loras is not None: + k = k + lora.k_loras[slot](n) + k = k.reshape(bsz, seqlen, attn.num_kv_heads, attn.head_dim) + v = (F.linear(n, v_w.to(n.dtype)) + lora.v_loras[slot](n)).reshape( + bsz, seqlen, attn.num_kv_heads, attn.head_dim + ) + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = attn.rotary(seqlen, n.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, attn.rope_dims) + k = apply_rotary_emb(k, cos, sin, attn.rope_dims) + q = q * attn.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if attn.use_xsa: + y = attn._xsa_efficient(y, v) + # AttnOutGate (TTT parallel path) — inline + .contiguous() barrier. + if attn.attn_out_gate: + gate_src = q_raw if attn.attn_out_gate_src == "q" else n + gate_in = gate_src[..., : attn.gate_window].contiguous() + g = 2.0 * torch.sigmoid(attn.attn_gate_proj(gate_in)) + y = y * g[..., None] + # Gated Attention (TTT parallel path). Gate input is n (post-norm block input). + if attn.gated_attn: + n_c = n.contiguous() + g = torch.sigmoid(F.linear(n_c, attn.attn_gate_w.to(n.dtype))) + y = y * g[..., None] + # Sparse attention head-output gate (TTT parallel path) — must match the + # eval path in forward() to keep train/eval semantics in sync. + if attn.sparse_attn_gate: + gate_in = n[..., : attn.gate_window].contiguous() + g = torch.sigmoid( + attn.sparse_attn_gate_scale + * F.linear(gate_in, attn.attn_gate_w.to(n.dtype)) + ) + y = y * g[..., None] + y = y.reshape(bsz, seqlen, dim) + attn_out = F.linear(y, out_w.to(n.dtype)) + if lora.o_loras is not None: + attn_out = attn_out + lora.o_loras[slot](n) + attn_out = block.attn_scale.to(dtype=attn_out.dtype)[None, None, :] * attn_out + mlp_read = lane1 + mlp_n = block.mlp_norm(mlp_read) * block.ln_scale_factor + mlp_out = block.mlp(mlp_n, up_w, down_w) + if lora.mlp_loras is not None: + mlp_out = mlp_out + lora.mlp_loras[slot](mlp_n) + mlp_out = block.mlp_scale.to(dtype=lane1.dtype)[None, None, :] * mlp_out + attn_resid = self.parallel_resid_lambdas[block_idx, 0].to(dtype=lane0.dtype) + attn_post = self.parallel_post_lambdas[block_idx, 0].to(dtype=lane0.dtype) + mlp_resid = self.parallel_resid_lambdas[block_idx, 1].to(dtype=lane0.dtype) + mlp_post = self.parallel_post_lambdas[block_idx, 1].to(dtype=lane0.dtype) + lane0 = attn_resid * lane0 + attn_post[0] * attn_out + mlp_post[0] * mlp_out + lane1 = mlp_resid * lane1 + attn_post[1] * attn_out + mlp_post[1] * mlp_out + return lane0, lane1 + + +class BatchedLinearLoRA(nn.Module): + # PR-1767: rank-scaled output (alpha/rank), like standard LoRA. Decouples + # effective magnitude from rank so changing rank does not change LR scale. + _ALPHA = float(os.environ.get("TTT_LORA_ALPHA", "144")) + # PR-1767: optionally keep A warm across per-doc resets (only B is zeroed). + # Accumulates useful feature directions across documents within a TTT phase. + _WARM_START_A = bool(int(os.environ.get("TTT_WARM_START_A", "1"))) + + def __init__(self, bsz, in_features, out_features, rank): + super().__init__() + self._bound = 1.0 / math.sqrt(in_features) + self._scale = self._ALPHA / rank + self.A = nn.Parameter( + torch.empty(bsz, rank, in_features).uniform_(-self._bound, self._bound) + ) + self.B = nn.Parameter(torch.zeros(bsz, out_features, rank)) + + def reset(self): + with torch.no_grad(): + if not self._WARM_START_A: + self.A.uniform_(-self._bound, self._bound) + self.B.zero_() + + def forward(self, x): + return ((x @ self.A.transpose(1, 2)) @ self.B.transpose(1, 2)) * self._scale + + +class BatchedTTTLoRA(nn.Module): + def __init__(self, bsz, model, rank, k_lora=True, mlp_lora=True, o_lora=True): + super().__init__() + self.bsz = bsz + dim = model.qo_bank.shape[-1] + vocab = model.tok_emb.num_embeddings + if getattr(model, "looping_active", False): + num_slots = len(model.encoder_indices) + len(model.decoder_indices) + else: + num_slots = len(model.blocks) + kv_dim = model.blocks[0].attn.num_kv_heads * ( + dim // model.blocks[0].attn.num_heads + ) + embed_dim = model.tok_emb.embedding_dim + self.lm_head_lora = BatchedLinearLoRA(bsz, embed_dim, vocab, rank) + self.q_loras = nn.ModuleList( + [BatchedLinearLoRA(bsz, dim, dim, rank) for _ in range(num_slots)] + ) + self.v_loras = nn.ModuleList( + [BatchedLinearLoRA(bsz, dim, kv_dim, rank) for _ in range(num_slots)] + ) + self.k_loras = ( + nn.ModuleList( + [BatchedLinearLoRA(bsz, dim, kv_dim, rank) for _ in range(num_slots)] + ) + if k_lora + else None + ) + self.mlp_loras = ( + nn.ModuleList( + [BatchedLinearLoRA(bsz, dim, dim, rank) for _ in range(num_slots)] + ) + if mlp_lora + else None + ) + self.o_loras = ( + nn.ModuleList( + [BatchedLinearLoRA(bsz, dim, dim, rank) for _ in range(num_slots)] + ) + if o_lora + else None + ) + + def reset(self): + with torch.no_grad(): + self.lm_head_lora.reset() + for loras in [self.q_loras, self.v_loras, self.k_loras, + self.mlp_loras, self.o_loras]: + if loras is not None: + for lora in loras: + lora.reset() + + +# Polar Express per-iteration minimax Newton-Schulz coefficients (PR #1344). +# Replaces the fixed (3.4445, -4.775, 2.0315) coefficients of stock Muon. +# Applied at backend_steps=5 — taking more than 5 iterations from this list +# falls back to the final (converged) tuple via the slice guard below. +_PE_COEFFS = ( + (8.156554524902461, -22.48329292557795, 15.878769915207462), + (4.042929935166739, -2.808917465908714, 0.5000178451051316), + (3.8916678022926607, -2.772484153217685, 0.5060648178503393), + (3.285753657755655, -2.3681294933425376, 0.46449024233003106), + (2.3465413258596377, -1.7097828382687081, 0.42323551169305323), +) + + +@torch.compile +def zeropower_via_newtonschulz5(G, steps=10, eps=1e-07): + was_2d = G.ndim == 2 + if was_2d: + G = G.unsqueeze(0) + X = G.bfloat16() + transposed = X.size(-2) > X.size(-1) + if transposed: + X = X.mT + X = X / (X.norm(dim=(-2, -1), keepdim=True) + eps) + coeffs = _PE_COEFFS[:steps] if steps <= len(_PE_COEFFS) else _PE_COEFFS + for a, b, c in coeffs: + A = X @ X.mT + B = b * A + c * (A @ A) + X = a * X + B @ X + if transposed: + X = X.mT + if was_2d: + X = X.squeeze(0) + return X + + +class Muon(torch.optim.Optimizer): + def __init__( + self, + params, + lr, + momentum, + backend_steps, + nesterov=True, + weight_decay=0.0, + row_normalize=False, + ): + super().__init__( + params, + dict( + lr=lr, + momentum=momentum, + backend_steps=backend_steps, + nesterov=nesterov, + weight_decay=weight_decay, + row_normalize=row_normalize, + ), + ) + self._built = False + + def _build(self): + self._distributed = dist.is_available() and dist.is_initialized() + self._world_size = dist.get_world_size() if self._distributed else 1 + self._rank = dist.get_rank() if self._distributed else 0 + ws = self._world_size + self._bank_meta = [] + for group in self.param_groups: + for p in group["params"]: + B = p.shape[0] + padded_B = ((B + ws - 1) // ws) * ws + shard_B = padded_B // ws + tail = p.shape[1:] + dev = p.device + self._bank_meta.append({ + "p": p, + "B": B, + "padded_grad": torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + "shard": torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + "shard_mom": torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + "full_update": torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + "scale": max(1, p.shape[-2] / p.shape[-1]) ** 0.5, + }) + self._bank_meta.sort(key=lambda m: -m["p"].numel()) + self._built = True + + def launch_reduce_scatters(self): + if not self._built: + self._build() + if not self._distributed: + return + self._rs_futures = [] + for m in self._bank_meta: + p = m["p"] + if p.grad is None: + self._rs_futures.append(None) + continue + pg = m["padded_grad"] + pg[: m["B"]].copy_(p.grad) + fut = dist.reduce_scatter_tensor( + m["shard"], pg, op=dist.ReduceOp.AVG, async_op=True + ) + self._rs_futures.append(fut) + + @torch.no_grad() + def step(self, closure=None): + loss = None + if closure is not None: + with torch.enable_grad(): + loss = closure() + if not self._built: + self._build() + for group in self.param_groups: + lr = group["lr"] + momentum = group["momentum"] + backend_steps = group["backend_steps"] + nesterov = group["nesterov"] + wd = group.get("weight_decay", 0.0) + row_normalize = group.get("row_normalize", False) + prev_ag_handle = None + prev_m = None + sharded = self._distributed and hasattr(self, "_rs_futures") + for idx, m in enumerate(self._bank_meta): + p = m["p"] + if p.grad is None: + continue + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m["p"] + upd = prev_m["full_update"][: prev_m["B"]] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd, alpha=-lr * prev_m["scale"]) + if sharded and self._rs_futures[idx] is not None: + self._rs_futures[idx].wait() + g = m["shard"] + buf = m["shard_mom"] + else: + g = p.grad.bfloat16() + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + buf.mul_(momentum).add_(g) + if nesterov: + update = g.add(buf, alpha=momentum) + else: + update = buf + if row_normalize: + rn = update.float().norm(dim=-1, keepdim=True).clamp_min(1e-07) + update = update / rn.to(update.dtype) + update = zeropower_via_newtonschulz5(update, steps=backend_steps) + if sharded: + prev_ag_handle = dist.all_gather_into_tensor( + m["full_update"], update, async_op=True + ) + prev_m = m + else: + if wd > 0.0: + p.data.mul_(1.0 - lr * wd) + p.add_(update, alpha=-lr * m["scale"]) + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m["p"] + upd = prev_m["full_update"][: prev_m["B"]] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd, alpha=-lr * prev_m["scale"]) + if hasattr(self, "_rs_futures"): + del self._rs_futures + return loss + + +CONTROL_TENSOR_NAME_PATTERNS = tuple( + pattern + for pattern in os.environ.get( + "CONTROL_TENSOR_NAME_PATTERNS", + "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,skip_gates,parallel_post_lambdas,parallel_resid_lambdas,attn_gate_proj,attn_gate_w,smear_gate,smear_lambda", + ).split(",") + if pattern +) + + +PACKED_REPLICATED_GRAD_MAX_NUMEL = 1 << 15 + + +class Optimizers: + def __init__(self, h, base_model): + matrix_params = [ + base_model.qo_bank, + base_model.kv_bank, + base_model.mlp_up_bank, + base_model.mlp_down_bank, + ] + block_named_params = list(base_model.blocks.named_parameters()) + scalar_params = [ + p + for (name, p) in block_named_params + if p.ndim < 2 + or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS) + ] + if base_model.skip_weights.numel() > 0: + scalar_params.append(base_model.skip_weights) + if base_model.skip_gates is not None and base_model.skip_gates.numel() > 0: + scalar_params.append(base_model.skip_gates) + if base_model.parallel_post_lambdas is not None: + scalar_params.append(base_model.parallel_post_lambdas) + if base_model.parallel_resid_lambdas is not None: + scalar_params.append(base_model.parallel_resid_lambdas) + # SmearGate params live on GPT root (not in .blocks), so add them by hand. + # Both are tiny (gate_window scalars + 1 lambda). Optimized via scalar Adam. + if getattr(base_model, "smear_gate_enabled", False): + scalar_params.append(base_model.smear_gate.weight) + scalar_params.append(base_model.smear_lambda) + token_lr = h.tied_embed_lr if h.tie_embeddings else h.embed_lr + tok_params = [ + {"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr} + ] + self.optimizer_tok = torch.optim.AdamW( + tok_params, + betas=(h.beta1, h.beta2), + eps=h.adam_eps, + weight_decay=h.embed_wd, + fused=True, + ) + self.optimizer_muon = Muon( + matrix_params, + lr=h.matrix_lr, + momentum=h.muon_momentum, + backend_steps=h.muon_backend_steps, + weight_decay=h.muon_wd, + row_normalize=h.muon_row_normalize, + ) + for group in self.optimizer_muon.param_groups: + group["base_lr"] = h.matrix_lr + self.optimizer_scalar = torch.optim.AdamW( + [{"params": scalar_params, "lr": h.scalar_lr, "base_lr": h.scalar_lr}], + betas=(h.beta1, h.beta2), + eps=h.adam_eps, + weight_decay=h.adam_wd, + fused=True, + ) + self.optimizers = [ + self.optimizer_tok, + self.optimizer_muon, + self.optimizer_scalar, + ] + self.replicated_params = list(tok_params[0]["params"]) + self.replicated_params.extend(scalar_params) + self.replicated_large_params = [] + self.replicated_packed_params = [] + for p in self.replicated_params: + if p.numel() <= PACKED_REPLICATED_GRAD_MAX_NUMEL: + self.replicated_packed_params.append(p) + else: + self.replicated_large_params.append(p) + self._aux_stream = torch.cuda.Stream() + + def __iter__(self): + return iter(self.optimizers) + + def zero_grad_all(self): + for opt in self.optimizers: + opt.zero_grad(set_to_none=True) + + def _all_reduce_packed_grads(self): + grads_by_key = collections.defaultdict(list) + for p in self.replicated_packed_params: + if p.grad is not None: + grads_by_key[(p.grad.device, p.grad.dtype)].append(p.grad) + for grads in grads_by_key.values(): + flat = torch.empty( + sum(g.numel() for g in grads), + device=grads[0].device, + dtype=grads[0].dtype, + ) + offset = 0 + for g in grads: + n = g.numel() + flat[offset : offset + n].copy_(g.contiguous().view(-1)) + offset += n + dist.all_reduce(flat, op=dist.ReduceOp.AVG) + offset = 0 + for g in grads: + n = g.numel() + g.copy_(flat[offset : offset + n].view_as(g)) + offset += n + + def step(self, distributed=False): + self.optimizer_muon.launch_reduce_scatters() + if distributed: + reduce_handles = [ + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG, async_op=True) + for p in self.replicated_large_params + if p.grad is not None + ] + self._all_reduce_packed_grads() + for handle in reduce_handles: + handle.wait() + self._aux_stream.wait_stream(torch.cuda.current_stream()) + with torch.cuda.stream(self._aux_stream): + self.optimizer_tok.step() + self.optimizer_scalar.step() + self.optimizer_muon.step() + torch.cuda.current_stream().wait_stream(self._aux_stream) + self.zero_grad_all() + + +def restore_fp32_params(model): + for module in model.modules(): + if isinstance(module, CastedLinear): + module.float() + for name, param in model.named_parameters(): + if ( + param.ndim < 2 + or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS) + ) and param.dtype != torch.float32: + param.data = param.data.float() + if hasattr(model, "qo_bank") and model.qo_bank is not None: + model.qo_bank.data = model.qo_bank.data.float() + model.kv_bank.data = model.kv_bank.data.float() + model.mlp_up_bank.data = model.mlp_up_bank.data.float() + model.mlp_down_bank.data = model.mlp_down_bank.data.float() + + +def collect_hessians(model, train_loader, h, device, n_calibration_batches=64): + hessians = {} + act_sumsq = {} + act_counts = {} + hooks = [] + for i, block in enumerate(model.blocks): + block.attn._calib = True + block.mlp._calib = True + block.mlp.use_fused = False + + def make_attn_hook(layer_idx): + def hook_fn(module, inp, out): + x = inp[0].detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + x_sq = x.square().sum(dim=0) + x_count = x.shape[0] + for suffix in ["c_q", "c_k", "c_v"]: + name = f"blocks.{layer_idx}.attn.{suffix}.weight" + if name not in hessians: + hessians[name] = torch.zeros( + x.shape[1], x.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(x.T, x) + if name not in act_sumsq: + act_sumsq[name] = torch.zeros( + x.shape[1], dtype=torch.float32, device=device + ) + act_counts[name] = 0 + act_sumsq[name] += x_sq + act_counts[name] += x_count + y = module._last_proj_input + if y is not None: + y = y.float() + if y.ndim == 3: + y = y.reshape(-1, y.shape[-1]) + name = f"blocks.{layer_idx}.attn.proj.weight" + if name not in hessians: + hessians[name] = torch.zeros( + y.shape[1], y.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(y.T, y) + if name not in act_sumsq: + act_sumsq[name] = torch.zeros( + y.shape[1], dtype=torch.float32, device=device + ) + act_counts[name] = 0 + act_sumsq[name] += y.square().sum(dim=0) + act_counts[name] += y.shape[0] + return hook_fn + + def make_mlp_hook(layer_idx): + def hook_fn(module, inp, out): + x = inp[0].detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + name = f"blocks.{layer_idx}.mlp.fc.weight" + if name not in hessians: + hessians[name] = torch.zeros( + x.shape[1], x.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(x.T, x) + if name not in act_sumsq: + act_sumsq[name] = torch.zeros( + x.shape[1], dtype=torch.float32, device=device + ) + act_counts[name] = 0 + act_sumsq[name] += x.square().sum(dim=0) + act_counts[name] += x.shape[0] + h_act = module._last_down_input + if h_act is not None: + h_act = h_act.float() + if h_act.ndim == 3: + h_act = h_act.reshape(-1, h_act.shape[-1]) + name = f"blocks.{layer_idx}.mlp.proj.weight" + if name not in hessians: + hessians[name] = torch.zeros( + h_act.shape[1], h_act.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(h_act.T, h_act) + if name not in act_sumsq: + act_sumsq[name] = torch.zeros( + h_act.shape[1], dtype=torch.float32, device=device + ) + act_counts[name] = 0 + act_sumsq[name] += h_act.square().sum(dim=0) + act_counts[name] += h_act.shape[0] + return hook_fn + + for i, block in enumerate(model.blocks): + hooks.append(block.attn.register_forward_hook(make_attn_hook(i))) + hooks.append(block.mlp.register_forward_hook(make_mlp_hook(i))) + + # Hessian hooks for embedding factorization projection layers + def make_linear_input_hook(weight_name): + def hook_fn(module, inp, out): + x = inp[0].detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + if weight_name not in hessians: + hessians[weight_name] = torch.zeros( + x.shape[1], x.shape[1], dtype=torch.float32, device=device + ) + hessians[weight_name].addmm_(x.T, x) + return hook_fn + + if model.tie_embeddings: + hook_module = model.final_norm + + def make_output_hook(name): + def hook_fn(module, inp, out): + x = out.detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + if name not in hessians: + hessians[name] = torch.zeros( + x.shape[1], x.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(x.T, x) + if name not in act_sumsq: + act_sumsq[name] = torch.zeros( + x.shape[1], dtype=torch.float32, device=device + ) + act_counts[name] = 0 + act_sumsq[name] += x.square().sum(dim=0) + act_counts[name] += x.shape[0] + return hook_fn + + hooks.append( + hook_module.register_forward_hook(make_output_hook("tok_emb.weight")) + ) + model.eval() + with torch.no_grad(): + for _ in range(n_calibration_batches): + x, _ = train_loader.next_batch(h.train_batch_tokens, h.grad_accum_steps) + model.forward_logits(x) + for hook in hooks: + hook.remove() + for i, block in enumerate(model.blocks): + block.attn._calib = False + block.mlp._calib = False + block.mlp.use_fused = True + for name in hessians: + hessians[name] = hessians[name].cpu() / n_calibration_batches + act_stats = {} + for name, sumsq in act_sumsq.items(): + count = max(act_counts.get(name, 0), 1) + act_stats[name] = (sumsq / count).sqrt().cpu() + return hessians, act_stats + + +def gptq_quantize_weight( + w, + H, + clip_sigmas=3.0, + clip_range=63, + block_size=128, + protect_groups=None, + group_size=None, + protect_clip_range=None, +): + W_orig = w.float().clone() + rows, cols = W_orig.shape + H = H.float().clone() + dead = torch.diag(H) == 0 + H[dead, dead] = 1 + damp = 0.01 * H.diag().mean() + H.diagonal().add_(damp) + perm = torch.argsort(H.diag(), descending=True) + invperm = torch.argsort(perm) + W_perm = W_orig[:, perm].clone() + W_perm[:, dead[perm]] = 0 + H = H[perm][:, perm] + Hinv = torch.cholesky_inverse(torch.linalg.cholesky(H)) + Hinv = torch.linalg.cholesky(Hinv, upper=True) + row_std = W_orig.std(dim=1) + s = (clip_sigmas * row_std / clip_range).clamp_min(1e-10).to(torch.float16) + sf = s.float() + protect_meta = None + protect_mask_perm = None + s_hi = None + sf_hi = None + if ( + protect_groups + and group_size is not None + and protect_clip_range is not None + and protect_clip_range > clip_range + ): + protect_mask = torch.zeros(cols, dtype=torch.bool) + starts = [] + for (start, end) in protect_groups: + if start < 0 or end > cols or end <= start: + continue + protect_mask[start:end] = True + starts.append(start) + if starts: + protect_mask_perm = protect_mask[perm] + s_hi = (clip_sigmas * row_std / protect_clip_range).clamp_min(1e-10).to( + torch.float16 + ) + sf_hi = s_hi.float() + protect_meta = { + "starts": torch.tensor(starts, dtype=torch.int16), + "size": int(group_size), + "s_hi": s_hi, + } + Q = torch.zeros(rows, cols, dtype=torch.int8) + W_work = W_perm.clone() + for i1 in range(0, cols, block_size): + i2 = min(i1 + block_size, cols) + W_block = W_work[:, i1:i2].clone() + Hinv_block = Hinv[i1:i2, i1:i2] + Err = torch.zeros(rows, i2 - i1) + for j in range(i2 - i1): + w_col = W_block[:, j] + d = Hinv_block[j, j] + if protect_mask_perm is not None and bool(protect_mask_perm[i1 + j]): + q_col = torch.clamp( + torch.round(w_col / sf_hi), + -protect_clip_range, + protect_clip_range, + ) + w_recon = q_col.float() * sf_hi + else: + q_col = torch.clamp(torch.round(w_col / sf), -clip_range, clip_range) + w_recon = q_col.float() * sf + Q[:, i1 + j] = q_col.to(torch.int8) + err = (w_col - w_recon) / d + Err[:, j] = err + W_block[:, j:] -= err.unsqueeze(1) * Hinv_block[j, j:].unsqueeze(0) + if i2 < cols: + W_work[:, i2:] -= Err @ Hinv[i1:i2, i2:] + return Q[:, invperm], s, protect_meta + + +def _quantize_gate_int8_row(w): + # Symmetric int8-per-row quantization for small gate tensors. w shape + # (R, C) -> (R,) scales in fp16, int8 values in [-127, 127]. Single scale + # per row keeps accuracy high while halving storage vs fp16. + W = w.float().contiguous() + row_max = W.abs().amax(dim=1).clamp_min(1e-10) + s = (row_max / 127.0).to(torch.float16) + sf = s.float().view(-1, 1) + q = torch.clamp(torch.round(W / sf), -127, 127).to(torch.int8) + return q, s + + +def _lqer_pack(A, B, bits): + rng = 2 ** (bits - 1) - 1 + sA = (A.abs().amax(dim=1).clamp_min(1e-10) / rng).to(torch.float16) + sB = (B.abs().amax(dim=1).clamp_min(1e-10) / rng).to(torch.float16) + qA = torch.clamp(torch.round(A / sA.float().view(-1, 1)), -rng, rng).to(torch.int8) + qB = torch.clamp(torch.round(B / sB.float().view(-1, 1)), -rng, rng).to(torch.int8) + return qA, sA, qB, sB + + +def _lqer_pack_asym(A, B, g=64): + # A: INT2 per-matrix scalar (signed [-2,1], scale = |A|max/1.5). + sA = (A.abs().amax().clamp_min(1e-10) / 1.5).to(torch.float16) + qA = torch.clamp(torch.round(A / sA.float()), -2, 1).to(torch.int8) + # B: INT4 groupwise g over flattened B (signed [-8,7], per-group scale). + Bf = B.reshape(-1, g) + Bmax = Bf.abs().amax(dim=-1, keepdim=True).clamp_min(1e-10) + sB = (Bmax / 7.5).to(torch.float16).reshape(-1) + qB = torch.clamp(torch.round(Bf / sB.float().reshape(-1, 1)), -8, 7).to( + torch.int8 + ).reshape(B.shape) + return qA, sA, qB, sB + + +def _lqer_fit_quantized(E, h): + U, S, Vh = torch.linalg.svd(E, full_matrices=False) + r = min(h.lqer_rank, S.numel()) + if r <= 0: + return None + A = (U[:, :r] * S[:r]).contiguous() + B = Vh[:r, :].contiguous() + asym_on = bool(getattr(h, "lqer_asym_enabled", False)) + asym_g = int(getattr(h, "lqer_asym_group", 64)) + if asym_on and B.numel() % asym_g == 0: + qA, sA, qB, sB = _lqer_pack_asym(A, B, asym_g) + A_hat = qA.float() * float(sA) + g_sz = qB.numel() // sB.numel() + B_hat = (qB.reshape(-1, g_sz).float() * sB.float().view(-1, 1)).reshape( + qB.shape + ) + return { + "kind": "asym", + "qA": qA, + "sA": sA, + "qB": qB, + "sB": sB, + "delta": A_hat @ B_hat, + } + qA, sA, qB, sB = _lqer_pack(A, B, h.lqer_factor_bits) + A_hat = qA.float() * sA.float().view(-1, 1) + B_hat = qB.float() * sB.float().view(-1, 1) + return { + "kind": "sym", + "qA": qA, + "sA": sA, + "qB": qB, + "sB": sB, + "delta": A_hat @ B_hat, + } + + +def _awq_lite_group_candidates(w, act_rms, group_size): + cols = w.shape[1] + n_groups = cols // group_size + if n_groups <= 0: + return [] + weight_score = w.float().abs().mean(dim=0) + saliency = act_rms.float() * weight_score + cands = [] + for gi in range(n_groups): + start = gi * group_size + end = start + group_size + score = float(saliency[start:end].sum()) + cands.append((score, start, end)) + return cands + + +def gptq_mixed_quantize(state_dict, hessians, act_stats, h): + result = {} + meta = {} + quant_gate = bool(getattr(h, "gated_attn_quant_gate", False)) + lqer_on = bool(getattr(h, "lqer_enabled", False)) + awq_on = bool(getattr(h, "awq_lite_enabled", False)) + lqer_cands = {} + awq_selected = collections.defaultdict(list) + if awq_on: + awq_cands = [] + for (name, tensor) in state_dict.items(): + t = tensor.detach().cpu().contiguous() + if t.is_floating_point() and t.numel() > 65536 and name in act_stats: + bits = h.embed_bits if "tok_emb" in name else h.matrix_bits + if bits < h.awq_lite_bits: + for score, start, end in _awq_lite_group_candidates( + t, act_stats[name], h.awq_lite_group_size + ): + awq_cands.append((score, name, start, end)) + awq_cands.sort(key=lambda x: -x[0]) + for (_score, name, start, end) in awq_cands[: h.awq_lite_group_top_k]: + awq_selected[name].append((start, end)) + for (name, tensor) in state_dict.items(): + t = tensor.detach().cpu().contiguous() + # Dedicated int8-per-row path for attn_gate_w (bypasses both GPTQ and + # fp16 passthrough). Applied BEFORE the numel<=65536 passthrough check + # so the gate tensor is routed here instead of to fp16. + if ( + quant_gate + and t.is_floating_point() + and t.ndim == 2 + and name.endswith(".attn_gate_w") + # Dense GatedAttn: (num_heads, dim) = (8, 512) = 4096. + # Sparse gate: (num_heads, gate_window) = (8, 12) = 96. + # Both need int8-per-row routing; the 1024 lower bound in stock + # PR-1736 presumed dense-only. Widen to catch both. + and 32 <= t.numel() <= 8192 + ): + gq, gs = _quantize_gate_int8_row(t) + result[name + ".gq"] = gq + result[name + ".gs"] = gs + meta[name] = "gate_int8_row" + continue + if not t.is_floating_point() or t.numel() <= 65536: + result[name] = t.to(torch.float16) if t.is_floating_point() else t + meta[name] = "passthrough (float16)" + continue + if "tok_emb" in name: + cs = h.embed_clip_sigmas + elif ".mlp." in name: + cs = h.mlp_clip_sigmas + elif ".attn." in name: + cs = h.attn_clip_sigmas + else: + cs = h.matrix_clip_sigmas + bits = h.embed_bits if "tok_emb" in name else h.matrix_bits + clip_range = 2 ** (bits - 1) - 1 + q, s, protect_meta = gptq_quantize_weight( + t, + hessians[name], + clip_sigmas=cs, + clip_range=clip_range, + protect_groups=awq_selected.get(name), + group_size=h.awq_lite_group_size if name in awq_selected else None, + protect_clip_range=(2 ** (h.awq_lite_bits - 1) - 1) + if name in awq_selected + else None, + ) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = f"gptq (int{bits})" + W_q = q.float() * s.float().view(-1, 1) + if protect_meta is not None: + result[name + ".awqg_start"] = protect_meta["starts"] + result[name + ".awqg_s_hi"] = protect_meta["s_hi"] + result[name + ".awqg_size"] = torch.tensor( + protect_meta["size"], dtype=torch.int16 + ) + meta[name] = meta[name] + f"+awqgrpint{h.awq_lite_bits}" + gsz = protect_meta["size"] + for start in protect_meta["starts"].tolist(): + W_q[:, start : start + gsz] = ( + q[:, start : start + gsz].float() + * protect_meta["s_hi"].float().view(-1, 1) + ) + if lqer_on: + # LQER is fit on top of the fully realized GPTQ base, which already + # includes any higher-precision AWQ-protected groups. + scope = str(getattr(h, "lqer_scope", "all")).lower() + scope_ok = ( + scope == "all" + or (scope == "mlp" and ".mlp." in name) + or (scope == "attn" and ".attn." in name) + or (scope == "embed" and "tok_emb" in name) + ) + if scope_ok: + E = t.float() - W_q + err_norm = float(E.norm()) + if err_norm > 0: + lqer_cands[name] = (E, err_norm) + if lqer_on and lqer_cands: + if bool(getattr(h, "lqer_gain_select", False)): + scored = [] + for (name, (E, base_err)) in lqer_cands.items(): + fit = _lqer_fit_quantized(E, h) + if fit is None: + continue + new_err = float((E - fit["delta"]).norm()) + gain = base_err - new_err + if gain > 0: + scored.append((gain, name, fit)) + scored.sort(key=lambda x: -x[0]) + for (_gain, name, fit) in scored[: h.lqer_top_k]: + if fit["kind"] == "asym": + result[name + ".lqA_a"] = fit["qA"] + result[name + ".lqAs_a"] = fit["sA"] + result[name + ".lqB_a"] = fit["qB"] + result[name + ".lqBs_a"] = fit["sB"] + meta[name] = meta[name] + "+lqer_asym" + else: + result[name + ".lqA"] = fit["qA"] + result[name + ".lqAs"] = fit["sA"] + result[name + ".lqB"] = fit["qB"] + result[name + ".lqBs"] = fit["sB"] + meta[name] = meta[name] + "+lqer" + else: + top = sorted(lqer_cands.items(), key=lambda kv: -kv[1][1])[: h.lqer_top_k] + asym_on = bool(getattr(h, "lqer_asym_enabled", False)) + asym_g = int(getattr(h, "lqer_asym_group", 64)) + for (name, (E, _)) in top: + U, S, Vh = torch.linalg.svd(E, full_matrices=False) + r = min(h.lqer_rank, S.numel()) + A = (U[:, :r] * S[:r]).contiguous() + B = Vh[:r, :].contiguous() + if asym_on and B.numel() % asym_g == 0: + qA, sA, qB, sB = _lqer_pack_asym(A, B, asym_g) + result[name + ".lqA_a"] = qA + result[name + ".lqAs_a"] = sA + result[name + ".lqB_a"] = qB + result[name + ".lqBs_a"] = sB + meta[name] = meta[name] + "+lqer_asym" + else: + qA, sA, qB, sB = _lqer_pack(A, B, h.lqer_factor_bits) + result[name + ".lqA"] = qA + result[name + ".lqAs"] = sA + result[name + ".lqB"] = qB + result[name + ".lqBs"] = sB + meta[name] = meta[name] + "+lqer" + categories = collections.defaultdict(set) + for (name, cat) in meta.items(): + short = re.sub("\\.\\d+$", "", re.sub("blocks\\.\\d+", "blocks", name)) + categories[cat].add(short) + log("Quantized weights:") + for cat in sorted(categories): + log(f" {cat}: {', '.join(sorted(categories[cat]))}") + return result, meta + +def dequantize_mixed(result, meta, template_sd): + out = {} + for (name, orig) in template_sd.items(): + info = meta.get(name) + if info is None: + continue + orig_dtype = orig.dtype + if "passthrough" in info: + t = result[name] + if t.dtype == torch.float16 and orig_dtype in ( + torch.float32, + torch.bfloat16, + ): + t = t.to(orig_dtype) + out[name] = t + continue + if info == "gate_int8_row": + gq = result[name + ".gq"] + gs = result[name + ".gs"] + out[name] = (gq.float() * gs.float().view(-1, 1)).to(orig_dtype) + continue + q, s = result[name + ".q"], result[name + ".scale"] + if s.ndim > 0: + W = q.float() * s.float().view(q.shape[0], *[1] * (q.ndim - 1)) + else: + W = q.float() * float(s.item()) + if "awqgrpint" in info: + starts = result[name + ".awqg_start"].tolist() + s_hi = result[name + ".awqg_s_hi"].float() + gsz = int(result[name + ".awqg_size"].item()) + for start in starts: + W[:, start : start + gsz] = ( + q[:, start : start + gsz].float() * s_hi.view(-1, 1) + ) + if "lqer_asym" in info: + qA_t = result[name + ".lqA_a"] + sA_t = result[name + ".lqAs_a"] + qB_t = result[name + ".lqB_a"] + sB_t = result[name + ".lqBs_a"] + qA = qA_t.float() * float(sA_t) + g_sz = qB_t.numel() // sB_t.numel() + qB = (qB_t.reshape(-1, g_sz).float() * sB_t.float().view(-1, 1)).reshape( + qB_t.shape + ) + W = W + qA @ qB + elif "lqer" in info: + qA = result[name + ".lqA"].float() * result[name + ".lqAs"].float().view(-1, 1) + qB = result[name + ".lqB"].float() * result[name + ".lqBs"].float().view(-1, 1) + W = W + qA @ qB + out[name] = W.to(orig_dtype) + return out + + +_BSHF_MAGIC = b"BSHF" + + +# ── Per-group lrzip compression (ported from PR#1586 via PR#1667/1729) ──────── + +_GROUP_ORDER = [ + "_tok_emb.weight.q", + "attn.c_k.weight.q", "attn.c_q.weight.q", + "attn.c_v.weight.q", "attn.proj.weight.q", + "mlp.fc.weight.q", "mlp.proj.weight.q", +] +_SIMSORT_KEYS = {"_tok_emb.weight.q", "attn.c_q.weight.q", "mlp.fc.weight.q"} +_PACK_MAGIC = b"PGRP" + + +def _similarity_sort_l1(matrix): + import numpy as _np + n = matrix.shape[0] + used = _np.zeros(n, dtype=bool) + order = [0] + used[0] = True + cur = matrix[0].astype(_np.float32) + for _ in range(n - 1): + dists = _np.sum(_np.abs(matrix[~used].astype(_np.float32) - cur), axis=1) + unused = _np.where(~used)[0] + best = unused[_np.argmin(dists)] + order.append(best) + used[best] = True + cur = matrix[best].astype(_np.float32) + return _np.array(order, dtype=_np.uint16) + + +def _lrzip_compress(data, tmpdir, label): + inp = os.path.join(tmpdir, f"{label}.bin") + out = f"{inp}.lrz" + with open(inp, "wb") as f: + f.write(data) + subprocess.run(["lrzip", "-z", "-L", "9", "-o", out, inp], capture_output=True, check=True) + with open(out, "rb") as f: + result = f.read() + os.remove(inp); os.remove(out) + return result + + +def _lrzip_decompress(data, tmpdir, label): + inp = os.path.join(tmpdir, f"{label}.lrz") + out = os.path.join(tmpdir, f"{label}.bin") + with open(inp, "wb") as f: + f.write(data) + subprocess.run(["lrzip", "-d", "-f", "-o", out, inp], capture_output=True, check=True) + with open(out, "rb") as f: + result = f.read() + os.remove(inp); os.remove(out) + return result + + +def _pack_streams(streams): + import struct + n = len(streams) + hdr = _PACK_MAGIC + struct.pack("= 2 + docs.append((start, end - start)) + return docs + + +def _build_ttt_global_batches(doc_entries, h, ascending=False): + batch_size = h.ttt_batch_size + global_doc_entries = sorted(doc_entries, key=lambda x: x[1][1]) + global_batches = [ + global_doc_entries[i : i + batch_size] + for i in range(0, len(global_doc_entries), batch_size) + ] + indexed = list(enumerate(global_batches)) + if not ascending: + indexed.sort(key=lambda ib: -max(dl for _, (_, dl) in ib[1])) + return indexed + + +def _init_batch_counter(path): + with open(path, "wb") as f: + f.write((0).to_bytes(4, "little")) + + +def _claim_next_batch(counter_path, queue_len): + try: + with open(counter_path, "r+b") as f: + fcntl.flock(f, fcntl.LOCK_EX) + idx = int.from_bytes(f.read(4), "little") + f.seek(0) + f.write((idx + 1).to_bytes(4, "little")) + f.flush() + except FileNotFoundError: + return queue_len + return idx + + +def _compute_chunk_window(ci, pred_len, num_chunks, chunk_size, eval_seq_len): + chunk_end = pred_len if ci == num_chunks - 1 else (ci + 1) * chunk_size + win_start = max(0, chunk_end - eval_seq_len) + win_len = chunk_end - win_start + chunk_start = ci * chunk_size + chunk_offset = chunk_start - win_start + chunk_len = chunk_end - chunk_start + return win_start, win_len, chunk_offset, chunk_len + + +def _accumulate_bpb( + ptl, + x, + y, + chunk_offsets, + chunk_lens, + pos_idx, + base_bytes_lut, + has_leading_space_lut, + is_boundary_token_lut, + loss_sum, + byte_sum, + token_count, + y_bytes=None, +): + pos = pos_idx[: x.size(1)].unsqueeze(0) + mask = ( + (chunk_lens.unsqueeze(1) > 0) + & (pos >= chunk_offsets.unsqueeze(1)) + & (pos < (chunk_offsets + chunk_lens).unsqueeze(1)) + ) + mask_f64 = mask.to(torch.float64) + if y_bytes is not None: + tok_bytes = y_bytes.to(torch.float64) + else: + tok_bytes = base_bytes_lut[y].to(torch.float64) + tok_bytes += (has_leading_space_lut[y] & ~is_boundary_token_lut[x]).to( + torch.float64 + ) + loss_sum += (ptl.to(torch.float64) * mask_f64).sum() + byte_sum += (tok_bytes * mask_f64).sum() + token_count += chunk_lens.to(torch.float64).sum() + + +def _loss_bpb_from_sums(loss_sum, token_count, byte_sum): + val_loss = (loss_sum / token_count).item() + val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_sum.item()) + return val_loss, val_bpb + + +def _add_to_counter(path, delta): + try: + with open(path, "r+b") as f: + fcntl.flock(f, fcntl.LOCK_EX) + cur = int.from_bytes(f.read(8), "little", signed=True) + cur += int(delta) + f.seek(0) + f.write(int(cur).to_bytes(8, "little", signed=True)) + f.flush() + return cur + except FileNotFoundError: + return int(delta) + + +def _init_int64_counter(path): + with open(path, "wb") as f: + f.write((0).to_bytes(8, "little", signed=True)) + + +def _select_ttt_doc_entries(docs, h): + doc_entries = list(enumerate(docs)) + if h.val_doc_fraction < 1.0: + sample_n = max(1, int(round(len(docs) * h.val_doc_fraction))) + sampled_indices = sorted( + random.Random(h.seed).sample(range(len(docs)), sample_n) + ) + return [(i, docs[i]) for i in sampled_indices] + return doc_entries + + +def train_val_ttt_global_sgd_distributed(h, device, val_data, base_model, val_tokens, batch_seqs=None): + global BOS_ID + if BOS_ID is None: + BOS_ID = 1 + base_model.eval() + seq_len = h.eval_seq_len + total_tokens = val_tokens.numel() - 1 + ttt_chunk = h.global_ttt_chunk_tokens + batch_seqs = h.global_ttt_batch_seqs if batch_seqs is None else batch_seqs + num_chunks = (total_tokens + ttt_chunk - 1) // ttt_chunk + ttt_params = [p for p in base_model.parameters()] + for p in ttt_params: + p.requires_grad_(True) + optimizer = torch.optim.SGD( + ttt_params, lr=h.global_ttt_lr, momentum=h.global_ttt_momentum + ) + t_start = time.perf_counter() + for ci in range(num_chunks): + chunk_start = ci * ttt_chunk + chunk_end = min((ci + 1) * ttt_chunk, total_tokens) + is_last_chunk = ci == num_chunks - 1 + if is_last_chunk or h.global_ttt_epochs <= 0: + continue + base_model.train() + chunk_seqs = (chunk_end - chunk_start) // seq_len + if chunk_seqs <= 0: + continue + warmup_chunks = max(0, min(h.global_ttt_warmup_chunks, num_chunks - 1)) + if warmup_chunks > 0 and ci < warmup_chunks: + warmup_denom = max(warmup_chunks - 1, 1) + warmup_t = ci / warmup_denom + lr_now = ( + h.global_ttt_warmup_start_lr + + (h.global_ttt_lr - h.global_ttt_warmup_start_lr) * warmup_t + ) + else: + decay_steps = max(num_chunks - 1 - warmup_chunks, 1) + decay_ci = max(ci - warmup_chunks, 0) + lr_now = h.global_ttt_lr * 0.5 * ( + 1.0 + math.cos(math.pi * decay_ci / decay_steps) + ) + for pg in optimizer.param_groups: + pg["lr"] = lr_now + my_seq_s = chunk_seqs * h.rank // h.world_size + my_seq_e = chunk_seqs * (h.rank + 1) // h.world_size + my_chunk_seqs = my_seq_e - my_seq_s + for _ in range(h.global_ttt_epochs): + for bs in range(0, my_chunk_seqs, batch_seqs): + be = min(bs + batch_seqs, my_chunk_seqs) + actual_bs = my_seq_s + bs + start_tok = chunk_start + actual_bs * seq_len + end_tok = chunk_start + (my_seq_s + be) * seq_len + 1 + if end_tok > val_tokens.numel(): + continue + local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64) + x_flat = local[:-1] + y_flat = local[1:] + optimizer.zero_grad(set_to_none=True) + with torch.enable_grad(): + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + if h.global_ttt_respect_doc_boundaries: + bos_pos = (x_flat == BOS_ID).nonzero(as_tuple=True)[0].tolist() + cu_seqlens, max_seqlen = _build_cu_seqlens( + bos_pos, x_flat.numel(), x_flat.device, h.eval_seq_len, 64 + ) + loss = base_model( + x_flat[None], + y_flat[None], + cu_seqlens=cu_seqlens, + max_seqlen=max_seqlen, + ) + else: + x = x_flat.reshape(-1, seq_len) + y = y_flat.reshape(-1, seq_len) + loss = base_model(x, y) + loss.backward() + if dist.is_available() and dist.is_initialized(): + for p in ttt_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.SUM) + p.grad.mul_(1.0 / h.world_size) + if h.global_ttt_grad_clip > 0: + torch.nn.utils.clip_grad_norm_(ttt_params, h.global_ttt_grad_clip) + optimizer.step() + base_model.eval() + if h.rank == 0: + elapsed = time.perf_counter() - t_start + log( + f"tttg: c{ci+1}/{num_chunks} lr:{lr_now:.6f} t:{elapsed:.1f}s" + ) + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.eval() + + +def _compute_ngram_hints_for_val(h, val_data, log0=print): + """Stage 1A: precompute ngram hints over full val token sequence. + Returns (hint_global, gate_global, boost_global) tensors on CPU, or None if tilt disabled. + + Compliance: single L->R pass over val tokens; uses val data only; produces hint + aligned to target positions [t] for predicting all_tokens[t+1] from prefix [:t+1]. + Same compute as inline precompute, just relocated to run BEFORE eval timer. + """ + if not getattr(h, "ngram_tilt_enabled", False): + return None + from online_ngram_tilt import build_hints_for_targets + all_tokens = val_data.val_tokens + targets_np_all = all_tokens.cpu().numpy().astype("uint16", copy=False)[1:] + t_h0 = time.perf_counter() + hints_pkg = build_hints_for_targets( + target_token_ids_np=targets_np_all, + tokenizer_path=h.tokenizer_path, + vocab_size=h.vocab_size, + log0=log0, + token_order=h.token_order, + token_threshold=h.token_threshold, + token_boost=h.token_boost, + within_tau=h.within_tau, + within_boost=h.within_boost, + word_order=h.word_order, + word_normalize=h.word_normalize, + word_tau=h.word_tau, + word_boost=h.word_boost, + agree_add_boost=h.agree_add_boost, + ) + hint_global = torch.from_numpy(hints_pkg["hint_ids"].astype("int64")) + gate_global = torch.from_numpy(hints_pkg["gate_mask"]) + boost_global = torch.from_numpy(hints_pkg["boost"].astype("float32")) + log0( + f"ngram_tilt:precompute_outside_timer_done elapsed={time.perf_counter()-t_h0:.2f}s " + f"total_targets={hint_global.numel()}" + ) + return (hint_global, gate_global, boost_global) + + +def eval_val_ttt_phased(h, base_model, device, val_data, forward_ttt_train, precomputed_hints=None): + global BOS_ID + if BOS_ID is None: + BOS_ID = 1 + base_model.eval() + for p in base_model.parameters(): + p.requires_grad_(False) + all_tokens = val_data.val_tokens + all_tokens_idx = all_tokens.to(torch.int32) + # === PR #1145 n-gram tilt: precompute prefix-only hints over val targets === + # Hints are aligned to target positions: hint_global[i] is the hint for + # predicting token all_tokens[i+1] given prefix all_tokens[:i+1]. + # Stored on CPU as int64; gathered per-chunk to GPU alongside y indices. + ngram_hint_global = None + ngram_gate_global = None + ngram_boost_global = None + if precomputed_hints is not None: + # v5 Stage 1A: hints were precomputed BEFORE eval timer started. + # Save measured eval time = the precompute elapsed (~168s for full tilt). + ngram_hint_global, ngram_gate_global, ngram_boost_global = precomputed_hints + log( + f"ngram_tilt:using_precomputed_hints " + f"total_targets={ngram_hint_global.numel()} (precompute time excluded from eval)" + ) + elif getattr(h, "ngram_tilt_enabled", False): + from online_ngram_tilt import build_hints_for_targets + targets_np_all = all_tokens.cpu().numpy().astype("uint16", copy=False)[1:] + t_h0 = time.perf_counter() + hints_pkg = build_hints_for_targets( + target_token_ids_np=targets_np_all, + tokenizer_path=h.tokenizer_path, + vocab_size=h.vocab_size, + log0=log, + token_order=h.token_order, + token_threshold=h.token_threshold, + token_boost=h.token_boost, + within_tau=h.within_tau, + within_boost=h.within_boost, + word_order=h.word_order, + word_normalize=h.word_normalize, + word_tau=h.word_tau, + word_boost=h.word_boost, + agree_add_boost=h.agree_add_boost, + ) + ngram_hint_global = torch.from_numpy(hints_pkg["hint_ids"].astype("int64")) + ngram_gate_global = torch.from_numpy(hints_pkg["gate_mask"]) + ngram_boost_global = torch.from_numpy(hints_pkg["boost"].astype("float32")) + log( + f"ngram_tilt:precompute_done elapsed={time.perf_counter()-t_h0:.2f}s " + f"total_targets={ngram_hint_global.numel()}" + ) + docs = _find_docs(all_tokens) + doc_entries = _select_ttt_doc_entries(docs, h) + prefix_doc_limit = max(0, min(len(doc_entries), int(h.phased_ttt_prefix_docs))) + num_phases = max(1, int(h.phased_ttt_num_phases)) + phase_boundaries = [] + for pi in range(num_phases): + boundary = prefix_doc_limit * (pi + 1) // num_phases + phase_boundaries.append(boundary) + current_phase = 0 + current_phase_boundary = phase_boundaries[0] + log( + "ttt_phased:" + f" total_docs:{len(doc_entries)} prefix_docs:{prefix_doc_limit} " + f"suffix_docs:{len(doc_entries) - prefix_doc_limit}" + f" num_phases:{num_phases} boundaries:{phase_boundaries}" + ) + chunk_size, eval_seq_len = h.ttt_chunk_size, h.ttt_eval_seq_len + eval_batch_set = None + if h.ttt_eval_batches: + eval_batch_set = set(int(x) for x in h.ttt_eval_batches.split(",") if x.strip()) + use_ascending = eval_batch_set is not None + global_batches_sorted = _build_ttt_global_batches( + doc_entries, h, ascending=use_ascending + ) + queue_len = len(global_batches_sorted) + counter_path = f"/tmp/ttt_counter_{h.run_id}" + prefix_counter_path = f"/tmp/ttt_prefix_counter_{h.run_id}" + pause_flag_path = f"/tmp/ttt_pause_flag_{h.run_id}" + if h.rank == 0: + _init_batch_counter(counter_path) + _init_int64_counter(prefix_counter_path) + try: + os.remove(pause_flag_path) + except FileNotFoundError: + pass + if dist.is_available() and dist.is_initialized(): + path_list = [counter_path, prefix_counter_path, pause_flag_path] + dist.broadcast_object_list(path_list, src=0) + counter_path, prefix_counter_path, pause_flag_path = path_list + dist.barrier() + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + byte_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + t_start = time.perf_counter() + reusable_lora = BatchedTTTLoRA( + h.ttt_batch_size, base_model, h.ttt_lora_rank, + k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora, + ).to(device) + + def _build_opt(lora): + if h.ttt_optimizer == "sgd": + return torch.optim.SGD( + lora.parameters(), lr=h.ttt_lora_lr, + momentum=h.ttt_beta1, weight_decay=h.ttt_weight_decay, + ) + return torch.optim.AdamW( + lora.parameters(), lr=h.ttt_lora_lr, + betas=(h.ttt_beta1, h.ttt_beta2), + eps=1e-10, weight_decay=h.ttt_weight_decay, fused=True, + ) + + reusable_opt = _build_opt(reusable_lora) + local_scored_docs = [] + global_ttt_done = prefix_doc_limit == 0 + try: + while True: + queue_idx = _claim_next_batch(counter_path, queue_len) + if queue_idx >= queue_len: + break + orig_batch_idx, batch_entries = global_batches_sorted[queue_idx] + batch = [doc for _, doc in batch_entries] + bsz = len(batch) + prev_loss = loss_sum.item() + prev_bytes = byte_sum.item() + prev_tokens = token_count.item() + if bsz == reusable_lora.bsz: + reusable_lora.reset() + for s in reusable_opt.state.values(): + for k, v in s.items(): + if isinstance(v, torch.Tensor): + v.zero_() + elif k == "step": + s[k] = 0 + cur_lora = reusable_lora + cur_opt = reusable_opt + else: + cur_lora = BatchedTTTLoRA( + bsz, base_model, h.ttt_lora_rank, + k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora, + ).to(device) + cur_opt = _build_opt(cur_lora) + pred_lens = [doc_len - 1 for _, doc_len in batch] + num_chunks = [(pl + chunk_size - 1) // chunk_size for pl in pred_lens] + max_nc = max(num_chunks) + num_chunks_t = torch.tensor(num_chunks, dtype=torch.int64, device=device) + for ci in range(max_nc): + active = [ci < nc for nc in num_chunks] + needs_train = any(ci < nc - 1 for nc in num_chunks) + tok_starts = torch.zeros(bsz, dtype=torch.int64) + tok_wls = torch.zeros(bsz, dtype=torch.int64) + chunk_offsets_cpu = torch.zeros(bsz, dtype=torch.int64) + chunk_lens_cpu = torch.zeros(bsz, dtype=torch.int64) + for b in range(bsz): + if not active[b]: + continue + doc_start, doc_len = batch[b] + win_start, win_len, chunk_offset, chunk_len = _compute_chunk_window( + ci, pred_lens[b], num_chunks[b], chunk_size, eval_seq_len + ) + tok_starts[b] = doc_start + win_start + tok_wls[b] = win_len + chunk_offsets_cpu[b] = chunk_offset + chunk_lens_cpu[b] = chunk_len + _, context_size, chunk_offset, _ = _compute_chunk_window( + ci, (ci + 1) * chunk_size, ci + 1, chunk_size, eval_seq_len + ) + col_idx = torch.arange(context_size + 1) + idx = tok_starts.unsqueeze(1) + col_idx.unsqueeze(0) + idx.clamp_(max=all_tokens.numel() - 1) + gathered_gpu = all_tokens_idx[idx].to( + device=device, dtype=torch.int64, non_blocking=True + ) + valid = (col_idx[:context_size].unsqueeze(0) < tok_wls.unsqueeze(1)).to( + device, non_blocking=True + ) + chunk_offsets = chunk_offsets_cpu.to(device, non_blocking=True) + chunk_lens = chunk_lens_cpu.to(device, non_blocking=True) + x = torch.where(valid, gathered_gpu[:, :context_size], 0) + y = torch.where(valid, gathered_gpu[:, 1 : context_size + 1], 0) + ctx_pos = torch.arange(context_size, device=device, dtype=torch.int64) + # n-gram tilt path: gather hints aligned to y, pass into forward_ttt + hint_ids_gpu = None + gate_mask_gpu = None + boost_gpu = None + if ngram_hint_global is not None: + hint_idx_cpu = ( + tok_starts.unsqueeze(1) + col_idx[:context_size].unsqueeze(0) + ).clamp_(min=0, max=ngram_hint_global.numel() - 1) + hint_ids_gpu = ngram_hint_global[hint_idx_cpu].to( + device=device, dtype=torch.int64, non_blocking=True + ) + gate_mask_gpu = ngram_gate_global[hint_idx_cpu].to( + device=device, non_blocking=True + ) + boost_gpu = ngram_boost_global[hint_idx_cpu].to( + device=device, dtype=torch.float32, non_blocking=True + ) + hint_ids_gpu = torch.where(valid, hint_ids_gpu, torch.zeros_like(hint_ids_gpu)) + gate_mask_gpu = gate_mask_gpu & valid + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + if hint_ids_gpu is not None: + per_tok_loss, log_q_hint = forward_ttt_train( + x, y, lora=cur_lora, hint_ids=hint_ids_gpu + ) + else: + per_tok_loss = forward_ttt_train(x, y, lora=cur_lora) + log_q_hint = None + # CaseOps sidecar-driven byte budget. Mirror the index pattern + # used to build y from all_tokens: y[b, j] corresponds to the + # token at global position tok_starts[b] + 1 + j (when valid). + y_bytes_arg = None + if val_data.caseops_enabled and val_data.val_bytes is not None: + y_idx = ( + tok_starts.unsqueeze(1) + + 1 + + col_idx[:context_size].unsqueeze(0) + ) + y_idx = y_idx.clamp_(max=val_data.val_bytes.numel() - 1) + y_bytes_arg = val_data.val_bytes[y_idx].to( + device=device, dtype=torch.int32, non_blocking=True + ) + # Mirror the `valid` masking used for y so out-of-range tokens + # contribute zero bytes (matches y=0 substitution above). + y_bytes_arg = torch.where( + valid, y_bytes_arg, torch.zeros_like(y_bytes_arg) + ) + # n-gram tilt application: use tilted ptl for BPB accumulation, + # but keep original per_tok_loss for TTT-LoRA backward (training + # objective is base NLL — tilt is a scoring-time overlay). + if hint_ids_gpu is not None and log_q_hint is not None: + from online_ngram_tilt import apply_tilt_to_ptl_torch_fast as apply_tilt_to_ptl_torch + tilted_loss = apply_tilt_to_ptl_torch( + ptl=per_tok_loss, + log_q_hint=log_q_hint, + target_ids=y, + hint_ids=hint_ids_gpu, + gate_mask=gate_mask_gpu, + boost=boost_gpu, + ) + else: + tilted_loss = per_tok_loss + with torch.no_grad(): + _accumulate_bpb( + tilted_loss, + x, + y, + chunk_offsets, + chunk_lens, + ctx_pos, + val_data.base_bytes_lut, + val_data.has_leading_space_lut, + val_data.is_boundary_token_lut, + loss_sum, + byte_sum, + token_count, + y_bytes=y_bytes_arg, + ) + if needs_train: + activate_chunk_mask = (num_chunks_t - 1 > ci).float() + for gi in range(h.ttt_grad_steps): + if gi > 0: + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + per_tok_loss = forward_ttt_train(x, y, lora=cur_lora) + per_doc = per_tok_loss[ + :, chunk_offset : chunk_offset + chunk_size + ].mean(dim=-1) + cur_opt.zero_grad(set_to_none=True) + (per_doc * activate_chunk_mask).sum().backward() + cur_opt.step() + else: + del per_tok_loss + batch_num = orig_batch_idx + 1 + doc_lens = [dl for _, dl in batch] + should_report = batch_num in eval_batch_set if eval_batch_set is not None else True + if should_report: + cur_tokens = token_count.item() + cur_loss_val = loss_sum.item() + cur_bytes_val = byte_sum.item() + dt = cur_tokens - prev_tokens + db = cur_bytes_val - prev_bytes + if dt > 0 and db > 0: + b_loss = (cur_loss_val - prev_loss) / dt + b_bpb = b_loss / math.log(2.0) * (dt / db) + else: + b_loss = b_bpb = 0.0 + r_loss = cur_loss_val / max(cur_tokens, 1) + r_bpb = r_loss / math.log(2.0) * (cur_tokens / max(cur_bytes_val, 1)) + elapsed = time.perf_counter() - t_start + log( + f"ttp: b{batch_num}/{queue_len} bl:{b_loss:.4f} bb:{b_bpb:.4f} " + f"rl:{r_loss:.4f} rb:{r_bpb:.4f} dl:{min(doc_lens)}-{max(doc_lens)} " + f"gd:{int(global_ttt_done)}" + ) + if not global_ttt_done: + local_scored_docs.extend( + (orig_batch_idx, pos, doc_start, doc_len) + for pos, (doc_start, doc_len) in enumerate(batch) + ) + prefix_done = _add_to_counter(prefix_counter_path, len(batch_entries)) + if prefix_done >= current_phase_boundary: + try: + with open(pause_flag_path, "x"): + pass + except FileExistsError: + pass + should_pause = os.path.exists(pause_flag_path) + if should_pause: + if dist.is_available() and dist.is_initialized(): + dist.barrier() + gathered_scored_docs = [None] * h.world_size + if dist.is_available() and dist.is_initialized(): + dist.all_gather_object(gathered_scored_docs, local_scored_docs) + else: + gathered_scored_docs = [local_scored_docs] + scored_docs_for_global = [] + for rank_docs in gathered_scored_docs: + if rank_docs: + scored_docs_for_global.extend(rank_docs) + scored_docs_for_global.sort(key=lambda x: (x[0], x[1])) + scored_docs_for_global = scored_docs_for_global[:current_phase_boundary] + scored_token_chunks = [ + val_data.val_tokens[doc_start : doc_start + doc_len] + for _, _, doc_start, doc_len in scored_docs_for_global + ] + if scored_token_chunks: + global_ttt_tokens = torch.cat(scored_token_chunks) + else: + global_ttt_tokens = val_data.val_tokens[:0] + if h.rank == 0: + prefix_done = 0 + try: + with open(prefix_counter_path, "rb") as f: + prefix_done = int.from_bytes( + f.read(8), "little", signed=True + ) + except FileNotFoundError: + pass + log( + f"ttpp: phase:{current_phase + 1}/{num_phases} pd:{prefix_done} " + f"gd:{len(scored_docs_for_global)} " + f"t:{time.perf_counter() - t_start:.1f}s" + ) + train_val_ttt_global_sgd_distributed( + h, device, val_data, base_model, global_ttt_tokens + ) + for p in base_model.parameters(): + p.requires_grad_(False) + reusable_lora = BatchedTTTLoRA( + h.ttt_batch_size, base_model, h.ttt_lora_rank, + k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora, + ).to(device) + reusable_opt = _build_opt(reusable_lora) + current_phase += 1 + if current_phase >= num_phases: + global_ttt_done = True + else: + current_phase_boundary = phase_boundaries[current_phase] + if h.rank == 0: + try: + os.remove(pause_flag_path) + except FileNotFoundError: + pass + if dist.is_available() and dist.is_initialized(): + dist.barrier() + if h.rank == 0: + log(f"ttpr: phase:{current_phase}/{num_phases} t:{time.perf_counter() - t_start:.1f}s") + del cur_lora, cur_opt + finally: + pass + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.train() + return _loss_bpb_from_sums(loss_sum, token_count, byte_sum) + + +def timed_eval(label, fn, *args, **kwargs): + torch.cuda.synchronize() + t0 = time.perf_counter() + val_loss, val_bpb = fn(*args, **kwargs) + torch.cuda.synchronize() + elapsed_ms = 1e3 * (time.perf_counter() - t0) + log( + f"{label} val_loss:{val_loss:.8f} val_bpb:{val_bpb:.8f} eval_time:{elapsed_ms:.0f}ms" + ) + return val_loss, val_bpb + + +def train_model(h, device, val_data): + base_model = GPT(h).to(device).bfloat16() + restore_fp32_params(base_model) + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + compiled_forward_logits = torch.compile( + base_model.forward_logits, dynamic=False, fullgraph=True + ) + model = compiled_model + log(f"model_params:{sum(p.numel()for p in base_model.parameters())}") + optimizers = Optimizers(h, base_model) + train_loader = DocumentPackingLoader(h, device) + max_wallclock_ms = ( + 1e3 * h.max_wallclock_seconds if h.max_wallclock_seconds > 0 else None + ) + if max_wallclock_ms is not None: + max_wallclock_ms -= h.gptq_reserve_seconds * 1e3 + log( + f"gptq:reserving {h.gptq_reserve_seconds:.0f}s, effective={max_wallclock_ms:.0f}ms" + ) + + def training_frac(step, elapsed_ms): + if max_wallclock_ms is None: + return step / max(h.iterations, 1) + return elapsed_ms / max(max_wallclock_ms, 1e-09) + + def lr_mul(frac): + if h.warmdown_frac <= 0: + return 1.0 + if frac >= 1.0 - h.warmdown_frac: + return max((1.0 - frac) / h.warmdown_frac, h.min_lr) + return 1.0 + + _clip_params = [p for p in base_model.parameters() if p.requires_grad] + def step_fn(step, lr_scale): + train_loss = torch.zeros((), device=device) + for micro_step in range(h.grad_accum_steps): + x, y, cu_seqlens, _max_seqlen = train_loader.next_batch( + h.train_batch_tokens, h.grad_accum_steps + ) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + loss = model(x, y, cu_seqlens=cu_seqlens, max_seqlen=h.train_seq_len) + train_loss += loss.detach() + (loss / h.grad_accum_steps).backward() + train_loss /= h.grad_accum_steps + if step <= h.muon_momentum_warmup_steps: + + frac = ( + + min(step / h.muon_momentum_warmup_steps, 1.0) + + if h.muon_momentum_warmup_steps > 0 + + else 1.0 + + ) + + muon_momentum = ( + + 1 - frac + + ) * h.muon_momentum_warmup_start + frac * h.muon_momentum + + for group in optimizers.optimizer_muon.param_groups: + + group["momentum"] = muon_momentum + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * lr_scale + if h.grad_clip_norm > 0: + torch.nn.utils.clip_grad_norm_(_clip_params, h.grad_clip_norm) + optimizers.step(distributed=h.distributed) + return train_loss + + if h.warmup_steps > 0: + initial_model_state = { + name: tensor.detach().cpu().clone() + for (name, tensor) in base_model.state_dict().items() + } + initial_optimizer_states = [ + copy.deepcopy(opt.state_dict()) for opt in optimizers + ] + model.train() + num_tokens_local = h.train_batch_tokens // h.world_size + for blk in base_model.blocks: + blk.attn.rotary(num_tokens_local, device, torch.bfloat16) + cu_bucket_size = train_loader.cu_bucket_size + warmup_cu_buckets = tuple(cu_bucket_size * i for i in range(1, 5)) + warmup_cu_iters = 3 + x, y, cu_seqlens, _ = train_loader.next_batch( + h.train_batch_tokens, h.grad_accum_steps + ) + log(f"warmup_cu_buckets:{','.join(str(b) for b in warmup_cu_buckets)} iters_each:{warmup_cu_iters}") + def _run_cu_bucket_warmup(): + for bucket_len in warmup_cu_buckets: + boundaries = list(range(0, x.size(1), max(h.train_seq_len, 1))) + if boundaries[-1] != x.size(1): + boundaries.append(x.size(1)) + cu = torch.full((bucket_len,), x.size(1), dtype=torch.int32, device=device) + cu[: len(boundaries)] = torch.tensor(boundaries, dtype=torch.int32, device=device) + for _ in range(warmup_cu_iters): + optimizers.zero_grad_all() + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + wloss = model(x, y, cu_seqlens=cu, max_seqlen=h.train_seq_len) + (wloss / h.grad_accum_steps).backward() + optimizers.zero_grad_all() + _run_cu_bucket_warmup() + if h.num_loops > 0: + base_model.looping_active = True + _run_cu_bucket_warmup() + base_model.looping_active = False + for warmup_step in range(h.warmup_steps): + step_fn(warmup_step, 1.0) + if ( + warmup_step <= 5 + or (warmup_step + 1) % 10 == 0 + or warmup_step + 1 == h.warmup_steps + ): + log(f"warmup_step: {warmup_step+1}/{h.warmup_steps}") + if h.num_loops > 0: + base_model.looping_active = True + log( + f"loop_warmup:enabled encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}" + ) + for warmup_step in range(h.warmup_steps): + step_fn(warmup_step, 1.0) + if ( + warmup_step <= 5 + or (warmup_step + 1) % 10 == 0 + or warmup_step + 1 == h.warmup_steps + ): + log(f"loop_warmup_step: {warmup_step+1}/{h.warmup_steps}") + base_model.looping_active = False + base_model.load_state_dict(initial_model_state, strict=True) + for (opt, state) in zip(optimizers, initial_optimizer_states, strict=True): + opt.load_state_dict(state) + optimizers.zero_grad_all() + train_loader = DocumentPackingLoader(h, device) + _live_state = base_model.state_dict(keep_vars=True) + ema_state = { + name: t.detach().float().clone() + for (name, t) in _live_state.items() + } + _ema_pairs = [(ema_state[name], t) for (name, t) in _live_state.items()] + ema_decay = h.ema_decay + training_time_ms = 0.0 + forced_stop_step = int(os.environ.get("FORCE_STOP_STEP", "0")) + stop_after_step = forced_stop_step if forced_stop_step > 0 else None + torch.cuda.synchronize() + t0 = time.perf_counter() + step = 0 + while True: + last_step = ( + step == h.iterations + or stop_after_step is not None + and step >= stop_after_step + ) + should_validate = ( + last_step or h.val_loss_every > 0 and step % h.val_loss_every == 0 + ) + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1e3 * (time.perf_counter() - t0) + val_loss, val_bpb = eval_val( + h, device, val_data, model, compiled_forward_logits + ) + log( + f"{step}/{h.iterations} val_loss: {val_loss:.4f} val_bpb: {val_bpb:.4f}" + ) + torch.cuda.synchronize() + t0 = time.perf_counter() + if last_step: + if stop_after_step is not None and step < h.iterations: + log( + f"stopping_early: wallclock_cap train_time: {training_time_ms:.0f}ms step: {step}/{h.iterations}" + ) + break + elapsed_ms = training_time_ms + 1e3 * (time.perf_counter() - t0) + frac = training_frac(step, elapsed_ms) + scale = lr_mul(frac) + if ( + h.num_loops > 0 + and not base_model.looping_active + and frac >= h.enable_looping_at + ): + base_model.looping_active = True + log( + f"layer_loop:enabled step:{step} frac:{frac:.3f} encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}" + ) + train_loss = step_fn(step, scale) + with torch.no_grad(): + for ema_t, t in _ema_pairs: + ema_t.mul_(ema_decay).add_(t.detach(), alpha=1.0 - ema_decay) + step += 1 + approx_training_time_ms = training_time_ms + 1e3 * (time.perf_counter() - t0) + should_log_train = h.train_log_every > 0 and ( + step <= 5 or step % h.train_log_every == 0 or stop_after_step is not None + ) + if should_log_train: + tok_per_sec = step * h.train_batch_tokens / (approx_training_time_ms / 1e3) + log( + f"{step}/{h.iterations} train_loss: {train_loss.item():.4f} train_time: {approx_training_time_ms/60000:.1f}m tok/s: {tok_per_sec:.0f}" + ) + reached_cap = ( + forced_stop_step <= 0 + and max_wallclock_ms is not None + and approx_training_time_ms >= max_wallclock_ms + ) + if h.distributed and forced_stop_step <= 0 and max_wallclock_ms is not None: + reached_cap_tensor = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX) + reached_cap = bool(reached_cap_tensor.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + log( + f"peak memory allocated: {torch.cuda.max_memory_allocated()//1024//1024} MiB reserved: {torch.cuda.max_memory_reserved()//1024//1024} MiB" + ) + log("ema:applying EMA weights") + current_state = base_model.state_dict() + avg_state = { + name: t.to(dtype=current_state[name].dtype) for (name, t) in ema_state.items() + } + base_model.load_state_dict(avg_state, strict=True) + return base_model, compiled_model, compiled_forward_logits + + +def train_and_eval(h, device): + global BOS_ID + random.seed(h.seed) + np.random.seed(h.seed) + torch.manual_seed(h.seed) + torch.cuda.manual_seed_all(h.seed) + if h.artifact_dir and h.is_main_process: + os.makedirs(h.artifact_dir, exist_ok=True) + val_data = ValidationData(h, device) + log( + f"train_shards: {len(list(Path(h.datasets_dir).resolve().glob('fineweb_train_*.bin')))}" + ) + log(f"val_tokens: {val_data.val_tokens.numel()-1}") + # TTT_EVAL_ONLY: skip training + GPTQ, jump straight to TTT eval on a + # pre-existing quantized artifact. Used to test TTT-only improvements + # (e.g., PR-1767's alpha/warm-start/WD) without retraining. + ttt_eval_only = os.environ.get("TTT_EVAL_ONLY", "0") == "1" + quantize_only = os.environ.get("QUANTIZE_ONLY", "0") == "1" + if ttt_eval_only: + log("TTT_EVAL_ONLY=1 — skipping training + GPTQ, loading saved artifact for TTT eval") + log(f"ttt_lora_alpha: {BatchedLinearLoRA._ALPHA}") + log(f"ttt_warm_start_a: {BatchedLinearLoRA._WARM_START_A}") + log(f"ttt_weight_decay: {h.ttt_weight_decay}") + elif quantize_only: + log("QUANTIZE_ONLY=1 — skipping training, loading saved full-precision checkpoint") + log(f"quantize_only checkpoint: {h.model_path}") + if BOS_ID is None: + BOS_ID = 1 + base_model = GPT(h).to(device).bfloat16() + state = torch.load(h.model_path, map_location="cpu") + base_model.load_state_dict(state, strict=True) + del state + serialize(h, base_model, Path(__file__).read_text(encoding="utf-8")) + if h.distributed: + dist.barrier() + else: + base_model, compiled_model, compiled_forward_logits = train_model( + h, device, val_data + ) + torch._dynamo.reset() + timed_eval( + "diagnostic pre-quantization post-ema", + eval_val, + h, + device, + val_data, + compiled_model, + compiled_forward_logits, + ) + if os.environ.get("PREQUANT_ONLY", "0") == "1": + log("PREQUANT_ONLY=1 — skipping serialize/GPTQ/post-quant eval/TTT") + return + serialize(h, base_model, Path(__file__).read_text(encoding="utf-8")) + if h.distributed: + dist.barrier() + eval_model = deserialize(h, device) + if h.num_loops > 0: + eval_model.looping_active = True + if not ttt_eval_only: + compiled_model = torch.compile(eval_model, dynamic=False, fullgraph=True) + compiled_forward_logits = torch.compile( + eval_model.forward_logits, dynamic=False, fullgraph=True + ) + timed_eval( + "diagnostic quantized", + eval_val, + h, + device, + val_data, + compiled_model, + compiled_forward_logits, + ) + del eval_model + if h.ttt_enabled: + if not ttt_eval_only: + del compiled_model + if ttt_eval_only: + del eval_model + torch._dynamo.reset() + torch.cuda.empty_cache() + ttt_model = deserialize(h, device) + if h.num_loops > 0: + ttt_model.looping_active = True + for p in ttt_model.parameters(): + p.requires_grad_(False) + + if h.rope_yarn: + _yarn_seqlen = h.train_batch_tokens // h.grad_accum_steps + for block in ttt_model.blocks: + block.attn.rotary(_yarn_seqlen, device, torch.bfloat16) + else: + for block in ttt_model.blocks: + block.attn.rotary._cos_cached = None + block.attn.rotary._sin_cached = None + block.attn.rotary._seq_len_cached = 0 + block.attn.rotary(h.ttt_eval_seq_len, device, torch.bfloat16) + + def _fwd_ttt_inner(input_ids, target_ids, lora): + return ttt_model.forward_ttt(input_ids, target_ids, lora=lora) + + def _fwd_ttt_inner_with_hints(input_ids, target_ids, lora, hint_ids): + return ttt_model.forward_ttt(input_ids, target_ids, lora=lora, hint_ids=hint_ids) + + _fwd_ttt_compiled_inner = None + _fwd_ttt_compiled_inner_hints = None + + def _fwd_ttt(input_ids, target_ids, lora, hint_ids=None): + nonlocal _fwd_ttt_compiled_inner, _fwd_ttt_compiled_inner_hints + if hint_ids is None: + if _fwd_ttt_compiled_inner is None: + _fwd_ttt_compiled_inner = torch.compile(_fwd_ttt_inner, dynamic=True) + return _fwd_ttt_compiled_inner(input_ids, target_ids, lora=lora) + if _fwd_ttt_compiled_inner_hints is None: + _fwd_ttt_compiled_inner_hints = torch.compile( + _fwd_ttt_inner_with_hints, dynamic=True + ) + return _fwd_ttt_compiled_inner_hints( + input_ids, target_ids, lora=lora, hint_ids=hint_ids + ) + + fwd_ttt_compiled = _fwd_ttt + log(f"ttt_lora:warming up compile (random tokens, no val data)") + if BOS_ID is None: + BOS_ID = 1 + t_warmup = time.perf_counter() + warmup_bszes = [h.ttt_batch_size] + for bsz in warmup_bszes: + wl = BatchedTTTLoRA( + bsz, ttt_model, h.ttt_lora_rank, + k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora, + ).to(device) + wo = torch.optim.AdamW( + wl.parameters(), + lr=h.ttt_lora_lr, + betas=(h.ttt_beta1, h.ttt_beta2), + eps=1e-10, + weight_decay=h.ttt_weight_decay, + fused=True, + ) + for ctx_len in (h.ttt_chunk_size, h.ttt_eval_seq_len): + xw = torch.randint(0, h.vocab_size, (bsz, ctx_len), device=device, dtype=torch.int64) + yw = torch.randint(0, h.vocab_size, (bsz, ctx_len), device=device, dtype=torch.int64) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + ptl = fwd_ttt_compiled(xw, yw, lora=wl) + ptl[:, : min(h.ttt_chunk_size, ctx_len)].mean(dim=-1).sum().backward() + wo.step() + wo.zero_grad(set_to_none=True) + del wl, wo + torch.cuda.empty_cache() + compile_elapsed = time.perf_counter() - t_warmup + log(f"ttt_lora:compile warmup done ({compile_elapsed:.1f}s)") + # v5 Stage 1A: precompute ngram hints BEFORE eval timer (single pass causal, + # uses val tokens only — same compliance as inline). For full tilt this saves + # ~168s of measured eval time without losing any tilt benefit. + precomputed_hints = None + if h.ngram_tilt_enabled and getattr(h, "ngram_hint_precompute_outside", True): + log("v5:precomputing ngram hints OUTSIDE eval timer") + precomputed_hints = _compute_ngram_hints_for_val(h, val_data, log0=log) + log("\nbeginning TTT eval timer") + torch.cuda.synchronize() + t_ttt = time.perf_counter() + ttt_val_loss, ttt_val_bpb = eval_val_ttt_phased( + h, ttt_model, device, val_data, forward_ttt_train=fwd_ttt_compiled, + precomputed_hints=precomputed_hints, + ) + torch.cuda.synchronize() + ttt_eval_elapsed = time.perf_counter() - t_ttt + log( + "quantized_ttt_phased " + f"val_loss:{ttt_val_loss:.8f} val_bpb:{ttt_val_bpb:.8f} " + f"eval_time:{1e3*ttt_eval_elapsed:.0f}ms" + ) + log(f"total_eval_time:{ttt_eval_elapsed:.1f}s") + del ttt_model + + +def main(): + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + if not torch.cuda.is_available(): + raise RuntimeError("CUDA is required") + if world_size <= 0: + raise ValueError(f"WORLD_SIZE must be positive, got {world_size}") + if 8 % world_size != 0: + raise ValueError( + f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral" + ) + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + torch.set_float32_matmul_precision("high") + from torch.backends.cuda import ( + enable_cudnn_sdp, + enable_flash_sdp, + enable_math_sdp, + enable_mem_efficient_sdp, + ) + + enable_cudnn_sdp(False) + enable_flash_sdp(True) + enable_mem_efficient_sdp(False) + enable_math_sdp(False) + torch._dynamo.config.optimize_ddp = False + torch._dynamo.config.cache_size_limit = 64 + h = Hyperparameters() + set_logging_hparams(h) + if h.is_main_process: + os.makedirs(h.artifact_dir if h.artifact_dir else "logs", exist_ok=True) + log(100 * "=", console=False) + log("Hyperparameters:", console=True) + for (k, v) in sorted(vars(type(h)).items()): + if not k.startswith("_"): + log(f" {k}: {v}", console=True) + log("=" * 100, console=False) + log("Source code:", console=False) + log("=" * 100, console=False) + with open(__file__, "r", encoding="utf-8") as _src: + log(_src.read(), console=False) + log("=" * 100, console=False) + log(f"Running Python {sys.version}", console=False) + log(f"Running PyTorch {torch.__version__}", console=False) + log("=" * 100, console=False) + train_and_eval(h, device) + if distributed: + dist.destroy_process_group() + + +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_seed0.log b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_seed0.log new file mode 100644 index 0000000000..9d0d2203d9 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_seed0.log @@ -0,0 +1,5846 @@ +nohup: ignoring input +==================================================== + v5 PRIMARY noLC fulltilt + precompute outside timer: V21 + #1953 + #1948 + fulltilt-tilt SEED=0 Thu Apr 30 06:31:00 UTC 2026 + LeakyReLU slope 0.3 (code patch + v5 hint-precompute-outside-timer), EVAL_SEQ_LEN 2048 (no long-ctx for cap), no_qv, fulltilt-tilt +==================================================== +W0430 06:31:01.197000 1039730 torch/distributed/run.py:803] +W0430 06:31:01.197000 1039730 torch/distributed/run.py:803] ***************************************** +W0430 06:31:01.197000 1039730 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0430 06:31:01.197000 1039730 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + agree_add_boost: 0.5 + artifact_dir: + attn_clip_sigmas: 13.0 + attn_out_gate_enabled: False + attn_out_gate_src: proj + awq_lite_bits: 8 + awq_lite_enabled: True + awq_lite_group_size: 64 + awq_lite_group_top_k: 1 + beta1: 0.9 + beta2: 0.99 + caseops_enabled: True + compressor: pergroup + data_dir: /runpod-volume/caseops_data/datasets + datasets_dir: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved + distributed: True + ema_decay: 0.9965 + embed_bits: 7 + embed_clip_sigmas: 14.0 + embed_lr: 0.6 + embed_wd: 0.085 + enable_looping_at: 0.35 + eval_seq_len: 2048 + eval_stride: 64 + fused_ce_enabled: True + gate_window: 12 + gated_attn_enabled: False + gated_attn_init_std: 0.01 + gated_attn_quant_gate: True + global_ttt_batch_seqs: 32 + global_ttt_chunk_tokens: 32768 + global_ttt_epochs: 1 + global_ttt_grad_clip: 1.0 + global_ttt_lr: 0.001 + global_ttt_momentum: 0.9 + global_ttt_respect_doc_boundaries: True + global_ttt_warmup_chunks: 0 + global_ttt_warmup_start_lr: 0.0 + gptq_calibration_batches: 16 + gptq_reserve_seconds: 0.5 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/f52a14a3-f337-475d-ae4d-917bd1d29ebb.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + lqer_asym_enabled: True + lqer_asym_group: 64 + lqer_enabled: True + lqer_factor_bits: 4 + lqer_gain_select: False + lqer_rank: 4 + lqer_scope: all + lqer_top_k: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.026 + max_wallclock_seconds: 600.0 + min_lr: 0.1 + mlp_clip_sigmas: 11.5 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_momentum: 0.97 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + ngram_hint_precompute_outside: True + ngram_tilt_enabled: True + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_final_lane: mean + parallel_start_layer: 8 + phased_ttt_num_phases: 3 + phased_ttt_prefix_docs: 2500 + qk_gain_init: 5.25 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + rope_yarn: False + run_id: f52a14a3-f337-475d-ae4d-917bd1d29ebb + scalar_lr: 0.02 + seed: 0 + skip_gates_enabled: True + smear_gate_enabled: True + sparse_attn_gate_enabled: True + sparse_attn_gate_init_std: 0.0 + sparse_attn_gate_scale: 0.5 + temperature_scale: 1.0 + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + token_boost: 2.625 + token_order: 16 + token_threshold: 0.8 + tokenizer_path: /runpod-volume/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + train_batch_tokens: 786432 + train_files: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_size: 64 + ttt_beta1: 0.0 + ttt_beta2: 0.99 + ttt_chunk_size: 48 + ttt_enabled: True + ttt_eval_batches: + ttt_eval_seq_len: 2048 + ttt_grad_steps: 1 + ttt_k_lora: True + ttt_lora_lr: 0.0001 + ttt_lora_rank: 80 + ttt_mlp_lora: True + ttt_o_lora: True + ttt_optimizer: adam + ttt_weight_decay: 0.5 + val_batch_tokens: 524288 + val_bytes_files: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin + val_doc_fraction: 1.0 + val_files: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin + val_loss_every: 0 + vocab_size: 8192 + warmdown_frac: 0.85 + warmup_steps: 20 + within_boost: 0.75 + within_tau: 0.45 + word_boost: 0.75 + word_normalize: strip_punct_lower + word_order: 4 + word_tau: 0.65 + world_size: 8 + xsa_last_n: 11 +train_shards: 1499 +val_tokens: 47851520 +model_params:35945673 +gptq:reserving 0s, effective=599500ms +warmup_cu_buckets:64,128,192,256 iters_each:3 +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +1/20000 train_loss: 9.0105 train_time: 0.0m tok/s: 17751024 +2/20000 train_loss: 12.9581 train_time: 0.0m tok/s: 11612482 +3/20000 train_loss: 10.2705 train_time: 0.0m tok/s: 10325348 +4/20000 train_loss: 8.7742 train_time: 0.0m tok/s: 9758222 +5/20000 train_loss: 8.0014 train_time: 0.0m tok/s: 9451468 +6/20000 train_loss: 7.5156 train_time: 0.0m tok/s: 9277800 +7/20000 train_loss: 7.2839 train_time: 0.0m tok/s: 9129780 +8/20000 train_loss: 6.9483 train_time: 0.0m tok/s: 9050891 +9/20000 train_loss: 6.5860 train_time: 0.0m tok/s: 8983588 +10/20000 train_loss: 6.4463 train_time: 0.0m tok/s: 8909806 +11/20000 train_loss: 6.1455 train_time: 0.0m tok/s: 8781752 +12/20000 train_loss: 5.8724 train_time: 0.0m tok/s: 8716445 +13/20000 train_loss: 5.7235 train_time: 0.0m tok/s: 8669421 +14/20000 train_loss: 5.3392 train_time: 0.0m tok/s: 8638126 +15/20000 train_loss: 5.3102 train_time: 0.0m tok/s: 8620471 +16/20000 train_loss: 5.2933 train_time: 0.0m tok/s: 8607356 +17/20000 train_loss: 5.1377 train_time: 0.0m tok/s: 8592032 +18/20000 train_loss: 5.0844 train_time: 0.0m tok/s: 8585383 +19/20000 train_loss: 5.0099 train_time: 0.0m tok/s: 8574284 +20/20000 train_loss: 4.9165 train_time: 0.0m tok/s: 8565371 +21/20000 train_loss: 4.8210 train_time: 0.0m tok/s: 8549746 +22/20000 train_loss: 4.8559 train_time: 0.0m tok/s: 8540030 +23/20000 train_loss: 4.7980 train_time: 0.0m tok/s: 8526436 +24/20000 train_loss: 4.9132 train_time: 0.0m tok/s: 8513113 +25/20000 train_loss: 4.6963 train_time: 0.0m tok/s: 8504926 +26/20000 train_loss: 4.7245 train_time: 0.0m tok/s: 8500667 +27/20000 train_loss: 4.6048 train_time: 0.0m tok/s: 8496905 +28/20000 train_loss: 4.6617 train_time: 0.0m tok/s: 8495571 +29/20000 train_loss: 4.6002 train_time: 0.0m tok/s: 8495370 +30/20000 train_loss: 4.5709 train_time: 0.0m tok/s: 8490619 +31/20000 train_loss: 4.5644 train_time: 0.0m tok/s: 8484751 +32/20000 train_loss: 4.5421 train_time: 0.0m tok/s: 8477705 +33/20000 train_loss: 4.5252 train_time: 0.1m tok/s: 8471873 +34/20000 train_loss: 4.4573 train_time: 0.1m tok/s: 8464569 +35/20000 train_loss: 4.3670 train_time: 0.1m tok/s: 8456342 +36/20000 train_loss: 4.5019 train_time: 0.1m tok/s: 8447991 +37/20000 train_loss: 4.4654 train_time: 0.1m tok/s: 8447650 +38/20000 train_loss: 4.3741 train_time: 0.1m tok/s: 8445256 +39/20000 train_loss: 4.5231 train_time: 0.1m tok/s: 8445273 +40/20000 train_loss: 4.4834 train_time: 0.1m tok/s: 8438097 +41/20000 train_loss: 4.3528 train_time: 0.1m tok/s: 8435987 +42/20000 train_loss: 4.2740 train_time: 0.1m tok/s: 8432526 +43/20000 train_loss: 4.3038 train_time: 0.1m tok/s: 8429744 +44/20000 train_loss: 4.2526 train_time: 0.1m tok/s: 8426438 +45/20000 train_loss: 4.3834 train_time: 0.1m tok/s: 8421537 +46/20000 train_loss: 4.2997 train_time: 0.1m tok/s: 8417575 +47/20000 train_loss: 4.1680 train_time: 0.1m tok/s: 8416414 +48/20000 train_loss: 4.2051 train_time: 0.1m tok/s: 8413488 +49/20000 train_loss: 4.1551 train_time: 0.1m tok/s: 8413390 +50/20000 train_loss: 4.1042 train_time: 0.1m tok/s: 8410172 +51/20000 train_loss: 4.3031 train_time: 0.1m tok/s: 8409375 +52/20000 train_loss: 4.2456 train_time: 0.1m tok/s: 8403704 +53/20000 train_loss: 4.1840 train_time: 0.1m tok/s: 8401315 +54/20000 train_loss: 4.1960 train_time: 0.1m tok/s: 8398949 +55/20000 train_loss: 4.1977 train_time: 0.1m tok/s: 8397614 +56/20000 train_loss: 4.1101 train_time: 0.1m tok/s: 8395810 +57/20000 train_loss: 4.1572 train_time: 0.1m tok/s: 8393608 +58/20000 train_loss: 4.0902 train_time: 0.1m tok/s: 8391305 +59/20000 train_loss: 4.0499 train_time: 0.1m tok/s: 8390804 +60/20000 train_loss: 3.9716 train_time: 0.1m tok/s: 8391160 +61/20000 train_loss: 3.9698 train_time: 0.1m tok/s: 8389518 +62/20000 train_loss: 4.0810 train_time: 0.1m tok/s: 8388845 +63/20000 train_loss: 4.1709 train_time: 0.1m tok/s: 8387858 +64/20000 train_loss: 3.9689 train_time: 0.1m tok/s: 8388269 +65/20000 train_loss: 4.0895 train_time: 0.1m tok/s: 8385569 +66/20000 train_loss: 4.0442 train_time: 0.1m tok/s: 8384422 +67/20000 train_loss: 3.9574 train_time: 0.1m tok/s: 8382318 +68/20000 train_loss: 3.9705 train_time: 0.1m tok/s: 8382313 +69/20000 train_loss: 3.8863 train_time: 0.1m tok/s: 8381121 +70/20000 train_loss: 3.9976 train_time: 0.1m tok/s: 8380991 +71/20000 train_loss: 3.9085 train_time: 0.1m tok/s: 8381230 +72/20000 train_loss: 4.0867 train_time: 0.1m tok/s: 8380366 +73/20000 train_loss: 3.8928 train_time: 0.1m tok/s: 8379742 +74/20000 train_loss: 3.9314 train_time: 0.1m tok/s: 8378604 +75/20000 train_loss: 3.9082 train_time: 0.1m tok/s: 8377173 +76/20000 train_loss: 3.8849 train_time: 0.1m tok/s: 8375511 +77/20000 train_loss: 3.8310 train_time: 0.1m tok/s: 8374999 +78/20000 train_loss: 3.7644 train_time: 0.1m tok/s: 8376061 +79/20000 train_loss: 3.8934 train_time: 0.1m tok/s: 8375460 +80/20000 train_loss: 3.8134 train_time: 0.1m tok/s: 8374503 +81/20000 train_loss: 3.7480 train_time: 0.1m tok/s: 8373020 +82/20000 train_loss: 3.7796 train_time: 0.1m tok/s: 8371661 +83/20000 train_loss: 3.6426 train_time: 0.1m tok/s: 8369820 +84/20000 train_loss: 3.7002 train_time: 0.1m tok/s: 8368942 +85/20000 train_loss: 3.6587 train_time: 0.1m tok/s: 8368513 +86/20000 train_loss: 3.4554 train_time: 0.1m tok/s: 8368176 +87/20000 train_loss: 3.6848 train_time: 0.1m tok/s: 8367957 +88/20000 train_loss: 3.5622 train_time: 0.1m tok/s: 8367459 +89/20000 train_loss: 3.5792 train_time: 0.1m tok/s: 8366796 +90/20000 train_loss: 3.6046 train_time: 0.1m tok/s: 8366447 +91/20000 train_loss: 3.6406 train_time: 0.1m tok/s: 8366409 +92/20000 train_loss: 3.7234 train_time: 0.1m tok/s: 8365772 +93/20000 train_loss: 3.6273 train_time: 0.1m tok/s: 8365245 +94/20000 train_loss: 3.6507 train_time: 0.1m tok/s: 8364384 +95/20000 train_loss: 3.6212 train_time: 0.1m tok/s: 8364490 +96/20000 train_loss: 3.5848 train_time: 0.2m tok/s: 8364564 +97/20000 train_loss: 3.4911 train_time: 0.2m tok/s: 8362939 +98/20000 train_loss: 3.5426 train_time: 0.2m tok/s: 8362268 +99/20000 train_loss: 3.5103 train_time: 0.2m tok/s: 8362408 +100/20000 train_loss: 3.4167 train_time: 0.2m tok/s: 8361685 +101/20000 train_loss: 3.4378 train_time: 0.2m tok/s: 8361600 +102/20000 train_loss: 3.4955 train_time: 0.2m tok/s: 8360739 +103/20000 train_loss: 3.3744 train_time: 0.2m tok/s: 8360847 +104/20000 train_loss: 3.4856 train_time: 0.2m tok/s: 8360680 +105/20000 train_loss: 3.3645 train_time: 0.2m tok/s: 8360072 +106/20000 train_loss: 3.4981 train_time: 0.2m tok/s: 8360320 +107/20000 train_loss: 3.2241 train_time: 0.2m tok/s: 8359165 +108/20000 train_loss: 3.4149 train_time: 0.2m tok/s: 8358105 +109/20000 train_loss: 3.4086 train_time: 0.2m tok/s: 8357400 +110/20000 train_loss: 3.4323 train_time: 0.2m tok/s: 8357205 +111/20000 train_loss: 3.4358 train_time: 0.2m tok/s: 8357400 +112/20000 train_loss: 3.4377 train_time: 0.2m tok/s: 8356744 +113/20000 train_loss: 3.3437 train_time: 0.2m tok/s: 8356772 +114/20000 train_loss: 3.3927 train_time: 0.2m tok/s: 8357091 +115/20000 train_loss: 3.4280 train_time: 0.2m tok/s: 8356696 +116/20000 train_loss: 3.2275 train_time: 0.2m tok/s: 8356292 +117/20000 train_loss: 3.4415 train_time: 0.2m tok/s: 8355531 +118/20000 train_loss: 3.3799 train_time: 0.2m tok/s: 8354809 +119/20000 train_loss: 3.3620 train_time: 0.2m tok/s: 8354197 +120/20000 train_loss: 3.3466 train_time: 0.2m tok/s: 8353454 +121/20000 train_loss: 3.2950 train_time: 0.2m tok/s: 8353931 +122/20000 train_loss: 3.3135 train_time: 0.2m tok/s: 8354104 +123/20000 train_loss: 3.2969 train_time: 0.2m tok/s: 8353151 +124/20000 train_loss: 3.3434 train_time: 0.2m tok/s: 8352155 +125/20000 train_loss: 3.2474 train_time: 0.2m tok/s: 8351813 +126/20000 train_loss: 3.2639 train_time: 0.2m tok/s: 8352017 +127/20000 train_loss: 3.2927 train_time: 0.2m tok/s: 8350771 +128/20000 train_loss: 3.3352 train_time: 0.2m tok/s: 8349666 +129/20000 train_loss: 3.3029 train_time: 0.2m tok/s: 8349368 +130/20000 train_loss: 3.2789 train_time: 0.2m tok/s: 8348553 +131/20000 train_loss: 3.2390 train_time: 0.2m tok/s: 8348426 +132/20000 train_loss: 3.1912 train_time: 0.2m tok/s: 8348185 +133/20000 train_loss: 3.2460 train_time: 0.2m tok/s: 8348365 +134/20000 train_loss: 3.1474 train_time: 0.2m tok/s: 8348510 +135/20000 train_loss: 2.9773 train_time: 0.2m tok/s: 8346370 +136/20000 train_loss: 3.2477 train_time: 0.2m tok/s: 8344932 +137/20000 train_loss: 3.0866 train_time: 0.2m tok/s: 8344432 +138/20000 train_loss: 3.2948 train_time: 0.2m tok/s: 8343768 +139/20000 train_loss: 3.2481 train_time: 0.2m tok/s: 8343348 +140/20000 train_loss: 3.1884 train_time: 0.2m tok/s: 8342973 +141/20000 train_loss: 3.0969 train_time: 0.2m tok/s: 8342797 +142/20000 train_loss: 3.3045 train_time: 0.2m tok/s: 8343192 +143/20000 train_loss: 3.3626 train_time: 0.2m tok/s: 8342650 +144/20000 train_loss: 3.2974 train_time: 0.2m tok/s: 8342508 +145/20000 train_loss: 3.2659 train_time: 0.2m tok/s: 8342426 +146/20000 train_loss: 3.2864 train_time: 0.2m tok/s: 8342642 +147/20000 train_loss: 3.1766 train_time: 0.2m tok/s: 8341825 +148/20000 train_loss: 3.2077 train_time: 0.2m tok/s: 8342372 +149/20000 train_loss: 3.2732 train_time: 0.2m tok/s: 8341855 +150/20000 train_loss: 3.2082 train_time: 0.2m tok/s: 8341874 +151/20000 train_loss: 3.5619 train_time: 0.2m tok/s: 8341042 +152/20000 train_loss: 3.1797 train_time: 0.2m tok/s: 8340396 +153/20000 train_loss: 3.3076 train_time: 0.2m tok/s: 8340518 +154/20000 train_loss: 3.2043 train_time: 0.2m tok/s: 8339940 +155/20000 train_loss: 3.1550 train_time: 0.2m tok/s: 8339161 +156/20000 train_loss: 3.0516 train_time: 0.2m tok/s: 8338495 +157/20000 train_loss: 3.1076 train_time: 0.2m tok/s: 8337962 +158/20000 train_loss: 3.1996 train_time: 0.2m tok/s: 8338015 +159/20000 train_loss: 3.0644 train_time: 0.2m tok/s: 8337946 +160/20000 train_loss: 3.1831 train_time: 0.3m tok/s: 8338087 +161/20000 train_loss: 3.1466 train_time: 0.3m tok/s: 8338053 +162/20000 train_loss: 3.0814 train_time: 0.3m tok/s: 8337265 +163/20000 train_loss: 3.1532 train_time: 0.3m tok/s: 8337328 +164/20000 train_loss: 3.0355 train_time: 0.3m tok/s: 8336380 +165/20000 train_loss: 3.2211 train_time: 0.3m tok/s: 8336058 +166/20000 train_loss: 3.1528 train_time: 0.3m tok/s: 8335884 +167/20000 train_loss: 3.1464 train_time: 0.3m tok/s: 8335556 +168/20000 train_loss: 3.1921 train_time: 0.3m tok/s: 8335519 +169/20000 train_loss: 3.1188 train_time: 0.3m tok/s: 8335255 +170/20000 train_loss: 2.8159 train_time: 0.3m tok/s: 8334906 +171/20000 train_loss: 3.1409 train_time: 0.3m tok/s: 8334385 +172/20000 train_loss: 3.0965 train_time: 0.3m tok/s: 8333404 +173/20000 train_loss: 3.2364 train_time: 0.3m tok/s: 8334548 +174/20000 train_loss: 3.1239 train_time: 0.3m tok/s: 8334528 +175/20000 train_loss: 3.1495 train_time: 0.3m tok/s: 8334680 +176/20000 train_loss: 3.1667 train_time: 0.3m tok/s: 8334789 +177/20000 train_loss: 3.1315 train_time: 0.3m tok/s: 8333952 +178/20000 train_loss: 2.9748 train_time: 0.3m tok/s: 8333712 +179/20000 train_loss: 3.3340 train_time: 0.3m tok/s: 8333381 +180/20000 train_loss: 2.9790 train_time: 0.3m tok/s: 8333289 +181/20000 train_loss: 2.9740 train_time: 0.3m tok/s: 8333688 +182/20000 train_loss: 3.0624 train_time: 0.3m tok/s: 8333299 +183/20000 train_loss: 3.0095 train_time: 0.3m tok/s: 8332846 +184/20000 train_loss: 3.0259 train_time: 0.3m tok/s: 8332886 +185/20000 train_loss: 2.7218 train_time: 0.3m tok/s: 8331738 +186/20000 train_loss: 3.1241 train_time: 0.3m tok/s: 8329917 +187/20000 train_loss: 3.0579 train_time: 0.3m tok/s: 8329758 +188/20000 train_loss: 3.2102 train_time: 0.3m tok/s: 8329720 +189/20000 train_loss: 3.5327 train_time: 0.3m tok/s: 8329731 +190/20000 train_loss: 3.0839 train_time: 0.3m tok/s: 8329363 +191/20000 train_loss: 3.0545 train_time: 0.3m tok/s: 8329492 +192/20000 train_loss: 3.0238 train_time: 0.3m tok/s: 8329286 +193/20000 train_loss: 3.0141 train_time: 0.3m tok/s: 8328888 +194/20000 train_loss: 3.0243 train_time: 0.3m tok/s: 8328371 +195/20000 train_loss: 2.9070 train_time: 0.3m tok/s: 8328491 +196/20000 train_loss: 3.1385 train_time: 0.3m tok/s: 8327997 +197/20000 train_loss: 3.0585 train_time: 0.3m tok/s: 8328053 +198/20000 train_loss: 3.0623 train_time: 0.3m tok/s: 8327837 +199/20000 train_loss: 3.0606 train_time: 0.3m tok/s: 8327901 +200/20000 train_loss: 3.0731 train_time: 0.3m tok/s: 8327860 +201/20000 train_loss: 3.1123 train_time: 0.3m tok/s: 8327736 +202/20000 train_loss: 3.3427 train_time: 0.3m tok/s: 8327002 +203/20000 train_loss: 3.0791 train_time: 0.3m tok/s: 8326692 +204/20000 train_loss: 3.0854 train_time: 0.3m tok/s: 8326437 +205/20000 train_loss: 3.0654 train_time: 0.3m tok/s: 8326428 +206/20000 train_loss: 2.9530 train_time: 0.3m tok/s: 8325930 +207/20000 train_loss: 3.0990 train_time: 0.3m tok/s: 8325818 +208/20000 train_loss: 2.9411 train_time: 0.3m tok/s: 8325674 +209/20000 train_loss: 3.0097 train_time: 0.3m tok/s: 8325673 +210/20000 train_loss: 3.0890 train_time: 0.3m tok/s: 8325072 +211/20000 train_loss: 3.2688 train_time: 0.3m tok/s: 8324541 +212/20000 train_loss: 3.0275 train_time: 0.3m tok/s: 8324604 +213/20000 train_loss: 2.9514 train_time: 0.3m tok/s: 8324119 +214/20000 train_loss: 3.1062 train_time: 0.3m tok/s: 8323875 +215/20000 train_loss: 3.0388 train_time: 0.3m tok/s: 8323917 +216/20000 train_loss: 3.0995 train_time: 0.3m tok/s: 8323945 +217/20000 train_loss: 3.0265 train_time: 0.3m tok/s: 8324106 +218/20000 train_loss: 3.0386 train_time: 0.3m tok/s: 8324046 +219/20000 train_loss: 3.1288 train_time: 0.3m tok/s: 8323651 +220/20000 train_loss: 3.3544 train_time: 0.3m tok/s: 8322833 +221/20000 train_loss: 2.9475 train_time: 0.3m tok/s: 8322463 +222/20000 train_loss: 2.9841 train_time: 0.3m tok/s: 8322462 +223/20000 train_loss: 3.0011 train_time: 0.4m tok/s: 8321978 +224/20000 train_loss: 3.0052 train_time: 0.4m tok/s: 8321573 +225/20000 train_loss: 3.0811 train_time: 0.4m tok/s: 8321048 +226/20000 train_loss: 3.0510 train_time: 0.4m tok/s: 8321201 +227/20000 train_loss: 3.0695 train_time: 0.4m tok/s: 8321486 +228/20000 train_loss: 3.0887 train_time: 0.4m tok/s: 8321773 +229/20000 train_loss: 3.0970 train_time: 0.4m tok/s: 8321865 +230/20000 train_loss: 2.9666 train_time: 0.4m tok/s: 8322032 +231/20000 train_loss: 3.1176 train_time: 0.4m tok/s: 8321984 +232/20000 train_loss: 2.9898 train_time: 0.4m tok/s: 8321943 +233/20000 train_loss: 3.0269 train_time: 0.4m tok/s: 8321368 +234/20000 train_loss: 3.0240 train_time: 0.4m tok/s: 8320782 +235/20000 train_loss: 2.9339 train_time: 0.4m tok/s: 8320711 +236/20000 train_loss: 3.0191 train_time: 0.4m tok/s: 8320749 +237/20000 train_loss: 2.9203 train_time: 0.4m tok/s: 8320384 +238/20000 train_loss: 3.0891 train_time: 0.4m tok/s: 8320099 +239/20000 train_loss: 3.0147 train_time: 0.4m tok/s: 8319745 +240/20000 train_loss: 3.1668 train_time: 0.4m tok/s: 8320011 +241/20000 train_loss: 3.0297 train_time: 0.4m tok/s: 8319989 +242/20000 train_loss: 3.1043 train_time: 0.4m tok/s: 8320093 +243/20000 train_loss: 3.0150 train_time: 0.4m tok/s: 8319800 +244/20000 train_loss: 3.0493 train_time: 0.4m tok/s: 8320381 +245/20000 train_loss: 2.9968 train_time: 0.4m tok/s: 8319510 +246/20000 train_loss: 3.0474 train_time: 0.4m tok/s: 8319516 +247/20000 train_loss: 2.9836 train_time: 0.4m tok/s: 8319352 +248/20000 train_loss: 2.9086 train_time: 0.4m tok/s: 8319602 +249/20000 train_loss: 2.9823 train_time: 0.4m tok/s: 8319600 +250/20000 train_loss: 2.9924 train_time: 0.4m tok/s: 8319499 +251/20000 train_loss: 2.9438 train_time: 0.4m tok/s: 8319590 +252/20000 train_loss: 2.9413 train_time: 0.4m tok/s: 8319665 +253/20000 train_loss: 3.0321 train_time: 0.4m tok/s: 8319493 +254/20000 train_loss: 3.0922 train_time: 0.4m tok/s: 8319478 +255/20000 train_loss: 3.1038 train_time: 0.4m tok/s: 8319508 +256/20000 train_loss: 2.9682 train_time: 0.4m tok/s: 8319107 +257/20000 train_loss: 2.9689 train_time: 0.4m tok/s: 8318673 +258/20000 train_loss: 3.0234 train_time: 0.4m tok/s: 8318531 +259/20000 train_loss: 2.9497 train_time: 0.4m tok/s: 8318533 +260/20000 train_loss: 3.1592 train_time: 0.4m tok/s: 8318332 +261/20000 train_loss: 2.9402 train_time: 0.4m tok/s: 8318232 +262/20000 train_loss: 2.7902 train_time: 0.4m tok/s: 8318357 +263/20000 train_loss: 2.8018 train_time: 0.4m tok/s: 8318425 +264/20000 train_loss: 2.9812 train_time: 0.4m tok/s: 8318074 +265/20000 train_loss: 2.9997 train_time: 0.4m tok/s: 8317775 +266/20000 train_loss: 2.9256 train_time: 0.4m tok/s: 8317297 +267/20000 train_loss: 2.9390 train_time: 0.4m tok/s: 8317281 +268/20000 train_loss: 3.0120 train_time: 0.4m tok/s: 8317301 +269/20000 train_loss: 3.0136 train_time: 0.4m tok/s: 8317194 +270/20000 train_loss: 3.0072 train_time: 0.4m tok/s: 8317187 +271/20000 train_loss: 3.0125 train_time: 0.4m tok/s: 8317415 +272/20000 train_loss: 3.0722 train_time: 0.4m tok/s: 8317007 +273/20000 train_loss: 2.9296 train_time: 0.4m tok/s: 8317103 +274/20000 train_loss: 3.0293 train_time: 0.4m tok/s: 8316968 +275/20000 train_loss: 2.9473 train_time: 0.4m tok/s: 8317130 +276/20000 train_loss: 2.8838 train_time: 0.4m tok/s: 8316906 +277/20000 train_loss: 2.8777 train_time: 0.4m tok/s: 8316656 +278/20000 train_loss: 2.8458 train_time: 0.4m tok/s: 8316214 +279/20000 train_loss: 2.9762 train_time: 0.4m tok/s: 8315692 +280/20000 train_loss: 3.0162 train_time: 0.4m tok/s: 8315348 +281/20000 train_loss: 2.7724 train_time: 0.4m tok/s: 8315552 +282/20000 train_loss: 3.0694 train_time: 0.4m tok/s: 8315443 +283/20000 train_loss: 2.8821 train_time: 0.4m tok/s: 8315272 +284/20000 train_loss: 2.9196 train_time: 0.4m tok/s: 8315257 +285/20000 train_loss: 2.9746 train_time: 0.4m tok/s: 8315425 +286/20000 train_loss: 2.9966 train_time: 0.5m tok/s: 8315469 +287/20000 train_loss: 2.8318 train_time: 0.5m tok/s: 8315342 +288/20000 train_loss: 2.9704 train_time: 0.5m tok/s: 8314710 +289/20000 train_loss: 2.8875 train_time: 0.5m tok/s: 8314488 +290/20000 train_loss: 2.9187 train_time: 0.5m tok/s: 8314460 +291/20000 train_loss: 2.8888 train_time: 0.5m tok/s: 8314539 +292/20000 train_loss: 2.7252 train_time: 0.5m tok/s: 8314210 +293/20000 train_loss: 2.9414 train_time: 0.5m tok/s: 8314088 +294/20000 train_loss: 3.0647 train_time: 0.5m tok/s: 8313816 +295/20000 train_loss: 3.0103 train_time: 0.5m tok/s: 8313881 +296/20000 train_loss: 3.0701 train_time: 0.5m tok/s: 8313810 +297/20000 train_loss: 2.9548 train_time: 0.5m tok/s: 8313545 +298/20000 train_loss: 2.9967 train_time: 0.5m tok/s: 8313643 +299/20000 train_loss: 2.8260 train_time: 0.5m tok/s: 8313647 +300/20000 train_loss: 3.0274 train_time: 0.5m tok/s: 8313386 +301/20000 train_loss: 2.9712 train_time: 0.5m tok/s: 8313458 +302/20000 train_loss: 2.8679 train_time: 0.5m tok/s: 8313346 +303/20000 train_loss: 2.9336 train_time: 0.5m tok/s: 8313547 +304/20000 train_loss: 2.9505 train_time: 0.5m tok/s: 8313187 +305/20000 train_loss: 2.9434 train_time: 0.5m tok/s: 8313224 +306/20000 train_loss: 3.0113 train_time: 0.5m tok/s: 8313073 +307/20000 train_loss: 2.9189 train_time: 0.5m tok/s: 8313041 +308/20000 train_loss: 2.9108 train_time: 0.5m tok/s: 8312663 +309/20000 train_loss: 3.0495 train_time: 0.5m tok/s: 8312574 +310/20000 train_loss: 2.8708 train_time: 0.5m tok/s: 8312726 +311/20000 train_loss: 2.9414 train_time: 0.5m tok/s: 8312084 +312/20000 train_loss: 2.8375 train_time: 0.5m tok/s: 8312567 +313/20000 train_loss: 2.8410 train_time: 0.5m tok/s: 8312609 +314/20000 train_loss: 2.8783 train_time: 0.5m tok/s: 8312332 +315/20000 train_loss: 2.9584 train_time: 0.5m tok/s: 8312363 +316/20000 train_loss: 2.6926 train_time: 0.5m tok/s: 8311979 +317/20000 train_loss: 2.8154 train_time: 0.5m tok/s: 8311751 +318/20000 train_loss: 2.9254 train_time: 0.5m tok/s: 8311549 +319/20000 train_loss: 2.9093 train_time: 0.5m tok/s: 8311300 +320/20000 train_loss: 3.0335 train_time: 0.5m tok/s: 8311092 +321/20000 train_loss: 3.0073 train_time: 0.5m tok/s: 8310757 +322/20000 train_loss: 2.9691 train_time: 0.5m tok/s: 8310814 +323/20000 train_loss: 3.0084 train_time: 0.5m tok/s: 8310869 +324/20000 train_loss: 2.9123 train_time: 0.5m tok/s: 8310525 +325/20000 train_loss: 2.8994 train_time: 0.5m tok/s: 8310474 +326/20000 train_loss: 2.9003 train_time: 0.5m tok/s: 8310596 +327/20000 train_loss: 2.8339 train_time: 0.5m tok/s: 8310742 +328/20000 train_loss: 2.8576 train_time: 0.5m tok/s: 8310434 +329/20000 train_loss: 2.8284 train_time: 0.5m tok/s: 8310471 +330/20000 train_loss: 2.7833 train_time: 0.5m tok/s: 8310618 +331/20000 train_loss: 2.8970 train_time: 0.5m tok/s: 8309079 +332/20000 train_loss: 2.9814 train_time: 0.5m tok/s: 8309740 +333/20000 train_loss: 2.8781 train_time: 0.5m tok/s: 8309788 +334/20000 train_loss: 3.1051 train_time: 0.5m tok/s: 8309479 +335/20000 train_loss: 2.8554 train_time: 0.5m tok/s: 8309255 +336/20000 train_loss: 2.9344 train_time: 0.5m tok/s: 8309170 +337/20000 train_loss: 2.8251 train_time: 0.5m tok/s: 8309378 +338/20000 train_loss: 2.8984 train_time: 0.5m tok/s: 8309139 +339/20000 train_loss: 2.9438 train_time: 0.5m tok/s: 8309158 +340/20000 train_loss: 2.9740 train_time: 0.5m tok/s: 8309027 +341/20000 train_loss: 2.9302 train_time: 0.5m tok/s: 8309013 +342/20000 train_loss: 2.8124 train_time: 0.5m tok/s: 8309044 +343/20000 train_loss: 2.9206 train_time: 0.5m tok/s: 8309156 +344/20000 train_loss: 2.8295 train_time: 0.5m tok/s: 8308962 +345/20000 train_loss: 2.8716 train_time: 0.5m tok/s: 8308871 +346/20000 train_loss: 2.8812 train_time: 0.5m tok/s: 8308869 +347/20000 train_loss: 2.8985 train_time: 0.5m tok/s: 8308868 +348/20000 train_loss: 2.8631 train_time: 0.5m tok/s: 8308617 +349/20000 train_loss: 2.9373 train_time: 0.6m tok/s: 8308709 +350/20000 train_loss: 2.7874 train_time: 0.6m tok/s: 8308761 +351/20000 train_loss: 2.8082 train_time: 0.6m tok/s: 8308960 +352/20000 train_loss: 2.7819 train_time: 0.6m tok/s: 8308819 +353/20000 train_loss: 2.6282 train_time: 0.6m tok/s: 8308401 +354/20000 train_loss: 2.9905 train_time: 0.6m tok/s: 8307983 +355/20000 train_loss: 2.9190 train_time: 0.6m tok/s: 8307745 +356/20000 train_loss: 2.8343 train_time: 0.6m tok/s: 8307463 +357/20000 train_loss: 2.7883 train_time: 0.6m tok/s: 8307149 +358/20000 train_loss: 2.7978 train_time: 0.6m tok/s: 8307154 +359/20000 train_loss: 2.9036 train_time: 0.6m tok/s: 8307173 +360/20000 train_loss: 2.8878 train_time: 0.6m tok/s: 8306947 +361/20000 train_loss: 2.9643 train_time: 0.6m tok/s: 8306796 +362/20000 train_loss: 2.8693 train_time: 0.6m tok/s: 8306609 +363/20000 train_loss: 2.9555 train_time: 0.6m tok/s: 8306449 +364/20000 train_loss: 2.8194 train_time: 0.6m tok/s: 8306378 +365/20000 train_loss: 2.8047 train_time: 0.6m tok/s: 8306308 +366/20000 train_loss: 2.8134 train_time: 0.6m tok/s: 8306187 +367/20000 train_loss: 2.9347 train_time: 0.6m tok/s: 8306247 +368/20000 train_loss: 2.7426 train_time: 0.6m tok/s: 8306092 +369/20000 train_loss: 2.8960 train_time: 0.6m tok/s: 8306209 +370/20000 train_loss: 2.8821 train_time: 0.6m tok/s: 8306250 +371/20000 train_loss: 2.8808 train_time: 0.6m tok/s: 8306232 +372/20000 train_loss: 2.8391 train_time: 0.6m tok/s: 8306104 +373/20000 train_loss: 2.7237 train_time: 0.6m tok/s: 8306009 +374/20000 train_loss: 2.7375 train_time: 0.6m tok/s: 8306029 +375/20000 train_loss: 2.6791 train_time: 0.6m tok/s: 8305982 +376/20000 train_loss: 2.9179 train_time: 0.6m tok/s: 8305586 +377/20000 train_loss: 2.7360 train_time: 0.6m tok/s: 8305482 +378/20000 train_loss: 2.8210 train_time: 0.6m tok/s: 8305502 +379/20000 train_loss: 2.8791 train_time: 0.6m tok/s: 8305423 +380/20000 train_loss: 2.8845 train_time: 0.6m tok/s: 8305461 +381/20000 train_loss: 2.9045 train_time: 0.6m tok/s: 8305396 +382/20000 train_loss: 2.9593 train_time: 0.6m tok/s: 8305384 +383/20000 train_loss: 2.9515 train_time: 0.6m tok/s: 8305385 +384/20000 train_loss: 2.8185 train_time: 0.6m tok/s: 8305093 +385/20000 train_loss: 2.8374 train_time: 0.6m tok/s: 8305159 +386/20000 train_loss: 2.8785 train_time: 0.6m tok/s: 8305031 +387/20000 train_loss: 3.0511 train_time: 0.6m tok/s: 8304672 +388/20000 train_loss: 2.8788 train_time: 0.6m tok/s: 8304616 +389/20000 train_loss: 2.9178 train_time: 0.6m tok/s: 8304649 +390/20000 train_loss: 2.7546 train_time: 0.6m tok/s: 8304629 +391/20000 train_loss: 2.7221 train_time: 0.6m tok/s: 8304607 +392/20000 train_loss: 2.7834 train_time: 0.6m tok/s: 8304521 +393/20000 train_loss: 2.8514 train_time: 0.6m tok/s: 8304572 +394/20000 train_loss: 2.8429 train_time: 0.6m tok/s: 8304619 +395/20000 train_loss: 2.9336 train_time: 0.6m tok/s: 8304394 +396/20000 train_loss: 2.8267 train_time: 0.6m tok/s: 8304404 +397/20000 train_loss: 2.8330 train_time: 0.6m tok/s: 8304331 +398/20000 train_loss: 2.8620 train_time: 0.6m tok/s: 8304243 +399/20000 train_loss: 2.7743 train_time: 0.6m tok/s: 8304308 +400/20000 train_loss: 2.8733 train_time: 0.6m tok/s: 8304196 +401/20000 train_loss: 2.8608 train_time: 0.6m tok/s: 8304369 +402/20000 train_loss: 2.7291 train_time: 0.6m tok/s: 8304295 +403/20000 train_loss: 2.9593 train_time: 0.6m tok/s: 8304259 +404/20000 train_loss: 2.9341 train_time: 0.6m tok/s: 8303881 +405/20000 train_loss: 2.9343 train_time: 0.6m tok/s: 8303696 +406/20000 train_loss: 2.8108 train_time: 0.6m tok/s: 8303439 +407/20000 train_loss: 2.8319 train_time: 0.6m tok/s: 8303230 +408/20000 train_loss: 2.8349 train_time: 0.6m tok/s: 8303207 +409/20000 train_loss: 2.8087 train_time: 0.6m tok/s: 8303194 +410/20000 train_loss: 2.8774 train_time: 0.6m tok/s: 8303319 +411/20000 train_loss: 2.8142 train_time: 0.6m tok/s: 8303269 +412/20000 train_loss: 2.8197 train_time: 0.7m tok/s: 8303262 +413/20000 train_loss: 2.7056 train_time: 0.7m tok/s: 8302987 +414/20000 train_loss: 2.7276 train_time: 0.7m tok/s: 8302925 +415/20000 train_loss: 2.7043 train_time: 0.7m tok/s: 8302825 +416/20000 train_loss: 2.7731 train_time: 0.7m tok/s: 8302646 +417/20000 train_loss: 2.7717 train_time: 0.7m tok/s: 8302288 +418/20000 train_loss: 2.7908 train_time: 0.7m tok/s: 8302331 +419/20000 train_loss: 2.8115 train_time: 0.7m tok/s: 8302327 +420/20000 train_loss: 2.8026 train_time: 0.7m tok/s: 8302436 +421/20000 train_loss: 2.8749 train_time: 0.7m tok/s: 8302306 +422/20000 train_loss: 2.8429 train_time: 0.7m tok/s: 8302424 +423/20000 train_loss: 2.8367 train_time: 0.7m tok/s: 8302198 +424/20000 train_loss: 2.9085 train_time: 0.7m tok/s: 8301950 +425/20000 train_loss: 2.8109 train_time: 0.7m tok/s: 8301553 +426/20000 train_loss: 2.8344 train_time: 0.7m tok/s: 8300989 +427/20000 train_loss: 2.8290 train_time: 0.7m tok/s: 8300901 +428/20000 train_loss: 2.7953 train_time: 0.7m tok/s: 8300855 +429/20000 train_loss: 2.7342 train_time: 0.7m tok/s: 8300957 +430/20000 train_loss: 2.8673 train_time: 0.7m tok/s: 8301096 +431/20000 train_loss: 2.6859 train_time: 0.7m tok/s: 8300964 +432/20000 train_loss: 2.7375 train_time: 0.7m tok/s: 8301064 +433/20000 train_loss: 2.6797 train_time: 0.7m tok/s: 8301122 +434/20000 train_loss: 2.6712 train_time: 0.7m tok/s: 8300700 +435/20000 train_loss: 2.8652 train_time: 0.7m tok/s: 8300274 +436/20000 train_loss: 2.4884 train_time: 0.7m tok/s: 8300064 +437/20000 train_loss: 2.7448 train_time: 0.7m tok/s: 8299988 +438/20000 train_loss: 2.8636 train_time: 0.7m tok/s: 8299964 +439/20000 train_loss: 2.7630 train_time: 0.7m tok/s: 8299210 +440/20000 train_loss: 2.6789 train_time: 0.7m tok/s: 8299221 +441/20000 train_loss: 2.9186 train_time: 0.7m tok/s: 8299236 +442/20000 train_loss: 2.9713 train_time: 0.7m tok/s: 8299175 +443/20000 train_loss: 2.9180 train_time: 0.7m tok/s: 8299190 +444/20000 train_loss: 2.9314 train_time: 0.7m tok/s: 8299134 +445/20000 train_loss: 2.8953 train_time: 0.7m tok/s: 8298992 +446/20000 train_loss: 2.7732 train_time: 0.7m tok/s: 8298912 +447/20000 train_loss: 2.7944 train_time: 0.7m tok/s: 8298647 +448/20000 train_loss: 2.8141 train_time: 0.7m tok/s: 8298669 +449/20000 train_loss: 2.7880 train_time: 0.7m tok/s: 8298672 +450/20000 train_loss: 2.8260 train_time: 0.7m tok/s: 8298779 +451/20000 train_loss: 2.5308 train_time: 0.7m tok/s: 8298621 +452/20000 train_loss: 2.7551 train_time: 0.7m tok/s: 8298211 +453/20000 train_loss: 2.6953 train_time: 0.7m tok/s: 8297998 +454/20000 train_loss: 2.7066 train_time: 0.7m tok/s: 8297747 +455/20000 train_loss: 2.7656 train_time: 0.7m tok/s: 8297741 +456/20000 train_loss: 2.7799 train_time: 0.7m tok/s: 8297688 +457/20000 train_loss: 2.6943 train_time: 0.7m tok/s: 8297520 +458/20000 train_loss: 2.7682 train_time: 0.7m tok/s: 8297350 +459/20000 train_loss: 2.8875 train_time: 0.7m tok/s: 8297257 +460/20000 train_loss: 2.8140 train_time: 0.7m tok/s: 8297277 +461/20000 train_loss: 2.8753 train_time: 0.7m tok/s: 8297434 +462/20000 train_loss: 2.9237 train_time: 0.7m tok/s: 8297432 +463/20000 train_loss: 2.8135 train_time: 0.7m tok/s: 8297587 +464/20000 train_loss: 2.7679 train_time: 0.7m tok/s: 8297645 +465/20000 train_loss: 2.9482 train_time: 0.7m tok/s: 8297505 +466/20000 train_loss: 2.8486 train_time: 0.7m tok/s: 8297369 +467/20000 train_loss: 2.8439 train_time: 0.7m tok/s: 8297383 +468/20000 train_loss: 2.9954 train_time: 0.7m tok/s: 8296997 +469/20000 train_loss: 2.7397 train_time: 0.7m tok/s: 8296645 +470/20000 train_loss: 2.7636 train_time: 0.7m tok/s: 8296525 +471/20000 train_loss: 2.8820 train_time: 0.7m tok/s: 8296332 +472/20000 train_loss: 2.9836 train_time: 0.7m tok/s: 8296204 +473/20000 train_loss: 2.7063 train_time: 0.7m tok/s: 8295652 +474/20000 train_loss: 2.6838 train_time: 0.7m tok/s: 8295520 +475/20000 train_loss: 2.8623 train_time: 0.8m tok/s: 8295397 +476/20000 train_loss: 2.6205 train_time: 0.8m tok/s: 8295125 +477/20000 train_loss: 2.7206 train_time: 0.8m tok/s: 8295108 +478/20000 train_loss: 2.8284 train_time: 0.8m tok/s: 8295023 +479/20000 train_loss: 2.7982 train_time: 0.8m tok/s: 8294921 +480/20000 train_loss: 3.0527 train_time: 0.8m tok/s: 8294748 +481/20000 train_loss: 2.8510 train_time: 0.8m tok/s: 8294792 +482/20000 train_loss: 2.7812 train_time: 0.8m tok/s: 8294876 +483/20000 train_loss: 2.8218 train_time: 0.8m tok/s: 8294960 +484/20000 train_loss: 2.8879 train_time: 0.8m tok/s: 8294996 +485/20000 train_loss: 2.7614 train_time: 0.8m tok/s: 8294994 +486/20000 train_loss: 2.7527 train_time: 0.8m tok/s: 8294773 +487/20000 train_loss: 2.8213 train_time: 0.8m tok/s: 8294707 +488/20000 train_loss: 2.7674 train_time: 0.8m tok/s: 8294724 +489/20000 train_loss: 2.3654 train_time: 0.8m tok/s: 8294623 +490/20000 train_loss: 2.8683 train_time: 0.8m tok/s: 8294320 +491/20000 train_loss: 2.7863 train_time: 0.8m tok/s: 8294447 +492/20000 train_loss: 2.7823 train_time: 0.8m tok/s: 8294606 +493/20000 train_loss: 2.6790 train_time: 0.8m tok/s: 8294654 +494/20000 train_loss: 2.6902 train_time: 0.8m tok/s: 8294411 +495/20000 train_loss: 2.8026 train_time: 0.8m tok/s: 8294271 +496/20000 train_loss: 2.6926 train_time: 0.8m tok/s: 8294195 +497/20000 train_loss: 2.9288 train_time: 0.8m tok/s: 8294176 +498/20000 train_loss: 2.8392 train_time: 0.8m tok/s: 8294190 +499/20000 train_loss: 2.9333 train_time: 0.8m tok/s: 8294243 +500/20000 train_loss: 2.7403 train_time: 0.8m tok/s: 8294290 +501/20000 train_loss: 2.9151 train_time: 0.8m tok/s: 8294287 +502/20000 train_loss: 2.7125 train_time: 0.8m tok/s: 8294348 +503/20000 train_loss: 2.7869 train_time: 0.8m tok/s: 8294294 +504/20000 train_loss: 2.6870 train_time: 0.8m tok/s: 8294266 +505/20000 train_loss: 2.8791 train_time: 0.8m tok/s: 8294238 +506/20000 train_loss: 2.7955 train_time: 0.8m tok/s: 8294159 +507/20000 train_loss: 2.7744 train_time: 0.8m tok/s: 8293943 +508/20000 train_loss: 2.9228 train_time: 0.8m tok/s: 8293922 +509/20000 train_loss: 2.9088 train_time: 0.8m tok/s: 8294044 +510/20000 train_loss: 2.7034 train_time: 0.8m tok/s: 8294017 +511/20000 train_loss: 2.8775 train_time: 0.8m tok/s: 8293960 +512/20000 train_loss: 2.8526 train_time: 0.8m tok/s: 8294001 +513/20000 train_loss: 2.8954 train_time: 0.8m tok/s: 8294170 +514/20000 train_loss: 2.8517 train_time: 0.8m tok/s: 8294293 +515/20000 train_loss: 2.8551 train_time: 0.8m tok/s: 8294280 +516/20000 train_loss: 2.7305 train_time: 0.8m tok/s: 8294359 +517/20000 train_loss: 2.8161 train_time: 0.8m tok/s: 8294231 +518/20000 train_loss: 2.9461 train_time: 0.8m tok/s: 8293893 +519/20000 train_loss: 2.7509 train_time: 0.8m tok/s: 8293956 +520/20000 train_loss: 2.6679 train_time: 0.8m tok/s: 8293923 +521/20000 train_loss: 2.7632 train_time: 0.8m tok/s: 8293825 +522/20000 train_loss: 2.7464 train_time: 0.8m tok/s: 8293627 +523/20000 train_loss: 2.7304 train_time: 0.8m tok/s: 8293580 +524/20000 train_loss: 2.7942 train_time: 0.8m tok/s: 8293512 +525/20000 train_loss: 2.7257 train_time: 0.8m tok/s: 8293347 +526/20000 train_loss: 2.8331 train_time: 0.8m tok/s: 8293320 +527/20000 train_loss: 2.8885 train_time: 0.8m tok/s: 8293455 +528/20000 train_loss: 2.8700 train_time: 0.8m tok/s: 8293359 +529/20000 train_loss: 2.8802 train_time: 0.8m tok/s: 8293240 +530/20000 train_loss: 2.9022 train_time: 0.8m tok/s: 8293146 +531/20000 train_loss: 3.2454 train_time: 0.8m tok/s: 8292928 +532/20000 train_loss: 3.1100 train_time: 0.8m tok/s: 8292646 +533/20000 train_loss: 2.6812 train_time: 0.8m tok/s: 8292619 +534/20000 train_loss: 2.8715 train_time: 0.8m tok/s: 8292572 +535/20000 train_loss: 2.7872 train_time: 0.8m tok/s: 8292511 +536/20000 train_loss: 2.6815 train_time: 0.8m tok/s: 8292289 +537/20000 train_loss: 2.8991 train_time: 0.8m tok/s: 8292048 +538/20000 train_loss: 2.7389 train_time: 0.9m tok/s: 8292026 +539/20000 train_loss: 2.8501 train_time: 0.9m tok/s: 8291895 +540/20000 train_loss: 2.8654 train_time: 0.9m tok/s: 8291722 +541/20000 train_loss: 2.2791 train_time: 0.9m tok/s: 8291517 +542/20000 train_loss: 2.8491 train_time: 0.9m tok/s: 8291290 +543/20000 train_loss: 2.8141 train_time: 0.9m tok/s: 8291340 +544/20000 train_loss: 2.8392 train_time: 0.9m tok/s: 8291111 +545/20000 train_loss: 2.7842 train_time: 0.9m tok/s: 8291173 +546/20000 train_loss: 2.8149 train_time: 0.9m tok/s: 8291197 +547/20000 train_loss: 2.7711 train_time: 0.9m tok/s: 8291120 +548/20000 train_loss: 2.7337 train_time: 0.9m tok/s: 8291026 +549/20000 train_loss: 2.6841 train_time: 0.9m tok/s: 8290928 +550/20000 train_loss: 2.7737 train_time: 0.9m tok/s: 8290820 +551/20000 train_loss: 2.7324 train_time: 0.9m tok/s: 8290771 +552/20000 train_loss: 2.9136 train_time: 0.9m tok/s: 8290162 +553/20000 train_loss: 2.7469 train_time: 0.9m tok/s: 8289747 +554/20000 train_loss: 2.5866 train_time: 0.9m tok/s: 8289787 +555/20000 train_loss: 2.6626 train_time: 0.9m tok/s: 8289859 +556/20000 train_loss: 2.7597 train_time: 0.9m tok/s: 8289932 +557/20000 train_loss: 2.8748 train_time: 0.9m tok/s: 8289880 +558/20000 train_loss: 2.8290 train_time: 0.9m tok/s: 8289740 +559/20000 train_loss: 2.7504 train_time: 0.9m tok/s: 8289935 +560/20000 train_loss: 2.7944 train_time: 0.9m tok/s: 8289702 +561/20000 train_loss: 2.7937 train_time: 0.9m tok/s: 8289647 +562/20000 train_loss: 2.8463 train_time: 0.9m tok/s: 8289543 +563/20000 train_loss: 2.8375 train_time: 0.9m tok/s: 8289487 +564/20000 train_loss: 2.9501 train_time: 0.9m tok/s: 8289415 +565/20000 train_loss: 2.8507 train_time: 0.9m tok/s: 8289323 +566/20000 train_loss: 2.7473 train_time: 0.9m tok/s: 8289377 +567/20000 train_loss: 2.6978 train_time: 0.9m tok/s: 8289413 +568/20000 train_loss: 2.8259 train_time: 0.9m tok/s: 8289409 +569/20000 train_loss: 2.6771 train_time: 0.9m tok/s: 8289350 +570/20000 train_loss: 2.6827 train_time: 0.9m tok/s: 8289086 +571/20000 train_loss: 2.7961 train_time: 0.9m tok/s: 8288965 +572/20000 train_loss: 2.6352 train_time: 0.9m tok/s: 8288759 +573/20000 train_loss: 2.6476 train_time: 0.9m tok/s: 8288692 +574/20000 train_loss: 2.7376 train_time: 0.9m tok/s: 8288724 +575/20000 train_loss: 2.5106 train_time: 0.9m tok/s: 8288719 +576/20000 train_loss: 2.7705 train_time: 0.9m tok/s: 8288579 +577/20000 train_loss: 2.8643 train_time: 0.9m tok/s: 8288498 +578/20000 train_loss: 2.8401 train_time: 0.9m tok/s: 8288546 +579/20000 train_loss: 2.7378 train_time: 0.9m tok/s: 8288641 +580/20000 train_loss: 2.8131 train_time: 0.9m tok/s: 8288643 +581/20000 train_loss: 2.7786 train_time: 0.9m tok/s: 8288676 +582/20000 train_loss: 2.7746 train_time: 0.9m tok/s: 8288533 +583/20000 train_loss: 2.7302 train_time: 0.9m tok/s: 8288472 +584/20000 train_loss: 2.7907 train_time: 0.9m tok/s: 8288540 +585/20000 train_loss: 2.8064 train_time: 0.9m tok/s: 8288570 +586/20000 train_loss: 2.6427 train_time: 0.9m tok/s: 8288404 +587/20000 train_loss: 2.7397 train_time: 0.9m tok/s: 8288511 +588/20000 train_loss: 2.7161 train_time: 0.9m tok/s: 8288502 +589/20000 train_loss: 2.7468 train_time: 0.9m tok/s: 8288434 +590/20000 train_loss: 2.7641 train_time: 0.9m tok/s: 8288386 +591/20000 train_loss: 2.7562 train_time: 0.9m tok/s: 8288496 +592/20000 train_loss: 2.7435 train_time: 0.9m tok/s: 8288383 +593/20000 train_loss: 2.7450 train_time: 0.9m tok/s: 8288405 +594/20000 train_loss: 2.6428 train_time: 0.9m tok/s: 8288151 +595/20000 train_loss: 2.8029 train_time: 0.9m tok/s: 8287908 +596/20000 train_loss: 2.6775 train_time: 0.9m tok/s: 8287989 +597/20000 train_loss: 2.7568 train_time: 0.9m tok/s: 8288134 +598/20000 train_loss: 2.7956 train_time: 0.9m tok/s: 8287913 +599/20000 train_loss: 2.7026 train_time: 0.9m tok/s: 8287831 +600/20000 train_loss: 2.7461 train_time: 0.9m tok/s: 8287725 +601/20000 train_loss: 2.7297 train_time: 1.0m tok/s: 8287679 +602/20000 train_loss: 2.7705 train_time: 1.0m tok/s: 8287621 +603/20000 train_loss: 2.7604 train_time: 1.0m tok/s: 8287453 +604/20000 train_loss: 2.7505 train_time: 1.0m tok/s: 8287359 +605/20000 train_loss: 2.6566 train_time: 1.0m tok/s: 8287226 +606/20000 train_loss: 2.6580 train_time: 1.0m tok/s: 8287212 +607/20000 train_loss: 2.7458 train_time: 1.0m tok/s: 8287370 +608/20000 train_loss: 2.6634 train_time: 1.0m tok/s: 8287410 +609/20000 train_loss: 2.7300 train_time: 1.0m tok/s: 8287390 +610/20000 train_loss: 2.7785 train_time: 1.0m tok/s: 8287552 +611/20000 train_loss: 2.8894 train_time: 1.0m tok/s: 8287337 +612/20000 train_loss: 2.8280 train_time: 1.0m tok/s: 8287157 +613/20000 train_loss: 2.8101 train_time: 1.0m tok/s: 8287072 +614/20000 train_loss: 2.8088 train_time: 1.0m tok/s: 8287080 +615/20000 train_loss: 2.7540 train_time: 1.0m tok/s: 8287050 +616/20000 train_loss: 2.7762 train_time: 1.0m tok/s: 8286948 +617/20000 train_loss: 2.7297 train_time: 1.0m tok/s: 8286903 +618/20000 train_loss: 2.7423 train_time: 1.0m tok/s: 8287007 +619/20000 train_loss: 2.7874 train_time: 1.0m tok/s: 8287015 +620/20000 train_loss: 2.8953 train_time: 1.0m tok/s: 8286918 +621/20000 train_loss: 2.6942 train_time: 1.0m tok/s: 8286933 +622/20000 train_loss: 2.7251 train_time: 1.0m tok/s: 8286942 +623/20000 train_loss: 2.7307 train_time: 1.0m tok/s: 8287013 +624/20000 train_loss: 2.4431 train_time: 1.0m tok/s: 8286821 +625/20000 train_loss: 2.7564 train_time: 1.0m tok/s: 8286731 +626/20000 train_loss: 2.9042 train_time: 1.0m tok/s: 8286740 +627/20000 train_loss: 2.6935 train_time: 1.0m tok/s: 8286538 +628/20000 train_loss: 2.8658 train_time: 1.0m tok/s: 8286457 +629/20000 train_loss: 2.8489 train_time: 1.0m tok/s: 8286465 +630/20000 train_loss: 2.6996 train_time: 1.0m tok/s: 8286327 +631/20000 train_loss: 2.8275 train_time: 1.0m tok/s: 8286410 +632/20000 train_loss: 2.8385 train_time: 1.0m tok/s: 8286495 +633/20000 train_loss: 2.7147 train_time: 1.0m tok/s: 8286447 +634/20000 train_loss: 2.9406 train_time: 1.0m tok/s: 8286470 +635/20000 train_loss: 2.7361 train_time: 1.0m tok/s: 8286344 +636/20000 train_loss: 2.8817 train_time: 1.0m tok/s: 8286219 +637/20000 train_loss: 2.7594 train_time: 1.0m tok/s: 8286192 +638/20000 train_loss: 2.5746 train_time: 1.0m tok/s: 8286212 +639/20000 train_loss: 2.7368 train_time: 1.0m tok/s: 8286190 +640/20000 train_loss: 2.7232 train_time: 1.0m tok/s: 8286156 +641/20000 train_loss: 2.7928 train_time: 1.0m tok/s: 8285959 +642/20000 train_loss: 2.7989 train_time: 1.0m tok/s: 8285090 +643/20000 train_loss: 2.7634 train_time: 1.0m tok/s: 8286139 +644/20000 train_loss: 2.8092 train_time: 1.0m tok/s: 8286142 +645/20000 train_loss: 2.8807 train_time: 1.0m tok/s: 8286069 +646/20000 train_loss: 2.7755 train_time: 1.0m tok/s: 8286083 +647/20000 train_loss: 2.8327 train_time: 1.0m tok/s: 8285862 +648/20000 train_loss: 2.7369 train_time: 1.0m tok/s: 8285624 +649/20000 train_loss: 2.8698 train_time: 1.0m tok/s: 8285576 +650/20000 train_loss: 2.7596 train_time: 1.0m tok/s: 8285699 +651/20000 train_loss: 2.7538 train_time: 1.0m tok/s: 8285678 +652/20000 train_loss: 2.7045 train_time: 1.0m tok/s: 8285693 +653/20000 train_loss: 2.6591 train_time: 1.0m tok/s: 8285681 +654/20000 train_loss: 2.7185 train_time: 1.0m tok/s: 8285762 +655/20000 train_loss: 2.7110 train_time: 1.0m tok/s: 8285378 +656/20000 train_loss: 2.6481 train_time: 1.0m tok/s: 8285374 +657/20000 train_loss: 2.6565 train_time: 1.0m tok/s: 8285339 +658/20000 train_loss: 2.7054 train_time: 1.0m tok/s: 8285305 +659/20000 train_loss: 2.7497 train_time: 1.0m tok/s: 8285421 +660/20000 train_loss: 2.7487 train_time: 1.0m tok/s: 8285335 +661/20000 train_loss: 2.8076 train_time: 1.0m tok/s: 8285302 +662/20000 train_loss: 2.6942 train_time: 1.0m tok/s: 8285323 +663/20000 train_loss: 2.7849 train_time: 1.0m tok/s: 8285426 +664/20000 train_loss: 2.7881 train_time: 1.1m tok/s: 8285355 +665/20000 train_loss: 2.8380 train_time: 1.1m tok/s: 8285159 +666/20000 train_loss: 2.8286 train_time: 1.1m tok/s: 8285046 +667/20000 train_loss: 2.7582 train_time: 1.1m tok/s: 8284993 +668/20000 train_loss: 2.7433 train_time: 1.1m tok/s: 8284916 +669/20000 train_loss: 2.6304 train_time: 1.1m tok/s: 8284898 +670/20000 train_loss: 2.6432 train_time: 1.1m tok/s: 8284845 +671/20000 train_loss: 2.6573 train_time: 1.1m tok/s: 8284933 +672/20000 train_loss: 2.7830 train_time: 1.1m tok/s: 8284975 +673/20000 train_loss: 2.6272 train_time: 1.1m tok/s: 8284877 +674/20000 train_loss: 2.8614 train_time: 1.1m tok/s: 8284830 +675/20000 train_loss: 2.6237 train_time: 1.1m tok/s: 8284733 +676/20000 train_loss: 2.8513 train_time: 1.1m tok/s: 8284593 +677/20000 train_loss: 2.6756 train_time: 1.1m tok/s: 8284531 +678/20000 train_loss: 2.7592 train_time: 1.1m tok/s: 8284459 +679/20000 train_loss: 2.7034 train_time: 1.1m tok/s: 8284392 +680/20000 train_loss: 2.9086 train_time: 1.1m tok/s: 8284378 +681/20000 train_loss: 2.7825 train_time: 1.1m tok/s: 8284380 +682/20000 train_loss: 2.8806 train_time: 1.1m tok/s: 8284377 +683/20000 train_loss: 2.8585 train_time: 1.1m tok/s: 8284301 +684/20000 train_loss: 2.7913 train_time: 1.1m tok/s: 8284259 +685/20000 train_loss: 2.6575 train_time: 1.1m tok/s: 8284318 +686/20000 train_loss: 2.8921 train_time: 1.1m tok/s: 8284237 +687/20000 train_loss: 2.7814 train_time: 1.1m tok/s: 8284213 +688/20000 train_loss: 2.7836 train_time: 1.1m tok/s: 8284142 +689/20000 train_loss: 2.8114 train_time: 1.1m tok/s: 8284049 +690/20000 train_loss: 2.7484 train_time: 1.1m tok/s: 8284036 +691/20000 train_loss: 2.8506 train_time: 1.1m tok/s: 8283974 +692/20000 train_loss: 2.9120 train_time: 1.1m tok/s: 8283946 +693/20000 train_loss: 2.8220 train_time: 1.1m tok/s: 8283916 +694/20000 train_loss: 2.8196 train_time: 1.1m tok/s: 8284018 +695/20000 train_loss: 2.8053 train_time: 1.1m tok/s: 8283971 +696/20000 train_loss: 2.8090 train_time: 1.1m tok/s: 8284064 +697/20000 train_loss: 2.6733 train_time: 1.1m tok/s: 8283919 +698/20000 train_loss: 2.8422 train_time: 1.1m tok/s: 8283836 +699/20000 train_loss: 2.6880 train_time: 1.1m tok/s: 8283709 +700/20000 train_loss: 2.6365 train_time: 1.1m tok/s: 8283590 +701/20000 train_loss: 2.6379 train_time: 1.1m tok/s: 8283583 +702/20000 train_loss: 2.6239 train_time: 1.1m tok/s: 8283498 +703/20000 train_loss: 2.5156 train_time: 1.1m tok/s: 8283359 +704/20000 train_loss: 2.8472 train_time: 1.1m tok/s: 8283276 +705/20000 train_loss: 2.7819 train_time: 1.1m tok/s: 8283216 +706/20000 train_loss: 2.7544 train_time: 1.1m tok/s: 8282924 +707/20000 train_loss: 2.7615 train_time: 1.1m tok/s: 8283285 +708/20000 train_loss: 2.8460 train_time: 1.1m tok/s: 8283190 +709/20000 train_loss: 2.8055 train_time: 1.1m tok/s: 8283246 +710/20000 train_loss: 2.6339 train_time: 1.1m tok/s: 8283192 +711/20000 train_loss: 2.7087 train_time: 1.1m tok/s: 8283183 +712/20000 train_loss: 2.6423 train_time: 1.1m tok/s: 8283087 +713/20000 train_loss: 2.7040 train_time: 1.1m tok/s: 8283022 +714/20000 train_loss: 2.7616 train_time: 1.1m tok/s: 8283064 +715/20000 train_loss: 2.7093 train_time: 1.1m tok/s: 8283086 +716/20000 train_loss: 2.7327 train_time: 1.1m tok/s: 8283047 +717/20000 train_loss: 2.9224 train_time: 1.1m tok/s: 8283101 +718/20000 train_loss: 2.8132 train_time: 1.1m tok/s: 8283149 +719/20000 train_loss: 2.7505 train_time: 1.1m tok/s: 8283220 +720/20000 train_loss: 2.6974 train_time: 1.1m tok/s: 8283017 +721/20000 train_loss: 2.8318 train_time: 1.1m tok/s: 8282896 +722/20000 train_loss: 2.6665 train_time: 1.1m tok/s: 8282867 +723/20000 train_loss: 2.8746 train_time: 1.1m tok/s: 8283004 +724/20000 train_loss: 2.7737 train_time: 1.1m tok/s: 8283046 +725/20000 train_loss: 2.6358 train_time: 1.1m tok/s: 8283079 +726/20000 train_loss: 2.7847 train_time: 1.1m tok/s: 8282985 +727/20000 train_loss: 2.5987 train_time: 1.2m tok/s: 8282855 +728/20000 train_loss: 2.8050 train_time: 1.2m tok/s: 8282800 +729/20000 train_loss: 2.8440 train_time: 1.2m tok/s: 8282765 +730/20000 train_loss: 2.7766 train_time: 1.2m tok/s: 8282808 +731/20000 train_loss: 2.8810 train_time: 1.2m tok/s: 8282814 +732/20000 train_loss: 2.7157 train_time: 1.2m tok/s: 8282744 +733/20000 train_loss: 2.8873 train_time: 1.2m tok/s: 8282494 +734/20000 train_loss: 2.7395 train_time: 1.2m tok/s: 8282536 +735/20000 train_loss: 2.8008 train_time: 1.2m tok/s: 8282584 +736/20000 train_loss: 2.6739 train_time: 1.2m tok/s: 8282605 +737/20000 train_loss: 2.8066 train_time: 1.2m tok/s: 8282487 +738/20000 train_loss: 2.6735 train_time: 1.2m tok/s: 8282472 +739/20000 train_loss: 2.5981 train_time: 1.2m tok/s: 8282533 +740/20000 train_loss: 2.8325 train_time: 1.2m tok/s: 8282454 +741/20000 train_loss: 2.8244 train_time: 1.2m tok/s: 8282380 +742/20000 train_loss: 2.6899 train_time: 1.2m tok/s: 8282454 +743/20000 train_loss: 2.8524 train_time: 1.2m tok/s: 8282506 +744/20000 train_loss: 2.7300 train_time: 1.2m tok/s: 8282441 +745/20000 train_loss: 2.7438 train_time: 1.2m tok/s: 8282434 +746/20000 train_loss: 2.8228 train_time: 1.2m tok/s: 8282543 +747/20000 train_loss: 2.7048 train_time: 1.2m tok/s: 8282545 +748/20000 train_loss: 2.7495 train_time: 1.2m tok/s: 8282570 +749/20000 train_loss: 2.8139 train_time: 1.2m tok/s: 8282545 +750/20000 train_loss: 2.8237 train_time: 1.2m tok/s: 8282414 +751/20000 train_loss: 2.6915 train_time: 1.2m tok/s: 8282387 +752/20000 train_loss: 2.7781 train_time: 1.2m tok/s: 8282230 +753/20000 train_loss: 2.4282 train_time: 1.2m tok/s: 8282021 +754/20000 train_loss: 2.6748 train_time: 1.2m tok/s: 8281864 +755/20000 train_loss: 2.8708 train_time: 1.2m tok/s: 8281920 +756/20000 train_loss: 3.1509 train_time: 1.2m tok/s: 8282052 +757/20000 train_loss: 2.7933 train_time: 1.2m tok/s: 8282109 +758/20000 train_loss: 2.7250 train_time: 1.2m tok/s: 8282193 +759/20000 train_loss: 2.6835 train_time: 1.2m tok/s: 8282166 +760/20000 train_loss: 2.8593 train_time: 1.2m tok/s: 8282060 +761/20000 train_loss: 2.7385 train_time: 1.2m tok/s: 8282065 +762/20000 train_loss: 2.8335 train_time: 1.2m tok/s: 8282002 +763/20000 train_loss: 2.6622 train_time: 1.2m tok/s: 8282058 +764/20000 train_loss: 2.7141 train_time: 1.2m tok/s: 8281904 +765/20000 train_loss: 2.6847 train_time: 1.2m tok/s: 8281945 +766/20000 train_loss: 2.6726 train_time: 1.2m tok/s: 8282013 +767/20000 train_loss: 2.6998 train_time: 1.2m tok/s: 8282106 +768/20000 train_loss: 2.7472 train_time: 1.2m tok/s: 8282131 +769/20000 train_loss: 2.7698 train_time: 1.2m tok/s: 8282137 +770/20000 train_loss: 2.7773 train_time: 1.2m tok/s: 8282123 +771/20000 train_loss: 2.7909 train_time: 1.2m tok/s: 8282101 +772/20000 train_loss: 2.7711 train_time: 1.2m tok/s: 8282062 +773/20000 train_loss: 2.7151 train_time: 1.2m tok/s: 8282027 +774/20000 train_loss: 2.8472 train_time: 1.2m tok/s: 8281876 +775/20000 train_loss: 2.8156 train_time: 1.2m tok/s: 8281836 +776/20000 train_loss: 2.9100 train_time: 1.2m tok/s: 8281734 +777/20000 train_loss: 2.8682 train_time: 1.2m tok/s: 8281458 +778/20000 train_loss: 2.7210 train_time: 1.2m tok/s: 8281415 +779/20000 train_loss: 2.4414 train_time: 1.2m tok/s: 8281269 +780/20000 train_loss: 2.7851 train_time: 1.2m tok/s: 8281183 +781/20000 train_loss: 2.7602 train_time: 1.2m tok/s: 8281251 +782/20000 train_loss: 3.0327 train_time: 1.2m tok/s: 8281166 +783/20000 train_loss: 2.5248 train_time: 1.2m tok/s: 8280980 +784/20000 train_loss: 2.9042 train_time: 1.2m tok/s: 8280909 +785/20000 train_loss: 2.8622 train_time: 1.2m tok/s: 8280968 +786/20000 train_loss: 2.7301 train_time: 1.2m tok/s: 8281117 +787/20000 train_loss: 2.6397 train_time: 1.2m tok/s: 8280989 +788/20000 train_loss: 2.6796 train_time: 1.2m tok/s: 8280901 +789/20000 train_loss: 2.7970 train_time: 1.2m tok/s: 8280824 +790/20000 train_loss: 2.6434 train_time: 1.3m tok/s: 8280589 +791/20000 train_loss: 2.5954 train_time: 1.3m tok/s: 8280590 +792/20000 train_loss: 2.7333 train_time: 1.3m tok/s: 8280602 +793/20000 train_loss: 2.7111 train_time: 1.3m tok/s: 8280512 +794/20000 train_loss: 2.7265 train_time: 1.3m tok/s: 8280438 +795/20000 train_loss: 2.8569 train_time: 1.3m tok/s: 8280473 +796/20000 train_loss: 2.7255 train_time: 1.3m tok/s: 8280451 +797/20000 train_loss: 2.7417 train_time: 1.3m tok/s: 8280534 +798/20000 train_loss: 2.7559 train_time: 1.3m tok/s: 8280543 +799/20000 train_loss: 2.7867 train_time: 1.3m tok/s: 8280533 +800/20000 train_loss: 2.7226 train_time: 1.3m tok/s: 8280406 +801/20000 train_loss: 2.7445 train_time: 1.3m tok/s: 8280279 +802/20000 train_loss: 2.8201 train_time: 1.3m tok/s: 8280255 +803/20000 train_loss: 2.6931 train_time: 1.3m tok/s: 8279970 +804/20000 train_loss: 2.6760 train_time: 1.3m tok/s: 8280214 +805/20000 train_loss: 2.6856 train_time: 1.3m tok/s: 8280143 +806/20000 train_loss: 2.8071 train_time: 1.3m tok/s: 8280180 +807/20000 train_loss: 2.7938 train_time: 1.3m tok/s: 8280313 +808/20000 train_loss: 2.8134 train_time: 1.3m tok/s: 8280275 +809/20000 train_loss: 2.6368 train_time: 1.3m tok/s: 8280225 +810/20000 train_loss: 2.8518 train_time: 1.3m tok/s: 8280228 +811/20000 train_loss: 2.8518 train_time: 1.3m tok/s: 8280240 +812/20000 train_loss: 2.6878 train_time: 1.3m tok/s: 8280219 +813/20000 train_loss: 2.7520 train_time: 1.3m tok/s: 8280187 +814/20000 train_loss: 2.7932 train_time: 1.3m tok/s: 8280255 +815/20000 train_loss: 2.8949 train_time: 1.3m tok/s: 8280229 +816/20000 train_loss: 2.7198 train_time: 1.3m tok/s: 8280163 +817/20000 train_loss: 2.7094 train_time: 1.3m tok/s: 8280148 +818/20000 train_loss: 2.7774 train_time: 1.3m tok/s: 8280291 +819/20000 train_loss: 2.7923 train_time: 1.3m tok/s: 8280214 +820/20000 train_loss: 3.0427 train_time: 1.3m tok/s: 8280127 +821/20000 train_loss: 2.7903 train_time: 1.3m tok/s: 8280045 +822/20000 train_loss: 2.6053 train_time: 1.3m tok/s: 8280026 +823/20000 train_loss: 2.6461 train_time: 1.3m tok/s: 8279919 +824/20000 train_loss: 2.8166 train_time: 1.3m tok/s: 8279936 +825/20000 train_loss: 2.8830 train_time: 1.3m tok/s: 8279801 +826/20000 train_loss: 2.8666 train_time: 1.3m tok/s: 8279704 +827/20000 train_loss: 2.6448 train_time: 1.3m tok/s: 8279710 +828/20000 train_loss: 2.7303 train_time: 1.3m tok/s: 8279726 +829/20000 train_loss: 3.3557 train_time: 1.3m tok/s: 8279633 +830/20000 train_loss: 2.7539 train_time: 1.3m tok/s: 8279449 +831/20000 train_loss: 2.7460 train_time: 1.3m tok/s: 8279412 +832/20000 train_loss: 2.7747 train_time: 1.3m tok/s: 8279376 +833/20000 train_loss: 2.8565 train_time: 1.3m tok/s: 8279414 +834/20000 train_loss: 2.7000 train_time: 1.3m tok/s: 8279394 +835/20000 train_loss: 2.7871 train_time: 1.3m tok/s: 8279465 +836/20000 train_loss: 2.6209 train_time: 1.3m tok/s: 8279436 +837/20000 train_loss: 2.5120 train_time: 1.3m tok/s: 8279327 +838/20000 train_loss: 2.6300 train_time: 1.3m tok/s: 8279069 +839/20000 train_loss: 2.7226 train_time: 1.3m tok/s: 8278887 +840/20000 train_loss: 3.1312 train_time: 1.3m tok/s: 8278887 +841/20000 train_loss: 2.7213 train_time: 1.3m tok/s: 8278909 +842/20000 train_loss: 2.7255 train_time: 1.3m tok/s: 8278931 +843/20000 train_loss: 2.6557 train_time: 1.3m tok/s: 8278883 +844/20000 train_loss: 2.7362 train_time: 1.3m tok/s: 8278807 +845/20000 train_loss: 2.6849 train_time: 1.3m tok/s: 8278774 +846/20000 train_loss: 2.6722 train_time: 1.3m tok/s: 8278766 +847/20000 train_loss: 2.7264 train_time: 1.3m tok/s: 8278777 +848/20000 train_loss: 2.6435 train_time: 1.3m tok/s: 8278781 +849/20000 train_loss: 2.7470 train_time: 1.3m tok/s: 8278781 +850/20000 train_loss: 2.5878 train_time: 1.3m tok/s: 8278879 +851/20000 train_loss: 2.7583 train_time: 1.3m tok/s: 8278878 +852/20000 train_loss: 2.5666 train_time: 1.3m tok/s: 8278942 +853/20000 train_loss: 2.7178 train_time: 1.4m tok/s: 8279003 +854/20000 train_loss: 2.7117 train_time: 1.4m tok/s: 8278874 +855/20000 train_loss: 2.7557 train_time: 1.4m tok/s: 8278784 +856/20000 train_loss: 2.6952 train_time: 1.4m tok/s: 8278855 +857/20000 train_loss: 2.8389 train_time: 1.4m tok/s: 8278862 +858/20000 train_loss: 2.8322 train_time: 1.4m tok/s: 8278895 +859/20000 train_loss: 2.7302 train_time: 1.4m tok/s: 8278756 +860/20000 train_loss: 2.6699 train_time: 1.4m tok/s: 8278608 +861/20000 train_loss: 2.7118 train_time: 1.4m tok/s: 8278598 +862/20000 train_loss: 2.6621 train_time: 1.4m tok/s: 8278585 +863/20000 train_loss: 2.9054 train_time: 1.4m tok/s: 8278546 +864/20000 train_loss: 2.7546 train_time: 1.4m tok/s: 8278472 +865/20000 train_loss: 2.7746 train_time: 1.4m tok/s: 8278444 +866/20000 train_loss: 2.6362 train_time: 1.4m tok/s: 8277987 +867/20000 train_loss: 2.6743 train_time: 1.4m tok/s: 8278280 +868/20000 train_loss: 2.6875 train_time: 1.4m tok/s: 8278175 +869/20000 train_loss: 2.7024 train_time: 1.4m tok/s: 8278218 +870/20000 train_loss: 2.6740 train_time: 1.4m tok/s: 8278281 +871/20000 train_loss: 2.6562 train_time: 1.4m tok/s: 8278329 +872/20000 train_loss: 2.7520 train_time: 1.4m tok/s: 8278368 +873/20000 train_loss: 2.6658 train_time: 1.4m tok/s: 8278390 +874/20000 train_loss: 2.8039 train_time: 1.4m tok/s: 8278329 +875/20000 train_loss: 2.7573 train_time: 1.4m tok/s: 8278397 +876/20000 train_loss: 2.8118 train_time: 1.4m tok/s: 8278384 +877/20000 train_loss: 2.7026 train_time: 1.4m tok/s: 8278399 +878/20000 train_loss: 2.6915 train_time: 1.4m tok/s: 8278441 +879/20000 train_loss: 2.7385 train_time: 1.4m tok/s: 8278399 +880/20000 train_loss: 2.7483 train_time: 1.4m tok/s: 8278370 +881/20000 train_loss: 2.6708 train_time: 1.4m tok/s: 8278448 +882/20000 train_loss: 2.6945 train_time: 1.4m tok/s: 8278426 +883/20000 train_loss: 2.7878 train_time: 1.4m tok/s: 8278371 +884/20000 train_loss: 2.5296 train_time: 1.4m tok/s: 8278386 +885/20000 train_loss: 2.6358 train_time: 1.4m tok/s: 8278369 +886/20000 train_loss: 2.7096 train_time: 1.4m tok/s: 8278385 +887/20000 train_loss: 2.6662 train_time: 1.4m tok/s: 8278345 +888/20000 train_loss: 2.6338 train_time: 1.4m tok/s: 8278265 +889/20000 train_loss: 2.8034 train_time: 1.4m tok/s: 8278309 +890/20000 train_loss: 2.5957 train_time: 1.4m tok/s: 8278210 +891/20000 train_loss: 2.6918 train_time: 1.4m tok/s: 8278260 +892/20000 train_loss: 2.7109 train_time: 1.4m tok/s: 8278236 +893/20000 train_loss: 2.6495 train_time: 1.4m tok/s: 8278259 +894/20000 train_loss: 2.7130 train_time: 1.4m tok/s: 8278317 +895/20000 train_loss: 2.7395 train_time: 1.4m tok/s: 8278447 +896/20000 train_loss: 2.7987 train_time: 1.4m tok/s: 8278362 +897/20000 train_loss: 2.7210 train_time: 1.4m tok/s: 8278337 +898/20000 train_loss: 2.6880 train_time: 1.4m tok/s: 8278307 +899/20000 train_loss: 2.6727 train_time: 1.4m tok/s: 8278319 +900/20000 train_loss: 2.7235 train_time: 1.4m tok/s: 8278213 +901/20000 train_loss: 2.6399 train_time: 1.4m tok/s: 8278197 +902/20000 train_loss: 2.6350 train_time: 1.4m tok/s: 8278138 +903/20000 train_loss: 2.5856 train_time: 1.4m tok/s: 8278181 +904/20000 train_loss: 2.5698 train_time: 1.4m tok/s: 8278243 +905/20000 train_loss: 2.7061 train_time: 1.4m tok/s: 8278346 +906/20000 train_loss: 2.7743 train_time: 1.4m tok/s: 8278383 +907/20000 train_loss: 2.7517 train_time: 1.4m tok/s: 8278375 +908/20000 train_loss: 2.8340 train_time: 1.4m tok/s: 8278342 +909/20000 train_loss: 2.7768 train_time: 1.4m tok/s: 8278275 +910/20000 train_loss: 2.8344 train_time: 1.4m tok/s: 8278271 +911/20000 train_loss: 2.7263 train_time: 1.4m tok/s: 8278347 +912/20000 train_loss: 2.5440 train_time: 1.4m tok/s: 8278077 +913/20000 train_loss: 2.7320 train_time: 1.4m tok/s: 8277703 +914/20000 train_loss: 2.8113 train_time: 1.4m tok/s: 8277668 +915/20000 train_loss: 2.7418 train_time: 1.4m tok/s: 8277696 +916/20000 train_loss: 2.7466 train_time: 1.5m tok/s: 8277800 +917/20000 train_loss: 2.6826 train_time: 1.5m tok/s: 8277706 +918/20000 train_loss: 2.5626 train_time: 1.5m tok/s: 8277681 +919/20000 train_loss: 2.6253 train_time: 1.5m tok/s: 8277614 +920/20000 train_loss: 2.6482 train_time: 1.5m tok/s: 8277659 +921/20000 train_loss: 2.5089 train_time: 1.5m tok/s: 8277637 +922/20000 train_loss: 2.7212 train_time: 1.5m tok/s: 8277559 +923/20000 train_loss: 2.6209 train_time: 1.5m tok/s: 8277493 +924/20000 train_loss: 2.6208 train_time: 1.5m tok/s: 8277494 +925/20000 train_loss: 2.9403 train_time: 1.5m tok/s: 8277492 +926/20000 train_loss: 2.5550 train_time: 1.5m tok/s: 8277427 +927/20000 train_loss: 2.7505 train_time: 1.5m tok/s: 8277402 +928/20000 train_loss: 2.8015 train_time: 1.5m tok/s: 8277427 +929/20000 train_loss: 2.7204 train_time: 1.5m tok/s: 8277468 +930/20000 train_loss: 2.8829 train_time: 1.5m tok/s: 8277208 +931/20000 train_loss: 2.7648 train_time: 1.5m tok/s: 8277152 +932/20000 train_loss: 2.6967 train_time: 1.5m tok/s: 8277254 +933/20000 train_loss: 2.7127 train_time: 1.5m tok/s: 8277392 +934/20000 train_loss: 2.7150 train_time: 1.5m tok/s: 8277354 +935/20000 train_loss: 2.7995 train_time: 1.5m tok/s: 8277369 +936/20000 train_loss: 2.6136 train_time: 1.5m tok/s: 8277375 +937/20000 train_loss: 2.7287 train_time: 1.5m tok/s: 8277461 +938/20000 train_loss: 2.5090 train_time: 1.5m tok/s: 8277260 +939/20000 train_loss: 2.5091 train_time: 1.5m tok/s: 8276981 +940/20000 train_loss: 2.7681 train_time: 1.5m tok/s: 8276810 +941/20000 train_loss: 2.8697 train_time: 1.5m tok/s: 8276718 +942/20000 train_loss: 2.7016 train_time: 1.5m tok/s: 8276732 +943/20000 train_loss: 2.7180 train_time: 1.5m tok/s: 8276642 +944/20000 train_loss: 2.8055 train_time: 1.5m tok/s: 8276719 +945/20000 train_loss: 2.7225 train_time: 1.5m tok/s: 8276769 +946/20000 train_loss: 2.6247 train_time: 1.5m tok/s: 8276799 +947/20000 train_loss: 2.7936 train_time: 1.5m tok/s: 8276796 +948/20000 train_loss: 2.7188 train_time: 1.5m tok/s: 8276780 +949/20000 train_loss: 2.7195 train_time: 1.5m tok/s: 8276772 +950/20000 train_loss: 2.7271 train_time: 1.5m tok/s: 8276719 +951/20000 train_loss: 2.7900 train_time: 1.5m tok/s: 8276684 +952/20000 train_loss: 2.5538 train_time: 1.5m tok/s: 8276591 +953/20000 train_loss: 2.6850 train_time: 1.5m tok/s: 8276641 +954/20000 train_loss: 2.6935 train_time: 1.5m tok/s: 8276520 +955/20000 train_loss: 2.7966 train_time: 1.5m tok/s: 8276395 +956/20000 train_loss: 2.8071 train_time: 1.5m tok/s: 8276315 +957/20000 train_loss: 2.6789 train_time: 1.5m tok/s: 8276280 +958/20000 train_loss: 2.7768 train_time: 1.5m tok/s: 8276239 +959/20000 train_loss: 2.7778 train_time: 1.5m tok/s: 8276284 +960/20000 train_loss: 2.9908 train_time: 1.5m tok/s: 8276208 +961/20000 train_loss: 2.7948 train_time: 1.5m tok/s: 8276146 +962/20000 train_loss: 2.7084 train_time: 1.5m tok/s: 8276175 +963/20000 train_loss: 2.7573 train_time: 1.5m tok/s: 8276177 +964/20000 train_loss: 2.6630 train_time: 1.5m tok/s: 8276211 +965/20000 train_loss: 2.7791 train_time: 1.5m tok/s: 8276161 +966/20000 train_loss: 2.7795 train_time: 1.5m tok/s: 8276196 +967/20000 train_loss: 2.5718 train_time: 1.5m tok/s: 8276238 +968/20000 train_loss: 2.6798 train_time: 1.5m tok/s: 8276265 +969/20000 train_loss: 2.7049 train_time: 1.5m tok/s: 8276188 +970/20000 train_loss: 2.6977 train_time: 1.5m tok/s: 8275999 +971/20000 train_loss: 2.5490 train_time: 1.5m tok/s: 8275799 +972/20000 train_loss: 2.4987 train_time: 1.5m tok/s: 8275694 +973/20000 train_loss: 2.5692 train_time: 1.5m tok/s: 8275512 +974/20000 train_loss: 2.6943 train_time: 1.5m tok/s: 8275162 +975/20000 train_loss: 2.7501 train_time: 1.5m tok/s: 8275437 +976/20000 train_loss: 2.6884 train_time: 1.5m tok/s: 8275419 +977/20000 train_loss: 2.7327 train_time: 1.5m tok/s: 8275451 +978/20000 train_loss: 2.7645 train_time: 1.5m tok/s: 8275517 +979/20000 train_loss: 2.6090 train_time: 1.6m tok/s: 8275549 +980/20000 train_loss: 2.6508 train_time: 1.6m tok/s: 8275369 +981/20000 train_loss: 2.6414 train_time: 1.6m tok/s: 8275376 +982/20000 train_loss: 2.6275 train_time: 1.6m tok/s: 8275415 +983/20000 train_loss: 2.7503 train_time: 1.6m tok/s: 8275224 +984/20000 train_loss: 2.6473 train_time: 1.6m tok/s: 8275279 +985/20000 train_loss: 2.7651 train_time: 1.6m tok/s: 8275241 +986/20000 train_loss: 2.6958 train_time: 1.6m tok/s: 8275302 +987/20000 train_loss: 2.6857 train_time: 1.6m tok/s: 8275177 +988/20000 train_loss: 2.5478 train_time: 1.6m tok/s: 8275454 +989/20000 train_loss: 2.6785 train_time: 1.6m tok/s: 8275435 +990/20000 train_loss: 2.6717 train_time: 1.6m tok/s: 8275490 +991/20000 train_loss: 2.8064 train_time: 1.6m tok/s: 8275462 +992/20000 train_loss: 2.6370 train_time: 1.6m tok/s: 8275437 +993/20000 train_loss: 2.5767 train_time: 1.6m tok/s: 8275371 +994/20000 train_loss: 2.7190 train_time: 1.6m tok/s: 8275398 +995/20000 train_loss: 2.8847 train_time: 1.6m tok/s: 8275422 +996/20000 train_loss: 2.8264 train_time: 1.6m tok/s: 8275482 +997/20000 train_loss: 2.7793 train_time: 1.6m tok/s: 8275527 +998/20000 train_loss: 2.6645 train_time: 1.6m tok/s: 8275529 +999/20000 train_loss: 2.7513 train_time: 1.6m tok/s: 8275551 +1000/20000 train_loss: 2.7859 train_time: 1.6m tok/s: 8275577 +1001/20000 train_loss: 2.6760 train_time: 1.6m tok/s: 8275617 +1002/20000 train_loss: 2.7320 train_time: 1.6m tok/s: 8275484 +1003/20000 train_loss: 2.6519 train_time: 1.6m tok/s: 8275374 +1004/20000 train_loss: 2.6539 train_time: 1.6m tok/s: 8275380 +1005/20000 train_loss: 2.6347 train_time: 1.6m tok/s: 8275366 +1006/20000 train_loss: 2.7181 train_time: 1.6m tok/s: 8275342 +1007/20000 train_loss: 2.5866 train_time: 1.6m tok/s: 8275392 +1008/20000 train_loss: 2.5355 train_time: 1.6m tok/s: 8275344 +1009/20000 train_loss: 2.6867 train_time: 1.6m tok/s: 8275362 +1010/20000 train_loss: 2.8218 train_time: 1.6m tok/s: 8275437 +1011/20000 train_loss: 2.8077 train_time: 1.6m tok/s: 8275515 +1012/20000 train_loss: 2.4258 train_time: 1.6m tok/s: 8275363 +1013/20000 train_loss: 2.6162 train_time: 1.6m tok/s: 8275155 +1014/20000 train_loss: 2.7268 train_time: 1.6m tok/s: 8275187 +1015/20000 train_loss: 2.7595 train_time: 1.6m tok/s: 8275167 +1016/20000 train_loss: 2.5694 train_time: 1.6m tok/s: 8275073 +1017/20000 train_loss: 2.7184 train_time: 1.6m tok/s: 8275085 +1018/20000 train_loss: 2.7856 train_time: 1.6m tok/s: 8274868 +1019/20000 train_loss: 2.6689 train_time: 1.6m tok/s: 8274863 +1020/20000 train_loss: 2.6992 train_time: 1.6m tok/s: 8274774 +1021/20000 train_loss: 2.6704 train_time: 1.6m tok/s: 8274670 +1022/20000 train_loss: 2.7639 train_time: 1.6m tok/s: 8274669 +1023/20000 train_loss: 2.6674 train_time: 1.6m tok/s: 8274622 +1024/20000 train_loss: 2.6902 train_time: 1.6m tok/s: 8274650 +1025/20000 train_loss: 2.7610 train_time: 1.6m tok/s: 8274605 +1026/20000 train_loss: 3.3264 train_time: 1.6m tok/s: 8274415 +1027/20000 train_loss: 2.5510 train_time: 1.6m tok/s: 8274327 +1028/20000 train_loss: 2.6138 train_time: 1.6m tok/s: 8274396 +1029/20000 train_loss: 2.6990 train_time: 1.6m tok/s: 8274377 +1030/20000 train_loss: 2.5458 train_time: 1.6m tok/s: 8274296 +1031/20000 train_loss: 2.6162 train_time: 1.6m tok/s: 8274279 +1032/20000 train_loss: 2.7352 train_time: 1.6m tok/s: 8274329 +1033/20000 train_loss: 2.8480 train_time: 1.6m tok/s: 8274283 +1034/20000 train_loss: 2.6369 train_time: 1.6m tok/s: 8274212 +1035/20000 train_loss: 2.7319 train_time: 1.6m tok/s: 8274244 +1036/20000 train_loss: 2.6824 train_time: 1.6m tok/s: 8274214 +1037/20000 train_loss: 2.7010 train_time: 1.6m tok/s: 8274238 +1038/20000 train_loss: 2.4963 train_time: 1.6m tok/s: 8274188 +1039/20000 train_loss: 2.7003 train_time: 1.6m tok/s: 8274188 +1040/20000 train_loss: 2.6517 train_time: 1.6m tok/s: 8274248 +1041/20000 train_loss: 2.6676 train_time: 1.6m tok/s: 8274241 +1042/20000 train_loss: 2.6718 train_time: 1.7m tok/s: 8274176 +1043/20000 train_loss: 2.7239 train_time: 1.7m tok/s: 8274164 +1044/20000 train_loss: 2.6623 train_time: 1.7m tok/s: 8274144 +1045/20000 train_loss: 2.7958 train_time: 1.7m tok/s: 8274161 +1046/20000 train_loss: 2.7391 train_time: 1.7m tok/s: 8274023 +1047/20000 train_loss: 2.7080 train_time: 1.7m tok/s: 8274066 +1048/20000 train_loss: 2.5965 train_time: 1.7m tok/s: 8273981 +1049/20000 train_loss: 2.7265 train_time: 1.7m tok/s: 8274009 +1050/20000 train_loss: 2.8243 train_time: 1.7m tok/s: 8274038 +1051/20000 train_loss: 2.7338 train_time: 1.7m tok/s: 8274112 +1052/20000 train_loss: 2.6175 train_time: 1.7m tok/s: 8274128 +1053/20000 train_loss: 2.6478 train_time: 1.7m tok/s: 8274100 +1054/20000 train_loss: 2.6232 train_time: 1.7m tok/s: 8274095 +1055/20000 train_loss: 2.5804 train_time: 1.7m tok/s: 8274018 +1056/20000 train_loss: 2.6546 train_time: 1.7m tok/s: 8273892 +1057/20000 train_loss: 2.7180 train_time: 1.7m tok/s: 8273867 +1058/20000 train_loss: 2.6057 train_time: 1.7m tok/s: 8273856 +1059/20000 train_loss: 2.7491 train_time: 1.7m tok/s: 8273793 +1060/20000 train_loss: 2.6774 train_time: 1.7m tok/s: 8273787 +1061/20000 train_loss: 2.7499 train_time: 1.7m tok/s: 8273844 +1062/20000 train_loss: 2.7775 train_time: 1.7m tok/s: 8273762 +1063/20000 train_loss: 2.7772 train_time: 1.7m tok/s: 8273745 +1064/20000 train_loss: 2.6564 train_time: 1.7m tok/s: 8273816 +1065/20000 train_loss: 2.4655 train_time: 1.7m tok/s: 8273797 +1066/20000 train_loss: 2.7941 train_time: 1.7m tok/s: 8273741 +1067/20000 train_loss: 2.8004 train_time: 1.7m tok/s: 8273617 +1068/20000 train_loss: 2.7007 train_time: 1.7m tok/s: 8273635 +1069/20000 train_loss: 2.6452 train_time: 1.7m tok/s: 8273581 +1070/20000 train_loss: 2.5789 train_time: 1.7m tok/s: 8273606 +1071/20000 train_loss: 2.7440 train_time: 1.7m tok/s: 8273606 +1072/20000 train_loss: 2.6715 train_time: 1.7m tok/s: 8273623 +1073/20000 train_loss: 2.6728 train_time: 1.7m tok/s: 8273678 +1074/20000 train_loss: 2.6920 train_time: 1.7m tok/s: 8273712 +1075/20000 train_loss: 2.7199 train_time: 1.7m tok/s: 8273708 +1076/20000 train_loss: 2.7221 train_time: 1.7m tok/s: 8273744 +1077/20000 train_loss: 2.6723 train_time: 1.7m tok/s: 8273775 +1078/20000 train_loss: 2.8156 train_time: 1.7m tok/s: 8273847 +1079/20000 train_loss: 2.6757 train_time: 1.7m tok/s: 8273765 +1080/20000 train_loss: 2.6479 train_time: 1.7m tok/s: 8273792 +1081/20000 train_loss: 2.6744 train_time: 1.7m tok/s: 8273913 +1082/20000 train_loss: 2.6693 train_time: 1.7m tok/s: 8273982 +1083/20000 train_loss: 2.6051 train_time: 1.7m tok/s: 8273888 +1084/20000 train_loss: 2.6863 train_time: 1.7m tok/s: 8273891 +1085/20000 train_loss: 2.6488 train_time: 1.7m tok/s: 8273967 +1086/20000 train_loss: 2.6849 train_time: 1.7m tok/s: 8274038 +1087/20000 train_loss: 2.6682 train_time: 1.7m tok/s: 8273938 +1088/20000 train_loss: 2.8501 train_time: 1.7m tok/s: 8273862 +1089/20000 train_loss: 2.8155 train_time: 1.7m tok/s: 8273860 +1090/20000 train_loss: 2.6345 train_time: 1.7m tok/s: 8273802 +1091/20000 train_loss: 2.6712 train_time: 1.7m tok/s: 8273819 +1092/20000 train_loss: 2.7027 train_time: 1.7m tok/s: 8273778 +1093/20000 train_loss: 2.7565 train_time: 1.7m tok/s: 8273819 +1094/20000 train_loss: 2.8084 train_time: 1.7m tok/s: 8273859 +1095/20000 train_loss: 2.6254 train_time: 1.7m tok/s: 8273752 +1096/20000 train_loss: 2.5300 train_time: 1.7m tok/s: 8273675 +1097/20000 train_loss: 2.6526 train_time: 1.7m tok/s: 8273596 +1098/20000 train_loss: 2.6667 train_time: 1.7m tok/s: 8273579 +1099/20000 train_loss: 2.5152 train_time: 1.7m tok/s: 8273584 +1100/20000 train_loss: 2.5805 train_time: 1.7m tok/s: 8273473 +1101/20000 train_loss: 2.6562 train_time: 1.7m tok/s: 8273487 +1102/20000 train_loss: 2.6754 train_time: 1.7m tok/s: 8273538 +1103/20000 train_loss: 2.7359 train_time: 1.7m tok/s: 8273557 +1104/20000 train_loss: 2.7042 train_time: 1.7m tok/s: 8273505 +1105/20000 train_loss: 2.7244 train_time: 1.8m tok/s: 8273599 +1106/20000 train_loss: 2.7358 train_time: 1.8m tok/s: 8273689 +1107/20000 train_loss: 2.7420 train_time: 1.8m tok/s: 8273657 +1108/20000 train_loss: 2.6849 train_time: 1.8m tok/s: 8273632 +1109/20000 train_loss: 2.6721 train_time: 1.8m tok/s: 8273623 +1110/20000 train_loss: 2.6679 train_time: 1.8m tok/s: 8273655 +1111/20000 train_loss: 2.6423 train_time: 1.8m tok/s: 8273639 +1112/20000 train_loss: 2.6290 train_time: 1.8m tok/s: 8273519 +1113/20000 train_loss: 2.6517 train_time: 1.8m tok/s: 8273469 +1114/20000 train_loss: 2.8216 train_time: 1.8m tok/s: 8273427 +1115/20000 train_loss: 2.6727 train_time: 1.8m tok/s: 8273526 +1116/20000 train_loss: 2.8644 train_time: 1.8m tok/s: 8273506 +1117/20000 train_loss: 2.6946 train_time: 1.8m tok/s: 8273507 +1118/20000 train_loss: 2.7287 train_time: 1.8m tok/s: 8273471 +1119/20000 train_loss: 2.7426 train_time: 1.8m tok/s: 8273426 +1120/20000 train_loss: 2.6366 train_time: 1.8m tok/s: 8273420 +1121/20000 train_loss: 2.6375 train_time: 1.8m tok/s: 8273399 +1122/20000 train_loss: 2.7375 train_time: 1.8m tok/s: 8273421 +1123/20000 train_loss: 2.5354 train_time: 1.8m tok/s: 8273473 +1124/20000 train_loss: 2.6863 train_time: 1.8m tok/s: 8273419 +1125/20000 train_loss: 2.5748 train_time: 1.8m tok/s: 8273388 +1126/20000 train_loss: 2.6465 train_time: 1.8m tok/s: 8273417 +1127/20000 train_loss: 2.8770 train_time: 1.8m tok/s: 8273466 +1128/20000 train_loss: 2.8565 train_time: 1.8m tok/s: 8273460 +1129/20000 train_loss: 2.5984 train_time: 1.8m tok/s: 8273391 +1130/20000 train_loss: 2.7691 train_time: 1.8m tok/s: 8273470 +1131/20000 train_loss: 2.7512 train_time: 1.8m tok/s: 8273485 +1132/20000 train_loss: 2.6229 train_time: 1.8m tok/s: 8273484 +1133/20000 train_loss: 2.5763 train_time: 1.8m tok/s: 8273432 +1134/20000 train_loss: 2.7360 train_time: 1.8m tok/s: 8273410 +1135/20000 train_loss: 2.7315 train_time: 1.8m tok/s: 8273401 +1136/20000 train_loss: 2.5669 train_time: 1.8m tok/s: 8273483 +1137/20000 train_loss: 2.6079 train_time: 1.8m tok/s: 8273429 +1138/20000 train_loss: 2.5648 train_time: 1.8m tok/s: 8273478 +1139/20000 train_loss: 2.5559 train_time: 1.8m tok/s: 8273498 +1140/20000 train_loss: 2.6629 train_time: 1.8m tok/s: 8273479 +1141/20000 train_loss: 2.7014 train_time: 1.8m tok/s: 8273445 +1142/20000 train_loss: 2.6970 train_time: 1.8m tok/s: 8273483 +1143/20000 train_loss: 2.7284 train_time: 1.8m tok/s: 8273490 +1144/20000 train_loss: 2.7826 train_time: 1.8m tok/s: 8273505 +1145/20000 train_loss: 2.7310 train_time: 1.8m tok/s: 8273498 +1146/20000 train_loss: 2.5977 train_time: 1.8m tok/s: 8273549 +1147/20000 train_loss: 2.7569 train_time: 1.8m tok/s: 8273622 +1148/20000 train_loss: 2.5636 train_time: 1.8m tok/s: 8273547 +1149/20000 train_loss: 2.7197 train_time: 1.8m tok/s: 8273449 +1150/20000 train_loss: 2.5914 train_time: 1.8m tok/s: 8273446 +1151/20000 train_loss: 2.5997 train_time: 1.8m tok/s: 8273463 +1152/20000 train_loss: 2.4697 train_time: 1.8m tok/s: 8273457 +1153/20000 train_loss: 2.6009 train_time: 1.8m tok/s: 8273361 +1154/20000 train_loss: 2.7358 train_time: 1.8m tok/s: 8273387 +1155/20000 train_loss: 2.5914 train_time: 1.8m tok/s: 8273448 +1156/20000 train_loss: 2.6985 train_time: 1.8m tok/s: 8273405 +1157/20000 train_loss: 2.6880 train_time: 1.8m tok/s: 8273414 +1158/20000 train_loss: 2.7853 train_time: 1.8m tok/s: 8273375 +1159/20000 train_loss: 2.7171 train_time: 1.8m tok/s: 8273350 +1160/20000 train_loss: 2.6798 train_time: 1.8m tok/s: 8273427 +1161/20000 train_loss: 2.6598 train_time: 1.8m tok/s: 8273399 +1162/20000 train_loss: 2.7107 train_time: 1.8m tok/s: 8273391 +1163/20000 train_loss: 2.6988 train_time: 1.8m tok/s: 8273430 +1164/20000 train_loss: 2.6658 train_time: 1.8m tok/s: 8273378 +1165/20000 train_loss: 2.5309 train_time: 1.8m tok/s: 8273321 +1166/20000 train_loss: 2.7306 train_time: 1.8m tok/s: 8273302 +1167/20000 train_loss: 2.7658 train_time: 1.8m tok/s: 8273287 +1168/20000 train_loss: 2.5670 train_time: 1.9m tok/s: 8273221 +1169/20000 train_loss: 2.7004 train_time: 1.9m tok/s: 8273134 +1170/20000 train_loss: 2.9085 train_time: 1.9m tok/s: 8273059 +1171/20000 train_loss: 2.6624 train_time: 1.9m tok/s: 8273152 +1172/20000 train_loss: 2.7069 train_time: 1.9m tok/s: 8273247 +1173/20000 train_loss: 2.6422 train_time: 1.9m tok/s: 8273275 +1174/20000 train_loss: 2.7063 train_time: 1.9m tok/s: 8273235 +1175/20000 train_loss: 2.6584 train_time: 1.9m tok/s: 8273226 +1176/20000 train_loss: 2.7800 train_time: 1.9m tok/s: 8273228 +1177/20000 train_loss: 2.7805 train_time: 1.9m tok/s: 8273213 +1178/20000 train_loss: 2.6681 train_time: 1.9m tok/s: 8273101 +1179/20000 train_loss: 2.5433 train_time: 1.9m tok/s: 8273067 +1180/20000 train_loss: 2.6053 train_time: 1.9m tok/s: 8272914 +1181/20000 train_loss: 2.6637 train_time: 1.9m tok/s: 8272890 +1182/20000 train_loss: 2.5823 train_time: 1.9m tok/s: 8272878 +1183/20000 train_loss: 2.7936 train_time: 1.9m tok/s: 8272884 +1184/20000 train_loss: 2.4905 train_time: 1.9m tok/s: 8272794 +1185/20000 train_loss: 2.6741 train_time: 1.9m tok/s: 8272806 +1186/20000 train_loss: 2.6644 train_time: 1.9m tok/s: 8272794 +1187/20000 train_loss: 2.7652 train_time: 1.9m tok/s: 8272752 +1188/20000 train_loss: 2.8818 train_time: 1.9m tok/s: 8272776 +1189/20000 train_loss: 2.6405 train_time: 1.9m tok/s: 8272854 +1190/20000 train_loss: 2.7019 train_time: 1.9m tok/s: 8272838 +1191/20000 train_loss: 2.6463 train_time: 1.9m tok/s: 8272716 +1192/20000 train_loss: 2.6771 train_time: 1.9m tok/s: 8272659 +1193/20000 train_loss: 2.6956 train_time: 1.9m tok/s: 8272640 +1194/20000 train_loss: 2.6995 train_time: 1.9m tok/s: 8272585 +1195/20000 train_loss: 2.6076 train_time: 1.9m tok/s: 8272589 +1196/20000 train_loss: 2.8445 train_time: 1.9m tok/s: 8272551 +1197/20000 train_loss: 2.5726 train_time: 1.9m tok/s: 8272527 +1198/20000 train_loss: 2.7006 train_time: 1.9m tok/s: 8272541 +1199/20000 train_loss: 2.7450 train_time: 1.9m tok/s: 8272638 +1200/20000 train_loss: 2.7309 train_time: 1.9m tok/s: 8272606 +1201/20000 train_loss: 2.7303 train_time: 1.9m tok/s: 8272588 +1202/20000 train_loss: 2.8306 train_time: 1.9m tok/s: 8272614 +1203/20000 train_loss: 2.6708 train_time: 1.9m tok/s: 8272648 +1204/20000 train_loss: 2.7127 train_time: 1.9m tok/s: 8272555 +1205/20000 train_loss: 2.7485 train_time: 1.9m tok/s: 8272572 +1206/20000 train_loss: 2.7649 train_time: 1.9m tok/s: 8272535 +1207/20000 train_loss: 2.5706 train_time: 1.9m tok/s: 8272541 +1208/20000 train_loss: 2.5686 train_time: 1.9m tok/s: 8272452 +1209/20000 train_loss: 2.6933 train_time: 1.9m tok/s: 8272390 +1210/20000 train_loss: 2.6285 train_time: 1.9m tok/s: 8272453 +1211/20000 train_loss: 2.5675 train_time: 1.9m tok/s: 8272503 +1212/20000 train_loss: 2.5391 train_time: 1.9m tok/s: 8272298 +1213/20000 train_loss: 2.8106 train_time: 1.9m tok/s: 8272218 +1214/20000 train_loss: 2.6276 train_time: 1.9m tok/s: 8272353 +1215/20000 train_loss: 2.7171 train_time: 1.9m tok/s: 8272362 +1216/20000 train_loss: 2.6834 train_time: 1.9m tok/s: 8272357 +1217/20000 train_loss: 2.7655 train_time: 1.9m tok/s: 8272379 +1218/20000 train_loss: 2.7050 train_time: 1.9m tok/s: 8272424 +1219/20000 train_loss: 3.3161 train_time: 1.9m tok/s: 8272394 +1220/20000 train_loss: 2.6099 train_time: 1.9m tok/s: 8272290 +1221/20000 train_loss: 2.7679 train_time: 1.9m tok/s: 8272332 +1222/20000 train_loss: 2.5870 train_time: 1.9m tok/s: 8272295 +1223/20000 train_loss: 2.7023 train_time: 1.9m tok/s: 8272268 +1224/20000 train_loss: 2.7121 train_time: 1.9m tok/s: 8272245 +1225/20000 train_loss: 2.5347 train_time: 1.9m tok/s: 8272128 +1226/20000 train_loss: 2.6574 train_time: 1.9m tok/s: 8272056 +1227/20000 train_loss: 2.8647 train_time: 1.9m tok/s: 8272016 +1228/20000 train_loss: 2.6612 train_time: 1.9m tok/s: 8272000 +1229/20000 train_loss: 2.6648 train_time: 1.9m tok/s: 8272039 +1230/20000 train_loss: 2.7608 train_time: 1.9m tok/s: 8271976 +1231/20000 train_loss: 2.6980 train_time: 2.0m tok/s: 8271966 +1232/20000 train_loss: 2.6823 train_time: 2.0m tok/s: 8271915 +1233/20000 train_loss: 2.6627 train_time: 2.0m tok/s: 8271895 +1234/20000 train_loss: 2.6597 train_time: 2.0m tok/s: 8271874 +1235/20000 train_loss: 2.5947 train_time: 2.0m tok/s: 8271771 +1236/20000 train_loss: 2.6450 train_time: 2.0m tok/s: 8271728 +1237/20000 train_loss: 2.6226 train_time: 2.0m tok/s: 8271785 +1238/20000 train_loss: 2.5905 train_time: 2.0m tok/s: 8271790 +1239/20000 train_loss: 2.5854 train_time: 2.0m tok/s: 8271783 +1240/20000 train_loss: 2.5381 train_time: 2.0m tok/s: 8271825 +1241/20000 train_loss: 2.5908 train_time: 2.0m tok/s: 8271867 +1242/20000 train_loss: 2.5873 train_time: 2.0m tok/s: 8271682 +1243/20000 train_loss: 2.6826 train_time: 2.0m tok/s: 8271492 +1244/20000 train_loss: 2.7852 train_time: 2.0m tok/s: 8271407 +1245/20000 train_loss: 2.6761 train_time: 2.0m tok/s: 8271267 +1246/20000 train_loss: 2.7898 train_time: 2.0m tok/s: 8271133 +1247/20000 train_loss: 2.7823 train_time: 2.0m tok/s: 8271036 +1248/20000 train_loss: 2.6588 train_time: 2.0m tok/s: 8271154 +1249/20000 train_loss: 2.6444 train_time: 2.0m tok/s: 8271141 +1250/20000 train_loss: 2.6480 train_time: 2.0m tok/s: 8271262 +1251/20000 train_loss: 2.6026 train_time: 2.0m tok/s: 8271286 +1252/20000 train_loss: 2.6788 train_time: 2.0m tok/s: 8271203 +1253/20000 train_loss: 2.6278 train_time: 2.0m tok/s: 8271164 +1254/20000 train_loss: 2.6861 train_time: 2.0m tok/s: 8271162 +1255/20000 train_loss: 2.4585 train_time: 2.0m tok/s: 8271194 +1256/20000 train_loss: 2.6725 train_time: 2.0m tok/s: 8271143 +1257/20000 train_loss: 2.6050 train_time: 2.0m tok/s: 8271059 +1258/20000 train_loss: 2.6371 train_time: 2.0m tok/s: 8271041 +1259/20000 train_loss: 2.7648 train_time: 2.0m tok/s: 8270918 +1260/20000 train_loss: 2.7043 train_time: 2.0m tok/s: 8271012 +1261/20000 train_loss: 2.7838 train_time: 2.0m tok/s: 8270979 +1262/20000 train_loss: 2.6962 train_time: 2.0m tok/s: 8270984 +1263/20000 train_loss: 2.7058 train_time: 2.0m tok/s: 8271037 +1264/20000 train_loss: 2.6339 train_time: 2.0m tok/s: 8271034 +1265/20000 train_loss: 2.6326 train_time: 2.0m tok/s: 8270958 +1266/20000 train_loss: 2.6479 train_time: 2.0m tok/s: 8270906 +1267/20000 train_loss: 2.6805 train_time: 2.0m tok/s: 8270817 +1268/20000 train_loss: 2.4899 train_time: 2.0m tok/s: 8270802 +1269/20000 train_loss: 2.6887 train_time: 2.0m tok/s: 8270820 +1270/20000 train_loss: 2.6524 train_time: 2.0m tok/s: 8270913 +1271/20000 train_loss: 2.5687 train_time: 2.0m tok/s: 8270903 +1272/20000 train_loss: 2.8080 train_time: 2.0m tok/s: 8271026 +1273/20000 train_loss: 2.7516 train_time: 2.0m tok/s: 8271025 +1274/20000 train_loss: 2.6803 train_time: 2.0m tok/s: 8271097 +1275/20000 train_loss: 2.7841 train_time: 2.0m tok/s: 8271163 +1276/20000 train_loss: 2.7148 train_time: 2.0m tok/s: 8271165 +1277/20000 train_loss: 2.7025 train_time: 2.0m tok/s: 8271185 +1278/20000 train_loss: 2.6237 train_time: 2.0m tok/s: 8271169 +1279/20000 train_loss: 2.7185 train_time: 2.0m tok/s: 8271100 +1280/20000 train_loss: 2.6428 train_time: 2.0m tok/s: 8271113 +1281/20000 train_loss: 2.8235 train_time: 2.0m tok/s: 8271063 +1282/20000 train_loss: 2.5391 train_time: 2.0m tok/s: 8271035 +1283/20000 train_loss: 2.6306 train_time: 2.0m tok/s: 8270886 +1284/20000 train_loss: 2.6197 train_time: 2.0m tok/s: 8271054 +1285/20000 train_loss: 2.7764 train_time: 2.0m tok/s: 8271022 +1286/20000 train_loss: 2.6702 train_time: 2.0m tok/s: 8271020 +1287/20000 train_loss: 2.7007 train_time: 2.0m tok/s: 8270942 +1288/20000 train_loss: 2.7251 train_time: 2.0m tok/s: 8270939 +1289/20000 train_loss: 2.7507 train_time: 2.0m tok/s: 8270922 +1290/20000 train_loss: 2.6368 train_time: 2.0m tok/s: 8270894 +1291/20000 train_loss: 2.7467 train_time: 2.0m tok/s: 8270914 +1292/20000 train_loss: 2.7294 train_time: 2.0m tok/s: 8270942 +1293/20000 train_loss: 2.7200 train_time: 2.0m tok/s: 8270960 +1294/20000 train_loss: 2.7252 train_time: 2.1m tok/s: 8270914 +1295/20000 train_loss: 2.7454 train_time: 2.1m tok/s: 8270895 +1296/20000 train_loss: 2.6883 train_time: 2.1m tok/s: 8271033 +1297/20000 train_loss: 2.5949 train_time: 2.1m tok/s: 8271040 +1298/20000 train_loss: 2.6618 train_time: 2.1m tok/s: 8271061 +1299/20000 train_loss: 2.5169 train_time: 2.1m tok/s: 8270977 +1300/20000 train_loss: 2.6310 train_time: 2.1m tok/s: 8270949 +1301/20000 train_loss: 2.6918 train_time: 2.1m tok/s: 8270994 +1302/20000 train_loss: 2.6737 train_time: 2.1m tok/s: 8270957 +1303/20000 train_loss: 2.8991 train_time: 2.1m tok/s: 8270928 +1304/20000 train_loss: 2.7449 train_time: 2.1m tok/s: 8270897 +1305/20000 train_loss: 2.7690 train_time: 2.1m tok/s: 8270913 +1306/20000 train_loss: 2.8625 train_time: 2.1m tok/s: 8270958 +1307/20000 train_loss: 2.6250 train_time: 2.1m tok/s: 8270912 +1308/20000 train_loss: 2.6301 train_time: 2.1m tok/s: 8270998 +1309/20000 train_loss: 2.6625 train_time: 2.1m tok/s: 8270977 +1310/20000 train_loss: 2.5681 train_time: 2.1m tok/s: 8270773 +1311/20000 train_loss: 2.6136 train_time: 2.1m tok/s: 8270632 +1312/20000 train_loss: 2.5481 train_time: 2.1m tok/s: 8270607 +1313/20000 train_loss: 2.5754 train_time: 2.1m tok/s: 8270588 +1314/20000 train_loss: 2.5602 train_time: 2.1m tok/s: 8270685 +1315/20000 train_loss: 2.4160 train_time: 2.1m tok/s: 8270545 +1316/20000 train_loss: 2.6861 train_time: 2.1m tok/s: 8270502 +1317/20000 train_loss: 2.6825 train_time: 2.1m tok/s: 8270579 +1318/20000 train_loss: 2.7179 train_time: 2.1m tok/s: 8270656 +1319/20000 train_loss: 2.8080 train_time: 2.1m tok/s: 8270617 +1320/20000 train_loss: 2.7331 train_time: 2.1m tok/s: 8270622 +1321/20000 train_loss: 2.7251 train_time: 2.1m tok/s: 8270616 +1322/20000 train_loss: 2.7416 train_time: 2.1m tok/s: 8270660 +1323/20000 train_loss: 2.5862 train_time: 2.1m tok/s: 8270544 +1324/20000 train_loss: 2.6470 train_time: 2.1m tok/s: 8270564 +1325/20000 train_loss: 2.8258 train_time: 2.1m tok/s: 8270557 +1326/20000 train_loss: 2.8342 train_time: 2.1m tok/s: 8270576 +1327/20000 train_loss: 2.6615 train_time: 2.1m tok/s: 8270603 +1328/20000 train_loss: 2.6558 train_time: 2.1m tok/s: 8270630 +1329/20000 train_loss: 2.7469 train_time: 2.1m tok/s: 8270616 +1330/20000 train_loss: 2.6174 train_time: 2.1m tok/s: 8270641 +1331/20000 train_loss: 2.6802 train_time: 2.1m tok/s: 8270561 +1332/20000 train_loss: 2.9139 train_time: 2.1m tok/s: 8270522 +1333/20000 train_loss: 2.8379 train_time: 2.1m tok/s: 8270443 +1334/20000 train_loss: 2.6754 train_time: 2.1m tok/s: 8270431 +1335/20000 train_loss: 2.6268 train_time: 2.1m tok/s: 8270404 +1336/20000 train_loss: 2.6569 train_time: 2.1m tok/s: 8270367 +1337/20000 train_loss: 2.8172 train_time: 2.1m tok/s: 8270371 +1338/20000 train_loss: 2.9068 train_time: 2.1m tok/s: 8270404 +1339/20000 train_loss: 2.6921 train_time: 2.1m tok/s: 8270426 +1340/20000 train_loss: 2.5348 train_time: 2.1m tok/s: 8270459 +1341/20000 train_loss: 2.5550 train_time: 2.1m tok/s: 8270413 +1342/20000 train_loss: 2.6512 train_time: 2.1m tok/s: 8270402 +1343/20000 train_loss: 2.6729 train_time: 2.1m tok/s: 8270339 +1344/20000 train_loss: 2.6871 train_time: 2.1m tok/s: 8270405 +1345/20000 train_loss: 2.6406 train_time: 2.1m tok/s: 8270464 +1346/20000 train_loss: 2.7679 train_time: 2.1m tok/s: 8270507 +1347/20000 train_loss: 2.7940 train_time: 2.1m tok/s: 8270467 +1348/20000 train_loss: 2.6956 train_time: 2.1m tok/s: 8270489 +1349/20000 train_loss: 2.7260 train_time: 2.1m tok/s: 8270565 +1350/20000 train_loss: 2.6802 train_time: 2.1m tok/s: 8270625 +1351/20000 train_loss: 2.7822 train_time: 2.1m tok/s: 8270617 +1352/20000 train_loss: 2.6763 train_time: 2.1m tok/s: 8270585 +1353/20000 train_loss: 2.7108 train_time: 2.1m tok/s: 8270626 +1354/20000 train_loss: 2.3805 train_time: 2.1m tok/s: 8270595 +1355/20000 train_loss: 2.5813 train_time: 2.1m tok/s: 8270503 +1356/20000 train_loss: 2.6868 train_time: 2.1m tok/s: 8270679 +1357/20000 train_loss: 2.6799 train_time: 2.2m tok/s: 8270688 +1358/20000 train_loss: 2.7457 train_time: 2.2m tok/s: 8270692 +1359/20000 train_loss: 2.5038 train_time: 2.2m tok/s: 8270647 +1360/20000 train_loss: 2.7404 train_time: 2.2m tok/s: 8270588 +1361/20000 train_loss: 2.6286 train_time: 2.2m tok/s: 8270584 +1362/20000 train_loss: 2.6247 train_time: 2.2m tok/s: 8270584 +1363/20000 train_loss: 2.7094 train_time: 2.2m tok/s: 8270645 +1364/20000 train_loss: 2.5351 train_time: 2.2m tok/s: 8270628 +1365/20000 train_loss: 2.5324 train_time: 2.2m tok/s: 8270496 +1366/20000 train_loss: 2.6189 train_time: 2.2m tok/s: 8270475 +1367/20000 train_loss: 2.7035 train_time: 2.2m tok/s: 8270415 +1368/20000 train_loss: 2.5711 train_time: 2.2m tok/s: 8270530 +1369/20000 train_loss: 2.6776 train_time: 2.2m tok/s: 8270522 +1370/20000 train_loss: 2.7239 train_time: 2.2m tok/s: 8270561 +1371/20000 train_loss: 2.6994 train_time: 2.2m tok/s: 8270506 +1372/20000 train_loss: 2.7319 train_time: 2.2m tok/s: 8270476 +1373/20000 train_loss: 2.6700 train_time: 2.2m tok/s: 8270463 +1374/20000 train_loss: 2.7778 train_time: 2.2m tok/s: 8270503 +1375/20000 train_loss: 2.7299 train_time: 2.2m tok/s: 8270585 +1376/20000 train_loss: 2.5925 train_time: 2.2m tok/s: 8270534 +1377/20000 train_loss: 2.6408 train_time: 2.2m tok/s: 8270498 +1378/20000 train_loss: 2.5872 train_time: 2.2m tok/s: 8270507 +1379/20000 train_loss: 2.6345 train_time: 2.2m tok/s: 8270479 +1380/20000 train_loss: 2.5997 train_time: 2.2m tok/s: 8270521 +1381/20000 train_loss: 2.6470 train_time: 2.2m tok/s: 8270485 +1382/20000 train_loss: 2.6647 train_time: 2.2m tok/s: 8270449 +1383/20000 train_loss: 2.6829 train_time: 2.2m tok/s: 8270483 +1384/20000 train_loss: 2.5805 train_time: 2.2m tok/s: 8270457 +1385/20000 train_loss: 2.6130 train_time: 2.2m tok/s: 8270415 +1386/20000 train_loss: 2.7666 train_time: 2.2m tok/s: 8270461 +1387/20000 train_loss: 2.6778 train_time: 2.2m tok/s: 8270487 +1388/20000 train_loss: 2.7559 train_time: 2.2m tok/s: 8270499 +1389/20000 train_loss: 2.6122 train_time: 2.2m tok/s: 8270484 +1390/20000 train_loss: 2.7408 train_time: 2.2m tok/s: 8270515 +1391/20000 train_loss: 2.5464 train_time: 2.2m tok/s: 8270457 +1392/20000 train_loss: 2.7404 train_time: 2.2m tok/s: 8270517 +1393/20000 train_loss: 2.6354 train_time: 2.2m tok/s: 8270511 +1394/20000 train_loss: 2.9067 train_time: 2.2m tok/s: 8270546 +1395/20000 train_loss: 2.5011 train_time: 2.2m tok/s: 8270494 +1396/20000 train_loss: 2.8269 train_time: 2.2m tok/s: 8270565 +1397/20000 train_loss: 2.7196 train_time: 2.2m tok/s: 8270535 +1398/20000 train_loss: 2.8019 train_time: 2.2m tok/s: 8270595 +1399/20000 train_loss: 2.6432 train_time: 2.2m tok/s: 8270645 +1400/20000 train_loss: 2.7398 train_time: 2.2m tok/s: 8270691 +1401/20000 train_loss: 2.7337 train_time: 2.2m tok/s: 8270682 +1402/20000 train_loss: 2.5672 train_time: 2.2m tok/s: 8270663 +1403/20000 train_loss: 2.5967 train_time: 2.2m tok/s: 8270698 +1404/20000 train_loss: 2.6853 train_time: 2.2m tok/s: 8270699 +1405/20000 train_loss: 2.7235 train_time: 2.2m tok/s: 8270756 +1406/20000 train_loss: 2.8445 train_time: 2.2m tok/s: 8270692 +1407/20000 train_loss: 2.5706 train_time: 2.2m tok/s: 8270607 +1408/20000 train_loss: 2.6950 train_time: 2.2m tok/s: 8270567 +1409/20000 train_loss: 2.7739 train_time: 2.2m tok/s: 8270539 +1410/20000 train_loss: 2.6606 train_time: 2.2m tok/s: 8270573 +1411/20000 train_loss: 2.6970 train_time: 2.2m tok/s: 8270607 +1412/20000 train_loss: 2.7363 train_time: 2.2m tok/s: 8270607 +1413/20000 train_loss: 2.6214 train_time: 2.2m tok/s: 8270553 +1414/20000 train_loss: 2.6030 train_time: 2.2m tok/s: 8270662 +1415/20000 train_loss: 2.6628 train_time: 2.2m tok/s: 8270725 +1416/20000 train_loss: 2.6205 train_time: 2.2m tok/s: 8270789 +1417/20000 train_loss: 2.6286 train_time: 2.2m tok/s: 8270812 +1418/20000 train_loss: 2.7583 train_time: 2.2m tok/s: 8270790 +1419/20000 train_loss: 2.6680 train_time: 2.2m tok/s: 8270751 +1420/20000 train_loss: 2.6302 train_time: 2.3m tok/s: 8270723 +1421/20000 train_loss: 2.7584 train_time: 2.3m tok/s: 8270749 +1422/20000 train_loss: 2.7406 train_time: 2.3m tok/s: 8270756 +1423/20000 train_loss: 2.7028 train_time: 2.3m tok/s: 8270800 +1424/20000 train_loss: 2.7122 train_time: 2.3m tok/s: 8270807 +1425/20000 train_loss: 2.6254 train_time: 2.3m tok/s: 8270809 +1426/20000 train_loss: 2.6917 train_time: 2.3m tok/s: 8270802 +1427/20000 train_loss: 2.6507 train_time: 2.3m tok/s: 8270818 +1428/20000 train_loss: 2.6508 train_time: 2.3m tok/s: 8270792 +1429/20000 train_loss: 2.5917 train_time: 2.3m tok/s: 8270729 +1430/20000 train_loss: 2.6300 train_time: 2.3m tok/s: 8270706 +1431/20000 train_loss: 2.6045 train_time: 2.3m tok/s: 8270725 +1432/20000 train_loss: 2.4441 train_time: 2.3m tok/s: 8270686 +1433/20000 train_loss: 2.6305 train_time: 2.3m tok/s: 8270650 +1434/20000 train_loss: 2.7485 train_time: 2.3m tok/s: 8270610 +1435/20000 train_loss: 2.7226 train_time: 2.3m tok/s: 8270632 +1436/20000 train_loss: 2.6228 train_time: 2.3m tok/s: 8270737 +1437/20000 train_loss: 2.7086 train_time: 2.3m tok/s: 8270756 +1438/20000 train_loss: 2.7516 train_time: 2.3m tok/s: 8270818 +1439/20000 train_loss: 2.6718 train_time: 2.3m tok/s: 8270809 +1440/20000 train_loss: 2.7113 train_time: 2.3m tok/s: 8270661 +1441/20000 train_loss: 2.6674 train_time: 2.3m tok/s: 8270639 +1442/20000 train_loss: 2.5837 train_time: 2.3m tok/s: 8270569 +1443/20000 train_loss: 2.5979 train_time: 2.3m tok/s: 8270560 +1444/20000 train_loss: 2.5518 train_time: 2.3m tok/s: 8270628 +1445/20000 train_loss: 2.6714 train_time: 2.3m tok/s: 8270533 +1446/20000 train_loss: 2.7767 train_time: 2.3m tok/s: 8270484 +1447/20000 train_loss: 2.7219 train_time: 2.3m tok/s: 8270495 +1448/20000 train_loss: 2.7042 train_time: 2.3m tok/s: 8270530 +1449/20000 train_loss: 2.6409 train_time: 2.3m tok/s: 8270591 +1450/20000 train_loss: 2.7357 train_time: 2.3m tok/s: 8270621 +1451/20000 train_loss: 2.5988 train_time: 2.3m tok/s: 8270585 +1452/20000 train_loss: 2.6215 train_time: 2.3m tok/s: 8270570 +1453/20000 train_loss: 2.6470 train_time: 2.3m tok/s: 8270596 +1454/20000 train_loss: 2.7532 train_time: 2.3m tok/s: 8270513 +1455/20000 train_loss: 2.5730 train_time: 2.3m tok/s: 8270532 +1456/20000 train_loss: 2.4581 train_time: 2.3m tok/s: 8270561 +1457/20000 train_loss: 2.4810 train_time: 2.3m tok/s: 8270575 +1458/20000 train_loss: 2.5994 train_time: 2.3m tok/s: 8270493 +1459/20000 train_loss: 2.6739 train_time: 2.3m tok/s: 8270485 +1460/20000 train_loss: 2.7018 train_time: 2.3m tok/s: 8270549 +1461/20000 train_loss: 2.7760 train_time: 2.3m tok/s: 8270581 +1462/20000 train_loss: 2.6521 train_time: 2.3m tok/s: 8270607 +1463/20000 train_loss: 2.6654 train_time: 2.3m tok/s: 8270622 +1464/20000 train_loss: 2.6520 train_time: 2.3m tok/s: 8270567 +1465/20000 train_loss: 2.6820 train_time: 2.3m tok/s: 8270499 +1466/20000 train_loss: 2.6196 train_time: 2.3m tok/s: 8270543 +1467/20000 train_loss: 2.6202 train_time: 2.3m tok/s: 8270629 +1468/20000 train_loss: 2.5601 train_time: 2.3m tok/s: 8270511 +1469/20000 train_loss: 2.5881 train_time: 2.3m tok/s: 8270518 +1470/20000 train_loss: 2.5077 train_time: 2.3m tok/s: 8270517 +1471/20000 train_loss: 2.8105 train_time: 2.3m tok/s: 8270500 +1472/20000 train_loss: 2.8809 train_time: 2.3m tok/s: 8270435 +1473/20000 train_loss: 2.7799 train_time: 2.3m tok/s: 8270319 +1474/20000 train_loss: 2.7368 train_time: 2.3m tok/s: 8270296 +1475/20000 train_loss: 2.6898 train_time: 2.3m tok/s: 8270369 +1476/20000 train_loss: 2.7751 train_time: 2.3m tok/s: 8270396 +1477/20000 train_loss: 2.6019 train_time: 2.3m tok/s: 8270378 +1478/20000 train_loss: 2.6079 train_time: 2.3m tok/s: 8270368 +1479/20000 train_loss: 2.5889 train_time: 2.3m tok/s: 8270408 +1480/20000 train_loss: 2.6080 train_time: 2.3m tok/s: 8270496 +1481/20000 train_loss: 2.6360 train_time: 2.3m tok/s: 8270547 +1482/20000 train_loss: 3.0598 train_time: 2.3m tok/s: 8270464 +1483/20000 train_loss: 2.6962 train_time: 2.4m tok/s: 8270348 +1484/20000 train_loss: 2.6308 train_time: 2.4m tok/s: 8270432 +1485/20000 train_loss: 2.7918 train_time: 2.4m tok/s: 8270467 +1486/20000 train_loss: 2.6001 train_time: 2.4m tok/s: 8270431 +1487/20000 train_loss: 2.6952 train_time: 2.4m tok/s: 8270404 +1488/20000 train_loss: 2.6568 train_time: 2.4m tok/s: 8270481 +1489/20000 train_loss: 2.5767 train_time: 2.4m tok/s: 8270493 +1490/20000 train_loss: 2.6729 train_time: 2.4m tok/s: 8270522 +1491/20000 train_loss: 2.6710 train_time: 2.4m tok/s: 8270500 +1492/20000 train_loss: 2.6003 train_time: 2.4m tok/s: 8270499 +1493/20000 train_loss: 2.6659 train_time: 2.4m tok/s: 8270517 +1494/20000 train_loss: 2.6391 train_time: 2.4m tok/s: 8270507 +1495/20000 train_loss: 2.5699 train_time: 2.4m tok/s: 8270477 +1496/20000 train_loss: 2.6783 train_time: 2.4m tok/s: 8270422 +1497/20000 train_loss: 2.5975 train_time: 2.4m tok/s: 8270434 +1498/20000 train_loss: 2.9254 train_time: 2.4m tok/s: 8270512 +1499/20000 train_loss: 2.6914 train_time: 2.4m tok/s: 8270499 +1500/20000 train_loss: 2.7211 train_time: 2.4m tok/s: 8270543 +1501/20000 train_loss: 2.6918 train_time: 2.4m tok/s: 8270567 +1502/20000 train_loss: 2.8059 train_time: 2.4m tok/s: 8270513 +1503/20000 train_loss: 2.6777 train_time: 2.4m tok/s: 8270422 +1504/20000 train_loss: 2.7333 train_time: 2.4m tok/s: 8270440 +1505/20000 train_loss: 2.6807 train_time: 2.4m tok/s: 8270517 +1506/20000 train_loss: 2.7115 train_time: 2.4m tok/s: 8270476 +1507/20000 train_loss: 2.8285 train_time: 2.4m tok/s: 8270482 +1508/20000 train_loss: 2.5349 train_time: 2.4m tok/s: 8270426 +1509/20000 train_loss: 2.5837 train_time: 2.4m tok/s: 8270462 +1510/20000 train_loss: 2.5422 train_time: 2.4m tok/s: 8270492 +1511/20000 train_loss: 2.5024 train_time: 2.4m tok/s: 8270397 +1512/20000 train_loss: 2.5704 train_time: 2.4m tok/s: 8270415 +1513/20000 train_loss: 2.7204 train_time: 2.4m tok/s: 8270457 +1514/20000 train_loss: 2.7420 train_time: 2.4m tok/s: 8270462 +1515/20000 train_loss: 2.6996 train_time: 2.4m tok/s: 8270477 +1516/20000 train_loss: 2.5885 train_time: 2.4m tok/s: 8270436 +1517/20000 train_loss: 2.5922 train_time: 2.4m tok/s: 8270427 +1518/20000 train_loss: 2.7324 train_time: 2.4m tok/s: 8270431 +1519/20000 train_loss: 2.6603 train_time: 2.4m tok/s: 8270389 +1520/20000 train_loss: 2.6695 train_time: 2.4m tok/s: 8270407 +1521/20000 train_loss: 2.6455 train_time: 2.4m tok/s: 8270441 +1522/20000 train_loss: 2.6485 train_time: 2.4m tok/s: 8270513 +1523/20000 train_loss: 2.6795 train_time: 2.4m tok/s: 8270487 +1524/20000 train_loss: 2.6406 train_time: 2.4m tok/s: 8270429 +1525/20000 train_loss: 2.6305 train_time: 2.4m tok/s: 8270370 +1526/20000 train_loss: 2.7235 train_time: 2.4m tok/s: 8270307 +1527/20000 train_loss: 2.6760 train_time: 2.4m tok/s: 8270264 +1528/20000 train_loss: 2.4670 train_time: 2.4m tok/s: 8270264 +1529/20000 train_loss: 2.6278 train_time: 2.4m tok/s: 8270346 +1530/20000 train_loss: 2.6118 train_time: 2.4m tok/s: 8270379 +1531/20000 train_loss: 2.3620 train_time: 2.4m tok/s: 8270357 +1532/20000 train_loss: 2.6195 train_time: 2.4m tok/s: 8270363 +1533/20000 train_loss: 2.6772 train_time: 2.4m tok/s: 8270331 +1534/20000 train_loss: 2.6350 train_time: 2.4m tok/s: 8270269 +1535/20000 train_loss: 2.7697 train_time: 2.4m tok/s: 8270297 +1536/20000 train_loss: 2.7000 train_time: 2.4m tok/s: 8270234 +1537/20000 train_loss: 3.0812 train_time: 2.4m tok/s: 8270200 +1538/20000 train_loss: 2.7208 train_time: 2.4m tok/s: 8270160 +1539/20000 train_loss: 2.6305 train_time: 2.4m tok/s: 8270180 +1540/20000 train_loss: 2.6815 train_time: 2.4m tok/s: 8270204 +1541/20000 train_loss: 2.5916 train_time: 2.4m tok/s: 8270237 +1542/20000 train_loss: 2.6194 train_time: 2.4m tok/s: 8270316 +1543/20000 train_loss: 2.6343 train_time: 2.4m tok/s: 8270369 +1544/20000 train_loss: 2.5819 train_time: 2.4m tok/s: 8270334 +1545/20000 train_loss: 2.6209 train_time: 2.4m tok/s: 8270380 +1546/20000 train_loss: 2.4978 train_time: 2.5m tok/s: 8270372 +1547/20000 train_loss: 2.7439 train_time: 2.5m tok/s: 8270314 +1548/20000 train_loss: 2.6878 train_time: 2.5m tok/s: 8270320 +1549/20000 train_loss: 2.5786 train_time: 2.5m tok/s: 8270318 +1550/20000 train_loss: 2.7158 train_time: 2.5m tok/s: 8270266 +1551/20000 train_loss: 2.6758 train_time: 2.5m tok/s: 8270255 +1552/20000 train_loss: 2.5518 train_time: 2.5m tok/s: 8270297 +1553/20000 train_loss: 2.4901 train_time: 2.5m tok/s: 8270283 +1554/20000 train_loss: 2.5903 train_time: 2.5m tok/s: 8270300 +1555/20000 train_loss: 2.6307 train_time: 2.5m tok/s: 8270306 +1556/20000 train_loss: 2.5158 train_time: 2.5m tok/s: 8270283 +1557/20000 train_loss: 2.5564 train_time: 2.5m tok/s: 8270213 +1558/20000 train_loss: 2.5730 train_time: 2.5m tok/s: 8270179 +1559/20000 train_loss: 2.5503 train_time: 2.5m tok/s: 8270185 +1560/20000 train_loss: 2.6215 train_time: 2.5m tok/s: 8270195 +1561/20000 train_loss: 2.5461 train_time: 2.5m tok/s: 8270129 +1562/20000 train_loss: 2.5889 train_time: 2.5m tok/s: 8270166 +1563/20000 train_loss: 2.4931 train_time: 2.5m tok/s: 8270201 +1564/20000 train_loss: 2.5873 train_time: 2.5m tok/s: 8270232 +1565/20000 train_loss: 2.5706 train_time: 2.5m tok/s: 8270203 +1566/20000 train_loss: 2.7476 train_time: 2.5m tok/s: 8270196 +1567/20000 train_loss: 2.6869 train_time: 2.5m tok/s: 8270214 +1568/20000 train_loss: 2.5299 train_time: 2.5m tok/s: 8270238 +1569/20000 train_loss: 2.5974 train_time: 2.5m tok/s: 8270250 +1570/20000 train_loss: 2.5485 train_time: 2.5m tok/s: 8270198 +1571/20000 train_loss: 2.6191 train_time: 2.5m tok/s: 8270179 +1572/20000 train_loss: 3.2035 train_time: 2.5m tok/s: 8270182 +1573/20000 train_loss: 2.7553 train_time: 2.5m tok/s: 8270167 +1574/20000 train_loss: 2.6013 train_time: 2.5m tok/s: 8270132 +1575/20000 train_loss: 2.5516 train_time: 2.5m tok/s: 8270142 +1576/20000 train_loss: 2.5406 train_time: 2.5m tok/s: 8270139 +1577/20000 train_loss: 2.5758 train_time: 2.5m tok/s: 8270106 +1578/20000 train_loss: 2.4999 train_time: 2.5m tok/s: 8270048 +1579/20000 train_loss: 2.7629 train_time: 2.5m tok/s: 8270036 +1580/20000 train_loss: 2.6526 train_time: 2.5m tok/s: 8270087 +1581/20000 train_loss: 2.4929 train_time: 2.5m tok/s: 8270091 +1582/20000 train_loss: 2.5170 train_time: 2.5m tok/s: 8270019 +1583/20000 train_loss: 2.5842 train_time: 2.5m tok/s: 8270038 +1584/20000 train_loss: 2.5631 train_time: 2.5m tok/s: 8270105 +1585/20000 train_loss: 2.7101 train_time: 2.5m tok/s: 8270178 +1586/20000 train_loss: 2.5567 train_time: 2.5m tok/s: 8270181 +1587/20000 train_loss: 2.5940 train_time: 2.5m tok/s: 8270189 +1588/20000 train_loss: 2.6345 train_time: 2.5m tok/s: 8270193 +1589/20000 train_loss: 2.6973 train_time: 2.5m tok/s: 8270172 +1590/20000 train_loss: 2.6502 train_time: 2.5m tok/s: 8270154 +1591/20000 train_loss: 2.6407 train_time: 2.5m tok/s: 8270172 +1592/20000 train_loss: 2.5706 train_time: 2.5m tok/s: 8270223 +1593/20000 train_loss: 2.6411 train_time: 2.5m tok/s: 8270241 +1594/20000 train_loss: 2.7418 train_time: 2.5m tok/s: 8270268 +1595/20000 train_loss: 2.6727 train_time: 2.5m tok/s: 8270263 +1596/20000 train_loss: 2.4467 train_time: 2.5m tok/s: 8270250 +1597/20000 train_loss: 2.5642 train_time: 2.5m tok/s: 8270214 +1598/20000 train_loss: 2.6222 train_time: 2.5m tok/s: 8270201 +1599/20000 train_loss: 2.6190 train_time: 2.5m tok/s: 8270245 +1600/20000 train_loss: 2.8158 train_time: 2.5m tok/s: 8270277 +1601/20000 train_loss: 2.6533 train_time: 2.5m tok/s: 8270301 +1602/20000 train_loss: 2.7566 train_time: 2.5m tok/s: 8270176 +1603/20000 train_loss: 2.5788 train_time: 2.5m tok/s: 8270173 +1604/20000 train_loss: 2.6001 train_time: 2.5m tok/s: 8270214 +1605/20000 train_loss: 2.6217 train_time: 2.5m tok/s: 8270162 +1606/20000 train_loss: 2.6111 train_time: 2.5m tok/s: 8270149 +1607/20000 train_loss: 2.5305 train_time: 2.5m tok/s: 8270186 +1608/20000 train_loss: 2.5057 train_time: 2.5m tok/s: 8270186 +1609/20000 train_loss: 2.7044 train_time: 2.6m tok/s: 8270221 +1610/20000 train_loss: 2.6040 train_time: 2.6m tok/s: 8270207 +1611/20000 train_loss: 2.5886 train_time: 2.6m tok/s: 8270210 +1612/20000 train_loss: 2.6554 train_time: 2.6m tok/s: 8270197 +1613/20000 train_loss: 2.6536 train_time: 2.6m tok/s: 8270237 +1614/20000 train_loss: 2.7161 train_time: 2.6m tok/s: 8270248 +1615/20000 train_loss: 2.7244 train_time: 2.6m tok/s: 8270227 +1616/20000 train_loss: 2.6620 train_time: 2.6m tok/s: 8270301 +1617/20000 train_loss: 2.6044 train_time: 2.6m tok/s: 8270154 +1618/20000 train_loss: 3.0161 train_time: 2.6m tok/s: 8270290 +1619/20000 train_loss: 2.7357 train_time: 2.6m tok/s: 8270283 +1620/20000 train_loss: 2.5606 train_time: 2.6m tok/s: 8270311 +1621/20000 train_loss: 2.5782 train_time: 2.6m tok/s: 8270341 +1622/20000 train_loss: 2.7564 train_time: 2.6m tok/s: 8270377 +1623/20000 train_loss: 2.6709 train_time: 2.6m tok/s: 8270365 +1624/20000 train_loss: 2.6237 train_time: 2.6m tok/s: 8270408 +1625/20000 train_loss: 2.6323 train_time: 2.6m tok/s: 8270423 +1626/20000 train_loss: 2.7038 train_time: 2.6m tok/s: 8270436 +1627/20000 train_loss: 2.4403 train_time: 2.6m tok/s: 8270432 +1628/20000 train_loss: 2.5967 train_time: 2.6m tok/s: 8270441 +1629/20000 train_loss: 2.5721 train_time: 2.6m tok/s: 8270486 +1630/20000 train_loss: 2.5862 train_time: 2.6m tok/s: 8270523 +1631/20000 train_loss: 2.8018 train_time: 2.6m tok/s: 8270553 +1632/20000 train_loss: 2.7057 train_time: 2.6m tok/s: 8270591 +1633/20000 train_loss: 2.6693 train_time: 2.6m tok/s: 8270573 +1634/20000 train_loss: 2.6245 train_time: 2.6m tok/s: 8270590 +1635/20000 train_loss: 2.6785 train_time: 2.6m tok/s: 8270594 +1636/20000 train_loss: 2.4657 train_time: 2.6m tok/s: 8270599 +1637/20000 train_loss: 2.5595 train_time: 2.6m tok/s: 8270482 +1638/20000 train_loss: 2.5024 train_time: 2.6m tok/s: 8270445 +1639/20000 train_loss: 2.5287 train_time: 2.6m tok/s: 8270411 +1640/20000 train_loss: 2.3754 train_time: 2.6m tok/s: 8270436 +1641/20000 train_loss: 2.5488 train_time: 2.6m tok/s: 8270435 +1642/20000 train_loss: 2.7542 train_time: 2.6m tok/s: 8270460 +1643/20000 train_loss: 2.4456 train_time: 2.6m tok/s: 8270499 +1644/20000 train_loss: 2.4736 train_time: 2.6m tok/s: 8270525 +1645/20000 train_loss: 2.7629 train_time: 2.6m tok/s: 8270451 +1646/20000 train_loss: 2.5206 train_time: 2.6m tok/s: 8270432 +1647/20000 train_loss: 2.7447 train_time: 2.6m tok/s: 8270388 +1648/20000 train_loss: 2.6432 train_time: 2.6m tok/s: 8270380 +1649/20000 train_loss: 2.7490 train_time: 2.6m tok/s: 8270360 +1650/20000 train_loss: 2.5658 train_time: 2.6m tok/s: 8270328 +1651/20000 train_loss: 2.7362 train_time: 2.6m tok/s: 8270364 +1652/20000 train_loss: 2.6462 train_time: 2.6m tok/s: 8270400 +1653/20000 train_loss: 2.7514 train_time: 2.6m tok/s: 8270454 +1654/20000 train_loss: 2.6774 train_time: 2.6m tok/s: 8270473 +1655/20000 train_loss: 2.5615 train_time: 2.6m tok/s: 8270511 +1656/20000 train_loss: 2.6057 train_time: 2.6m tok/s: 8270569 +1657/20000 train_loss: 2.6442 train_time: 2.6m tok/s: 8270490 +1658/20000 train_loss: 2.6361 train_time: 2.6m tok/s: 8270524 +1659/20000 train_loss: 2.5921 train_time: 2.6m tok/s: 8270543 +1660/20000 train_loss: 2.5560 train_time: 2.6m tok/s: 8270510 +1661/20000 train_loss: 2.7493 train_time: 2.6m tok/s: 8270473 +1662/20000 train_loss: 2.7388 train_time: 2.6m tok/s: 8270334 +1663/20000 train_loss: 2.8002 train_time: 2.6m tok/s: 8270220 +1664/20000 train_loss: 2.8086 train_time: 2.6m tok/s: 8270242 +1665/20000 train_loss: 2.8016 train_time: 2.6m tok/s: 8270212 +1666/20000 train_loss: 2.7030 train_time: 2.6m tok/s: 8270191 +1667/20000 train_loss: 2.6114 train_time: 2.6m tok/s: 8270181 +1668/20000 train_loss: 2.6355 train_time: 2.6m tok/s: 8270188 +1669/20000 train_loss: 2.7580 train_time: 2.6m tok/s: 8270207 +1670/20000 train_loss: 2.5550 train_time: 2.6m tok/s: 8270173 +1671/20000 train_loss: 2.4835 train_time: 2.6m tok/s: 8270179 +1672/20000 train_loss: 2.6117 train_time: 2.6m tok/s: 8270188 +1673/20000 train_loss: 2.5763 train_time: 2.7m tok/s: 8270210 +1674/20000 train_loss: 2.6455 train_time: 2.7m tok/s: 8270267 +1675/20000 train_loss: 2.4375 train_time: 2.7m tok/s: 8270242 +1676/20000 train_loss: 2.6874 train_time: 2.7m tok/s: 8270247 +1677/20000 train_loss: 2.6034 train_time: 2.7m tok/s: 8270266 +1678/20000 train_loss: 2.6733 train_time: 2.7m tok/s: 8270203 +1679/20000 train_loss: 2.6140 train_time: 2.7m tok/s: 8270146 +1680/20000 train_loss: 2.5398 train_time: 2.7m tok/s: 8270112 +1681/20000 train_loss: 2.5185 train_time: 2.7m tok/s: 8270143 +1682/20000 train_loss: 2.6251 train_time: 2.7m tok/s: 8270185 +1683/20000 train_loss: 2.6256 train_time: 2.7m tok/s: 8270129 +1684/20000 train_loss: 2.5788 train_time: 2.7m tok/s: 8270089 +1685/20000 train_loss: 2.6792 train_time: 2.7m tok/s: 8270102 +1686/20000 train_loss: 2.5840 train_time: 2.7m tok/s: 8270146 +1687/20000 train_loss: 2.5504 train_time: 2.7m tok/s: 8270154 +1688/20000 train_loss: 2.5705 train_time: 2.7m tok/s: 8270154 +1689/20000 train_loss: 2.5242 train_time: 2.7m tok/s: 8270139 +1690/20000 train_loss: 2.8012 train_time: 2.7m tok/s: 8270164 +1691/20000 train_loss: 2.5931 train_time: 2.7m tok/s: 8270179 +1692/20000 train_loss: 2.5863 train_time: 2.7m tok/s: 8270185 +1693/20000 train_loss: 2.3909 train_time: 2.7m tok/s: 8270212 +1694/20000 train_loss: 2.6178 train_time: 2.7m tok/s: 8270268 +1695/20000 train_loss: 2.6185 train_time: 2.7m tok/s: 8270291 +1696/20000 train_loss: 2.6575 train_time: 2.7m tok/s: 8270373 +1697/20000 train_loss: 2.7568 train_time: 2.7m tok/s: 8270418 +1698/20000 train_loss: 2.6522 train_time: 2.7m tok/s: 8270484 +1699/20000 train_loss: 2.7500 train_time: 2.7m tok/s: 8270441 +1700/20000 train_loss: 2.5849 train_time: 2.7m tok/s: 8270464 +1701/20000 train_loss: 2.4808 train_time: 2.7m tok/s: 8270479 +1702/20000 train_loss: 2.6125 train_time: 2.7m tok/s: 8270489 +1703/20000 train_loss: 2.6481 train_time: 2.7m tok/s: 8270509 +1704/20000 train_loss: 2.7581 train_time: 2.7m tok/s: 8270545 +1705/20000 train_loss: 2.7739 train_time: 2.7m tok/s: 8270495 +1706/20000 train_loss: 2.7012 train_time: 2.7m tok/s: 8270525 +1707/20000 train_loss: 2.8443 train_time: 2.7m tok/s: 8270557 +1708/20000 train_loss: 2.4774 train_time: 2.7m tok/s: 8270494 +1709/20000 train_loss: 2.6630 train_time: 2.7m tok/s: 8270459 +1710/20000 train_loss: 2.6381 train_time: 2.7m tok/s: 8270519 +1711/20000 train_loss: 2.5757 train_time: 2.7m tok/s: 8270547 +1712/20000 train_loss: 2.7029 train_time: 2.7m tok/s: 8270535 +1713/20000 train_loss: 2.7852 train_time: 2.7m tok/s: 8270564 +1714/20000 train_loss: 2.5642 train_time: 2.7m tok/s: 8270558 +1715/20000 train_loss: 2.7500 train_time: 2.7m tok/s: 8270520 +1716/20000 train_loss: 2.7289 train_time: 2.7m tok/s: 8270485 +1717/20000 train_loss: 2.7390 train_time: 2.7m tok/s: 8270493 +1718/20000 train_loss: 2.8177 train_time: 2.7m tok/s: 8270536 +1719/20000 train_loss: 2.7152 train_time: 2.7m tok/s: 8270530 +1720/20000 train_loss: 2.5324 train_time: 2.7m tok/s: 8270482 +1721/20000 train_loss: 2.6030 train_time: 2.7m tok/s: 8270521 +1722/20000 train_loss: 2.7324 train_time: 2.7m tok/s: 8270526 +1723/20000 train_loss: 2.6240 train_time: 2.7m tok/s: 8270556 +1724/20000 train_loss: 2.6658 train_time: 2.7m tok/s: 8270585 +1725/20000 train_loss: 2.5983 train_time: 2.7m tok/s: 8270558 +1726/20000 train_loss: 2.6301 train_time: 2.7m tok/s: 8270564 +1727/20000 train_loss: 2.5945 train_time: 2.7m tok/s: 8270563 +1728/20000 train_loss: 2.8364 train_time: 2.7m tok/s: 8270574 +1729/20000 train_loss: 2.6610 train_time: 2.7m tok/s: 8270557 +1730/20000 train_loss: 2.7433 train_time: 2.7m tok/s: 8270581 +1731/20000 train_loss: 2.7461 train_time: 2.7m tok/s: 8270615 +1732/20000 train_loss: 2.7099 train_time: 2.7m tok/s: 8270696 +1733/20000 train_loss: 2.7003 train_time: 2.7m tok/s: 8270719 +1734/20000 train_loss: 2.6089 train_time: 2.7m tok/s: 8270678 +1735/20000 train_loss: 2.4654 train_time: 2.7m tok/s: 8270658 +1736/20000 train_loss: 2.6996 train_time: 2.8m tok/s: 8270559 +1737/20000 train_loss: 2.5773 train_time: 2.8m tok/s: 8270550 +1738/20000 train_loss: 2.8106 train_time: 2.8m tok/s: 8270560 +1739/20000 train_loss: 2.7449 train_time: 2.8m tok/s: 8270483 +1740/20000 train_loss: 2.3891 train_time: 2.8m tok/s: 8270542 +1741/20000 train_loss: 2.7966 train_time: 2.8m tok/s: 8270595 +1742/20000 train_loss: 2.6085 train_time: 2.8m tok/s: 8270590 +1743/20000 train_loss: 2.5202 train_time: 2.8m tok/s: 8270572 +1744/20000 train_loss: 2.5886 train_time: 2.8m tok/s: 8270569 +1745/20000 train_loss: 2.6338 train_time: 2.8m tok/s: 8270616 +1746/20000 train_loss: 2.6045 train_time: 2.8m tok/s: 8270648 +1747/20000 train_loss: 2.6416 train_time: 2.8m tok/s: 8270649 +1748/20000 train_loss: 2.5712 train_time: 2.8m tok/s: 8270650 +1749/20000 train_loss: 2.6336 train_time: 2.8m tok/s: 8270629 +1750/20000 train_loss: 2.6746 train_time: 2.8m tok/s: 8270624 +1751/20000 train_loss: 2.6711 train_time: 2.8m tok/s: 8270624 +1752/20000 train_loss: 2.6279 train_time: 2.8m tok/s: 8270680 +1753/20000 train_loss: 2.6126 train_time: 2.8m tok/s: 8270719 +1754/20000 train_loss: 2.6790 train_time: 2.8m tok/s: 8270722 +1755/20000 train_loss: 2.5904 train_time: 2.8m tok/s: 8270757 +1756/20000 train_loss: 2.6107 train_time: 2.8m tok/s: 8270709 +1757/20000 train_loss: 2.5868 train_time: 2.8m tok/s: 8270678 +1758/20000 train_loss: 2.8622 train_time: 2.8m tok/s: 8270668 +1759/20000 train_loss: 2.6338 train_time: 2.8m tok/s: 8270628 +1760/20000 train_loss: 2.5321 train_time: 2.8m tok/s: 8270638 +1761/20000 train_loss: 2.6417 train_time: 2.8m tok/s: 8270684 +1762/20000 train_loss: 2.6987 train_time: 2.8m tok/s: 8270671 +1763/20000 train_loss: 2.7128 train_time: 2.8m tok/s: 8270701 +1764/20000 train_loss: 2.6543 train_time: 2.8m tok/s: 8270725 +1765/20000 train_loss: 2.5892 train_time: 2.8m tok/s: 8270777 +1766/20000 train_loss: 2.7113 train_time: 2.8m tok/s: 8270777 +1767/20000 train_loss: 2.5767 train_time: 2.8m tok/s: 8270753 +1768/20000 train_loss: 2.6471 train_time: 2.8m tok/s: 8270737 +1769/20000 train_loss: 2.6456 train_time: 2.8m tok/s: 8270688 +1770/20000 train_loss: 2.6673 train_time: 2.8m tok/s: 8270678 +1771/20000 train_loss: 2.5297 train_time: 2.8m tok/s: 8270678 +1772/20000 train_loss: 2.5530 train_time: 2.8m tok/s: 8270719 +1773/20000 train_loss: 2.8531 train_time: 2.8m tok/s: 8270738 +1774/20000 train_loss: 2.7178 train_time: 2.8m tok/s: 8270802 +1775/20000 train_loss: 2.7522 train_time: 2.8m tok/s: 8270854 +1776/20000 train_loss: 2.5774 train_time: 2.8m tok/s: 8270823 +1777/20000 train_loss: 2.6951 train_time: 2.8m tok/s: 8270898 +1778/20000 train_loss: 2.6556 train_time: 2.8m tok/s: 8270875 +1779/20000 train_loss: 2.6563 train_time: 2.8m tok/s: 8270877 +1780/20000 train_loss: 2.6565 train_time: 2.8m tok/s: 8270871 +1781/20000 train_loss: 2.5272 train_time: 2.8m tok/s: 8270894 +1782/20000 train_loss: 2.4157 train_time: 2.8m tok/s: 8270876 +1783/20000 train_loss: 2.6253 train_time: 2.8m tok/s: 8270815 +1784/20000 train_loss: 2.6316 train_time: 2.8m tok/s: 8270852 +1785/20000 train_loss: 2.6409 train_time: 2.8m tok/s: 8270909 +1786/20000 train_loss: 2.7975 train_time: 2.8m tok/s: 8270935 +1787/20000 train_loss: 2.7149 train_time: 2.8m tok/s: 8270937 +1788/20000 train_loss: 2.6365 train_time: 2.8m tok/s: 8270985 +1789/20000 train_loss: 2.7192 train_time: 2.8m tok/s: 8271006 +1790/20000 train_loss: 2.5279 train_time: 2.8m tok/s: 8271016 +1791/20000 train_loss: 2.3572 train_time: 2.8m tok/s: 8271015 +1792/20000 train_loss: 2.6321 train_time: 2.8m tok/s: 8271010 +1793/20000 train_loss: 2.4955 train_time: 2.8m tok/s: 8271017 +1794/20000 train_loss: 2.4322 train_time: 2.8m tok/s: 8271019 +1795/20000 train_loss: 2.6224 train_time: 2.8m tok/s: 8271049 +1796/20000 train_loss: 2.6510 train_time: 2.8m tok/s: 8271126 +1797/20000 train_loss: 2.8101 train_time: 2.8m tok/s: 8271152 +1798/20000 train_loss: 2.5723 train_time: 2.8m tok/s: 8271094 +1799/20000 train_loss: 2.6575 train_time: 2.9m tok/s: 8271118 +1800/20000 train_loss: 2.5575 train_time: 2.9m tok/s: 8271192 +1801/20000 train_loss: 2.6881 train_time: 2.9m tok/s: 8271179 +1802/20000 train_loss: 2.5873 train_time: 2.9m tok/s: 8271139 +1803/20000 train_loss: 2.6422 train_time: 2.9m tok/s: 8271096 +1804/20000 train_loss: 2.6027 train_time: 2.9m tok/s: 8271069 +1805/20000 train_loss: 2.5909 train_time: 2.9m tok/s: 8271082 +1806/20000 train_loss: 2.8073 train_time: 2.9m tok/s: 8270981 +1807/20000 train_loss: 2.7088 train_time: 2.9m tok/s: 8270962 +1808/20000 train_loss: 2.7248 train_time: 2.9m tok/s: 8270982 +1809/20000 train_loss: 2.6162 train_time: 2.9m tok/s: 8270944 +1810/20000 train_loss: 2.7194 train_time: 2.9m tok/s: 8270861 +1811/20000 train_loss: 2.5676 train_time: 2.9m tok/s: 8270799 +1812/20000 train_loss: 2.6144 train_time: 2.9m tok/s: 8270822 +1813/20000 train_loss: 2.7021 train_time: 2.9m tok/s: 8270851 +1814/20000 train_loss: 2.7482 train_time: 2.9m tok/s: 8270888 +1815/20000 train_loss: 2.5562 train_time: 2.9m tok/s: 8270895 +1816/20000 train_loss: 2.4914 train_time: 2.9m tok/s: 8270920 +1817/20000 train_loss: 2.8474 train_time: 2.9m tok/s: 8270946 +1818/20000 train_loss: 2.6736 train_time: 2.9m tok/s: 8270953 +1819/20000 train_loss: 2.6948 train_time: 2.9m tok/s: 8271005 +1820/20000 train_loss: 2.5812 train_time: 2.9m tok/s: 8271046 +1821/20000 train_loss: 2.6482 train_time: 2.9m tok/s: 8271084 +1822/20000 train_loss: 2.7037 train_time: 2.9m tok/s: 8271118 +1823/20000 train_loss: 2.4019 train_time: 2.9m tok/s: 8271071 +1824/20000 train_loss: 2.6084 train_time: 2.9m tok/s: 8271036 +1825/20000 train_loss: 2.6297 train_time: 2.9m tok/s: 8271037 +1826/20000 train_loss: 2.4610 train_time: 2.9m tok/s: 8271020 +1827/20000 train_loss: 2.5793 train_time: 2.9m tok/s: 8270989 +1828/20000 train_loss: 2.5251 train_time: 2.9m tok/s: 8270948 +1829/20000 train_loss: 2.4705 train_time: 2.9m tok/s: 8270945 +1830/20000 train_loss: 2.6461 train_time: 2.9m tok/s: 8270977 +1831/20000 train_loss: 2.6526 train_time: 2.9m tok/s: 8271017 +1832/20000 train_loss: 2.6464 train_time: 2.9m tok/s: 8271028 +1833/20000 train_loss: 2.7068 train_time: 2.9m tok/s: 8271042 +1834/20000 train_loss: 2.6725 train_time: 2.9m tok/s: 8271068 +1835/20000 train_loss: 2.5750 train_time: 2.9m tok/s: 8271029 +1836/20000 train_loss: 2.5808 train_time: 2.9m tok/s: 8271032 +1837/20000 train_loss: 2.4915 train_time: 2.9m tok/s: 8271046 +1838/20000 train_loss: 2.6119 train_time: 2.9m tok/s: 8271019 +1839/20000 train_loss: 2.7101 train_time: 2.9m tok/s: 8271022 +1840/20000 train_loss: 2.6514 train_time: 2.9m tok/s: 8271022 +1841/20000 train_loss: 2.6901 train_time: 2.9m tok/s: 8271004 +1842/20000 train_loss: 2.6271 train_time: 2.9m tok/s: 8271015 +1843/20000 train_loss: 2.5608 train_time: 2.9m tok/s: 8271002 +1844/20000 train_loss: 2.6537 train_time: 2.9m tok/s: 8271024 +1845/20000 train_loss: 2.8217 train_time: 2.9m tok/s: 8271024 +1846/20000 train_loss: 2.5953 train_time: 2.9m tok/s: 8270998 +1847/20000 train_loss: 2.4692 train_time: 2.9m tok/s: 8270967 +1848/20000 train_loss: 2.5343 train_time: 2.9m tok/s: 8270925 +1849/20000 train_loss: 2.5291 train_time: 2.9m tok/s: 8270943 +1850/20000 train_loss: 2.6535 train_time: 2.9m tok/s: 8270990 +1851/20000 train_loss: 2.6578 train_time: 2.9m tok/s: 8270997 +1852/20000 train_loss: 2.6769 train_time: 2.9m tok/s: 8271049 +1853/20000 train_loss: 2.5066 train_time: 2.9m tok/s: 8271054 +1854/20000 train_loss: 2.5567 train_time: 2.9m tok/s: 8271074 +1855/20000 train_loss: 2.6403 train_time: 2.9m tok/s: 8271075 +1856/20000 train_loss: 2.6742 train_time: 2.9m tok/s: 8271066 +1857/20000 train_loss: 2.7871 train_time: 2.9m tok/s: 8271083 +1858/20000 train_loss: 2.7310 train_time: 2.9m tok/s: 8271141 +1859/20000 train_loss: 2.6422 train_time: 2.9m tok/s: 8271156 +1860/20000 train_loss: 2.5803 train_time: 2.9m tok/s: 8271180 +1861/20000 train_loss: 2.5706 train_time: 2.9m tok/s: 8271161 +1862/20000 train_loss: 2.5433 train_time: 3.0m tok/s: 8271183 +1863/20000 train_loss: 2.6249 train_time: 3.0m tok/s: 8271169 +1864/20000 train_loss: 2.5854 train_time: 3.0m tok/s: 8271192 +1865/20000 train_loss: 2.7337 train_time: 3.0m tok/s: 8271226 +1866/20000 train_loss: 2.6328 train_time: 3.0m tok/s: 8271150 +1867/20000 train_loss: 2.5536 train_time: 3.0m tok/s: 8271094 +1868/20000 train_loss: 2.5445 train_time: 3.0m tok/s: 8271105 +1869/20000 train_loss: 2.6863 train_time: 3.0m tok/s: 8271104 +1870/20000 train_loss: 2.6366 train_time: 3.0m tok/s: 8271106 +1871/20000 train_loss: 2.5525 train_time: 3.0m tok/s: 8271165 +1872/20000 train_loss: 2.5836 train_time: 3.0m tok/s: 8271207 +1873/20000 train_loss: 2.7360 train_time: 3.0m tok/s: 8271216 +1874/20000 train_loss: 2.6618 train_time: 3.0m tok/s: 8271196 +1875/20000 train_loss: 2.7456 train_time: 3.0m tok/s: 8271231 +1876/20000 train_loss: 2.7849 train_time: 3.0m tok/s: 8271319 +1877/20000 train_loss: 2.9601 train_time: 3.0m tok/s: 8271265 +1878/20000 train_loss: 2.6582 train_time: 3.0m tok/s: 8271176 +1879/20000 train_loss: 2.6171 train_time: 3.0m tok/s: 8271164 +1880/20000 train_loss: 2.7521 train_time: 3.0m tok/s: 8271133 +1881/20000 train_loss: 2.5914 train_time: 3.0m tok/s: 8271131 +1882/20000 train_loss: 2.7292 train_time: 3.0m tok/s: 8271188 +1883/20000 train_loss: 2.5944 train_time: 3.0m tok/s: 8271158 +1884/20000 train_loss: 2.5656 train_time: 3.0m tok/s: 8271085 +1885/20000 train_loss: 2.6388 train_time: 3.0m tok/s: 8271065 +1886/20000 train_loss: 2.5673 train_time: 3.0m tok/s: 8271121 +1887/20000 train_loss: 2.6322 train_time: 3.0m tok/s: 8271143 +1888/20000 train_loss: 2.4895 train_time: 3.0m tok/s: 8271150 +1889/20000 train_loss: 2.5449 train_time: 3.0m tok/s: 8271189 +1890/20000 train_loss: 2.6766 train_time: 3.0m tok/s: 8271208 +1891/20000 train_loss: 2.5092 train_time: 3.0m tok/s: 8271187 +1892/20000 train_loss: 2.6813 train_time: 3.0m tok/s: 8271197 +1893/20000 train_loss: 2.6879 train_time: 3.0m tok/s: 8271179 +1894/20000 train_loss: 2.5924 train_time: 3.0m tok/s: 8271194 +1895/20000 train_loss: 2.6473 train_time: 3.0m tok/s: 8271191 +1896/20000 train_loss: 2.6256 train_time: 3.0m tok/s: 8271155 +1897/20000 train_loss: 2.5625 train_time: 3.0m tok/s: 8271208 +1898/20000 train_loss: 2.7060 train_time: 3.0m tok/s: 8271253 +1899/20000 train_loss: 2.6308 train_time: 3.0m tok/s: 8271277 +1900/20000 train_loss: 2.6151 train_time: 3.0m tok/s: 8271291 +1901/20000 train_loss: 2.6924 train_time: 3.0m tok/s: 8271331 +1902/20000 train_loss: 2.6062 train_time: 3.0m tok/s: 8271307 +1903/20000 train_loss: 2.7617 train_time: 3.0m tok/s: 8271290 +1904/20000 train_loss: 3.1517 train_time: 3.0m tok/s: 8271259 +1905/20000 train_loss: 2.4867 train_time: 3.0m tok/s: 8271237 +1906/20000 train_loss: 2.6350 train_time: 3.0m tok/s: 8271251 +1907/20000 train_loss: 2.5157 train_time: 3.0m tok/s: 8271226 +1908/20000 train_loss: 2.5386 train_time: 3.0m tok/s: 8271270 +1909/20000 train_loss: 2.5836 train_time: 3.0m tok/s: 8271303 +1910/20000 train_loss: 2.5350 train_time: 3.0m tok/s: 8271313 +1911/20000 train_loss: 2.4879 train_time: 3.0m tok/s: 8271321 +1912/20000 train_loss: 2.7074 train_time: 3.0m tok/s: 8271354 +1913/20000 train_loss: 2.7175 train_time: 3.0m tok/s: 8271380 +1914/20000 train_loss: 2.6977 train_time: 3.0m tok/s: 8271388 +1915/20000 train_loss: 2.7100 train_time: 3.0m tok/s: 8271431 +1916/20000 train_loss: 2.5744 train_time: 3.0m tok/s: 8271484 +1917/20000 train_loss: 2.7160 train_time: 3.0m tok/s: 8271493 +1918/20000 train_loss: 2.5808 train_time: 3.0m tok/s: 8271490 +1919/20000 train_loss: 2.5621 train_time: 3.0m tok/s: 8271505 +1920/20000 train_loss: 2.4949 train_time: 3.0m tok/s: 8271477 +1921/20000 train_loss: 2.7052 train_time: 3.0m tok/s: 8271491 +1922/20000 train_loss: 2.6010 train_time: 3.0m tok/s: 8271483 +1923/20000 train_loss: 2.5226 train_time: 3.0m tok/s: 8271501 +1924/20000 train_loss: 2.5925 train_time: 3.0m tok/s: 8271545 +1925/20000 train_loss: 2.5330 train_time: 3.1m tok/s: 8271566 +1926/20000 train_loss: 2.7453 train_time: 3.1m tok/s: 8271517 +1927/20000 train_loss: 2.5751 train_time: 3.1m tok/s: 8271514 +1928/20000 train_loss: 2.6359 train_time: 3.1m tok/s: 8271555 +1929/20000 train_loss: 2.6409 train_time: 3.1m tok/s: 8271564 +1930/20000 train_loss: 2.7077 train_time: 3.1m tok/s: 8271464 +1931/20000 train_loss: 2.6484 train_time: 3.1m tok/s: 8271463 +1932/20000 train_loss: 2.7546 train_time: 3.1m tok/s: 8271468 +1933/20000 train_loss: 2.6577 train_time: 3.1m tok/s: 8271477 +1934/20000 train_loss: 2.6666 train_time: 3.1m tok/s: 8271482 +1935/20000 train_loss: 2.5628 train_time: 3.1m tok/s: 8271534 +1936/20000 train_loss: 2.6811 train_time: 3.1m tok/s: 8271516 +1937/20000 train_loss: 2.6727 train_time: 3.1m tok/s: 8271544 +1938/20000 train_loss: 2.7047 train_time: 3.1m tok/s: 8271583 +1939/20000 train_loss: 2.6230 train_time: 3.1m tok/s: 8271608 +1940/20000 train_loss: 2.8146 train_time: 3.1m tok/s: 8271638 +1941/20000 train_loss: 2.4680 train_time: 3.1m tok/s: 8271643 +1942/20000 train_loss: 2.4992 train_time: 3.1m tok/s: 8271689 +1943/20000 train_loss: 2.5010 train_time: 3.1m tok/s: 8271735 +1944/20000 train_loss: 2.5262 train_time: 3.1m tok/s: 8271742 +1945/20000 train_loss: 2.5751 train_time: 3.1m tok/s: 8271740 +1946/20000 train_loss: 2.6255 train_time: 3.1m tok/s: 8271757 +1947/20000 train_loss: 2.6906 train_time: 3.1m tok/s: 8271763 +1948/20000 train_loss: 2.6829 train_time: 3.1m tok/s: 8271793 +1949/20000 train_loss: 2.7294 train_time: 3.1m tok/s: 8271793 +1950/20000 train_loss: 2.5896 train_time: 3.1m tok/s: 8271847 +1951/20000 train_loss: 2.8066 train_time: 3.1m tok/s: 8271849 +1952/20000 train_loss: 2.8226 train_time: 3.1m tok/s: 8271870 +1953/20000 train_loss: 2.6381 train_time: 3.1m tok/s: 8271875 +1954/20000 train_loss: 2.5857 train_time: 3.1m tok/s: 8271925 +1955/20000 train_loss: 2.8569 train_time: 3.1m tok/s: 8271943 +1956/20000 train_loss: 2.5820 train_time: 3.1m tok/s: 8271934 +1957/20000 train_loss: 2.6036 train_time: 3.1m tok/s: 8271920 +1958/20000 train_loss: 2.5641 train_time: 3.1m tok/s: 8271966 +1959/20000 train_loss: 2.5489 train_time: 3.1m tok/s: 8271995 +1960/20000 train_loss: 2.5020 train_time: 3.1m tok/s: 8271992 +1961/20000 train_loss: 2.5147 train_time: 3.1m tok/s: 8271981 +1962/20000 train_loss: 2.6028 train_time: 3.1m tok/s: 8272009 +1963/20000 train_loss: 2.5659 train_time: 3.1m tok/s: 8272055 +1964/20000 train_loss: 2.5896 train_time: 3.1m tok/s: 8272058 +1965/20000 train_loss: 2.5896 train_time: 3.1m tok/s: 8272080 +1966/20000 train_loss: 2.7535 train_time: 3.1m tok/s: 8272026 +1967/20000 train_loss: 2.5613 train_time: 3.1m tok/s: 8272017 +1968/20000 train_loss: 2.7064 train_time: 3.1m tok/s: 8272012 +1969/20000 train_loss: 2.7652 train_time: 3.1m tok/s: 8272080 +1970/20000 train_loss: 2.5939 train_time: 3.1m tok/s: 8272104 +1971/20000 train_loss: 2.6185 train_time: 3.1m tok/s: 8272099 +1972/20000 train_loss: 2.6820 train_time: 3.1m tok/s: 8272129 +1973/20000 train_loss: 2.6055 train_time: 3.1m tok/s: 8272115 +1974/20000 train_loss: 2.7700 train_time: 3.1m tok/s: 8272089 +1975/20000 train_loss: 2.5266 train_time: 3.1m tok/s: 8272099 +1976/20000 train_loss: 2.7206 train_time: 3.1m tok/s: 8272059 +1977/20000 train_loss: 2.5275 train_time: 3.1m tok/s: 8272070 +1978/20000 train_loss: 2.6936 train_time: 3.1m tok/s: 8272068 +1979/20000 train_loss: 2.5299 train_time: 3.1m tok/s: 8272111 +1980/20000 train_loss: 2.5773 train_time: 3.1m tok/s: 8272114 +1981/20000 train_loss: 2.4800 train_time: 3.1m tok/s: 8272149 +1982/20000 train_loss: 2.6444 train_time: 3.1m tok/s: 8272159 +1983/20000 train_loss: 2.3879 train_time: 3.1m tok/s: 8272156 +1984/20000 train_loss: 2.6912 train_time: 3.1m tok/s: 8272074 +1985/20000 train_loss: 2.6292 train_time: 3.1m tok/s: 8272092 +1986/20000 train_loss: 2.6786 train_time: 3.1m tok/s: 8272104 +1987/20000 train_loss: 2.6917 train_time: 3.1m tok/s: 8272151 +1988/20000 train_loss: 2.6549 train_time: 3.1m tok/s: 8272169 +1989/20000 train_loss: 2.4874 train_time: 3.2m tok/s: 8272189 +1990/20000 train_loss: 2.6511 train_time: 3.2m tok/s: 8272188 +1991/20000 train_loss: 2.5728 train_time: 3.2m tok/s: 8272254 +1992/20000 train_loss: 2.7519 train_time: 3.2m tok/s: 8272282 +1993/20000 train_loss: 2.5732 train_time: 3.2m tok/s: 8272275 +1994/20000 train_loss: 2.6189 train_time: 3.2m tok/s: 8272291 +1995/20000 train_loss: 2.5238 train_time: 3.2m tok/s: 8272321 +1996/20000 train_loss: 2.6145 train_time: 3.2m tok/s: 8272307 +1997/20000 train_loss: 2.5948 train_time: 3.2m tok/s: 8272358 +1998/20000 train_loss: 2.6115 train_time: 3.2m tok/s: 8272375 +1999/20000 train_loss: 2.6770 train_time: 3.2m tok/s: 8272397 +2000/20000 train_loss: 2.4911 train_time: 3.2m tok/s: 8272464 +2001/20000 train_loss: 2.5815 train_time: 3.2m tok/s: 8272485 +2002/20000 train_loss: 2.4517 train_time: 3.2m tok/s: 8272512 +2003/20000 train_loss: 2.6512 train_time: 3.2m tok/s: 8272529 +2004/20000 train_loss: 2.6342 train_time: 3.2m tok/s: 8272566 +2005/20000 train_loss: 2.6178 train_time: 3.2m tok/s: 8272568 +2006/20000 train_loss: 2.5822 train_time: 3.2m tok/s: 8272583 +2007/20000 train_loss: 2.6268 train_time: 3.2m tok/s: 8272565 +2008/20000 train_loss: 2.5448 train_time: 3.2m tok/s: 8272540 +2009/20000 train_loss: 2.6532 train_time: 3.2m tok/s: 8272604 +2010/20000 train_loss: 2.7002 train_time: 3.2m tok/s: 8272566 +2011/20000 train_loss: 2.5750 train_time: 3.2m tok/s: 8272575 +2012/20000 train_loss: 2.5818 train_time: 3.2m tok/s: 8272582 +2013/20000 train_loss: 2.4832 train_time: 3.2m tok/s: 8272614 +2014/20000 train_loss: 2.4825 train_time: 3.2m tok/s: 8272623 +2015/20000 train_loss: 2.6955 train_time: 3.2m tok/s: 8272630 +2016/20000 train_loss: 2.4825 train_time: 3.2m tok/s: 8272609 +2017/20000 train_loss: 2.6341 train_time: 3.2m tok/s: 8272635 +2018/20000 train_loss: 2.6298 train_time: 3.2m tok/s: 8272625 +2019/20000 train_loss: 2.7188 train_time: 3.2m tok/s: 8272690 +2020/20000 train_loss: 2.7104 train_time: 3.2m tok/s: 8272724 +2021/20000 train_loss: 2.5646 train_time: 3.2m tok/s: 8272707 +2022/20000 train_loss: 2.4959 train_time: 3.2m tok/s: 8272653 +2023/20000 train_loss: 2.6934 train_time: 3.2m tok/s: 8272658 +2024/20000 train_loss: 2.6329 train_time: 3.2m tok/s: 8272683 +2025/20000 train_loss: 2.4827 train_time: 3.2m tok/s: 8272702 +2026/20000 train_loss: 2.7052 train_time: 3.2m tok/s: 8272627 +2027/20000 train_loss: 2.5953 train_time: 3.2m tok/s: 8272614 +2028/20000 train_loss: 2.7180 train_time: 3.2m tok/s: 8272582 +2029/20000 train_loss: 2.4999 train_time: 3.2m tok/s: 8272600 +2030/20000 train_loss: 2.5149 train_time: 3.2m tok/s: 8272608 +2031/20000 train_loss: 2.5094 train_time: 3.2m tok/s: 8272655 +2032/20000 train_loss: 2.5641 train_time: 3.2m tok/s: 8272628 +2033/20000 train_loss: 2.8379 train_time: 3.2m tok/s: 8272599 +2034/20000 train_loss: 2.6831 train_time: 3.2m tok/s: 8272591 +2035/20000 train_loss: 2.6544 train_time: 3.2m tok/s: 8272639 +2036/20000 train_loss: 2.6187 train_time: 3.2m tok/s: 8272664 +2037/20000 train_loss: 2.8316 train_time: 3.2m tok/s: 8272612 +2038/20000 train_loss: 2.5888 train_time: 3.2m tok/s: 8272557 +2039/20000 train_loss: 2.6332 train_time: 3.2m tok/s: 8272567 +2040/20000 train_loss: 2.5789 train_time: 3.2m tok/s: 8272614 +2041/20000 train_loss: 2.6434 train_time: 3.2m tok/s: 8272642 +2042/20000 train_loss: 2.5407 train_time: 3.2m tok/s: 8272617 +2043/20000 train_loss: 2.4928 train_time: 3.2m tok/s: 8272548 +2044/20000 train_loss: 2.6486 train_time: 3.2m tok/s: 8272569 +2045/20000 train_loss: 2.4388 train_time: 3.2m tok/s: 8272594 +2046/20000 train_loss: 2.4636 train_time: 3.2m tok/s: 8272590 +2047/20000 train_loss: 2.7451 train_time: 3.2m tok/s: 8272580 +2048/20000 train_loss: 2.5728 train_time: 3.2m tok/s: 8272636 +2049/20000 train_loss: 2.7147 train_time: 3.2m tok/s: 8272663 +2050/20000 train_loss: 2.6705 train_time: 3.2m tok/s: 8272678 +2051/20000 train_loss: 2.6481 train_time: 3.2m tok/s: 8272679 +2052/20000 train_loss: 2.5372 train_time: 3.3m tok/s: 8272736 +2053/20000 train_loss: 2.6395 train_time: 3.3m tok/s: 8272765 +2054/20000 train_loss: 2.6753 train_time: 3.3m tok/s: 8272775 +2055/20000 train_loss: 2.5803 train_time: 3.3m tok/s: 8272788 +2056/20000 train_loss: 2.6194 train_time: 3.3m tok/s: 8272793 +2057/20000 train_loss: 2.6599 train_time: 3.3m tok/s: 8272785 +2058/20000 train_loss: 2.5610 train_time: 3.3m tok/s: 8272772 +2059/20000 train_loss: 2.4965 train_time: 3.3m tok/s: 8272801 +2060/20000 train_loss: 2.5892 train_time: 3.3m tok/s: 8272810 +2061/20000 train_loss: 2.5841 train_time: 3.3m tok/s: 8272842 +2062/20000 train_loss: 2.6159 train_time: 3.3m tok/s: 8272814 +2063/20000 train_loss: 2.5509 train_time: 3.3m tok/s: 8272794 +2064/20000 train_loss: 2.8009 train_time: 3.3m tok/s: 8272796 +2065/20000 train_loss: 2.5373 train_time: 3.3m tok/s: 8272771 +2066/20000 train_loss: 2.6139 train_time: 3.3m tok/s: 8272728 +2067/20000 train_loss: 2.6652 train_time: 3.3m tok/s: 8272742 +2068/20000 train_loss: 2.6036 train_time: 3.3m tok/s: 8272800 +2069/20000 train_loss: 2.4578 train_time: 3.3m tok/s: 8272798 +2070/20000 train_loss: 2.6105 train_time: 3.3m tok/s: 8272753 +2071/20000 train_loss: 2.5505 train_time: 3.3m tok/s: 8272773 +2072/20000 train_loss: 2.6002 train_time: 3.3m tok/s: 8272770 +2073/20000 train_loss: 2.5333 train_time: 3.3m tok/s: 8272766 +2074/20000 train_loss: 2.6952 train_time: 3.3m tok/s: 8272751 +2075/20000 train_loss: 2.5726 train_time: 3.3m tok/s: 8272750 +2076/20000 train_loss: 2.6715 train_time: 3.3m tok/s: 8272756 +2077/20000 train_loss: 3.5629 train_time: 3.3m tok/s: 8272716 +2078/20000 train_loss: 2.7108 train_time: 3.3m tok/s: 8272647 +2079/20000 train_loss: 2.6583 train_time: 3.3m tok/s: 8272641 +2080/20000 train_loss: 2.5992 train_time: 3.3m tok/s: 8272634 +2081/20000 train_loss: 2.6081 train_time: 3.3m tok/s: 8272619 +2082/20000 train_loss: 2.5985 train_time: 3.3m tok/s: 8272646 +2083/20000 train_loss: 2.5431 train_time: 3.3m tok/s: 8272607 +2084/20000 train_loss: 2.5826 train_time: 3.3m tok/s: 8272620 +2085/20000 train_loss: 2.5817 train_time: 3.3m tok/s: 8272638 +2086/20000 train_loss: 2.6096 train_time: 3.3m tok/s: 8272663 +2087/20000 train_loss: 2.5503 train_time: 3.3m tok/s: 8272664 +2088/20000 train_loss: 2.4597 train_time: 3.3m tok/s: 8272698 +2089/20000 train_loss: 2.6314 train_time: 3.3m tok/s: 8272711 +2090/20000 train_loss: 2.7403 train_time: 3.3m tok/s: 8272763 +2091/20000 train_loss: 2.6251 train_time: 3.3m tok/s: 8272757 +2092/20000 train_loss: 2.6555 train_time: 3.3m tok/s: 8272798 +2093/20000 train_loss: 2.6435 train_time: 3.3m tok/s: 8272840 +2094/20000 train_loss: 2.6041 train_time: 3.3m tok/s: 8272858 +2095/20000 train_loss: 2.5886 train_time: 3.3m tok/s: 8272891 +2096/20000 train_loss: 2.6932 train_time: 3.3m tok/s: 8272941 +2097/20000 train_loss: 2.5714 train_time: 3.3m tok/s: 8272958 +2098/20000 train_loss: 2.4871 train_time: 3.3m tok/s: 8272968 +2099/20000 train_loss: 2.4876 train_time: 3.3m tok/s: 8272991 +2100/20000 train_loss: 2.5833 train_time: 3.3m tok/s: 8273029 +2101/20000 train_loss: 2.6551 train_time: 3.3m tok/s: 8272982 +2102/20000 train_loss: 2.5689 train_time: 3.3m tok/s: 8272943 +2103/20000 train_loss: 2.5914 train_time: 3.3m tok/s: 8272942 +2104/20000 train_loss: 2.7089 train_time: 3.3m tok/s: 8272984 +2105/20000 train_loss: 2.7198 train_time: 3.3m tok/s: 8273043 +2106/20000 train_loss: 2.7067 train_time: 3.3m tok/s: 8273059 +2107/20000 train_loss: 2.6068 train_time: 3.3m tok/s: 8273039 +2108/20000 train_loss: 2.5428 train_time: 3.3m tok/s: 8273038 +2109/20000 train_loss: 2.7170 train_time: 3.3m tok/s: 8273026 +2110/20000 train_loss: 2.5784 train_time: 3.3m tok/s: 8273047 +2111/20000 train_loss: 2.5549 train_time: 3.3m tok/s: 8273031 +2112/20000 train_loss: 2.5514 train_time: 3.3m tok/s: 8273038 +2113/20000 train_loss: 2.5208 train_time: 3.3m tok/s: 8273043 +2114/20000 train_loss: 2.7872 train_time: 3.3m tok/s: 8273065 +2115/20000 train_loss: 2.4914 train_time: 3.4m tok/s: 8273090 +2116/20000 train_loss: 2.6523 train_time: 3.4m tok/s: 8273143 +2117/20000 train_loss: 2.6930 train_time: 3.4m tok/s: 8273134 +2118/20000 train_loss: 2.6965 train_time: 3.4m tok/s: 8273166 +2119/20000 train_loss: 2.8019 train_time: 3.4m tok/s: 8273177 +2120/20000 train_loss: 2.6544 train_time: 3.4m tok/s: 8273199 +2121/20000 train_loss: 2.6327 train_time: 3.4m tok/s: 8273213 +2122/20000 train_loss: 2.5464 train_time: 3.4m tok/s: 8273221 +2123/20000 train_loss: 2.4000 train_time: 3.4m tok/s: 8273243 +2124/20000 train_loss: 2.6359 train_time: 3.4m tok/s: 8273284 +2125/20000 train_loss: 2.6170 train_time: 3.4m tok/s: 8273324 +2126/20000 train_loss: 2.6351 train_time: 3.4m tok/s: 8273362 +2127/20000 train_loss: 2.5556 train_time: 3.4m tok/s: 8273188 +2128/20000 train_loss: 2.5380 train_time: 3.4m tok/s: 8273330 +2129/20000 train_loss: 2.3717 train_time: 3.4m tok/s: 8273339 +2130/20000 train_loss: 2.6778 train_time: 3.4m tok/s: 8273349 +2131/20000 train_loss: 2.5853 train_time: 3.4m tok/s: 8273372 +2132/20000 train_loss: 2.9141 train_time: 3.4m tok/s: 8273356 +2133/20000 train_loss: 2.6940 train_time: 3.4m tok/s: 8273372 +2134/20000 train_loss: 2.6332 train_time: 3.4m tok/s: 8273383 +2135/20000 train_loss: 2.5896 train_time: 3.4m tok/s: 8273406 +2136/20000 train_loss: 2.6004 train_time: 3.4m tok/s: 8273412 +2137/20000 train_loss: 2.5365 train_time: 3.4m tok/s: 8273492 +2138/20000 train_loss: 2.5913 train_time: 3.4m tok/s: 8273529 +2139/20000 train_loss: 2.6815 train_time: 3.4m tok/s: 8273531 +2140/20000 train_loss: 2.5493 train_time: 3.4m tok/s: 8273500 +2141/20000 train_loss: 2.5275 train_time: 3.4m tok/s: 8273491 +2142/20000 train_loss: 2.6009 train_time: 3.4m tok/s: 8273523 +2143/20000 train_loss: 2.4849 train_time: 3.4m tok/s: 8273494 +2144/20000 train_loss: 2.5863 train_time: 3.4m tok/s: 8273460 +2145/20000 train_loss: 2.6706 train_time: 3.4m tok/s: 8273508 +2146/20000 train_loss: 2.5914 train_time: 3.4m tok/s: 8273537 +2147/20000 train_loss: 2.5553 train_time: 3.4m tok/s: 8273541 +2148/20000 train_loss: 2.7040 train_time: 3.4m tok/s: 8273526 +2149/20000 train_loss: 2.5390 train_time: 3.4m tok/s: 8273535 +2150/20000 train_loss: 2.7382 train_time: 3.4m tok/s: 8273588 +2151/20000 train_loss: 2.6845 train_time: 3.4m tok/s: 8273530 +2152/20000 train_loss: 2.6027 train_time: 3.4m tok/s: 8273504 +2153/20000 train_loss: 2.4668 train_time: 3.4m tok/s: 8273486 +2154/20000 train_loss: 2.6288 train_time: 3.4m tok/s: 8273488 +2155/20000 train_loss: 2.5239 train_time: 3.4m tok/s: 8273484 +2156/20000 train_loss: 2.6109 train_time: 3.4m tok/s: 8273526 +2157/20000 train_loss: 2.5964 train_time: 3.4m tok/s: 8273568 +2158/20000 train_loss: 2.5918 train_time: 3.4m tok/s: 8273584 +2159/20000 train_loss: 2.5398 train_time: 3.4m tok/s: 8273593 +2160/20000 train_loss: 2.4297 train_time: 3.4m tok/s: 8273578 +2161/20000 train_loss: 2.6176 train_time: 3.4m tok/s: 8273586 +2162/20000 train_loss: 2.6401 train_time: 3.4m tok/s: 8273631 +2163/20000 train_loss: 2.5570 train_time: 3.4m tok/s: 8273593 +2164/20000 train_loss: 2.6020 train_time: 3.4m tok/s: 8273612 +2165/20000 train_loss: 2.6441 train_time: 3.4m tok/s: 8273619 +2166/20000 train_loss: 2.5238 train_time: 3.4m tok/s: 8273620 +2167/20000 train_loss: 2.5282 train_time: 3.4m tok/s: 8273574 +2168/20000 train_loss: 2.5847 train_time: 3.4m tok/s: 8273539 +2169/20000 train_loss: 2.6297 train_time: 3.4m tok/s: 8273533 +2170/20000 train_loss: 2.4945 train_time: 3.4m tok/s: 8273527 +2171/20000 train_loss: 2.6030 train_time: 3.4m tok/s: 8273550 +2172/20000 train_loss: 2.5089 train_time: 3.4m tok/s: 8273550 +2173/20000 train_loss: 2.7158 train_time: 3.4m tok/s: 8273579 +2174/20000 train_loss: 2.5515 train_time: 3.4m tok/s: 8273539 +2175/20000 train_loss: 2.4526 train_time: 3.4m tok/s: 8273481 +2176/20000 train_loss: 2.6709 train_time: 3.4m tok/s: 8273472 +2177/20000 train_loss: 2.6142 train_time: 3.4m tok/s: 8273485 +2178/20000 train_loss: 2.5317 train_time: 3.5m tok/s: 8273497 +2179/20000 train_loss: 2.7064 train_time: 3.5m tok/s: 8273491 +2180/20000 train_loss: 2.5593 train_time: 3.5m tok/s: 8273491 +2181/20000 train_loss: 2.5577 train_time: 3.5m tok/s: 8273497 +2182/20000 train_loss: 2.4381 train_time: 3.5m tok/s: 8273518 +2183/20000 train_loss: 2.5360 train_time: 3.5m tok/s: 8273527 +2184/20000 train_loss: 2.5393 train_time: 3.5m tok/s: 8273491 +2185/20000 train_loss: 2.4402 train_time: 3.5m tok/s: 8273522 +2186/20000 train_loss: 2.7450 train_time: 3.5m tok/s: 8273531 +2187/20000 train_loss: 2.5540 train_time: 3.5m tok/s: 8273552 +2188/20000 train_loss: 2.5483 train_time: 3.5m tok/s: 8273543 +2189/20000 train_loss: 2.6370 train_time: 3.5m tok/s: 8273551 +2190/20000 train_loss: 2.6794 train_time: 3.5m tok/s: 8273545 +2191/20000 train_loss: 2.6073 train_time: 3.5m tok/s: 8273584 +2192/20000 train_loss: 2.6466 train_time: 3.5m tok/s: 8273601 +2193/20000 train_loss: 2.5781 train_time: 3.5m tok/s: 8273644 +2194/20000 train_loss: 2.5887 train_time: 3.5m tok/s: 8273613 +2195/20000 train_loss: 2.6105 train_time: 3.5m tok/s: 8273610 +2196/20000 train_loss: 2.6043 train_time: 3.5m tok/s: 8273597 +2197/20000 train_loss: 2.5840 train_time: 3.5m tok/s: 8273591 +2198/20000 train_loss: 2.5315 train_time: 3.5m tok/s: 8273616 +2199/20000 train_loss: 2.5420 train_time: 3.5m tok/s: 8273630 +2200/20000 train_loss: 2.6167 train_time: 3.5m tok/s: 8273617 +2201/20000 train_loss: 2.6670 train_time: 3.5m tok/s: 8273640 +2202/20000 train_loss: 2.5853 train_time: 3.5m tok/s: 8273608 +2203/20000 train_loss: 2.5657 train_time: 3.5m tok/s: 8273577 +2204/20000 train_loss: 2.4266 train_time: 3.5m tok/s: 8273549 +2205/20000 train_loss: 2.6459 train_time: 3.5m tok/s: 8273509 +2206/20000 train_loss: 2.5740 train_time: 3.5m tok/s: 8273499 +2207/20000 train_loss: 2.5256 train_time: 3.5m tok/s: 8273528 +layer_loop:enabled step:2207 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2208/20000 train_loss: 2.9876 train_time: 3.5m tok/s: 8271529 +2209/20000 train_loss: 2.9498 train_time: 3.5m tok/s: 8269813 +2210/20000 train_loss: 2.8414 train_time: 3.5m tok/s: 8268040 +2211/20000 train_loss: 2.7446 train_time: 3.5m tok/s: 8266264 +2212/20000 train_loss: 2.5297 train_time: 3.5m tok/s: 8264541 +2213/20000 train_loss: 2.6333 train_time: 3.5m tok/s: 8262817 +2214/20000 train_loss: 2.5405 train_time: 3.5m tok/s: 8261054 +2215/20000 train_loss: 2.6327 train_time: 3.5m tok/s: 8259333 +2216/20000 train_loss: 2.5596 train_time: 3.5m tok/s: 8257644 +2217/20000 train_loss: 2.7179 train_time: 3.5m tok/s: 8255776 +2218/20000 train_loss: 2.7037 train_time: 3.5m tok/s: 8253993 +2219/20000 train_loss: 2.5258 train_time: 3.5m tok/s: 8252237 +2220/20000 train_loss: 2.5638 train_time: 3.5m tok/s: 8250450 +2221/20000 train_loss: 2.6885 train_time: 3.5m tok/s: 8248697 +2222/20000 train_loss: 2.5141 train_time: 3.5m tok/s: 8246912 +2223/20000 train_loss: 2.5964 train_time: 3.5m tok/s: 8245141 +2224/20000 train_loss: 2.4088 train_time: 3.5m tok/s: 8243293 +2225/20000 train_loss: 2.5705 train_time: 3.5m tok/s: 8241515 +2226/20000 train_loss: 2.5503 train_time: 3.5m tok/s: 8239790 +2227/20000 train_loss: 2.5256 train_time: 3.5m tok/s: 8238029 +2228/20000 train_loss: 2.5692 train_time: 3.5m tok/s: 8236306 +2229/20000 train_loss: 2.5287 train_time: 3.5m tok/s: 8234601 +2230/20000 train_loss: 2.5459 train_time: 3.6m tok/s: 8232933 +2231/20000 train_loss: 2.3510 train_time: 3.6m tok/s: 8231164 +2232/20000 train_loss: 2.5418 train_time: 3.6m tok/s: 8229336 +2233/20000 train_loss: 2.6770 train_time: 3.6m tok/s: 8227692 +2234/20000 train_loss: 2.7023 train_time: 3.6m tok/s: 8226020 +2235/20000 train_loss: 2.6532 train_time: 3.6m tok/s: 8224324 +2236/20000 train_loss: 2.6117 train_time: 3.6m tok/s: 8222630 +2237/20000 train_loss: 2.6441 train_time: 3.6m tok/s: 8220932 +2238/20000 train_loss: 2.6580 train_time: 3.6m tok/s: 8219252 +2239/20000 train_loss: 2.7595 train_time: 3.6m tok/s: 8217531 +2240/20000 train_loss: 2.7654 train_time: 3.6m tok/s: 8215853 +2241/20000 train_loss: 2.3962 train_time: 3.6m tok/s: 8214104 +2242/20000 train_loss: 2.5858 train_time: 3.6m tok/s: 8212443 +2243/20000 train_loss: 2.5704 train_time: 3.6m tok/s: 8210733 +2244/20000 train_loss: 2.5935 train_time: 3.6m tok/s: 8209075 +2245/20000 train_loss: 2.6336 train_time: 3.6m tok/s: 8207401 +2246/20000 train_loss: 2.5438 train_time: 3.6m tok/s: 8205706 +2247/20000 train_loss: 2.6849 train_time: 3.6m tok/s: 8204030 +2248/20000 train_loss: 2.6728 train_time: 3.6m tok/s: 8202339 +2249/20000 train_loss: 2.5469 train_time: 3.6m tok/s: 8200712 +2250/20000 train_loss: 2.5419 train_time: 3.6m tok/s: 8198973 +2251/20000 train_loss: 2.5844 train_time: 3.6m tok/s: 8197303 +2252/20000 train_loss: 2.7341 train_time: 3.6m tok/s: 8195637 +2253/20000 train_loss: 2.5742 train_time: 3.6m tok/s: 8194036 +2254/20000 train_loss: 2.5670 train_time: 3.6m tok/s: 8192275 +2255/20000 train_loss: 2.4334 train_time: 3.6m tok/s: 8190597 +2256/20000 train_loss: 2.4734 train_time: 3.6m tok/s: 8188956 +2257/20000 train_loss: 2.5158 train_time: 3.6m tok/s: 8187191 +2258/20000 train_loss: 2.5423 train_time: 3.6m tok/s: 8185487 +2259/20000 train_loss: 2.5633 train_time: 3.6m tok/s: 8183862 +2260/20000 train_loss: 2.5138 train_time: 3.6m tok/s: 8182114 +2261/20000 train_loss: 2.6092 train_time: 3.6m tok/s: 8180478 +2262/20000 train_loss: 2.7067 train_time: 3.6m tok/s: 8178811 +2263/20000 train_loss: 2.7005 train_time: 3.6m tok/s: 8177176 +2264/20000 train_loss: 2.4670 train_time: 3.6m tok/s: 8175462 +2265/20000 train_loss: 2.5608 train_time: 3.6m tok/s: 8173867 +2266/20000 train_loss: 2.5852 train_time: 3.6m tok/s: 8172200 +2267/20000 train_loss: 2.6027 train_time: 3.6m tok/s: 8170577 +2268/20000 train_loss: 2.6930 train_time: 3.6m tok/s: 8168978 +2269/20000 train_loss: 2.7688 train_time: 3.6m tok/s: 8167289 +2270/20000 train_loss: 2.5311 train_time: 3.6m tok/s: 8165633 +2271/20000 train_loss: 2.5833 train_time: 3.6m tok/s: 8163991 +2272/20000 train_loss: 2.5164 train_time: 3.6m tok/s: 8162390 +2273/20000 train_loss: 2.5755 train_time: 3.7m tok/s: 8160843 +2274/20000 train_loss: 3.2796 train_time: 3.7m tok/s: 8159095 +2275/20000 train_loss: 2.4186 train_time: 3.7m tok/s: 8157411 +2276/20000 train_loss: 2.5230 train_time: 3.7m tok/s: 8155739 +2277/20000 train_loss: 2.7828 train_time: 3.7m tok/s: 8154072 +2278/20000 train_loss: 2.7144 train_time: 3.7m tok/s: 8152484 +2279/20000 train_loss: 2.6444 train_time: 3.7m tok/s: 8150763 +2280/20000 train_loss: 2.7358 train_time: 3.7m tok/s: 8149080 +2281/20000 train_loss: 2.5476 train_time: 3.7m tok/s: 8147424 +2282/20000 train_loss: 2.7366 train_time: 3.7m tok/s: 8145819 +2283/20000 train_loss: 2.9772 train_time: 3.7m tok/s: 8144159 +2284/20000 train_loss: 2.4846 train_time: 3.7m tok/s: 8142529 +2285/20000 train_loss: 2.5308 train_time: 3.7m tok/s: 8140915 +2286/20000 train_loss: 2.4559 train_time: 3.7m tok/s: 8139255 +2287/20000 train_loss: 2.4594 train_time: 3.7m tok/s: 8137606 +2288/20000 train_loss: 2.6546 train_time: 3.7m tok/s: 8135969 +2289/20000 train_loss: 2.6641 train_time: 3.7m tok/s: 8134390 +2290/20000 train_loss: 2.6418 train_time: 3.7m tok/s: 8132796 +2291/20000 train_loss: 2.5431 train_time: 3.7m tok/s: 8131236 +2292/20000 train_loss: 2.4783 train_time: 3.7m tok/s: 8129624 +2293/20000 train_loss: 2.5578 train_time: 3.7m tok/s: 8128080 +2294/20000 train_loss: 2.4621 train_time: 3.7m tok/s: 8126555 +2295/20000 train_loss: 2.6630 train_time: 3.7m tok/s: 8124959 +2296/20000 train_loss: 2.5161 train_time: 3.7m tok/s: 8123382 +2297/20000 train_loss: 2.6307 train_time: 3.7m tok/s: 8121716 +2298/20000 train_loss: 2.4975 train_time: 3.7m tok/s: 8120140 +2299/20000 train_loss: 2.5947 train_time: 3.7m tok/s: 8118612 +2300/20000 train_loss: 2.6286 train_time: 3.7m tok/s: 8117005 +2301/20000 train_loss: 2.4008 train_time: 3.7m tok/s: 8115449 +2302/20000 train_loss: 2.5752 train_time: 3.7m tok/s: 8113876 +2303/20000 train_loss: 2.5003 train_time: 3.7m tok/s: 8112302 +2304/20000 train_loss: 2.5926 train_time: 3.7m tok/s: 8110703 +2305/20000 train_loss: 2.4483 train_time: 3.7m tok/s: 8109026 +2306/20000 train_loss: 2.6698 train_time: 3.7m tok/s: 8107510 +2307/20000 train_loss: 2.6480 train_time: 3.7m tok/s: 8105994 +2308/20000 train_loss: 2.5912 train_time: 3.7m tok/s: 8104500 +2309/20000 train_loss: 2.5427 train_time: 3.7m tok/s: 8102951 +2310/20000 train_loss: 2.6249 train_time: 3.7m tok/s: 8101426 +2311/20000 train_loss: 2.6231 train_time: 3.7m tok/s: 8099880 +2312/20000 train_loss: 2.6378 train_time: 3.7m tok/s: 8098298 +2313/20000 train_loss: 2.6206 train_time: 3.7m tok/s: 8096762 +2314/20000 train_loss: 2.4444 train_time: 3.7m tok/s: 8095258 +2315/20000 train_loss: 2.4101 train_time: 3.7m tok/s: 8093695 +2316/20000 train_loss: 2.3704 train_time: 3.8m tok/s: 8092147 +2317/20000 train_loss: 2.7539 train_time: 3.8m tok/s: 8090467 +2318/20000 train_loss: 2.6430 train_time: 3.8m tok/s: 8088955 +2319/20000 train_loss: 2.4524 train_time: 3.8m tok/s: 8087411 +2320/20000 train_loss: 2.6186 train_time: 3.8m tok/s: 8085868 +2321/20000 train_loss: 2.5984 train_time: 3.8m tok/s: 8084300 +2322/20000 train_loss: 2.4635 train_time: 3.8m tok/s: 8082780 +2323/20000 train_loss: 2.6334 train_time: 3.8m tok/s: 8081311 +2324/20000 train_loss: 2.5738 train_time: 3.8m tok/s: 8079753 +2325/20000 train_loss: 2.5943 train_time: 3.8m tok/s: 8078178 +2326/20000 train_loss: 2.6036 train_time: 3.8m tok/s: 8076682 +2327/20000 train_loss: 2.5809 train_time: 3.8m tok/s: 8075159 +2328/20000 train_loss: 2.5342 train_time: 3.8m tok/s: 8073695 +2329/20000 train_loss: 2.4197 train_time: 3.8m tok/s: 8072170 +2330/20000 train_loss: 2.6790 train_time: 3.8m tok/s: 8070659 +2331/20000 train_loss: 2.5834 train_time: 3.8m tok/s: 8069082 +2332/20000 train_loss: 2.3994 train_time: 3.8m tok/s: 8067552 +2333/20000 train_loss: 2.6284 train_time: 3.8m tok/s: 8066051 +2334/20000 train_loss: 2.2811 train_time: 3.8m tok/s: 8064457 +2335/20000 train_loss: 2.5493 train_time: 3.8m tok/s: 8062931 +2336/20000 train_loss: 2.6277 train_time: 3.8m tok/s: 8061444 +2337/20000 train_loss: 2.6639 train_time: 3.8m tok/s: 8059939 +2338/20000 train_loss: 2.5374 train_time: 3.8m tok/s: 8058453 +2339/20000 train_loss: 2.6210 train_time: 3.8m tok/s: 8056963 +2340/20000 train_loss: 2.5724 train_time: 3.8m tok/s: 8055513 +2341/20000 train_loss: 2.5549 train_time: 3.8m tok/s: 8054048 +2342/20000 train_loss: 2.5062 train_time: 3.8m tok/s: 8052549 +2343/20000 train_loss: 2.4378 train_time: 3.8m tok/s: 8051112 +2344/20000 train_loss: 2.7104 train_time: 3.8m tok/s: 8049585 +2345/20000 train_loss: 3.0394 train_time: 3.8m tok/s: 8048111 +2346/20000 train_loss: 2.5307 train_time: 3.8m tok/s: 8046600 +2347/20000 train_loss: 2.5292 train_time: 3.8m tok/s: 8045154 +2348/20000 train_loss: 2.7206 train_time: 3.8m tok/s: 8043673 +2349/20000 train_loss: 2.5936 train_time: 3.8m tok/s: 8042228 +2350/20000 train_loss: 2.5871 train_time: 3.8m tok/s: 8040786 +2351/20000 train_loss: 2.5681 train_time: 3.8m tok/s: 8039324 +2352/20000 train_loss: 2.6001 train_time: 3.8m tok/s: 8037850 +2353/20000 train_loss: 2.4984 train_time: 3.8m tok/s: 8036351 +2354/20000 train_loss: 2.5514 train_time: 3.8m tok/s: 8034897 +2355/20000 train_loss: 2.5127 train_time: 3.8m tok/s: 8033476 +2356/20000 train_loss: 2.5705 train_time: 3.8m tok/s: 8032026 +2357/20000 train_loss: 2.5195 train_time: 3.8m tok/s: 8030567 +2358/20000 train_loss: 2.5278 train_time: 3.8m tok/s: 8029149 +2359/20000 train_loss: 2.5082 train_time: 3.9m tok/s: 8027695 +2360/20000 train_loss: 2.5808 train_time: 3.9m tok/s: 8026227 +2361/20000 train_loss: 2.5955 train_time: 3.9m tok/s: 8024777 +2362/20000 train_loss: 2.5270 train_time: 3.9m tok/s: 8023278 +2363/20000 train_loss: 2.5443 train_time: 3.9m tok/s: 8021866 +2364/20000 train_loss: 2.6575 train_time: 3.9m tok/s: 8020410 +2365/20000 train_loss: 2.5583 train_time: 3.9m tok/s: 8018951 +2366/20000 train_loss: 2.6248 train_time: 3.9m tok/s: 8017529 +2367/20000 train_loss: 2.5359 train_time: 3.9m tok/s: 8016071 +2368/20000 train_loss: 2.6901 train_time: 3.9m tok/s: 8014657 +2369/20000 train_loss: 2.5198 train_time: 3.9m tok/s: 8013218 +2370/20000 train_loss: 2.6087 train_time: 3.9m tok/s: 8011827 +2371/20000 train_loss: 2.5922 train_time: 3.9m tok/s: 8010393 +2372/20000 train_loss: 2.6329 train_time: 3.9m tok/s: 8008978 +2373/20000 train_loss: 2.4926 train_time: 3.9m tok/s: 8007542 +2374/20000 train_loss: 2.5590 train_time: 3.9m tok/s: 8006102 +2375/20000 train_loss: 2.5532 train_time: 3.9m tok/s: 8004688 +2376/20000 train_loss: 2.5007 train_time: 3.9m tok/s: 8003240 +2377/20000 train_loss: 2.4178 train_time: 3.9m tok/s: 8001788 +2378/20000 train_loss: 2.5177 train_time: 3.9m tok/s: 8000355 +2379/20000 train_loss: 2.8983 train_time: 3.9m tok/s: 7998880 +2380/20000 train_loss: 2.5038 train_time: 3.9m tok/s: 7997498 +2381/20000 train_loss: 2.6739 train_time: 3.9m tok/s: 7996138 +2382/20000 train_loss: 2.4641 train_time: 3.9m tok/s: 7994700 +2383/20000 train_loss: 2.6520 train_time: 3.9m tok/s: 7993279 +2384/20000 train_loss: 2.6600 train_time: 3.9m tok/s: 7991884 +2385/20000 train_loss: 2.6787 train_time: 3.9m tok/s: 7990475 +2386/20000 train_loss: 2.6375 train_time: 3.9m tok/s: 7988986 +2387/20000 train_loss: 2.4951 train_time: 3.9m tok/s: 7987607 +2388/20000 train_loss: 2.9510 train_time: 3.9m tok/s: 7986053 +2389/20000 train_loss: 2.3710 train_time: 3.9m tok/s: 7984486 +2390/20000 train_loss: 2.5955 train_time: 3.9m tok/s: 7983081 +2391/20000 train_loss: 2.5195 train_time: 3.9m tok/s: 7981705 +2392/20000 train_loss: 2.6423 train_time: 3.9m tok/s: 7980286 +2393/20000 train_loss: 2.5440 train_time: 3.9m tok/s: 7978916 +2394/20000 train_loss: 2.6382 train_time: 3.9m tok/s: 7977491 +2395/20000 train_loss: 2.6382 train_time: 3.9m tok/s: 7976107 +2396/20000 train_loss: 2.6050 train_time: 3.9m tok/s: 7974763 +2397/20000 train_loss: 2.6897 train_time: 3.9m tok/s: 7973369 +2398/20000 train_loss: 2.5174 train_time: 3.9m tok/s: 7971989 +2399/20000 train_loss: 2.5248 train_time: 3.9m tok/s: 7970649 +2400/20000 train_loss: 2.5226 train_time: 3.9m tok/s: 7969253 +2401/20000 train_loss: 2.6087 train_time: 3.9m tok/s: 7967898 +2402/20000 train_loss: 2.5071 train_time: 4.0m tok/s: 7966499 +2403/20000 train_loss: 2.8818 train_time: 4.0m tok/s: 7965046 +2404/20000 train_loss: 2.5546 train_time: 4.0m tok/s: 7963700 +2405/20000 train_loss: 2.4832 train_time: 4.0m tok/s: 7962310 +2406/20000 train_loss: 2.5776 train_time: 4.0m tok/s: 7960978 +2407/20000 train_loss: 2.5811 train_time: 4.0m tok/s: 7959632 +2408/20000 train_loss: 2.6607 train_time: 4.0m tok/s: 7958325 +2409/20000 train_loss: 2.5869 train_time: 4.0m tok/s: 7956949 +2410/20000 train_loss: 2.5789 train_time: 4.0m tok/s: 7955595 +2411/20000 train_loss: 2.5391 train_time: 4.0m tok/s: 7954257 +2412/20000 train_loss: 2.6504 train_time: 4.0m tok/s: 7952826 +2413/20000 train_loss: 2.4989 train_time: 4.0m tok/s: 7951433 +2414/20000 train_loss: 2.5658 train_time: 4.0m tok/s: 7950065 +2415/20000 train_loss: 2.5505 train_time: 4.0m tok/s: 7948713 +2416/20000 train_loss: 2.5772 train_time: 4.0m tok/s: 7947332 +2417/20000 train_loss: 2.5491 train_time: 4.0m tok/s: 7945921 +2418/20000 train_loss: 2.5110 train_time: 4.0m tok/s: 7944594 +2419/20000 train_loss: 2.5588 train_time: 4.0m tok/s: 7943224 +2420/20000 train_loss: 2.5986 train_time: 4.0m tok/s: 7941927 +2421/20000 train_loss: 2.6272 train_time: 4.0m tok/s: 7940575 +2422/20000 train_loss: 2.6066 train_time: 4.0m tok/s: 7939203 +2423/20000 train_loss: 2.5051 train_time: 4.0m tok/s: 7937865 +2424/20000 train_loss: 2.6249 train_time: 4.0m tok/s: 7936515 +2425/20000 train_loss: 2.6074 train_time: 4.0m tok/s: 7935130 +2426/20000 train_loss: 2.5448 train_time: 4.0m tok/s: 7933774 +2427/20000 train_loss: 2.4184 train_time: 4.0m tok/s: 7932429 +2428/20000 train_loss: 2.5207 train_time: 4.0m tok/s: 7931022 +2429/20000 train_loss: 2.4963 train_time: 4.0m tok/s: 7929716 +2430/20000 train_loss: 2.4704 train_time: 4.0m tok/s: 7928373 +2431/20000 train_loss: 2.5885 train_time: 4.0m tok/s: 7926947 +2432/20000 train_loss: 2.5420 train_time: 4.0m tok/s: 7925613 +2433/20000 train_loss: 2.6759 train_time: 4.0m tok/s: 7924300 +2434/20000 train_loss: 2.4891 train_time: 4.0m tok/s: 7922981 +2435/20000 train_loss: 2.6772 train_time: 4.0m tok/s: 7921626 +2436/20000 train_loss: 2.5365 train_time: 4.0m tok/s: 7920311 +2437/20000 train_loss: 2.5837 train_time: 4.0m tok/s: 7918979 +2438/20000 train_loss: 2.5316 train_time: 4.0m tok/s: 7917637 +2439/20000 train_loss: 2.5330 train_time: 4.0m tok/s: 7916282 +2440/20000 train_loss: 2.5069 train_time: 4.0m tok/s: 7914997 +2441/20000 train_loss: 2.5391 train_time: 4.0m tok/s: 7913694 +2442/20000 train_loss: 2.5164 train_time: 4.0m tok/s: 7912368 +2443/20000 train_loss: 2.6455 train_time: 4.0m tok/s: 7911063 +2444/20000 train_loss: 2.5882 train_time: 4.0m tok/s: 7909721 +2445/20000 train_loss: 2.5052 train_time: 4.1m tok/s: 7908439 +2446/20000 train_loss: 2.7168 train_time: 4.1m tok/s: 7907113 +2447/20000 train_loss: 2.7269 train_time: 4.1m tok/s: 7905800 +2448/20000 train_loss: 2.5847 train_time: 4.1m tok/s: 7904514 +2449/20000 train_loss: 2.5003 train_time: 4.1m tok/s: 7903169 +2450/20000 train_loss: 2.5558 train_time: 4.1m tok/s: 7901852 +2451/20000 train_loss: 2.5470 train_time: 4.1m tok/s: 7900553 +2452/20000 train_loss: 2.5694 train_time: 4.1m tok/s: 7899218 +2453/20000 train_loss: 2.4424 train_time: 4.1m tok/s: 7897907 +2454/20000 train_loss: 2.4500 train_time: 4.1m tok/s: 7896639 +2455/20000 train_loss: 2.5487 train_time: 4.1m tok/s: 7895304 +2456/20000 train_loss: 2.5543 train_time: 4.1m tok/s: 7893988 +2457/20000 train_loss: 2.6242 train_time: 4.1m tok/s: 7892718 +2458/20000 train_loss: 2.7155 train_time: 4.1m tok/s: 7891381 +2459/20000 train_loss: 2.5672 train_time: 4.1m tok/s: 7890131 +2460/20000 train_loss: 2.5889 train_time: 4.1m tok/s: 7888805 +2461/20000 train_loss: 2.6668 train_time: 4.1m tok/s: 7887547 +2462/20000 train_loss: 2.5464 train_time: 4.1m tok/s: 7886238 +2463/20000 train_loss: 2.6039 train_time: 4.1m tok/s: 7884950 +2464/20000 train_loss: 2.5216 train_time: 4.1m tok/s: 7883620 +2465/20000 train_loss: 2.6161 train_time: 4.1m tok/s: 7882376 +2466/20000 train_loss: 2.3301 train_time: 4.1m tok/s: 7881089 +2467/20000 train_loss: 2.5954 train_time: 4.1m tok/s: 7879772 +2468/20000 train_loss: 2.4793 train_time: 4.1m tok/s: 7878471 +2469/20000 train_loss: 2.5815 train_time: 4.1m tok/s: 7877202 +2470/20000 train_loss: 2.6069 train_time: 4.1m tok/s: 7875963 +2471/20000 train_loss: 2.5851 train_time: 4.1m tok/s: 7874649 +2472/20000 train_loss: 2.6802 train_time: 4.1m tok/s: 7873366 +2473/20000 train_loss: 2.5544 train_time: 4.1m tok/s: 7872116 +2474/20000 train_loss: 2.8435 train_time: 4.1m tok/s: 7870692 +2475/20000 train_loss: 2.6939 train_time: 4.1m tok/s: 7869405 +2476/20000 train_loss: 2.5771 train_time: 4.1m tok/s: 7868159 +2477/20000 train_loss: 2.5021 train_time: 4.1m tok/s: 7866887 +2478/20000 train_loss: 2.5340 train_time: 4.1m tok/s: 7865651 +2479/20000 train_loss: 2.6441 train_time: 4.1m tok/s: 7864385 +2480/20000 train_loss: 2.5834 train_time: 4.1m tok/s: 7863125 +2481/20000 train_loss: 2.4684 train_time: 4.1m tok/s: 7861892 +2482/20000 train_loss: 2.6226 train_time: 4.1m tok/s: 7860597 +2483/20000 train_loss: 2.5765 train_time: 4.1m tok/s: 7859358 +2484/20000 train_loss: 2.5552 train_time: 4.1m tok/s: 7858117 +2485/20000 train_loss: 2.4921 train_time: 4.1m tok/s: 7856870 +2486/20000 train_loss: 2.5789 train_time: 4.1m tok/s: 7855581 +2487/20000 train_loss: 2.6060 train_time: 4.2m tok/s: 7854319 +2488/20000 train_loss: 2.5724 train_time: 4.2m tok/s: 7853074 +2489/20000 train_loss: 2.4660 train_time: 4.2m tok/s: 7851824 +2490/20000 train_loss: 2.6346 train_time: 4.2m tok/s: 7850552 +2491/20000 train_loss: 2.5767 train_time: 4.2m tok/s: 7849293 +2492/20000 train_loss: 2.5452 train_time: 4.2m tok/s: 7848026 +2493/20000 train_loss: 2.5718 train_time: 4.2m tok/s: 7846795 +2494/20000 train_loss: 2.4560 train_time: 4.2m tok/s: 7845476 +2495/20000 train_loss: 2.4973 train_time: 4.2m tok/s: 7844234 +2496/20000 train_loss: 2.6145 train_time: 4.2m tok/s: 7843029 +2497/20000 train_loss: 2.5656 train_time: 4.2m tok/s: 7841780 +2498/20000 train_loss: 2.5613 train_time: 4.2m tok/s: 7840546 +2499/20000 train_loss: 2.6039 train_time: 4.2m tok/s: 7839329 +2500/20000 train_loss: 2.6728 train_time: 4.2m tok/s: 7838140 +2501/20000 train_loss: 2.5662 train_time: 4.2m tok/s: 7836849 +2502/20000 train_loss: 2.4136 train_time: 4.2m tok/s: 7835578 +2503/20000 train_loss: 2.5229 train_time: 4.2m tok/s: 7834348 +2504/20000 train_loss: 2.6195 train_time: 4.2m tok/s: 7833070 +2505/20000 train_loss: 2.5240 train_time: 4.2m tok/s: 7831838 +2506/20000 train_loss: 2.5763 train_time: 4.2m tok/s: 7830562 +2507/20000 train_loss: 2.4341 train_time: 4.2m tok/s: 7829338 +2508/20000 train_loss: 2.5896 train_time: 4.2m tok/s: 7828113 +2509/20000 train_loss: 2.6294 train_time: 4.2m tok/s: 7826907 +2510/20000 train_loss: 2.5293 train_time: 4.2m tok/s: 7825664 +2511/20000 train_loss: 2.5671 train_time: 4.2m tok/s: 7824298 +2512/20000 train_loss: 2.6267 train_time: 4.2m tok/s: 7823022 +2513/20000 train_loss: 2.4801 train_time: 4.2m tok/s: 7821826 +2514/20000 train_loss: 2.5743 train_time: 4.2m tok/s: 7820629 +2515/20000 train_loss: 2.6102 train_time: 4.2m tok/s: 7819290 +2516/20000 train_loss: 2.5965 train_time: 4.2m tok/s: 7818027 +2517/20000 train_loss: 2.4395 train_time: 4.2m tok/s: 7816795 +2518/20000 train_loss: 2.5444 train_time: 4.2m tok/s: 7815550 +2519/20000 train_loss: 2.5825 train_time: 4.2m tok/s: 7814271 +2520/20000 train_loss: 2.5422 train_time: 4.2m tok/s: 7813003 +2521/20000 train_loss: 2.6211 train_time: 4.2m tok/s: 7811791 +2522/20000 train_loss: 2.6172 train_time: 4.2m tok/s: 7810589 +2523/20000 train_loss: 2.5602 train_time: 4.2m tok/s: 7809366 +2524/20000 train_loss: 2.5419 train_time: 4.2m tok/s: 7808154 +2525/20000 train_loss: 2.4498 train_time: 4.2m tok/s: 7807021 +2526/20000 train_loss: 2.5383 train_time: 4.2m tok/s: 7805789 +2527/20000 train_loss: 2.5508 train_time: 4.2m tok/s: 7804505 +2528/20000 train_loss: 2.6039 train_time: 4.2m tok/s: 7803278 +2529/20000 train_loss: 2.5620 train_time: 4.2m tok/s: 7802092 +2530/20000 train_loss: 2.4890 train_time: 4.3m tok/s: 7800930 +2531/20000 train_loss: 2.4014 train_time: 4.3m tok/s: 7799725 +2532/20000 train_loss: 2.5258 train_time: 4.3m tok/s: 7798410 +2533/20000 train_loss: 2.5434 train_time: 4.3m tok/s: 7797214 +2534/20000 train_loss: 2.4605 train_time: 4.3m tok/s: 7796031 +2535/20000 train_loss: 2.5775 train_time: 4.3m tok/s: 7794867 +2536/20000 train_loss: 2.5701 train_time: 4.3m tok/s: 7793650 +2537/20000 train_loss: 2.4774 train_time: 4.3m tok/s: 7792454 +2538/20000 train_loss: 2.5960 train_time: 4.3m tok/s: 7791295 +2539/20000 train_loss: 2.7889 train_time: 4.3m tok/s: 7790125 +2540/20000 train_loss: 2.5461 train_time: 4.3m tok/s: 7788942 +2541/20000 train_loss: 2.5357 train_time: 4.3m tok/s: 7787815 +2542/20000 train_loss: 2.5526 train_time: 4.3m tok/s: 7786581 +2543/20000 train_loss: 2.6335 train_time: 4.3m tok/s: 7785396 +2544/20000 train_loss: 2.6598 train_time: 4.3m tok/s: 7784222 +2545/20000 train_loss: 2.4912 train_time: 4.3m tok/s: 7782997 +2546/20000 train_loss: 2.5370 train_time: 4.3m tok/s: 7781830 +2547/20000 train_loss: 2.5134 train_time: 4.3m tok/s: 7780648 +2548/20000 train_loss: 2.7855 train_time: 4.3m tok/s: 7779485 +2549/20000 train_loss: 2.5548 train_time: 4.3m tok/s: 7778286 +2550/20000 train_loss: 2.8241 train_time: 4.3m tok/s: 7777136 +2551/20000 train_loss: 2.5345 train_time: 4.3m tok/s: 7775979 +2552/20000 train_loss: 2.7488 train_time: 4.3m tok/s: 7774842 +2553/20000 train_loss: 2.5872 train_time: 4.3m tok/s: 7773591 +2554/20000 train_loss: 2.4628 train_time: 4.3m tok/s: 7772478 +2555/20000 train_loss: 2.5542 train_time: 4.3m tok/s: 7771336 +2556/20000 train_loss: 2.5598 train_time: 4.3m tok/s: 7770160 +2557/20000 train_loss: 2.5038 train_time: 4.3m tok/s: 7769013 +2558/20000 train_loss: 2.6142 train_time: 4.3m tok/s: 7767880 +2559/20000 train_loss: 2.4343 train_time: 4.3m tok/s: 7766673 +2560/20000 train_loss: 2.4490 train_time: 4.3m tok/s: 7765447 +2561/20000 train_loss: 2.5288 train_time: 4.3m tok/s: 7764320 +2562/20000 train_loss: 2.4745 train_time: 4.3m tok/s: 7763200 +2563/20000 train_loss: 2.4427 train_time: 4.3m tok/s: 7762056 +2564/20000 train_loss: 2.4324 train_time: 4.3m tok/s: 7760874 +2565/20000 train_loss: 2.5392 train_time: 4.3m tok/s: 7759716 +2566/20000 train_loss: 2.5667 train_time: 4.3m tok/s: 7758633 +2567/20000 train_loss: 2.5835 train_time: 4.3m tok/s: 7757493 +2568/20000 train_loss: 2.6135 train_time: 4.3m tok/s: 7756332 +2569/20000 train_loss: 2.6611 train_time: 4.3m tok/s: 7755195 +2570/20000 train_loss: 2.5517 train_time: 4.3m tok/s: 7754026 +2571/20000 train_loss: 2.6060 train_time: 4.3m tok/s: 7752875 +2572/20000 train_loss: 2.4356 train_time: 4.3m tok/s: 7751741 +2573/20000 train_loss: 2.4992 train_time: 4.4m tok/s: 7750548 +2574/20000 train_loss: 2.6861 train_time: 4.4m tok/s: 7749382 +2575/20000 train_loss: 2.5862 train_time: 4.4m tok/s: 7748274 +2576/20000 train_loss: 2.5131 train_time: 4.4m tok/s: 7747145 +2577/20000 train_loss: 2.4739 train_time: 4.4m tok/s: 7746023 +2578/20000 train_loss: 2.4463 train_time: 4.4m tok/s: 7744906 +2579/20000 train_loss: 2.5430 train_time: 4.4m tok/s: 7743785 +2580/20000 train_loss: 2.4690 train_time: 4.4m tok/s: 7742628 +2581/20000 train_loss: 2.3496 train_time: 4.4m tok/s: 7741415 +2582/20000 train_loss: 2.5642 train_time: 4.4m tok/s: 7740279 +2583/20000 train_loss: 2.5957 train_time: 4.4m tok/s: 7739148 +2584/20000 train_loss: 2.5916 train_time: 4.4m tok/s: 7738026 +2585/20000 train_loss: 2.4834 train_time: 4.4m tok/s: 7736930 +2586/20000 train_loss: 2.5082 train_time: 4.4m tok/s: 7735783 +2587/20000 train_loss: 2.5487 train_time: 4.4m tok/s: 7734676 +2588/20000 train_loss: 2.6069 train_time: 4.4m tok/s: 7733539 +2589/20000 train_loss: 2.4782 train_time: 4.4m tok/s: 7732419 +2590/20000 train_loss: 2.5150 train_time: 4.4m tok/s: 7731318 +2591/20000 train_loss: 2.4702 train_time: 4.4m tok/s: 7730199 +2592/20000 train_loss: 2.4328 train_time: 4.4m tok/s: 7729075 +2593/20000 train_loss: 2.5044 train_time: 4.4m tok/s: 7727987 +2594/20000 train_loss: 2.4295 train_time: 4.4m tok/s: 7726818 +2595/20000 train_loss: 2.6080 train_time: 4.4m tok/s: 7725705 +2596/20000 train_loss: 3.0952 train_time: 4.4m tok/s: 7724593 +2597/20000 train_loss: 2.4065 train_time: 4.4m tok/s: 7723484 +2598/20000 train_loss: 2.5094 train_time: 4.4m tok/s: 7722371 +2599/20000 train_loss: 2.6202 train_time: 4.4m tok/s: 7721249 +2600/20000 train_loss: 2.5802 train_time: 4.4m tok/s: 7720126 +2601/20000 train_loss: 2.5012 train_time: 4.4m tok/s: 7719018 +2602/20000 train_loss: 2.7696 train_time: 4.4m tok/s: 7717917 +2603/20000 train_loss: 2.5249 train_time: 4.4m tok/s: 7716819 +2604/20000 train_loss: 2.5245 train_time: 4.4m tok/s: 7715689 +2605/20000 train_loss: 2.6691 train_time: 4.4m tok/s: 7714583 +2606/20000 train_loss: 2.4130 train_time: 4.4m tok/s: 7713501 +2607/20000 train_loss: 2.4734 train_time: 4.4m tok/s: 7712347 +2608/20000 train_loss: 2.5299 train_time: 4.4m tok/s: 7711283 +2609/20000 train_loss: 2.4865 train_time: 4.4m tok/s: 7710113 +2610/20000 train_loss: 2.4743 train_time: 4.4m tok/s: 7709023 +2611/20000 train_loss: 2.6162 train_time: 4.4m tok/s: 7707941 +2612/20000 train_loss: 2.6840 train_time: 4.4m tok/s: 7706771 +2613/20000 train_loss: 2.5602 train_time: 4.4m tok/s: 7705676 +2614/20000 train_loss: 2.5460 train_time: 4.4m tok/s: 7704598 +2615/20000 train_loss: 2.6722 train_time: 4.4m tok/s: 7703528 +2616/20000 train_loss: 2.5771 train_time: 4.5m tok/s: 7702434 +2617/20000 train_loss: 2.5680 train_time: 4.5m tok/s: 7701354 +2618/20000 train_loss: 2.5671 train_time: 4.5m tok/s: 7700250 +2619/20000 train_loss: 2.4633 train_time: 4.5m tok/s: 7699189 +2620/20000 train_loss: 2.4306 train_time: 4.5m tok/s: 7698112 +2621/20000 train_loss: 2.5054 train_time: 4.5m tok/s: 7697011 +2622/20000 train_loss: 2.4895 train_time: 4.5m tok/s: 7695992 +2623/20000 train_loss: 2.5090 train_time: 4.5m tok/s: 7694830 +2624/20000 train_loss: 2.3486 train_time: 4.5m tok/s: 7693745 +2625/20000 train_loss: 2.6367 train_time: 4.5m tok/s: 7692716 +2626/20000 train_loss: 2.3587 train_time: 4.5m tok/s: 7691610 +2627/20000 train_loss: 2.4638 train_time: 4.5m tok/s: 7690551 +2628/20000 train_loss: 2.6601 train_time: 4.5m tok/s: 7689485 +2629/20000 train_loss: 2.5601 train_time: 4.5m tok/s: 7688433 +2630/20000 train_loss: 2.6218 train_time: 4.5m tok/s: 7687328 +2631/20000 train_loss: 2.5929 train_time: 4.5m tok/s: 7686222 +2632/20000 train_loss: 2.6348 train_time: 4.5m tok/s: 7685185 +2633/20000 train_loss: 2.4865 train_time: 4.5m tok/s: 7684089 +2634/20000 train_loss: 2.5667 train_time: 4.5m tok/s: 7683040 +2635/20000 train_loss: 2.4911 train_time: 4.5m tok/s: 7681948 +2636/20000 train_loss: 2.5418 train_time: 4.5m tok/s: 7680888 +2637/20000 train_loss: 2.4554 train_time: 4.5m tok/s: 7679830 +2638/20000 train_loss: 2.5388 train_time: 4.5m tok/s: 7678739 +2639/20000 train_loss: 2.2733 train_time: 4.5m tok/s: 7677607 +2640/20000 train_loss: 2.5428 train_time: 4.5m tok/s: 7676583 +2641/20000 train_loss: 2.5948 train_time: 4.5m tok/s: 7675465 +2642/20000 train_loss: 2.6462 train_time: 4.5m tok/s: 7674442 +2643/20000 train_loss: 2.5601 train_time: 4.5m tok/s: 7673352 +2644/20000 train_loss: 2.5819 train_time: 4.5m tok/s: 7672309 +2645/20000 train_loss: 2.5340 train_time: 4.5m tok/s: 7671253 +2646/20000 train_loss: 2.5672 train_time: 4.5m tok/s: 7670093 +2647/20000 train_loss: 2.6580 train_time: 4.5m tok/s: 7669027 +2648/20000 train_loss: 2.5172 train_time: 4.5m tok/s: 7667989 +2649/20000 train_loss: 2.5460 train_time: 4.5m tok/s: 7666945 +2650/20000 train_loss: 2.4652 train_time: 4.5m tok/s: 7665913 +2651/20000 train_loss: 2.4373 train_time: 4.5m tok/s: 7664875 +2652/20000 train_loss: 2.3550 train_time: 4.5m tok/s: 7663823 +2653/20000 train_loss: 2.6508 train_time: 4.5m tok/s: 7662724 +2654/20000 train_loss: 2.2725 train_time: 4.5m tok/s: 7661599 +2655/20000 train_loss: 2.9441 train_time: 4.5m tok/s: 7660464 +2656/20000 train_loss: 2.4550 train_time: 4.5m tok/s: 7659420 +2657/20000 train_loss: 2.4392 train_time: 4.5m tok/s: 7658394 +2658/20000 train_loss: 2.6202 train_time: 4.5m tok/s: 7657290 +2659/20000 train_loss: 2.5281 train_time: 4.6m tok/s: 7656186 +2660/20000 train_loss: 2.5936 train_time: 4.6m tok/s: 7655104 +2661/20000 train_loss: 2.5486 train_time: 4.6m tok/s: 7654055 +2662/20000 train_loss: 2.3351 train_time: 4.6m tok/s: 7652986 +2663/20000 train_loss: 2.7240 train_time: 4.6m tok/s: 7651897 +2664/20000 train_loss: 2.5450 train_time: 4.6m tok/s: 7650761 +2665/20000 train_loss: 2.4913 train_time: 4.6m tok/s: 7649712 +2666/20000 train_loss: 2.4073 train_time: 4.6m tok/s: 7648686 +2667/20000 train_loss: 2.2918 train_time: 4.6m tok/s: 7647656 +2668/20000 train_loss: 2.5799 train_time: 4.6m tok/s: 7646594 +2669/20000 train_loss: 2.4555 train_time: 4.6m tok/s: 7645564 +2670/20000 train_loss: 2.5740 train_time: 4.6m tok/s: 7644553 +2671/20000 train_loss: 2.6714 train_time: 4.6m tok/s: 7643510 +2672/20000 train_loss: 2.6147 train_time: 4.6m tok/s: 7642499 +2673/20000 train_loss: 2.5684 train_time: 4.6m tok/s: 7641482 +2674/20000 train_loss: 2.6311 train_time: 4.6m tok/s: 7640458 +2675/20000 train_loss: 2.5330 train_time: 4.6m tok/s: 7639438 +2676/20000 train_loss: 2.5228 train_time: 4.6m tok/s: 7638381 +2677/20000 train_loss: 2.4367 train_time: 4.6m tok/s: 7637375 +2678/20000 train_loss: 2.4740 train_time: 4.6m tok/s: 7636328 +2679/20000 train_loss: 2.3282 train_time: 4.6m tok/s: 7635279 +2680/20000 train_loss: 2.4358 train_time: 4.6m tok/s: 7634257 +2681/20000 train_loss: 2.4490 train_time: 4.6m tok/s: 7633230 +2682/20000 train_loss: 2.5427 train_time: 4.6m tok/s: 7632222 +2683/20000 train_loss: 2.4670 train_time: 4.6m tok/s: 7631194 +2684/20000 train_loss: 2.4811 train_time: 4.6m tok/s: 7630166 +2685/20000 train_loss: 2.8106 train_time: 4.6m tok/s: 7629125 +2686/20000 train_loss: 2.5216 train_time: 4.6m tok/s: 7628157 +2687/20000 train_loss: 2.5802 train_time: 4.6m tok/s: 7627156 +2688/20000 train_loss: 2.4570 train_time: 4.6m tok/s: 7626161 +2689/20000 train_loss: 2.5505 train_time: 4.6m tok/s: 7625129 +2690/20000 train_loss: 2.5524 train_time: 4.6m tok/s: 7624114 +2691/20000 train_loss: 2.5344 train_time: 4.6m tok/s: 7623110 +2692/20000 train_loss: 2.4694 train_time: 4.6m tok/s: 7622097 +2693/20000 train_loss: 2.4480 train_time: 4.6m tok/s: 7621033 +2694/20000 train_loss: 2.4779 train_time: 4.6m tok/s: 7620004 +2695/20000 train_loss: 2.5831 train_time: 4.6m tok/s: 7618986 +2696/20000 train_loss: 2.4787 train_time: 4.6m tok/s: 7617983 +2697/20000 train_loss: 2.5606 train_time: 4.6m tok/s: 7616965 +2698/20000 train_loss: 2.5781 train_time: 4.6m tok/s: 7615973 +2699/20000 train_loss: 2.4777 train_time: 4.6m tok/s: 7614986 +2700/20000 train_loss: 2.4587 train_time: 4.6m tok/s: 7614003 +2701/20000 train_loss: 2.5890 train_time: 4.7m tok/s: 7613021 +2702/20000 train_loss: 2.3712 train_time: 4.7m tok/s: 7612031 +2703/20000 train_loss: 2.5563 train_time: 4.7m tok/s: 7611059 +2704/20000 train_loss: 2.4514 train_time: 4.7m tok/s: 7609986 +2705/20000 train_loss: 2.5100 train_time: 4.7m tok/s: 7609024 +2706/20000 train_loss: 2.5183 train_time: 4.7m tok/s: 7608052 +2707/20000 train_loss: 2.6204 train_time: 4.7m tok/s: 7607019 +2708/20000 train_loss: 2.6550 train_time: 4.7m tok/s: 7605994 +2709/20000 train_loss: 2.5740 train_time: 4.7m tok/s: 7605028 +2710/20000 train_loss: 2.6279 train_time: 4.7m tok/s: 7604004 +2711/20000 train_loss: 2.7073 train_time: 4.7m tok/s: 7602982 +2712/20000 train_loss: 2.4950 train_time: 4.7m tok/s: 7602032 +2713/20000 train_loss: 2.5724 train_time: 4.7m tok/s: 7601005 +2714/20000 train_loss: 2.6281 train_time: 4.7m tok/s: 7600028 +2715/20000 train_loss: 2.4579 train_time: 4.7m tok/s: 7599075 +2716/20000 train_loss: 2.4199 train_time: 4.7m tok/s: 7598088 +2717/20000 train_loss: 2.5248 train_time: 4.7m tok/s: 7597157 +2718/20000 train_loss: 2.4495 train_time: 4.7m tok/s: 7596147 +2719/20000 train_loss: 2.4432 train_time: 4.7m tok/s: 7595173 +2720/20000 train_loss: 2.5684 train_time: 4.7m tok/s: 7594102 +2721/20000 train_loss: 2.4111 train_time: 4.7m tok/s: 7593026 +2722/20000 train_loss: 2.4857 train_time: 4.7m tok/s: 7592046 +2723/20000 train_loss: 2.4645 train_time: 4.7m tok/s: 7591109 +2724/20000 train_loss: 2.5260 train_time: 4.7m tok/s: 7590108 +2725/20000 train_loss: 2.6417 train_time: 4.7m tok/s: 7589169 +2726/20000 train_loss: 2.5065 train_time: 4.7m tok/s: 7588223 +2727/20000 train_loss: 2.5663 train_time: 4.7m tok/s: 7587260 +2728/20000 train_loss: 2.8968 train_time: 4.7m tok/s: 7586315 +2729/20000 train_loss: 2.6997 train_time: 4.7m tok/s: 7585364 +2730/20000 train_loss: 2.5374 train_time: 4.7m tok/s: 7584379 +2731/20000 train_loss: 2.6094 train_time: 4.7m tok/s: 7583390 +2732/20000 train_loss: 2.6347 train_time: 4.7m tok/s: 7582411 +2733/20000 train_loss: 2.5490 train_time: 4.7m tok/s: 7581444 +2734/20000 train_loss: 2.6283 train_time: 4.7m tok/s: 7580485 +2735/20000 train_loss: 2.4311 train_time: 4.7m tok/s: 7579535 +2736/20000 train_loss: 2.5532 train_time: 4.7m tok/s: 7578619 +2737/20000 train_loss: 2.4799 train_time: 4.7m tok/s: 7577592 +2738/20000 train_loss: 2.4094 train_time: 4.7m tok/s: 7576607 +2739/20000 train_loss: 2.5408 train_time: 4.7m tok/s: 7575654 +2740/20000 train_loss: 2.5572 train_time: 4.7m tok/s: 7574678 +2741/20000 train_loss: 2.5185 train_time: 4.7m tok/s: 7573711 +2742/20000 train_loss: 2.4823 train_time: 4.7m tok/s: 7572778 +2743/20000 train_loss: 2.5805 train_time: 4.7m tok/s: 7571791 +2744/20000 train_loss: 2.5895 train_time: 4.8m tok/s: 7570846 +2745/20000 train_loss: 2.6531 train_time: 4.8m tok/s: 7569920 +2746/20000 train_loss: 2.6218 train_time: 4.8m tok/s: 7568959 +2747/20000 train_loss: 2.4441 train_time: 4.8m tok/s: 7567990 +2748/20000 train_loss: 2.5154 train_time: 4.8m tok/s: 7567013 +2749/20000 train_loss: 2.5820 train_time: 4.8m tok/s: 7566082 +2750/20000 train_loss: 2.6452 train_time: 4.8m tok/s: 7565117 +2751/20000 train_loss: 2.6306 train_time: 4.8m tok/s: 7564165 +2752/20000 train_loss: 2.5047 train_time: 4.8m tok/s: 7563214 +2753/20000 train_loss: 2.4866 train_time: 4.8m tok/s: 7562313 +2754/20000 train_loss: 2.4567 train_time: 4.8m tok/s: 7561381 +2755/20000 train_loss: 2.5016 train_time: 4.8m tok/s: 7560402 +2756/20000 train_loss: 2.4830 train_time: 4.8m tok/s: 7559475 +2757/20000 train_loss: 2.4496 train_time: 4.8m tok/s: 7558489 +2758/20000 train_loss: 2.5931 train_time: 4.8m tok/s: 7557529 +2759/20000 train_loss: 2.4757 train_time: 4.8m tok/s: 7556575 +2760/20000 train_loss: 2.4130 train_time: 4.8m tok/s: 7555632 +2761/20000 train_loss: 2.6437 train_time: 4.8m tok/s: 7554640 +2762/20000 train_loss: 2.5414 train_time: 4.8m tok/s: 7553714 +2763/20000 train_loss: 2.5951 train_time: 4.8m tok/s: 7552780 +2764/20000 train_loss: 2.5721 train_time: 4.8m tok/s: 7551856 +2765/20000 train_loss: 2.5085 train_time: 4.8m tok/s: 7550940 +2766/20000 train_loss: 2.4660 train_time: 4.8m tok/s: 7550007 +2767/20000 train_loss: 2.5102 train_time: 4.8m tok/s: 7549060 +2768/20000 train_loss: 2.6474 train_time: 4.8m tok/s: 7548125 +2769/20000 train_loss: 2.5592 train_time: 4.8m tok/s: 7547141 +2770/20000 train_loss: 2.7017 train_time: 4.8m tok/s: 7546196 +2771/20000 train_loss: 2.5115 train_time: 4.8m tok/s: 7545247 +2772/20000 train_loss: 2.5359 train_time: 4.8m tok/s: 7544319 +2773/20000 train_loss: 2.5178 train_time: 4.8m tok/s: 7543383 +2774/20000 train_loss: 2.4714 train_time: 4.8m tok/s: 7542452 +2775/20000 train_loss: 2.4707 train_time: 4.8m tok/s: 7541530 +2776/20000 train_loss: 2.4153 train_time: 4.8m tok/s: 7540610 +2777/20000 train_loss: 2.4803 train_time: 4.8m tok/s: 7539684 +2778/20000 train_loss: 2.5345 train_time: 4.8m tok/s: 7538772 +2779/20000 train_loss: 2.4037 train_time: 4.8m tok/s: 7537784 +2780/20000 train_loss: 2.6623 train_time: 4.8m tok/s: 7536883 +2781/20000 train_loss: 2.6360 train_time: 4.8m tok/s: 7535976 +2782/20000 train_loss: 2.4615 train_time: 4.8m tok/s: 7535080 +2783/20000 train_loss: 2.6230 train_time: 4.8m tok/s: 7534145 +2784/20000 train_loss: 2.6529 train_time: 4.8m tok/s: 7533248 +2785/20000 train_loss: 2.5229 train_time: 4.8m tok/s: 7532340 +2786/20000 train_loss: 2.5237 train_time: 4.8m tok/s: 7531411 +2787/20000 train_loss: 2.5830 train_time: 4.9m tok/s: 7530475 +2788/20000 train_loss: 2.4289 train_time: 4.9m tok/s: 7529556 +2789/20000 train_loss: 2.5456 train_time: 4.9m tok/s: 7528657 +2790/20000 train_loss: 2.6072 train_time: 4.9m tok/s: 7527724 +2791/20000 train_loss: 2.3670 train_time: 4.9m tok/s: 7526715 +2792/20000 train_loss: 2.5233 train_time: 4.9m tok/s: 7525801 +2793/20000 train_loss: 2.5464 train_time: 4.9m tok/s: 7524874 +2794/20000 train_loss: 2.4984 train_time: 4.9m tok/s: 7523981 +2795/20000 train_loss: 2.3798 train_time: 4.9m tok/s: 7523120 +2796/20000 train_loss: 2.5178 train_time: 4.9m tok/s: 7522191 +2797/20000 train_loss: 2.5439 train_time: 4.9m tok/s: 7521301 +2798/20000 train_loss: 2.5204 train_time: 4.9m tok/s: 7520391 +2799/20000 train_loss: 2.7806 train_time: 4.9m tok/s: 7519501 +2800/20000 train_loss: 2.6309 train_time: 4.9m tok/s: 7518591 +2801/20000 train_loss: 2.5026 train_time: 4.9m tok/s: 7517674 +2802/20000 train_loss: 2.5197 train_time: 4.9m tok/s: 7516810 +2803/20000 train_loss: 2.5935 train_time: 4.9m tok/s: 7515844 +2804/20000 train_loss: 2.6355 train_time: 4.9m tok/s: 7514939 +2805/20000 train_loss: 2.4744 train_time: 4.9m tok/s: 7514035 +2806/20000 train_loss: 2.5062 train_time: 4.9m tok/s: 7513117 +2807/20000 train_loss: 2.6584 train_time: 4.9m tok/s: 7512225 +2808/20000 train_loss: 2.5859 train_time: 4.9m tok/s: 7511214 +2809/20000 train_loss: 2.4399 train_time: 4.9m tok/s: 7510301 +2810/20000 train_loss: 2.5304 train_time: 4.9m tok/s: 7509413 +2811/20000 train_loss: 2.5916 train_time: 4.9m tok/s: 7508507 +2812/20000 train_loss: 2.5973 train_time: 4.9m tok/s: 7507622 +2813/20000 train_loss: 2.3938 train_time: 4.9m tok/s: 7506751 +2814/20000 train_loss: 2.5393 train_time: 4.9m tok/s: 7505862 +2815/20000 train_loss: 2.6721 train_time: 4.9m tok/s: 7504978 +2816/20000 train_loss: 2.6399 train_time: 4.9m tok/s: 7504078 +2817/20000 train_loss: 2.5448 train_time: 4.9m tok/s: 7503199 +2818/20000 train_loss: 2.5661 train_time: 4.9m tok/s: 7502345 +2819/20000 train_loss: 2.4743 train_time: 4.9m tok/s: 7501498 +2820/20000 train_loss: 2.4989 train_time: 4.9m tok/s: 7500615 +2821/20000 train_loss: 2.4400 train_time: 4.9m tok/s: 7499655 +2822/20000 train_loss: 2.7934 train_time: 4.9m tok/s: 7498679 +2823/20000 train_loss: 2.6177 train_time: 4.9m tok/s: 7497799 +2824/20000 train_loss: 2.6749 train_time: 4.9m tok/s: 7496897 +2825/20000 train_loss: 2.4820 train_time: 4.9m tok/s: 7496015 +2826/20000 train_loss: 2.5645 train_time: 4.9m tok/s: 7495153 +2827/20000 train_loss: 2.4507 train_time: 4.9m tok/s: 7494305 +2828/20000 train_loss: 2.6260 train_time: 4.9m tok/s: 7493423 +2829/20000 train_loss: 2.4131 train_time: 4.9m tok/s: 7492558 +2830/20000 train_loss: 2.5082 train_time: 5.0m tok/s: 7491654 +2831/20000 train_loss: 2.7162 train_time: 5.0m tok/s: 7490777 +2832/20000 train_loss: 2.5276 train_time: 5.0m tok/s: 7489914 +2833/20000 train_loss: 2.6925 train_time: 5.0m tok/s: 7489040 +2834/20000 train_loss: 2.6388 train_time: 5.0m tok/s: 7488164 +2835/20000 train_loss: 2.5931 train_time: 5.0m tok/s: 7487297 +2836/20000 train_loss: 2.4994 train_time: 5.0m tok/s: 7486413 +2837/20000 train_loss: 2.5779 train_time: 5.0m tok/s: 7485536 +2838/20000 train_loss: 2.5442 train_time: 5.0m tok/s: 7484633 +2839/20000 train_loss: 2.5285 train_time: 5.0m tok/s: 7483759 +2840/20000 train_loss: 2.6097 train_time: 5.0m tok/s: 7482890 +2841/20000 train_loss: 2.5508 train_time: 5.0m tok/s: 7482004 +2842/20000 train_loss: 2.6262 train_time: 5.0m tok/s: 7481139 +2843/20000 train_loss: 2.4449 train_time: 5.0m tok/s: 7480233 +2844/20000 train_loss: 2.4384 train_time: 5.0m tok/s: 7479361 +2845/20000 train_loss: 2.5239 train_time: 5.0m tok/s: 7478509 +2846/20000 train_loss: 2.3911 train_time: 5.0m tok/s: 7477650 +2847/20000 train_loss: 2.4322 train_time: 5.0m tok/s: 7476768 +2848/20000 train_loss: 2.5625 train_time: 5.0m tok/s: 7475922 +2849/20000 train_loss: 2.6316 train_time: 5.0m tok/s: 7475030 +2850/20000 train_loss: 2.5749 train_time: 5.0m tok/s: 7474146 +2851/20000 train_loss: 2.7799 train_time: 5.0m tok/s: 7473255 +2852/20000 train_loss: 2.4624 train_time: 5.0m tok/s: 7472383 +2853/20000 train_loss: 2.5365 train_time: 5.0m tok/s: 7471521 +2854/20000 train_loss: 2.4569 train_time: 5.0m tok/s: 7470659 +2855/20000 train_loss: 2.6423 train_time: 5.0m tok/s: 7469797 +2856/20000 train_loss: 2.4677 train_time: 5.0m tok/s: 7468929 +2857/20000 train_loss: 2.5327 train_time: 5.0m tok/s: 7468084 +2858/20000 train_loss: 2.5123 train_time: 5.0m tok/s: 7467202 +2859/20000 train_loss: 3.1528 train_time: 5.0m tok/s: 7466254 +2860/20000 train_loss: 2.4918 train_time: 5.0m tok/s: 7465375 +2861/20000 train_loss: 2.5030 train_time: 5.0m tok/s: 7464548 +2862/20000 train_loss: 2.5150 train_time: 5.0m tok/s: 7463715 +2863/20000 train_loss: 2.3516 train_time: 5.0m tok/s: 7462884 +2864/20000 train_loss: 2.3826 train_time: 5.0m tok/s: 7462037 +2865/20000 train_loss: 2.6168 train_time: 5.0m tok/s: 7461178 +2866/20000 train_loss: 2.5269 train_time: 5.0m tok/s: 7460355 +2867/20000 train_loss: 2.3950 train_time: 5.0m tok/s: 7459494 +2868/20000 train_loss: 2.4288 train_time: 5.0m tok/s: 7458660 +2869/20000 train_loss: 2.5630 train_time: 5.0m tok/s: 7457843 +2870/20000 train_loss: 2.6530 train_time: 5.0m tok/s: 7457002 +2871/20000 train_loss: 2.4594 train_time: 5.0m tok/s: 7456165 +2872/20000 train_loss: 3.0441 train_time: 5.0m tok/s: 7455283 +2873/20000 train_loss: 2.4747 train_time: 5.1m tok/s: 7454311 +2874/20000 train_loss: 2.6147 train_time: 5.1m tok/s: 7453433 +2875/20000 train_loss: 2.5346 train_time: 5.1m tok/s: 7452612 +2876/20000 train_loss: 2.5927 train_time: 5.1m tok/s: 7451797 +2877/20000 train_loss: 2.5201 train_time: 5.1m tok/s: 7450965 +2878/20000 train_loss: 2.4987 train_time: 5.1m tok/s: 7450113 +2879/20000 train_loss: 2.5352 train_time: 5.1m tok/s: 7449275 +2880/20000 train_loss: 2.5453 train_time: 5.1m tok/s: 7448453 +2881/20000 train_loss: 2.6013 train_time: 5.1m tok/s: 7447634 +2882/20000 train_loss: 2.6852 train_time: 5.1m tok/s: 7446821 +2883/20000 train_loss: 2.6457 train_time: 5.1m tok/s: 7445941 +2884/20000 train_loss: 2.6037 train_time: 5.1m tok/s: 7445141 +2885/20000 train_loss: 2.5567 train_time: 5.1m tok/s: 7444308 +2886/20000 train_loss: 2.5517 train_time: 5.1m tok/s: 7443504 +2887/20000 train_loss: 2.5273 train_time: 5.1m tok/s: 7442633 +2888/20000 train_loss: 2.6248 train_time: 5.1m tok/s: 7441752 +2889/20000 train_loss: 2.5761 train_time: 5.1m tok/s: 7440932 +2890/20000 train_loss: 2.5906 train_time: 5.1m tok/s: 7440097 +2891/20000 train_loss: 2.5308 train_time: 5.1m tok/s: 7439285 +2892/20000 train_loss: 2.4684 train_time: 5.1m tok/s: 7438463 +2893/20000 train_loss: 2.3144 train_time: 5.1m tok/s: 7437626 +2894/20000 train_loss: 2.5563 train_time: 5.1m tok/s: 7436797 +2895/20000 train_loss: 2.5231 train_time: 5.1m tok/s: 7435954 +2896/20000 train_loss: 2.4963 train_time: 5.1m tok/s: 7435147 +2897/20000 train_loss: 2.5760 train_time: 5.1m tok/s: 7434315 +2898/20000 train_loss: 2.5603 train_time: 5.1m tok/s: 7433512 +2899/20000 train_loss: 2.5608 train_time: 5.1m tok/s: 7432718 +2900/20000 train_loss: 2.6062 train_time: 5.1m tok/s: 7431854 +2901/20000 train_loss: 2.4184 train_time: 5.1m tok/s: 7430990 +2902/20000 train_loss: 2.5358 train_time: 5.1m tok/s: 7430115 +2903/20000 train_loss: 2.4659 train_time: 5.1m tok/s: 7429267 +2904/20000 train_loss: 2.4623 train_time: 5.1m tok/s: 7428451 +2905/20000 train_loss: 2.6116 train_time: 5.1m tok/s: 7427645 +2906/20000 train_loss: 2.4171 train_time: 5.1m tok/s: 7426851 +2907/20000 train_loss: 2.4593 train_time: 5.1m tok/s: 7426063 +2908/20000 train_loss: 2.5220 train_time: 5.1m tok/s: 7425249 +2909/20000 train_loss: 2.5184 train_time: 5.1m tok/s: 7424417 +2910/20000 train_loss: 2.3961 train_time: 5.1m tok/s: 7423598 +2911/20000 train_loss: 2.5949 train_time: 5.1m tok/s: 7422750 +2912/20000 train_loss: 2.5438 train_time: 5.1m tok/s: 7421888 +2913/20000 train_loss: 2.5726 train_time: 5.1m tok/s: 7421041 +2914/20000 train_loss: 2.6261 train_time: 5.1m tok/s: 7420237 +2915/20000 train_loss: 2.5355 train_time: 5.1m tok/s: 7419473 +2916/20000 train_loss: 2.4732 train_time: 5.2m tok/s: 7418603 +2917/20000 train_loss: 2.5245 train_time: 5.2m tok/s: 7417802 +2918/20000 train_loss: 2.8647 train_time: 5.2m tok/s: 7417009 +2919/20000 train_loss: 2.6991 train_time: 5.2m tok/s: 7416163 +2920/20000 train_loss: 2.4767 train_time: 5.2m tok/s: 7415367 +2921/20000 train_loss: 2.4263 train_time: 5.2m tok/s: 7414571 +2922/20000 train_loss: 2.4648 train_time: 5.2m tok/s: 7413783 +2923/20000 train_loss: 2.4600 train_time: 5.2m tok/s: 7412990 +2924/20000 train_loss: 2.5753 train_time: 5.2m tok/s: 7412185 +2925/20000 train_loss: 2.6242 train_time: 5.2m tok/s: 7411360 +2926/20000 train_loss: 2.4980 train_time: 5.2m tok/s: 7410516 +2927/20000 train_loss: 2.5172 train_time: 5.2m tok/s: 7409737 +2928/20000 train_loss: 2.3561 train_time: 5.2m tok/s: 7408927 +2929/20000 train_loss: 2.5576 train_time: 5.2m tok/s: 7408137 +2930/20000 train_loss: 2.4514 train_time: 5.2m tok/s: 7407332 +2931/20000 train_loss: 2.4456 train_time: 5.2m tok/s: 7406551 +2932/20000 train_loss: 2.4277 train_time: 5.2m tok/s: 7405755 +2933/20000 train_loss: 2.5200 train_time: 5.2m tok/s: 7404956 +2934/20000 train_loss: 2.5360 train_time: 5.2m tok/s: 7404146 +2935/20000 train_loss: 2.5263 train_time: 5.2m tok/s: 7403343 +2936/20000 train_loss: 2.5245 train_time: 5.2m tok/s: 7402548 +2937/20000 train_loss: 2.6517 train_time: 5.2m tok/s: 7401770 +2938/20000 train_loss: 2.5971 train_time: 5.2m tok/s: 7400950 +2939/20000 train_loss: 2.5770 train_time: 5.2m tok/s: 7400171 +2940/20000 train_loss: 2.3900 train_time: 5.2m tok/s: 7399364 +2941/20000 train_loss: 2.6217 train_time: 5.2m tok/s: 7398532 +2942/20000 train_loss: 2.5345 train_time: 5.2m tok/s: 7397735 +2943/20000 train_loss: 2.4248 train_time: 5.2m tok/s: 7396938 +2944/20000 train_loss: 2.4162 train_time: 5.2m tok/s: 7396134 +2945/20000 train_loss: 2.5160 train_time: 5.2m tok/s: 7395311 +2946/20000 train_loss: 2.4909 train_time: 5.2m tok/s: 7394511 +2947/20000 train_loss: 2.4964 train_time: 5.2m tok/s: 7393760 +2948/20000 train_loss: 2.4783 train_time: 5.2m tok/s: 7392968 +2949/20000 train_loss: 2.5432 train_time: 5.2m tok/s: 7392189 +2950/20000 train_loss: 2.6427 train_time: 5.2m tok/s: 7391369 +2951/20000 train_loss: 2.7275 train_time: 5.2m tok/s: 7390570 +2952/20000 train_loss: 2.5588 train_time: 5.2m tok/s: 7389773 +2953/20000 train_loss: 2.5044 train_time: 5.2m tok/s: 7388962 +2954/20000 train_loss: 2.4811 train_time: 5.2m tok/s: 7388193 +2955/20000 train_loss: 2.4678 train_time: 5.2m tok/s: 7387392 +2956/20000 train_loss: 2.4613 train_time: 5.2m tok/s: 7386589 +2957/20000 train_loss: 2.7206 train_time: 5.2m tok/s: 7385746 +2958/20000 train_loss: 2.4343 train_time: 5.2m tok/s: 7384983 +2959/20000 train_loss: 2.4326 train_time: 5.3m tok/s: 7384242 +2960/20000 train_loss: 2.4329 train_time: 5.3m tok/s: 7383438 +2961/20000 train_loss: 2.5396 train_time: 5.3m tok/s: 7382666 +2962/20000 train_loss: 2.4772 train_time: 5.3m tok/s: 7381934 +2963/20000 train_loss: 2.6926 train_time: 5.3m tok/s: 7381146 +2964/20000 train_loss: 2.5856 train_time: 5.3m tok/s: 7380377 +2965/20000 train_loss: 2.7039 train_time: 5.3m tok/s: 7379599 +2966/20000 train_loss: 2.5839 train_time: 5.3m tok/s: 7378820 +2967/20000 train_loss: 2.5507 train_time: 5.3m tok/s: 7378037 +2968/20000 train_loss: 2.4838 train_time: 5.3m tok/s: 7377281 +2969/20000 train_loss: 2.6632 train_time: 5.3m tok/s: 7376502 +2970/20000 train_loss: 2.4759 train_time: 5.3m tok/s: 7375745 +2971/20000 train_loss: 2.4893 train_time: 5.3m tok/s: 7374976 +2972/20000 train_loss: 2.5255 train_time: 5.3m tok/s: 7374207 +2973/20000 train_loss: 2.4442 train_time: 5.3m tok/s: 7373402 +2974/20000 train_loss: 2.5844 train_time: 5.3m tok/s: 7372598 +2975/20000 train_loss: 2.5875 train_time: 5.3m tok/s: 7371829 +2976/20000 train_loss: 2.5492 train_time: 5.3m tok/s: 7371062 +2977/20000 train_loss: 2.4883 train_time: 5.3m tok/s: 7370289 +2978/20000 train_loss: 2.5807 train_time: 5.3m tok/s: 7369515 +2979/20000 train_loss: 2.4757 train_time: 5.3m tok/s: 7368738 +2980/20000 train_loss: 2.5893 train_time: 5.3m tok/s: 7367954 +2981/20000 train_loss: 2.3945 train_time: 5.3m tok/s: 7367170 +2982/20000 train_loss: 2.6209 train_time: 5.3m tok/s: 7366419 +2983/20000 train_loss: 2.4142 train_time: 5.3m tok/s: 7365649 +2984/20000 train_loss: 2.4483 train_time: 5.3m tok/s: 7364879 +2985/20000 train_loss: 2.4481 train_time: 5.3m tok/s: 7364107 +2986/20000 train_loss: 2.5753 train_time: 5.3m tok/s: 7363347 +2987/20000 train_loss: 2.4943 train_time: 5.3m tok/s: 7362518 +2988/20000 train_loss: 2.5556 train_time: 5.3m tok/s: 7361774 +2989/20000 train_loss: 2.4368 train_time: 5.3m tok/s: 7361001 +2990/20000 train_loss: 2.7006 train_time: 5.3m tok/s: 7360249 +2991/20000 train_loss: 2.4850 train_time: 5.3m tok/s: 7359474 +2992/20000 train_loss: 2.5980 train_time: 5.3m tok/s: 7358720 +2993/20000 train_loss: 2.6530 train_time: 5.3m tok/s: 7357932 +2994/20000 train_loss: 2.4379 train_time: 5.3m tok/s: 7357171 +2995/20000 train_loss: 2.6419 train_time: 5.3m tok/s: 7356411 +2996/20000 train_loss: 2.4306 train_time: 5.3m tok/s: 7355607 +2997/20000 train_loss: 2.5860 train_time: 5.3m tok/s: 7354846 +2998/20000 train_loss: 2.4835 train_time: 5.3m tok/s: 7354104 +2999/20000 train_loss: 2.4688 train_time: 5.3m tok/s: 7353316 +3000/20000 train_loss: 2.4910 train_time: 5.3m tok/s: 7352563 +3001/20000 train_loss: 2.5481 train_time: 5.4m tok/s: 7351792 +3002/20000 train_loss: 2.5073 train_time: 5.4m tok/s: 7351044 +3003/20000 train_loss: 2.5382 train_time: 5.4m tok/s: 7350319 +3004/20000 train_loss: 2.4663 train_time: 5.4m tok/s: 7349602 +3005/20000 train_loss: 2.5445 train_time: 5.4m tok/s: 7348826 +3006/20000 train_loss: 2.7572 train_time: 5.4m tok/s: 7348059 +3007/20000 train_loss: 2.5085 train_time: 5.4m tok/s: 7347315 +3008/20000 train_loss: 2.4961 train_time: 5.4m tok/s: 7346578 +3009/20000 train_loss: 2.4721 train_time: 5.4m tok/s: 7345851 +3010/20000 train_loss: 2.4112 train_time: 5.4m tok/s: 7345053 +3011/20000 train_loss: 2.5774 train_time: 5.4m tok/s: 7344268 +3012/20000 train_loss: 2.5540 train_time: 5.4m tok/s: 7343542 +3013/20000 train_loss: 2.5201 train_time: 5.4m tok/s: 7342813 +3014/20000 train_loss: 2.5624 train_time: 5.4m tok/s: 7342104 +3015/20000 train_loss: 2.6600 train_time: 5.4m tok/s: 7341363 +3016/20000 train_loss: 2.6838 train_time: 5.4m tok/s: 7340591 +3017/20000 train_loss: 2.4709 train_time: 5.4m tok/s: 7339870 +3018/20000 train_loss: 2.6229 train_time: 5.4m tok/s: 7339126 +3019/20000 train_loss: 2.4574 train_time: 5.4m tok/s: 7338430 +3020/20000 train_loss: 3.1341 train_time: 5.4m tok/s: 7337576 +3021/20000 train_loss: 2.4725 train_time: 5.4m tok/s: 7336810 +3022/20000 train_loss: 2.4524 train_time: 5.4m tok/s: 7336083 +3023/20000 train_loss: 2.5751 train_time: 5.4m tok/s: 7335332 +3024/20000 train_loss: 3.4321 train_time: 5.4m tok/s: 7334504 +3025/20000 train_loss: 2.4420 train_time: 5.4m tok/s: 7333726 +3026/20000 train_loss: 2.5321 train_time: 5.4m tok/s: 7332912 +3027/20000 train_loss: 2.5684 train_time: 5.4m tok/s: 7332191 +3028/20000 train_loss: 2.6509 train_time: 5.4m tok/s: 7331419 +3029/20000 train_loss: 2.7250 train_time: 5.4m tok/s: 7330696 +3030/20000 train_loss: 2.5691 train_time: 5.4m tok/s: 7329974 +3031/20000 train_loss: 2.5147 train_time: 5.4m tok/s: 7329258 +3032/20000 train_loss: 2.5441 train_time: 5.4m tok/s: 7328519 +3033/20000 train_loss: 2.5689 train_time: 5.4m tok/s: 7327776 +3034/20000 train_loss: 2.4821 train_time: 5.4m tok/s: 7327054 +3035/20000 train_loss: 2.3035 train_time: 5.4m tok/s: 7326333 +3036/20000 train_loss: 2.5735 train_time: 5.4m tok/s: 7325615 +3037/20000 train_loss: 2.4892 train_time: 5.4m tok/s: 7324893 +3038/20000 train_loss: 2.5277 train_time: 5.4m tok/s: 7324190 +3039/20000 train_loss: 2.5216 train_time: 5.4m tok/s: 7323467 +3040/20000 train_loss: 2.4157 train_time: 5.4m tok/s: 7322724 +3041/20000 train_loss: 2.6344 train_time: 5.4m tok/s: 7322025 +3042/20000 train_loss: 2.5782 train_time: 5.4m tok/s: 7321264 +3043/20000 train_loss: 2.6451 train_time: 5.4m tok/s: 7320505 +3044/20000 train_loss: 2.5921 train_time: 5.5m tok/s: 7319825 +3045/20000 train_loss: 2.5755 train_time: 5.5m tok/s: 7319140 +3046/20000 train_loss: 2.5585 train_time: 5.5m tok/s: 7318439 +3047/20000 train_loss: 2.3909 train_time: 5.5m tok/s: 7317709 +3048/20000 train_loss: 2.3599 train_time: 5.5m tok/s: 7316998 +3049/20000 train_loss: 2.5108 train_time: 5.5m tok/s: 7316264 +3050/20000 train_loss: 2.5964 train_time: 5.5m tok/s: 7315567 +3051/20000 train_loss: 2.3615 train_time: 5.5m tok/s: 7314879 +3052/20000 train_loss: 2.4196 train_time: 5.5m tok/s: 7314188 +3053/20000 train_loss: 2.6077 train_time: 5.5m tok/s: 7313421 +3054/20000 train_loss: 2.4234 train_time: 5.5m tok/s: 7312701 +3055/20000 train_loss: 2.5414 train_time: 5.5m tok/s: 7312006 +3056/20000 train_loss: 2.5463 train_time: 5.5m tok/s: 7311296 +3057/20000 train_loss: 2.5077 train_time: 5.5m tok/s: 7310552 +3058/20000 train_loss: 2.5489 train_time: 5.5m tok/s: 7309851 +3059/20000 train_loss: 2.4642 train_time: 5.5m tok/s: 7309139 +3060/20000 train_loss: 2.5681 train_time: 5.5m tok/s: 7308443 +3061/20000 train_loss: 2.4408 train_time: 5.5m tok/s: 7307704 +3062/20000 train_loss: 2.5951 train_time: 5.5m tok/s: 7306979 +3063/20000 train_loss: 2.5443 train_time: 5.5m tok/s: 7306269 +3064/20000 train_loss: 2.4400 train_time: 5.5m tok/s: 7305527 +3065/20000 train_loss: 2.4089 train_time: 5.5m tok/s: 7304831 +3066/20000 train_loss: 2.6754 train_time: 5.5m tok/s: 7304083 +3067/20000 train_loss: 2.4550 train_time: 5.5m tok/s: 7303389 +3068/20000 train_loss: 2.5991 train_time: 5.5m tok/s: 7302653 +3069/20000 train_loss: 2.3979 train_time: 5.5m tok/s: 7301963 +3070/20000 train_loss: 2.5563 train_time: 5.5m tok/s: 7301228 +3071/20000 train_loss: 2.5080 train_time: 5.5m tok/s: 7300511 +3072/20000 train_loss: 2.5430 train_time: 5.5m tok/s: 7299817 +3073/20000 train_loss: 2.6080 train_time: 5.5m tok/s: 7299120 +3074/20000 train_loss: 2.4285 train_time: 5.5m tok/s: 7298413 +3075/20000 train_loss: 2.4121 train_time: 5.5m tok/s: 7297678 +3076/20000 train_loss: 2.4750 train_time: 5.5m tok/s: 7296962 +3077/20000 train_loss: 2.5299 train_time: 5.5m tok/s: 7296299 +3078/20000 train_loss: 2.3929 train_time: 5.5m tok/s: 7295579 +3079/20000 train_loss: 2.4371 train_time: 5.5m tok/s: 7294869 +3080/20000 train_loss: 3.2236 train_time: 5.5m tok/s: 7294127 +3081/20000 train_loss: 2.3849 train_time: 5.5m tok/s: 7293397 +3082/20000 train_loss: 2.4461 train_time: 5.5m tok/s: 7292693 +3083/20000 train_loss: 2.4673 train_time: 5.5m tok/s: 7292014 +3084/20000 train_loss: 2.4699 train_time: 5.5m tok/s: 7291309 +3085/20000 train_loss: 2.5320 train_time: 5.5m tok/s: 7290616 +3086/20000 train_loss: 2.5731 train_time: 5.5m tok/s: 7289920 +3087/20000 train_loss: 2.6157 train_time: 5.6m tok/s: 7289231 +3088/20000 train_loss: 2.5752 train_time: 5.6m tok/s: 7288538 +3089/20000 train_loss: 2.4487 train_time: 5.6m tok/s: 7287860 +3090/20000 train_loss: 2.6988 train_time: 5.6m tok/s: 7287142 +3091/20000 train_loss: 2.4483 train_time: 5.6m tok/s: 7286433 +3092/20000 train_loss: 2.4843 train_time: 5.6m tok/s: 7285720 +3093/20000 train_loss: 2.5263 train_time: 5.6m tok/s: 7285012 +3094/20000 train_loss: 2.4780 train_time: 5.6m tok/s: 7284305 +3095/20000 train_loss: 2.3340 train_time: 5.6m tok/s: 7283599 +3096/20000 train_loss: 2.5115 train_time: 5.6m tok/s: 7282904 +3097/20000 train_loss: 2.5503 train_time: 5.6m tok/s: 7282202 +3098/20000 train_loss: 2.4662 train_time: 5.6m tok/s: 7281499 +3099/20000 train_loss: 2.3239 train_time: 5.6m tok/s: 7280809 +3100/20000 train_loss: 2.4148 train_time: 5.6m tok/s: 7280113 +3101/20000 train_loss: 2.6671 train_time: 5.6m tok/s: 7279421 +3102/20000 train_loss: 2.6549 train_time: 5.6m tok/s: 7278724 +3103/20000 train_loss: 2.4874 train_time: 5.6m tok/s: 7278065 +3104/20000 train_loss: 2.5857 train_time: 5.6m tok/s: 7277382 +3105/20000 train_loss: 2.3868 train_time: 5.6m tok/s: 7276684 +3106/20000 train_loss: 2.5920 train_time: 5.6m tok/s: 7275999 +3107/20000 train_loss: 2.3398 train_time: 5.6m tok/s: 7275282 +3108/20000 train_loss: 2.4710 train_time: 5.6m tok/s: 7274591 +3109/20000 train_loss: 2.5622 train_time: 5.6m tok/s: 7273933 +3110/20000 train_loss: 2.4228 train_time: 5.6m tok/s: 7273249 +3111/20000 train_loss: 2.4309 train_time: 5.6m tok/s: 7272552 +3112/20000 train_loss: 2.3898 train_time: 5.6m tok/s: 7271859 +3113/20000 train_loss: 2.5276 train_time: 5.6m tok/s: 7271173 +3114/20000 train_loss: 2.5402 train_time: 5.6m tok/s: 7270471 +3115/20000 train_loss: 2.5592 train_time: 5.6m tok/s: 7269823 +3116/20000 train_loss: 2.5762 train_time: 5.6m tok/s: 7269104 +3117/20000 train_loss: 2.6043 train_time: 5.6m tok/s: 7268408 +3118/20000 train_loss: 2.5363 train_time: 5.6m tok/s: 7267721 +3119/20000 train_loss: 2.5745 train_time: 5.6m tok/s: 7267037 +3120/20000 train_loss: 2.5553 train_time: 5.6m tok/s: 7266384 +3121/20000 train_loss: 2.5326 train_time: 5.6m tok/s: 7265675 +3122/20000 train_loss: 2.5508 train_time: 5.6m tok/s: 7264972 +3123/20000 train_loss: 2.4311 train_time: 5.6m tok/s: 7264299 +3124/20000 train_loss: 2.5347 train_time: 5.6m tok/s: 7263610 +3125/20000 train_loss: 2.4327 train_time: 5.6m tok/s: 7262916 +3126/20000 train_loss: 2.1315 train_time: 5.6m tok/s: 7262216 +3127/20000 train_loss: 2.6029 train_time: 5.6m tok/s: 7261566 +3128/20000 train_loss: 2.5056 train_time: 5.6m tok/s: 7260858 +3129/20000 train_loss: 2.5332 train_time: 5.6m tok/s: 7260171 +3130/20000 train_loss: 2.4766 train_time: 5.7m tok/s: 7259491 +3131/20000 train_loss: 2.4446 train_time: 5.7m tok/s: 7258832 +3132/20000 train_loss: 2.5666 train_time: 5.7m tok/s: 7258143 +3133/20000 train_loss: 2.5172 train_time: 5.7m tok/s: 7257458 +3134/20000 train_loss: 2.5504 train_time: 5.7m tok/s: 7256808 +3135/20000 train_loss: 2.5078 train_time: 5.7m tok/s: 7256145 +3136/20000 train_loss: 2.5677 train_time: 5.7m tok/s: 7255469 +3137/20000 train_loss: 2.6063 train_time: 5.7m tok/s: 7254789 +3138/20000 train_loss: 2.4454 train_time: 5.7m tok/s: 7254099 +3139/20000 train_loss: 2.4518 train_time: 5.7m tok/s: 7253430 +3140/20000 train_loss: 2.4270 train_time: 5.7m tok/s: 7252754 +3141/20000 train_loss: 2.5722 train_time: 5.7m tok/s: 7252087 +3142/20000 train_loss: 2.5458 train_time: 5.7m tok/s: 7251406 +3143/20000 train_loss: 2.5484 train_time: 5.7m tok/s: 7250757 +3144/20000 train_loss: 2.5466 train_time: 5.7m tok/s: 7250080 +3145/20000 train_loss: 2.0733 train_time: 5.7m tok/s: 7249352 +3146/20000 train_loss: 2.4543 train_time: 5.7m tok/s: 7248698 +3147/20000 train_loss: 2.5500 train_time: 5.7m tok/s: 7248043 +3148/20000 train_loss: 2.4830 train_time: 5.7m tok/s: 7247369 +3149/20000 train_loss: 2.5957 train_time: 5.7m tok/s: 7246683 +3150/20000 train_loss: 2.3853 train_time: 5.7m tok/s: 7246026 +3151/20000 train_loss: 2.4066 train_time: 5.7m tok/s: 7245364 +3152/20000 train_loss: 2.4724 train_time: 5.7m tok/s: 7244713 +3153/20000 train_loss: 2.5220 train_time: 5.7m tok/s: 7244042 +3154/20000 train_loss: 2.3784 train_time: 5.7m tok/s: 7243392 +3155/20000 train_loss: 2.4782 train_time: 5.7m tok/s: 7242730 +3156/20000 train_loss: 2.5891 train_time: 5.7m tok/s: 7242051 +3157/20000 train_loss: 2.5949 train_time: 5.7m tok/s: 7241441 +3158/20000 train_loss: 2.6398 train_time: 5.7m tok/s: 7240731 +3159/20000 train_loss: 2.5245 train_time: 5.7m tok/s: 7240080 +3160/20000 train_loss: 2.3664 train_time: 5.7m tok/s: 7239385 +3161/20000 train_loss: 2.5357 train_time: 5.7m tok/s: 7238741 +3162/20000 train_loss: 2.6225 train_time: 5.7m tok/s: 7238103 +3163/20000 train_loss: 2.5484 train_time: 5.7m tok/s: 7237464 +3164/20000 train_loss: 2.3755 train_time: 5.7m tok/s: 7236774 +3165/20000 train_loss: 2.5073 train_time: 5.7m tok/s: 7236141 +3166/20000 train_loss: 2.4204 train_time: 5.7m tok/s: 7235487 +3167/20000 train_loss: 2.5266 train_time: 5.7m tok/s: 7234852 +3168/20000 train_loss: 2.5155 train_time: 5.7m tok/s: 7234204 +3169/20000 train_loss: 2.4900 train_time: 5.7m tok/s: 7233552 +3170/20000 train_loss: 2.6200 train_time: 5.7m tok/s: 7232809 +3171/20000 train_loss: 2.6528 train_time: 5.7m tok/s: 7232179 +3172/20000 train_loss: 2.6575 train_time: 5.7m tok/s: 7231540 +3173/20000 train_loss: 2.7155 train_time: 5.8m tok/s: 7230866 +3174/20000 train_loss: 2.4698 train_time: 5.8m tok/s: 7230210 +3175/20000 train_loss: 2.6154 train_time: 5.8m tok/s: 7229564 +3176/20000 train_loss: 2.4464 train_time: 5.8m tok/s: 7228911 +3177/20000 train_loss: 2.4533 train_time: 5.8m tok/s: 7228272 +3178/20000 train_loss: 2.5972 train_time: 5.8m tok/s: 7227560 +3179/20000 train_loss: 2.4327 train_time: 5.8m tok/s: 7226835 +3180/20000 train_loss: 2.5261 train_time: 5.8m tok/s: 7226177 +3181/20000 train_loss: 2.3762 train_time: 5.8m tok/s: 7225529 +3182/20000 train_loss: 2.8114 train_time: 5.8m tok/s: 7224864 +3183/20000 train_loss: 2.6582 train_time: 5.8m tok/s: 7224215 +3184/20000 train_loss: 2.5071 train_time: 5.8m tok/s: 7223519 +3185/20000 train_loss: 2.5786 train_time: 5.8m tok/s: 7222899 +3186/20000 train_loss: 2.5297 train_time: 5.8m tok/s: 7222250 +3187/20000 train_loss: 2.5748 train_time: 5.8m tok/s: 7221593 +3188/20000 train_loss: 2.3938 train_time: 5.8m tok/s: 7220947 +3189/20000 train_loss: 2.5958 train_time: 5.8m tok/s: 7220319 +3190/20000 train_loss: 2.5750 train_time: 5.8m tok/s: 7219707 +3191/20000 train_loss: 2.4806 train_time: 5.8m tok/s: 7219044 +3192/20000 train_loss: 2.3769 train_time: 5.8m tok/s: 7218417 +3193/20000 train_loss: 2.4639 train_time: 5.8m tok/s: 7217822 +3194/20000 train_loss: 2.4482 train_time: 5.8m tok/s: 7217211 +3195/20000 train_loss: 2.5032 train_time: 5.8m tok/s: 7216595 +3196/20000 train_loss: 2.3916 train_time: 5.8m tok/s: 7215944 +3197/20000 train_loss: 2.4719 train_time: 5.8m tok/s: 7215305 +3198/20000 train_loss: 2.4935 train_time: 5.8m tok/s: 7214696 +3199/20000 train_loss: 2.5080 train_time: 5.8m tok/s: 7214055 +3200/20000 train_loss: 2.6319 train_time: 5.8m tok/s: 7213394 +3201/20000 train_loss: 2.5733 train_time: 5.8m tok/s: 7212775 +3202/20000 train_loss: 2.2688 train_time: 5.8m tok/s: 7212077 +3203/20000 train_loss: 2.5471 train_time: 5.8m tok/s: 7211414 +3204/20000 train_loss: 2.4490 train_time: 5.8m tok/s: 7210790 +3205/20000 train_loss: 2.5025 train_time: 5.8m tok/s: 7210142 +3206/20000 train_loss: 2.4460 train_time: 5.8m tok/s: 7209533 +3207/20000 train_loss: 2.5566 train_time: 5.8m tok/s: 7208896 +3208/20000 train_loss: 2.5849 train_time: 5.8m tok/s: 7208286 +3209/20000 train_loss: 2.4561 train_time: 5.8m tok/s: 7207655 +3210/20000 train_loss: 2.7169 train_time: 5.8m tok/s: 7207002 +3211/20000 train_loss: 2.5140 train_time: 5.8m tok/s: 7206399 +3212/20000 train_loss: 2.4080 train_time: 5.8m tok/s: 7205737 +3213/20000 train_loss: 2.5585 train_time: 5.8m tok/s: 7205100 +3214/20000 train_loss: 2.4162 train_time: 5.8m tok/s: 7204425 +3215/20000 train_loss: 2.4349 train_time: 5.8m tok/s: 7203791 +3216/20000 train_loss: 2.3248 train_time: 5.9m tok/s: 7203150 +3217/20000 train_loss: 2.4588 train_time: 5.9m tok/s: 7202517 +3218/20000 train_loss: 2.5018 train_time: 5.9m tok/s: 7201918 +3219/20000 train_loss: 2.5663 train_time: 5.9m tok/s: 7201334 +3220/20000 train_loss: 2.6261 train_time: 5.9m tok/s: 7200700 +3221/20000 train_loss: 2.4277 train_time: 5.9m tok/s: 7200076 +3222/20000 train_loss: 2.5270 train_time: 5.9m tok/s: 7199458 +3223/20000 train_loss: 2.6328 train_time: 5.9m tok/s: 7198812 +3224/20000 train_loss: 2.9696 train_time: 5.9m tok/s: 7198147 +3225/20000 train_loss: 2.5827 train_time: 5.9m tok/s: 7197498 +3226/20000 train_loss: 2.4072 train_time: 5.9m tok/s: 7196885 +3227/20000 train_loss: 2.4828 train_time: 5.9m tok/s: 7196257 +3228/20000 train_loss: 2.8144 train_time: 5.9m tok/s: 7195649 +3229/20000 train_loss: 2.4080 train_time: 5.9m tok/s: 7195048 +3230/20000 train_loss: 2.4469 train_time: 5.9m tok/s: 7194427 +3231/20000 train_loss: 2.5547 train_time: 5.9m tok/s: 7193829 +3232/20000 train_loss: 2.5359 train_time: 5.9m tok/s: 7193211 +3233/20000 train_loss: 2.5243 train_time: 5.9m tok/s: 7192577 +3234/20000 train_loss: 2.5957 train_time: 5.9m tok/s: 7191933 +3235/20000 train_loss: 2.5274 train_time: 5.9m tok/s: 7191276 +3236/20000 train_loss: 2.2806 train_time: 5.9m tok/s: 7190655 +3237/20000 train_loss: 2.4702 train_time: 5.9m tok/s: 7190067 +3238/20000 train_loss: 2.3593 train_time: 5.9m tok/s: 7189441 +3239/20000 train_loss: 2.3982 train_time: 5.9m tok/s: 7188791 +3240/20000 train_loss: 2.4979 train_time: 5.9m tok/s: 7188214 +3241/20000 train_loss: 2.4965 train_time: 5.9m tok/s: 7187577 +3242/20000 train_loss: 2.5329 train_time: 5.9m tok/s: 7186969 +3243/20000 train_loss: 2.4478 train_time: 5.9m tok/s: 7186342 +3244/20000 train_loss: 2.5607 train_time: 5.9m tok/s: 7185723 +3245/20000 train_loss: 2.6326 train_time: 5.9m tok/s: 7185112 +3246/20000 train_loss: 2.5152 train_time: 5.9m tok/s: 7184498 +3247/20000 train_loss: 2.4113 train_time: 5.9m tok/s: 7183872 +3248/20000 train_loss: 2.4674 train_time: 5.9m tok/s: 7183254 +3249/20000 train_loss: 2.6019 train_time: 5.9m tok/s: 7182647 +3250/20000 train_loss: 2.5531 train_time: 5.9m tok/s: 7182019 +3251/20000 train_loss: 2.5894 train_time: 5.9m tok/s: 7181373 +3252/20000 train_loss: 2.4510 train_time: 5.9m tok/s: 7180737 +3253/20000 train_loss: 2.4436 train_time: 5.9m tok/s: 7180136 +3254/20000 train_loss: 2.9140 train_time: 5.9m tok/s: 7179480 +3255/20000 train_loss: 2.4613 train_time: 5.9m tok/s: 7178865 +3256/20000 train_loss: 2.4998 train_time: 5.9m tok/s: 7178272 +3257/20000 train_loss: 2.5791 train_time: 5.9m tok/s: 7177656 +3258/20000 train_loss: 2.5393 train_time: 5.9m tok/s: 7177069 +3259/20000 train_loss: 2.4912 train_time: 6.0m tok/s: 7176462 +3260/20000 train_loss: 2.4839 train_time: 6.0m tok/s: 7175872 +3261/20000 train_loss: 2.5252 train_time: 6.0m tok/s: 7175268 +3262/20000 train_loss: 2.4041 train_time: 6.0m tok/s: 7174646 +3263/20000 train_loss: 2.4686 train_time: 6.0m tok/s: 7174030 +3264/20000 train_loss: 2.5176 train_time: 6.0m tok/s: 7173418 +3265/20000 train_loss: 2.5348 train_time: 6.0m tok/s: 7172844 +3266/20000 train_loss: 2.5435 train_time: 6.0m tok/s: 7172221 +3267/20000 train_loss: 2.4903 train_time: 6.0m tok/s: 7171637 +3268/20000 train_loss: 2.5135 train_time: 6.0m tok/s: 7171029 +3269/20000 train_loss: 2.6222 train_time: 6.0m tok/s: 7170416 +3270/20000 train_loss: 2.4711 train_time: 6.0m tok/s: 7169808 +3271/20000 train_loss: 2.5545 train_time: 6.0m tok/s: 7169164 +3272/20000 train_loss: 2.5210 train_time: 6.0m tok/s: 7168571 +3273/20000 train_loss: 2.6738 train_time: 6.0m tok/s: 7167973 +3274/20000 train_loss: 2.4765 train_time: 6.0m tok/s: 7167375 +3275/20000 train_loss: 2.5261 train_time: 6.0m tok/s: 7166769 +3276/20000 train_loss: 2.5031 train_time: 6.0m tok/s: 7166192 +3277/20000 train_loss: 2.5093 train_time: 6.0m tok/s: 7165513 +3278/20000 train_loss: 2.4019 train_time: 6.0m tok/s: 7164922 +3279/20000 train_loss: 2.4698 train_time: 6.0m tok/s: 7164344 +3280/20000 train_loss: 2.3864 train_time: 6.0m tok/s: 7163743 +3281/20000 train_loss: 2.4274 train_time: 6.0m tok/s: 7163090 +3282/20000 train_loss: 2.4951 train_time: 6.0m tok/s: 7162458 +3283/20000 train_loss: 2.4298 train_time: 6.0m tok/s: 7161869 +3284/20000 train_loss: 2.6399 train_time: 6.0m tok/s: 7161301 +3285/20000 train_loss: 2.4914 train_time: 6.0m tok/s: 7160687 +3286/20000 train_loss: 2.4456 train_time: 6.0m tok/s: 7160108 +3287/20000 train_loss: 2.5139 train_time: 6.0m tok/s: 7159540 +3288/20000 train_loss: 2.5090 train_time: 6.0m tok/s: 7158957 +3289/20000 train_loss: 2.4853 train_time: 6.0m tok/s: 7158381 +3290/20000 train_loss: 2.4588 train_time: 6.0m tok/s: 7157789 +3291/20000 train_loss: 2.4476 train_time: 6.0m tok/s: 7157177 +3292/20000 train_loss: 2.4681 train_time: 6.0m tok/s: 7156593 +3293/20000 train_loss: 2.4596 train_time: 6.0m tok/s: 7155978 +3294/20000 train_loss: 2.4311 train_time: 6.0m tok/s: 7155383 +3295/20000 train_loss: 2.5229 train_time: 6.0m tok/s: 7154786 +3296/20000 train_loss: 2.3949 train_time: 6.0m tok/s: 7154173 +3297/20000 train_loss: 2.4086 train_time: 6.0m tok/s: 7153584 +3298/20000 train_loss: 2.4851 train_time: 6.0m tok/s: 7152979 +3299/20000 train_loss: 2.3040 train_time: 6.0m tok/s: 7152326 +3300/20000 train_loss: 2.4643 train_time: 6.0m tok/s: 7151739 +3301/20000 train_loss: 2.4312 train_time: 6.1m tok/s: 7151164 +3302/20000 train_loss: 2.2267 train_time: 6.1m tok/s: 7150502 +3303/20000 train_loss: 2.6561 train_time: 6.1m tok/s: 7149912 +3304/20000 train_loss: 2.4991 train_time: 6.1m tok/s: 7149353 +3305/20000 train_loss: 2.5643 train_time: 6.1m tok/s: 7148786 +3306/20000 train_loss: 2.6168 train_time: 6.1m tok/s: 7148218 +3307/20000 train_loss: 2.5239 train_time: 6.1m tok/s: 7147612 +3308/20000 train_loss: 2.2484 train_time: 6.1m tok/s: 7147013 +3309/20000 train_loss: 2.4725 train_time: 6.1m tok/s: 7146428 +3310/20000 train_loss: 2.5375 train_time: 6.1m tok/s: 7145887 +3311/20000 train_loss: 2.6626 train_time: 6.1m tok/s: 7145264 +3312/20000 train_loss: 2.5294 train_time: 6.1m tok/s: 7144690 +3313/20000 train_loss: 2.3781 train_time: 6.1m tok/s: 7144096 +3314/20000 train_loss: 2.4964 train_time: 6.1m tok/s: 7143535 +3315/20000 train_loss: 2.5214 train_time: 6.1m tok/s: 7142971 +3316/20000 train_loss: 2.5076 train_time: 6.1m tok/s: 7142407 +3317/20000 train_loss: 2.4246 train_time: 6.1m tok/s: 7141786 +3318/20000 train_loss: 2.3986 train_time: 6.1m tok/s: 7141234 +3319/20000 train_loss: 2.4014 train_time: 6.1m tok/s: 7140639 +3320/20000 train_loss: 2.4225 train_time: 6.1m tok/s: 7140038 +3321/20000 train_loss: 2.4471 train_time: 6.1m tok/s: 7139463 +3322/20000 train_loss: 2.4012 train_time: 6.1m tok/s: 7138914 +3323/20000 train_loss: 2.4101 train_time: 6.1m tok/s: 7138310 +3324/20000 train_loss: 2.4538 train_time: 6.1m tok/s: 7137720 +3325/20000 train_loss: 2.5599 train_time: 6.1m tok/s: 7137165 +3326/20000 train_loss: 2.5093 train_time: 6.1m tok/s: 7136611 +3327/20000 train_loss: 2.4268 train_time: 6.1m tok/s: 7136043 +3328/20000 train_loss: 2.5720 train_time: 6.1m tok/s: 7135404 +3329/20000 train_loss: 2.4641 train_time: 6.1m tok/s: 7134849 +3330/20000 train_loss: 2.8003 train_time: 6.1m tok/s: 7134277 +3331/20000 train_loss: 2.6203 train_time: 6.1m tok/s: 7133688 +3332/20000 train_loss: 2.5303 train_time: 6.1m tok/s: 7133104 +3333/20000 train_loss: 2.3874 train_time: 6.1m tok/s: 7132560 +3334/20000 train_loss: 2.4707 train_time: 6.1m tok/s: 7131967 +3335/20000 train_loss: 2.5692 train_time: 6.1m tok/s: 7131360 +3336/20000 train_loss: 2.5179 train_time: 6.1m tok/s: 7130803 +3337/20000 train_loss: 2.2973 train_time: 6.1m tok/s: 7130255 +3338/20000 train_loss: 2.4692 train_time: 6.1m tok/s: 7129694 +3339/20000 train_loss: 2.4135 train_time: 6.1m tok/s: 7129108 +3340/20000 train_loss: 2.4346 train_time: 6.1m tok/s: 7128524 +3341/20000 train_loss: 2.4493 train_time: 6.1m tok/s: 7127943 +3342/20000 train_loss: 2.3855 train_time: 6.1m tok/s: 7127351 +3343/20000 train_loss: 2.3888 train_time: 6.1m tok/s: 7126775 +3344/20000 train_loss: 2.5460 train_time: 6.2m tok/s: 7126219 +3345/20000 train_loss: 2.4660 train_time: 6.2m tok/s: 7125671 +3346/20000 train_loss: 2.4869 train_time: 6.2m tok/s: 7125064 +3347/20000 train_loss: 2.5283 train_time: 6.2m tok/s: 7124500 +3348/20000 train_loss: 2.5826 train_time: 6.2m tok/s: 7123922 +3349/20000 train_loss: 2.4954 train_time: 6.2m tok/s: 7123354 +3350/20000 train_loss: 2.4553 train_time: 6.2m tok/s: 7122792 +3351/20000 train_loss: 2.4838 train_time: 6.2m tok/s: 7122250 +3352/20000 train_loss: 2.4949 train_time: 6.2m tok/s: 7121651 +3353/20000 train_loss: 2.4361 train_time: 6.2m tok/s: 7121051 +3354/20000 train_loss: 2.5244 train_time: 6.2m tok/s: 7120513 +3355/20000 train_loss: 2.5670 train_time: 6.2m tok/s: 7119936 +3356/20000 train_loss: 2.3559 train_time: 6.2m tok/s: 7119301 +3357/20000 train_loss: 2.3591 train_time: 6.2m tok/s: 7118738 +3358/20000 train_loss: 2.5033 train_time: 6.2m tok/s: 7118142 +3359/20000 train_loss: 2.4424 train_time: 6.2m tok/s: 7117567 +3360/20000 train_loss: 2.4578 train_time: 6.2m tok/s: 7117023 +3361/20000 train_loss: 2.5192 train_time: 6.2m tok/s: 7116488 +3362/20000 train_loss: 2.4917 train_time: 6.2m tok/s: 7115937 +3363/20000 train_loss: 2.4232 train_time: 6.2m tok/s: 7115355 +3364/20000 train_loss: 2.5068 train_time: 6.2m tok/s: 7114776 +3365/20000 train_loss: 2.5160 train_time: 6.2m tok/s: 7114230 +3366/20000 train_loss: 2.3836 train_time: 6.2m tok/s: 7113659 +3367/20000 train_loss: 2.4243 train_time: 6.2m tok/s: 7113102 +3368/20000 train_loss: 2.3775 train_time: 6.2m tok/s: 7112560 +3369/20000 train_loss: 2.6125 train_time: 6.2m tok/s: 7111950 +3370/20000 train_loss: 2.5861 train_time: 6.2m tok/s: 7111382 +3371/20000 train_loss: 2.5039 train_time: 6.2m tok/s: 7110813 +3372/20000 train_loss: 2.5780 train_time: 6.2m tok/s: 7110286 +3373/20000 train_loss: 2.4672 train_time: 6.2m tok/s: 7109729 +3374/20000 train_loss: 2.4684 train_time: 6.2m tok/s: 7109198 +3375/20000 train_loss: 2.4781 train_time: 6.2m tok/s: 7108621 +3376/20000 train_loss: 2.4213 train_time: 6.2m tok/s: 7108063 +3377/20000 train_loss: 2.5551 train_time: 6.2m tok/s: 7107535 +3378/20000 train_loss: 2.3389 train_time: 6.2m tok/s: 7106949 +3379/20000 train_loss: 2.4134 train_time: 6.2m tok/s: 7106416 +3380/20000 train_loss: 2.3614 train_time: 6.2m tok/s: 7105823 +3381/20000 train_loss: 2.3320 train_time: 6.2m tok/s: 7105251 +3382/20000 train_loss: 2.5785 train_time: 6.2m tok/s: 7104731 +3383/20000 train_loss: 2.4970 train_time: 6.2m tok/s: 7104203 +3384/20000 train_loss: 2.4833 train_time: 6.2m tok/s: 7103617 +3385/20000 train_loss: 2.4777 train_time: 6.2m tok/s: 7103088 +3386/20000 train_loss: 2.4953 train_time: 6.2m tok/s: 7102556 +3387/20000 train_loss: 2.5231 train_time: 6.3m tok/s: 7101997 +3388/20000 train_loss: 2.2916 train_time: 6.3m tok/s: 7101352 +3389/20000 train_loss: 2.4252 train_time: 6.3m tok/s: 7100831 +3390/20000 train_loss: 2.5006 train_time: 6.3m tok/s: 7100304 +3391/20000 train_loss: 2.4990 train_time: 6.3m tok/s: 7099739 +3392/20000 train_loss: 2.4614 train_time: 6.3m tok/s: 7099229 +3393/20000 train_loss: 2.4516 train_time: 6.3m tok/s: 7098700 +3394/20000 train_loss: 2.4593 train_time: 6.3m tok/s: 7098162 +3395/20000 train_loss: 2.4620 train_time: 6.3m tok/s: 7097598 +3396/20000 train_loss: 2.6108 train_time: 6.3m tok/s: 7097043 +3397/20000 train_loss: 2.5258 train_time: 6.3m tok/s: 7096456 +3398/20000 train_loss: 2.3687 train_time: 6.3m tok/s: 7095907 +3399/20000 train_loss: 2.4296 train_time: 6.3m tok/s: 7095333 +3400/20000 train_loss: 2.5297 train_time: 6.3m tok/s: 7094799 +3401/20000 train_loss: 2.4085 train_time: 6.3m tok/s: 7094255 +3402/20000 train_loss: 2.4478 train_time: 6.3m tok/s: 7093724 +3403/20000 train_loss: 2.4860 train_time: 6.3m tok/s: 7093185 +3404/20000 train_loss: 2.4490 train_time: 6.3m tok/s: 7092659 +3405/20000 train_loss: 2.6275 train_time: 6.3m tok/s: 7092111 +3406/20000 train_loss: 2.4747 train_time: 6.3m tok/s: 7091571 +3407/20000 train_loss: 2.5055 train_time: 6.3m tok/s: 7091019 +3408/20000 train_loss: 2.5730 train_time: 6.3m tok/s: 7090461 +3409/20000 train_loss: 2.4158 train_time: 6.3m tok/s: 7089914 +3410/20000 train_loss: 2.4194 train_time: 6.3m tok/s: 7089351 +3411/20000 train_loss: 2.3473 train_time: 6.3m tok/s: 7088811 +3412/20000 train_loss: 2.3464 train_time: 6.3m tok/s: 7088264 +3413/20000 train_loss: 2.3379 train_time: 6.3m tok/s: 7087713 +3414/20000 train_loss: 2.4425 train_time: 6.3m tok/s: 7087184 +3415/20000 train_loss: 2.5941 train_time: 6.3m tok/s: 7086640 +3416/20000 train_loss: 2.5030 train_time: 6.3m tok/s: 7086091 +3417/20000 train_loss: 2.5701 train_time: 6.3m tok/s: 7085559 +3418/20000 train_loss: 2.5053 train_time: 6.3m tok/s: 7085045 +3419/20000 train_loss: 2.5307 train_time: 6.3m tok/s: 7084484 +3420/20000 train_loss: 2.5385 train_time: 6.3m tok/s: 7083920 +3421/20000 train_loss: 2.3727 train_time: 6.3m tok/s: 7083383 +3422/20000 train_loss: 2.6885 train_time: 6.3m tok/s: 7082811 +3423/20000 train_loss: 2.4024 train_time: 6.3m tok/s: 7082286 +3424/20000 train_loss: 2.4562 train_time: 6.3m tok/s: 7081774 +3425/20000 train_loss: 2.4284 train_time: 6.3m tok/s: 7081225 +3426/20000 train_loss: 2.5062 train_time: 6.3m tok/s: 7080716 +3427/20000 train_loss: 2.4589 train_time: 6.3m tok/s: 7080181 +3428/20000 train_loss: 2.4567 train_time: 6.3m tok/s: 7079633 +3429/20000 train_loss: 2.5972 train_time: 6.3m tok/s: 7079082 +3430/20000 train_loss: 2.4168 train_time: 6.4m tok/s: 7078544 +3431/20000 train_loss: 2.4671 train_time: 6.4m tok/s: 7078022 +3432/20000 train_loss: 2.5937 train_time: 6.4m tok/s: 7077475 +3433/20000 train_loss: 2.4587 train_time: 6.4m tok/s: 7076954 +3434/20000 train_loss: 2.5114 train_time: 6.4m tok/s: 7076424 +3435/20000 train_loss: 2.4381 train_time: 6.4m tok/s: 7075842 +3436/20000 train_loss: 2.4113 train_time: 6.4m tok/s: 7075291 +3437/20000 train_loss: 2.5533 train_time: 6.4m tok/s: 7074782 +3438/20000 train_loss: 2.4879 train_time: 6.4m tok/s: 7074260 +3439/20000 train_loss: 2.3297 train_time: 6.4m tok/s: 7073717 +3440/20000 train_loss: 2.4784 train_time: 6.4m tok/s: 7073175 +3441/20000 train_loss: 2.3797 train_time: 6.4m tok/s: 7072663 +3442/20000 train_loss: 2.4074 train_time: 6.4m tok/s: 7072127 +3443/20000 train_loss: 2.6583 train_time: 6.4m tok/s: 7071540 +3444/20000 train_loss: 2.3317 train_time: 6.4m tok/s: 7071019 +3445/20000 train_loss: 2.4636 train_time: 6.4m tok/s: 7070546 +3446/20000 train_loss: 2.5295 train_time: 6.4m tok/s: 7070023 +3447/20000 train_loss: 2.4870 train_time: 6.4m tok/s: 7069495 +3448/20000 train_loss: 2.6063 train_time: 6.4m tok/s: 7068987 +3449/20000 train_loss: 2.4730 train_time: 6.4m tok/s: 7068444 +3450/20000 train_loss: 2.4293 train_time: 6.4m tok/s: 7067907 +3451/20000 train_loss: 2.4922 train_time: 6.4m tok/s: 7067378 +3452/20000 train_loss: 2.5205 train_time: 6.4m tok/s: 7066868 +3453/20000 train_loss: 2.4038 train_time: 6.4m tok/s: 7066318 +3454/20000 train_loss: 2.5209 train_time: 6.4m tok/s: 7065796 +3455/20000 train_loss: 2.4832 train_time: 6.4m tok/s: 7065274 +3456/20000 train_loss: 2.4816 train_time: 6.4m tok/s: 7064727 +3457/20000 train_loss: 2.4435 train_time: 6.4m tok/s: 7064176 +3458/20000 train_loss: 2.4001 train_time: 6.4m tok/s: 7063654 +3459/20000 train_loss: 2.3661 train_time: 6.4m tok/s: 7063121 +3460/20000 train_loss: 2.5216 train_time: 6.4m tok/s: 7062618 +3461/20000 train_loss: 2.6264 train_time: 6.4m tok/s: 7062092 +3462/20000 train_loss: 2.5513 train_time: 6.4m tok/s: 7061593 +3463/20000 train_loss: 2.5733 train_time: 6.4m tok/s: 7061064 +3464/20000 train_loss: 2.4782 train_time: 6.4m tok/s: 7060538 +3465/20000 train_loss: 2.5323 train_time: 6.4m tok/s: 7059999 +3466/20000 train_loss: 2.5888 train_time: 6.4m tok/s: 7059468 +3467/20000 train_loss: 2.4455 train_time: 6.4m tok/s: 7058953 +3468/20000 train_loss: 2.6316 train_time: 6.4m tok/s: 7058396 +3469/20000 train_loss: 2.3935 train_time: 6.4m tok/s: 7057856 +3470/20000 train_loss: 2.4013 train_time: 6.4m tok/s: 7057329 +3471/20000 train_loss: 2.3913 train_time: 6.4m tok/s: 7056805 +3472/20000 train_loss: 2.4370 train_time: 6.4m tok/s: 7056276 +3473/20000 train_loss: 2.3988 train_time: 6.5m tok/s: 7055747 +3474/20000 train_loss: 2.4715 train_time: 6.5m tok/s: 7055260 +3475/20000 train_loss: 2.5200 train_time: 6.5m tok/s: 7054755 +3476/20000 train_loss: 2.4559 train_time: 6.5m tok/s: 7054250 +3477/20000 train_loss: 2.5328 train_time: 6.5m tok/s: 7053734 +3478/20000 train_loss: 2.5566 train_time: 6.5m tok/s: 7053195 +3479/20000 train_loss: 2.4568 train_time: 6.5m tok/s: 7052700 +3480/20000 train_loss: 2.5240 train_time: 6.5m tok/s: 7052200 +3481/20000 train_loss: 2.4842 train_time: 6.5m tok/s: 7051673 +3482/20000 train_loss: 2.4284 train_time: 6.5m tok/s: 7051168 +3483/20000 train_loss: 2.4176 train_time: 6.5m tok/s: 7050637 +3484/20000 train_loss: 2.4657 train_time: 6.5m tok/s: 7050086 +3485/20000 train_loss: 2.3998 train_time: 6.5m tok/s: 7049549 +3486/20000 train_loss: 2.3827 train_time: 6.5m tok/s: 7049057 +3487/20000 train_loss: 2.4728 train_time: 6.5m tok/s: 7048542 +3488/20000 train_loss: 2.3670 train_time: 6.5m tok/s: 7048057 +3489/20000 train_loss: 2.3847 train_time: 6.5m tok/s: 7047537 +3490/20000 train_loss: 2.4831 train_time: 6.5m tok/s: 7047030 +3491/20000 train_loss: 2.5851 train_time: 6.5m tok/s: 7046516 +3492/20000 train_loss: 2.4249 train_time: 6.5m tok/s: 7045986 +3493/20000 train_loss: 2.5518 train_time: 6.5m tok/s: 7045491 +3494/20000 train_loss: 2.5255 train_time: 6.5m tok/s: 7044975 +3495/20000 train_loss: 2.4897 train_time: 6.5m tok/s: 7044496 +3496/20000 train_loss: 2.5902 train_time: 6.5m tok/s: 7043916 +3497/20000 train_loss: 2.6833 train_time: 6.5m tok/s: 7043438 +3498/20000 train_loss: 2.3501 train_time: 6.5m tok/s: 7042879 +3499/20000 train_loss: 2.4567 train_time: 6.5m tok/s: 7042375 +3500/20000 train_loss: 2.3682 train_time: 6.5m tok/s: 7041903 +3501/20000 train_loss: 2.4153 train_time: 6.5m tok/s: 7041388 +3502/20000 train_loss: 3.2599 train_time: 6.5m tok/s: 7040836 +3503/20000 train_loss: 2.4594 train_time: 6.5m tok/s: 7040330 +3504/20000 train_loss: 2.5537 train_time: 6.5m tok/s: 7039846 +3505/20000 train_loss: 2.5006 train_time: 6.5m tok/s: 7039289 +3506/20000 train_loss: 2.5611 train_time: 6.5m tok/s: 7038771 +3507/20000 train_loss: 2.5285 train_time: 6.5m tok/s: 7038304 +3508/20000 train_loss: 2.6195 train_time: 6.5m tok/s: 7037787 +3509/20000 train_loss: 2.4367 train_time: 6.5m tok/s: 7037262 +3510/20000 train_loss: 2.4319 train_time: 6.5m tok/s: 7036765 +3511/20000 train_loss: 2.4160 train_time: 6.5m tok/s: 7036246 +3512/20000 train_loss: 2.4662 train_time: 6.5m tok/s: 7035743 +3513/20000 train_loss: 2.4145 train_time: 6.5m tok/s: 7035254 +3514/20000 train_loss: 2.5389 train_time: 6.5m tok/s: 7034763 +3515/20000 train_loss: 2.3609 train_time: 6.5m tok/s: 7034264 +3516/20000 train_loss: 2.2683 train_time: 6.6m tok/s: 7033734 +3517/20000 train_loss: 2.4475 train_time: 6.6m tok/s: 7033238 +3518/20000 train_loss: 2.4162 train_time: 6.6m tok/s: 7032767 +3519/20000 train_loss: 2.8153 train_time: 6.6m tok/s: 7032237 +3520/20000 train_loss: 2.6448 train_time: 6.6m tok/s: 7031710 +3521/20000 train_loss: 2.4962 train_time: 6.6m tok/s: 7031201 +3522/20000 train_loss: 2.5119 train_time: 6.6m tok/s: 7030709 +3523/20000 train_loss: 2.3845 train_time: 6.6m tok/s: 7030199 +3524/20000 train_loss: 2.7599 train_time: 6.6m tok/s: 7029682 +3525/20000 train_loss: 2.5223 train_time: 6.6m tok/s: 7029211 +3526/20000 train_loss: 2.4859 train_time: 6.6m tok/s: 7028701 +3527/20000 train_loss: 2.4165 train_time: 6.6m tok/s: 7028202 +3528/20000 train_loss: 2.4586 train_time: 6.6m tok/s: 7027720 +3529/20000 train_loss: 2.4526 train_time: 6.6m tok/s: 7027232 +3530/20000 train_loss: 2.6725 train_time: 6.6m tok/s: 7026726 +3531/20000 train_loss: 2.2993 train_time: 6.6m tok/s: 7026221 +3532/20000 train_loss: 2.2768 train_time: 6.6m tok/s: 7025697 +3533/20000 train_loss: 2.4383 train_time: 6.6m tok/s: 7025002 +3534/20000 train_loss: 2.4678 train_time: 6.6m tok/s: 7024536 +3535/20000 train_loss: 2.4056 train_time: 6.6m tok/s: 7024014 +3536/20000 train_loss: 2.4690 train_time: 6.6m tok/s: 7023391 +3537/20000 train_loss: 2.6030 train_time: 6.6m tok/s: 7022864 +3538/20000 train_loss: 2.4436 train_time: 6.6m tok/s: 7022272 +3539/20000 train_loss: 2.4769 train_time: 6.6m tok/s: 7021738 +3540/20000 train_loss: 2.4926 train_time: 6.6m tok/s: 7021164 +3541/20000 train_loss: 2.2931 train_time: 6.6m tok/s: 7020608 +3542/20000 train_loss: 2.4074 train_time: 6.6m tok/s: 7020027 +3543/20000 train_loss: 2.5149 train_time: 6.6m tok/s: 7019562 +3544/20000 train_loss: 2.4237 train_time: 6.6m tok/s: 7018873 +3545/20000 train_loss: 2.4581 train_time: 6.6m tok/s: 7018412 +3546/20000 train_loss: 2.3633 train_time: 6.6m tok/s: 7017700 +3547/20000 train_loss: 2.3366 train_time: 6.6m tok/s: 7017244 +3548/20000 train_loss: 2.5778 train_time: 6.6m tok/s: 7016578 +3549/20000 train_loss: 2.5466 train_time: 6.6m tok/s: 7016138 +3550/20000 train_loss: 2.5509 train_time: 6.6m tok/s: 7015662 +3551/20000 train_loss: 2.5730 train_time: 6.6m tok/s: 7015190 +3552/20000 train_loss: 2.4716 train_time: 6.6m tok/s: 7014720 +3553/20000 train_loss: 2.5847 train_time: 6.6m tok/s: 7014227 +3554/20000 train_loss: 2.5000 train_time: 6.6m tok/s: 7013750 +3555/20000 train_loss: 2.5148 train_time: 6.6m tok/s: 7013256 +3556/20000 train_loss: 2.5248 train_time: 6.6m tok/s: 7012807 +3557/20000 train_loss: 2.4403 train_time: 6.6m tok/s: 7012328 +3558/20000 train_loss: 2.6190 train_time: 6.7m tok/s: 7011850 +3559/20000 train_loss: 2.4920 train_time: 6.7m tok/s: 7011356 +3560/20000 train_loss: 2.4479 train_time: 6.7m tok/s: 7010888 +3561/20000 train_loss: 3.1562 train_time: 6.7m tok/s: 7010368 +3562/20000 train_loss: 2.3816 train_time: 6.7m tok/s: 7009896 +3563/20000 train_loss: 2.4772 train_time: 6.7m tok/s: 7009422 +3564/20000 train_loss: 2.4788 train_time: 6.7m tok/s: 7008941 +3565/20000 train_loss: 2.4716 train_time: 6.7m tok/s: 7008459 +3566/20000 train_loss: 2.4677 train_time: 6.7m tok/s: 7007963 +3567/20000 train_loss: 2.5263 train_time: 6.7m tok/s: 7007477 +3568/20000 train_loss: 2.5329 train_time: 6.7m tok/s: 7006972 +3569/20000 train_loss: 2.3260 train_time: 6.7m tok/s: 7006464 +3570/20000 train_loss: 2.2992 train_time: 6.7m tok/s: 7005985 +3571/20000 train_loss: 2.3864 train_time: 6.7m tok/s: 7005505 +3572/20000 train_loss: 2.3967 train_time: 6.7m tok/s: 7005031 +3573/20000 train_loss: 2.2703 train_time: 6.7m tok/s: 7004518 +3574/20000 train_loss: 2.4184 train_time: 6.7m tok/s: 7004011 +3575/20000 train_loss: 2.5051 train_time: 6.7m tok/s: 7003535 +3576/20000 train_loss: 2.5069 train_time: 6.7m tok/s: 7003076 +3577/20000 train_loss: 2.4878 train_time: 6.7m tok/s: 7002593 +3578/20000 train_loss: 2.5614 train_time: 6.7m tok/s: 7002136 +3579/20000 train_loss: 2.5191 train_time: 6.7m tok/s: 7001654 +3580/20000 train_loss: 2.5254 train_time: 6.7m tok/s: 7001178 +3581/20000 train_loss: 2.5076 train_time: 6.7m tok/s: 7000690 +3582/20000 train_loss: 2.3763 train_time: 6.7m tok/s: 7000217 +3583/20000 train_loss: 2.4122 train_time: 6.7m tok/s: 6999715 +3584/20000 train_loss: 2.3555 train_time: 6.7m tok/s: 6999209 +3585/20000 train_loss: 2.3406 train_time: 6.7m tok/s: 6998711 +3586/20000 train_loss: 2.1784 train_time: 6.7m tok/s: 6998187 +3587/20000 train_loss: 2.3308 train_time: 6.7m tok/s: 6997721 +3588/20000 train_loss: 2.3832 train_time: 6.7m tok/s: 6997259 +3589/20000 train_loss: 2.6153 train_time: 6.7m tok/s: 6996764 +3590/20000 train_loss: 2.5114 train_time: 6.7m tok/s: 6996291 +3591/20000 train_loss: 2.5289 train_time: 6.7m tok/s: 6995814 +3592/20000 train_loss: 2.5170 train_time: 6.7m tok/s: 6995363 +3593/20000 train_loss: 2.5708 train_time: 6.7m tok/s: 6994895 +3594/20000 train_loss: 2.4897 train_time: 6.7m tok/s: 6994433 +3595/20000 train_loss: 2.5366 train_time: 6.7m tok/s: 6993964 +3596/20000 train_loss: 2.5132 train_time: 6.7m tok/s: 6993506 +3597/20000 train_loss: 2.5473 train_time: 6.7m tok/s: 6993024 +3598/20000 train_loss: 2.2728 train_time: 6.7m tok/s: 6992533 +3599/20000 train_loss: 2.4737 train_time: 6.7m tok/s: 6992074 +3600/20000 train_loss: 2.5595 train_time: 6.7m tok/s: 6991623 +3601/20000 train_loss: 2.4287 train_time: 6.8m tok/s: 6991145 +3602/20000 train_loss: 2.3174 train_time: 6.8m tok/s: 6990648 +3603/20000 train_loss: 2.5491 train_time: 6.8m tok/s: 6990192 +3604/20000 train_loss: 2.5779 train_time: 6.8m tok/s: 6989732 +3605/20000 train_loss: 2.4442 train_time: 6.8m tok/s: 6989263 +3606/20000 train_loss: 2.4501 train_time: 6.8m tok/s: 6988799 +3607/20000 train_loss: 2.5761 train_time: 6.8m tok/s: 6988315 +3608/20000 train_loss: 2.4193 train_time: 6.8m tok/s: 6987846 +3609/20000 train_loss: 2.4301 train_time: 6.8m tok/s: 6987365 +3610/20000 train_loss: 2.5454 train_time: 6.8m tok/s: 6986886 +3611/20000 train_loss: 2.5322 train_time: 6.8m tok/s: 6986417 +3612/20000 train_loss: 2.4075 train_time: 6.8m tok/s: 6985955 +3613/20000 train_loss: 2.4461 train_time: 6.8m tok/s: 6985501 +3614/20000 train_loss: 2.5438 train_time: 6.8m tok/s: 6985030 +3615/20000 train_loss: 2.3721 train_time: 6.8m tok/s: 6984570 +3616/20000 train_loss: 2.4741 train_time: 6.8m tok/s: 6984084 +3617/20000 train_loss: 2.4307 train_time: 6.8m tok/s: 6983586 +3618/20000 train_loss: 2.3723 train_time: 6.8m tok/s: 6983121 +3619/20000 train_loss: 2.6418 train_time: 6.8m tok/s: 6982652 +3620/20000 train_loss: 2.3826 train_time: 6.8m tok/s: 6982213 +3621/20000 train_loss: 2.5042 train_time: 6.8m tok/s: 6981724 +3622/20000 train_loss: 2.5257 train_time: 6.8m tok/s: 6981222 +3623/20000 train_loss: 2.5347 train_time: 6.8m tok/s: 6980726 +3624/20000 train_loss: 2.5901 train_time: 6.8m tok/s: 6980260 +3625/20000 train_loss: 2.4405 train_time: 6.8m tok/s: 6979807 +3626/20000 train_loss: 2.4035 train_time: 6.8m tok/s: 6979365 +3627/20000 train_loss: 2.3802 train_time: 6.8m tok/s: 6978926 +3628/20000 train_loss: 2.3959 train_time: 6.8m tok/s: 6978443 +3629/20000 train_loss: 2.5293 train_time: 6.8m tok/s: 6977970 +3630/20000 train_loss: 2.5424 train_time: 6.8m tok/s: 6977517 +3631/20000 train_loss: 2.5378 train_time: 6.8m tok/s: 6977061 +3632/20000 train_loss: 2.5817 train_time: 6.8m tok/s: 6976582 +3633/20000 train_loss: 2.3959 train_time: 6.8m tok/s: 6976115 +3634/20000 train_loss: 2.5141 train_time: 6.8m tok/s: 6975672 +3635/20000 train_loss: 2.4993 train_time: 6.8m tok/s: 6975215 +3636/20000 train_loss: 2.4864 train_time: 6.8m tok/s: 6974735 +3637/20000 train_loss: 2.4451 train_time: 6.8m tok/s: 6974255 +3638/20000 train_loss: 2.4271 train_time: 6.8m tok/s: 6973815 +3639/20000 train_loss: 2.4557 train_time: 6.8m tok/s: 6973341 +3640/20000 train_loss: 2.3966 train_time: 6.8m tok/s: 6972872 +3641/20000 train_loss: 2.3847 train_time: 6.8m tok/s: 6972415 +3642/20000 train_loss: 2.2956 train_time: 6.8m tok/s: 6971934 +3643/20000 train_loss: 2.3865 train_time: 6.8m tok/s: 6971495 +3644/20000 train_loss: 2.1038 train_time: 6.9m tok/s: 6971007 +3645/20000 train_loss: 2.5196 train_time: 6.9m tok/s: 6970553 +3646/20000 train_loss: 2.5336 train_time: 6.9m tok/s: 6970130 +3647/20000 train_loss: 2.5356 train_time: 6.9m tok/s: 6969670 +3648/20000 train_loss: 2.4820 train_time: 6.9m tok/s: 6969193 +3649/20000 train_loss: 2.4504 train_time: 6.9m tok/s: 6968763 +3650/20000 train_loss: 2.6565 train_time: 6.9m tok/s: 6968270 +3651/20000 train_loss: 2.5580 train_time: 6.9m tok/s: 6967809 +3652/20000 train_loss: 2.4339 train_time: 6.9m tok/s: 6967354 +3653/20000 train_loss: 2.4448 train_time: 6.9m tok/s: 6966920 +3654/20000 train_loss: 2.3828 train_time: 6.9m tok/s: 6966464 +3655/20000 train_loss: 2.4094 train_time: 6.9m tok/s: 6965980 +3656/20000 train_loss: 2.3837 train_time: 6.9m tok/s: 6965469 +3657/20000 train_loss: 2.4286 train_time: 6.9m tok/s: 6965030 +3658/20000 train_loss: 2.5608 train_time: 6.9m tok/s: 6964598 +3659/20000 train_loss: 2.4329 train_time: 6.9m tok/s: 6964143 +3660/20000 train_loss: 2.3413 train_time: 6.9m tok/s: 6963697 +3661/20000 train_loss: 2.4467 train_time: 6.9m tok/s: 6963233 +3662/20000 train_loss: 2.4647 train_time: 6.9m tok/s: 6962762 +3663/20000 train_loss: 2.0448 train_time: 6.9m tok/s: 6962290 +3664/20000 train_loss: 2.6475 train_time: 6.9m tok/s: 6961834 +3665/20000 train_loss: 2.4911 train_time: 6.9m tok/s: 6961399 +3666/20000 train_loss: 2.4420 train_time: 6.9m tok/s: 6960947 +3667/20000 train_loss: 2.3211 train_time: 6.9m tok/s: 6960485 +3668/20000 train_loss: 2.2289 train_time: 6.9m tok/s: 6960006 +3669/20000 train_loss: 2.3880 train_time: 6.9m tok/s: 6959508 +3670/20000 train_loss: 2.4242 train_time: 6.9m tok/s: 6959054 +3671/20000 train_loss: 2.4320 train_time: 6.9m tok/s: 6958629 +3672/20000 train_loss: 2.5434 train_time: 6.9m tok/s: 6958178 +3673/20000 train_loss: 2.4553 train_time: 6.9m tok/s: 6957764 +3674/20000 train_loss: 2.5423 train_time: 6.9m tok/s: 6957327 +3675/20000 train_loss: 2.5777 train_time: 6.9m tok/s: 6956882 +3676/20000 train_loss: 2.4721 train_time: 6.9m tok/s: 6956421 +3677/20000 train_loss: 2.5358 train_time: 6.9m tok/s: 6955983 +3678/20000 train_loss: 2.4734 train_time: 6.9m tok/s: 6955523 +3679/20000 train_loss: 2.5738 train_time: 6.9m tok/s: 6955042 +3680/20000 train_loss: 2.4719 train_time: 6.9m tok/s: 6954615 +3681/20000 train_loss: 2.5368 train_time: 6.9m tok/s: 6954206 +3682/20000 train_loss: 2.3521 train_time: 6.9m tok/s: 6953735 +3683/20000 train_loss: 2.5740 train_time: 6.9m tok/s: 6953283 +3684/20000 train_loss: 2.4052 train_time: 6.9m tok/s: 6952837 +3685/20000 train_loss: 2.6505 train_time: 6.9m tok/s: 6952367 +3686/20000 train_loss: 2.4956 train_time: 6.9m tok/s: 6951912 +3687/20000 train_loss: 2.5683 train_time: 7.0m tok/s: 6951481 +3688/20000 train_loss: 2.4755 train_time: 7.0m tok/s: 6951036 +3689/20000 train_loss: 2.5449 train_time: 7.0m tok/s: 6950587 +3690/20000 train_loss: 2.6014 train_time: 7.0m tok/s: 6950138 +3691/20000 train_loss: 2.5556 train_time: 7.0m tok/s: 6949724 +3692/20000 train_loss: 2.4947 train_time: 7.0m tok/s: 6949256 +3693/20000 train_loss: 2.5070 train_time: 7.0m tok/s: 6948824 +3694/20000 train_loss: 2.5052 train_time: 7.0m tok/s: 6948369 +3695/20000 train_loss: 2.4278 train_time: 7.0m tok/s: 6947937 +3696/20000 train_loss: 2.4108 train_time: 7.0m tok/s: 6947501 +3697/20000 train_loss: 2.4250 train_time: 7.0m tok/s: 6947031 +3698/20000 train_loss: 2.4746 train_time: 7.0m tok/s: 6946600 +3699/20000 train_loss: 2.4925 train_time: 7.0m tok/s: 6946163 +3700/20000 train_loss: 2.6317 train_time: 7.0m tok/s: 6945694 +3701/20000 train_loss: 2.5312 train_time: 7.0m tok/s: 6945272 +3702/20000 train_loss: 2.6084 train_time: 7.0m tok/s: 6944822 +3703/20000 train_loss: 2.5411 train_time: 7.0m tok/s: 6944349 +3704/20000 train_loss: 2.8528 train_time: 7.0m tok/s: 6943885 +3705/20000 train_loss: 2.4411 train_time: 7.0m tok/s: 6943463 +3706/20000 train_loss: 2.4623 train_time: 7.0m tok/s: 6943026 +3707/20000 train_loss: 2.4565 train_time: 7.0m tok/s: 6942586 +3708/20000 train_loss: 2.4034 train_time: 7.0m tok/s: 6942150 +3709/20000 train_loss: 2.5407 train_time: 7.0m tok/s: 6941726 +3710/20000 train_loss: 2.4037 train_time: 7.0m tok/s: 6941273 +3711/20000 train_loss: 2.5435 train_time: 7.0m tok/s: 6940828 +3712/20000 train_loss: 2.4546 train_time: 7.0m tok/s: 6940413 +3713/20000 train_loss: 2.4640 train_time: 7.0m tok/s: 6939979 +3714/20000 train_loss: 2.4312 train_time: 7.0m tok/s: 6939539 +3715/20000 train_loss: 2.4224 train_time: 7.0m tok/s: 6939092 +3716/20000 train_loss: 2.4183 train_time: 7.0m tok/s: 6938640 +3717/20000 train_loss: 2.3532 train_time: 7.0m tok/s: 6938178 +3718/20000 train_loss: 2.5454 train_time: 7.0m tok/s: 6937756 +3719/20000 train_loss: 2.4358 train_time: 7.0m tok/s: 6937321 +3720/20000 train_loss: 2.5067 train_time: 7.0m tok/s: 6936866 +3721/20000 train_loss: 2.5575 train_time: 7.0m tok/s: 6936426 +3722/20000 train_loss: 2.4460 train_time: 7.0m tok/s: 6935978 +3723/20000 train_loss: 2.6126 train_time: 7.0m tok/s: 6935572 +3724/20000 train_loss: 2.4237 train_time: 7.0m tok/s: 6935103 +3725/20000 train_loss: 2.4707 train_time: 7.0m tok/s: 6934697 +3726/20000 train_loss: 2.4343 train_time: 7.0m tok/s: 6934263 +3727/20000 train_loss: 2.4277 train_time: 7.0m tok/s: 6933816 +3728/20000 train_loss: 2.4447 train_time: 7.0m tok/s: 6933393 +3729/20000 train_loss: 2.5118 train_time: 7.0m tok/s: 6932972 +3730/20000 train_loss: 2.4277 train_time: 7.1m tok/s: 6932526 +3731/20000 train_loss: 2.5125 train_time: 7.1m tok/s: 6932085 +3732/20000 train_loss: 2.4335 train_time: 7.1m tok/s: 6931654 +3733/20000 train_loss: 2.5386 train_time: 7.1m tok/s: 6931227 +3734/20000 train_loss: 2.4519 train_time: 7.1m tok/s: 6930791 +3735/20000 train_loss: 2.5249 train_time: 7.1m tok/s: 6930341 +3736/20000 train_loss: 2.4268 train_time: 7.1m tok/s: 6929900 +3737/20000 train_loss: 2.4459 train_time: 7.1m tok/s: 6929460 +3738/20000 train_loss: 2.3910 train_time: 7.1m tok/s: 6929016 +3739/20000 train_loss: 2.4599 train_time: 7.1m tok/s: 6928602 +3740/20000 train_loss: 2.5322 train_time: 7.1m tok/s: 6928181 +3741/20000 train_loss: 2.3289 train_time: 7.1m tok/s: 6927708 +3742/20000 train_loss: 2.4609 train_time: 7.1m tok/s: 6927256 +3743/20000 train_loss: 2.4095 train_time: 7.1m tok/s: 6926818 +3744/20000 train_loss: 2.4370 train_time: 7.1m tok/s: 6926400 +3745/20000 train_loss: 2.5130 train_time: 7.1m tok/s: 6925975 +3746/20000 train_loss: 2.3718 train_time: 7.1m tok/s: 6925546 +3747/20000 train_loss: 2.4353 train_time: 7.1m tok/s: 6925147 +3748/20000 train_loss: 2.5527 train_time: 7.1m tok/s: 6924716 +3749/20000 train_loss: 2.5432 train_time: 7.1m tok/s: 6924279 +3750/20000 train_loss: 2.4807 train_time: 7.1m tok/s: 6923833 +3751/20000 train_loss: 2.5811 train_time: 7.1m tok/s: 6923431 +3752/20000 train_loss: 2.4286 train_time: 7.1m tok/s: 6923021 +3753/20000 train_loss: 2.3575 train_time: 7.1m tok/s: 6922571 +3754/20000 train_loss: 2.4453 train_time: 7.1m tok/s: 6922138 +3755/20000 train_loss: 2.3945 train_time: 7.1m tok/s: 6921715 +3756/20000 train_loss: 2.4815 train_time: 7.1m tok/s: 6921303 +3757/20000 train_loss: 2.3703 train_time: 7.1m tok/s: 6920858 +3758/20000 train_loss: 2.7261 train_time: 7.1m tok/s: 6920420 +3759/20000 train_loss: 2.6071 train_time: 7.1m tok/s: 6919991 +3760/20000 train_loss: 2.5432 train_time: 7.1m tok/s: 6919573 +3761/20000 train_loss: 2.6229 train_time: 7.1m tok/s: 6919135 +3762/20000 train_loss: 2.4664 train_time: 7.1m tok/s: 6918701 +3763/20000 train_loss: 2.4939 train_time: 7.1m tok/s: 6918312 +3764/20000 train_loss: 2.4505 train_time: 7.1m tok/s: 6917888 +3765/20000 train_loss: 2.4309 train_time: 7.1m tok/s: 6917463 +3766/20000 train_loss: 2.4184 train_time: 7.1m tok/s: 6917065 +3767/20000 train_loss: 2.4756 train_time: 7.1m tok/s: 6916637 +3768/20000 train_loss: 2.4762 train_time: 7.1m tok/s: 6916194 +3769/20000 train_loss: 2.4118 train_time: 7.1m tok/s: 6915781 +3770/20000 train_loss: 2.5599 train_time: 7.1m tok/s: 6915340 +3771/20000 train_loss: 2.4483 train_time: 7.1m tok/s: 6914919 +3772/20000 train_loss: 2.5140 train_time: 7.2m tok/s: 6914503 +3773/20000 train_loss: 2.4858 train_time: 7.2m tok/s: 6914068 +3774/20000 train_loss: 2.3876 train_time: 7.2m tok/s: 6913661 +3775/20000 train_loss: 2.6272 train_time: 7.2m tok/s: 6913254 +3776/20000 train_loss: 2.4997 train_time: 7.2m tok/s: 6912809 +3777/20000 train_loss: 2.4013 train_time: 7.2m tok/s: 6912362 +3778/20000 train_loss: 2.4642 train_time: 7.2m tok/s: 6911943 +3779/20000 train_loss: 2.4587 train_time: 7.2m tok/s: 6911531 +3780/20000 train_loss: 2.3815 train_time: 7.2m tok/s: 6911118 +3781/20000 train_loss: 2.4527 train_time: 7.2m tok/s: 6910676 +3782/20000 train_loss: 2.3815 train_time: 7.2m tok/s: 6910226 +3783/20000 train_loss: 2.4565 train_time: 7.2m tok/s: 6909810 +3784/20000 train_loss: 2.4634 train_time: 7.2m tok/s: 6909428 +3785/20000 train_loss: 2.4813 train_time: 7.2m tok/s: 6909031 +3786/20000 train_loss: 2.4997 train_time: 7.2m tok/s: 6908601 +3787/20000 train_loss: 2.5092 train_time: 7.2m tok/s: 6908188 +3788/20000 train_loss: 2.4226 train_time: 7.2m tok/s: 6907788 +3789/20000 train_loss: 2.3868 train_time: 7.2m tok/s: 6907377 +3790/20000 train_loss: 2.4455 train_time: 7.2m tok/s: 6906959 +3791/20000 train_loss: 2.3436 train_time: 7.2m tok/s: 6906525 +3792/20000 train_loss: 2.4596 train_time: 7.2m tok/s: 6906084 +3793/20000 train_loss: 2.3867 train_time: 7.2m tok/s: 6905640 +3794/20000 train_loss: 2.4807 train_time: 7.2m tok/s: 6905212 +3795/20000 train_loss: 2.4708 train_time: 7.2m tok/s: 6904825 +3796/20000 train_loss: 2.5393 train_time: 7.2m tok/s: 6904403 +3797/20000 train_loss: 2.5797 train_time: 7.2m tok/s: 6903998 +3798/20000 train_loss: 2.6871 train_time: 7.2m tok/s: 6903571 +3799/20000 train_loss: 2.4849 train_time: 7.2m tok/s: 6903143 +3800/20000 train_loss: 2.4904 train_time: 7.2m tok/s: 6902706 +3801/20000 train_loss: 2.2729 train_time: 7.2m tok/s: 6902262 +3802/20000 train_loss: 2.4865 train_time: 7.2m tok/s: 6901857 +3803/20000 train_loss: 2.3967 train_time: 7.2m tok/s: 6901460 +3804/20000 train_loss: 2.4698 train_time: 7.2m tok/s: 6901083 +3805/20000 train_loss: 2.3914 train_time: 7.2m tok/s: 6900687 +3806/20000 train_loss: 2.4776 train_time: 7.2m tok/s: 6900275 +3807/20000 train_loss: 2.4707 train_time: 7.2m tok/s: 6899850 +3808/20000 train_loss: 2.6560 train_time: 7.2m tok/s: 6899432 +3809/20000 train_loss: 2.5676 train_time: 7.2m tok/s: 6899024 +3810/20000 train_loss: 2.4296 train_time: 7.2m tok/s: 6898590 +3811/20000 train_loss: 2.4407 train_time: 7.2m tok/s: 6898197 +3812/20000 train_loss: 2.4401 train_time: 7.2m tok/s: 6897781 +3813/20000 train_loss: 2.4568 train_time: 7.2m tok/s: 6897380 +3814/20000 train_loss: 4.3519 train_time: 7.2m tok/s: 6896911 +3815/20000 train_loss: 2.4409 train_time: 7.3m tok/s: 6896499 +3816/20000 train_loss: 2.5511 train_time: 7.3m tok/s: 6896053 +3817/20000 train_loss: 2.4939 train_time: 7.3m tok/s: 6895654 +3818/20000 train_loss: 2.5150 train_time: 7.3m tok/s: 6895265 +3819/20000 train_loss: 2.3863 train_time: 7.3m tok/s: 6894859 +3820/20000 train_loss: 2.5493 train_time: 7.3m tok/s: 6894441 +3821/20000 train_loss: 2.4242 train_time: 7.3m tok/s: 6894047 +3822/20000 train_loss: 2.4591 train_time: 7.3m tok/s: 6893643 +3823/20000 train_loss: 2.4813 train_time: 7.3m tok/s: 6893238 +3824/20000 train_loss: 2.5113 train_time: 7.3m tok/s: 6892840 +3825/20000 train_loss: 2.4801 train_time: 7.3m tok/s: 6892452 +3826/20000 train_loss: 2.5593 train_time: 7.3m tok/s: 6892029 +3827/20000 train_loss: 2.5688 train_time: 7.3m tok/s: 6891620 +3828/20000 train_loss: 2.5518 train_time: 7.3m tok/s: 6891203 +3829/20000 train_loss: 2.5110 train_time: 7.3m tok/s: 6890793 +3830/20000 train_loss: 2.5487 train_time: 7.3m tok/s: 6890390 +3831/20000 train_loss: 2.5090 train_time: 7.3m tok/s: 6889994 +3832/20000 train_loss: 2.4559 train_time: 7.3m tok/s: 6889581 +3833/20000 train_loss: 2.4874 train_time: 7.3m tok/s: 6889170 +3834/20000 train_loss: 2.4170 train_time: 7.3m tok/s: 6888768 +3835/20000 train_loss: 2.4582 train_time: 7.3m tok/s: 6888360 +3836/20000 train_loss: 2.4245 train_time: 7.3m tok/s: 6887945 +3837/20000 train_loss: 2.4431 train_time: 7.3m tok/s: 6887536 +3838/20000 train_loss: 2.4533 train_time: 7.3m tok/s: 6887131 +3839/20000 train_loss: 2.4347 train_time: 7.3m tok/s: 6886728 +3840/20000 train_loss: 2.3931 train_time: 7.3m tok/s: 6886321 +3841/20000 train_loss: 2.5566 train_time: 7.3m tok/s: 6885923 +3842/20000 train_loss: 2.4215 train_time: 7.3m tok/s: 6885509 +3843/20000 train_loss: 2.4472 train_time: 7.3m tok/s: 6885100 +3844/20000 train_loss: 2.3906 train_time: 7.3m tok/s: 6884721 +3845/20000 train_loss: 2.5047 train_time: 7.3m tok/s: 6884321 +3846/20000 train_loss: 2.6002 train_time: 7.3m tok/s: 6883916 +3847/20000 train_loss: 2.4141 train_time: 7.3m tok/s: 6883508 +3848/20000 train_loss: 2.4408 train_time: 7.3m tok/s: 6883104 +3849/20000 train_loss: 2.5696 train_time: 7.3m tok/s: 6882693 +3850/20000 train_loss: 2.5470 train_time: 7.3m tok/s: 6882273 +3851/20000 train_loss: 2.4608 train_time: 7.3m tok/s: 6881876 +3852/20000 train_loss: 2.4417 train_time: 7.3m tok/s: 6881472 +3853/20000 train_loss: 2.3922 train_time: 7.3m tok/s: 6881073 +3854/20000 train_loss: 2.2862 train_time: 7.3m tok/s: 6880679 +3855/20000 train_loss: 2.4901 train_time: 7.3m tok/s: 6880303 +3856/20000 train_loss: 2.4145 train_time: 7.3m tok/s: 6879891 +3857/20000 train_loss: 2.2076 train_time: 7.3m tok/s: 6879441 +3858/20000 train_loss: 2.4866 train_time: 7.4m tok/s: 6879030 +3859/20000 train_loss: 2.3978 train_time: 7.4m tok/s: 6878643 +3860/20000 train_loss: 2.3328 train_time: 7.4m tok/s: 6878264 +3861/20000 train_loss: 2.4730 train_time: 7.4m tok/s: 6877881 +3862/20000 train_loss: 2.5986 train_time: 7.4m tok/s: 6877495 +3863/20000 train_loss: 2.5130 train_time: 7.4m tok/s: 6877089 +3864/20000 train_loss: 2.4652 train_time: 7.4m tok/s: 6876680 +3865/20000 train_loss: 2.4883 train_time: 7.4m tok/s: 6876312 +3866/20000 train_loss: 2.4088 train_time: 7.4m tok/s: 6875923 +3867/20000 train_loss: 2.9000 train_time: 7.4m tok/s: 6875489 +3868/20000 train_loss: 2.3200 train_time: 7.4m tok/s: 6875039 +3869/20000 train_loss: 2.4428 train_time: 7.4m tok/s: 6874641 +3870/20000 train_loss: 2.5604 train_time: 7.4m tok/s: 6874268 +3871/20000 train_loss: 2.4081 train_time: 7.4m tok/s: 6873880 +3872/20000 train_loss: 2.4529 train_time: 7.4m tok/s: 6873498 +3873/20000 train_loss: 2.3845 train_time: 7.4m tok/s: 6873104 +3874/20000 train_loss: 2.4454 train_time: 7.4m tok/s: 6872730 +3875/20000 train_loss: 2.4432 train_time: 7.4m tok/s: 6872319 +3876/20000 train_loss: 2.3310 train_time: 7.4m tok/s: 6871920 +3877/20000 train_loss: 2.6133 train_time: 7.4m tok/s: 6871516 +3878/20000 train_loss: 2.9822 train_time: 7.4m tok/s: 6871078 +3879/20000 train_loss: 2.4192 train_time: 7.4m tok/s: 6870694 +3880/20000 train_loss: 2.4141 train_time: 7.4m tok/s: 6870308 +3881/20000 train_loss: 2.4094 train_time: 7.4m tok/s: 6869907 +3882/20000 train_loss: 2.3116 train_time: 7.4m tok/s: 6869501 +3883/20000 train_loss: 2.4688 train_time: 7.4m tok/s: 6869118 +3884/20000 train_loss: 2.4630 train_time: 7.4m tok/s: 6868722 +3885/20000 train_loss: 2.4130 train_time: 7.4m tok/s: 6868337 +3886/20000 train_loss: 2.3907 train_time: 7.4m tok/s: 6867977 +3887/20000 train_loss: 2.4180 train_time: 7.4m tok/s: 6867554 +3888/20000 train_loss: 2.4262 train_time: 7.4m tok/s: 6867187 +3889/20000 train_loss: 2.4126 train_time: 7.4m tok/s: 6866790 +3890/20000 train_loss: 2.3509 train_time: 7.4m tok/s: 6866399 +3891/20000 train_loss: 2.5906 train_time: 7.4m tok/s: 6866006 +3892/20000 train_loss: 2.3918 train_time: 7.4m tok/s: 6865602 +3893/20000 train_loss: 2.4996 train_time: 7.4m tok/s: 6865217 +3894/20000 train_loss: 2.2332 train_time: 7.4m tok/s: 6864811 +3895/20000 train_loss: 2.5625 train_time: 7.4m tok/s: 6864442 +3896/20000 train_loss: 2.4701 train_time: 7.4m tok/s: 6864042 +3897/20000 train_loss: 2.4656 train_time: 7.4m tok/s: 6863642 +3898/20000 train_loss: 2.3838 train_time: 7.4m tok/s: 6863260 +3899/20000 train_loss: 2.6190 train_time: 7.4m tok/s: 6862877 +3900/20000 train_loss: 2.5393 train_time: 7.4m tok/s: 6862453 +3901/20000 train_loss: 2.4487 train_time: 7.5m tok/s: 6862080 +3902/20000 train_loss: 2.4846 train_time: 7.5m tok/s: 6861665 +3903/20000 train_loss: 2.4824 train_time: 7.5m tok/s: 6861264 +3904/20000 train_loss: 2.3099 train_time: 7.5m tok/s: 6860857 +3905/20000 train_loss: 2.4326 train_time: 7.5m tok/s: 6860489 +3906/20000 train_loss: 2.5464 train_time: 7.5m tok/s: 6860133 +3907/20000 train_loss: 2.4664 train_time: 7.5m tok/s: 6859738 +3908/20000 train_loss: 2.4653 train_time: 7.5m tok/s: 6859353 +3909/20000 train_loss: 2.4722 train_time: 7.5m tok/s: 6858966 +3910/20000 train_loss: 2.4512 train_time: 7.5m tok/s: 6858579 +3911/20000 train_loss: 2.5226 train_time: 7.5m tok/s: 6858188 +3912/20000 train_loss: 2.5801 train_time: 7.5m tok/s: 6857796 +3913/20000 train_loss: 2.4880 train_time: 7.5m tok/s: 6857444 +3914/20000 train_loss: 2.4592 train_time: 7.5m tok/s: 6857041 +3915/20000 train_loss: 2.3521 train_time: 7.5m tok/s: 6856677 +3916/20000 train_loss: 2.4117 train_time: 7.5m tok/s: 6856293 +3917/20000 train_loss: 2.5813 train_time: 7.5m tok/s: 6855884 +3918/20000 train_loss: 2.4463 train_time: 7.5m tok/s: 6855496 +3919/20000 train_loss: 2.4059 train_time: 7.5m tok/s: 6855121 +3920/20000 train_loss: 2.3556 train_time: 7.5m tok/s: 6854741 +3921/20000 train_loss: 2.4929 train_time: 7.5m tok/s: 6854339 +3922/20000 train_loss: 2.5822 train_time: 7.5m tok/s: 6853942 +3923/20000 train_loss: 2.4931 train_time: 7.5m tok/s: 6853590 +3924/20000 train_loss: 2.5566 train_time: 7.5m tok/s: 6853202 +3925/20000 train_loss: 2.4436 train_time: 7.5m tok/s: 6852817 +3926/20000 train_loss: 2.3616 train_time: 7.5m tok/s: 6852416 +3927/20000 train_loss: 2.3416 train_time: 7.5m tok/s: 6852035 +3928/20000 train_loss: 2.4441 train_time: 7.5m tok/s: 6851640 +3929/20000 train_loss: 2.4321 train_time: 7.5m tok/s: 6851279 +3930/20000 train_loss: 2.5243 train_time: 7.5m tok/s: 6850901 +3931/20000 train_loss: 2.4869 train_time: 7.5m tok/s: 6850503 +3932/20000 train_loss: 2.4706 train_time: 7.5m tok/s: 6850100 +3933/20000 train_loss: 2.5444 train_time: 7.5m tok/s: 6849716 +3934/20000 train_loss: 2.4874 train_time: 7.5m tok/s: 6849344 +3935/20000 train_loss: 2.4046 train_time: 7.5m tok/s: 6848930 +3936/20000 train_loss: 2.3868 train_time: 7.5m tok/s: 6848552 +3937/20000 train_loss: 2.4047 train_time: 7.5m tok/s: 6848167 +3938/20000 train_loss: 2.3564 train_time: 7.5m tok/s: 6847813 +3939/20000 train_loss: 2.4955 train_time: 7.5m tok/s: 6847446 +3940/20000 train_loss: 2.4208 train_time: 7.5m tok/s: 6847019 +3941/20000 train_loss: 2.5829 train_time: 7.5m tok/s: 6846648 +3942/20000 train_loss: 2.3550 train_time: 7.5m tok/s: 6846255 +3943/20000 train_loss: 2.4204 train_time: 7.5m tok/s: 6845894 +3944/20000 train_loss: 2.4959 train_time: 7.6m tok/s: 6845524 +3945/20000 train_loss: 2.4134 train_time: 7.6m tok/s: 6845152 +3946/20000 train_loss: 2.4436 train_time: 7.6m tok/s: 6844765 +3947/20000 train_loss: 2.4690 train_time: 7.6m tok/s: 6844419 +3948/20000 train_loss: 2.3503 train_time: 7.6m tok/s: 6844025 +3949/20000 train_loss: 2.3712 train_time: 7.6m tok/s: 6843654 +3950/20000 train_loss: 2.3397 train_time: 7.6m tok/s: 6843291 +3951/20000 train_loss: 2.4713 train_time: 7.6m tok/s: 6842885 +3952/20000 train_loss: 2.4487 train_time: 7.6m tok/s: 6842509 +3953/20000 train_loss: 2.4707 train_time: 7.6m tok/s: 6842152 +3954/20000 train_loss: 2.4893 train_time: 7.6m tok/s: 6841761 +3955/20000 train_loss: 2.4770 train_time: 7.6m tok/s: 6841393 +3956/20000 train_loss: 2.4633 train_time: 7.6m tok/s: 6841005 +3957/20000 train_loss: 2.3417 train_time: 7.6m tok/s: 6840621 +3958/20000 train_loss: 2.5283 train_time: 7.6m tok/s: 6840248 +3959/20000 train_loss: 2.2671 train_time: 7.6m tok/s: 6839861 +3960/20000 train_loss: 2.4039 train_time: 7.6m tok/s: 6839479 +3961/20000 train_loss: 2.3840 train_time: 7.6m tok/s: 6839096 +3962/20000 train_loss: 2.3897 train_time: 7.6m tok/s: 6838714 +3963/20000 train_loss: 2.4836 train_time: 7.6m tok/s: 6838336 +3964/20000 train_loss: 2.2938 train_time: 7.6m tok/s: 6837946 +3965/20000 train_loss: 2.5968 train_time: 7.6m tok/s: 6837591 +3966/20000 train_loss: 2.4592 train_time: 7.6m tok/s: 6837245 +3967/20000 train_loss: 2.4269 train_time: 7.6m tok/s: 6836890 +3968/20000 train_loss: 2.4661 train_time: 7.6m tok/s: 6836502 +3969/20000 train_loss: 2.4154 train_time: 7.6m tok/s: 6836136 +3970/20000 train_loss: 2.5132 train_time: 7.6m tok/s: 6835770 +3971/20000 train_loss: 2.4153 train_time: 7.6m tok/s: 6835386 +3972/20000 train_loss: 2.4334 train_time: 7.6m tok/s: 6834998 +3973/20000 train_loss: 2.3944 train_time: 7.6m tok/s: 6834640 +3974/20000 train_loss: 2.3546 train_time: 7.6m tok/s: 6834263 +3975/20000 train_loss: 2.5170 train_time: 7.6m tok/s: 6833882 +3976/20000 train_loss: 2.4917 train_time: 7.6m tok/s: 6833525 +3977/20000 train_loss: 2.7617 train_time: 7.6m tok/s: 6833157 +3978/20000 train_loss: 2.4891 train_time: 7.6m tok/s: 6832775 +3979/20000 train_loss: 3.1321 train_time: 7.6m tok/s: 6832355 +3980/20000 train_loss: 2.4538 train_time: 7.6m tok/s: 6831974 +3981/20000 train_loss: 2.3807 train_time: 7.6m tok/s: 6831620 +3982/20000 train_loss: 2.4763 train_time: 7.6m tok/s: 6831247 +3983/20000 train_loss: 2.3951 train_time: 7.6m tok/s: 6830876 +3984/20000 train_loss: 2.5177 train_time: 7.6m tok/s: 6830521 +3985/20000 train_loss: 2.4754 train_time: 7.6m tok/s: 6830127 +3986/20000 train_loss: 2.4067 train_time: 7.6m tok/s: 6829769 +3987/20000 train_loss: 2.5605 train_time: 7.7m tok/s: 6829415 +3988/20000 train_loss: 2.4915 train_time: 7.7m tok/s: 6829069 +3989/20000 train_loss: 2.4377 train_time: 7.7m tok/s: 6828702 +3990/20000 train_loss: 2.4293 train_time: 7.7m tok/s: 6828335 +3991/20000 train_loss: 2.4458 train_time: 7.7m tok/s: 6827967 +3992/20000 train_loss: 2.4558 train_time: 7.7m tok/s: 6827616 +3993/20000 train_loss: 2.3848 train_time: 7.7m tok/s: 6827245 +3994/20000 train_loss: 2.1462 train_time: 7.7m tok/s: 6826838 +3995/20000 train_loss: 2.4542 train_time: 7.7m tok/s: 6826467 +3996/20000 train_loss: 2.3648 train_time: 7.7m tok/s: 6826112 +3997/20000 train_loss: 2.4724 train_time: 7.7m tok/s: 6825729 +3998/20000 train_loss: 2.4034 train_time: 7.7m tok/s: 6825328 +3999/20000 train_loss: 2.4134 train_time: 7.7m tok/s: 6824970 +4000/20000 train_loss: 2.5086 train_time: 7.7m tok/s: 6824630 +4001/20000 train_loss: 2.4248 train_time: 7.7m tok/s: 6824253 +4002/20000 train_loss: 2.4293 train_time: 7.7m tok/s: 6823898 +4003/20000 train_loss: 2.3754 train_time: 7.7m tok/s: 6823549 +4004/20000 train_loss: 2.4888 train_time: 7.7m tok/s: 6823178 +4005/20000 train_loss: 2.4371 train_time: 7.7m tok/s: 6822845 +4006/20000 train_loss: 2.4689 train_time: 7.7m tok/s: 6822467 +4007/20000 train_loss: 2.4741 train_time: 7.7m tok/s: 6822085 +4008/20000 train_loss: 2.3585 train_time: 7.7m tok/s: 6821717 +4009/20000 train_loss: 2.3883 train_time: 7.7m tok/s: 6821353 +4010/20000 train_loss: 2.4924 train_time: 7.7m tok/s: 6821004 +4011/20000 train_loss: 2.4562 train_time: 7.7m tok/s: 6820668 +4012/20000 train_loss: 2.4738 train_time: 7.7m tok/s: 6820305 +4013/20000 train_loss: 2.3530 train_time: 7.7m tok/s: 6819897 +4014/20000 train_loss: 2.3991 train_time: 7.7m tok/s: 6819534 +4015/20000 train_loss: 2.4246 train_time: 7.7m tok/s: 6819175 +4016/20000 train_loss: 2.4245 train_time: 7.7m tok/s: 6818793 +4017/20000 train_loss: 2.5373 train_time: 7.7m tok/s: 6818451 +4018/20000 train_loss: 2.3874 train_time: 7.7m tok/s: 6818077 +4019/20000 train_loss: 2.2990 train_time: 7.7m tok/s: 6817710 +4020/20000 train_loss: 2.3367 train_time: 7.7m tok/s: 6817353 +4021/20000 train_loss: 2.3800 train_time: 7.7m tok/s: 6816974 +4022/20000 train_loss: 2.4651 train_time: 7.7m tok/s: 6816608 +4023/20000 train_loss: 2.4449 train_time: 7.7m tok/s: 6816249 +4024/20000 train_loss: 2.5105 train_time: 7.7m tok/s: 6815898 +4025/20000 train_loss: 2.4879 train_time: 7.7m tok/s: 6815558 +4026/20000 train_loss: 2.4138 train_time: 7.7m tok/s: 6815198 +4027/20000 train_loss: 2.3432 train_time: 7.7m tok/s: 6814833 +4028/20000 train_loss: 2.2402 train_time: 7.7m tok/s: 6814458 +4029/20000 train_loss: 2.3892 train_time: 7.7m tok/s: 6814096 +4030/20000 train_loss: 2.3981 train_time: 7.8m tok/s: 6813745 +4031/20000 train_loss: 2.4777 train_time: 7.8m tok/s: 6813405 +4032/20000 train_loss: 2.3730 train_time: 7.8m tok/s: 6813047 +4033/20000 train_loss: 2.4299 train_time: 7.8m tok/s: 6812682 +4034/20000 train_loss: 2.5988 train_time: 7.8m tok/s: 6812349 +4035/20000 train_loss: 2.5305 train_time: 7.8m tok/s: 6812002 +4036/20000 train_loss: 2.4943 train_time: 7.8m tok/s: 6811664 +4037/20000 train_loss: 2.4833 train_time: 7.8m tok/s: 6811307 +4038/20000 train_loss: 2.4900 train_time: 7.8m tok/s: 6810965 +4039/20000 train_loss: 2.4756 train_time: 7.8m tok/s: 6810637 +4040/20000 train_loss: 2.4218 train_time: 7.8m tok/s: 6810239 +4041/20000 train_loss: 2.3898 train_time: 7.8m tok/s: 6809883 +4042/20000 train_loss: 2.3638 train_time: 7.8m tok/s: 6809530 +4043/20000 train_loss: 2.3782 train_time: 7.8m tok/s: 6809175 +4044/20000 train_loss: 2.4241 train_time: 7.8m tok/s: 6808814 +4045/20000 train_loss: 2.4580 train_time: 7.8m tok/s: 6808431 +4046/20000 train_loss: 2.5261 train_time: 7.8m tok/s: 6808063 +4047/20000 train_loss: 2.5003 train_time: 7.8m tok/s: 6807691 +4048/20000 train_loss: 2.4269 train_time: 7.8m tok/s: 6807330 +4049/20000 train_loss: 2.5538 train_time: 7.8m tok/s: 6806967 +4050/20000 train_loss: 2.4431 train_time: 7.8m tok/s: 6806586 +4051/20000 train_loss: 2.4053 train_time: 7.8m tok/s: 6806250 +4052/20000 train_loss: 2.4874 train_time: 7.8m tok/s: 6805881 +4053/20000 train_loss: 2.4375 train_time: 7.8m tok/s: 6805521 +4054/20000 train_loss: 2.4157 train_time: 7.8m tok/s: 6805168 +4055/20000 train_loss: 2.4445 train_time: 7.8m tok/s: 6804811 +4056/20000 train_loss: 2.3251 train_time: 7.8m tok/s: 6804436 +4057/20000 train_loss: 2.4155 train_time: 7.8m tok/s: 6804084 +4058/20000 train_loss: 2.5582 train_time: 7.8m tok/s: 6803707 +4059/20000 train_loss: 2.2789 train_time: 7.8m tok/s: 6803364 +4060/20000 train_loss: 2.3945 train_time: 7.8m tok/s: 6803038 +4061/20000 train_loss: 2.4996 train_time: 7.8m tok/s: 6802705 +4062/20000 train_loss: 2.4868 train_time: 7.8m tok/s: 6802355 +4063/20000 train_loss: 2.4501 train_time: 7.8m tok/s: 6801988 +4064/20000 train_loss: 2.4304 train_time: 7.8m tok/s: 6801634 +4065/20000 train_loss: 2.5011 train_time: 7.8m tok/s: 6801319 +4066/20000 train_loss: 2.4089 train_time: 7.8m tok/s: 6800941 +4067/20000 train_loss: 2.4782 train_time: 7.8m tok/s: 6800592 +4068/20000 train_loss: 2.4794 train_time: 7.8m tok/s: 6800239 +4069/20000 train_loss: 2.3655 train_time: 7.8m tok/s: 6799879 +4070/20000 train_loss: 2.3726 train_time: 7.8m tok/s: 6799546 +4071/20000 train_loss: 2.7212 train_time: 7.8m tok/s: 6799177 +4072/20000 train_loss: 2.4522 train_time: 7.9m tok/s: 6798822 +4073/20000 train_loss: 2.4496 train_time: 7.9m tok/s: 6798468 +4074/20000 train_loss: 2.4084 train_time: 7.9m tok/s: 6798123 +4075/20000 train_loss: 2.5591 train_time: 7.9m tok/s: 6797776 +4076/20000 train_loss: 2.5841 train_time: 7.9m tok/s: 6797420 +4077/20000 train_loss: 2.4936 train_time: 7.9m tok/s: 6797077 +4078/20000 train_loss: 2.3752 train_time: 7.9m tok/s: 6796721 +4079/20000 train_loss: 3.0162 train_time: 7.9m tok/s: 6796330 +4080/20000 train_loss: 2.3605 train_time: 7.9m tok/s: 6795975 +4081/20000 train_loss: 2.3883 train_time: 7.9m tok/s: 6795624 +4082/20000 train_loss: 2.4052 train_time: 7.9m tok/s: 6795298 +4083/20000 train_loss: 2.3182 train_time: 7.9m tok/s: 6794958 +4084/20000 train_loss: 2.2602 train_time: 7.9m tok/s: 6794602 +4085/20000 train_loss: 2.3571 train_time: 7.9m tok/s: 6794255 +4086/20000 train_loss: 2.4926 train_time: 7.9m tok/s: 6793907 +4087/20000 train_loss: 2.5340 train_time: 7.9m tok/s: 6793564 +4088/20000 train_loss: 2.5375 train_time: 7.9m tok/s: 6793227 +4089/20000 train_loss: 2.4422 train_time: 7.9m tok/s: 6792890 +4090/20000 train_loss: 2.3684 train_time: 7.9m tok/s: 6792556 +4091/20000 train_loss: 2.5207 train_time: 7.9m tok/s: 6792212 +4092/20000 train_loss: 2.5266 train_time: 7.9m tok/s: 6791844 +4093/20000 train_loss: 2.4965 train_time: 7.9m tok/s: 6791482 +4094/20000 train_loss: 2.2972 train_time: 7.9m tok/s: 6791100 +4095/20000 train_loss: 2.4546 train_time: 7.9m tok/s: 6790771 +4096/20000 train_loss: 2.2624 train_time: 7.9m tok/s: 6790401 +4097/20000 train_loss: 2.3104 train_time: 7.9m tok/s: 6790057 +4098/20000 train_loss: 2.4172 train_time: 7.9m tok/s: 6789736 +4099/20000 train_loss: 2.1982 train_time: 7.9m tok/s: 6789370 +4100/20000 train_loss: 2.4212 train_time: 7.9m tok/s: 6789025 +4101/20000 train_loss: 2.5318 train_time: 7.9m tok/s: 6788689 +4102/20000 train_loss: 2.2881 train_time: 7.9m tok/s: 6788328 +4103/20000 train_loss: 2.4378 train_time: 7.9m tok/s: 6788011 +4104/20000 train_loss: 2.3829 train_time: 7.9m tok/s: 6787661 +4105/20000 train_loss: 2.4353 train_time: 7.9m tok/s: 6787287 +4106/20000 train_loss: 2.4633 train_time: 7.9m tok/s: 6786981 +4107/20000 train_loss: 2.4107 train_time: 7.9m tok/s: 6786630 +4108/20000 train_loss: 2.4178 train_time: 7.9m tok/s: 6786289 +4109/20000 train_loss: 2.3910 train_time: 7.9m tok/s: 6785968 +4110/20000 train_loss: 2.3480 train_time: 7.9m tok/s: 6785620 +4111/20000 train_loss: 2.3982 train_time: 7.9m tok/s: 6785288 +4112/20000 train_loss: 2.4445 train_time: 7.9m tok/s: 6784938 +4113/20000 train_loss: 2.4217 train_time: 7.9m tok/s: 6784604 +4114/20000 train_loss: 2.3916 train_time: 7.9m tok/s: 6784287 +4115/20000 train_loss: 2.5291 train_time: 8.0m tok/s: 6783948 +4116/20000 train_loss: 2.3682 train_time: 8.0m tok/s: 6783615 +4117/20000 train_loss: 2.3827 train_time: 8.0m tok/s: 6783267 +4118/20000 train_loss: 2.4309 train_time: 8.0m tok/s: 6782914 +4119/20000 train_loss: 2.4676 train_time: 8.0m tok/s: 6782568 +4120/20000 train_loss: 2.4346 train_time: 8.0m tok/s: 6782203 +4121/20000 train_loss: 2.4029 train_time: 8.0m tok/s: 6781842 +4122/20000 train_loss: 2.2872 train_time: 8.0m tok/s: 6781480 +4123/20000 train_loss: 2.3867 train_time: 8.0m tok/s: 6781144 +4124/20000 train_loss: 2.2565 train_time: 8.0m tok/s: 6780797 +4125/20000 train_loss: 2.4529 train_time: 8.0m tok/s: 6780473 +4126/20000 train_loss: 2.3724 train_time: 8.0m tok/s: 6780149 +4127/20000 train_loss: 2.3765 train_time: 8.0m tok/s: 6779811 +4128/20000 train_loss: 2.4317 train_time: 8.0m tok/s: 6779476 +4129/20000 train_loss: 2.4646 train_time: 8.0m tok/s: 6779146 +4130/20000 train_loss: 2.4017 train_time: 8.0m tok/s: 6778801 +4131/20000 train_loss: 2.4511 train_time: 8.0m tok/s: 6778454 +4132/20000 train_loss: 2.4239 train_time: 8.0m tok/s: 6778113 +4133/20000 train_loss: 2.4400 train_time: 8.0m tok/s: 6777748 +4134/20000 train_loss: 2.4769 train_time: 8.0m tok/s: 6777410 +4135/20000 train_loss: 2.4648 train_time: 8.0m tok/s: 6777077 +4136/20000 train_loss: 2.2946 train_time: 8.0m tok/s: 6776752 +4137/20000 train_loss: 2.4986 train_time: 8.0m tok/s: 6776420 +4138/20000 train_loss: 2.3036 train_time: 8.0m tok/s: 6776081 +4139/20000 train_loss: 2.4589 train_time: 8.0m tok/s: 6775735 +4140/20000 train_loss: 2.5240 train_time: 8.0m tok/s: 6775402 +4141/20000 train_loss: 2.3323 train_time: 8.0m tok/s: 6775065 +4142/20000 train_loss: 2.5065 train_time: 8.0m tok/s: 6774748 +4143/20000 train_loss: 2.4067 train_time: 8.0m tok/s: 6774437 +4144/20000 train_loss: 2.5090 train_time: 8.0m tok/s: 6774092 +4145/20000 train_loss: 2.2740 train_time: 8.0m tok/s: 6773742 +4146/20000 train_loss: 2.4508 train_time: 8.0m tok/s: 6773404 +4147/20000 train_loss: 2.5448 train_time: 8.0m tok/s: 6773088 +4148/20000 train_loss: 2.4373 train_time: 8.0m tok/s: 6772730 +4149/20000 train_loss: 2.2973 train_time: 8.0m tok/s: 6772384 +4150/20000 train_loss: 2.3496 train_time: 8.0m tok/s: 6772060 +4151/20000 train_loss: 2.4387 train_time: 8.0m tok/s: 6771732 +4152/20000 train_loss: 2.4169 train_time: 8.0m tok/s: 6771408 +4153/20000 train_loss: 2.5086 train_time: 8.0m tok/s: 6771075 +4154/20000 train_loss: 2.3613 train_time: 8.0m tok/s: 6770753 +4155/20000 train_loss: 2.5002 train_time: 8.0m tok/s: 6770439 +4156/20000 train_loss: 2.4295 train_time: 8.0m tok/s: 6770128 +4157/20000 train_loss: 2.4865 train_time: 8.0m tok/s: 6769799 +4158/20000 train_loss: 2.4446 train_time: 8.1m tok/s: 6769481 +4159/20000 train_loss: 2.3494 train_time: 8.1m tok/s: 6769138 +4160/20000 train_loss: 2.3496 train_time: 8.1m tok/s: 6768828 +4161/20000 train_loss: 2.4090 train_time: 8.1m tok/s: 6768477 +4162/20000 train_loss: 2.3205 train_time: 8.1m tok/s: 6768128 +4163/20000 train_loss: 2.5203 train_time: 8.1m tok/s: 6767795 +4164/20000 train_loss: 2.4421 train_time: 8.1m tok/s: 6767465 +4165/20000 train_loss: 2.4440 train_time: 8.1m tok/s: 6767135 +4166/20000 train_loss: 2.4453 train_time: 8.1m tok/s: 6766790 +4167/20000 train_loss: 2.5254 train_time: 8.1m tok/s: 6766449 +4168/20000 train_loss: 2.4335 train_time: 8.1m tok/s: 6766136 +4169/20000 train_loss: 2.4216 train_time: 8.1m tok/s: 6765786 +4170/20000 train_loss: 2.3402 train_time: 8.1m tok/s: 6765460 +4171/20000 train_loss: 2.4381 train_time: 8.1m tok/s: 6765113 +4172/20000 train_loss: 2.3941 train_time: 8.1m tok/s: 6764770 +4173/20000 train_loss: 2.5120 train_time: 8.1m tok/s: 6764444 +4174/20000 train_loss: 2.3082 train_time: 8.1m tok/s: 6764114 +4175/20000 train_loss: 2.4047 train_time: 8.1m tok/s: 6763789 +4176/20000 train_loss: 2.5040 train_time: 8.1m tok/s: 6763464 +4177/20000 train_loss: 2.3790 train_time: 8.1m tok/s: 6763154 +4178/20000 train_loss: 2.4510 train_time: 8.1m tok/s: 6762847 +4179/20000 train_loss: 2.3983 train_time: 8.1m tok/s: 6762508 +4180/20000 train_loss: 2.3898 train_time: 8.1m tok/s: 6762198 +4181/20000 train_loss: 2.5457 train_time: 8.1m tok/s: 6761880 +4182/20000 train_loss: 2.4741 train_time: 8.1m tok/s: 6761554 +4183/20000 train_loss: 2.4599 train_time: 8.1m tok/s: 6761246 +4184/20000 train_loss: 2.4768 train_time: 8.1m tok/s: 6760917 +4185/20000 train_loss: 2.5708 train_time: 8.1m tok/s: 6760595 +4186/20000 train_loss: 2.4733 train_time: 8.1m tok/s: 6760290 +4187/20000 train_loss: 2.2928 train_time: 8.1m tok/s: 6759951 +4188/20000 train_loss: 2.4576 train_time: 8.1m tok/s: 6759596 +4189/20000 train_loss: 2.4503 train_time: 8.1m tok/s: 6759282 +4190/20000 train_loss: 2.5355 train_time: 8.1m tok/s: 6758958 +4191/20000 train_loss: 2.4213 train_time: 8.1m tok/s: 6758670 +4192/20000 train_loss: 2.3713 train_time: 8.1m tok/s: 6758348 +4193/20000 train_loss: 2.4034 train_time: 8.1m tok/s: 6758045 +4194/20000 train_loss: 2.4716 train_time: 8.1m tok/s: 6757714 +4195/20000 train_loss: 2.4858 train_time: 8.1m tok/s: 6757386 +4196/20000 train_loss: 2.3112 train_time: 8.1m tok/s: 6757083 +4197/20000 train_loss: 2.3249 train_time: 8.1m tok/s: 6756762 +4198/20000 train_loss: 2.4306 train_time: 8.1m tok/s: 6756445 +4199/20000 train_loss: 2.2418 train_time: 8.1m tok/s: 6756139 +4200/20000 train_loss: 2.4311 train_time: 8.1m tok/s: 6755770 +4201/20000 train_loss: 2.3674 train_time: 8.2m tok/s: 6755428 +4202/20000 train_loss: 2.4532 train_time: 8.2m tok/s: 6755107 +4203/20000 train_loss: 2.5882 train_time: 8.2m tok/s: 6754804 +4204/20000 train_loss: 2.4920 train_time: 8.2m tok/s: 6754477 +4205/20000 train_loss: 2.4699 train_time: 8.2m tok/s: 6754142 +4206/20000 train_loss: 2.3964 train_time: 8.2m tok/s: 6753812 +4207/20000 train_loss: 2.4046 train_time: 8.2m tok/s: 6753490 +4208/20000 train_loss: 2.3843 train_time: 8.2m tok/s: 6753153 +4209/20000 train_loss: 2.3872 train_time: 8.2m tok/s: 6752824 +4210/20000 train_loss: 2.3831 train_time: 8.2m tok/s: 6752528 +4211/20000 train_loss: 2.5570 train_time: 8.2m tok/s: 6752208 +4212/20000 train_loss: 2.3708 train_time: 8.2m tok/s: 6751876 +4213/20000 train_loss: 2.3994 train_time: 8.2m tok/s: 6751557 +4214/20000 train_loss: 2.4226 train_time: 8.2m tok/s: 6751237 +4215/20000 train_loss: 2.4551 train_time: 8.2m tok/s: 6750924 +4216/20000 train_loss: 2.5508 train_time: 8.2m tok/s: 6750585 +4217/20000 train_loss: 2.3023 train_time: 8.2m tok/s: 6750268 +4218/20000 train_loss: 2.4560 train_time: 8.2m tok/s: 6749944 +4219/20000 train_loss: 2.3809 train_time: 8.2m tok/s: 6749620 +4220/20000 train_loss: 2.0869 train_time: 8.2m tok/s: 6749290 +4221/20000 train_loss: 2.4712 train_time: 8.2m tok/s: 6748985 +4222/20000 train_loss: 2.4673 train_time: 8.2m tok/s: 6748650 +4223/20000 train_loss: 2.5657 train_time: 8.2m tok/s: 6748328 +4224/20000 train_loss: 2.4521 train_time: 8.2m tok/s: 6748009 +4225/20000 train_loss: 2.4568 train_time: 8.2m tok/s: 6747686 +4226/20000 train_loss: 2.4251 train_time: 8.2m tok/s: 6747370 +4227/20000 train_loss: 2.4672 train_time: 8.2m tok/s: 6747049 +4228/20000 train_loss: 2.4428 train_time: 8.2m tok/s: 6746713 +4229/20000 train_loss: 2.4380 train_time: 8.2m tok/s: 6746377 +4230/20000 train_loss: 2.3752 train_time: 8.2m tok/s: 6746045 +4231/20000 train_loss: 2.3637 train_time: 8.2m tok/s: 6745742 +4232/20000 train_loss: 2.4025 train_time: 8.2m tok/s: 6745425 +4233/20000 train_loss: 2.5403 train_time: 8.2m tok/s: 6745114 +4234/20000 train_loss: 2.3450 train_time: 8.2m tok/s: 6744815 +4235/20000 train_loss: 2.5143 train_time: 8.2m tok/s: 6744472 +4236/20000 train_loss: 2.5647 train_time: 8.2m tok/s: 6744150 +4237/20000 train_loss: 2.4392 train_time: 8.2m tok/s: 6743822 +4238/20000 train_loss: 2.5937 train_time: 8.2m tok/s: 6743522 +4239/20000 train_loss: 2.4969 train_time: 8.2m tok/s: 6743220 +4240/20000 train_loss: 2.5684 train_time: 8.2m tok/s: 6742920 +4241/20000 train_loss: 2.3559 train_time: 8.2m tok/s: 6742616 +4242/20000 train_loss: 2.4619 train_time: 8.2m tok/s: 6742269 +4243/20000 train_loss: 2.4046 train_time: 8.2m tok/s: 6741953 +4244/20000 train_loss: 2.4581 train_time: 8.3m tok/s: 6741637 +4245/20000 train_loss: 2.4523 train_time: 8.3m tok/s: 6741292 +4246/20000 train_loss: 2.4917 train_time: 8.3m tok/s: 6740967 +4247/20000 train_loss: 2.3733 train_time: 8.3m tok/s: 6740645 +4248/20000 train_loss: 2.4122 train_time: 8.3m tok/s: 6740342 +4249/20000 train_loss: 2.4317 train_time: 8.3m tok/s: 6740022 +4250/20000 train_loss: 2.3883 train_time: 8.3m tok/s: 6739707 +4251/20000 train_loss: 2.4064 train_time: 8.3m tok/s: 6739410 +4252/20000 train_loss: 2.4818 train_time: 8.3m tok/s: 6739105 +4253/20000 train_loss: 2.5399 train_time: 8.3m tok/s: 6738761 +4254/20000 train_loss: 2.6597 train_time: 8.3m tok/s: 6738440 +4255/20000 train_loss: 2.4498 train_time: 8.3m tok/s: 6738141 +4256/20000 train_loss: 2.5076 train_time: 8.3m tok/s: 6737820 +4257/20000 train_loss: 2.5249 train_time: 8.3m tok/s: 6737495 +4258/20000 train_loss: 2.3313 train_time: 8.3m tok/s: 6737182 +4259/20000 train_loss: 2.4224 train_time: 8.3m tok/s: 6736867 +4260/20000 train_loss: 2.3930 train_time: 8.3m tok/s: 6736549 +4261/20000 train_loss: 2.2937 train_time: 8.3m tok/s: 6736236 +4262/20000 train_loss: 2.3722 train_time: 8.3m tok/s: 6735933 +4263/20000 train_loss: 2.3381 train_time: 8.3m tok/s: 6735626 +4264/20000 train_loss: 2.3720 train_time: 8.3m tok/s: 6735305 +4265/20000 train_loss: 2.4368 train_time: 8.3m tok/s: 6734996 +4266/20000 train_loss: 2.5154 train_time: 8.3m tok/s: 6734689 +4267/20000 train_loss: 2.4061 train_time: 8.3m tok/s: 6734380 +4268/20000 train_loss: 2.4661 train_time: 8.3m tok/s: 6734051 +4269/20000 train_loss: 2.4080 train_time: 8.3m tok/s: 6733739 +4270/20000 train_loss: 2.3160 train_time: 8.3m tok/s: 6733430 +4271/20000 train_loss: 2.4296 train_time: 8.3m tok/s: 6733134 +4272/20000 train_loss: 2.3842 train_time: 8.3m tok/s: 6732802 +4273/20000 train_loss: 2.4125 train_time: 8.3m tok/s: 6732501 +4274/20000 train_loss: 2.4570 train_time: 8.3m tok/s: 6732154 +4275/20000 train_loss: 2.3029 train_time: 8.3m tok/s: 6731839 +4276/20000 train_loss: 2.3877 train_time: 8.3m tok/s: 6731523 +4277/20000 train_loss: 2.3578 train_time: 8.3m tok/s: 6731213 +4278/20000 train_loss: 2.3801 train_time: 8.3m tok/s: 6730909 +4279/20000 train_loss: 2.3704 train_time: 8.3m tok/s: 6730603 +4280/20000 train_loss: 2.5244 train_time: 8.3m tok/s: 6730281 +4281/20000 train_loss: 2.6060 train_time: 8.3m tok/s: 6729991 +4282/20000 train_loss: 2.4336 train_time: 8.3m tok/s: 6729673 +4283/20000 train_loss: 2.6003 train_time: 8.3m tok/s: 6729366 +4284/20000 train_loss: 2.5172 train_time: 8.3m tok/s: 6729068 +4285/20000 train_loss: 2.3987 train_time: 8.3m tok/s: 6728756 +4286/20000 train_loss: 2.4837 train_time: 8.3m tok/s: 6728439 +4287/20000 train_loss: 2.3741 train_time: 8.4m tok/s: 6728120 +4288/20000 train_loss: 2.3980 train_time: 8.4m tok/s: 6727835 +4289/20000 train_loss: 2.5729 train_time: 8.4m tok/s: 6727472 +4290/20000 train_loss: 2.3629 train_time: 8.4m tok/s: 6727174 +4291/20000 train_loss: 2.3197 train_time: 8.4m tok/s: 6726876 +4292/20000 train_loss: 2.4616 train_time: 8.4m tok/s: 6726550 +4293/20000 train_loss: 2.4285 train_time: 8.4m tok/s: 6726254 +4294/20000 train_loss: 2.3789 train_time: 8.4m tok/s: 6725941 +4295/20000 train_loss: 2.4187 train_time: 8.4m tok/s: 6725640 +4296/20000 train_loss: 2.2147 train_time: 8.4m tok/s: 6725336 +4297/20000 train_loss: 2.4772 train_time: 8.4m tok/s: 6725036 +4298/20000 train_loss: 2.4033 train_time: 8.4m tok/s: 6724721 +4299/20000 train_loss: 2.3425 train_time: 8.4m tok/s: 6724416 +4300/20000 train_loss: 2.5179 train_time: 8.4m tok/s: 6724120 +4301/20000 train_loss: 2.1917 train_time: 8.4m tok/s: 6723769 +4302/20000 train_loss: 2.3421 train_time: 8.4m tok/s: 6723434 +4303/20000 train_loss: 2.3342 train_time: 8.4m tok/s: 6723145 +4304/20000 train_loss: 2.3477 train_time: 8.4m tok/s: 6722855 +4305/20000 train_loss: 2.3295 train_time: 8.4m tok/s: 6722547 +4306/20000 train_loss: 2.3932 train_time: 8.4m tok/s: 6722218 +4307/20000 train_loss: 2.3384 train_time: 8.4m tok/s: 6721927 +4308/20000 train_loss: 2.3776 train_time: 8.4m tok/s: 6721634 +4309/20000 train_loss: 2.5795 train_time: 8.4m tok/s: 6721327 +4310/20000 train_loss: 2.4539 train_time: 8.4m tok/s: 6721018 +4311/20000 train_loss: 2.5404 train_time: 8.4m tok/s: 6720720 +4312/20000 train_loss: 2.4182 train_time: 8.4m tok/s: 6720425 +4313/20000 train_loss: 2.2548 train_time: 8.4m tok/s: 6720130 +4314/20000 train_loss: 2.4311 train_time: 8.4m tok/s: 6719824 +4315/20000 train_loss: 2.4057 train_time: 8.4m tok/s: 6719504 +4316/20000 train_loss: 2.4178 train_time: 8.4m tok/s: 6719202 +4317/20000 train_loss: 2.4920 train_time: 8.4m tok/s: 6718879 +4318/20000 train_loss: 2.4902 train_time: 8.4m tok/s: 6718569 +4319/20000 train_loss: 2.1803 train_time: 8.4m tok/s: 6718263 +4320/20000 train_loss: 2.2687 train_time: 8.4m tok/s: 6717964 +4321/20000 train_loss: 2.3737 train_time: 8.4m tok/s: 6717680 +4322/20000 train_loss: 2.4196 train_time: 8.4m tok/s: 6717352 +4323/20000 train_loss: 2.4183 train_time: 8.4m tok/s: 6717056 +4324/20000 train_loss: 2.4391 train_time: 8.4m tok/s: 6716772 +4325/20000 train_loss: 2.4631 train_time: 8.4m tok/s: 6716470 +4326/20000 train_loss: 2.3656 train_time: 8.4m tok/s: 6716152 +4327/20000 train_loss: 2.4104 train_time: 8.4m tok/s: 6715841 +4328/20000 train_loss: 2.3848 train_time: 8.4m tok/s: 6715556 +4329/20000 train_loss: 2.4337 train_time: 8.4m tok/s: 6715255 +4330/20000 train_loss: 2.3048 train_time: 8.5m tok/s: 6714945 +4331/20000 train_loss: 2.3909 train_time: 8.5m tok/s: 6714645 +4332/20000 train_loss: 2.9206 train_time: 8.5m tok/s: 6714297 +4333/20000 train_loss: 2.3804 train_time: 8.5m tok/s: 6714006 +4334/20000 train_loss: 2.4459 train_time: 8.5m tok/s: 6713705 +4335/20000 train_loss: 2.5386 train_time: 8.5m tok/s: 6713392 +4336/20000 train_loss: 2.4135 train_time: 8.5m tok/s: 6713089 +4337/20000 train_loss: 2.5094 train_time: 8.5m tok/s: 6712822 +4338/20000 train_loss: 2.5376 train_time: 8.5m tok/s: 6712520 +4339/20000 train_loss: 2.4445 train_time: 8.5m tok/s: 6712241 +4340/20000 train_loss: 2.4248 train_time: 8.5m tok/s: 6711944 +4341/20000 train_loss: 2.4110 train_time: 8.5m tok/s: 6711627 +4342/20000 train_loss: 2.5170 train_time: 8.5m tok/s: 6711318 +4343/20000 train_loss: 2.4869 train_time: 8.5m tok/s: 6711024 +4344/20000 train_loss: 2.4891 train_time: 8.5m tok/s: 6710725 +4345/20000 train_loss: 2.3232 train_time: 8.5m tok/s: 6710397 +4346/20000 train_loss: 2.3951 train_time: 8.5m tok/s: 6710093 +4347/20000 train_loss: 2.4071 train_time: 8.5m tok/s: 6709800 +4348/20000 train_loss: 2.4008 train_time: 8.5m tok/s: 6709489 +4349/20000 train_loss: 1.8638 train_time: 8.5m tok/s: 6709134 +4350/20000 train_loss: 2.2322 train_time: 8.5m tok/s: 6708827 +4351/20000 train_loss: 2.4177 train_time: 8.5m tok/s: 6708548 +4352/20000 train_loss: 2.3678 train_time: 8.5m tok/s: 6708248 +4353/20000 train_loss: 2.4608 train_time: 8.5m tok/s: 6707961 +4354/20000 train_loss: 2.4817 train_time: 8.5m tok/s: 6707688 +4355/20000 train_loss: 2.4084 train_time: 8.5m tok/s: 6707399 +4356/20000 train_loss: 2.4721 train_time: 8.5m tok/s: 6707101 +4357/20000 train_loss: 2.5066 train_time: 8.5m tok/s: 6706787 +4358/20000 train_loss: 2.4083 train_time: 8.5m tok/s: 6706503 +4359/20000 train_loss: 2.4198 train_time: 8.5m tok/s: 6706202 +4360/20000 train_loss: 2.3205 train_time: 8.5m tok/s: 6705936 +4361/20000 train_loss: 2.2836 train_time: 8.5m tok/s: 6705639 +4362/20000 train_loss: 2.4564 train_time: 8.5m tok/s: 6705347 +4363/20000 train_loss: 2.2157 train_time: 8.5m tok/s: 6705030 +4364/20000 train_loss: 2.3510 train_time: 8.5m tok/s: 6704729 +4365/20000 train_loss: 2.5510 train_time: 8.5m tok/s: 6704437 +4366/20000 train_loss: 2.7785 train_time: 8.5m tok/s: 6704126 +4367/20000 train_loss: 2.5286 train_time: 8.5m tok/s: 6703837 +4368/20000 train_loss: 2.5046 train_time: 8.5m tok/s: 6703545 +4369/20000 train_loss: 2.3111 train_time: 8.5m tok/s: 6703248 +4370/20000 train_loss: 2.4952 train_time: 8.5m tok/s: 6702951 +4371/20000 train_loss: 2.3901 train_time: 8.5m tok/s: 6702636 +4372/20000 train_loss: 2.4821 train_time: 8.5m tok/s: 6702336 +4373/20000 train_loss: 2.2513 train_time: 8.6m tok/s: 6702044 +4374/20000 train_loss: 2.3145 train_time: 8.6m tok/s: 6701744 +4375/20000 train_loss: 2.4068 train_time: 8.6m tok/s: 6701452 +4376/20000 train_loss: 2.3847 train_time: 8.6m tok/s: 6701151 +4377/20000 train_loss: 2.5562 train_time: 8.6m tok/s: 6700865 +4378/20000 train_loss: 2.2681 train_time: 8.6m tok/s: 6700553 +4379/20000 train_loss: 2.3312 train_time: 8.6m tok/s: 6700264 +4380/20000 train_loss: 2.4350 train_time: 8.6m tok/s: 6699983 +4381/20000 train_loss: 2.5042 train_time: 8.6m tok/s: 6699715 +4382/20000 train_loss: 2.4889 train_time: 8.6m tok/s: 6699401 +4383/20000 train_loss: 2.4346 train_time: 8.6m tok/s: 6699104 +4384/20000 train_loss: 2.4967 train_time: 8.6m tok/s: 6698818 +4385/20000 train_loss: 2.4229 train_time: 8.6m tok/s: 6698510 +4386/20000 train_loss: 2.4015 train_time: 8.6m tok/s: 6698217 +4387/20000 train_loss: 2.4702 train_time: 8.6m tok/s: 6697918 +4388/20000 train_loss: 2.3868 train_time: 8.6m tok/s: 6697618 +4389/20000 train_loss: 2.2641 train_time: 8.6m tok/s: 6697349 +4390/20000 train_loss: 2.4065 train_time: 8.6m tok/s: 6697053 +4391/20000 train_loss: 2.2857 train_time: 8.6m tok/s: 6696759 +4392/20000 train_loss: 2.3938 train_time: 8.6m tok/s: 6696459 +4393/20000 train_loss: 2.3941 train_time: 8.6m tok/s: 6696175 +4394/20000 train_loss: 2.3729 train_time: 8.6m tok/s: 6695889 +4395/20000 train_loss: 2.3396 train_time: 8.6m tok/s: 6695600 +4396/20000 train_loss: 2.5864 train_time: 8.6m tok/s: 6695294 +4397/20000 train_loss: 2.4150 train_time: 8.6m tok/s: 6694995 +4398/20000 train_loss: 2.3446 train_time: 8.6m tok/s: 6694711 +4399/20000 train_loss: 2.3617 train_time: 8.6m tok/s: 6694423 +4400/20000 train_loss: 2.5097 train_time: 8.6m tok/s: 6694125 +4401/20000 train_loss: 2.4340 train_time: 8.6m tok/s: 6693846 +4402/20000 train_loss: 2.3791 train_time: 8.6m tok/s: 6693550 +4403/20000 train_loss: 2.3714 train_time: 8.6m tok/s: 6693274 +4404/20000 train_loss: 2.3185 train_time: 8.6m tok/s: 6692975 +4405/20000 train_loss: 2.5384 train_time: 8.6m tok/s: 6692672 +4406/20000 train_loss: 2.4067 train_time: 8.6m tok/s: 6692376 +4407/20000 train_loss: 2.4543 train_time: 8.6m tok/s: 6692077 +4408/20000 train_loss: 2.4780 train_time: 8.6m tok/s: 6691801 +4409/20000 train_loss: 2.4206 train_time: 8.6m tok/s: 6691532 +4410/20000 train_loss: 2.4486 train_time: 8.6m tok/s: 6691249 +4411/20000 train_loss: 2.4662 train_time: 8.6m tok/s: 6690973 +4412/20000 train_loss: 2.5253 train_time: 8.6m tok/s: 6690688 +4413/20000 train_loss: 2.3458 train_time: 8.6m tok/s: 6690401 +4414/20000 train_loss: 2.3558 train_time: 8.6m tok/s: 6690117 +4415/20000 train_loss: 2.5658 train_time: 8.7m tok/s: 6689799 +4416/20000 train_loss: 2.3855 train_time: 8.7m tok/s: 6689487 +4417/20000 train_loss: 2.2975 train_time: 8.7m tok/s: 6689198 +4418/20000 train_loss: 2.3765 train_time: 8.7m tok/s: 6688913 +4419/20000 train_loss: 2.3659 train_time: 8.7m tok/s: 6688626 +4420/20000 train_loss: 2.2860 train_time: 8.7m tok/s: 6688336 +4421/20000 train_loss: 2.2528 train_time: 8.7m tok/s: 6688042 +4422/20000 train_loss: 2.3869 train_time: 8.7m tok/s: 6687754 +4423/20000 train_loss: 2.5344 train_time: 8.7m tok/s: 6687448 +4424/20000 train_loss: 2.4663 train_time: 8.7m tok/s: 6687170 +4425/20000 train_loss: 2.5008 train_time: 8.7m tok/s: 6686893 +4426/20000 train_loss: 2.3054 train_time: 8.7m tok/s: 6686603 +4427/20000 train_loss: 2.3365 train_time: 8.7m tok/s: 6686310 +4428/20000 train_loss: 2.4030 train_time: 8.7m tok/s: 6686044 +4429/20000 train_loss: 2.4062 train_time: 8.7m tok/s: 6685751 +4430/20000 train_loss: 2.5672 train_time: 8.7m tok/s: 6685427 +4431/20000 train_loss: 2.4672 train_time: 8.7m tok/s: 6685108 +4432/20000 train_loss: 2.3617 train_time: 8.7m tok/s: 6684824 +4433/20000 train_loss: 2.2767 train_time: 8.7m tok/s: 6684551 +4434/20000 train_loss: 2.5710 train_time: 8.7m tok/s: 6684237 +4435/20000 train_loss: 2.4378 train_time: 8.7m tok/s: 6683966 +4436/20000 train_loss: 2.4098 train_time: 8.7m tok/s: 6683684 +4437/20000 train_loss: 2.2327 train_time: 8.7m tok/s: 6683389 +4438/20000 train_loss: 2.5216 train_time: 8.7m tok/s: 6683107 +4439/20000 train_loss: 2.4851 train_time: 8.7m tok/s: 6682801 +4440/20000 train_loss: 2.4778 train_time: 8.7m tok/s: 6682532 +4441/20000 train_loss: 2.3551 train_time: 8.7m tok/s: 6682274 +4442/20000 train_loss: 2.4525 train_time: 8.7m tok/s: 6682006 +4443/20000 train_loss: 2.5247 train_time: 8.7m tok/s: 6681728 +4444/20000 train_loss: 2.3760 train_time: 8.7m tok/s: 6681448 +4445/20000 train_loss: 2.3886 train_time: 8.7m tok/s: 6681157 +4446/20000 train_loss: 2.3161 train_time: 8.7m tok/s: 6680875 +4447/20000 train_loss: 2.3658 train_time: 8.7m tok/s: 6680599 +4448/20000 train_loss: 2.3119 train_time: 8.7m tok/s: 6680313 +4449/20000 train_loss: 2.5146 train_time: 8.7m tok/s: 6680025 +4450/20000 train_loss: 2.3453 train_time: 8.7m tok/s: 6679756 +4451/20000 train_loss: 2.4477 train_time: 8.7m tok/s: 6679460 +4452/20000 train_loss: 2.3726 train_time: 8.7m tok/s: 6679171 +4453/20000 train_loss: 2.4806 train_time: 8.7m tok/s: 6678883 +4454/20000 train_loss: 2.3381 train_time: 8.7m tok/s: 6678599 +4455/20000 train_loss: 2.4081 train_time: 8.7m tok/s: 6678330 +4456/20000 train_loss: 2.4185 train_time: 8.7m tok/s: 6678040 +4457/20000 train_loss: 2.6178 train_time: 8.7m tok/s: 6677755 +4458/20000 train_loss: 2.3903 train_time: 8.8m tok/s: 6677481 +4459/20000 train_loss: 2.2668 train_time: 8.8m tok/s: 6677200 +4460/20000 train_loss: 2.3203 train_time: 8.8m tok/s: 6676903 +4461/20000 train_loss: 2.3534 train_time: 8.8m tok/s: 6676630 +4462/20000 train_loss: 2.4823 train_time: 8.8m tok/s: 6676355 +4463/20000 train_loss: 2.2405 train_time: 8.8m tok/s: 6676058 +4464/20000 train_loss: 2.4964 train_time: 8.8m tok/s: 6675779 +4465/20000 train_loss: 2.4620 train_time: 8.8m tok/s: 6675503 +4466/20000 train_loss: 2.5115 train_time: 8.8m tok/s: 6675216 +4467/20000 train_loss: 2.4556 train_time: 8.8m tok/s: 6674934 +4468/20000 train_loss: 2.2133 train_time: 8.8m tok/s: 6674635 +4469/20000 train_loss: 2.3627 train_time: 8.8m tok/s: 6674336 +4470/20000 train_loss: 2.3495 train_time: 8.8m tok/s: 6674080 +4471/20000 train_loss: 2.4358 train_time: 8.8m tok/s: 6673813 +4472/20000 train_loss: 2.4008 train_time: 8.8m tok/s: 6673522 +4473/20000 train_loss: 2.5130 train_time: 8.8m tok/s: 6673244 +4474/20000 train_loss: 2.3601 train_time: 8.8m tok/s: 6672943 +4475/20000 train_loss: 2.4801 train_time: 8.8m tok/s: 6672668 +4476/20000 train_loss: 2.4198 train_time: 8.8m tok/s: 6672396 +4477/20000 train_loss: 2.3434 train_time: 8.8m tok/s: 6672101 +4478/20000 train_loss: 2.3656 train_time: 8.8m tok/s: 6671839 +4479/20000 train_loss: 2.4300 train_time: 8.8m tok/s: 6671565 +4480/20000 train_loss: 2.3815 train_time: 8.8m tok/s: 6671297 +4481/20000 train_loss: 2.4183 train_time: 8.8m tok/s: 6671012 +4482/20000 train_loss: 2.3793 train_time: 8.8m tok/s: 6670721 +4483/20000 train_loss: 2.4184 train_time: 8.8m tok/s: 6670441 +4484/20000 train_loss: 2.5013 train_time: 8.8m tok/s: 6670171 +4485/20000 train_loss: 2.3385 train_time: 8.8m tok/s: 6669904 +4486/20000 train_loss: 2.3924 train_time: 8.8m tok/s: 6669630 +4487/20000 train_loss: 2.3824 train_time: 8.8m tok/s: 6669347 +4488/20000 train_loss: 2.3792 train_time: 8.8m tok/s: 6669059 +4489/20000 train_loss: 2.2798 train_time: 8.8m tok/s: 6668777 +4490/20000 train_loss: 2.4654 train_time: 8.8m tok/s: 6668490 +4491/20000 train_loss: 2.3999 train_time: 8.8m tok/s: 6668217 +4492/20000 train_loss: 2.4507 train_time: 8.8m tok/s: 6667943 +4493/20000 train_loss: 2.4621 train_time: 8.8m tok/s: 6667675 +4494/20000 train_loss: 2.4904 train_time: 8.8m tok/s: 6667402 +4495/20000 train_loss: 2.4479 train_time: 8.8m tok/s: 6667125 +4496/20000 train_loss: 2.3325 train_time: 8.8m tok/s: 6666847 +4497/20000 train_loss: 2.4422 train_time: 8.8m tok/s: 6666556 +4498/20000 train_loss: 2.3722 train_time: 8.8m tok/s: 6666291 +4499/20000 train_loss: 2.3627 train_time: 8.8m tok/s: 6666016 +4500/20000 train_loss: 2.3893 train_time: 8.8m tok/s: 6665741 +4501/20000 train_loss: 2.0625 train_time: 8.9m tok/s: 6665423 +4502/20000 train_loss: 2.3541 train_time: 8.9m tok/s: 6665137 +4503/20000 train_loss: 2.2331 train_time: 8.9m tok/s: 6664858 +4504/20000 train_loss: 2.3204 train_time: 8.9m tok/s: 6664571 +4505/20000 train_loss: 2.3234 train_time: 8.9m tok/s: 6664306 +4506/20000 train_loss: 2.2662 train_time: 8.9m tok/s: 6664033 +4507/20000 train_loss: 2.3547 train_time: 8.9m tok/s: 6663773 +4508/20000 train_loss: 2.4647 train_time: 8.9m tok/s: 6663512 +4509/20000 train_loss: 2.4255 train_time: 8.9m tok/s: 6663242 +4510/20000 train_loss: 2.4542 train_time: 8.9m tok/s: 6662969 +4511/20000 train_loss: 2.2619 train_time: 8.9m tok/s: 6662695 +4512/20000 train_loss: 2.3576 train_time: 8.9m tok/s: 6662426 +4513/20000 train_loss: 2.3776 train_time: 8.9m tok/s: 6662154 +4514/20000 train_loss: 2.3911 train_time: 8.9m tok/s: 6661878 +4515/20000 train_loss: 2.3164 train_time: 8.9m tok/s: 6661594 +4516/20000 train_loss: 2.5665 train_time: 8.9m tok/s: 6661300 +4517/20000 train_loss: 2.6633 train_time: 8.9m tok/s: 6661013 +4518/20000 train_loss: 2.4489 train_time: 8.9m tok/s: 6660748 +4519/20000 train_loss: 2.3679 train_time: 8.9m tok/s: 6660483 +4520/20000 train_loss: 2.3118 train_time: 8.9m tok/s: 6660208 +4521/20000 train_loss: 2.4242 train_time: 8.9m tok/s: 6659941 +4522/20000 train_loss: 2.3461 train_time: 8.9m tok/s: 6659679 +4523/20000 train_loss: 2.4175 train_time: 8.9m tok/s: 6659418 +4524/20000 train_loss: 2.3145 train_time: 8.9m tok/s: 6659142 +4525/20000 train_loss: 2.4896 train_time: 8.9m tok/s: 6658881 +4526/20000 train_loss: 2.4314 train_time: 8.9m tok/s: 6658622 +4527/20000 train_loss: 2.4175 train_time: 8.9m tok/s: 6658353 +4528/20000 train_loss: 2.3307 train_time: 8.9m tok/s: 6658081 +4529/20000 train_loss: 2.3807 train_time: 8.9m tok/s: 6657791 +4530/20000 train_loss: 2.2935 train_time: 8.9m tok/s: 6657508 +4531/20000 train_loss: 2.5323 train_time: 8.9m tok/s: 6657218 +4532/20000 train_loss: 2.3998 train_time: 8.9m tok/s: 6656946 +4533/20000 train_loss: 2.2842 train_time: 8.9m tok/s: 6656699 +4534/20000 train_loss: 2.4274 train_time: 8.9m tok/s: 6656434 +4535/20000 train_loss: 2.4902 train_time: 8.9m tok/s: 6656156 +4536/20000 train_loss: 2.2850 train_time: 8.9m tok/s: 6655868 +4537/20000 train_loss: 2.3818 train_time: 8.9m tok/s: 6655596 +4538/20000 train_loss: 2.1585 train_time: 8.9m tok/s: 6655323 +4539/20000 train_loss: 2.4333 train_time: 8.9m tok/s: 6655046 +4540/20000 train_loss: 2.3692 train_time: 8.9m tok/s: 6654773 +4541/20000 train_loss: 2.2982 train_time: 8.9m tok/s: 6654519 +4542/20000 train_loss: 2.3913 train_time: 8.9m tok/s: 6654250 +4543/20000 train_loss: 2.2984 train_time: 8.9m tok/s: 6653974 +4544/20000 train_loss: 2.6393 train_time: 9.0m tok/s: 6653678 +4545/20000 train_loss: 2.3788 train_time: 9.0m tok/s: 6653429 +4546/20000 train_loss: 2.3848 train_time: 9.0m tok/s: 6653163 +4547/20000 train_loss: 2.3433 train_time: 9.0m tok/s: 6652902 +4548/20000 train_loss: 2.2340 train_time: 9.0m tok/s: 6652637 +4549/20000 train_loss: 2.4305 train_time: 9.0m tok/s: 6652367 +4550/20000 train_loss: 2.4547 train_time: 9.0m tok/s: 6652112 +4551/20000 train_loss: 2.3528 train_time: 9.0m tok/s: 6651843 +4552/20000 train_loss: 2.3424 train_time: 9.0m tok/s: 6651551 +4553/20000 train_loss: 2.2867 train_time: 9.0m tok/s: 6651299 +4554/20000 train_loss: 2.4562 train_time: 9.0m tok/s: 6651008 +4555/20000 train_loss: 2.4521 train_time: 9.0m tok/s: 6650723 +4556/20000 train_loss: 2.4146 train_time: 9.0m tok/s: 6650472 +4557/20000 train_loss: 2.4615 train_time: 9.0m tok/s: 6650210 +4558/20000 train_loss: 2.5737 train_time: 9.0m tok/s: 6649944 +4559/20000 train_loss: 2.4103 train_time: 9.0m tok/s: 6649683 +4560/20000 train_loss: 2.4290 train_time: 9.0m tok/s: 6649409 +4561/20000 train_loss: 2.4408 train_time: 9.0m tok/s: 6649148 +4562/20000 train_loss: 2.3892 train_time: 9.0m tok/s: 6648884 +4563/20000 train_loss: 2.5012 train_time: 9.0m tok/s: 6648617 +4564/20000 train_loss: 2.3222 train_time: 9.0m tok/s: 6648338 +4565/20000 train_loss: 2.3954 train_time: 9.0m tok/s: 6648081 +4566/20000 train_loss: 2.3654 train_time: 9.0m tok/s: 6647802 +4567/20000 train_loss: 2.2291 train_time: 9.0m tok/s: 6647526 +4568/20000 train_loss: 2.2569 train_time: 9.0m tok/s: 6647270 +4569/20000 train_loss: 2.4761 train_time: 9.0m tok/s: 6647010 +4570/20000 train_loss: 2.3456 train_time: 9.0m tok/s: 6646745 +4571/20000 train_loss: 2.4166 train_time: 9.0m tok/s: 6646490 +4572/20000 train_loss: 2.3973 train_time: 9.0m tok/s: 6646233 +4573/20000 train_loss: 2.3936 train_time: 9.0m tok/s: 6645970 +4574/20000 train_loss: 2.4040 train_time: 9.0m tok/s: 6645669 +4575/20000 train_loss: 2.2820 train_time: 9.0m tok/s: 6645382 +4576/20000 train_loss: 2.4210 train_time: 9.0m tok/s: 6645126 +4577/20000 train_loss: 2.4132 train_time: 9.0m tok/s: 6644856 +4578/20000 train_loss: 2.3683 train_time: 9.0m tok/s: 6644602 +4579/20000 train_loss: 2.4723 train_time: 9.0m tok/s: 6644326 +4580/20000 train_loss: 2.2539 train_time: 9.0m tok/s: 6644045 +4581/20000 train_loss: 1.9360 train_time: 9.0m tok/s: 6643735 +4582/20000 train_loss: 2.3699 train_time: 9.0m tok/s: 6643467 +4583/20000 train_loss: 2.4077 train_time: 9.0m tok/s: 6643230 +4584/20000 train_loss: 2.4607 train_time: 9.0m tok/s: 6642978 +4585/20000 train_loss: 2.3160 train_time: 9.0m tok/s: 6642734 +4586/20000 train_loss: 2.3568 train_time: 9.0m tok/s: 6642490 +4587/20000 train_loss: 2.5044 train_time: 9.1m tok/s: 6642202 +4588/20000 train_loss: 2.3972 train_time: 9.1m tok/s: 6641940 +4589/20000 train_loss: 2.4570 train_time: 9.1m tok/s: 6641707 +4590/20000 train_loss: 2.3079 train_time: 9.1m tok/s: 6641435 +4591/20000 train_loss: 2.3678 train_time: 9.1m tok/s: 6641170 +4592/20000 train_loss: 2.2545 train_time: 9.1m tok/s: 6640903 +4593/20000 train_loss: 2.4548 train_time: 9.1m tok/s: 6640641 +4594/20000 train_loss: 2.3201 train_time: 9.1m tok/s: 6640383 +4595/20000 train_loss: 2.1861 train_time: 9.1m tok/s: 6640094 +4596/20000 train_loss: 2.4291 train_time: 9.1m tok/s: 6639854 +4597/20000 train_loss: 2.4121 train_time: 9.1m tok/s: 6639569 +4598/20000 train_loss: 2.4620 train_time: 9.1m tok/s: 6639325 +4599/20000 train_loss: 2.3751 train_time: 9.1m tok/s: 6639077 +4600/20000 train_loss: 2.5605 train_time: 9.1m tok/s: 6638805 +4601/20000 train_loss: 2.4847 train_time: 9.1m tok/s: 6638538 +4602/20000 train_loss: 2.5111 train_time: 9.1m tok/s: 6638265 +4603/20000 train_loss: 2.3835 train_time: 9.1m tok/s: 6638012 +4604/20000 train_loss: 2.3461 train_time: 9.1m tok/s: 6637749 +4605/20000 train_loss: 2.3489 train_time: 9.1m tok/s: 6637475 +4606/20000 train_loss: 2.3481 train_time: 9.1m tok/s: 6637229 +4607/20000 train_loss: 2.3534 train_time: 9.1m tok/s: 6636980 +4608/20000 train_loss: 2.3675 train_time: 9.1m tok/s: 6636723 +4609/20000 train_loss: 2.3199 train_time: 9.1m tok/s: 6636449 +4610/20000 train_loss: 2.4209 train_time: 9.1m tok/s: 6636172 +4611/20000 train_loss: 2.5822 train_time: 9.1m tok/s: 6635896 +4612/20000 train_loss: 2.4527 train_time: 9.1m tok/s: 6635632 +4613/20000 train_loss: 2.5041 train_time: 9.1m tok/s: 6635393 +4614/20000 train_loss: 2.4581 train_time: 9.1m tok/s: 6635144 +4615/20000 train_loss: 2.3032 train_time: 9.1m tok/s: 6634842 +4616/20000 train_loss: 2.2758 train_time: 9.1m tok/s: 6634604 +4617/20000 train_loss: 2.3151 train_time: 9.1m tok/s: 6634330 +4618/20000 train_loss: 2.3739 train_time: 9.1m tok/s: 6634101 +4619/20000 train_loss: 2.2872 train_time: 9.1m tok/s: 6633847 +4620/20000 train_loss: 2.3496 train_time: 9.1m tok/s: 6633601 +4621/20000 train_loss: 2.3893 train_time: 9.1m tok/s: 6633329 +4622/20000 train_loss: 2.4201 train_time: 9.1m tok/s: 6633048 +4623/20000 train_loss: 2.3730 train_time: 9.1m tok/s: 6632795 +4624/20000 train_loss: 2.5461 train_time: 9.1m tok/s: 6632525 +4625/20000 train_loss: 2.5872 train_time: 9.1m tok/s: 6632255 +4626/20000 train_loss: 2.4973 train_time: 9.1m tok/s: 6631995 +4627/20000 train_loss: 2.3998 train_time: 9.1m tok/s: 6631763 +4628/20000 train_loss: 2.4532 train_time: 9.1m tok/s: 6631496 +4629/20000 train_loss: 2.3730 train_time: 9.1m tok/s: 6631238 +4630/20000 train_loss: 2.1497 train_time: 9.2m tok/s: 6630992 +4631/20000 train_loss: 2.3869 train_time: 9.2m tok/s: 6630731 +4632/20000 train_loss: 2.3133 train_time: 9.2m tok/s: 6630472 +4633/20000 train_loss: 2.2332 train_time: 9.2m tok/s: 6630217 +4634/20000 train_loss: 2.3477 train_time: 9.2m tok/s: 6629970 +4635/20000 train_loss: 2.4743 train_time: 9.2m tok/s: 6629690 +4636/20000 train_loss: 2.3872 train_time: 9.2m tok/s: 6629437 +4637/20000 train_loss: 2.4627 train_time: 9.2m tok/s: 6629183 +4638/20000 train_loss: 2.4532 train_time: 9.2m tok/s: 6628944 +4639/20000 train_loss: 2.3911 train_time: 9.2m tok/s: 6628694 +4640/20000 train_loss: 2.4071 train_time: 9.2m tok/s: 6628431 +4641/20000 train_loss: 2.3469 train_time: 9.2m tok/s: 6628160 +4642/20000 train_loss: 2.3123 train_time: 9.2m tok/s: 6627909 +4643/20000 train_loss: 2.4233 train_time: 9.2m tok/s: 6627656 +4644/20000 train_loss: 2.2570 train_time: 9.2m tok/s: 6627383 +4645/20000 train_loss: 2.3501 train_time: 9.2m tok/s: 6627137 +4646/20000 train_loss: 2.2881 train_time: 9.2m tok/s: 6626877 +4647/20000 train_loss: 2.1976 train_time: 9.2m tok/s: 6626621 +4648/20000 train_loss: 2.3736 train_time: 9.2m tok/s: 6626370 +4649/20000 train_loss: 2.3928 train_time: 9.2m tok/s: 6626112 +4650/20000 train_loss: 2.4094 train_time: 9.2m tok/s: 6625867 +4651/20000 train_loss: 2.4866 train_time: 9.2m tok/s: 6625607 +4652/20000 train_loss: 2.4640 train_time: 9.2m tok/s: 6625341 +4653/20000 train_loss: 2.4327 train_time: 9.2m tok/s: 6625087 +4654/20000 train_loss: 2.2998 train_time: 9.2m tok/s: 6624831 +4655/20000 train_loss: 2.2761 train_time: 9.2m tok/s: 6624558 +4656/20000 train_loss: 2.4700 train_time: 9.2m tok/s: 6624307 +4657/20000 train_loss: 2.3573 train_time: 9.2m tok/s: 6624052 +4658/20000 train_loss: 2.2982 train_time: 9.2m tok/s: 6623802 +4659/20000 train_loss: 2.3307 train_time: 9.2m tok/s: 6623558 +4660/20000 train_loss: 2.3383 train_time: 9.2m tok/s: 6623295 +4661/20000 train_loss: 2.0318 train_time: 9.2m tok/s: 6623011 +4662/20000 train_loss: 2.3587 train_time: 9.2m tok/s: 6622750 +4663/20000 train_loss: 2.3864 train_time: 9.2m tok/s: 6622528 +4664/20000 train_loss: 2.2969 train_time: 9.2m tok/s: 6622271 +4665/20000 train_loss: 2.4456 train_time: 9.2m tok/s: 6622014 +4666/20000 train_loss: 2.3418 train_time: 9.2m tok/s: 6621761 +4667/20000 train_loss: 2.4351 train_time: 9.2m tok/s: 6621489 +4668/20000 train_loss: 2.4018 train_time: 9.2m tok/s: 6621268 +4669/20000 train_loss: 2.6143 train_time: 9.2m tok/s: 6621009 +4670/20000 train_loss: 2.2812 train_time: 9.2m tok/s: 6620756 +4671/20000 train_loss: 2.4567 train_time: 9.2m tok/s: 6620515 +4672/20000 train_loss: 2.3482 train_time: 9.2m tok/s: 6620258 +4673/20000 train_loss: 2.3265 train_time: 9.3m tok/s: 6619998 +4674/20000 train_loss: 2.4360 train_time: 9.3m tok/s: 6619740 +4675/20000 train_loss: 2.3730 train_time: 9.3m tok/s: 6619487 +4676/20000 train_loss: 2.3639 train_time: 9.3m tok/s: 6619231 +4677/20000 train_loss: 2.4182 train_time: 9.3m tok/s: 6618976 +4678/20000 train_loss: 2.3871 train_time: 9.3m tok/s: 6618731 +4679/20000 train_loss: 2.2972 train_time: 9.3m tok/s: 6618489 +4680/20000 train_loss: 2.3032 train_time: 9.3m tok/s: 6618235 +4681/20000 train_loss: 2.3774 train_time: 9.3m tok/s: 6617979 +4682/20000 train_loss: 2.2883 train_time: 9.3m tok/s: 6617726 +4683/20000 train_loss: 2.3618 train_time: 9.3m tok/s: 6617466 +4684/20000 train_loss: 2.3504 train_time: 9.3m tok/s: 6617224 +4685/20000 train_loss: 2.3216 train_time: 9.3m tok/s: 6616957 +4686/20000 train_loss: 2.2538 train_time: 9.3m tok/s: 6616692 +4687/20000 train_loss: 2.3872 train_time: 9.3m tok/s: 6616464 +4688/20000 train_loss: 2.4494 train_time: 9.3m tok/s: 6616198 +4689/20000 train_loss: 2.4383 train_time: 9.3m tok/s: 6615951 +4690/20000 train_loss: 2.5013 train_time: 9.3m tok/s: 6615709 +4691/20000 train_loss: 2.4025 train_time: 9.3m tok/s: 6615465 +4692/20000 train_loss: 2.4222 train_time: 9.3m tok/s: 6615218 +4693/20000 train_loss: 2.3704 train_time: 9.3m tok/s: 6614962 +4694/20000 train_loss: 2.4529 train_time: 9.3m tok/s: 6614699 +4695/20000 train_loss: 2.4441 train_time: 9.3m tok/s: 6614452 +4696/20000 train_loss: 2.3430 train_time: 9.3m tok/s: 6614224 +4697/20000 train_loss: 2.3407 train_time: 9.3m tok/s: 6613989 +4698/20000 train_loss: 2.3711 train_time: 9.3m tok/s: 6613747 +4699/20000 train_loss: 2.3680 train_time: 9.3m tok/s: 6613525 +4700/20000 train_loss: 2.1593 train_time: 9.3m tok/s: 6613285 +4701/20000 train_loss: 2.4174 train_time: 9.3m tok/s: 6613024 +4702/20000 train_loss: 2.5016 train_time: 9.3m tok/s: 6612738 +4703/20000 train_loss: 2.4465 train_time: 9.3m tok/s: 6612475 +4704/20000 train_loss: 2.3664 train_time: 9.3m tok/s: 6612230 +4705/20000 train_loss: 2.3343 train_time: 9.3m tok/s: 6611983 +4706/20000 train_loss: 2.3325 train_time: 9.3m tok/s: 6611732 +4707/20000 train_loss: 2.4488 train_time: 9.3m tok/s: 6611484 +4708/20000 train_loss: 2.3617 train_time: 9.3m tok/s: 6611214 +4709/20000 train_loss: 2.3512 train_time: 9.3m tok/s: 6610974 +4710/20000 train_loss: 2.3956 train_time: 9.3m tok/s: 6610719 +4711/20000 train_loss: 2.3652 train_time: 9.3m tok/s: 6610455 +4712/20000 train_loss: 2.3200 train_time: 9.3m tok/s: 6610212 +4713/20000 train_loss: 2.4440 train_time: 9.3m tok/s: 6609974 +4714/20000 train_loss: 2.3454 train_time: 9.3m tok/s: 6609693 +4715/20000 train_loss: 2.4642 train_time: 9.4m tok/s: 6609446 +4716/20000 train_loss: 2.4964 train_time: 9.4m tok/s: 6609227 +4717/20000 train_loss: 2.3299 train_time: 9.4m tok/s: 6608975 +4718/20000 train_loss: 2.3180 train_time: 9.4m tok/s: 6608728 +4719/20000 train_loss: 2.3508 train_time: 9.4m tok/s: 6608494 +4720/20000 train_loss: 2.3042 train_time: 9.4m tok/s: 6608243 +4721/20000 train_loss: 2.2802 train_time: 9.4m tok/s: 6607989 +4722/20000 train_loss: 2.4781 train_time: 9.4m tok/s: 6607745 +4723/20000 train_loss: 2.3232 train_time: 9.4m tok/s: 6607516 +4724/20000 train_loss: 2.3658 train_time: 9.4m tok/s: 6607285 +4725/20000 train_loss: 2.2568 train_time: 9.4m tok/s: 6607010 +4726/20000 train_loss: 2.3997 train_time: 9.4m tok/s: 6606752 +4727/20000 train_loss: 2.3791 train_time: 9.4m tok/s: 6606512 +4728/20000 train_loss: 2.3625 train_time: 9.4m tok/s: 6606279 +4729/20000 train_loss: 2.3695 train_time: 9.4m tok/s: 6606029 +4730/20000 train_loss: 2.2011 train_time: 9.4m tok/s: 6605777 +4731/20000 train_loss: 2.3642 train_time: 9.4m tok/s: 6605526 +4732/20000 train_loss: 2.3179 train_time: 9.4m tok/s: 6605281 +4733/20000 train_loss: 2.4135 train_time: 9.4m tok/s: 6605048 +4734/20000 train_loss: 2.3972 train_time: 9.4m tok/s: 6604791 +4735/20000 train_loss: 2.1971 train_time: 9.4m tok/s: 6604528 +4736/20000 train_loss: 2.3380 train_time: 9.4m tok/s: 6604297 +4737/20000 train_loss: 2.4789 train_time: 9.4m tok/s: 6604066 +4738/20000 train_loss: 2.3485 train_time: 9.4m tok/s: 6603819 +4739/20000 train_loss: 2.4849 train_time: 9.4m tok/s: 6603513 +4740/20000 train_loss: 2.4546 train_time: 9.4m tok/s: 6603244 +4741/20000 train_loss: 2.3771 train_time: 9.4m tok/s: 6603001 +4742/20000 train_loss: 2.3396 train_time: 9.4m tok/s: 6602770 +4743/20000 train_loss: 2.2962 train_time: 9.4m tok/s: 6602504 +4744/20000 train_loss: 2.3683 train_time: 9.4m tok/s: 6602271 +4745/20000 train_loss: 2.2571 train_time: 9.4m tok/s: 6602037 +4746/20000 train_loss: 2.2258 train_time: 9.4m tok/s: 6601803 +4747/20000 train_loss: 2.4580 train_time: 9.4m tok/s: 6601578 +4748/20000 train_loss: 2.4003 train_time: 9.4m tok/s: 6601324 +4749/20000 train_loss: 2.3674 train_time: 9.4m tok/s: 6601058 +4750/20000 train_loss: 2.4057 train_time: 9.4m tok/s: 6600829 +4751/20000 train_loss: 2.3263 train_time: 9.4m tok/s: 6600612 +4752/20000 train_loss: 2.3807 train_time: 9.4m tok/s: 6600395 +4753/20000 train_loss: 2.2601 train_time: 9.4m tok/s: 6600143 +4754/20000 train_loss: 2.2218 train_time: 9.4m tok/s: 6599902 +4755/20000 train_loss: 2.4101 train_time: 9.4m tok/s: 6599659 +4756/20000 train_loss: 2.3789 train_time: 9.4m tok/s: 6599416 +4757/20000 train_loss: 2.3288 train_time: 9.4m tok/s: 6599184 +4758/20000 train_loss: 2.5561 train_time: 9.5m tok/s: 6598935 +4759/20000 train_loss: 2.3883 train_time: 9.5m tok/s: 6598707 +4760/20000 train_loss: 2.2655 train_time: 9.5m tok/s: 6598458 +4761/20000 train_loss: 2.4424 train_time: 9.5m tok/s: 6598228 +4762/20000 train_loss: 2.3452 train_time: 9.5m tok/s: 6597984 +4763/20000 train_loss: 2.4821 train_time: 9.5m tok/s: 6597740 +4764/20000 train_loss: 2.4402 train_time: 9.5m tok/s: 6597499 +4765/20000 train_loss: 2.4231 train_time: 9.5m tok/s: 6597243 +4766/20000 train_loss: 2.3529 train_time: 9.5m tok/s: 6597009 +4767/20000 train_loss: 2.3098 train_time: 9.5m tok/s: 6596770 +4768/20000 train_loss: 2.3795 train_time: 9.5m tok/s: 6596542 +4769/20000 train_loss: 2.3511 train_time: 9.5m tok/s: 6596286 +4770/20000 train_loss: 2.4208 train_time: 9.5m tok/s: 6596048 +4771/20000 train_loss: 2.3472 train_time: 9.5m tok/s: 6595799 +4772/20000 train_loss: 2.4243 train_time: 9.5m tok/s: 6595553 +4773/20000 train_loss: 2.4387 train_time: 9.5m tok/s: 6595324 +4774/20000 train_loss: 2.5482 train_time: 9.5m tok/s: 6595072 +4775/20000 train_loss: 2.4249 train_time: 9.5m tok/s: 6594828 +4776/20000 train_loss: 2.5234 train_time: 9.5m tok/s: 6594607 +4777/20000 train_loss: 2.3981 train_time: 9.5m tok/s: 6594360 +4778/20000 train_loss: 2.3642 train_time: 9.5m tok/s: 6594125 +4779/20000 train_loss: 2.2030 train_time: 9.5m tok/s: 6593864 +4780/20000 train_loss: 2.3174 train_time: 9.5m tok/s: 6593648 +4781/20000 train_loss: 2.3086 train_time: 9.5m tok/s: 6593412 +4782/20000 train_loss: 2.7115 train_time: 9.5m tok/s: 6593145 +4783/20000 train_loss: 2.4247 train_time: 9.5m tok/s: 6592911 +4784/20000 train_loss: 2.3788 train_time: 9.5m tok/s: 6592670 +4785/20000 train_loss: 2.4254 train_time: 9.5m tok/s: 6592444 +4786/20000 train_loss: 2.4206 train_time: 9.5m tok/s: 6592204 +4787/20000 train_loss: 2.3607 train_time: 9.5m tok/s: 6591967 +4788/20000 train_loss: 2.3773 train_time: 9.5m tok/s: 6591719 +4789/20000 train_loss: 2.1603 train_time: 9.5m tok/s: 6591468 +4790/20000 train_loss: 2.4110 train_time: 9.5m tok/s: 6591226 +4791/20000 train_loss: 2.3540 train_time: 9.5m tok/s: 6590996 +4792/20000 train_loss: 2.2519 train_time: 9.5m tok/s: 6590756 +4793/20000 train_loss: 2.3512 train_time: 9.5m tok/s: 6590528 +4794/20000 train_loss: 2.2290 train_time: 9.5m tok/s: 6590291 +4795/20000 train_loss: 2.3914 train_time: 9.5m tok/s: 6590044 +4796/20000 train_loss: 2.2587 train_time: 9.5m tok/s: 6589825 +4797/20000 train_loss: 2.4032 train_time: 9.5m tok/s: 6589593 +4798/20000 train_loss: 2.5524 train_time: 9.5m tok/s: 6589341 +4799/20000 train_loss: 2.4745 train_time: 9.5m tok/s: 6589099 +4800/20000 train_loss: 2.4482 train_time: 9.5m tok/s: 6588877 +4801/20000 train_loss: 2.4210 train_time: 9.6m tok/s: 6588627 +4802/20000 train_loss: 2.3271 train_time: 9.6m tok/s: 6588388 +4803/20000 train_loss: 2.5179 train_time: 9.6m tok/s: 6588151 +4804/20000 train_loss: 1.9512 train_time: 9.6m tok/s: 6587885 +4805/20000 train_loss: 2.3943 train_time: 9.6m tok/s: 6587630 +4806/20000 train_loss: 2.3767 train_time: 9.6m tok/s: 6587407 +4807/20000 train_loss: 2.3649 train_time: 9.6m tok/s: 6587186 +4808/20000 train_loss: 2.2943 train_time: 9.6m tok/s: 6586974 +4809/20000 train_loss: 2.4094 train_time: 9.6m tok/s: 6586733 +4810/20000 train_loss: 2.4587 train_time: 9.6m tok/s: 6586500 +4811/20000 train_loss: 2.4021 train_time: 9.6m tok/s: 6586270 +4812/20000 train_loss: 2.3557 train_time: 9.6m tok/s: 6586039 +4813/20000 train_loss: 2.3547 train_time: 9.6m tok/s: 6585813 +4814/20000 train_loss: 2.3850 train_time: 9.6m tok/s: 6585598 +4815/20000 train_loss: 2.4439 train_time: 9.6m tok/s: 6585356 +4816/20000 train_loss: 2.3621 train_time: 9.6m tok/s: 6585131 +4817/20000 train_loss: 2.4960 train_time: 9.6m tok/s: 6584880 +4818/20000 train_loss: 2.2846 train_time: 9.6m tok/s: 6584652 +4819/20000 train_loss: 2.4374 train_time: 9.6m tok/s: 6584412 +4820/20000 train_loss: 2.3061 train_time: 9.6m tok/s: 6584171 +4821/20000 train_loss: 2.3938 train_time: 9.6m tok/s: 6583934 +4822/20000 train_loss: 2.4127 train_time: 9.6m tok/s: 6583698 +4823/20000 train_loss: 2.3806 train_time: 9.6m tok/s: 6583463 +4824/20000 train_loss: 2.4818 train_time: 9.6m tok/s: 6583212 +4825/20000 train_loss: 2.5654 train_time: 9.6m tok/s: 6582964 +4826/20000 train_loss: 2.3006 train_time: 9.6m tok/s: 6582731 +4827/20000 train_loss: 2.2837 train_time: 9.6m tok/s: 6582497 +4828/20000 train_loss: 2.3203 train_time: 9.6m tok/s: 6582256 +4829/20000 train_loss: 2.3589 train_time: 9.6m tok/s: 6582034 +4830/20000 train_loss: 2.3424 train_time: 9.6m tok/s: 6581802 +4831/20000 train_loss: 2.2999 train_time: 9.6m tok/s: 6581565 +4832/20000 train_loss: 2.2531 train_time: 9.6m tok/s: 6581330 +4833/20000 train_loss: 2.2698 train_time: 9.6m tok/s: 6581100 +4834/20000 train_loss: 2.3898 train_time: 9.6m tok/s: 6580869 +4835/20000 train_loss: 2.3357 train_time: 9.6m tok/s: 6580631 +4836/20000 train_loss: 2.3537 train_time: 9.6m tok/s: 6580392 +4837/20000 train_loss: 2.5246 train_time: 9.6m tok/s: 6580156 +4838/20000 train_loss: 2.2803 train_time: 9.6m tok/s: 6579933 +4839/20000 train_loss: 2.4344 train_time: 9.6m tok/s: 6579699 +4840/20000 train_loss: 2.2699 train_time: 9.6m tok/s: 6579471 +4841/20000 train_loss: 2.4284 train_time: 9.6m tok/s: 6579235 +4842/20000 train_loss: 2.6822 train_time: 9.6m tok/s: 6578995 +4843/20000 train_loss: 2.3030 train_time: 9.6m tok/s: 6578766 +4844/20000 train_loss: 2.2924 train_time: 9.7m tok/s: 6578534 +4845/20000 train_loss: 2.3260 train_time: 9.7m tok/s: 6578309 +4846/20000 train_loss: 2.3424 train_time: 9.7m tok/s: 6578058 +4847/20000 train_loss: 2.5513 train_time: 9.7m tok/s: 6577827 +4848/20000 train_loss: 2.3126 train_time: 9.7m tok/s: 6577612 +4849/20000 train_loss: 2.4517 train_time: 9.7m tok/s: 6577406 +4850/20000 train_loss: 2.2968 train_time: 9.7m tok/s: 6577176 +4851/20000 train_loss: 2.3395 train_time: 9.7m tok/s: 6576946 +4852/20000 train_loss: 2.3566 train_time: 9.7m tok/s: 6576713 +4853/20000 train_loss: 2.2958 train_time: 9.7m tok/s: 6576461 +4854/20000 train_loss: 2.4586 train_time: 9.7m tok/s: 6576224 +4855/20000 train_loss: 2.3479 train_time: 9.7m tok/s: 6575994 +4856/20000 train_loss: 2.2798 train_time: 9.7m tok/s: 6575764 +4857/20000 train_loss: 2.2888 train_time: 9.7m tok/s: 6575528 +4858/20000 train_loss: 2.2703 train_time: 9.7m tok/s: 6575297 +4859/20000 train_loss: 2.3072 train_time: 9.7m tok/s: 6575054 +4860/20000 train_loss: 2.4742 train_time: 9.7m tok/s: 6574843 +4861/20000 train_loss: 2.4070 train_time: 9.7m tok/s: 6574606 +4862/20000 train_loss: 2.3170 train_time: 9.7m tok/s: 6574377 +4863/20000 train_loss: 2.5817 train_time: 9.7m tok/s: 6574152 +4864/20000 train_loss: 2.4021 train_time: 9.7m tok/s: 6573926 +4865/20000 train_loss: 2.5086 train_time: 9.7m tok/s: 6573684 +4866/20000 train_loss: 2.2684 train_time: 9.7m tok/s: 6573432 +4867/20000 train_loss: 2.2499 train_time: 9.7m tok/s: 6573229 +4868/20000 train_loss: 2.3115 train_time: 9.7m tok/s: 6572994 +4869/20000 train_loss: 2.4386 train_time: 9.7m tok/s: 6572734 +4870/20000 train_loss: 2.3615 train_time: 9.7m tok/s: 6572503 +4871/20000 train_loss: 2.3669 train_time: 9.7m tok/s: 6572260 +4872/20000 train_loss: 2.2766 train_time: 9.7m tok/s: 6572002 +4873/20000 train_loss: 2.5455 train_time: 9.7m tok/s: 6571771 +4874/20000 train_loss: 2.4211 train_time: 9.7m tok/s: 6571567 +4875/20000 train_loss: 2.4091 train_time: 9.7m tok/s: 6571338 +4876/20000 train_loss: 2.4026 train_time: 9.7m tok/s: 6571122 +4877/20000 train_loss: 2.3028 train_time: 9.7m tok/s: 6570906 +4878/20000 train_loss: 2.4061 train_time: 9.7m tok/s: 6570666 +4879/20000 train_loss: 2.3609 train_time: 9.7m tok/s: 6570447 +4880/20000 train_loss: 2.4363 train_time: 9.7m tok/s: 6570209 +4881/20000 train_loss: 2.3474 train_time: 9.7m tok/s: 6569989 +4882/20000 train_loss: 2.2730 train_time: 9.7m tok/s: 6569787 +4883/20000 train_loss: 2.2513 train_time: 9.7m tok/s: 6569565 +4884/20000 train_loss: 2.6065 train_time: 9.7m tok/s: 6569336 +4885/20000 train_loss: 2.4676 train_time: 9.7m tok/s: 6569108 +4886/20000 train_loss: 2.4385 train_time: 9.7m tok/s: 6568877 +4887/20000 train_loss: 2.4054 train_time: 9.8m tok/s: 6568653 +4888/20000 train_loss: 2.4049 train_time: 9.8m tok/s: 6568428 +4889/20000 train_loss: 2.3510 train_time: 9.8m tok/s: 6568204 +4890/20000 train_loss: 2.4081 train_time: 9.8m tok/s: 6567963 +4891/20000 train_loss: 2.1723 train_time: 9.8m tok/s: 6567729 +4892/20000 train_loss: 1.9563 train_time: 9.8m tok/s: 6567455 +4893/20000 train_loss: 2.4398 train_time: 9.8m tok/s: 6567229 +4894/20000 train_loss: 2.3813 train_time: 9.8m tok/s: 6567014 +4895/20000 train_loss: 2.3754 train_time: 9.8m tok/s: 6566811 +4895/20000 val_loss: 2.3589 val_bpb: 1.0778 +stopping_early: wallclock_cap train_time: 586261ms step: 4895/20000 +peak memory allocated: 41707 MiB reserved: 47048 MiB +ema:applying EMA weights +diagnostic pre-quantization post-ema val_loss:2.33485027 val_bpb:1.06686742 eval_time:7879ms +Serialized model: 135418111 bytes +Code size (uncompressed): 182796 bytes +Code size (compressed): 45910 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 4.1s +Quantized weights: + gate_int8_row: blocks.attn.attn_gate_w + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int6)+lqer_asym: blocks.mlp.fc.weight + gptq (int7)+awqgrpint8+lqer_asym: tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights, smear_gate.weight, smear_lambda, softcap_neg, softcap_pos +Serialize: per-group lrzip compression... +Serialize: per-group compression done in 122.7s +Serialized model quantized+pergroup: 15943979 bytes +Total submission size quantized+pergroup: 15989889 bytes +Deserialize: per-group lrzip decompression... +Deserialize: decompression done in 21.1s +diagnostic quantized val_loss:2.35208733 val_bpb:1.07474358 eval_time:11043ms +Deserialize: per-group lrzip decompression... +Deserialize: decompression done in 22.5s +ttt_lora:warming up compile (random tokens, no val data) +ttt_lora:compile warmup done (110.0s) +v5:precomputing ngram hints OUTSIDE eval timer +ngram_tilt:hints total=47851520 gated=13023303 token_gate=628130 within_gate=9866847 word_gate=2891588 agree2plus=303177 +ngram_tilt:precompute_outside_timer_done elapsed=164.53s total_targets=47851520 + +beginning TTT eval timer +ngram_tilt:using_precomputed_hints total_targets=47851520 (precompute time excluded from eval) +ttt_phased: total_docs:50000 prefix_docs:2500 suffix_docs:47500 num_phases:3 boundaries:[833, 1666, 2500] +ttp: b780/782 bl:2.2320 bb:1.0752 rl:2.2320 rb:1.0752 dl:13091-17244 gd:0 +ttp: b764/782 bl:2.2901 bb:1.0727 rl:2.2452 rb:1.0746 dl:4284-4392 gd:0 +ttpp: phase:1/3 pd:1296 gd:833 t:224.0s +tttg: c1/131 lr:0.001000 t:0.3s +tttg: c2/131 lr:0.001000 t:0.4s +tttg: c3/131 lr:0.000999 t:0.5s +tttg: c4/131 lr:0.000999 t:0.6s +tttg: c5/131 lr:0.000998 t:0.7s +tttg: c6/131 lr:0.000996 t:0.8s +tttg: c7/131 lr:0.000995 t:0.9s +tttg: c8/131 lr:0.000993 t:0.9s +tttg: c9/131 lr:0.000991 t:1.0s +tttg: c10/131 lr:0.000988 t:1.1s +tttg: c11/131 lr:0.000985 t:1.1s +tttg: c12/131 lr:0.000982 t:1.2s +tttg: c13/131 lr:0.000979 t:1.3s +tttg: c14/131 lr:0.000976 t:1.4s +tttg: c15/131 lr:0.000972 t:1.4s +tttg: c16/131 lr:0.000968 t:1.5s +tttg: c17/131 lr:0.000963 t:1.6s +tttg: c18/131 lr:0.000958 t:1.7s +tttg: c19/131 lr:0.000953 t:1.7s +tttg: c20/131 lr:0.000948 t:1.8s +tttg: c21/131 lr:0.000943 t:1.9s +tttg: c22/131 lr:0.000937 t:2.0s +tttg: c23/131 lr:0.000931 t:2.0s +tttg: c24/131 lr:0.000925 t:2.1s +tttg: c25/131 lr:0.000918 t:2.2s +tttg: c26/131 lr:0.000911 t:2.3s +tttg: c27/131 lr:0.000905 t:2.3s +tttg: c28/131 lr:0.000897 t:2.4s +tttg: c29/131 lr:0.000890 t:2.5s +tttg: c30/131 lr:0.000882 t:2.6s +tttg: c31/131 lr:0.000874 t:2.7s +tttg: c32/131 lr:0.000866 t:2.7s +tttg: c33/131 lr:0.000858 t:2.8s +tttg: c34/131 lr:0.000849 t:2.9s +tttg: c35/131 lr:0.000841 t:2.9s +tttg: c36/131 lr:0.000832 t:3.0s +tttg: c37/131 lr:0.000822 t:3.1s +tttg: c38/131 lr:0.000813 t:3.2s +tttg: c39/131 lr:0.000804 t:3.2s +tttg: c40/131 lr:0.000794 t:3.3s +tttg: c41/131 lr:0.000784 t:3.4s +tttg: c42/131 lr:0.000774 t:3.5s +tttg: c43/131 lr:0.000764 t:3.6s +tttg: c44/131 lr:0.000753 t:3.6s +tttg: c45/131 lr:0.000743 t:3.7s +tttg: c46/131 lr:0.000732 t:3.8s +tttg: c47/131 lr:0.000722 t:3.8s +tttg: c48/131 lr:0.000711 t:3.9s +tttg: c49/131 lr:0.000700 t:4.0s +tttg: c50/131 lr:0.000689 t:4.1s +tttg: c51/131 lr:0.000677 t:4.1s +tttg: c52/131 lr:0.000666 t:4.2s +tttg: c53/131 lr:0.000655 t:4.3s +tttg: c54/131 lr:0.000643 t:4.4s +tttg: c55/131 lr:0.000631 t:4.4s +tttg: c56/131 lr:0.000620 t:4.5s +tttg: c57/131 lr:0.000608 t:4.6s +tttg: c58/131 lr:0.000596 t:4.7s +tttg: c59/131 lr:0.000584 t:4.7s +tttg: c60/131 lr:0.000572 t:4.8s +tttg: c61/131 lr:0.000560 t:4.9s +tttg: c62/131 lr:0.000548 t:5.0s +tttg: c63/131 lr:0.000536 t:5.0s +tttg: c64/131 lr:0.000524 t:5.1s +tttg: c65/131 lr:0.000512 t:5.2s +tttg: c66/131 lr:0.000500 t:5.3s +tttg: c67/131 lr:0.000488 t:5.3s +tttg: c68/131 lr:0.000476 t:5.4s +tttg: c69/131 lr:0.000464 t:5.5s +tttg: c70/131 lr:0.000452 t:5.6s +tttg: c71/131 lr:0.000440 t:5.6s +tttg: c72/131 lr:0.000428 t:5.7s +tttg: c73/131 lr:0.000416 t:5.8s +tttg: c74/131 lr:0.000404 t:5.9s +tttg: c75/131 lr:0.000392 t:6.0s +tttg: c76/131 lr:0.000380 t:6.0s +tttg: c77/131 lr:0.000369 t:6.1s +tttg: c78/131 lr:0.000357 t:6.2s +tttg: c79/131 lr:0.000345 t:6.2s +tttg: c80/131 lr:0.000334 t:6.3s +tttg: c81/131 lr:0.000323 t:6.4s +tttg: c82/131 lr:0.000311 t:6.5s +tttg: c83/131 lr:0.000300 t:6.5s +tttg: c84/131 lr:0.000289 t:6.6s +tttg: c85/131 lr:0.000278 t:6.7s +tttg: c86/131 lr:0.000268 t:6.8s +tttg: c87/131 lr:0.000257 t:6.8s +tttg: c88/131 lr:0.000247 t:6.9s +tttg: c89/131 lr:0.000236 t:7.0s +tttg: c90/131 lr:0.000226 t:7.1s +tttg: c91/131 lr:0.000216 t:7.1s +tttg: c92/131 lr:0.000206 t:7.2s +tttg: c93/131 lr:0.000196 t:7.3s +tttg: c94/131 lr:0.000187 t:7.4s +tttg: c95/131 lr:0.000178 t:7.4s +tttg: c96/131 lr:0.000168 t:7.5s +tttg: c97/131 lr:0.000159 t:7.6s +tttg: c98/131 lr:0.000151 t:7.7s +tttg: c99/131 lr:0.000142 t:7.7s +tttg: c100/131 lr:0.000134 t:7.8s +tttg: c101/131 lr:0.000126 t:7.9s +tttg: c102/131 lr:0.000118 t:8.0s +tttg: c103/131 lr:0.000110 t:8.0s +tttg: c104/131 lr:0.000103 t:8.1s +tttg: c105/131 lr:0.000095 t:8.2s +tttg: c106/131 lr:0.000089 t:8.3s +tttg: c107/131 lr:0.000082 t:8.3s +tttg: c108/131 lr:0.000075 t:8.4s +tttg: c109/131 lr:0.000069 t:8.5s +tttg: c110/131 lr:0.000063 t:8.6s +tttg: c111/131 lr:0.000057 t:8.6s +tttg: c112/131 lr:0.000052 t:8.7s +tttg: c113/131 lr:0.000047 t:8.8s +tttg: c114/131 lr:0.000042 t:8.8s +tttg: c115/131 lr:0.000037 t:8.9s +tttg: c116/131 lr:0.000032 t:9.0s +tttg: c117/131 lr:0.000028 t:9.1s +tttg: c118/131 lr:0.000024 t:9.1s +tttg: c119/131 lr:0.000021 t:9.2s +tttg: c120/131 lr:0.000018 t:9.3s +tttg: c121/131 lr:0.000015 t:9.4s +tttg: c122/131 lr:0.000012 t:9.4s +tttg: c123/131 lr:0.000009 t:9.5s +tttg: c124/131 lr:0.000007 t:9.6s +tttg: c125/131 lr:0.000005 t:9.7s +tttg: c126/131 lr:0.000004 t:9.7s +tttg: c127/131 lr:0.000002 t:9.8s +tttg: c128/131 lr:0.000001 t:9.9s +tttg: c129/131 lr:0.000001 t:10.0s +tttg: c130/131 lr:0.000000 t:10.0s +ttpr: phase:1/3 t:235.7s +ttp: b761/782 bl:2.4169 bb:1.1143 rl:2.2749 rb:1.0817 dl:3916-4032 gd:0 +ttpp: phase:2/3 pd:2128 gd:1666 t:394.9s +tttg: c1/219 lr:0.001000 t:0.1s +tttg: c2/219 lr:0.001000 t:0.2s +tttg: c3/219 lr:0.001000 t:0.3s +tttg: c4/219 lr:0.001000 t:0.3s +tttg: c5/219 lr:0.000999 t:0.4s +tttg: c6/219 lr:0.000999 t:0.5s +tttg: c7/219 lr:0.000998 t:0.6s +tttg: c8/219 lr:0.000997 t:0.6s +tttg: c9/219 lr:0.000997 t:0.7s +tttg: c10/219 lr:0.000996 t:0.8s +tttg: c11/219 lr:0.000995 t:0.9s +tttg: c12/219 lr:0.000994 t:0.9s +tttg: c13/219 lr:0.000993 t:1.0s +tttg: c14/219 lr:0.000991 t:1.1s +tttg: c15/219 lr:0.000990 t:1.2s +tttg: c16/219 lr:0.000988 t:1.2s +tttg: c17/219 lr:0.000987 t:1.3s +tttg: c18/219 lr:0.000985 t:1.4s +tttg: c19/219 lr:0.000983 t:1.5s +tttg: c20/219 lr:0.000981 t:1.5s +tttg: c21/219 lr:0.000979 t:1.6s +tttg: c22/219 lr:0.000977 t:1.7s +tttg: c23/219 lr:0.000975 t:1.8s +tttg: c24/219 lr:0.000973 t:1.8s +tttg: c25/219 lr:0.000970 t:1.9s +tttg: c26/219 lr:0.000968 t:2.0s +tttg: c27/219 lr:0.000965 t:2.1s +tttg: c28/219 lr:0.000963 t:2.1s +tttg: c29/219 lr:0.000960 t:2.2s +tttg: c30/219 lr:0.000957 t:2.3s +tttg: c31/219 lr:0.000954 t:2.4s +tttg: c32/219 lr:0.000951 t:2.4s +tttg: c33/219 lr:0.000948 t:2.5s +tttg: c34/219 lr:0.000945 t:2.6s +tttg: c35/219 lr:0.000941 t:2.7s +tttg: c36/219 lr:0.000938 t:2.7s +tttg: c37/219 lr:0.000934 t:2.8s +tttg: c38/219 lr:0.000931 t:2.9s +tttg: c39/219 lr:0.000927 t:3.0s +tttg: c40/219 lr:0.000923 t:3.0s +tttg: c41/219 lr:0.000919 t:3.1s +tttg: c42/219 lr:0.000915 t:3.2s +tttg: c43/219 lr:0.000911 t:3.3s +tttg: c44/219 lr:0.000907 t:3.3s +tttg: c45/219 lr:0.000903 t:3.4s +tttg: c46/219 lr:0.000898 t:3.5s +tttg: c47/219 lr:0.000894 t:3.6s +tttg: c48/219 lr:0.000890 t:3.6s +tttg: c49/219 lr:0.000885 t:3.7s +tttg: c50/219 lr:0.000880 t:3.8s +tttg: c51/219 lr:0.000876 t:3.9s +tttg: c52/219 lr:0.000871 t:3.9s +tttg: c53/219 lr:0.000866 t:4.0s +tttg: c54/219 lr:0.000861 t:4.1s +tttg: c55/219 lr:0.000856 t:4.2s +tttg: c56/219 lr:0.000851 t:4.2s +tttg: c57/219 lr:0.000846 t:4.3s +tttg: c58/219 lr:0.000841 t:4.4s +tttg: c59/219 lr:0.000835 t:4.5s +tttg: c60/219 lr:0.000830 t:4.5s +tttg: c61/219 lr:0.000824 t:4.6s +tttg: c62/219 lr:0.000819 t:4.7s +tttg: c63/219 lr:0.000813 t:4.8s +tttg: c64/219 lr:0.000808 t:4.8s +tttg: c65/219 lr:0.000802 t:4.9s +tttg: c66/219 lr:0.000796 t:5.0s +tttg: c67/219 lr:0.000790 t:5.1s +tttg: c68/219 lr:0.000784 t:5.1s +tttg: c69/219 lr:0.000779 t:5.2s +tttg: c70/219 lr:0.000773 t:5.3s +tttg: c71/219 lr:0.000766 t:5.4s +tttg: c72/219 lr:0.000760 t:5.4s +tttg: c73/219 lr:0.000754 t:5.5s +tttg: c74/219 lr:0.000748 t:5.6s +tttg: c75/219 lr:0.000742 t:5.7s +tttg: c76/219 lr:0.000735 t:5.8s +tttg: c77/219 lr:0.000729 t:5.8s +tttg: c78/219 lr:0.000722 t:5.9s +tttg: c79/219 lr:0.000716 t:6.0s +tttg: c80/219 lr:0.000709 t:6.1s +tttg: c81/219 lr:0.000703 t:6.1s +tttg: c82/219 lr:0.000696 t:6.2s +tttg: c83/219 lr:0.000690 t:6.3s +tttg: c84/219 lr:0.000683 t:6.4s +tttg: c85/219 lr:0.000676 t:6.4s +tttg: c86/219 lr:0.000670 t:6.5s +tttg: c87/219 lr:0.000663 t:6.6s +tttg: c88/219 lr:0.000656 t:6.6s +tttg: c89/219 lr:0.000649 t:6.7s +tttg: c90/219 lr:0.000642 t:6.8s +tttg: c91/219 lr:0.000635 t:6.9s +tttg: c92/219 lr:0.000628 t:7.0s +tttg: c93/219 lr:0.000621 t:7.0s +tttg: c94/219 lr:0.000614 t:7.1s +tttg: c95/219 lr:0.000607 t:7.2s +tttg: c96/219 lr:0.000600 t:7.3s +tttg: c97/219 lr:0.000593 t:7.3s +tttg: c98/219 lr:0.000586 t:7.4s +tttg: c99/219 lr:0.000579 t:7.5s +tttg: c100/219 lr:0.000572 t:7.6s +tttg: c101/219 lr:0.000565 t:7.6s +tttg: c102/219 lr:0.000558 t:7.7s +tttg: c103/219 lr:0.000550 t:7.8s +tttg: c104/219 lr:0.000543 t:7.9s +tttg: c105/219 lr:0.000536 t:7.9s +tttg: c106/219 lr:0.000529 t:8.0s +tttg: c107/219 lr:0.000522 t:8.1s +tttg: c108/219 lr:0.000514 t:8.2s +tttg: c109/219 lr:0.000507 t:8.2s +tttg: c110/219 lr:0.000500 t:8.3s +tttg: c111/219 lr:0.000493 t:8.4s +tttg: c112/219 lr:0.000486 t:8.5s +tttg: c113/219 lr:0.000478 t:8.5s +tttg: c114/219 lr:0.000471 t:8.6s +tttg: c115/219 lr:0.000464 t:8.7s +tttg: c116/219 lr:0.000457 t:8.8s +tttg: c117/219 lr:0.000450 t:8.8s +tttg: c118/219 lr:0.000442 t:8.9s +tttg: c119/219 lr:0.000435 t:9.0s +tttg: c120/219 lr:0.000428 t:9.1s +tttg: c121/219 lr:0.000421 t:9.1s +tttg: c122/219 lr:0.000414 t:9.2s +tttg: c123/219 lr:0.000407 t:9.3s +tttg: c124/219 lr:0.000400 t:9.4s +tttg: c125/219 lr:0.000393 t:9.5s +tttg: c126/219 lr:0.000386 t:9.5s +tttg: c127/219 lr:0.000379 t:9.6s +tttg: c128/219 lr:0.000372 t:9.7s +tttg: c129/219 lr:0.000365 t:9.7s +tttg: c130/219 lr:0.000358 t:9.8s +tttg: c131/219 lr:0.000351 t:9.9s +tttg: c132/219 lr:0.000344 t:10.0s +tttg: c133/219 lr:0.000337 t:10.1s +tttg: c134/219 lr:0.000330 t:10.1s +tttg: c135/219 lr:0.000324 t:10.2s +tttg: c136/219 lr:0.000317 t:10.3s +tttg: c137/219 lr:0.000310 t:10.4s +tttg: c138/219 lr:0.000304 t:10.4s +tttg: c139/219 lr:0.000297 t:10.5s +tttg: c140/219 lr:0.000291 t:10.6s +tttg: c141/219 lr:0.000284 t:10.7s +tttg: c142/219 lr:0.000278 t:10.7s +tttg: c143/219 lr:0.000271 t:10.8s +tttg: c144/219 lr:0.000265 t:10.9s +tttg: c145/219 lr:0.000258 t:11.0s +tttg: c146/219 lr:0.000252 t:11.1s +tttg: c147/219 lr:0.000246 t:11.1s +tttg: c148/219 lr:0.000240 t:11.2s +tttg: c149/219 lr:0.000234 t:11.3s +tttg: c150/219 lr:0.000227 t:11.4s +tttg: c151/219 lr:0.000221 t:11.4s +tttg: c152/219 lr:0.000216 t:11.5s +tttg: c153/219 lr:0.000210 t:11.6s +tttg: c154/219 lr:0.000204 t:11.7s +tttg: c155/219 lr:0.000198 t:11.7s +tttg: c156/219 lr:0.000192 t:11.8s +tttg: c157/219 lr:0.000187 t:11.9s +tttg: c158/219 lr:0.000181 t:11.9s +tttg: c159/219 lr:0.000176 t:12.0s +tttg: c160/219 lr:0.000170 t:12.1s +tttg: c161/219 lr:0.000165 t:12.2s +tttg: c162/219 lr:0.000159 t:12.3s +tttg: c163/219 lr:0.000154 t:12.3s +tttg: c164/219 lr:0.000149 t:12.4s +tttg: c165/219 lr:0.000144 t:12.5s +tttg: c166/219 lr:0.000139 t:12.6s +tttg: c167/219 lr:0.000134 t:12.6s +tttg: c168/219 lr:0.000129 t:12.7s +tttg: c169/219 lr:0.000124 t:12.8s +tttg: c170/219 lr:0.000120 t:12.9s +tttg: c171/219 lr:0.000115 t:12.9s +tttg: c172/219 lr:0.000110 t:13.0s +tttg: c173/219 lr:0.000106 t:13.1s +tttg: c174/219 lr:0.000102 t:13.2s +tttg: c175/219 lr:0.000097 t:13.2s +tttg: c176/219 lr:0.000093 t:13.3s +tttg: c177/219 lr:0.000089 t:13.4s +tttg: c178/219 lr:0.000085 t:13.5s +tttg: c179/219 lr:0.000081 t:13.5s +tttg: c180/219 lr:0.000077 t:13.6s +tttg: c181/219 lr:0.000073 t:13.7s +tttg: c182/219 lr:0.000069 t:13.8s +tttg: c183/219 lr:0.000066 t:13.8s +tttg: c184/219 lr:0.000062 t:13.9s +tttg: c185/219 lr:0.000059 t:14.0s +tttg: c186/219 lr:0.000055 t:14.1s +tttg: c187/219 lr:0.000052 t:14.1s +tttg: c188/219 lr:0.000049 t:14.2s +tttg: c189/219 lr:0.000046 t:14.3s +tttg: c190/219 lr:0.000043 t:14.4s +tttg: c191/219 lr:0.000040 t:14.4s +tttg: c192/219 lr:0.000037 t:14.5s +tttg: c193/219 lr:0.000035 t:14.6s +tttg: c194/219 lr:0.000032 t:14.7s +tttg: c195/219 lr:0.000030 t:14.7s +tttg: c196/219 lr:0.000027 t:14.8s +tttg: c197/219 lr:0.000025 t:14.9s +tttg: c198/219 lr:0.000023 t:15.0s +tttg: c199/219 lr:0.000021 t:15.0s +tttg: c200/219 lr:0.000019 t:15.1s +tttg: c201/219 lr:0.000017 t:15.2s +tttg: c202/219 lr:0.000015 t:15.3s +tttg: c203/219 lr:0.000013 t:15.3s +tttg: c204/219 lr:0.000012 t:15.4s +tttg: c205/219 lr:0.000010 t:15.5s +tttg: c206/219 lr:0.000009 t:15.6s +tttg: c207/219 lr:0.000007 t:15.6s +tttg: c208/219 lr:0.000006 t:15.7s +tttg: c209/219 lr:0.000005 t:15.8s +tttg: c210/219 lr:0.000004 t:15.9s +tttg: c211/219 lr:0.000003 t:15.9s +tttg: c212/219 lr:0.000003 t:16.0s +tttg: c213/219 lr:0.000002 t:16.1s +tttg: c214/219 lr:0.000001 t:16.2s +tttg: c215/219 lr:0.000001 t:16.2s +tttg: c216/219 lr:0.000000 t:16.3s +tttg: c217/219 lr:0.000000 t:16.4s +tttg: c218/219 lr:0.000000 t:16.5s +ttpr: phase:2/3 t:413.0s +ttp: b743/782 bl:2.3379 bb:1.0652 rl:2.2817 rb:1.0799 dl:2762-2805 gd:0 +ttp: b738/782 bl:2.3088 bb:1.0455 rl:2.2842 rb:1.0766 dl:2583-2618 gd:0 +ttpp: phase:3/3 pd:2960 gd:2500 t:429.3s +tttg: c1/289 lr:0.001000 t:0.1s +tttg: c2/289 lr:0.001000 t:0.2s +tttg: c3/289 lr:0.001000 t:0.2s +tttg: c4/289 lr:0.001000 t:0.3s +tttg: c5/289 lr:0.001000 t:0.4s +tttg: c6/289 lr:0.000999 t:0.5s +tttg: c7/289 lr:0.000999 t:0.5s +tttg: c8/289 lr:0.000999 t:0.6s +tttg: c9/289 lr:0.000998 t:0.7s +tttg: c10/289 lr:0.000998 t:0.7s +tttg: c11/289 lr:0.000997 t:0.8s +tttg: c12/289 lr:0.000996 t:0.9s +tttg: c13/289 lr:0.000996 t:1.0s +tttg: c14/289 lr:0.000995 t:1.1s +tttg: c15/289 lr:0.000994 t:1.1s +tttg: c16/289 lr:0.000993 t:1.2s +tttg: c17/289 lr:0.000992 t:1.3s +tttg: c18/289 lr:0.000991 t:1.4s +tttg: c19/289 lr:0.000990 t:1.4s +tttg: c20/289 lr:0.000989 t:1.5s +tttg: c21/289 lr:0.000988 t:1.6s +tttg: c22/289 lr:0.000987 t:1.6s +tttg: c23/289 lr:0.000986 t:1.7s +tttg: c24/289 lr:0.000984 t:1.8s +tttg: c25/289 lr:0.000983 t:1.9s +tttg: c26/289 lr:0.000982 t:1.9s +tttg: c27/289 lr:0.000980 t:2.0s +tttg: c28/289 lr:0.000978 t:2.1s +tttg: c29/289 lr:0.000977 t:2.2s +tttg: c30/289 lr:0.000975 t:2.3s +tttg: c31/289 lr:0.000973 t:2.3s +tttg: c32/289 lr:0.000972 t:2.4s +tttg: c33/289 lr:0.000970 t:2.5s +tttg: c34/289 lr:0.000968 t:2.6s +tttg: c35/289 lr:0.000966 t:2.6s +tttg: c36/289 lr:0.000964 t:2.7s +tttg: c37/289 lr:0.000962 t:2.8s +tttg: c38/289 lr:0.000960 t:2.9s +tttg: c39/289 lr:0.000958 t:2.9s +tttg: c40/289 lr:0.000955 t:3.0s +tttg: c41/289 lr:0.000953 t:3.1s +tttg: c42/289 lr:0.000951 t:3.2s +tttg: c43/289 lr:0.000948 t:3.2s +tttg: c44/289 lr:0.000946 t:3.3s +tttg: c45/289 lr:0.000944 t:3.4s +tttg: c46/289 lr:0.000941 t:3.5s +tttg: c47/289 lr:0.000938 t:3.5s +tttg: c48/289 lr:0.000936 t:3.6s +tttg: c49/289 lr:0.000933 t:3.7s +tttg: c50/289 lr:0.000930 t:3.8s +tttg: c51/289 lr:0.000927 t:3.8s +tttg: c52/289 lr:0.000925 t:3.9s +tttg: c53/289 lr:0.000922 t:4.0s +tttg: c54/289 lr:0.000919 t:4.1s +tttg: c55/289 lr:0.000916 t:4.1s +tttg: c56/289 lr:0.000913 t:4.2s +tttg: c57/289 lr:0.000910 t:4.3s +tttg: c58/289 lr:0.000906 t:4.4s +tttg: c59/289 lr:0.000903 t:4.4s +tttg: c60/289 lr:0.000900 t:4.5s +tttg: c61/289 lr:0.000897 t:4.6s +tttg: c62/289 lr:0.000893 t:4.7s +tttg: c63/289 lr:0.000890 t:4.7s +tttg: c64/289 lr:0.000887 t:4.8s +tttg: c65/289 lr:0.000883 t:4.9s +tttg: c66/289 lr:0.000879 t:5.0s +tttg: c67/289 lr:0.000876 t:5.0s +tttg: c68/289 lr:0.000872 t:5.1s +tttg: c69/289 lr:0.000869 t:5.2s +tttg: c70/289 lr:0.000865 t:5.3s +tttg: c71/289 lr:0.000861 t:5.3s +tttg: c72/289 lr:0.000857 t:5.4s +tttg: c73/289 lr:0.000854 t:5.5s +tttg: c74/289 lr:0.000850 t:5.6s +tttg: c75/289 lr:0.000846 t:5.6s +tttg: c76/289 lr:0.000842 t:5.7s +tttg: c77/289 lr:0.000838 t:5.8s +tttg: c78/289 lr:0.000834 t:5.9s +tttg: c79/289 lr:0.000830 t:5.9s +tttg: c80/289 lr:0.000826 t:6.0s +tttg: c81/289 lr:0.000821 t:6.1s +tttg: c82/289 lr:0.000817 t:6.2s +tttg: c83/289 lr:0.000813 t:6.2s +tttg: c84/289 lr:0.000809 t:6.3s +tttg: c85/289 lr:0.000804 t:6.4s +tttg: c86/289 lr:0.000800 t:6.5s +tttg: c87/289 lr:0.000796 t:6.5s +tttg: c88/289 lr:0.000791 t:6.6s +tttg: c89/289 lr:0.000787 t:6.7s +tttg: c90/289 lr:0.000782 t:6.8s +tttg: c91/289 lr:0.000778 t:6.8s +tttg: c92/289 lr:0.000773 t:6.9s +tttg: c93/289 lr:0.000769 t:7.0s +tttg: c94/289 lr:0.000764 t:7.1s +tttg: c95/289 lr:0.000759 t:7.1s +tttg: c96/289 lr:0.000755 t:7.2s +tttg: c97/289 lr:0.000750 t:7.3s +tttg: c98/289 lr:0.000745 t:7.4s +tttg: c99/289 lr:0.000740 t:7.4s +tttg: c100/289 lr:0.000736 t:7.5s +tttg: c101/289 lr:0.000731 t:7.6s +tttg: c102/289 lr:0.000726 t:7.7s +tttg: c103/289 lr:0.000721 t:7.7s +tttg: c104/289 lr:0.000716 t:7.8s +tttg: c105/289 lr:0.000711 t:7.9s +tttg: c106/289 lr:0.000706 t:8.0s +tttg: c107/289 lr:0.000701 t:8.0s +tttg: c108/289 lr:0.000696 t:8.1s +tttg: c109/289 lr:0.000691 t:8.2s +tttg: c110/289 lr:0.000686 t:8.3s +tttg: c111/289 lr:0.000681 t:8.3s +tttg: c112/289 lr:0.000676 t:8.4s +tttg: c113/289 lr:0.000671 t:8.5s +tttg: c114/289 lr:0.000666 t:8.6s +tttg: c115/289 lr:0.000661 t:8.6s +tttg: c116/289 lr:0.000656 t:8.7s +tttg: c117/289 lr:0.000650 t:8.8s +tttg: c118/289 lr:0.000645 t:8.9s +tttg: c119/289 lr:0.000640 t:8.9s +tttg: c120/289 lr:0.000635 t:9.0s +tttg: c121/289 lr:0.000629 t:9.1s +tttg: c122/289 lr:0.000624 t:9.2s +tttg: c123/289 lr:0.000619 t:9.2s +tttg: c124/289 lr:0.000614 t:9.3s +tttg: c125/289 lr:0.000608 t:9.4s +tttg: c126/289 lr:0.000603 t:9.5s +tttg: c127/289 lr:0.000598 t:9.5s +tttg: c128/289 lr:0.000592 t:9.6s +tttg: c129/289 lr:0.000587 t:9.7s +tttg: c130/289 lr:0.000581 t:9.8s +tttg: c131/289 lr:0.000576 t:9.8s +tttg: c132/289 lr:0.000571 t:9.9s +tttg: c133/289 lr:0.000565 t:10.0s +tttg: c134/289 lr:0.000560 t:10.1s +tttg: c135/289 lr:0.000554 t:10.1s +tttg: c136/289 lr:0.000549 t:10.2s +tttg: c137/289 lr:0.000544 t:10.3s +tttg: c138/289 lr:0.000538 t:10.3s +tttg: c139/289 lr:0.000533 t:10.4s +tttg: c140/289 lr:0.000527 t:10.5s +tttg: c141/289 lr:0.000522 t:10.6s +tttg: c142/289 lr:0.000516 t:10.6s +tttg: c143/289 lr:0.000511 t:10.7s +tttg: c144/289 lr:0.000505 t:10.8s +tttg: c145/289 lr:0.000500 t:10.9s +tttg: c146/289 lr:0.000495 t:11.0s +tttg: c147/289 lr:0.000489 t:11.0s +tttg: c148/289 lr:0.000484 t:11.1s +tttg: c149/289 lr:0.000478 t:11.3s +tttg: c150/289 lr:0.000473 t:11.3s +tttg: c151/289 lr:0.000467 t:11.5s +tttg: c152/289 lr:0.000462 t:11.5s +tttg: c153/289 lr:0.000456 t:11.6s +tttg: c154/289 lr:0.000451 t:11.7s +tttg: c155/289 lr:0.000446 t:11.8s +tttg: c156/289 lr:0.000440 t:11.8s +tttg: c157/289 lr:0.000435 t:11.9s +tttg: c158/289 lr:0.000429 t:12.0s +tttg: c159/289 lr:0.000424 t:12.1s +tttg: c160/289 lr:0.000419 t:12.1s +tttg: c161/289 lr:0.000413 t:12.2s +tttg: c162/289 lr:0.000408 t:12.3s +tttg: c163/289 lr:0.000402 t:12.4s +tttg: c164/289 lr:0.000397 t:12.4s +tttg: c165/289 lr:0.000392 t:12.5s +tttg: c166/289 lr:0.000386 t:12.6s +tttg: c167/289 lr:0.000381 t:12.7s +tttg: c168/289 lr:0.000376 t:12.8s +tttg: c169/289 lr:0.000371 t:12.8s +tttg: c170/289 lr:0.000365 t:12.9s +tttg: c171/289 lr:0.000360 t:13.0s +tttg: c172/289 lr:0.000355 t:13.1s +tttg: c173/289 lr:0.000350 t:13.1s +tttg: c174/289 lr:0.000344 t:13.2s +tttg: c175/289 lr:0.000339 t:13.3s +tttg: c176/289 lr:0.000334 t:13.4s +tttg: c177/289 lr:0.000329 t:13.4s +tttg: c178/289 lr:0.000324 t:13.5s +tttg: c179/289 lr:0.000319 t:13.6s +tttg: c180/289 lr:0.000314 t:13.6s +tttg: c181/289 lr:0.000309 t:13.7s +tttg: c182/289 lr:0.000304 t:13.8s +tttg: c183/289 lr:0.000299 t:13.9s +tttg: c184/289 lr:0.000294 t:14.0s +tttg: c185/289 lr:0.000289 t:14.0s +tttg: c186/289 lr:0.000284 t:14.1s +tttg: c187/289 lr:0.000279 t:14.2s +tttg: c188/289 lr:0.000274 t:14.3s +tttg: c189/289 lr:0.000269 t:14.3s +tttg: c190/289 lr:0.000264 t:14.4s +tttg: c191/289 lr:0.000260 t:14.5s +tttg: c192/289 lr:0.000255 t:14.5s +tttg: c193/289 lr:0.000250 t:14.6s +tttg: c194/289 lr:0.000245 t:14.7s +tttg: c195/289 lr:0.000241 t:14.8s +tttg: c196/289 lr:0.000236 t:14.9s +tttg: c197/289 lr:0.000231 t:14.9s +tttg: c198/289 lr:0.000227 t:15.0s +tttg: c199/289 lr:0.000222 t:15.1s +tttg: c200/289 lr:0.000218 t:15.1s +tttg: c201/289 lr:0.000213 t:15.2s +tttg: c202/289 lr:0.000209 t:15.3s +tttg: c203/289 lr:0.000204 t:15.4s +tttg: c204/289 lr:0.000200 t:15.5s +tttg: c205/289 lr:0.000196 t:15.5s +tttg: c206/289 lr:0.000191 t:15.6s +tttg: c207/289 lr:0.000187 t:15.7s +tttg: c208/289 lr:0.000183 t:15.7s +tttg: c209/289 lr:0.000179 t:15.8s +tttg: c210/289 lr:0.000174 t:15.9s +tttg: c211/289 lr:0.000170 t:16.0s +tttg: c212/289 lr:0.000166 t:16.1s +tttg: c213/289 lr:0.000162 t:16.1s +tttg: c214/289 lr:0.000158 t:16.2s +tttg: c215/289 lr:0.000154 t:16.3s +tttg: c216/289 lr:0.000150 t:16.4s +tttg: c217/289 lr:0.000146 t:16.4s +tttg: c218/289 lr:0.000143 t:16.5s +tttg: c219/289 lr:0.000139 t:16.6s +tttg: c220/289 lr:0.000135 t:16.7s +tttg: c221/289 lr:0.000131 t:16.7s +tttg: c222/289 lr:0.000128 t:16.8s +tttg: c223/289 lr:0.000124 t:16.9s +tttg: c224/289 lr:0.000121 t:17.0s +tttg: c225/289 lr:0.000117 t:17.0s +tttg: c226/289 lr:0.000113 t:17.1s +tttg: c227/289 lr:0.000110 t:17.2s +tttg: c228/289 lr:0.000107 t:17.3s +tttg: c229/289 lr:0.000103 t:17.3s +tttg: c230/289 lr:0.000100 t:17.4s +tttg: c231/289 lr:0.000097 t:17.5s +tttg: c232/289 lr:0.000094 t:17.6s +tttg: c233/289 lr:0.000090 t:17.6s +tttg: c234/289 lr:0.000087 t:17.7s +tttg: c235/289 lr:0.000084 t:17.8s +tttg: c236/289 lr:0.000081 t:17.9s +tttg: c237/289 lr:0.000078 t:17.9s +tttg: c238/289 lr:0.000075 t:18.0s +tttg: c239/289 lr:0.000073 t:18.1s +tttg: c240/289 lr:0.000070 t:18.2s +tttg: c241/289 lr:0.000067 t:18.2s +tttg: c242/289 lr:0.000064 t:18.3s +tttg: c243/289 lr:0.000062 t:18.4s +tttg: c244/289 lr:0.000059 t:18.5s +tttg: c245/289 lr:0.000056 t:18.5s +tttg: c246/289 lr:0.000054 t:18.6s +tttg: c247/289 lr:0.000052 t:18.7s +tttg: c248/289 lr:0.000049 t:18.8s +tttg: c249/289 lr:0.000047 t:18.8s +tttg: c250/289 lr:0.000045 t:18.9s +tttg: c251/289 lr:0.000042 t:19.0s +tttg: c252/289 lr:0.000040 t:19.1s +tttg: c253/289 lr:0.000038 t:19.1s +tttg: c254/289 lr:0.000036 t:19.2s +tttg: c255/289 lr:0.000034 t:19.3s +tttg: c256/289 lr:0.000032 t:19.4s +tttg: c257/289 lr:0.000030 t:19.4s +tttg: c258/289 lr:0.000028 t:19.5s +tttg: c259/289 lr:0.000027 t:19.6s +tttg: c260/289 lr:0.000025 t:19.7s +tttg: c261/289 lr:0.000023 t:21.4s +tttg: c262/289 lr:0.000022 t:21.5s +tttg: c263/289 lr:0.000020 t:21.6s +tttg: c264/289 lr:0.000018 t:21.7s +tttg: c265/289 lr:0.000017 t:21.7s +tttg: c266/289 lr:0.000016 t:21.8s +tttg: c267/289 lr:0.000014 t:21.9s +tttg: c268/289 lr:0.000013 t:22.0s +tttg: c269/289 lr:0.000012 t:22.0s +tttg: c270/289 lr:0.000011 t:22.1s +tttg: c271/289 lr:0.000010 t:22.2s +tttg: c272/289 lr:0.000009 t:22.3s +tttg: c273/289 lr:0.000008 t:22.3s +tttg: c274/289 lr:0.000007 t:22.4s +tttg: c275/289 lr:0.000006 t:22.5s +tttg: c276/289 lr:0.000005 t:22.6s +tttg: c277/289 lr:0.000004 t:22.6s +tttg: c278/289 lr:0.000004 t:22.7s +tttg: c279/289 lr:0.000003 t:22.8s +tttg: c280/289 lr:0.000002 t:22.9s +tttg: c281/289 lr:0.000002 t:22.9s +tttg: c282/289 lr:0.000001 t:23.0s +tttg: c283/289 lr:0.000001 t:23.1s +tttg: c284/289 lr:0.000001 t:23.2s +tttg: c285/289 lr:0.000000 t:23.2s +tttg: c286/289 lr:0.000000 t:23.3s +tttg: c287/289 lr:0.000000 t:23.4s +tttg: c288/289 lr:0.000000 t:23.5s +ttpr: phase:3/3 t:454.4s +ttp: b731/782 bl:2.3405 bb:1.0438 rl:2.2886 rb:1.0739 dl:2377-2414 gd:1 +ttp: b723/782 bl:2.2948 bb:1.0303 rl:2.2890 rb:1.0709 dl:2185-2203 gd:1 +ttp: b716/782 bl:2.2489 bb:1.0392 rl:2.2866 rb:1.0690 dl:2054-2069 gd:1 +ttp: b705/782 bl:2.3634 bb:1.0623 rl:2.2906 rb:1.0686 dl:1885-1898 gd:1 +ttp: b700/782 bl:2.2713 bb:1.0143 rl:2.2897 rb:1.0660 dl:1824-1834 gd:1 +ttp: b688/782 bl:2.3978 bb:1.0735 rl:2.2942 rb:1.0663 dl:1696-1706 gd:1 +ttp: b683/782 bl:2.2701 bb:1.0567 rl:2.2933 rb:1.0659 dl:1646-1657 gd:1 +ttp: b677/782 bl:2.3072 bb:1.0337 rl:2.2938 rb:1.0647 dl:1595-1601 gd:1 +ttp: b668/782 bl:2.3286 bb:1.0646 rl:2.2949 rb:1.0647 dl:1521-1530 gd:1 +ttp: b662/782 bl:2.2949 bb:1.0258 rl:2.2949 rb:1.0634 dl:1480-1486 gd:1 +ttp: b655/782 bl:2.3777 bb:1.0428 rl:2.2974 rb:1.0628 dl:1432-1439 gd:1 +ttp: b647/782 bl:2.2730 bb:1.0316 rl:2.2967 rb:1.0619 dl:1382-1387 gd:1 +ttp: b639/782 bl:2.3074 bb:1.0304 rl:2.2970 rb:1.0610 dl:1331-1337 gd:1 +ttp: b630/782 bl:2.3229 bb:1.0392 rl:2.2976 rb:1.0605 dl:1280-1285 gd:1 +ttp: b620/782 bl:2.3396 bb:1.0538 rl:2.2986 rb:1.0603 dl:1226-1231 gd:1 +ttp: b611/782 bl:2.2932 bb:1.0240 rl:2.2985 rb:1.0595 dl:1182-1186 gd:1 +ttp: b604/782 bl:2.3738 bb:1.0420 rl:2.3000 rb:1.0591 dl:1150-1154 gd:1 +ttp: b595/782 bl:2.3426 bb:1.0574 rl:2.3009 rb:1.0591 dl:1110-1115 gd:1 +ttp: b587/782 bl:2.4041 bb:1.0668 rl:2.3028 rb:1.0592 dl:1077-1081 gd:1 +ttp: b579/782 bl:2.3386 bb:1.0336 rl:2.3034 rb:1.0588 dl:1044-1048 gd:1 +ttp: b573/782 bl:2.3620 bb:1.0647 rl:2.3044 rb:1.0589 dl:1021-1025 gd:1 +ttp: b564/782 bl:2.2822 bb:1.0155 rl:2.3041 rb:1.0581 dl:990-993 gd:1 +ttp: b553/782 bl:2.2838 bb:1.0297 rl:2.3038 rb:1.0577 dl:952-955 gd:1 +ttp: b546/782 bl:2.3192 bb:1.0312 rl:2.3040 rb:1.0573 dl:930-934 gd:1 +ttp: b538/782 bl:2.3313 bb:1.0437 rl:2.3044 rb:1.0571 dl:905-909 gd:1 +ttp: b529/782 bl:2.3069 bb:1.0134 rl:2.3044 rb:1.0565 dl:878-882 gd:1 +ttp: b520/782 bl:2.3222 bb:1.0013 rl:2.3046 rb:1.0557 dl:852-854 gd:1 +ttp: b513/782 bl:2.3636 bb:1.0376 rl:2.3054 rb:1.0555 dl:832-835 gd:1 +ttp: b505/782 bl:2.3251 bb:1.0633 rl:2.3056 rb:1.0556 dl:809-812 gd:1 +ttp: b497/782 bl:2.3344 bb:1.0411 rl:2.3060 rb:1.0554 dl:788-791 gd:1 +ttp: b489/782 bl:2.3854 bb:1.0732 rl:2.3068 rb:1.0556 dl:769-771 gd:1 +ttp: b478/782 bl:2.3345 bb:1.0750 rl:2.3071 rb:1.0558 dl:742-744 gd:1 +ttp: b470/782 bl:2.3440 bb:1.0549 rl:2.3075 rb:1.0558 dl:724-726 gd:1 +ttp: b462/782 bl:2.3329 bb:1.0354 rl:2.3078 rb:1.0556 dl:706-708 gd:1 +ttp: b454/782 bl:2.3804 bb:1.0811 rl:2.3085 rb:1.0558 dl:689-691 gd:1 +ttp: b446/782 bl:2.2870 bb:1.0750 rl:2.3083 rb:1.0560 dl:672-674 gd:1 +ttp: b437/782 bl:2.2865 bb:1.0521 rl:2.3081 rb:1.0560 dl:653-655 gd:1 +ttp: b429/782 bl:2.2442 bb:1.0237 rl:2.3075 rb:1.0557 dl:638-640 gd:1 +ttp: b421/782 bl:2.2878 bb:1.0016 rl:2.3074 rb:1.0552 dl:622-624 gd:1 +ttp: b413/782 bl:2.3637 bb:1.0594 rl:2.3078 rb:1.0552 dl:607-609 gd:1 +ttp: b406/782 bl:2.3111 bb:1.0643 rl:2.3078 rb:1.0553 dl:593-595 gd:1 +ttp: b397/782 bl:2.3518 bb:1.0430 rl:2.3082 rb:1.0552 dl:577-579 gd:1 +ttp: b389/782 bl:2.2863 bb:1.0829 rl:2.3080 rb:1.0554 dl:563-564 gd:1 +ttp: b381/782 bl:2.4201 bb:1.1001 rl:2.3088 rb:1.0557 dl:549-550 gd:1 +ttp: b373/782 bl:2.4044 bb:1.0972 rl:2.3095 rb:1.0560 dl:535-537 gd:1 +ttp: b365/782 bl:2.3237 bb:1.0325 rl:2.3096 rb:1.0559 dl:522-524 gd:1 +ttp: b357/782 bl:2.3211 bb:1.0641 rl:2.3096 rb:1.0559 dl:508-510 gd:1 +ttp: b351/782 bl:2.3621 bb:1.0813 rl:2.3100 rb:1.0561 dl:498-499 gd:1 +ttp: b343/782 bl:2.2167 bb:1.0432 rl:2.3094 rb:1.0560 dl:486-488 gd:1 +ttp: b335/782 bl:2.3538 bb:1.0663 rl:2.3097 rb:1.0561 dl:474-476 gd:1 +ttp: b326/782 bl:2.3045 bb:1.0553 rl:2.3096 rb:1.0561 dl:461-462 gd:1 +ttp: b317/782 bl:2.3028 bb:1.0463 rl:2.3096 rb:1.0560 dl:446-448 gd:1 +ttp: b309/782 bl:2.4085 bb:1.1052 rl:2.3101 rb:1.0563 dl:435-437 gd:1 +ttp: b301/782 bl:2.3392 bb:1.0859 rl:2.3103 rb:1.0564 dl:422-424 gd:1 +ttp: b293/782 bl:2.4197 bb:1.0910 rl:2.3108 rb:1.0566 dl:410-412 gd:1 +ttp: b285/782 bl:2.3700 bb:1.0797 rl:2.3111 rb:1.0567 dl:399-400 gd:1 +ttp: b277/782 bl:2.2571 bb:1.0629 rl:2.3109 rb:1.0567 dl:388-389 gd:1 +ttp: b270/782 bl:2.3085 bb:1.0563 rl:2.3108 rb:1.0567 dl:379-380 gd:1 +ttp: b264/782 bl:2.4211 bb:1.1033 rl:2.3113 rb:1.0569 dl:371-372 gd:1 +ttp: b259/782 bl:2.3380 bb:1.0964 rl:2.3115 rb:1.0571 dl:365-366 gd:1 +ttp: b253/782 bl:2.3256 bb:1.1047 rl:2.3115 rb:1.0573 dl:357-358 gd:1 +ttp: b246/782 bl:2.3451 bb:1.0961 rl:2.3116 rb:1.0574 dl:349-350 gd:1 +ttp: b238/782 bl:2.3213 bb:1.1071 rl:2.3117 rb:1.0576 dl:338-340 gd:1 +ttp: b229/782 bl:2.3630 bb:1.0650 rl:2.3119 rb:1.0577 dl:328-329 gd:1 +ttp: b221/782 bl:2.4040 bb:1.1202 rl:2.3122 rb:1.0579 dl:318-320 gd:1 +ttp: b213/782 bl:2.2565 bb:1.0720 rl:2.3120 rb:1.0579 dl:309-310 gd:1 +ttp: b204/782 bl:2.4588 bb:1.1537 rl:2.3125 rb:1.0583 dl:300-301 gd:1 +ttp: b194/782 bl:2.4341 bb:1.1151 rl:2.3129 rb:1.0585 dl:289-290 gd:1 +ttp: b184/782 bl:2.3839 bb:1.1238 rl:2.3132 rb:1.0587 dl:278-279 gd:1 +ttp: b176/782 bl:2.3144 bb:1.1241 rl:2.3132 rb:1.0588 dl:270-271 gd:1 +ttp: b167/782 bl:2.5192 bb:1.1239 rl:2.3138 rb:1.0590 dl:262-263 gd:1 +ttp: b159/782 bl:2.4681 bb:1.1451 rl:2.3142 rb:1.0593 dl:254-255 gd:1 +ttp: b152/782 bl:2.3728 bb:1.1364 rl:2.3144 rb:1.0595 dl:247-248 gd:1 +ttp: b144/782 bl:2.3451 bb:1.1023 rl:2.3145 rb:1.0596 dl:239-240 gd:1 +ttp: b137/782 bl:2.4122 bb:1.1524 rl:2.3147 rb:1.0598 dl:233-233 gd:1 +ttp: b128/782 bl:2.3765 bb:1.1486 rl:2.3149 rb:1.0601 dl:224-225 gd:1 +ttp: b120/782 bl:2.3873 bb:1.1093 rl:2.3151 rb:1.0602 dl:217-218 gd:1 +ttp: b113/782 bl:2.5458 bb:1.1319 rl:2.3156 rb:1.0604 dl:210-211 gd:1 +ttp: b107/782 bl:2.4352 bb:1.1662 rl:2.3159 rb:1.0606 dl:205-206 gd:1 +ttp: b100/782 bl:2.4158 bb:1.1557 rl:2.3161 rb:1.0608 dl:199-200 gd:1 +ttp: b92/782 bl:2.4346 bb:1.1584 rl:2.3164 rb:1.0610 dl:191-192 gd:1 +ttp: b86/782 bl:2.4565 bb:1.1333 rl:2.3166 rb:1.0611 dl:186-187 gd:1 +ttp: b79/782 bl:2.3690 bb:1.1325 rl:2.3167 rb:1.0613 dl:180-181 gd:1 +ttp: b70/782 bl:2.5169 bb:1.2265 rl:2.3171 rb:1.0616 dl:172-173 gd:1 +ttp: b63/782 bl:2.5187 bb:1.2014 rl:2.3175 rb:1.0618 dl:166-166 gd:1 +ttp: b54/782 bl:2.4728 bb:1.2130 rl:2.3178 rb:1.0621 dl:157-158 gd:1 +ttp: b46/782 bl:2.5462 bb:1.2157 rl:2.3181 rb:1.0623 dl:149-150 gd:1 +ttp: b37/782 bl:2.5731 bb:1.2128 rl:2.3185 rb:1.0625 dl:140-141 gd:1 +ttp: b30/782 bl:2.5759 bb:1.2560 rl:2.3189 rb:1.0628 dl:133-134 gd:1 +ttp: b22/782 bl:2.5484 bb:1.1929 rl:2.3192 rb:1.0630 dl:124-126 gd:1 +ttp: b16/782 bl:2.6183 bb:1.2546 rl:2.3196 rb:1.0632 dl:117-118 gd:1 +ttp: b8/782 bl:2.7678 bb:1.2849 rl:2.3201 rb:1.0634 dl:103-105 gd:1 +quantized_ttt_phased val_loss:2.31717973 val_bpb:1.05886205 eval_time:553279ms +total_eval_time:553.3s diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_seed1234.log b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_seed1234.log new file mode 100644 index 0000000000..4b1bc980bc --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_seed1234.log @@ -0,0 +1,5846 @@ +nohup: ignoring input +==================================================== + v5 PRIMARY noLC fulltilt + precompute outside timer: V21 + #1953 + #1948 + fulltilt-tilt SEED=1234 Thu Apr 30 07:02:04 UTC 2026 + LeakyReLU slope 0.3 (code patch + v5 hint-precompute-outside-timer), EVAL_SEQ_LEN 2048 (no long-ctx for cap), no_qv, fulltilt-tilt +==================================================== +W0430 07:02:05.947000 1130344 torch/distributed/run.py:803] +W0430 07:02:05.947000 1130344 torch/distributed/run.py:803] ***************************************** +W0430 07:02:05.947000 1130344 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0430 07:02:05.947000 1130344 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + agree_add_boost: 0.5 + artifact_dir: + attn_clip_sigmas: 13.0 + attn_out_gate_enabled: False + attn_out_gate_src: proj + awq_lite_bits: 8 + awq_lite_enabled: True + awq_lite_group_size: 64 + awq_lite_group_top_k: 1 + beta1: 0.9 + beta2: 0.99 + caseops_enabled: True + compressor: pergroup + data_dir: /runpod-volume/caseops_data/datasets + datasets_dir: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved + distributed: True + ema_decay: 0.9965 + embed_bits: 7 + embed_clip_sigmas: 14.0 + embed_lr: 0.6 + embed_wd: 0.085 + enable_looping_at: 0.35 + eval_seq_len: 2048 + eval_stride: 64 + fused_ce_enabled: True + gate_window: 12 + gated_attn_enabled: False + gated_attn_init_std: 0.01 + gated_attn_quant_gate: True + global_ttt_batch_seqs: 32 + global_ttt_chunk_tokens: 32768 + global_ttt_epochs: 1 + global_ttt_grad_clip: 1.0 + global_ttt_lr: 0.001 + global_ttt_momentum: 0.9 + global_ttt_respect_doc_boundaries: True + global_ttt_warmup_chunks: 0 + global_ttt_warmup_start_lr: 0.0 + gptq_calibration_batches: 16 + gptq_reserve_seconds: 0.5 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/2f461a67-fc1a-4567-9c23-d7dc2c178233.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + lqer_asym_enabled: True + lqer_asym_group: 64 + lqer_enabled: True + lqer_factor_bits: 4 + lqer_gain_select: False + lqer_rank: 4 + lqer_scope: all + lqer_top_k: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.026 + max_wallclock_seconds: 600.0 + min_lr: 0.1 + mlp_clip_sigmas: 11.5 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_momentum: 0.97 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + ngram_hint_precompute_outside: True + ngram_tilt_enabled: True + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_final_lane: mean + parallel_start_layer: 8 + phased_ttt_num_phases: 3 + phased_ttt_prefix_docs: 2500 + qk_gain_init: 5.25 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + rope_yarn: False + run_id: 2f461a67-fc1a-4567-9c23-d7dc2c178233 + scalar_lr: 0.02 + seed: 1234 + skip_gates_enabled: True + smear_gate_enabled: True + sparse_attn_gate_enabled: True + sparse_attn_gate_init_std: 0.0 + sparse_attn_gate_scale: 0.5 + temperature_scale: 1.0 + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + token_boost: 2.625 + token_order: 16 + token_threshold: 0.8 + tokenizer_path: /runpod-volume/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + train_batch_tokens: 786432 + train_files: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_size: 64 + ttt_beta1: 0.0 + ttt_beta2: 0.99 + ttt_chunk_size: 48 + ttt_enabled: True + ttt_eval_batches: + ttt_eval_seq_len: 2048 + ttt_grad_steps: 1 + ttt_k_lora: True + ttt_lora_lr: 0.0001 + ttt_lora_rank: 80 + ttt_mlp_lora: True + ttt_o_lora: True + ttt_optimizer: adam + ttt_weight_decay: 0.5 + val_batch_tokens: 524288 + val_bytes_files: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin + val_doc_fraction: 1.0 + val_files: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin + val_loss_every: 0 + vocab_size: 8192 + warmdown_frac: 0.85 + warmup_steps: 20 + within_boost: 0.75 + within_tau: 0.45 + word_boost: 0.75 + word_normalize: strip_punct_lower + word_order: 4 + word_tau: 0.65 + world_size: 8 + xsa_last_n: 11 +train_shards: 1499 +val_tokens: 47851520 +model_params:35945673 +gptq:reserving 0s, effective=599500ms +warmup_cu_buckets:64,128,192,256 iters_each:3 +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +1/20000 train_loss: 9.0017 train_time: 0.0m tok/s: 17596046 +2/20000 train_loss: 12.9266 train_time: 0.0m tok/s: 11224289 +3/20000 train_loss: 10.1998 train_time: 0.0m tok/s: 10130919 +4/20000 train_loss: 8.6931 train_time: 0.0m tok/s: 9647461 +5/20000 train_loss: 7.9435 train_time: 0.0m tok/s: 9394471 +6/20000 train_loss: 7.4819 train_time: 0.0m tok/s: 9223646 +7/20000 train_loss: 7.2064 train_time: 0.0m tok/s: 9095635 +8/20000 train_loss: 6.9632 train_time: 0.0m tok/s: 9005196 +9/20000 train_loss: 6.6449 train_time: 0.0m tok/s: 8947144 +10/20000 train_loss: 6.4811 train_time: 0.0m tok/s: 8874363 +11/20000 train_loss: 6.1261 train_time: 0.0m tok/s: 8758987 +12/20000 train_loss: 5.8024 train_time: 0.0m tok/s: 8696610 +13/20000 train_loss: 5.6493 train_time: 0.0m tok/s: 8659111 +14/20000 train_loss: 5.3563 train_time: 0.0m tok/s: 8629535 +15/20000 train_loss: 5.2770 train_time: 0.0m tok/s: 8609166 +16/20000 train_loss: 5.3134 train_time: 0.0m tok/s: 8587771 +17/20000 train_loss: 5.1277 train_time: 0.0m tok/s: 8577528 +18/20000 train_loss: 5.0573 train_time: 0.0m tok/s: 8573634 +19/20000 train_loss: 4.9799 train_time: 0.0m tok/s: 8568643 +20/20000 train_loss: 4.8948 train_time: 0.0m tok/s: 8563415 +21/20000 train_loss: 4.8154 train_time: 0.0m tok/s: 8550615 +22/20000 train_loss: 4.8250 train_time: 0.0m tok/s: 8534184 +23/20000 train_loss: 4.7700 train_time: 0.0m tok/s: 8523232 +24/20000 train_loss: 4.8872 train_time: 0.0m tok/s: 8513073 +25/20000 train_loss: 4.6624 train_time: 0.0m tok/s: 8508927 +26/20000 train_loss: 4.6986 train_time: 0.0m tok/s: 8502891 +27/20000 train_loss: 4.5737 train_time: 0.0m tok/s: 8497773 +28/20000 train_loss: 4.6449 train_time: 0.0m tok/s: 8496644 +29/20000 train_loss: 4.5690 train_time: 0.0m tok/s: 8493660 +30/20000 train_loss: 4.5484 train_time: 0.0m tok/s: 8490716 +31/20000 train_loss: 4.5363 train_time: 0.0m tok/s: 8484776 +32/20000 train_loss: 4.5149 train_time: 0.0m tok/s: 8473103 +33/20000 train_loss: 4.4848 train_time: 0.1m tok/s: 8466051 +34/20000 train_loss: 4.4073 train_time: 0.1m tok/s: 8459139 +35/20000 train_loss: 4.3480 train_time: 0.1m tok/s: 8455679 +36/20000 train_loss: 4.4850 train_time: 0.1m tok/s: 8449538 +37/20000 train_loss: 4.4283 train_time: 0.1m tok/s: 8446850 +38/20000 train_loss: 4.3556 train_time: 0.1m tok/s: 8446063 +39/20000 train_loss: 4.4874 train_time: 0.1m tok/s: 8445613 +40/20000 train_loss: 4.4544 train_time: 0.1m tok/s: 8439972 +41/20000 train_loss: 4.3264 train_time: 0.1m tok/s: 8438834 +42/20000 train_loss: 4.2394 train_time: 0.1m tok/s: 8436344 +43/20000 train_loss: 4.2737 train_time: 0.1m tok/s: 8433392 +44/20000 train_loss: 4.2100 train_time: 0.1m tok/s: 8427377 +45/20000 train_loss: 4.3478 train_time: 0.1m tok/s: 8425611 +46/20000 train_loss: 4.2576 train_time: 0.1m tok/s: 8420942 +47/20000 train_loss: 4.1263 train_time: 0.1m tok/s: 8414430 +48/20000 train_loss: 4.1734 train_time: 0.1m tok/s: 8417757 +49/20000 train_loss: 4.1183 train_time: 0.1m tok/s: 8416079 +50/20000 train_loss: 4.0846 train_time: 0.1m tok/s: 8414980 +51/20000 train_loss: 4.2841 train_time: 0.1m tok/s: 8413899 +52/20000 train_loss: 4.2067 train_time: 0.1m tok/s: 8411307 +53/20000 train_loss: 4.1498 train_time: 0.1m tok/s: 8409478 +54/20000 train_loss: 4.1442 train_time: 0.1m tok/s: 8407681 +55/20000 train_loss: 4.1657 train_time: 0.1m tok/s: 8406581 +56/20000 train_loss: 4.0829 train_time: 0.1m tok/s: 8403378 +57/20000 train_loss: 4.1253 train_time: 0.1m tok/s: 8401604 +58/20000 train_loss: 4.0528 train_time: 0.1m tok/s: 8399519 +59/20000 train_loss: 4.0168 train_time: 0.1m tok/s: 8395045 +60/20000 train_loss: 3.9349 train_time: 0.1m tok/s: 8398565 +61/20000 train_loss: 3.9410 train_time: 0.1m tok/s: 8398192 +62/20000 train_loss: 4.0526 train_time: 0.1m tok/s: 8397311 +63/20000 train_loss: 4.1311 train_time: 0.1m tok/s: 8396873 +64/20000 train_loss: 3.9277 train_time: 0.1m tok/s: 8396797 +65/20000 train_loss: 4.0415 train_time: 0.1m tok/s: 8395608 +66/20000 train_loss: 3.9907 train_time: 0.1m tok/s: 8393313 +67/20000 train_loss: 3.9136 train_time: 0.1m tok/s: 8391572 +68/20000 train_loss: 3.9503 train_time: 0.1m tok/s: 8390613 +69/20000 train_loss: 3.8631 train_time: 0.1m tok/s: 8388166 +70/20000 train_loss: 3.9591 train_time: 0.1m tok/s: 8387782 +71/20000 train_loss: 3.8844 train_time: 0.1m tok/s: 8384314 +72/20000 train_loss: 4.0635 train_time: 0.1m tok/s: 8386589 +73/20000 train_loss: 3.8777 train_time: 0.1m tok/s: 8386262 +74/20000 train_loss: 3.8834 train_time: 0.1m tok/s: 8385770 +75/20000 train_loss: 3.8775 train_time: 0.1m tok/s: 8384765 +76/20000 train_loss: 3.8369 train_time: 0.1m tok/s: 8384224 +77/20000 train_loss: 3.7920 train_time: 0.1m tok/s: 8383046 +78/20000 train_loss: 3.7242 train_time: 0.1m tok/s: 8382638 +79/20000 train_loss: 3.8491 train_time: 0.1m tok/s: 8381667 +80/20000 train_loss: 3.7660 train_time: 0.1m tok/s: 8379863 +81/20000 train_loss: 3.7007 train_time: 0.1m tok/s: 8378579 +82/20000 train_loss: 3.7293 train_time: 0.1m tok/s: 8377766 +83/20000 train_loss: 3.6029 train_time: 0.1m tok/s: 8377401 +84/20000 train_loss: 3.6624 train_time: 0.1m tok/s: 8376391 +85/20000 train_loss: 3.6209 train_time: 0.1m tok/s: 8375913 +86/20000 train_loss: 3.4084 train_time: 0.1m tok/s: 8375604 +87/20000 train_loss: 3.6481 train_time: 0.1m tok/s: 8375139 +88/20000 train_loss: 3.5255 train_time: 0.1m tok/s: 8374479 +89/20000 train_loss: 3.5654 train_time: 0.1m tok/s: 8373441 +90/20000 train_loss: 3.5703 train_time: 0.1m tok/s: 8373195 +91/20000 train_loss: 3.6038 train_time: 0.1m tok/s: 8371944 +92/20000 train_loss: 3.6815 train_time: 0.1m tok/s: 8371849 +93/20000 train_loss: 3.5825 train_time: 0.1m tok/s: 8371596 +94/20000 train_loss: 3.6130 train_time: 0.1m tok/s: 8372385 +95/20000 train_loss: 3.5798 train_time: 0.1m tok/s: 8372287 +96/20000 train_loss: 3.5515 train_time: 0.2m tok/s: 8369452 +97/20000 train_loss: 3.4463 train_time: 0.2m tok/s: 8372153 +98/20000 train_loss: 3.5158 train_time: 0.2m tok/s: 8370254 +99/20000 train_loss: 3.4721 train_time: 0.2m tok/s: 8371103 +100/20000 train_loss: 3.3905 train_time: 0.2m tok/s: 8371058 +101/20000 train_loss: 3.4075 train_time: 0.2m tok/s: 8370635 +102/20000 train_loss: 3.4686 train_time: 0.2m tok/s: 8370574 +103/20000 train_loss: 3.3420 train_time: 0.2m tok/s: 8370118 +104/20000 train_loss: 3.4586 train_time: 0.2m tok/s: 8369914 +105/20000 train_loss: 3.3399 train_time: 0.2m tok/s: 8369289 +106/20000 train_loss: 3.4719 train_time: 0.2m tok/s: 8368960 +107/20000 train_loss: 3.2130 train_time: 0.2m tok/s: 8368677 +108/20000 train_loss: 3.3850 train_time: 0.2m tok/s: 8367241 +109/20000 train_loss: 3.3858 train_time: 0.2m tok/s: 8365166 +110/20000 train_loss: 3.4035 train_time: 0.2m tok/s: 8365232 +111/20000 train_loss: 3.4085 train_time: 0.2m tok/s: 8364930 +112/20000 train_loss: 3.4062 train_time: 0.2m tok/s: 8365137 +113/20000 train_loss: 3.3182 train_time: 0.2m tok/s: 8364191 +114/20000 train_loss: 3.3715 train_time: 0.2m tok/s: 8364443 +115/20000 train_loss: 3.4116 train_time: 0.2m tok/s: 8364117 +116/20000 train_loss: 3.2150 train_time: 0.2m tok/s: 8362541 +117/20000 train_loss: 3.4154 train_time: 0.2m tok/s: 8362486 +118/20000 train_loss: 3.3629 train_time: 0.2m tok/s: 8362718 +119/20000 train_loss: 3.3422 train_time: 0.2m tok/s: 8361919 +120/20000 train_loss: 3.3291 train_time: 0.2m tok/s: 8361520 +121/20000 train_loss: 3.2837 train_time: 0.2m tok/s: 8362024 +122/20000 train_loss: 3.2962 train_time: 0.2m tok/s: 8362482 +123/20000 train_loss: 3.2863 train_time: 0.2m tok/s: 8361681 +124/20000 train_loss: 3.3314 train_time: 0.2m tok/s: 8360533 +125/20000 train_loss: 3.2245 train_time: 0.2m tok/s: 8360308 +126/20000 train_loss: 3.2509 train_time: 0.2m tok/s: 8360895 +127/20000 train_loss: 3.2722 train_time: 0.2m tok/s: 8360007 +128/20000 train_loss: 3.3235 train_time: 0.2m tok/s: 8359441 +129/20000 train_loss: 3.2893 train_time: 0.2m tok/s: 8359019 +130/20000 train_loss: 3.2684 train_time: 0.2m tok/s: 8358602 +131/20000 train_loss: 3.2174 train_time: 0.2m tok/s: 8357503 +132/20000 train_loss: 3.1758 train_time: 0.2m tok/s: 8357724 +133/20000 train_loss: 3.2191 train_time: 0.2m tok/s: 8357387 +134/20000 train_loss: 3.1312 train_time: 0.2m tok/s: 8357116 +135/20000 train_loss: 2.9590 train_time: 0.2m tok/s: 8354264 +136/20000 train_loss: 3.2276 train_time: 0.2m tok/s: 8352483 +137/20000 train_loss: 3.0676 train_time: 0.2m tok/s: 8351941 +138/20000 train_loss: 3.2728 train_time: 0.2m tok/s: 8351669 +139/20000 train_loss: 3.2339 train_time: 0.2m tok/s: 8351543 +140/20000 train_loss: 3.1732 train_time: 0.2m tok/s: 8351187 +141/20000 train_loss: 3.0811 train_time: 0.2m tok/s: 8350352 +142/20000 train_loss: 3.2845 train_time: 0.2m tok/s: 8350227 +143/20000 train_loss: 3.3495 train_time: 0.2m tok/s: 8349344 +144/20000 train_loss: 3.2825 train_time: 0.2m tok/s: 8349237 +145/20000 train_loss: 3.2420 train_time: 0.2m tok/s: 8349255 +146/20000 train_loss: 3.2620 train_time: 0.2m tok/s: 8349028 +147/20000 train_loss: 3.1630 train_time: 0.2m tok/s: 8348596 +148/20000 train_loss: 3.1951 train_time: 0.2m tok/s: 8348804 +149/20000 train_loss: 3.2533 train_time: 0.2m tok/s: 8348957 +150/20000 train_loss: 3.1954 train_time: 0.2m tok/s: 8349073 +151/20000 train_loss: 3.5461 train_time: 0.2m tok/s: 8348667 +152/20000 train_loss: 3.1605 train_time: 0.2m tok/s: 8347856 +153/20000 train_loss: 3.2939 train_time: 0.2m tok/s: 8347746 +154/20000 train_loss: 3.1933 train_time: 0.2m tok/s: 8347303 +155/20000 train_loss: 3.1406 train_time: 0.2m tok/s: 8346526 +156/20000 train_loss: 3.0464 train_time: 0.2m tok/s: 8346073 +157/20000 train_loss: 3.0941 train_time: 0.2m tok/s: 8346045 +158/20000 train_loss: 3.1929 train_time: 0.2m tok/s: 8345518 +159/20000 train_loss: 3.0508 train_time: 0.2m tok/s: 8345449 +160/20000 train_loss: 3.1802 train_time: 0.3m tok/s: 8345670 +161/20000 train_loss: 3.1346 train_time: 0.3m tok/s: 8345354 +162/20000 train_loss: 3.0666 train_time: 0.3m tok/s: 8344563 +163/20000 train_loss: 3.1402 train_time: 0.3m tok/s: 8344924 +164/20000 train_loss: 3.0222 train_time: 0.3m tok/s: 8344105 +165/20000 train_loss: 3.2044 train_time: 0.3m tok/s: 8343781 +166/20000 train_loss: 3.1368 train_time: 0.3m tok/s: 8343262 +167/20000 train_loss: 3.1235 train_time: 0.3m tok/s: 8343111 +168/20000 train_loss: 3.1762 train_time: 0.3m tok/s: 8343011 +169/20000 train_loss: 3.0927 train_time: 0.3m tok/s: 8343248 +170/20000 train_loss: 2.8103 train_time: 0.3m tok/s: 8342353 +171/20000 train_loss: 3.1271 train_time: 0.3m tok/s: 8341850 +172/20000 train_loss: 3.0887 train_time: 0.3m tok/s: 8341999 +173/20000 train_loss: 3.2275 train_time: 0.3m tok/s: 8342225 +174/20000 train_loss: 3.1115 train_time: 0.3m tok/s: 8341862 +175/20000 train_loss: 3.1433 train_time: 0.3m tok/s: 8341547 +176/20000 train_loss: 3.1562 train_time: 0.3m tok/s: 8341559 +177/20000 train_loss: 3.1248 train_time: 0.3m tok/s: 8340986 +178/20000 train_loss: 2.9536 train_time: 0.3m tok/s: 8340531 +179/20000 train_loss: 3.3035 train_time: 0.3m tok/s: 8340452 +180/20000 train_loss: 2.9681 train_time: 0.3m tok/s: 8340130 +181/20000 train_loss: 2.9522 train_time: 0.3m tok/s: 8339379 +182/20000 train_loss: 3.0496 train_time: 0.3m tok/s: 8339044 +183/20000 train_loss: 2.9867 train_time: 0.3m tok/s: 8338760 +184/20000 train_loss: 2.9953 train_time: 0.3m tok/s: 8338580 +185/20000 train_loss: 2.7162 train_time: 0.3m tok/s: 8337177 +186/20000 train_loss: 3.1078 train_time: 0.3m tok/s: 8336413 +187/20000 train_loss: 3.0456 train_time: 0.3m tok/s: 8336414 +188/20000 train_loss: 3.1972 train_time: 0.3m tok/s: 8336285 +189/20000 train_loss: 3.5217 train_time: 0.3m tok/s: 8335894 +190/20000 train_loss: 3.0774 train_time: 0.3m tok/s: 8335445 +191/20000 train_loss: 3.0454 train_time: 0.3m tok/s: 8335446 +192/20000 train_loss: 3.0061 train_time: 0.3m tok/s: 8335553 +193/20000 train_loss: 3.0003 train_time: 0.3m tok/s: 8335771 +194/20000 train_loss: 3.0042 train_time: 0.3m tok/s: 8335576 +195/20000 train_loss: 2.8922 train_time: 0.3m tok/s: 8335374 +196/20000 train_loss: 3.1360 train_time: 0.3m tok/s: 8334911 +197/20000 train_loss: 3.0508 train_time: 0.3m tok/s: 8334869 +198/20000 train_loss: 3.0555 train_time: 0.3m tok/s: 8334961 +199/20000 train_loss: 3.0500 train_time: 0.3m tok/s: 8334937 +200/20000 train_loss: 3.0642 train_time: 0.3m tok/s: 8334778 +201/20000 train_loss: 3.1121 train_time: 0.3m tok/s: 8334035 +202/20000 train_loss: 3.3261 train_time: 0.3m tok/s: 8333679 +203/20000 train_loss: 3.0669 train_time: 0.3m tok/s: 8333420 +204/20000 train_loss: 3.0749 train_time: 0.3m tok/s: 8333424 +205/20000 train_loss: 3.0578 train_time: 0.3m tok/s: 8333438 +206/20000 train_loss: 2.9497 train_time: 0.3m tok/s: 8333110 +207/20000 train_loss: 3.0942 train_time: 0.3m tok/s: 8333149 +208/20000 train_loss: 2.9345 train_time: 0.3m tok/s: 8333048 +209/20000 train_loss: 3.0045 train_time: 0.3m tok/s: 8332535 +210/20000 train_loss: 3.0762 train_time: 0.3m tok/s: 8332171 +211/20000 train_loss: 3.2555 train_time: 0.3m tok/s: 8331559 +212/20000 train_loss: 3.0186 train_time: 0.3m tok/s: 8331508 +213/20000 train_loss: 2.9340 train_time: 0.3m tok/s: 8330991 +214/20000 train_loss: 3.0872 train_time: 0.3m tok/s: 8330868 +215/20000 train_loss: 3.0284 train_time: 0.3m tok/s: 8330971 +216/20000 train_loss: 3.0855 train_time: 0.3m tok/s: 8330777 +217/20000 train_loss: 3.0160 train_time: 0.3m tok/s: 8330610 +218/20000 train_loss: 3.0223 train_time: 0.3m tok/s: 8330716 +219/20000 train_loss: 3.1144 train_time: 0.3m tok/s: 8330851 +220/20000 train_loss: 3.3268 train_time: 0.3m tok/s: 8329808 +221/20000 train_loss: 2.9284 train_time: 0.3m tok/s: 8328757 +222/20000 train_loss: 2.9734 train_time: 0.3m tok/s: 8329106 +223/20000 train_loss: 2.9950 train_time: 0.4m tok/s: 8328883 +224/20000 train_loss: 2.9844 train_time: 0.4m tok/s: 8328885 +225/20000 train_loss: 3.0741 train_time: 0.4m tok/s: 8328184 +226/20000 train_loss: 3.0370 train_time: 0.4m tok/s: 8328434 +227/20000 train_loss: 3.0686 train_time: 0.4m tok/s: 8328257 +228/20000 train_loss: 3.0741 train_time: 0.4m tok/s: 8328334 +229/20000 train_loss: 3.0798 train_time: 0.4m tok/s: 8328553 +230/20000 train_loss: 2.9523 train_time: 0.4m tok/s: 8328663 +231/20000 train_loss: 3.0982 train_time: 0.4m tok/s: 8328445 +232/20000 train_loss: 2.9818 train_time: 0.4m tok/s: 8327806 +233/20000 train_loss: 3.0127 train_time: 0.4m tok/s: 8327473 +234/20000 train_loss: 3.0136 train_time: 0.4m tok/s: 8327455 +235/20000 train_loss: 2.9307 train_time: 0.4m tok/s: 8327614 +236/20000 train_loss: 3.0060 train_time: 0.4m tok/s: 8327565 +237/20000 train_loss: 2.8922 train_time: 0.4m tok/s: 8327524 +238/20000 train_loss: 3.0829 train_time: 0.4m tok/s: 8327304 +239/20000 train_loss: 3.0021 train_time: 0.4m tok/s: 8326925 +240/20000 train_loss: 3.1496 train_time: 0.4m tok/s: 8326564 +241/20000 train_loss: 3.0063 train_time: 0.4m tok/s: 8326632 +242/20000 train_loss: 3.0886 train_time: 0.4m tok/s: 8326500 +243/20000 train_loss: 2.9992 train_time: 0.4m tok/s: 8326661 +244/20000 train_loss: 3.0420 train_time: 0.4m tok/s: 8326827 +245/20000 train_loss: 2.9824 train_time: 0.4m tok/s: 8326982 +246/20000 train_loss: 3.0376 train_time: 0.4m tok/s: 8327069 +247/20000 train_loss: 2.9721 train_time: 0.4m tok/s: 8326719 +248/20000 train_loss: 2.8882 train_time: 0.4m tok/s: 8326402 +249/20000 train_loss: 2.9745 train_time: 0.4m tok/s: 8326501 +250/20000 train_loss: 2.9782 train_time: 0.4m tok/s: 8326792 +251/20000 train_loss: 2.9310 train_time: 0.4m tok/s: 8326498 +252/20000 train_loss: 2.9322 train_time: 0.4m tok/s: 8326463 +253/20000 train_loss: 3.0230 train_time: 0.4m tok/s: 8326498 +254/20000 train_loss: 3.0826 train_time: 0.4m tok/s: 8326472 +255/20000 train_loss: 3.1003 train_time: 0.4m tok/s: 8326163 +256/20000 train_loss: 2.9595 train_time: 0.4m tok/s: 8326222 +257/20000 train_loss: 2.9649 train_time: 0.4m tok/s: 8326386 +258/20000 train_loss: 3.0141 train_time: 0.4m tok/s: 8325905 +259/20000 train_loss: 2.9421 train_time: 0.4m tok/s: 8325522 +260/20000 train_loss: 3.1486 train_time: 0.4m tok/s: 8325500 +261/20000 train_loss: 2.9329 train_time: 0.4m tok/s: 8325123 +262/20000 train_loss: 2.7793 train_time: 0.4m tok/s: 8325276 +263/20000 train_loss: 2.7964 train_time: 0.4m tok/s: 8325201 +264/20000 train_loss: 2.9692 train_time: 0.4m tok/s: 8325313 +265/20000 train_loss: 2.9903 train_time: 0.4m tok/s: 8325249 +266/20000 train_loss: 2.9165 train_time: 0.4m tok/s: 8324718 +267/20000 train_loss: 2.9364 train_time: 0.4m tok/s: 8324673 +268/20000 train_loss: 3.0140 train_time: 0.4m tok/s: 8324557 +269/20000 train_loss: 3.0015 train_time: 0.4m tok/s: 8324801 +270/20000 train_loss: 2.9985 train_time: 0.4m tok/s: 8324723 +271/20000 train_loss: 3.0015 train_time: 0.4m tok/s: 8324482 +272/20000 train_loss: 3.0707 train_time: 0.4m tok/s: 8324532 +273/20000 train_loss: 2.9259 train_time: 0.4m tok/s: 8324708 +274/20000 train_loss: 3.0270 train_time: 0.4m tok/s: 8324589 +275/20000 train_loss: 2.9566 train_time: 0.4m tok/s: 8324817 +276/20000 train_loss: 2.8796 train_time: 0.4m tok/s: 8325008 +277/20000 train_loss: 2.8659 train_time: 0.4m tok/s: 8324878 +278/20000 train_loss: 2.8370 train_time: 0.4m tok/s: 8324470 +279/20000 train_loss: 2.9716 train_time: 0.4m tok/s: 8324032 +280/20000 train_loss: 3.0067 train_time: 0.4m tok/s: 8323822 +281/20000 train_loss: 2.7586 train_time: 0.4m tok/s: 8323721 +282/20000 train_loss: 3.0673 train_time: 0.4m tok/s: 8323736 +283/20000 train_loss: 2.8712 train_time: 0.4m tok/s: 8323566 +284/20000 train_loss: 2.9119 train_time: 0.4m tok/s: 8323398 +285/20000 train_loss: 2.9630 train_time: 0.4m tok/s: 8323230 +286/20000 train_loss: 2.9876 train_time: 0.5m tok/s: 8323063 +287/20000 train_loss: 2.8311 train_time: 0.5m tok/s: 8322641 +288/20000 train_loss: 2.9711 train_time: 0.5m tok/s: 8322535 +289/20000 train_loss: 2.8782 train_time: 0.5m tok/s: 8322489 +290/20000 train_loss: 2.9014 train_time: 0.5m tok/s: 8322464 +291/20000 train_loss: 2.8781 train_time: 0.5m tok/s: 8322437 +292/20000 train_loss: 2.7074 train_time: 0.5m tok/s: 8322171 +293/20000 train_loss: 2.9368 train_time: 0.5m tok/s: 8322084 +294/20000 train_loss: 3.0567 train_time: 0.5m tok/s: 8321958 +295/20000 train_loss: 2.9989 train_time: 0.5m tok/s: 8321845 +296/20000 train_loss: 3.0624 train_time: 0.5m tok/s: 8321676 +297/20000 train_loss: 2.9461 train_time: 0.5m tok/s: 8321149 +298/20000 train_loss: 2.9828 train_time: 0.5m tok/s: 8321222 +299/20000 train_loss: 2.8115 train_time: 0.5m tok/s: 8321209 +300/20000 train_loss: 3.0230 train_time: 0.5m tok/s: 8321236 +301/20000 train_loss: 2.9652 train_time: 0.5m tok/s: 8321152 +302/20000 train_loss: 2.8643 train_time: 0.5m tok/s: 8320985 +303/20000 train_loss: 2.9242 train_time: 0.5m tok/s: 8321043 +304/20000 train_loss: 2.9326 train_time: 0.5m tok/s: 8321014 +305/20000 train_loss: 2.9342 train_time: 0.5m tok/s: 8320778 +306/20000 train_loss: 3.0082 train_time: 0.5m tok/s: 8320483 +307/20000 train_loss: 2.9163 train_time: 0.5m tok/s: 8320434 +308/20000 train_loss: 2.8961 train_time: 0.5m tok/s: 8320428 +309/20000 train_loss: 3.0394 train_time: 0.5m tok/s: 8320220 +310/20000 train_loss: 2.8561 train_time: 0.5m tok/s: 8320351 +311/20000 train_loss: 2.9256 train_time: 0.5m tok/s: 8320512 +312/20000 train_loss: 2.8312 train_time: 0.5m tok/s: 8320564 +313/20000 train_loss: 2.8318 train_time: 0.5m tok/s: 8320334 +314/20000 train_loss: 2.8708 train_time: 0.5m tok/s: 8320264 +315/20000 train_loss: 2.9573 train_time: 0.5m tok/s: 8320051 +316/20000 train_loss: 2.6925 train_time: 0.5m tok/s: 8319713 +317/20000 train_loss: 2.8096 train_time: 0.5m tok/s: 8319433 +318/20000 train_loss: 2.9174 train_time: 0.5m tok/s: 8319067 +319/20000 train_loss: 2.9107 train_time: 0.5m tok/s: 8318821 +320/20000 train_loss: 3.0278 train_time: 0.5m tok/s: 8318774 +321/20000 train_loss: 2.9987 train_time: 0.5m tok/s: 8318496 +322/20000 train_loss: 2.9631 train_time: 0.5m tok/s: 8318708 +323/20000 train_loss: 3.0037 train_time: 0.5m tok/s: 8318749 +324/20000 train_loss: 2.9093 train_time: 0.5m tok/s: 8318777 +325/20000 train_loss: 2.8894 train_time: 0.5m tok/s: 8318735 +326/20000 train_loss: 2.8964 train_time: 0.5m tok/s: 8318777 +327/20000 train_loss: 2.8364 train_time: 0.5m tok/s: 8318543 +328/20000 train_loss: 2.8598 train_time: 0.5m tok/s: 8318468 +329/20000 train_loss: 2.8144 train_time: 0.5m tok/s: 8318226 +330/20000 train_loss: 2.7720 train_time: 0.5m tok/s: 8318449 +331/20000 train_loss: 2.8929 train_time: 0.5m tok/s: 8317809 +332/20000 train_loss: 2.9668 train_time: 0.5m tok/s: 8317725 +333/20000 train_loss: 2.8694 train_time: 0.5m tok/s: 8317810 +334/20000 train_loss: 3.0914 train_time: 0.5m tok/s: 8317731 +335/20000 train_loss: 2.8409 train_time: 0.5m tok/s: 8317627 +336/20000 train_loss: 2.9330 train_time: 0.5m tok/s: 8317527 +337/20000 train_loss: 2.8287 train_time: 0.5m tok/s: 8317603 +338/20000 train_loss: 2.8935 train_time: 0.5m tok/s: 8317665 +339/20000 train_loss: 2.9375 train_time: 0.5m tok/s: 8317654 +340/20000 train_loss: 2.9653 train_time: 0.5m tok/s: 8317541 +341/20000 train_loss: 2.9154 train_time: 0.5m tok/s: 8317540 +342/20000 train_loss: 2.8096 train_time: 0.5m tok/s: 8317545 +343/20000 train_loss: 2.9121 train_time: 0.5m tok/s: 8317530 +344/20000 train_loss: 2.8223 train_time: 0.5m tok/s: 8317208 +345/20000 train_loss: 2.8568 train_time: 0.5m tok/s: 8317137 +346/20000 train_loss: 2.8734 train_time: 0.5m tok/s: 8317125 +347/20000 train_loss: 2.8952 train_time: 0.5m tok/s: 8317109 +348/20000 train_loss: 2.8599 train_time: 0.5m tok/s: 8317004 +349/20000 train_loss: 2.9313 train_time: 0.6m tok/s: 8316918 +350/20000 train_loss: 2.7742 train_time: 0.6m tok/s: 8316821 +351/20000 train_loss: 2.7963 train_time: 0.6m tok/s: 8316859 +352/20000 train_loss: 2.7699 train_time: 0.6m tok/s: 8316605 +353/20000 train_loss: 2.6122 train_time: 0.6m tok/s: 8316291 +354/20000 train_loss: 2.9951 train_time: 0.6m tok/s: 8316017 +355/20000 train_loss: 2.9251 train_time: 0.6m tok/s: 8315840 +356/20000 train_loss: 2.8299 train_time: 0.6m tok/s: 8315357 +357/20000 train_loss: 2.7801 train_time: 0.6m tok/s: 8315162 +358/20000 train_loss: 2.7884 train_time: 0.6m tok/s: 8315157 +359/20000 train_loss: 2.8868 train_time: 0.6m tok/s: 8315232 +360/20000 train_loss: 2.8882 train_time: 0.6m tok/s: 8314794 +361/20000 train_loss: 2.9605 train_time: 0.6m tok/s: 8314660 +362/20000 train_loss: 2.8657 train_time: 0.6m tok/s: 8314479 +363/20000 train_loss: 2.9489 train_time: 0.6m tok/s: 8314333 +364/20000 train_loss: 2.8112 train_time: 0.6m tok/s: 8314163 +365/20000 train_loss: 2.8040 train_time: 0.6m tok/s: 8314218 +366/20000 train_loss: 2.8063 train_time: 0.6m tok/s: 8314094 +367/20000 train_loss: 2.9229 train_time: 0.6m tok/s: 8314038 +368/20000 train_loss: 2.7374 train_time: 0.6m tok/s: 8313859 +369/20000 train_loss: 2.8905 train_time: 0.6m tok/s: 8313919 +370/20000 train_loss: 2.8647 train_time: 0.6m tok/s: 8313948 +371/20000 train_loss: 2.8755 train_time: 0.6m tok/s: 8313953 +372/20000 train_loss: 2.8334 train_time: 0.6m tok/s: 8313824 +373/20000 train_loss: 2.7208 train_time: 0.6m tok/s: 8313886 +374/20000 train_loss: 2.7236 train_time: 0.6m tok/s: 8313570 +375/20000 train_loss: 2.6771 train_time: 0.6m tok/s: 8313608 +376/20000 train_loss: 2.9089 train_time: 0.6m tok/s: 8313228 +377/20000 train_loss: 2.7247 train_time: 0.6m tok/s: 8313199 +378/20000 train_loss: 2.8152 train_time: 0.6m tok/s: 8313055 +379/20000 train_loss: 2.8727 train_time: 0.6m tok/s: 8313092 +380/20000 train_loss: 2.8813 train_time: 0.6m tok/s: 8313202 +381/20000 train_loss: 2.8994 train_time: 0.6m tok/s: 8312839 +382/20000 train_loss: 2.9558 train_time: 0.6m tok/s: 8312738 +383/20000 train_loss: 2.9420 train_time: 0.6m tok/s: 8312825 +384/20000 train_loss: 2.8161 train_time: 0.6m tok/s: 8312469 +385/20000 train_loss: 2.8350 train_time: 0.6m tok/s: 8312334 +386/20000 train_loss: 2.8727 train_time: 0.6m tok/s: 8312183 +387/20000 train_loss: 3.0555 train_time: 0.6m tok/s: 8311931 +388/20000 train_loss: 2.8756 train_time: 0.6m tok/s: 8311833 +389/20000 train_loss: 2.9076 train_time: 0.6m tok/s: 8311548 +390/20000 train_loss: 2.7545 train_time: 0.6m tok/s: 8311604 +391/20000 train_loss: 2.7082 train_time: 0.6m tok/s: 8311554 +392/20000 train_loss: 2.7721 train_time: 0.6m tok/s: 8311746 +393/20000 train_loss: 2.8390 train_time: 0.6m tok/s: 8311911 +394/20000 train_loss: 2.8307 train_time: 0.6m tok/s: 8311860 +395/20000 train_loss: 2.9168 train_time: 0.6m tok/s: 8311794 +396/20000 train_loss: 2.8213 train_time: 0.6m tok/s: 8311564 +397/20000 train_loss: 2.8217 train_time: 0.6m tok/s: 8311345 +398/20000 train_loss: 2.8653 train_time: 0.6m tok/s: 8311151 +399/20000 train_loss: 2.7668 train_time: 0.6m tok/s: 8311020 +400/20000 train_loss: 2.8638 train_time: 0.6m tok/s: 8311170 +401/20000 train_loss: 2.8552 train_time: 0.6m tok/s: 8311136 +402/20000 train_loss: 2.7240 train_time: 0.6m tok/s: 8311294 +403/20000 train_loss: 2.9468 train_time: 0.6m tok/s: 8311402 +404/20000 train_loss: 2.9262 train_time: 0.6m tok/s: 8310989 +405/20000 train_loss: 2.9195 train_time: 0.6m tok/s: 8310821 +406/20000 train_loss: 2.8040 train_time: 0.6m tok/s: 8310457 +407/20000 train_loss: 2.8310 train_time: 0.6m tok/s: 8310252 +408/20000 train_loss: 2.8316 train_time: 0.6m tok/s: 8310078 +409/20000 train_loss: 2.7936 train_time: 0.6m tok/s: 8309962 +410/20000 train_loss: 2.8690 train_time: 0.6m tok/s: 8309941 +411/20000 train_loss: 2.8105 train_time: 0.6m tok/s: 8309754 +412/20000 train_loss: 2.8156 train_time: 0.6m tok/s: 8309762 +413/20000 train_loss: 2.7043 train_time: 0.7m tok/s: 8309782 +414/20000 train_loss: 2.7207 train_time: 0.7m tok/s: 8309702 +415/20000 train_loss: 2.6995 train_time: 0.7m tok/s: 8309665 +416/20000 train_loss: 2.7666 train_time: 0.7m tok/s: 8309452 +417/20000 train_loss: 2.7715 train_time: 0.7m tok/s: 8309250 +418/20000 train_loss: 2.7918 train_time: 0.7m tok/s: 8309309 +419/20000 train_loss: 2.8103 train_time: 0.7m tok/s: 8309269 +420/20000 train_loss: 2.7951 train_time: 0.7m tok/s: 8309271 +421/20000 train_loss: 2.8589 train_time: 0.7m tok/s: 8309143 +422/20000 train_loss: 2.8358 train_time: 0.7m tok/s: 8309169 +423/20000 train_loss: 2.8318 train_time: 0.7m tok/s: 8309097 +424/20000 train_loss: 2.9031 train_time: 0.7m tok/s: 8309029 +425/20000 train_loss: 2.7979 train_time: 0.7m tok/s: 8308819 +426/20000 train_loss: 2.8220 train_time: 0.7m tok/s: 8308474 +427/20000 train_loss: 2.8287 train_time: 0.7m tok/s: 8308378 +428/20000 train_loss: 2.7904 train_time: 0.7m tok/s: 8308331 +429/20000 train_loss: 2.7282 train_time: 0.7m tok/s: 8308251 +430/20000 train_loss: 2.8610 train_time: 0.7m tok/s: 8308174 +431/20000 train_loss: 2.6740 train_time: 0.7m tok/s: 8308041 +432/20000 train_loss: 2.7314 train_time: 0.7m tok/s: 8308172 +433/20000 train_loss: 2.6685 train_time: 0.7m tok/s: 8307916 +434/20000 train_loss: 2.6623 train_time: 0.7m tok/s: 8307597 +435/20000 train_loss: 2.8675 train_time: 0.7m tok/s: 8307291 +436/20000 train_loss: 2.4910 train_time: 0.7m tok/s: 8307292 +437/20000 train_loss: 2.7312 train_time: 0.7m tok/s: 8307219 +438/20000 train_loss: 2.8559 train_time: 0.7m tok/s: 8307147 +439/20000 train_loss: 2.7570 train_time: 0.7m tok/s: 8306825 +440/20000 train_loss: 2.6748 train_time: 0.7m tok/s: 8306770 +441/20000 train_loss: 2.9077 train_time: 0.7m tok/s: 8306730 +442/20000 train_loss: 2.9646 train_time: 0.7m tok/s: 8306627 +443/20000 train_loss: 2.9145 train_time: 0.7m tok/s: 8306705 +444/20000 train_loss: 2.9278 train_time: 0.7m tok/s: 8306657 +445/20000 train_loss: 2.8883 train_time: 0.7m tok/s: 8306557 +446/20000 train_loss: 2.7619 train_time: 0.7m tok/s: 8306530 +447/20000 train_loss: 2.7924 train_time: 0.7m tok/s: 8306645 +448/20000 train_loss: 2.8005 train_time: 0.7m tok/s: 8306677 +449/20000 train_loss: 2.7825 train_time: 0.7m tok/s: 8306475 +450/20000 train_loss: 2.8201 train_time: 0.7m tok/s: 8306116 +451/20000 train_loss: 2.5354 train_time: 0.7m tok/s: 8305893 +452/20000 train_loss: 2.7555 train_time: 0.7m tok/s: 8305765 +453/20000 train_loss: 2.6912 train_time: 0.7m tok/s: 8305563 +454/20000 train_loss: 2.6986 train_time: 0.7m tok/s: 8305264 +455/20000 train_loss: 2.7549 train_time: 0.7m tok/s: 8305138 +456/20000 train_loss: 2.7734 train_time: 0.7m tok/s: 8305286 +457/20000 train_loss: 2.6893 train_time: 0.7m tok/s: 8305226 +458/20000 train_loss: 2.7633 train_time: 0.7m tok/s: 8305079 +459/20000 train_loss: 2.8815 train_time: 0.7m tok/s: 8304971 +460/20000 train_loss: 2.8032 train_time: 0.7m tok/s: 8304924 +461/20000 train_loss: 2.8679 train_time: 0.7m tok/s: 8305029 +462/20000 train_loss: 2.9139 train_time: 0.7m tok/s: 8304856 +463/20000 train_loss: 2.8078 train_time: 0.7m tok/s: 8304931 +464/20000 train_loss: 2.7687 train_time: 0.7m tok/s: 8304880 +465/20000 train_loss: 2.9466 train_time: 0.7m tok/s: 8304715 +466/20000 train_loss: 2.8513 train_time: 0.7m tok/s: 8304536 +467/20000 train_loss: 2.8444 train_time: 0.7m tok/s: 8304429 +468/20000 train_loss: 2.9834 train_time: 0.7m tok/s: 8304240 +469/20000 train_loss: 2.7333 train_time: 0.7m tok/s: 8303983 +470/20000 train_loss: 2.7583 train_time: 0.7m tok/s: 8303947 +471/20000 train_loss: 2.8740 train_time: 0.7m tok/s: 8303671 +472/20000 train_loss: 2.9741 train_time: 0.7m tok/s: 8303094 +473/20000 train_loss: 2.7041 train_time: 0.7m tok/s: 8302687 +474/20000 train_loss: 2.6860 train_time: 0.7m tok/s: 8302659 +475/20000 train_loss: 2.8513 train_time: 0.7m tok/s: 8302675 +476/20000 train_loss: 2.6142 train_time: 0.8m tok/s: 8302452 +477/20000 train_loss: 2.7164 train_time: 0.8m tok/s: 8302446 +478/20000 train_loss: 2.8179 train_time: 0.8m tok/s: 8302473 +479/20000 train_loss: 2.7911 train_time: 0.8m tok/s: 8302345 +480/20000 train_loss: 3.0491 train_time: 0.8m tok/s: 8302078 +481/20000 train_loss: 2.8396 train_time: 0.8m tok/s: 8302002 +482/20000 train_loss: 2.7801 train_time: 0.8m tok/s: 8302097 +483/20000 train_loss: 2.8158 train_time: 0.8m tok/s: 8302202 +484/20000 train_loss: 2.8773 train_time: 0.8m tok/s: 8301982 +485/20000 train_loss: 2.7539 train_time: 0.8m tok/s: 8301993 +486/20000 train_loss: 2.7601 train_time: 0.8m tok/s: 8301936 +487/20000 train_loss: 2.8165 train_time: 0.8m tok/s: 8301950 +488/20000 train_loss: 2.7570 train_time: 0.8m tok/s: 8301682 +489/20000 train_loss: 2.3531 train_time: 0.8m tok/s: 8301364 +490/20000 train_loss: 2.8545 train_time: 0.8m tok/s: 8301206 +491/20000 train_loss: 2.7772 train_time: 0.8m tok/s: 8301156 +492/20000 train_loss: 2.7767 train_time: 0.8m tok/s: 8301377 +493/20000 train_loss: 2.6690 train_time: 0.8m tok/s: 8301288 +494/20000 train_loss: 2.6765 train_time: 0.8m tok/s: 8301229 +495/20000 train_loss: 2.7902 train_time: 0.8m tok/s: 8301253 +496/20000 train_loss: 2.6864 train_time: 0.8m tok/s: 8301108 +497/20000 train_loss: 2.9200 train_time: 0.8m tok/s: 8301177 +498/20000 train_loss: 2.8311 train_time: 0.8m tok/s: 8301039 +499/20000 train_loss: 2.9289 train_time: 0.8m tok/s: 8301062 +500/20000 train_loss: 2.7361 train_time: 0.8m tok/s: 8301021 +501/20000 train_loss: 2.9087 train_time: 0.8m tok/s: 8300953 +502/20000 train_loss: 2.7061 train_time: 0.8m tok/s: 8300829 +503/20000 train_loss: 2.7867 train_time: 0.8m tok/s: 8300655 +504/20000 train_loss: 2.6798 train_time: 0.8m tok/s: 8300680 +505/20000 train_loss: 2.8782 train_time: 0.8m tok/s: 8300401 +506/20000 train_loss: 2.7831 train_time: 0.8m tok/s: 8300268 +507/20000 train_loss: 2.7671 train_time: 0.8m tok/s: 8300339 +508/20000 train_loss: 2.9143 train_time: 0.8m tok/s: 8300428 +509/20000 train_loss: 2.9075 train_time: 0.8m tok/s: 8300241 +510/20000 train_loss: 2.6962 train_time: 0.8m tok/s: 8300253 +511/20000 train_loss: 2.8722 train_time: 0.8m tok/s: 8300268 +512/20000 train_loss: 2.8480 train_time: 0.8m tok/s: 8300137 +513/20000 train_loss: 2.8895 train_time: 0.8m tok/s: 8299994 +514/20000 train_loss: 2.8448 train_time: 0.8m tok/s: 8300167 +515/20000 train_loss: 2.8506 train_time: 0.8m tok/s: 8299887 +516/20000 train_loss: 2.7214 train_time: 0.8m tok/s: 8300012 +517/20000 train_loss: 2.8072 train_time: 0.8m tok/s: 8300055 +518/20000 train_loss: 2.9325 train_time: 0.8m tok/s: 8299971 +519/20000 train_loss: 2.7350 train_time: 0.8m tok/s: 8299898 +520/20000 train_loss: 2.6565 train_time: 0.8m tok/s: 8299899 +521/20000 train_loss: 2.7629 train_time: 0.8m tok/s: 8299904 +522/20000 train_loss: 2.7362 train_time: 0.8m tok/s: 8299922 +523/20000 train_loss: 2.7212 train_time: 0.8m tok/s: 8299974 +524/20000 train_loss: 2.7928 train_time: 0.8m tok/s: 8299800 +525/20000 train_loss: 2.7186 train_time: 0.8m tok/s: 8299475 +526/20000 train_loss: 2.8253 train_time: 0.8m tok/s: 8299443 +527/20000 train_loss: 2.8856 train_time: 0.8m tok/s: 8299271 +528/20000 train_loss: 2.8647 train_time: 0.8m tok/s: 8299417 +529/20000 train_loss: 2.8745 train_time: 0.8m tok/s: 8299137 +530/20000 train_loss: 2.8904 train_time: 0.8m tok/s: 8298986 +531/20000 train_loss: 3.1778 train_time: 0.8m tok/s: 8298733 +532/20000 train_loss: 3.1044 train_time: 0.8m tok/s: 8298457 +533/20000 train_loss: 2.6720 train_time: 0.8m tok/s: 8298385 +534/20000 train_loss: 2.8651 train_time: 0.8m tok/s: 8298203 +535/20000 train_loss: 2.7872 train_time: 0.8m tok/s: 8298099 +536/20000 train_loss: 2.6746 train_time: 0.8m tok/s: 8297948 +537/20000 train_loss: 2.8962 train_time: 0.8m tok/s: 8297718 +538/20000 train_loss: 2.7283 train_time: 0.8m tok/s: 8297720 +539/20000 train_loss: 2.8449 train_time: 0.9m tok/s: 8297648 +540/20000 train_loss: 2.8571 train_time: 0.9m tok/s: 8297607 +541/20000 train_loss: 2.2775 train_time: 0.9m tok/s: 8297352 +542/20000 train_loss: 2.8382 train_time: 0.9m tok/s: 8297023 +543/20000 train_loss: 2.8062 train_time: 0.9m tok/s: 8296871 +544/20000 train_loss: 2.8286 train_time: 0.9m tok/s: 8296756 +545/20000 train_loss: 2.7753 train_time: 0.9m tok/s: 8296831 +546/20000 train_loss: 2.8121 train_time: 0.9m tok/s: 8296783 +547/20000 train_loss: 2.7686 train_time: 0.9m tok/s: 8296692 +548/20000 train_loss: 2.7348 train_time: 0.9m tok/s: 8296719 +549/20000 train_loss: 2.6821 train_time: 0.9m tok/s: 8296658 +550/20000 train_loss: 2.7737 train_time: 0.9m tok/s: 8296605 +551/20000 train_loss: 2.7230 train_time: 0.9m tok/s: 8296442 +552/20000 train_loss: 2.8860 train_time: 0.9m tok/s: 8295814 +553/20000 train_loss: 2.7467 train_time: 0.9m tok/s: 8295364 +554/20000 train_loss: 2.5801 train_time: 0.9m tok/s: 8295337 +555/20000 train_loss: 2.6592 train_time: 0.9m tok/s: 8295327 +556/20000 train_loss: 2.7596 train_time: 0.9m tok/s: 8295452 +557/20000 train_loss: 2.8728 train_time: 0.9m tok/s: 8295353 +558/20000 train_loss: 2.8287 train_time: 0.9m tok/s: 8295384 +559/20000 train_loss: 2.7463 train_time: 0.9m tok/s: 8295577 +560/20000 train_loss: 2.7867 train_time: 0.9m tok/s: 8295428 +561/20000 train_loss: 2.7869 train_time: 0.9m tok/s: 8295490 +562/20000 train_loss: 2.8484 train_time: 0.9m tok/s: 8295453 +563/20000 train_loss: 2.8256 train_time: 0.9m tok/s: 8295404 +564/20000 train_loss: 2.9436 train_time: 0.9m tok/s: 8295227 +565/20000 train_loss: 2.8413 train_time: 0.9m tok/s: 8295083 +566/20000 train_loss: 2.7439 train_time: 0.9m tok/s: 8295133 +567/20000 train_loss: 2.6983 train_time: 0.9m tok/s: 8295232 +568/20000 train_loss: 2.8224 train_time: 0.9m tok/s: 8295202 +569/20000 train_loss: 2.6698 train_time: 0.9m tok/s: 8295027 +570/20000 train_loss: 2.6846 train_time: 0.9m tok/s: 8294885 +571/20000 train_loss: 2.7897 train_time: 0.9m tok/s: 8294753 +572/20000 train_loss: 2.6427 train_time: 0.9m tok/s: 8294656 +573/20000 train_loss: 2.6462 train_time: 0.9m tok/s: 8294515 +574/20000 train_loss: 2.7330 train_time: 0.9m tok/s: 8294542 +575/20000 train_loss: 2.5108 train_time: 0.9m tok/s: 8294419 +576/20000 train_loss: 2.7627 train_time: 0.9m tok/s: 8294459 +577/20000 train_loss: 2.8559 train_time: 0.9m tok/s: 8294174 +578/20000 train_loss: 2.8326 train_time: 0.9m tok/s: 8294102 +579/20000 train_loss: 2.7279 train_time: 0.9m tok/s: 8294262 +580/20000 train_loss: 2.8024 train_time: 0.9m tok/s: 8294401 +581/20000 train_loss: 2.7721 train_time: 0.9m tok/s: 8294249 +582/20000 train_loss: 2.7700 train_time: 0.9m tok/s: 8294145 +583/20000 train_loss: 2.7266 train_time: 0.9m tok/s: 8294124 +584/20000 train_loss: 2.7859 train_time: 0.9m tok/s: 8294111 +585/20000 train_loss: 2.7964 train_time: 0.9m tok/s: 8293971 +586/20000 train_loss: 2.6351 train_time: 0.9m tok/s: 8293941 +587/20000 train_loss: 2.7264 train_time: 0.9m tok/s: 8293729 +588/20000 train_loss: 2.7038 train_time: 0.9m tok/s: 8294039 +589/20000 train_loss: 2.7367 train_time: 0.9m tok/s: 8293831 +590/20000 train_loss: 2.7634 train_time: 0.9m tok/s: 8293785 +591/20000 train_loss: 2.7487 train_time: 0.9m tok/s: 8293786 +592/20000 train_loss: 2.7345 train_time: 0.9m tok/s: 8293874 +593/20000 train_loss: 2.7435 train_time: 0.9m tok/s: 8293836 +594/20000 train_loss: 2.6431 train_time: 0.9m tok/s: 8293626 +595/20000 train_loss: 2.7979 train_time: 0.9m tok/s: 8293454 +596/20000 train_loss: 2.6759 train_time: 0.9m tok/s: 8293401 +597/20000 train_loss: 2.7465 train_time: 0.9m tok/s: 8293380 +598/20000 train_loss: 2.7880 train_time: 0.9m tok/s: 8293143 +599/20000 train_loss: 2.7005 train_time: 0.9m tok/s: 8293061 +600/20000 train_loss: 2.7433 train_time: 0.9m tok/s: 8293208 +601/20000 train_loss: 2.7278 train_time: 0.9m tok/s: 8293121 +602/20000 train_loss: 2.7598 train_time: 1.0m tok/s: 8292862 +603/20000 train_loss: 2.7541 train_time: 1.0m tok/s: 8292835 +604/20000 train_loss: 2.7468 train_time: 1.0m tok/s: 8292737 +605/20000 train_loss: 2.6539 train_time: 1.0m tok/s: 8292614 +606/20000 train_loss: 2.6537 train_time: 1.0m tok/s: 8292617 +607/20000 train_loss: 2.7395 train_time: 1.0m tok/s: 8292570 +608/20000 train_loss: 2.6538 train_time: 1.0m tok/s: 8292538 +609/20000 train_loss: 2.7184 train_time: 1.0m tok/s: 8292547 +610/20000 train_loss: 2.7696 train_time: 1.0m tok/s: 8292578 +611/20000 train_loss: 2.8885 train_time: 1.0m tok/s: 8292397 +612/20000 train_loss: 2.8268 train_time: 1.0m tok/s: 8292241 +613/20000 train_loss: 2.8057 train_time: 1.0m tok/s: 8292312 +614/20000 train_loss: 2.8002 train_time: 1.0m tok/s: 8292325 +615/20000 train_loss: 2.7492 train_time: 1.0m tok/s: 8292179 +616/20000 train_loss: 2.7672 train_time: 1.0m tok/s: 8292189 +617/20000 train_loss: 2.7294 train_time: 1.0m tok/s: 8292186 +618/20000 train_loss: 2.7387 train_time: 1.0m tok/s: 8292082 +619/20000 train_loss: 2.7837 train_time: 1.0m tok/s: 8292024 +620/20000 train_loss: 2.8850 train_time: 1.0m tok/s: 8292193 +621/20000 train_loss: 2.6857 train_time: 1.0m tok/s: 8292304 +622/20000 train_loss: 2.7209 train_time: 1.0m tok/s: 8292369 +623/20000 train_loss: 2.7269 train_time: 1.0m tok/s: 8292235 +624/20000 train_loss: 2.4428 train_time: 1.0m tok/s: 8292162 +625/20000 train_loss: 2.7508 train_time: 1.0m tok/s: 8292076 +626/20000 train_loss: 2.8956 train_time: 1.0m tok/s: 8292052 +627/20000 train_loss: 2.6854 train_time: 1.0m tok/s: 8291906 +628/20000 train_loss: 2.8623 train_time: 1.0m tok/s: 8291765 +629/20000 train_loss: 2.8480 train_time: 1.0m tok/s: 8291793 +630/20000 train_loss: 2.6961 train_time: 1.0m tok/s: 8291957 +631/20000 train_loss: 2.8210 train_time: 1.0m tok/s: 8291899 +632/20000 train_loss: 2.8361 train_time: 1.0m tok/s: 8291999 +633/20000 train_loss: 2.7132 train_time: 1.0m tok/s: 8292019 +634/20000 train_loss: 2.9400 train_time: 1.0m tok/s: 8291947 +635/20000 train_loss: 2.7373 train_time: 1.0m tok/s: 8291790 +636/20000 train_loss: 2.8682 train_time: 1.0m tok/s: 8291634 +637/20000 train_loss: 2.7559 train_time: 1.0m tok/s: 8291626 +638/20000 train_loss: 2.5723 train_time: 1.0m tok/s: 8291654 +639/20000 train_loss: 2.7300 train_time: 1.0m tok/s: 8291261 +640/20000 train_loss: 2.7084 train_time: 1.0m tok/s: 8291200 +641/20000 train_loss: 2.7875 train_time: 1.0m tok/s: 8291288 +642/20000 train_loss: 2.7913 train_time: 1.0m tok/s: 8291385 +643/20000 train_loss: 2.7552 train_time: 1.0m tok/s: 8291324 +644/20000 train_loss: 2.8034 train_time: 1.0m tok/s: 8291446 +645/20000 train_loss: 2.8804 train_time: 1.0m tok/s: 8291474 +646/20000 train_loss: 2.7743 train_time: 1.0m tok/s: 8291461 +647/20000 train_loss: 2.8246 train_time: 1.0m tok/s: 8291031 +648/20000 train_loss: 2.7339 train_time: 1.0m tok/s: 8290672 +649/20000 train_loss: 2.8670 train_time: 1.0m tok/s: 8290728 +650/20000 train_loss: 2.7541 train_time: 1.0m tok/s: 8290760 +651/20000 train_loss: 2.7381 train_time: 1.0m tok/s: 8290813 +652/20000 train_loss: 2.7038 train_time: 1.0m tok/s: 8290727 +653/20000 train_loss: 2.6509 train_time: 1.0m tok/s: 8290869 +654/20000 train_loss: 2.7104 train_time: 1.0m tok/s: 8291015 +655/20000 train_loss: 2.7024 train_time: 1.0m tok/s: 8290677 +656/20000 train_loss: 2.6435 train_time: 1.0m tok/s: 8290577 +657/20000 train_loss: 2.6545 train_time: 1.0m tok/s: 8290571 +658/20000 train_loss: 2.6972 train_time: 1.0m tok/s: 8290617 +659/20000 train_loss: 2.7437 train_time: 1.0m tok/s: 8290431 +660/20000 train_loss: 2.7460 train_time: 1.0m tok/s: 8290551 +661/20000 train_loss: 2.7993 train_time: 1.0m tok/s: 8290684 +662/20000 train_loss: 2.6861 train_time: 1.0m tok/s: 8290708 +663/20000 train_loss: 2.7790 train_time: 1.0m tok/s: 8290765 +664/20000 train_loss: 2.7825 train_time: 1.0m tok/s: 8290685 +665/20000 train_loss: 2.8298 train_time: 1.1m tok/s: 8290614 +666/20000 train_loss: 2.8237 train_time: 1.1m tok/s: 8290579 +667/20000 train_loss: 2.7573 train_time: 1.1m tok/s: 8290581 +668/20000 train_loss: 2.7381 train_time: 1.1m tok/s: 8290360 +669/20000 train_loss: 2.6267 train_time: 1.1m tok/s: 8290334 +670/20000 train_loss: 2.6448 train_time: 1.1m tok/s: 8290342 +671/20000 train_loss: 2.6554 train_time: 1.1m tok/s: 8290344 +672/20000 train_loss: 2.7816 train_time: 1.1m tok/s: 8290337 +673/20000 train_loss: 2.6220 train_time: 1.1m tok/s: 8290364 +674/20000 train_loss: 2.8539 train_time: 1.1m tok/s: 8290391 +675/20000 train_loss: 2.6158 train_time: 1.1m tok/s: 8290197 +676/20000 train_loss: 2.8493 train_time: 1.1m tok/s: 8290011 +677/20000 train_loss: 2.6765 train_time: 1.1m tok/s: 8289989 +678/20000 train_loss: 2.7624 train_time: 1.1m tok/s: 8289810 +679/20000 train_loss: 2.6982 train_time: 1.1m tok/s: 8289779 +680/20000 train_loss: 2.9003 train_time: 1.1m tok/s: 8289750 +681/20000 train_loss: 2.7762 train_time: 1.1m tok/s: 8289599 +682/20000 train_loss: 2.8772 train_time: 1.1m tok/s: 8289619 +683/20000 train_loss: 2.8561 train_time: 1.1m tok/s: 8289551 +684/20000 train_loss: 2.7920 train_time: 1.1m tok/s: 8289588 +685/20000 train_loss: 2.6524 train_time: 1.1m tok/s: 8289406 +686/20000 train_loss: 2.8805 train_time: 1.1m tok/s: 8289237 +687/20000 train_loss: 2.7728 train_time: 1.1m tok/s: 8289281 +688/20000 train_loss: 2.7770 train_time: 1.1m tok/s: 8289163 +689/20000 train_loss: 2.8105 train_time: 1.1m tok/s: 8288949 +690/20000 train_loss: 2.7369 train_time: 1.1m tok/s: 8288955 +691/20000 train_loss: 2.8549 train_time: 1.1m tok/s: 8288916 +692/20000 train_loss: 2.9056 train_time: 1.1m tok/s: 8288893 +693/20000 train_loss: 2.8113 train_time: 1.1m tok/s: 8288919 +694/20000 train_loss: 2.8118 train_time: 1.1m tok/s: 8288815 +695/20000 train_loss: 2.8009 train_time: 1.1m tok/s: 8288690 +696/20000 train_loss: 2.8027 train_time: 1.1m tok/s: 8288837 +697/20000 train_loss: 2.6642 train_time: 1.1m tok/s: 8288745 +698/20000 train_loss: 2.8324 train_time: 1.1m tok/s: 8288368 +699/20000 train_loss: 2.6872 train_time: 1.1m tok/s: 8288074 +700/20000 train_loss: 2.6322 train_time: 1.1m tok/s: 8287932 +701/20000 train_loss: 2.6364 train_time: 1.1m tok/s: 8287947 +702/20000 train_loss: 2.6237 train_time: 1.1m tok/s: 8287833 +703/20000 train_loss: 2.5073 train_time: 1.1m tok/s: 8287705 +704/20000 train_loss: 2.8425 train_time: 1.1m tok/s: 8287604 +705/20000 train_loss: 2.7876 train_time: 1.1m tok/s: 8287590 +706/20000 train_loss: 2.7544 train_time: 1.1m tok/s: 8287538 +707/20000 train_loss: 2.7560 train_time: 1.1m tok/s: 8287434 +708/20000 train_loss: 2.8378 train_time: 1.1m tok/s: 8287559 +709/20000 train_loss: 2.8036 train_time: 1.1m tok/s: 8287479 +710/20000 train_loss: 2.6331 train_time: 1.1m tok/s: 8287532 +711/20000 train_loss: 2.7086 train_time: 1.1m tok/s: 8287363 +712/20000 train_loss: 2.6351 train_time: 1.1m tok/s: 8287278 +713/20000 train_loss: 2.6930 train_time: 1.1m tok/s: 8287274 +714/20000 train_loss: 2.7566 train_time: 1.1m tok/s: 8287395 +715/20000 train_loss: 2.7041 train_time: 1.1m tok/s: 8287407 +716/20000 train_loss: 2.7203 train_time: 1.1m tok/s: 8287466 +717/20000 train_loss: 2.9148 train_time: 1.1m tok/s: 8287536 +718/20000 train_loss: 2.8068 train_time: 1.1m tok/s: 8287570 +719/20000 train_loss: 2.7488 train_time: 1.1m tok/s: 8287489 +720/20000 train_loss: 2.6903 train_time: 1.1m tok/s: 8287377 +721/20000 train_loss: 2.8322 train_time: 1.1m tok/s: 8287244 +722/20000 train_loss: 2.6618 train_time: 1.1m tok/s: 8287205 +723/20000 train_loss: 2.8684 train_time: 1.1m tok/s: 8287193 +724/20000 train_loss: 2.7711 train_time: 1.1m tok/s: 8287123 +725/20000 train_loss: 2.6372 train_time: 1.1m tok/s: 8287242 +726/20000 train_loss: 2.7800 train_time: 1.1m tok/s: 8287144 +727/20000 train_loss: 2.5993 train_time: 1.1m tok/s: 8287105 +728/20000 train_loss: 2.7998 train_time: 1.2m tok/s: 8287227 +729/20000 train_loss: 2.8422 train_time: 1.2m tok/s: 8287199 +730/20000 train_loss: 2.7738 train_time: 1.2m tok/s: 8287180 +731/20000 train_loss: 2.8704 train_time: 1.2m tok/s: 8287059 +732/20000 train_loss: 2.7076 train_time: 1.2m tok/s: 8287060 +733/20000 train_loss: 2.8806 train_time: 1.2m tok/s: 8287028 +734/20000 train_loss: 2.7344 train_time: 1.2m tok/s: 8287003 +735/20000 train_loss: 2.7922 train_time: 1.2m tok/s: 8287056 +736/20000 train_loss: 2.6702 train_time: 1.2m tok/s: 8286964 +737/20000 train_loss: 2.8014 train_time: 1.2m tok/s: 8286829 +738/20000 train_loss: 2.6694 train_time: 1.2m tok/s: 8286864 +739/20000 train_loss: 2.5922 train_time: 1.2m tok/s: 8286857 +740/20000 train_loss: 2.8257 train_time: 1.2m tok/s: 8286743 +741/20000 train_loss: 2.8210 train_time: 1.2m tok/s: 8286822 +742/20000 train_loss: 2.6856 train_time: 1.2m tok/s: 8286793 +743/20000 train_loss: 2.8417 train_time: 1.2m tok/s: 8286599 +744/20000 train_loss: 2.7309 train_time: 1.2m tok/s: 8286695 +745/20000 train_loss: 2.7394 train_time: 1.2m tok/s: 8286701 +746/20000 train_loss: 2.8148 train_time: 1.2m tok/s: 8286741 +747/20000 train_loss: 2.6977 train_time: 1.2m tok/s: 8286738 +748/20000 train_loss: 2.7446 train_time: 1.2m tok/s: 8286550 +749/20000 train_loss: 2.8040 train_time: 1.2m tok/s: 8286546 +750/20000 train_loss: 2.8129 train_time: 1.2m tok/s: 8286422 +751/20000 train_loss: 2.6858 train_time: 1.2m tok/s: 8286398 +752/20000 train_loss: 2.7723 train_time: 1.2m tok/s: 8286289 +753/20000 train_loss: 2.4233 train_time: 1.2m tok/s: 8285985 +754/20000 train_loss: 2.6697 train_time: 1.2m tok/s: 8285837 +755/20000 train_loss: 2.8646 train_time: 1.2m tok/s: 8285636 +756/20000 train_loss: 3.1184 train_time: 1.2m tok/s: 8285857 +757/20000 train_loss: 2.7884 train_time: 1.2m tok/s: 8285855 +758/20000 train_loss: 2.7222 train_time: 1.2m tok/s: 8285910 +759/20000 train_loss: 2.6838 train_time: 1.2m tok/s: 8285873 +760/20000 train_loss: 2.8606 train_time: 1.2m tok/s: 8285817 +761/20000 train_loss: 2.7328 train_time: 1.2m tok/s: 8285740 +762/20000 train_loss: 2.8295 train_time: 1.2m tok/s: 8285752 +763/20000 train_loss: 2.6534 train_time: 1.2m tok/s: 8285681 +764/20000 train_loss: 2.7027 train_time: 1.2m tok/s: 8285633 +765/20000 train_loss: 2.6762 train_time: 1.2m tok/s: 8285651 +766/20000 train_loss: 2.6706 train_time: 1.2m tok/s: 8285654 +767/20000 train_loss: 2.6942 train_time: 1.2m tok/s: 8285526 +768/20000 train_loss: 2.7364 train_time: 1.2m tok/s: 8285790 +769/20000 train_loss: 2.7678 train_time: 1.2m tok/s: 8285813 +770/20000 train_loss: 2.7713 train_time: 1.2m tok/s: 8285829 +771/20000 train_loss: 2.7854 train_time: 1.2m tok/s: 8285690 +772/20000 train_loss: 2.7680 train_time: 1.2m tok/s: 8285752 +773/20000 train_loss: 2.7062 train_time: 1.2m tok/s: 8285637 +774/20000 train_loss: 2.8421 train_time: 1.2m tok/s: 8285620 +775/20000 train_loss: 2.8081 train_time: 1.2m tok/s: 8285583 +776/20000 train_loss: 2.9095 train_time: 1.2m tok/s: 8285426 +777/20000 train_loss: 2.8581 train_time: 1.2m tok/s: 8285145 +778/20000 train_loss: 2.7098 train_time: 1.2m tok/s: 8285025 +779/20000 train_loss: 2.4396 train_time: 1.2m tok/s: 8284973 +780/20000 train_loss: 2.7767 train_time: 1.2m tok/s: 8284944 +781/20000 train_loss: 2.7561 train_time: 1.2m tok/s: 8284938 +782/20000 train_loss: 3.0301 train_time: 1.2m tok/s: 8284851 +783/20000 train_loss: 2.5278 train_time: 1.2m tok/s: 8284643 +784/20000 train_loss: 2.9036 train_time: 1.2m tok/s: 8284604 +785/20000 train_loss: 2.8602 train_time: 1.2m tok/s: 8284614 +786/20000 train_loss: 2.7254 train_time: 1.2m tok/s: 8284701 +787/20000 train_loss: 2.6370 train_time: 1.2m tok/s: 8284537 +788/20000 train_loss: 2.6751 train_time: 1.2m tok/s: 8284477 +789/20000 train_loss: 2.7998 train_time: 1.2m tok/s: 8284355 +790/20000 train_loss: 2.6408 train_time: 1.2m tok/s: 8284282 +791/20000 train_loss: 2.5974 train_time: 1.3m tok/s: 8284066 +792/20000 train_loss: 2.7320 train_time: 1.3m tok/s: 8284280 +793/20000 train_loss: 2.7067 train_time: 1.3m tok/s: 8284293 +794/20000 train_loss: 2.7156 train_time: 1.3m tok/s: 8284202 +795/20000 train_loss: 2.8526 train_time: 1.3m tok/s: 8284197 +796/20000 train_loss: 2.7175 train_time: 1.3m tok/s: 8284283 +797/20000 train_loss: 2.7368 train_time: 1.3m tok/s: 8284349 +798/20000 train_loss: 2.7531 train_time: 1.3m tok/s: 8284365 +799/20000 train_loss: 2.7870 train_time: 1.3m tok/s: 8284248 +800/20000 train_loss: 2.7145 train_time: 1.3m tok/s: 8284159 +801/20000 train_loss: 2.7456 train_time: 1.3m tok/s: 8284098 +802/20000 train_loss: 2.8196 train_time: 1.3m tok/s: 8284074 +803/20000 train_loss: 2.6844 train_time: 1.3m tok/s: 8283884 +804/20000 train_loss: 2.6678 train_time: 1.3m tok/s: 8283988 +805/20000 train_loss: 2.6796 train_time: 1.3m tok/s: 8283966 +806/20000 train_loss: 2.7984 train_time: 1.3m tok/s: 8283964 +807/20000 train_loss: 2.7847 train_time: 1.3m tok/s: 8284008 +808/20000 train_loss: 2.8127 train_time: 1.3m tok/s: 8284016 +809/20000 train_loss: 2.6330 train_time: 1.3m tok/s: 8284066 +810/20000 train_loss: 2.8493 train_time: 1.3m tok/s: 8284047 +811/20000 train_loss: 2.8467 train_time: 1.3m tok/s: 8284090 +812/20000 train_loss: 2.6820 train_time: 1.3m tok/s: 8284137 +813/20000 train_loss: 2.7413 train_time: 1.3m tok/s: 8284163 +814/20000 train_loss: 2.7909 train_time: 1.3m tok/s: 8284153 +815/20000 train_loss: 2.8901 train_time: 1.3m tok/s: 8284055 +816/20000 train_loss: 2.7193 train_time: 1.3m tok/s: 8283992 +817/20000 train_loss: 2.7113 train_time: 1.3m tok/s: 8283894 +818/20000 train_loss: 2.7761 train_time: 1.3m tok/s: 8284127 +819/20000 train_loss: 2.7927 train_time: 1.3m tok/s: 8284131 +820/20000 train_loss: 3.0427 train_time: 1.3m tok/s: 8283873 +821/20000 train_loss: 2.7823 train_time: 1.3m tok/s: 8283755 +822/20000 train_loss: 2.5985 train_time: 1.3m tok/s: 8283736 +823/20000 train_loss: 2.6452 train_time: 1.3m tok/s: 8283658 +824/20000 train_loss: 2.8128 train_time: 1.3m tok/s: 8283683 +825/20000 train_loss: 2.8863 train_time: 1.3m tok/s: 8283772 +826/20000 train_loss: 2.8625 train_time: 1.3m tok/s: 8283709 +827/20000 train_loss: 2.6439 train_time: 1.3m tok/s: 8283612 +828/20000 train_loss: 2.7236 train_time: 1.3m tok/s: 8283645 +829/20000 train_loss: 3.3586 train_time: 1.3m tok/s: 8283637 +830/20000 train_loss: 2.7508 train_time: 1.3m tok/s: 8283385 +831/20000 train_loss: 2.7402 train_time: 1.3m tok/s: 8283354 +832/20000 train_loss: 2.7701 train_time: 1.3m tok/s: 8283367 +833/20000 train_loss: 2.8682 train_time: 1.3m tok/s: 8283426 +834/20000 train_loss: 2.6907 train_time: 1.3m tok/s: 8283426 +835/20000 train_loss: 2.7753 train_time: 1.3m tok/s: 8283525 +836/20000 train_loss: 2.6145 train_time: 1.3m tok/s: 8283468 +837/20000 train_loss: 2.5094 train_time: 1.3m tok/s: 8283302 +838/20000 train_loss: 2.6224 train_time: 1.3m tok/s: 8283166 +839/20000 train_loss: 2.7189 train_time: 1.3m tok/s: 8283046 +840/20000 train_loss: 3.1215 train_time: 1.3m tok/s: 8283095 +841/20000 train_loss: 2.7179 train_time: 1.3m tok/s: 8283002 +842/20000 train_loss: 2.7246 train_time: 1.3m tok/s: 8283064 +843/20000 train_loss: 2.6458 train_time: 1.3m tok/s: 8283081 +844/20000 train_loss: 2.7318 train_time: 1.3m tok/s: 8283131 +845/20000 train_loss: 2.6844 train_time: 1.3m tok/s: 8282975 +846/20000 train_loss: 2.6742 train_time: 1.3m tok/s: 8282911 +847/20000 train_loss: 2.7202 train_time: 1.3m tok/s: 8282805 +848/20000 train_loss: 2.6418 train_time: 1.3m tok/s: 8282863 +849/20000 train_loss: 2.7445 train_time: 1.3m tok/s: 8282880 +850/20000 train_loss: 2.5808 train_time: 1.3m tok/s: 8282837 +851/20000 train_loss: 2.7577 train_time: 1.3m tok/s: 8282759 +852/20000 train_loss: 2.5650 train_time: 1.3m tok/s: 8282945 +853/20000 train_loss: 2.7158 train_time: 1.3m tok/s: 8282953 +854/20000 train_loss: 2.7030 train_time: 1.4m tok/s: 8283006 +855/20000 train_loss: 2.7567 train_time: 1.4m tok/s: 8283060 +856/20000 train_loss: 2.6897 train_time: 1.4m tok/s: 8283148 +857/20000 train_loss: 2.8367 train_time: 1.4m tok/s: 8283106 +858/20000 train_loss: 2.8318 train_time: 1.4m tok/s: 8283159 +859/20000 train_loss: 2.7234 train_time: 1.4m tok/s: 8283119 +860/20000 train_loss: 2.6616 train_time: 1.4m tok/s: 8282995 +861/20000 train_loss: 2.7028 train_time: 1.4m tok/s: 8282852 +862/20000 train_loss: 2.6611 train_time: 1.4m tok/s: 8282720 +863/20000 train_loss: 2.8908 train_time: 1.4m tok/s: 8282649 +864/20000 train_loss: 2.7454 train_time: 1.4m tok/s: 8282594 +865/20000 train_loss: 2.7705 train_time: 1.4m tok/s: 8282550 +866/20000 train_loss: 2.6292 train_time: 1.4m tok/s: 8282443 +867/20000 train_loss: 2.6718 train_time: 1.4m tok/s: 8282290 +868/20000 train_loss: 2.6810 train_time: 1.4m tok/s: 8282253 +869/20000 train_loss: 2.6973 train_time: 1.4m tok/s: 8282296 +870/20000 train_loss: 2.6658 train_time: 1.4m tok/s: 8282385 +871/20000 train_loss: 2.6591 train_time: 1.4m tok/s: 8282410 +872/20000 train_loss: 2.7496 train_time: 1.4m tok/s: 8282518 +873/20000 train_loss: 2.6622 train_time: 1.4m tok/s: 8282560 +874/20000 train_loss: 2.8041 train_time: 1.4m tok/s: 8282501 +875/20000 train_loss: 2.7573 train_time: 1.4m tok/s: 8282506 +876/20000 train_loss: 2.8069 train_time: 1.4m tok/s: 8282639 +877/20000 train_loss: 2.6919 train_time: 1.4m tok/s: 8282663 +878/20000 train_loss: 2.6855 train_time: 1.4m tok/s: 8282597 +879/20000 train_loss: 2.7334 train_time: 1.4m tok/s: 8282511 +880/20000 train_loss: 2.7440 train_time: 1.4m tok/s: 8282579 +881/20000 train_loss: 2.6585 train_time: 1.4m tok/s: 8282657 +882/20000 train_loss: 2.6874 train_time: 1.4m tok/s: 8282639 +883/20000 train_loss: 2.7764 train_time: 1.4m tok/s: 8282669 +884/20000 train_loss: 2.5184 train_time: 1.4m tok/s: 8282694 +885/20000 train_loss: 2.6316 train_time: 1.4m tok/s: 8282687 +886/20000 train_loss: 2.7051 train_time: 1.4m tok/s: 8282647 +887/20000 train_loss: 2.6697 train_time: 1.4m tok/s: 8282538 +888/20000 train_loss: 2.6353 train_time: 1.4m tok/s: 8282520 +889/20000 train_loss: 2.8020 train_time: 1.4m tok/s: 8282549 +890/20000 train_loss: 2.6013 train_time: 1.4m tok/s: 8282480 +891/20000 train_loss: 2.6885 train_time: 1.4m tok/s: 8282441 +892/20000 train_loss: 2.7038 train_time: 1.4m tok/s: 8282451 +893/20000 train_loss: 2.6438 train_time: 1.4m tok/s: 8282555 +894/20000 train_loss: 2.7089 train_time: 1.4m tok/s: 8282662 +895/20000 train_loss: 2.7350 train_time: 1.4m tok/s: 8282678 +896/20000 train_loss: 2.7955 train_time: 1.4m tok/s: 8282654 +897/20000 train_loss: 2.7189 train_time: 1.4m tok/s: 8282586 +898/20000 train_loss: 2.6886 train_time: 1.4m tok/s: 8282492 +899/20000 train_loss: 2.6657 train_time: 1.4m tok/s: 8282422 +900/20000 train_loss: 2.7236 train_time: 1.4m tok/s: 8282485 +901/20000 train_loss: 2.6352 train_time: 1.4m tok/s: 8282619 +902/20000 train_loss: 2.6263 train_time: 1.4m tok/s: 8282567 +903/20000 train_loss: 2.5843 train_time: 1.4m tok/s: 8282501 +904/20000 train_loss: 2.5642 train_time: 1.4m tok/s: 8282551 +905/20000 train_loss: 2.7046 train_time: 1.4m tok/s: 8282610 +906/20000 train_loss: 2.7732 train_time: 1.4m tok/s: 8282721 +907/20000 train_loss: 2.7487 train_time: 1.4m tok/s: 8282770 +908/20000 train_loss: 2.8283 train_time: 1.4m tok/s: 8282810 +909/20000 train_loss: 2.7663 train_time: 1.4m tok/s: 8282775 +910/20000 train_loss: 2.8212 train_time: 1.4m tok/s: 8282780 +911/20000 train_loss: 2.7224 train_time: 1.4m tok/s: 8282680 +912/20000 train_loss: 2.5403 train_time: 1.4m tok/s: 8282645 +913/20000 train_loss: 2.7242 train_time: 1.4m tok/s: 8282428 +914/20000 train_loss: 2.8067 train_time: 1.4m tok/s: 8282366 +915/20000 train_loss: 2.7448 train_time: 1.4m tok/s: 8282333 +916/20000 train_loss: 2.7434 train_time: 1.4m tok/s: 8282346 +917/20000 train_loss: 2.6750 train_time: 1.5m tok/s: 8282292 +918/20000 train_loss: 2.5485 train_time: 1.5m tok/s: 8282291 +919/20000 train_loss: 2.6226 train_time: 1.5m tok/s: 8282296 +920/20000 train_loss: 2.6432 train_time: 1.5m tok/s: 8282342 +921/20000 train_loss: 2.5005 train_time: 1.5m tok/s: 8282354 +922/20000 train_loss: 2.7089 train_time: 1.5m tok/s: 8282287 +923/20000 train_loss: 2.6289 train_time: 1.5m tok/s: 8282204 +924/20000 train_loss: 2.6164 train_time: 1.5m tok/s: 8282172 +925/20000 train_loss: 2.9349 train_time: 1.5m tok/s: 8282231 +926/20000 train_loss: 2.5570 train_time: 1.5m tok/s: 8282111 +927/20000 train_loss: 2.7460 train_time: 1.5m tok/s: 8281989 +928/20000 train_loss: 2.7942 train_time: 1.5m tok/s: 8282002 +929/20000 train_loss: 2.7155 train_time: 1.5m tok/s: 8282039 +930/20000 train_loss: 2.8777 train_time: 1.5m tok/s: 8281976 +931/20000 train_loss: 2.7602 train_time: 1.5m tok/s: 8281895 +932/20000 train_loss: 2.6924 train_time: 1.5m tok/s: 8281982 +933/20000 train_loss: 2.7150 train_time: 1.5m tok/s: 8281957 +934/20000 train_loss: 2.7210 train_time: 1.5m tok/s: 8282028 +935/20000 train_loss: 2.7939 train_time: 1.5m tok/s: 8281931 +936/20000 train_loss: 2.6099 train_time: 1.5m tok/s: 8281897 +937/20000 train_loss: 2.7213 train_time: 1.5m tok/s: 8281915 +938/20000 train_loss: 2.5071 train_time: 1.5m tok/s: 8281768 +939/20000 train_loss: 2.5098 train_time: 1.5m tok/s: 8281393 +940/20000 train_loss: 2.7674 train_time: 1.5m tok/s: 8281196 +941/20000 train_loss: 2.8648 train_time: 1.5m tok/s: 8281145 +942/20000 train_loss: 2.6968 train_time: 1.5m tok/s: 8281238 +943/20000 train_loss: 2.7095 train_time: 1.5m tok/s: 8281202 +944/20000 train_loss: 2.8043 train_time: 1.5m tok/s: 8281332 +945/20000 train_loss: 2.7210 train_time: 1.5m tok/s: 8281482 +946/20000 train_loss: 2.6215 train_time: 1.5m tok/s: 8281504 +947/20000 train_loss: 2.7908 train_time: 1.5m tok/s: 8281440 +948/20000 train_loss: 2.7184 train_time: 1.5m tok/s: 8281365 +949/20000 train_loss: 2.7113 train_time: 1.5m tok/s: 8281369 +950/20000 train_loss: 2.7234 train_time: 1.5m tok/s: 8281364 +951/20000 train_loss: 2.7861 train_time: 1.5m tok/s: 8281399 +952/20000 train_loss: 2.5454 train_time: 1.5m tok/s: 8281414 +953/20000 train_loss: 2.6837 train_time: 1.5m tok/s: 8281395 +954/20000 train_loss: 2.6914 train_time: 1.5m tok/s: 8281298 +955/20000 train_loss: 2.8038 train_time: 1.5m tok/s: 8281226 +956/20000 train_loss: 2.8076 train_time: 1.5m tok/s: 8281174 +957/20000 train_loss: 2.6751 train_time: 1.5m tok/s: 8281122 +958/20000 train_loss: 2.7700 train_time: 1.5m tok/s: 8281002 +959/20000 train_loss: 2.7776 train_time: 1.5m tok/s: 8280961 +960/20000 train_loss: 2.9861 train_time: 1.5m tok/s: 8280892 +961/20000 train_loss: 2.7868 train_time: 1.5m tok/s: 8280818 +962/20000 train_loss: 2.7003 train_time: 1.5m tok/s: 8280895 +963/20000 train_loss: 2.7510 train_time: 1.5m tok/s: 8280940 +964/20000 train_loss: 2.6577 train_time: 1.5m tok/s: 8280988 +965/20000 train_loss: 2.7806 train_time: 1.5m tok/s: 8280942 +966/20000 train_loss: 2.7705 train_time: 1.5m tok/s: 8280966 +967/20000 train_loss: 2.5687 train_time: 1.5m tok/s: 8280978 +968/20000 train_loss: 2.6721 train_time: 1.5m tok/s: 8280937 +969/20000 train_loss: 2.7002 train_time: 1.5m tok/s: 8280752 +970/20000 train_loss: 2.6946 train_time: 1.5m tok/s: 8280520 +971/20000 train_loss: 2.5434 train_time: 1.5m tok/s: 8280267 +972/20000 train_loss: 2.4965 train_time: 1.5m tok/s: 8280107 +973/20000 train_loss: 2.5650 train_time: 1.5m tok/s: 8280018 +974/20000 train_loss: 2.6905 train_time: 1.5m tok/s: 8280028 +975/20000 train_loss: 2.7445 train_time: 1.5m tok/s: 8280046 +976/20000 train_loss: 2.6783 train_time: 1.5m tok/s: 8280073 +977/20000 train_loss: 2.7329 train_time: 1.5m tok/s: 8280089 +978/20000 train_loss: 2.7636 train_time: 1.5m tok/s: 8280068 +979/20000 train_loss: 2.6065 train_time: 1.5m tok/s: 8279970 +980/20000 train_loss: 2.6460 train_time: 1.6m tok/s: 8279858 +981/20000 train_loss: 2.6371 train_time: 1.6m tok/s: 8279844 +982/20000 train_loss: 2.6228 train_time: 1.6m tok/s: 8279865 +983/20000 train_loss: 2.7465 train_time: 1.6m tok/s: 8279820 +984/20000 train_loss: 2.6417 train_time: 1.6m tok/s: 8279816 +985/20000 train_loss: 2.7577 train_time: 1.6m tok/s: 8279796 +986/20000 train_loss: 2.6871 train_time: 1.6m tok/s: 8279918 +987/20000 train_loss: 2.6790 train_time: 1.6m tok/s: 8280006 +988/20000 train_loss: 2.5425 train_time: 1.6m tok/s: 8280050 +989/20000 train_loss: 2.6675 train_time: 1.6m tok/s: 8280001 +990/20000 train_loss: 2.6587 train_time: 1.6m tok/s: 8279961 +991/20000 train_loss: 2.8125 train_time: 1.6m tok/s: 8279857 +992/20000 train_loss: 2.6258 train_time: 1.6m tok/s: 8279823 +993/20000 train_loss: 2.5734 train_time: 1.6m tok/s: 8279793 +994/20000 train_loss: 2.7181 train_time: 1.6m tok/s: 8279795 +995/20000 train_loss: 2.8781 train_time: 1.6m tok/s: 8279757 +996/20000 train_loss: 2.8183 train_time: 1.6m tok/s: 8279797 +997/20000 train_loss: 2.7770 train_time: 1.6m tok/s: 8279897 +998/20000 train_loss: 2.6601 train_time: 1.6m tok/s: 8279980 +999/20000 train_loss: 2.7453 train_time: 1.6m tok/s: 8279991 +1000/20000 train_loss: 2.7853 train_time: 1.6m tok/s: 8279967 +1001/20000 train_loss: 2.6722 train_time: 1.6m tok/s: 8279931 +1002/20000 train_loss: 2.7258 train_time: 1.6m tok/s: 8279836 +1003/20000 train_loss: 2.6500 train_time: 1.6m tok/s: 8279714 +1004/20000 train_loss: 2.6541 train_time: 1.6m tok/s: 8279748 +1005/20000 train_loss: 2.6383 train_time: 1.6m tok/s: 8279821 +1006/20000 train_loss: 2.7077 train_time: 1.6m tok/s: 8279810 +1007/20000 train_loss: 2.5859 train_time: 1.6m tok/s: 8279767 +1008/20000 train_loss: 2.5331 train_time: 1.6m tok/s: 8279748 +1009/20000 train_loss: 2.6822 train_time: 1.6m tok/s: 8279745 +1010/20000 train_loss: 2.8213 train_time: 1.6m tok/s: 8279838 +1011/20000 train_loss: 2.8061 train_time: 1.6m tok/s: 8279865 +1012/20000 train_loss: 2.4137 train_time: 1.6m tok/s: 8279668 +1013/20000 train_loss: 2.6116 train_time: 1.6m tok/s: 8279481 +1014/20000 train_loss: 2.7258 train_time: 1.6m tok/s: 8279534 +1015/20000 train_loss: 2.7578 train_time: 1.6m tok/s: 8279501 +1016/20000 train_loss: 2.5612 train_time: 1.6m tok/s: 8279408 +1017/20000 train_loss: 2.7126 train_time: 1.6m tok/s: 8279482 +1018/20000 train_loss: 2.7840 train_time: 1.6m tok/s: 8279400 +1019/20000 train_loss: 2.6648 train_time: 1.6m tok/s: 8279296 +1020/20000 train_loss: 2.6915 train_time: 1.6m tok/s: 8279358 +1021/20000 train_loss: 2.6648 train_time: 1.6m tok/s: 8279280 +1022/20000 train_loss: 2.7607 train_time: 1.6m tok/s: 8279234 +1023/20000 train_loss: 2.6658 train_time: 1.6m tok/s: 8279242 +1024/20000 train_loss: 2.6913 train_time: 1.6m tok/s: 8279251 +1025/20000 train_loss: 2.7551 train_time: 1.6m tok/s: 8279340 +1026/20000 train_loss: 3.3119 train_time: 1.6m tok/s: 8279226 +1027/20000 train_loss: 2.5488 train_time: 1.6m tok/s: 8279094 +1028/20000 train_loss: 2.6099 train_time: 1.6m tok/s: 8279099 +1029/20000 train_loss: 2.6992 train_time: 1.6m tok/s: 8279099 +1030/20000 train_loss: 2.5460 train_time: 1.6m tok/s: 8279086 +1031/20000 train_loss: 2.6129 train_time: 1.6m tok/s: 8278969 +1032/20000 train_loss: 2.7259 train_time: 1.6m tok/s: 8279021 +1033/20000 train_loss: 2.8420 train_time: 1.6m tok/s: 8279019 +1034/20000 train_loss: 2.6285 train_time: 1.6m tok/s: 8278967 +1035/20000 train_loss: 2.7273 train_time: 1.6m tok/s: 8279084 +1036/20000 train_loss: 2.6872 train_time: 1.6m tok/s: 8279121 +1037/20000 train_loss: 2.6931 train_time: 1.6m tok/s: 8279137 +1038/20000 train_loss: 2.4912 train_time: 1.6m tok/s: 8279123 +1039/20000 train_loss: 2.6978 train_time: 1.6m tok/s: 8279143 +1040/20000 train_loss: 2.6469 train_time: 1.6m tok/s: 8279184 +1041/20000 train_loss: 2.6604 train_time: 1.6m tok/s: 8279216 +1042/20000 train_loss: 2.6622 train_time: 1.6m tok/s: 8279190 +1043/20000 train_loss: 2.7179 train_time: 1.7m tok/s: 8279152 +1044/20000 train_loss: 2.6576 train_time: 1.7m tok/s: 8279057 +1045/20000 train_loss: 2.7913 train_time: 1.7m tok/s: 8278999 +1046/20000 train_loss: 2.7346 train_time: 1.7m tok/s: 8279042 +1047/20000 train_loss: 2.6988 train_time: 1.7m tok/s: 8279138 +1048/20000 train_loss: 2.5977 train_time: 1.7m tok/s: 8279160 +1049/20000 train_loss: 2.7256 train_time: 1.7m tok/s: 8279213 +1050/20000 train_loss: 2.8179 train_time: 1.7m tok/s: 8279276 +1051/20000 train_loss: 2.7274 train_time: 1.7m tok/s: 8279362 +1052/20000 train_loss: 2.6186 train_time: 1.7m tok/s: 8279217 +1053/20000 train_loss: 2.6456 train_time: 1.7m tok/s: 8279136 +1054/20000 train_loss: 2.6220 train_time: 1.7m tok/s: 8279033 +1055/20000 train_loss: 2.5833 train_time: 1.7m tok/s: 8278952 +1056/20000 train_loss: 2.6547 train_time: 1.7m tok/s: 8278942 +1057/20000 train_loss: 2.7190 train_time: 1.7m tok/s: 8278788 +1058/20000 train_loss: 2.6031 train_time: 1.7m tok/s: 8278809 +1059/20000 train_loss: 2.7477 train_time: 1.7m tok/s: 8278820 +1060/20000 train_loss: 2.6680 train_time: 1.7m tok/s: 8278773 +1061/20000 train_loss: 2.7482 train_time: 1.7m tok/s: 8278808 +1062/20000 train_loss: 2.7664 train_time: 1.7m tok/s: 8278837 +1063/20000 train_loss: 2.7709 train_time: 1.7m tok/s: 8278873 +1064/20000 train_loss: 2.6530 train_time: 1.7m tok/s: 8278873 +1065/20000 train_loss: 2.4633 train_time: 1.7m tok/s: 8278784 +1066/20000 train_loss: 2.7954 train_time: 1.7m tok/s: 8278688 +1067/20000 train_loss: 2.7983 train_time: 1.7m tok/s: 8278645 +1068/20000 train_loss: 2.6976 train_time: 1.7m tok/s: 8278743 +1069/20000 train_loss: 2.6422 train_time: 1.7m tok/s: 8278753 +1070/20000 train_loss: 2.5741 train_time: 1.7m tok/s: 8278784 +1071/20000 train_loss: 2.7349 train_time: 1.7m tok/s: 8278884 +1072/20000 train_loss: 2.6670 train_time: 1.7m tok/s: 8278986 +1073/20000 train_loss: 2.6701 train_time: 1.7m tok/s: 8278944 +1074/20000 train_loss: 2.6884 train_time: 1.7m tok/s: 8278893 +1075/20000 train_loss: 2.7167 train_time: 1.7m tok/s: 8278901 +1076/20000 train_loss: 2.7160 train_time: 1.7m tok/s: 8278903 +1077/20000 train_loss: 2.6624 train_time: 1.7m tok/s: 8278831 +1078/20000 train_loss: 2.8111 train_time: 1.7m tok/s: 8278782 +1079/20000 train_loss: 2.6699 train_time: 1.7m tok/s: 8278663 +1080/20000 train_loss: 2.6425 train_time: 1.7m tok/s: 8278846 +1081/20000 train_loss: 2.6694 train_time: 1.7m tok/s: 8278880 +1082/20000 train_loss: 2.6673 train_time: 1.7m tok/s: 8278919 +1083/20000 train_loss: 2.6015 train_time: 1.7m tok/s: 8278881 +1084/20000 train_loss: 2.6820 train_time: 1.7m tok/s: 8278855 +1085/20000 train_loss: 2.6500 train_time: 1.7m tok/s: 8278938 +1086/20000 train_loss: 2.6848 train_time: 1.7m tok/s: 8279039 +1087/20000 train_loss: 2.6638 train_time: 1.7m tok/s: 8279087 +1088/20000 train_loss: 2.8426 train_time: 1.7m tok/s: 8279003 +1089/20000 train_loss: 2.8089 train_time: 1.7m tok/s: 8279015 +1090/20000 train_loss: 2.6286 train_time: 1.7m tok/s: 8279013 +1091/20000 train_loss: 2.6719 train_time: 1.7m tok/s: 8278943 +1092/20000 train_loss: 2.7021 train_time: 1.7m tok/s: 8279042 +1093/20000 train_loss: 2.7481 train_time: 1.7m tok/s: 8279041 +1094/20000 train_loss: 2.8054 train_time: 1.7m tok/s: 8279027 +1095/20000 train_loss: 2.6208 train_time: 1.7m tok/s: 8278902 +1096/20000 train_loss: 2.5262 train_time: 1.7m tok/s: 8278790 +1097/20000 train_loss: 2.6565 train_time: 1.7m tok/s: 8278718 +1098/20000 train_loss: 2.6694 train_time: 1.7m tok/s: 8278637 +1099/20000 train_loss: 2.5129 train_time: 1.7m tok/s: 8278663 +1100/20000 train_loss: 2.5793 train_time: 1.7m tok/s: 8278714 +1101/20000 train_loss: 2.6518 train_time: 1.7m tok/s: 8278723 +1102/20000 train_loss: 2.6753 train_time: 1.7m tok/s: 8278846 +1103/20000 train_loss: 2.7381 train_time: 1.7m tok/s: 8278806 +1104/20000 train_loss: 2.7008 train_time: 1.7m tok/s: 8278928 +1105/20000 train_loss: 2.7198 train_time: 1.7m tok/s: 8278854 +1106/20000 train_loss: 2.7280 train_time: 1.8m tok/s: 8278922 +1107/20000 train_loss: 2.7368 train_time: 1.8m tok/s: 8278853 +1108/20000 train_loss: 2.6777 train_time: 1.8m tok/s: 8278881 +1109/20000 train_loss: 2.6686 train_time: 1.8m tok/s: 8278903 +1110/20000 train_loss: 2.6621 train_time: 1.8m tok/s: 8279028 +1111/20000 train_loss: 2.6370 train_time: 1.8m tok/s: 8279044 +1112/20000 train_loss: 2.6289 train_time: 1.8m tok/s: 8279018 +1113/20000 train_loss: 2.6479 train_time: 1.8m tok/s: 8278962 +1114/20000 train_loss: 2.8099 train_time: 1.8m tok/s: 8278953 +1115/20000 train_loss: 2.6632 train_time: 1.8m tok/s: 8278894 +1116/20000 train_loss: 2.8631 train_time: 1.8m tok/s: 8278956 +1117/20000 train_loss: 2.6880 train_time: 1.8m tok/s: 8278886 +1118/20000 train_loss: 2.7201 train_time: 1.8m tok/s: 8278864 +1119/20000 train_loss: 2.7437 train_time: 1.8m tok/s: 8278765 +1120/20000 train_loss: 2.6335 train_time: 1.8m tok/s: 8278817 +1121/20000 train_loss: 2.6396 train_time: 1.8m tok/s: 8278882 +1122/20000 train_loss: 2.7385 train_time: 1.8m tok/s: 8278955 +1123/20000 train_loss: 2.5358 train_time: 1.8m tok/s: 8279055 +1124/20000 train_loss: 2.6843 train_time: 1.8m tok/s: 8278988 +1125/20000 train_loss: 2.5653 train_time: 1.8m tok/s: 8278998 +1126/20000 train_loss: 2.6388 train_time: 1.8m tok/s: 8279067 +1127/20000 train_loss: 2.8765 train_time: 1.8m tok/s: 8278970 +1128/20000 train_loss: 2.8469 train_time: 1.8m tok/s: 8279053 +1129/20000 train_loss: 2.5908 train_time: 1.8m tok/s: 8279006 +1130/20000 train_loss: 2.7622 train_time: 1.8m tok/s: 8279032 +1131/20000 train_loss: 2.7473 train_time: 1.8m tok/s: 8279124 +1132/20000 train_loss: 2.6178 train_time: 1.8m tok/s: 8279122 +1133/20000 train_loss: 2.5774 train_time: 1.8m tok/s: 8279166 +1134/20000 train_loss: 2.7306 train_time: 1.8m tok/s: 8279159 +1135/20000 train_loss: 2.7251 train_time: 1.8m tok/s: 8279181 +1136/20000 train_loss: 2.5640 train_time: 1.8m tok/s: 8279144 +1137/20000 train_loss: 2.6110 train_time: 1.8m tok/s: 8279141 +1138/20000 train_loss: 2.5617 train_time: 1.8m tok/s: 8279177 +1139/20000 train_loss: 2.5523 train_time: 1.8m tok/s: 8279171 +1140/20000 train_loss: 2.6631 train_time: 1.8m tok/s: 8279150 +1141/20000 train_loss: 2.6926 train_time: 1.8m tok/s: 8279154 +1142/20000 train_loss: 2.6953 train_time: 1.8m tok/s: 8279150 +1143/20000 train_loss: 2.7277 train_time: 1.8m tok/s: 8279221 +1144/20000 train_loss: 2.7776 train_time: 1.8m tok/s: 8279180 +1145/20000 train_loss: 2.7321 train_time: 1.8m tok/s: 8279212 +1146/20000 train_loss: 2.5946 train_time: 1.8m tok/s: 8279280 +1147/20000 train_loss: 2.7485 train_time: 1.8m tok/s: 8279379 +1148/20000 train_loss: 2.5579 train_time: 1.8m tok/s: 8279281 +1149/20000 train_loss: 2.7193 train_time: 1.8m tok/s: 8279316 +1150/20000 train_loss: 2.5909 train_time: 1.8m tok/s: 8279365 +1151/20000 train_loss: 2.5916 train_time: 1.8m tok/s: 8279415 +1152/20000 train_loss: 2.4652 train_time: 1.8m tok/s: 8279381 +1153/20000 train_loss: 2.6017 train_time: 1.8m tok/s: 8279368 +1154/20000 train_loss: 2.7352 train_time: 1.8m tok/s: 8279460 +1155/20000 train_loss: 2.5845 train_time: 1.8m tok/s: 8279490 +1156/20000 train_loss: 2.6942 train_time: 1.8m tok/s: 8279534 +1157/20000 train_loss: 2.6830 train_time: 1.8m tok/s: 8279528 +1158/20000 train_loss: 2.7774 train_time: 1.8m tok/s: 8279554 +1159/20000 train_loss: 2.7169 train_time: 1.8m tok/s: 8279526 +1160/20000 train_loss: 2.6824 train_time: 1.8m tok/s: 8279564 +1161/20000 train_loss: 2.6546 train_time: 1.8m tok/s: 8279616 +1162/20000 train_loss: 2.7074 train_time: 1.8m tok/s: 8279652 +1163/20000 train_loss: 2.6915 train_time: 1.8m tok/s: 8279577 +1164/20000 train_loss: 2.6593 train_time: 1.8m tok/s: 8279534 +1165/20000 train_loss: 2.5271 train_time: 1.8m tok/s: 8279478 +1166/20000 train_loss: 2.7283 train_time: 1.8m tok/s: 8279497 +1167/20000 train_loss: 2.7679 train_time: 1.8m tok/s: 8279414 +1168/20000 train_loss: 2.5630 train_time: 1.8m tok/s: 8279311 +1169/20000 train_loss: 2.6995 train_time: 1.9m tok/s: 8279275 +1170/20000 train_loss: 2.9058 train_time: 1.9m tok/s: 8279330 +1171/20000 train_loss: 2.6551 train_time: 1.9m tok/s: 8279346 +1172/20000 train_loss: 2.7053 train_time: 1.9m tok/s: 8279392 +1173/20000 train_loss: 2.6402 train_time: 1.9m tok/s: 8279426 +1174/20000 train_loss: 2.7057 train_time: 1.9m tok/s: 8279465 +1175/20000 train_loss: 2.6507 train_time: 1.9m tok/s: 8279352 +1176/20000 train_loss: 2.7736 train_time: 1.9m tok/s: 8279370 +1177/20000 train_loss: 2.7741 train_time: 1.9m tok/s: 8279304 +1178/20000 train_loss: 2.6621 train_time: 1.9m tok/s: 8279305 +1179/20000 train_loss: 2.5420 train_time: 1.9m tok/s: 8279183 +1180/20000 train_loss: 2.6020 train_time: 1.9m tok/s: 8279140 +1181/20000 train_loss: 2.6559 train_time: 1.9m tok/s: 8279227 +1182/20000 train_loss: 2.5774 train_time: 1.9m tok/s: 8279306 +1183/20000 train_loss: 2.7935 train_time: 1.9m tok/s: 8279359 +1184/20000 train_loss: 2.4907 train_time: 1.9m tok/s: 8279244 +1185/20000 train_loss: 2.6685 train_time: 1.9m tok/s: 8279222 +1186/20000 train_loss: 2.6643 train_time: 1.9m tok/s: 8279227 +1187/20000 train_loss: 2.7631 train_time: 1.9m tok/s: 8279116 +1188/20000 train_loss: 2.8780 train_time: 1.9m tok/s: 8279227 +1189/20000 train_loss: 2.6376 train_time: 1.9m tok/s: 8279254 +1190/20000 train_loss: 2.6996 train_time: 1.9m tok/s: 8279295 +1191/20000 train_loss: 2.6398 train_time: 1.9m tok/s: 8279256 +1192/20000 train_loss: 2.6729 train_time: 1.9m tok/s: 8279168 +1193/20000 train_loss: 2.6932 train_time: 1.9m tok/s: 8279108 +1194/20000 train_loss: 2.6992 train_time: 1.9m tok/s: 8279100 +1195/20000 train_loss: 2.5951 train_time: 1.9m tok/s: 8279129 +1196/20000 train_loss: 2.8441 train_time: 1.9m tok/s: 8279071 +1197/20000 train_loss: 2.5704 train_time: 1.9m tok/s: 8279040 +1198/20000 train_loss: 2.6965 train_time: 1.9m tok/s: 8279063 +1199/20000 train_loss: 2.7390 train_time: 1.9m tok/s: 8279031 +1200/20000 train_loss: 2.7240 train_time: 1.9m tok/s: 8279096 +1201/20000 train_loss: 2.7277 train_time: 1.9m tok/s: 8279091 +1202/20000 train_loss: 2.8253 train_time: 1.9m tok/s: 8279144 +1203/20000 train_loss: 2.6706 train_time: 1.9m tok/s: 8279210 +1204/20000 train_loss: 2.7080 train_time: 1.9m tok/s: 8279155 +1205/20000 train_loss: 2.7416 train_time: 1.9m tok/s: 8279208 +1206/20000 train_loss: 2.7611 train_time: 1.9m tok/s: 8279222 +1207/20000 train_loss: 2.5705 train_time: 1.9m tok/s: 8279227 +1208/20000 train_loss: 2.5663 train_time: 1.9m tok/s: 8279240 +1209/20000 train_loss: 2.6914 train_time: 1.9m tok/s: 8279194 +1210/20000 train_loss: 2.6233 train_time: 1.9m tok/s: 8279253 +1211/20000 train_loss: 2.5586 train_time: 1.9m tok/s: 8279236 +1212/20000 train_loss: 2.5428 train_time: 1.9m tok/s: 8279100 +1213/20000 train_loss: 2.8078 train_time: 1.9m tok/s: 8278980 +1214/20000 train_loss: 2.6268 train_time: 1.9m tok/s: 8279061 +1215/20000 train_loss: 2.7162 train_time: 1.9m tok/s: 8279061 +1216/20000 train_loss: 2.6797 train_time: 1.9m tok/s: 8279094 +1217/20000 train_loss: 2.7595 train_time: 1.9m tok/s: 8279053 +1218/20000 train_loss: 2.7006 train_time: 1.9m tok/s: 8279059 +1219/20000 train_loss: 3.3076 train_time: 1.9m tok/s: 8279056 +1220/20000 train_loss: 2.6081 train_time: 1.9m tok/s: 8278988 +1221/20000 train_loss: 2.7635 train_time: 1.9m tok/s: 8278964 +1222/20000 train_loss: 2.5949 train_time: 1.9m tok/s: 8278965 +1223/20000 train_loss: 2.6995 train_time: 1.9m tok/s: 8278905 +1224/20000 train_loss: 2.7146 train_time: 1.9m tok/s: 8278863 +1225/20000 train_loss: 2.5356 train_time: 1.9m tok/s: 8278865 +1226/20000 train_loss: 2.6509 train_time: 1.9m tok/s: 8278830 +1227/20000 train_loss: 2.8649 train_time: 1.9m tok/s: 8278810 +1228/20000 train_loss: 2.6554 train_time: 1.9m tok/s: 8278845 +1229/20000 train_loss: 2.6598 train_time: 1.9m tok/s: 8278770 +1230/20000 train_loss: 2.7568 train_time: 1.9m tok/s: 8278778 +1231/20000 train_loss: 2.7038 train_time: 1.9m tok/s: 8278773 +1232/20000 train_loss: 2.6825 train_time: 2.0m tok/s: 8278791 +1233/20000 train_loss: 2.6587 train_time: 2.0m tok/s: 8278763 +1234/20000 train_loss: 2.6576 train_time: 2.0m tok/s: 8278770 +1235/20000 train_loss: 2.5983 train_time: 2.0m tok/s: 8278731 +1236/20000 train_loss: 2.6429 train_time: 2.0m tok/s: 8278680 +1237/20000 train_loss: 2.6194 train_time: 2.0m tok/s: 8278653 +1238/20000 train_loss: 2.5850 train_time: 2.0m tok/s: 8278640 +1239/20000 train_loss: 2.5846 train_time: 2.0m tok/s: 8278621 +1240/20000 train_loss: 2.5418 train_time: 2.0m tok/s: 8278676 +1241/20000 train_loss: 2.5907 train_time: 2.0m tok/s: 8278693 +1242/20000 train_loss: 2.5835 train_time: 2.0m tok/s: 8278663 +1243/20000 train_loss: 2.6776 train_time: 2.0m tok/s: 8278439 +1244/20000 train_loss: 2.7772 train_time: 2.0m tok/s: 8278411 +1245/20000 train_loss: 2.6744 train_time: 2.0m tok/s: 8278330 +1246/20000 train_loss: 2.7854 train_time: 2.0m tok/s: 8278160 +1247/20000 train_loss: 2.7849 train_time: 2.0m tok/s: 8278131 +1248/20000 train_loss: 2.6548 train_time: 2.0m tok/s: 8278168 +1249/20000 train_loss: 2.6416 train_time: 2.0m tok/s: 8278189 +1250/20000 train_loss: 2.6446 train_time: 2.0m tok/s: 8278225 +1251/20000 train_loss: 2.5979 train_time: 2.0m tok/s: 8278286 +1252/20000 train_loss: 2.6745 train_time: 2.0m tok/s: 8278331 +1253/20000 train_loss: 2.6213 train_time: 2.0m tok/s: 8278340 +1254/20000 train_loss: 2.6855 train_time: 2.0m tok/s: 8278334 +1255/20000 train_loss: 2.4491 train_time: 2.0m tok/s: 8278359 +1256/20000 train_loss: 2.6764 train_time: 2.0m tok/s: 8278329 +1257/20000 train_loss: 2.6007 train_time: 2.0m tok/s: 8278317 +1258/20000 train_loss: 2.6350 train_time: 2.0m tok/s: 8278219 +1259/20000 train_loss: 2.7631 train_time: 2.0m tok/s: 8278160 +1260/20000 train_loss: 2.7061 train_time: 2.0m tok/s: 8278198 +1261/20000 train_loss: 2.7850 train_time: 2.0m tok/s: 8278247 +1262/20000 train_loss: 2.6917 train_time: 2.0m tok/s: 8278361 +1263/20000 train_loss: 2.7057 train_time: 2.0m tok/s: 8278388 +1264/20000 train_loss: 2.6355 train_time: 2.0m tok/s: 8278325 +1265/20000 train_loss: 2.6337 train_time: 2.0m tok/s: 8278356 +1266/20000 train_loss: 2.6473 train_time: 2.0m tok/s: 8278250 +1267/20000 train_loss: 2.6858 train_time: 2.0m tok/s: 8278175 +1268/20000 train_loss: 2.4835 train_time: 2.0m tok/s: 8278184 +1269/20000 train_loss: 2.6800 train_time: 2.0m tok/s: 8278192 +1270/20000 train_loss: 2.6484 train_time: 2.0m tok/s: 8278298 +1271/20000 train_loss: 2.5622 train_time: 2.0m tok/s: 8278305 +1272/20000 train_loss: 2.8069 train_time: 2.0m tok/s: 8278407 +1273/20000 train_loss: 2.7524 train_time: 2.0m tok/s: 8278456 +1274/20000 train_loss: 2.6746 train_time: 2.0m tok/s: 8278487 +1275/20000 train_loss: 2.7786 train_time: 2.0m tok/s: 8278549 +1276/20000 train_loss: 2.7144 train_time: 2.0m tok/s: 8278603 +1277/20000 train_loss: 2.7025 train_time: 2.0m tok/s: 8278639 +1278/20000 train_loss: 2.6150 train_time: 2.0m tok/s: 8278573 +1279/20000 train_loss: 2.7210 train_time: 2.0m tok/s: 8278567 +1280/20000 train_loss: 2.6461 train_time: 2.0m tok/s: 8278545 +1281/20000 train_loss: 2.8164 train_time: 2.0m tok/s: 8278549 +1282/20000 train_loss: 2.5345 train_time: 2.0m tok/s: 8278507 +1283/20000 train_loss: 2.6347 train_time: 2.0m tok/s: 8278440 +1284/20000 train_loss: 2.6158 train_time: 2.0m tok/s: 8278569 +1285/20000 train_loss: 2.7772 train_time: 2.0m tok/s: 8278567 +1286/20000 train_loss: 2.6680 train_time: 2.0m tok/s: 8278576 +1287/20000 train_loss: 2.7082 train_time: 2.0m tok/s: 8278522 +1288/20000 train_loss: 2.7228 train_time: 2.0m tok/s: 8278536 +1289/20000 train_loss: 2.7512 train_time: 2.0m tok/s: 8278467 +1290/20000 train_loss: 2.6335 train_time: 2.0m tok/s: 8278463 +1291/20000 train_loss: 2.7477 train_time: 2.0m tok/s: 8278526 +1292/20000 train_loss: 2.7289 train_time: 2.0m tok/s: 8278549 +1293/20000 train_loss: 2.7161 train_time: 2.0m tok/s: 8278524 +1294/20000 train_loss: 2.7242 train_time: 2.0m tok/s: 8278475 +1295/20000 train_loss: 2.7484 train_time: 2.1m tok/s: 8278428 +1296/20000 train_loss: 2.6841 train_time: 2.1m tok/s: 8278452 +1297/20000 train_loss: 2.5892 train_time: 2.1m tok/s: 8278490 +1298/20000 train_loss: 2.6611 train_time: 2.1m tok/s: 8278546 +1299/20000 train_loss: 2.5160 train_time: 2.1m tok/s: 8278550 +1300/20000 train_loss: 2.6376 train_time: 2.1m tok/s: 8278578 +1301/20000 train_loss: 2.6901 train_time: 2.1m tok/s: 8278674 +1302/20000 train_loss: 2.6736 train_time: 2.1m tok/s: 8278715 +1303/20000 train_loss: 2.8890 train_time: 2.1m tok/s: 8278682 +1304/20000 train_loss: 2.7399 train_time: 2.1m tok/s: 8278645 +1305/20000 train_loss: 2.7723 train_time: 2.1m tok/s: 8278643 +1306/20000 train_loss: 2.8625 train_time: 2.1m tok/s: 8278602 +1307/20000 train_loss: 2.6282 train_time: 2.1m tok/s: 8278640 +1308/20000 train_loss: 2.6326 train_time: 2.1m tok/s: 8278685 +1309/20000 train_loss: 2.6663 train_time: 2.1m tok/s: 8278648 +1310/20000 train_loss: 2.5643 train_time: 2.1m tok/s: 8278569 +1311/20000 train_loss: 2.6070 train_time: 2.1m tok/s: 8278502 +1312/20000 train_loss: 2.5468 train_time: 2.1m tok/s: 8278508 +1313/20000 train_loss: 2.5769 train_time: 2.1m tok/s: 8278479 +1314/20000 train_loss: 2.5587 train_time: 2.1m tok/s: 8278455 +1315/20000 train_loss: 2.4184 train_time: 2.1m tok/s: 8278394 +1316/20000 train_loss: 2.6801 train_time: 2.1m tok/s: 8278302 +1317/20000 train_loss: 2.6906 train_time: 2.1m tok/s: 8278383 +1318/20000 train_loss: 2.7124 train_time: 2.1m tok/s: 8278407 +1319/20000 train_loss: 2.8092 train_time: 2.1m tok/s: 8278369 +1320/20000 train_loss: 2.7263 train_time: 2.1m tok/s: 8278493 +1321/20000 train_loss: 2.7179 train_time: 2.1m tok/s: 8278522 +1322/20000 train_loss: 2.7385 train_time: 2.1m tok/s: 8278513 +1323/20000 train_loss: 2.5835 train_time: 2.1m tok/s: 8278520 +1324/20000 train_loss: 2.6464 train_time: 2.1m tok/s: 8278518 +1325/20000 train_loss: 2.8280 train_time: 2.1m tok/s: 8278481 +1326/20000 train_loss: 2.8361 train_time: 2.1m tok/s: 8278393 +1327/20000 train_loss: 2.6633 train_time: 2.1m tok/s: 8278363 +1328/20000 train_loss: 2.6558 train_time: 2.1m tok/s: 8278375 +1329/20000 train_loss: 2.7462 train_time: 2.1m tok/s: 8278391 +1330/20000 train_loss: 2.6113 train_time: 2.1m tok/s: 8278371 +1331/20000 train_loss: 2.6769 train_time: 2.1m tok/s: 8278335 +1332/20000 train_loss: 2.9155 train_time: 2.1m tok/s: 8278381 +1333/20000 train_loss: 2.8375 train_time: 2.1m tok/s: 8278418 +1334/20000 train_loss: 2.6681 train_time: 2.1m tok/s: 8278364 +1335/20000 train_loss: 2.6269 train_time: 2.1m tok/s: 8278396 +1336/20000 train_loss: 2.6528 train_time: 2.1m tok/s: 8278396 +1337/20000 train_loss: 2.8073 train_time: 2.1m tok/s: 8278397 +1338/20000 train_loss: 2.8969 train_time: 2.1m tok/s: 8278311 +1339/20000 train_loss: 2.6929 train_time: 2.1m tok/s: 8278284 +1340/20000 train_loss: 2.5343 train_time: 2.1m tok/s: 8278327 +1341/20000 train_loss: 2.5572 train_time: 2.1m tok/s: 8278321 +1342/20000 train_loss: 2.6439 train_time: 2.1m tok/s: 8278322 +1343/20000 train_loss: 2.6656 train_time: 2.1m tok/s: 8278279 +1344/20000 train_loss: 2.6880 train_time: 2.1m tok/s: 8278369 +1345/20000 train_loss: 2.6417 train_time: 2.1m tok/s: 8278399 +1346/20000 train_loss: 2.7677 train_time: 2.1m tok/s: 8278371 +1347/20000 train_loss: 2.7930 train_time: 2.1m tok/s: 8278361 +1348/20000 train_loss: 2.6952 train_time: 2.1m tok/s: 8278353 +1349/20000 train_loss: 2.7283 train_time: 2.1m tok/s: 8278355 +1350/20000 train_loss: 2.6817 train_time: 2.1m tok/s: 8278376 +1351/20000 train_loss: 2.7774 train_time: 2.1m tok/s: 8278383 +1352/20000 train_loss: 2.6715 train_time: 2.1m tok/s: 8278408 +1353/20000 train_loss: 2.7049 train_time: 2.1m tok/s: 8278427 +1354/20000 train_loss: 2.3890 train_time: 2.1m tok/s: 8278344 +1355/20000 train_loss: 2.5815 train_time: 2.1m tok/s: 8278225 +1356/20000 train_loss: 2.6813 train_time: 2.1m tok/s: 8278295 +1357/20000 train_loss: 2.6773 train_time: 2.1m tok/s: 8278284 +1358/20000 train_loss: 2.7436 train_time: 2.2m tok/s: 8278298 +1359/20000 train_loss: 2.5098 train_time: 2.2m tok/s: 8278269 +1360/20000 train_loss: 2.7391 train_time: 2.2m tok/s: 8278197 +1361/20000 train_loss: 2.6340 train_time: 2.2m tok/s: 8278189 +1362/20000 train_loss: 2.6207 train_time: 2.2m tok/s: 8278225 +1363/20000 train_loss: 2.7032 train_time: 2.2m tok/s: 8278299 +1364/20000 train_loss: 2.5360 train_time: 2.2m tok/s: 8278279 +1365/20000 train_loss: 2.5357 train_time: 2.2m tok/s: 8278231 +1366/20000 train_loss: 2.6214 train_time: 2.2m tok/s: 8278224 +1367/20000 train_loss: 2.7082 train_time: 2.2m tok/s: 8278152 +1368/20000 train_loss: 2.5695 train_time: 2.2m tok/s: 8278301 +1369/20000 train_loss: 2.6791 train_time: 2.2m tok/s: 8278303 +1370/20000 train_loss: 2.7234 train_time: 2.2m tok/s: 8278297 +1371/20000 train_loss: 2.6972 train_time: 2.2m tok/s: 8278235 +1372/20000 train_loss: 2.7330 train_time: 2.2m tok/s: 8278211 +1373/20000 train_loss: 2.6682 train_time: 2.2m tok/s: 8278288 +1374/20000 train_loss: 2.7686 train_time: 2.2m tok/s: 8278326 +1375/20000 train_loss: 2.7268 train_time: 2.2m tok/s: 8278392 +1376/20000 train_loss: 2.5859 train_time: 2.2m tok/s: 8278380 +1377/20000 train_loss: 2.6411 train_time: 2.2m tok/s: 8278386 +1378/20000 train_loss: 2.5868 train_time: 2.2m tok/s: 8278410 +1379/20000 train_loss: 2.6333 train_time: 2.2m tok/s: 8278391 +1380/20000 train_loss: 2.5966 train_time: 2.2m tok/s: 8278451 +1381/20000 train_loss: 2.6421 train_time: 2.2m tok/s: 8278476 +1382/20000 train_loss: 2.6561 train_time: 2.2m tok/s: 8278416 +1383/20000 train_loss: 2.6827 train_time: 2.2m tok/s: 8278476 +1384/20000 train_loss: 2.5797 train_time: 2.2m tok/s: 8278448 +1385/20000 train_loss: 2.6037 train_time: 2.2m tok/s: 8278461 +1386/20000 train_loss: 2.7617 train_time: 2.2m tok/s: 8278474 +1387/20000 train_loss: 2.6786 train_time: 2.2m tok/s: 8278523 +1388/20000 train_loss: 2.7556 train_time: 2.2m tok/s: 8278619 +1389/20000 train_loss: 2.6107 train_time: 2.2m tok/s: 8278581 +1390/20000 train_loss: 2.7349 train_time: 2.2m tok/s: 8278589 +1391/20000 train_loss: 2.5476 train_time: 2.2m tok/s: 8278556 +1392/20000 train_loss: 2.7380 train_time: 2.2m tok/s: 8278643 +1393/20000 train_loss: 2.6354 train_time: 2.2m tok/s: 8278579 +1394/20000 train_loss: 2.9004 train_time: 2.2m tok/s: 8278553 +1395/20000 train_loss: 2.4996 train_time: 2.2m tok/s: 8278531 +1396/20000 train_loss: 2.8234 train_time: 2.2m tok/s: 8278537 +1397/20000 train_loss: 2.7118 train_time: 2.2m tok/s: 8278536 +1398/20000 train_loss: 2.8013 train_time: 2.2m tok/s: 8278561 +1399/20000 train_loss: 2.6430 train_time: 2.2m tok/s: 8278610 +1400/20000 train_loss: 2.7409 train_time: 2.2m tok/s: 8278626 +1401/20000 train_loss: 2.7324 train_time: 2.2m tok/s: 8278621 +1402/20000 train_loss: 2.5738 train_time: 2.2m tok/s: 8278716 +1403/20000 train_loss: 2.6017 train_time: 2.2m tok/s: 8278655 +1404/20000 train_loss: 2.6828 train_time: 2.2m tok/s: 8278724 +1405/20000 train_loss: 2.7219 train_time: 2.2m tok/s: 8278759 +1406/20000 train_loss: 2.8393 train_time: 2.2m tok/s: 8278625 +1407/20000 train_loss: 2.5728 train_time: 2.2m tok/s: 8278490 +1408/20000 train_loss: 2.6943 train_time: 2.2m tok/s: 8278445 +1409/20000 train_loss: 2.7744 train_time: 2.2m tok/s: 8278483 +1410/20000 train_loss: 2.6550 train_time: 2.2m tok/s: 8278513 +1411/20000 train_loss: 2.6942 train_time: 2.2m tok/s: 8278528 +1412/20000 train_loss: 2.7341 train_time: 2.2m tok/s: 8278544 +1413/20000 train_loss: 2.6133 train_time: 2.2m tok/s: 8278530 +1414/20000 train_loss: 2.6061 train_time: 2.2m tok/s: 8278588 +1415/20000 train_loss: 2.6664 train_time: 2.2m tok/s: 8278550 +1416/20000 train_loss: 2.6226 train_time: 2.2m tok/s: 8278662 +1417/20000 train_loss: 2.6273 train_time: 2.2m tok/s: 8278621 +1418/20000 train_loss: 2.7559 train_time: 2.2m tok/s: 8278569 +1419/20000 train_loss: 2.6650 train_time: 2.2m tok/s: 8278543 +1420/20000 train_loss: 2.6296 train_time: 2.2m tok/s: 8278544 +1421/20000 train_loss: 2.7549 train_time: 2.2m tok/s: 8278562 +1422/20000 train_loss: 2.7400 train_time: 2.3m tok/s: 8278577 +1423/20000 train_loss: 2.6998 train_time: 2.3m tok/s: 8278550 +1424/20000 train_loss: 2.7076 train_time: 2.3m tok/s: 8278569 +1425/20000 train_loss: 2.6242 train_time: 2.3m tok/s: 8278575 +1426/20000 train_loss: 2.6929 train_time: 2.3m tok/s: 8278541 +1427/20000 train_loss: 2.6527 train_time: 2.3m tok/s: 8278477 +1428/20000 train_loss: 2.6474 train_time: 2.3m tok/s: 8278494 +1429/20000 train_loss: 2.5898 train_time: 2.3m tok/s: 8278465 +1430/20000 train_loss: 2.6282 train_time: 2.3m tok/s: 8278488 +1431/20000 train_loss: 2.6041 train_time: 2.3m tok/s: 8278530 +1432/20000 train_loss: 2.4342 train_time: 2.3m tok/s: 8278487 +1433/20000 train_loss: 2.6289 train_time: 2.3m tok/s: 8278445 +1434/20000 train_loss: 2.7444 train_time: 2.3m tok/s: 8278383 +1435/20000 train_loss: 2.7170 train_time: 2.3m tok/s: 8278432 +1436/20000 train_loss: 2.6218 train_time: 2.3m tok/s: 8278524 +1437/20000 train_loss: 2.7029 train_time: 2.3m tok/s: 8278574 +1438/20000 train_loss: 2.7520 train_time: 2.3m tok/s: 8278570 +1439/20000 train_loss: 2.6684 train_time: 2.3m tok/s: 8278416 +1440/20000 train_loss: 2.7078 train_time: 2.3m tok/s: 8278385 +1441/20000 train_loss: 2.6673 train_time: 2.3m tok/s: 8278390 +1442/20000 train_loss: 2.5872 train_time: 2.3m tok/s: 8278325 +1443/20000 train_loss: 2.6007 train_time: 2.3m tok/s: 8278319 +1444/20000 train_loss: 2.5484 train_time: 2.3m tok/s: 8278340 +1445/20000 train_loss: 2.6639 train_time: 2.3m tok/s: 8278307 +1446/20000 train_loss: 2.7705 train_time: 2.3m tok/s: 8278271 +1447/20000 train_loss: 2.7193 train_time: 2.3m tok/s: 8278280 +1448/20000 train_loss: 2.7004 train_time: 2.3m tok/s: 8278304 +1449/20000 train_loss: 2.6473 train_time: 2.3m tok/s: 8278337 +1450/20000 train_loss: 2.7339 train_time: 2.3m tok/s: 8278356 +1451/20000 train_loss: 2.5955 train_time: 2.3m tok/s: 8278378 +1452/20000 train_loss: 2.6216 train_time: 2.3m tok/s: 8278396 +1453/20000 train_loss: 2.6432 train_time: 2.3m tok/s: 8278413 +1454/20000 train_loss: 2.7434 train_time: 2.3m tok/s: 8278318 +1455/20000 train_loss: 2.5698 train_time: 2.3m tok/s: 8278292 +1456/20000 train_loss: 2.4579 train_time: 2.3m tok/s: 8278222 +1457/20000 train_loss: 2.4823 train_time: 2.3m tok/s: 8278221 +1458/20000 train_loss: 2.5968 train_time: 2.3m tok/s: 8278142 +1459/20000 train_loss: 2.6721 train_time: 2.3m tok/s: 8278137 +1460/20000 train_loss: 2.6996 train_time: 2.3m tok/s: 8278192 +1461/20000 train_loss: 2.7707 train_time: 2.3m tok/s: 8278266 +1462/20000 train_loss: 2.6453 train_time: 2.3m tok/s: 8278275 +1463/20000 train_loss: 2.6618 train_time: 2.3m tok/s: 8278246 +1464/20000 train_loss: 2.6461 train_time: 2.3m tok/s: 8278273 +1465/20000 train_loss: 2.6775 train_time: 2.3m tok/s: 8278237 +1466/20000 train_loss: 2.6134 train_time: 2.3m tok/s: 8278260 +1467/20000 train_loss: 2.6185 train_time: 2.3m tok/s: 8278209 +1468/20000 train_loss: 2.5602 train_time: 2.3m tok/s: 8278160 +1469/20000 train_loss: 2.5829 train_time: 2.3m tok/s: 8278178 +1470/20000 train_loss: 2.5099 train_time: 2.3m tok/s: 8278147 +1471/20000 train_loss: 2.8053 train_time: 2.3m tok/s: 8278153 +1472/20000 train_loss: 2.8741 train_time: 2.3m tok/s: 8278095 +1473/20000 train_loss: 2.7489 train_time: 2.3m tok/s: 8278009 +1474/20000 train_loss: 2.7330 train_time: 2.3m tok/s: 8277951 +1475/20000 train_loss: 2.6929 train_time: 2.3m tok/s: 8277937 +1476/20000 train_loss: 2.7688 train_time: 2.3m tok/s: 8277945 +1477/20000 train_loss: 2.6019 train_time: 2.3m tok/s: 8277892 +1478/20000 train_loss: 2.5996 train_time: 2.3m tok/s: 8277930 +1479/20000 train_loss: 2.5868 train_time: 2.3m tok/s: 8277942 +1480/20000 train_loss: 2.6057 train_time: 2.3m tok/s: 8277941 +1481/20000 train_loss: 2.6358 train_time: 2.3m tok/s: 8277933 +1482/20000 train_loss: 3.0501 train_time: 2.3m tok/s: 8277881 +1483/20000 train_loss: 2.6959 train_time: 2.3m tok/s: 8277866 +1484/20000 train_loss: 2.6309 train_time: 2.3m tok/s: 8277874 +1485/20000 train_loss: 2.7847 train_time: 2.4m tok/s: 8277920 +1486/20000 train_loss: 2.5960 train_time: 2.4m tok/s: 8277898 +1487/20000 train_loss: 2.6967 train_time: 2.4m tok/s: 8277885 +1488/20000 train_loss: 2.6509 train_time: 2.4m tok/s: 8277963 +1489/20000 train_loss: 2.5789 train_time: 2.4m tok/s: 8277981 +1490/20000 train_loss: 2.6674 train_time: 2.4m tok/s: 8277969 +1491/20000 train_loss: 2.6673 train_time: 2.4m tok/s: 8277989 +1492/20000 train_loss: 2.5994 train_time: 2.4m tok/s: 8278008 +1493/20000 train_loss: 2.6688 train_time: 2.4m tok/s: 8278091 +1494/20000 train_loss: 2.6376 train_time: 2.4m tok/s: 8278068 +1495/20000 train_loss: 2.5668 train_time: 2.4m tok/s: 8278015 +1496/20000 train_loss: 2.6738 train_time: 2.4m tok/s: 8278008 +1497/20000 train_loss: 2.5999 train_time: 2.4m tok/s: 8277957 +1498/20000 train_loss: 2.9267 train_time: 2.4m tok/s: 8277992 +1499/20000 train_loss: 2.6908 train_time: 2.4m tok/s: 8277930 +1500/20000 train_loss: 2.7194 train_time: 2.4m tok/s: 8277979 +1501/20000 train_loss: 2.6925 train_time: 2.4m tok/s: 8278034 +1502/20000 train_loss: 2.8050 train_time: 2.4m tok/s: 8278054 +1503/20000 train_loss: 2.6731 train_time: 2.4m tok/s: 8278016 +1504/20000 train_loss: 2.7324 train_time: 2.4m tok/s: 8278095 +1505/20000 train_loss: 2.6833 train_time: 2.4m tok/s: 8278149 +1506/20000 train_loss: 2.7099 train_time: 2.4m tok/s: 8278168 +1507/20000 train_loss: 2.8273 train_time: 2.4m tok/s: 8278094 +1508/20000 train_loss: 2.5295 train_time: 2.4m tok/s: 8278004 +1509/20000 train_loss: 2.5792 train_time: 2.4m tok/s: 8277928 +1510/20000 train_loss: 2.5441 train_time: 2.4m tok/s: 8277962 +1511/20000 train_loss: 2.4980 train_time: 2.4m tok/s: 8277899 +1512/20000 train_loss: 2.5702 train_time: 2.4m tok/s: 8277860 +1513/20000 train_loss: 2.7198 train_time: 2.4m tok/s: 8277924 +1514/20000 train_loss: 2.7424 train_time: 2.4m tok/s: 8277962 +1515/20000 train_loss: 2.6927 train_time: 2.4m tok/s: 8277997 +1516/20000 train_loss: 2.5840 train_time: 2.4m tok/s: 8277971 +1517/20000 train_loss: 2.5954 train_time: 2.4m tok/s: 8277962 +1518/20000 train_loss: 2.7311 train_time: 2.4m tok/s: 8278005 +1519/20000 train_loss: 2.6596 train_time: 2.4m tok/s: 8277941 +1520/20000 train_loss: 2.6640 train_time: 2.4m tok/s: 8277945 +1521/20000 train_loss: 2.6457 train_time: 2.4m tok/s: 8277944 +1522/20000 train_loss: 2.6486 train_time: 2.4m tok/s: 8278012 +1523/20000 train_loss: 2.6782 train_time: 2.4m tok/s: 8277975 +1524/20000 train_loss: 2.6448 train_time: 2.4m tok/s: 8277965 +1525/20000 train_loss: 2.6277 train_time: 2.4m tok/s: 8277955 +1526/20000 train_loss: 2.7232 train_time: 2.4m tok/s: 8277903 +1527/20000 train_loss: 2.6731 train_time: 2.4m tok/s: 8277868 +1528/20000 train_loss: 2.4658 train_time: 2.4m tok/s: 8277828 +1529/20000 train_loss: 2.6291 train_time: 2.4m tok/s: 8277909 +1530/20000 train_loss: 2.6129 train_time: 2.4m tok/s: 8277923 +1531/20000 train_loss: 2.3602 train_time: 2.4m tok/s: 8277901 +1532/20000 train_loss: 2.6189 train_time: 2.4m tok/s: 8277889 +1533/20000 train_loss: 2.6754 train_time: 2.4m tok/s: 8277860 +1534/20000 train_loss: 2.6337 train_time: 2.4m tok/s: 8277863 +1535/20000 train_loss: 2.7690 train_time: 2.4m tok/s: 8277868 +1536/20000 train_loss: 2.6928 train_time: 2.4m tok/s: 8277793 +1537/20000 train_loss: 3.0721 train_time: 2.4m tok/s: 8277768 +1538/20000 train_loss: 2.7195 train_time: 2.4m tok/s: 8277710 +1539/20000 train_loss: 2.6282 train_time: 2.4m tok/s: 8277716 +1540/20000 train_loss: 2.6770 train_time: 2.4m tok/s: 8277668 +1541/20000 train_loss: 2.5854 train_time: 2.4m tok/s: 8277606 +1542/20000 train_loss: 2.6152 train_time: 2.4m tok/s: 8277652 +1543/20000 train_loss: 2.6335 train_time: 2.4m tok/s: 8277721 +1544/20000 train_loss: 2.5884 train_time: 2.4m tok/s: 8277693 +1545/20000 train_loss: 2.6175 train_time: 2.4m tok/s: 8277681 +1546/20000 train_loss: 2.5002 train_time: 2.4m tok/s: 8277677 +1547/20000 train_loss: 2.7405 train_time: 2.4m tok/s: 8277669 +1548/20000 train_loss: 2.6861 train_time: 2.5m tok/s: 8277668 +1549/20000 train_loss: 2.5794 train_time: 2.5m tok/s: 8277635 +1550/20000 train_loss: 2.7160 train_time: 2.5m tok/s: 8277598 +1551/20000 train_loss: 2.6741 train_time: 2.5m tok/s: 8277570 +1552/20000 train_loss: 2.5520 train_time: 2.5m tok/s: 8277589 +1553/20000 train_loss: 2.4852 train_time: 2.5m tok/s: 8277548 +1554/20000 train_loss: 2.5873 train_time: 2.5m tok/s: 8277594 +1555/20000 train_loss: 2.6295 train_time: 2.5m tok/s: 8277626 +1556/20000 train_loss: 2.5145 train_time: 2.5m tok/s: 8277638 +1557/20000 train_loss: 2.5486 train_time: 2.5m tok/s: 8277591 +1558/20000 train_loss: 2.5655 train_time: 2.5m tok/s: 8277519 +1559/20000 train_loss: 2.5465 train_time: 2.5m tok/s: 8277428 +1560/20000 train_loss: 2.6186 train_time: 2.5m tok/s: 8277480 +1561/20000 train_loss: 2.5394 train_time: 2.5m tok/s: 8277414 +1562/20000 train_loss: 2.5889 train_time: 2.5m tok/s: 8277424 +1563/20000 train_loss: 2.4923 train_time: 2.5m tok/s: 8277446 +1564/20000 train_loss: 2.5860 train_time: 2.5m tok/s: 8277483 +1565/20000 train_loss: 2.5700 train_time: 2.5m tok/s: 8277495 +1566/20000 train_loss: 2.7452 train_time: 2.5m tok/s: 8277482 +1567/20000 train_loss: 2.6809 train_time: 2.5m tok/s: 8277505 +1568/20000 train_loss: 2.5314 train_time: 2.5m tok/s: 8277554 +1569/20000 train_loss: 2.5961 train_time: 2.5m tok/s: 8277535 +1570/20000 train_loss: 2.5445 train_time: 2.5m tok/s: 8277560 +1571/20000 train_loss: 2.6129 train_time: 2.5m tok/s: 8277534 +1572/20000 train_loss: 3.1938 train_time: 2.5m tok/s: 8277519 +1573/20000 train_loss: 2.7475 train_time: 2.5m tok/s: 8277457 +1574/20000 train_loss: 2.5988 train_time: 2.5m tok/s: 8277438 +1575/20000 train_loss: 2.5509 train_time: 2.5m tok/s: 8277445 +1576/20000 train_loss: 2.5421 train_time: 2.5m tok/s: 8277423 +1577/20000 train_loss: 2.5720 train_time: 2.5m tok/s: 8277378 +1578/20000 train_loss: 2.5048 train_time: 2.5m tok/s: 8277337 +1579/20000 train_loss: 2.7593 train_time: 2.5m tok/s: 8277338 +1580/20000 train_loss: 2.6551 train_time: 2.5m tok/s: 8277308 +1581/20000 train_loss: 2.4919 train_time: 2.5m tok/s: 8277316 +1582/20000 train_loss: 2.5185 train_time: 2.5m tok/s: 8277272 +1583/20000 train_loss: 2.5811 train_time: 2.5m tok/s: 8277278 +1584/20000 train_loss: 2.5549 train_time: 2.5m tok/s: 8277391 +1585/20000 train_loss: 2.7027 train_time: 2.5m tok/s: 8277426 +1586/20000 train_loss: 2.5542 train_time: 2.5m tok/s: 8277402 +1587/20000 train_loss: 2.5948 train_time: 2.5m tok/s: 8277377 +1588/20000 train_loss: 2.6387 train_time: 2.5m tok/s: 8277381 +1589/20000 train_loss: 2.6956 train_time: 2.5m tok/s: 8277327 +1590/20000 train_loss: 2.6447 train_time: 2.5m tok/s: 8277267 +1591/20000 train_loss: 2.6377 train_time: 2.5m tok/s: 8277306 +1592/20000 train_loss: 2.5701 train_time: 2.5m tok/s: 8277305 +1593/20000 train_loss: 2.6339 train_time: 2.5m tok/s: 8277347 +1594/20000 train_loss: 2.7372 train_time: 2.5m tok/s: 8277384 +1595/20000 train_loss: 2.6757 train_time: 2.5m tok/s: 8277401 +1596/20000 train_loss: 2.4454 train_time: 2.5m tok/s: 8277441 +1597/20000 train_loss: 2.5636 train_time: 2.5m tok/s: 8277461 +1598/20000 train_loss: 2.6215 train_time: 2.5m tok/s: 8277435 +1599/20000 train_loss: 2.6138 train_time: 2.5m tok/s: 8277442 +1600/20000 train_loss: 2.8159 train_time: 2.5m tok/s: 8277406 +1601/20000 train_loss: 2.6498 train_time: 2.5m tok/s: 8277411 +1602/20000 train_loss: 2.7686 train_time: 2.5m tok/s: 8277313 +1603/20000 train_loss: 2.5833 train_time: 2.5m tok/s: 8277318 +1604/20000 train_loss: 2.5969 train_time: 2.5m tok/s: 8277383 +1605/20000 train_loss: 2.6192 train_time: 2.5m tok/s: 8277358 +1606/20000 train_loss: 2.6138 train_time: 2.5m tok/s: 8277324 +1607/20000 train_loss: 2.5303 train_time: 2.5m tok/s: 8277256 +1608/20000 train_loss: 2.5036 train_time: 2.5m tok/s: 8277358 +1609/20000 train_loss: 2.7032 train_time: 2.5m tok/s: 8277374 +1610/20000 train_loss: 2.6024 train_time: 2.5m tok/s: 8277325 +1611/20000 train_loss: 2.5936 train_time: 2.6m tok/s: 8277289 +1612/20000 train_loss: 2.6539 train_time: 2.6m tok/s: 8277275 +1613/20000 train_loss: 2.6505 train_time: 2.6m tok/s: 8277269 +1614/20000 train_loss: 2.7138 train_time: 2.6m tok/s: 8277272 +1615/20000 train_loss: 2.7261 train_time: 2.6m tok/s: 8277322 +1616/20000 train_loss: 2.6590 train_time: 2.6m tok/s: 8277375 +1617/20000 train_loss: 2.6056 train_time: 2.6m tok/s: 8277404 +1618/20000 train_loss: 3.0403 train_time: 2.6m tok/s: 8277390 +1619/20000 train_loss: 2.7392 train_time: 2.6m tok/s: 8277333 +1620/20000 train_loss: 2.5610 train_time: 2.6m tok/s: 8277352 +1621/20000 train_loss: 2.5796 train_time: 2.6m tok/s: 8277384 +1622/20000 train_loss: 2.7568 train_time: 2.6m tok/s: 8277350 +1623/20000 train_loss: 2.6726 train_time: 2.6m tok/s: 8277361 +1624/20000 train_loss: 2.6284 train_time: 2.6m tok/s: 8277342 +1625/20000 train_loss: 2.6326 train_time: 2.6m tok/s: 8277375 +1626/20000 train_loss: 2.7015 train_time: 2.6m tok/s: 8277357 +1627/20000 train_loss: 2.4348 train_time: 2.6m tok/s: 8277347 +1628/20000 train_loss: 2.5932 train_time: 2.6m tok/s: 8277381 +1629/20000 train_loss: 2.5713 train_time: 2.6m tok/s: 8277449 +1630/20000 train_loss: 2.5907 train_time: 2.6m tok/s: 8277411 +1631/20000 train_loss: 2.8008 train_time: 2.6m tok/s: 8277349 +1632/20000 train_loss: 2.7024 train_time: 2.6m tok/s: 8277505 +1633/20000 train_loss: 2.6669 train_time: 2.6m tok/s: 8277499 +1634/20000 train_loss: 2.6241 train_time: 2.6m tok/s: 8277470 +1635/20000 train_loss: 2.6817 train_time: 2.6m tok/s: 8277437 +1636/20000 train_loss: 2.4573 train_time: 2.6m tok/s: 8277451 +1637/20000 train_loss: 2.5601 train_time: 2.6m tok/s: 8277395 +1638/20000 train_loss: 2.4993 train_time: 2.6m tok/s: 8277332 +1639/20000 train_loss: 2.5206 train_time: 2.6m tok/s: 8277324 +1640/20000 train_loss: 2.3789 train_time: 2.6m tok/s: 8277349 +1641/20000 train_loss: 2.5435 train_time: 2.6m tok/s: 8277382 +1642/20000 train_loss: 2.7518 train_time: 2.6m tok/s: 8277375 +1643/20000 train_loss: 2.4323 train_time: 2.6m tok/s: 8277389 +1644/20000 train_loss: 2.4717 train_time: 2.6m tok/s: 8277499 +1645/20000 train_loss: 2.7671 train_time: 2.6m tok/s: 8277473 +1646/20000 train_loss: 2.5252 train_time: 2.6m tok/s: 8277478 +1647/20000 train_loss: 2.7408 train_time: 2.6m tok/s: 8277454 +1648/20000 train_loss: 2.6390 train_time: 2.6m tok/s: 8277472 +1649/20000 train_loss: 2.7471 train_time: 2.6m tok/s: 8277480 +1650/20000 train_loss: 2.5703 train_time: 2.6m tok/s: 8277427 +1651/20000 train_loss: 2.7328 train_time: 2.6m tok/s: 8277437 +1652/20000 train_loss: 2.6473 train_time: 2.6m tok/s: 8277525 +1653/20000 train_loss: 2.7458 train_time: 2.6m tok/s: 8277511 +1654/20000 train_loss: 2.6747 train_time: 2.6m tok/s: 8277576 +1655/20000 train_loss: 2.5566 train_time: 2.6m tok/s: 8277578 +1656/20000 train_loss: 2.6043 train_time: 2.6m tok/s: 8277710 +1657/20000 train_loss: 2.6452 train_time: 2.6m tok/s: 8277701 +1658/20000 train_loss: 2.6350 train_time: 2.6m tok/s: 8277691 +1659/20000 train_loss: 2.5861 train_time: 2.6m tok/s: 8277703 +1660/20000 train_loss: 2.5514 train_time: 2.6m tok/s: 8277657 +1661/20000 train_loss: 2.7454 train_time: 2.6m tok/s: 8277602 +1662/20000 train_loss: 2.7417 train_time: 2.6m tok/s: 8277513 +1663/20000 train_loss: 2.7980 train_time: 2.6m tok/s: 8277420 +1664/20000 train_loss: 2.8053 train_time: 2.6m tok/s: 8277441 +1665/20000 train_loss: 2.8027 train_time: 2.6m tok/s: 8277405 +1666/20000 train_loss: 2.7013 train_time: 2.6m tok/s: 8277369 +1667/20000 train_loss: 2.6045 train_time: 2.6m tok/s: 8277335 +1668/20000 train_loss: 2.6307 train_time: 2.6m tok/s: 8277458 +1669/20000 train_loss: 2.7532 train_time: 2.6m tok/s: 8277420 +1670/20000 train_loss: 2.5559 train_time: 2.6m tok/s: 8277422 +1671/20000 train_loss: 2.4862 train_time: 2.6m tok/s: 8277417 +1672/20000 train_loss: 2.6164 train_time: 2.6m tok/s: 8277476 +1673/20000 train_loss: 2.5774 train_time: 2.6m tok/s: 8277443 +1674/20000 train_loss: 2.6443 train_time: 2.7m tok/s: 8277484 +1675/20000 train_loss: 2.4337 train_time: 2.7m tok/s: 8277520 +1676/20000 train_loss: 2.6814 train_time: 2.7m tok/s: 8277543 +1677/20000 train_loss: 2.6033 train_time: 2.7m tok/s: 8277512 +1678/20000 train_loss: 2.6824 train_time: 2.7m tok/s: 8277397 +1679/20000 train_loss: 2.6110 train_time: 2.7m tok/s: 8277292 +1680/20000 train_loss: 2.5319 train_time: 2.7m tok/s: 8277288 +1681/20000 train_loss: 2.5197 train_time: 2.7m tok/s: 8277258 +1682/20000 train_loss: 2.6255 train_time: 2.7m tok/s: 8277255 +1683/20000 train_loss: 2.6286 train_time: 2.7m tok/s: 8277216 +1684/20000 train_loss: 2.5786 train_time: 2.7m tok/s: 8277228 +1685/20000 train_loss: 2.6761 train_time: 2.7m tok/s: 8277252 +1686/20000 train_loss: 2.5819 train_time: 2.7m tok/s: 8277295 +1687/20000 train_loss: 2.5508 train_time: 2.7m tok/s: 8277315 +1688/20000 train_loss: 2.5684 train_time: 2.7m tok/s: 8277319 +1689/20000 train_loss: 2.5205 train_time: 2.7m tok/s: 8277322 +1690/20000 train_loss: 2.8038 train_time: 2.7m tok/s: 8277247 +1691/20000 train_loss: 2.5976 train_time: 2.7m tok/s: 8277222 +1692/20000 train_loss: 2.5834 train_time: 2.7m tok/s: 8277283 +1693/20000 train_loss: 2.3943 train_time: 2.7m tok/s: 8277323 +1694/20000 train_loss: 2.6124 train_time: 2.7m tok/s: 8277293 +1695/20000 train_loss: 2.6157 train_time: 2.7m tok/s: 8277317 +1696/20000 train_loss: 2.6578 train_time: 2.7m tok/s: 8277368 +1697/20000 train_loss: 2.7591 train_time: 2.7m tok/s: 8277405 +1698/20000 train_loss: 2.6472 train_time: 2.7m tok/s: 8277431 +1699/20000 train_loss: 2.7436 train_time: 2.7m tok/s: 8277409 +1700/20000 train_loss: 2.5799 train_time: 2.7m tok/s: 8277410 +1701/20000 train_loss: 2.4805 train_time: 2.7m tok/s: 8277464 +1702/20000 train_loss: 2.6125 train_time: 2.7m tok/s: 8277460 +1703/20000 train_loss: 2.6467 train_time: 2.7m tok/s: 8277440 +1704/20000 train_loss: 2.7606 train_time: 2.7m tok/s: 8277490 +1705/20000 train_loss: 2.7770 train_time: 2.7m tok/s: 8277448 +1706/20000 train_loss: 2.7040 train_time: 2.7m tok/s: 8277465 +1707/20000 train_loss: 2.8438 train_time: 2.7m tok/s: 8277501 +1708/20000 train_loss: 2.4784 train_time: 2.7m tok/s: 8277435 +1709/20000 train_loss: 2.6634 train_time: 2.7m tok/s: 8277409 +1710/20000 train_loss: 2.6364 train_time: 2.7m tok/s: 8277453 +1711/20000 train_loss: 2.5726 train_time: 2.7m tok/s: 8277488 +1712/20000 train_loss: 2.6988 train_time: 2.7m tok/s: 8277542 +1713/20000 train_loss: 2.7884 train_time: 2.7m tok/s: 8277514 +1714/20000 train_loss: 2.5612 train_time: 2.7m tok/s: 8277517 +1715/20000 train_loss: 2.7481 train_time: 2.7m tok/s: 8277485 +1716/20000 train_loss: 2.7239 train_time: 2.7m tok/s: 8277535 +1717/20000 train_loss: 2.7375 train_time: 2.7m tok/s: 8277510 +1718/20000 train_loss: 2.8117 train_time: 2.7m tok/s: 8277490 +1719/20000 train_loss: 2.7147 train_time: 2.7m tok/s: 8277474 +1720/20000 train_loss: 2.5294 train_time: 2.7m tok/s: 8277435 +1721/20000 train_loss: 2.6054 train_time: 2.7m tok/s: 8277428 +1722/20000 train_loss: 2.7303 train_time: 2.7m tok/s: 8277447 +1723/20000 train_loss: 2.6240 train_time: 2.7m tok/s: 8277487 +1724/20000 train_loss: 2.6615 train_time: 2.7m tok/s: 8277536 +1725/20000 train_loss: 2.5989 train_time: 2.7m tok/s: 8277537 +1726/20000 train_loss: 2.6372 train_time: 2.7m tok/s: 8277613 +1727/20000 train_loss: 2.5938 train_time: 2.7m tok/s: 8277591 +1728/20000 train_loss: 2.8359 train_time: 2.7m tok/s: 8277640 +1729/20000 train_loss: 2.6595 train_time: 2.7m tok/s: 8277602 +1730/20000 train_loss: 2.7417 train_time: 2.7m tok/s: 8277634 +1731/20000 train_loss: 2.7440 train_time: 2.7m tok/s: 8277658 +1732/20000 train_loss: 2.7069 train_time: 2.7m tok/s: 8277699 +1733/20000 train_loss: 2.6981 train_time: 2.7m tok/s: 8277663 +1734/20000 train_loss: 2.6059 train_time: 2.7m tok/s: 8277665 +1735/20000 train_loss: 2.4687 train_time: 2.7m tok/s: 8277654 +1736/20000 train_loss: 2.6952 train_time: 2.7m tok/s: 8277610 +1737/20000 train_loss: 2.5782 train_time: 2.8m tok/s: 8277554 +1738/20000 train_loss: 2.8160 train_time: 2.8m tok/s: 8277566 +1739/20000 train_loss: 2.7475 train_time: 2.8m tok/s: 8277526 +1740/20000 train_loss: 2.3919 train_time: 2.8m tok/s: 8277639 +1741/20000 train_loss: 2.7940 train_time: 2.8m tok/s: 8277657 +1742/20000 train_loss: 2.6120 train_time: 2.8m tok/s: 8277691 +1743/20000 train_loss: 2.5234 train_time: 2.8m tok/s: 8277705 +1744/20000 train_loss: 2.5885 train_time: 2.8m tok/s: 8277714 +1745/20000 train_loss: 2.6302 train_time: 2.8m tok/s: 8277708 +1746/20000 train_loss: 2.6100 train_time: 2.8m tok/s: 8277731 +1747/20000 train_loss: 2.6385 train_time: 2.8m tok/s: 8277781 +1748/20000 train_loss: 2.5728 train_time: 2.8m tok/s: 8277781 +1749/20000 train_loss: 2.6246 train_time: 2.8m tok/s: 8277716 +1750/20000 train_loss: 2.6705 train_time: 2.8m tok/s: 8277720 +1751/20000 train_loss: 2.6705 train_time: 2.8m tok/s: 8277686 +1752/20000 train_loss: 2.6319 train_time: 2.8m tok/s: 8277835 +1753/20000 train_loss: 2.6120 train_time: 2.8m tok/s: 8277890 +1754/20000 train_loss: 2.6789 train_time: 2.8m tok/s: 8277911 +1755/20000 train_loss: 2.5908 train_time: 2.8m tok/s: 8277957 +1756/20000 train_loss: 2.6149 train_time: 2.8m tok/s: 8277869 +1757/20000 train_loss: 2.5819 train_time: 2.8m tok/s: 8277843 +1758/20000 train_loss: 2.8681 train_time: 2.8m tok/s: 8277823 +1759/20000 train_loss: 2.6325 train_time: 2.8m tok/s: 8277814 +1760/20000 train_loss: 2.5333 train_time: 2.8m tok/s: 8277742 +1761/20000 train_loss: 2.6347 train_time: 2.8m tok/s: 8277703 +1762/20000 train_loss: 2.7003 train_time: 2.8m tok/s: 8277748 +1763/20000 train_loss: 2.7148 train_time: 2.8m tok/s: 8277752 +1764/20000 train_loss: 2.6562 train_time: 2.8m tok/s: 8277770 +1765/20000 train_loss: 2.5830 train_time: 2.8m tok/s: 8277803 +1766/20000 train_loss: 2.7115 train_time: 2.8m tok/s: 8277819 +1767/20000 train_loss: 2.5760 train_time: 2.8m tok/s: 8277854 +1768/20000 train_loss: 2.6439 train_time: 2.8m tok/s: 8277858 +1769/20000 train_loss: 2.6439 train_time: 2.8m tok/s: 8277832 +1770/20000 train_loss: 2.6595 train_time: 2.8m tok/s: 8277815 +1771/20000 train_loss: 2.5251 train_time: 2.8m tok/s: 8277811 +1772/20000 train_loss: 2.5487 train_time: 2.8m tok/s: 8277806 +1773/20000 train_loss: 2.8520 train_time: 2.8m tok/s: 8277852 +1774/20000 train_loss: 2.7107 train_time: 2.8m tok/s: 8277913 +1775/20000 train_loss: 2.7518 train_time: 2.8m tok/s: 8277992 +1776/20000 train_loss: 2.5792 train_time: 2.8m tok/s: 8277984 +1777/20000 train_loss: 2.6960 train_time: 2.8m tok/s: 8277953 +1778/20000 train_loss: 2.6583 train_time: 2.8m tok/s: 8277963 +1779/20000 train_loss: 2.6569 train_time: 2.8m tok/s: 8278022 +1780/20000 train_loss: 2.6549 train_time: 2.8m tok/s: 8277986 +1781/20000 train_loss: 2.5264 train_time: 2.8m tok/s: 8277981 +1782/20000 train_loss: 2.4141 train_time: 2.8m tok/s: 8278002 +1783/20000 train_loss: 2.6268 train_time: 2.8m tok/s: 8277977 +1784/20000 train_loss: 2.6292 train_time: 2.8m tok/s: 8277967 +1785/20000 train_loss: 2.6405 train_time: 2.8m tok/s: 8278030 +1786/20000 train_loss: 2.7969 train_time: 2.8m tok/s: 8278064 +1787/20000 train_loss: 2.7149 train_time: 2.8m tok/s: 8278087 +1788/20000 train_loss: 2.6302 train_time: 2.8m tok/s: 8278129 +1789/20000 train_loss: 2.7132 train_time: 2.8m tok/s: 8278143 +1790/20000 train_loss: 2.5278 train_time: 2.8m tok/s: 8278182 +1791/20000 train_loss: 2.3533 train_time: 2.8m tok/s: 8278125 +1792/20000 train_loss: 2.6321 train_time: 2.8m tok/s: 8278075 +1793/20000 train_loss: 2.4961 train_time: 2.8m tok/s: 8278120 +1794/20000 train_loss: 2.4341 train_time: 2.8m tok/s: 8278144 +1795/20000 train_loss: 2.6233 train_time: 2.8m tok/s: 8278140 +1796/20000 train_loss: 2.6485 train_time: 2.8m tok/s: 8278184 +1797/20000 train_loss: 2.8051 train_time: 2.8m tok/s: 8278232 +1798/20000 train_loss: 2.5721 train_time: 2.8m tok/s: 8278221 +1799/20000 train_loss: 2.6580 train_time: 2.8m tok/s: 8278162 +1800/20000 train_loss: 2.5581 train_time: 2.9m tok/s: 8278172 +1801/20000 train_loss: 2.6820 train_time: 2.9m tok/s: 8278192 +1802/20000 train_loss: 2.5833 train_time: 2.9m tok/s: 8278127 +1803/20000 train_loss: 2.6388 train_time: 2.9m tok/s: 8278096 +1804/20000 train_loss: 2.6007 train_time: 2.9m tok/s: 8278132 +1805/20000 train_loss: 2.5872 train_time: 2.9m tok/s: 8278164 +1806/20000 train_loss: 2.8111 train_time: 2.9m tok/s: 8278153 +1807/20000 train_loss: 2.7097 train_time: 2.9m tok/s: 8278109 +1808/20000 train_loss: 2.7237 train_time: 2.9m tok/s: 8278140 +1809/20000 train_loss: 2.6091 train_time: 2.9m tok/s: 8278095 +1810/20000 train_loss: 2.7177 train_time: 2.9m tok/s: 8277991 +1811/20000 train_loss: 2.5672 train_time: 2.9m tok/s: 8277974 +1812/20000 train_loss: 2.6129 train_time: 2.9m tok/s: 8277957 +1813/20000 train_loss: 2.6979 train_time: 2.9m tok/s: 8277960 +1814/20000 train_loss: 2.7529 train_time: 2.9m tok/s: 8277985 +1815/20000 train_loss: 2.5548 train_time: 2.9m tok/s: 8277985 +1816/20000 train_loss: 2.4903 train_time: 2.9m tok/s: 8277968 +1817/20000 train_loss: 2.8469 train_time: 2.9m tok/s: 8277963 +1818/20000 train_loss: 2.6741 train_time: 2.9m tok/s: 8278027 +1819/20000 train_loss: 2.6920 train_time: 2.9m tok/s: 8278021 +1820/20000 train_loss: 2.5803 train_time: 2.9m tok/s: 8277990 +1821/20000 train_loss: 2.6466 train_time: 2.9m tok/s: 8278002 +1822/20000 train_loss: 2.7071 train_time: 2.9m tok/s: 8278011 +1823/20000 train_loss: 2.4053 train_time: 2.9m tok/s: 8277962 +1824/20000 train_loss: 2.6058 train_time: 2.9m tok/s: 8277889 +1825/20000 train_loss: 2.6279 train_time: 2.9m tok/s: 8277914 +1826/20000 train_loss: 2.4597 train_time: 2.9m tok/s: 8277915 +1827/20000 train_loss: 2.5771 train_time: 2.9m tok/s: 8277891 +1828/20000 train_loss: 2.5258 train_time: 2.9m tok/s: 8277837 +1829/20000 train_loss: 2.4680 train_time: 2.9m tok/s: 8277847 +1830/20000 train_loss: 2.6488 train_time: 2.9m tok/s: 8277864 +1831/20000 train_loss: 2.6514 train_time: 2.9m tok/s: 8277893 +1832/20000 train_loss: 2.6399 train_time: 2.9m tok/s: 8277914 +1833/20000 train_loss: 2.7047 train_time: 2.9m tok/s: 8277886 +1834/20000 train_loss: 2.6725 train_time: 2.9m tok/s: 8277945 +1835/20000 train_loss: 2.5779 train_time: 2.9m tok/s: 8277955 +1836/20000 train_loss: 2.5766 train_time: 2.9m tok/s: 8277919 +1837/20000 train_loss: 2.4866 train_time: 2.9m tok/s: 8277994 +1838/20000 train_loss: 2.6108 train_time: 2.9m tok/s: 8277991 +1839/20000 train_loss: 2.7137 train_time: 2.9m tok/s: 8278009 +1840/20000 train_loss: 2.6546 train_time: 2.9m tok/s: 8278005 +1841/20000 train_loss: 2.6940 train_time: 2.9m tok/s: 8277987 +1842/20000 train_loss: 2.6288 train_time: 2.9m tok/s: 8278014 +1843/20000 train_loss: 2.5603 train_time: 2.9m tok/s: 8278020 +1844/20000 train_loss: 2.6525 train_time: 2.9m tok/s: 8278023 +1845/20000 train_loss: 2.8151 train_time: 2.9m tok/s: 8277987 +1846/20000 train_loss: 2.5878 train_time: 2.9m tok/s: 8277967 +1847/20000 train_loss: 2.4690 train_time: 2.9m tok/s: 8277981 +1848/20000 train_loss: 2.5324 train_time: 2.9m tok/s: 8277954 +1849/20000 train_loss: 2.5244 train_time: 2.9m tok/s: 8277997 +1850/20000 train_loss: 2.6476 train_time: 2.9m tok/s: 8278033 +1851/20000 train_loss: 2.6626 train_time: 2.9m tok/s: 8278033 +1852/20000 train_loss: 2.6725 train_time: 2.9m tok/s: 8278054 +1853/20000 train_loss: 2.5128 train_time: 2.9m tok/s: 8278050 +1854/20000 train_loss: 2.5555 train_time: 2.9m tok/s: 8278100 +1855/20000 train_loss: 2.6406 train_time: 2.9m tok/s: 8278121 +1856/20000 train_loss: 2.6701 train_time: 2.9m tok/s: 8278123 +1857/20000 train_loss: 2.7878 train_time: 2.9m tok/s: 8278160 +1858/20000 train_loss: 2.7326 train_time: 2.9m tok/s: 8278201 +1859/20000 train_loss: 2.6439 train_time: 2.9m tok/s: 8278225 +1860/20000 train_loss: 2.5822 train_time: 2.9m tok/s: 8278241 +1861/20000 train_loss: 2.5712 train_time: 2.9m tok/s: 8278242 +1862/20000 train_loss: 2.5421 train_time: 2.9m tok/s: 8278250 +1863/20000 train_loss: 2.6246 train_time: 2.9m tok/s: 8278278 +1864/20000 train_loss: 2.5796 train_time: 3.0m tok/s: 8278237 +1865/20000 train_loss: 2.7346 train_time: 3.0m tok/s: 8278298 +1866/20000 train_loss: 2.6410 train_time: 3.0m tok/s: 8278200 +1867/20000 train_loss: 2.5486 train_time: 3.0m tok/s: 8278147 +1868/20000 train_loss: 2.5432 train_time: 3.0m tok/s: 8278168 +1869/20000 train_loss: 2.6833 train_time: 3.0m tok/s: 8278183 +1870/20000 train_loss: 2.6386 train_time: 3.0m tok/s: 8278176 +1871/20000 train_loss: 2.5514 train_time: 3.0m tok/s: 8278165 +1872/20000 train_loss: 2.5844 train_time: 3.0m tok/s: 8278182 +1873/20000 train_loss: 2.7284 train_time: 3.0m tok/s: 8278212 +1874/20000 train_loss: 2.6598 train_time: 3.0m tok/s: 8278217 +1875/20000 train_loss: 2.7483 train_time: 3.0m tok/s: 8278249 +1876/20000 train_loss: 2.7845 train_time: 3.0m tok/s: 8278273 +1877/20000 train_loss: 2.9599 train_time: 3.0m tok/s: 8278197 +1878/20000 train_loss: 2.6575 train_time: 3.0m tok/s: 8278121 +1879/20000 train_loss: 2.6109 train_time: 3.0m tok/s: 8278140 +1880/20000 train_loss: 2.7496 train_time: 3.0m tok/s: 8278156 +1881/20000 train_loss: 2.5941 train_time: 3.0m tok/s: 8278179 +1882/20000 train_loss: 2.7271 train_time: 3.0m tok/s: 8278215 +1883/20000 train_loss: 2.5978 train_time: 3.0m tok/s: 8278260 +1884/20000 train_loss: 2.5613 train_time: 3.0m tok/s: 8278224 +1885/20000 train_loss: 2.6419 train_time: 3.0m tok/s: 8278196 +1886/20000 train_loss: 2.5624 train_time: 3.0m tok/s: 8278239 +1887/20000 train_loss: 2.6347 train_time: 3.0m tok/s: 8278235 +1888/20000 train_loss: 2.4894 train_time: 3.0m tok/s: 8278261 +1889/20000 train_loss: 2.5433 train_time: 3.0m tok/s: 8278323 +1890/20000 train_loss: 2.6763 train_time: 3.0m tok/s: 8278279 +1891/20000 train_loss: 2.5100 train_time: 3.0m tok/s: 8278307 +1892/20000 train_loss: 2.6807 train_time: 3.0m tok/s: 8278322 +1893/20000 train_loss: 2.6850 train_time: 3.0m tok/s: 8278342 +1894/20000 train_loss: 2.5938 train_time: 3.0m tok/s: 8278328 +1895/20000 train_loss: 2.6494 train_time: 3.0m tok/s: 8278369 +1896/20000 train_loss: 2.6222 train_time: 3.0m tok/s: 8278292 +1897/20000 train_loss: 2.5616 train_time: 3.0m tok/s: 8278354 +1898/20000 train_loss: 2.7029 train_time: 3.0m tok/s: 8278344 +1899/20000 train_loss: 2.6254 train_time: 3.0m tok/s: 8278395 +1900/20000 train_loss: 2.6188 train_time: 3.0m tok/s: 8278452 +1901/20000 train_loss: 2.6875 train_time: 3.0m tok/s: 8278431 +1902/20000 train_loss: 2.6058 train_time: 3.0m tok/s: 8278450 +1903/20000 train_loss: 2.7594 train_time: 3.0m tok/s: 8278465 +1904/20000 train_loss: 3.1367 train_time: 3.0m tok/s: 8278458 +1905/20000 train_loss: 2.4854 train_time: 3.0m tok/s: 8278366 +1906/20000 train_loss: 2.6346 train_time: 3.0m tok/s: 8278350 +1907/20000 train_loss: 2.5161 train_time: 3.0m tok/s: 8278409 +1908/20000 train_loss: 2.5376 train_time: 3.0m tok/s: 8278398 +1909/20000 train_loss: 2.5840 train_time: 3.0m tok/s: 8278407 +1910/20000 train_loss: 2.5346 train_time: 3.0m tok/s: 8278387 +1911/20000 train_loss: 2.4885 train_time: 3.0m tok/s: 8278423 +1912/20000 train_loss: 2.7060 train_time: 3.0m tok/s: 8278486 +1913/20000 train_loss: 2.7184 train_time: 3.0m tok/s: 8278511 +1914/20000 train_loss: 2.6987 train_time: 3.0m tok/s: 8278521 +1915/20000 train_loss: 2.7139 train_time: 3.0m tok/s: 8278542 +1916/20000 train_loss: 2.5744 train_time: 3.0m tok/s: 8278541 +1917/20000 train_loss: 2.7111 train_time: 3.0m tok/s: 8278494 +1918/20000 train_loss: 2.5761 train_time: 3.0m tok/s: 8278514 +1919/20000 train_loss: 2.5594 train_time: 3.0m tok/s: 8278543 +1920/20000 train_loss: 2.4998 train_time: 3.0m tok/s: 8278540 +1921/20000 train_loss: 2.7061 train_time: 3.0m tok/s: 8278549 +1922/20000 train_loss: 2.6004 train_time: 3.0m tok/s: 8278590 +1923/20000 train_loss: 2.5222 train_time: 3.0m tok/s: 8278627 +1924/20000 train_loss: 2.5948 train_time: 3.0m tok/s: 8278633 +1925/20000 train_loss: 2.5358 train_time: 3.0m tok/s: 8278626 +1926/20000 train_loss: 2.7434 train_time: 3.0m tok/s: 8278644 +1927/20000 train_loss: 2.5739 train_time: 3.1m tok/s: 8278639 +1928/20000 train_loss: 2.6337 train_time: 3.1m tok/s: 8278664 +1929/20000 train_loss: 2.6461 train_time: 3.1m tok/s: 8278655 +1930/20000 train_loss: 2.7059 train_time: 3.1m tok/s: 8278662 +1931/20000 train_loss: 2.6436 train_time: 3.1m tok/s: 8278682 +1932/20000 train_loss: 2.7559 train_time: 3.1m tok/s: 8278711 +1933/20000 train_loss: 2.6562 train_time: 3.1m tok/s: 8278686 +1934/20000 train_loss: 2.6641 train_time: 3.1m tok/s: 8278713 +1935/20000 train_loss: 2.5647 train_time: 3.1m tok/s: 8278707 +1936/20000 train_loss: 2.6765 train_time: 3.1m tok/s: 8278692 +1937/20000 train_loss: 2.6694 train_time: 3.1m tok/s: 8278717 +1938/20000 train_loss: 2.7066 train_time: 3.1m tok/s: 8278754 +1939/20000 train_loss: 2.6218 train_time: 3.1m tok/s: 8278792 +1940/20000 train_loss: 2.8130 train_time: 3.1m tok/s: 8278785 +1941/20000 train_loss: 2.4731 train_time: 3.1m tok/s: 8278811 +1942/20000 train_loss: 2.4989 train_time: 3.1m tok/s: 8278886 +1943/20000 train_loss: 2.5009 train_time: 3.1m tok/s: 8278888 +1944/20000 train_loss: 2.5221 train_time: 3.1m tok/s: 8278900 +1945/20000 train_loss: 2.5770 train_time: 3.1m tok/s: 8278901 +1946/20000 train_loss: 2.6215 train_time: 3.1m tok/s: 8278942 +1947/20000 train_loss: 2.6899 train_time: 3.1m tok/s: 8278950 +1948/20000 train_loss: 2.6802 train_time: 3.1m tok/s: 8278939 +1949/20000 train_loss: 2.7305 train_time: 3.1m tok/s: 8278932 +1950/20000 train_loss: 2.5873 train_time: 3.1m tok/s: 8278972 +1951/20000 train_loss: 2.8077 train_time: 3.1m tok/s: 8278956 +1952/20000 train_loss: 2.8242 train_time: 3.1m tok/s: 8278961 +1953/20000 train_loss: 2.6397 train_time: 3.1m tok/s: 8278972 +1954/20000 train_loss: 2.5871 train_time: 3.1m tok/s: 8279019 +1955/20000 train_loss: 2.8588 train_time: 3.1m tok/s: 8279013 +1956/20000 train_loss: 2.5837 train_time: 3.1m tok/s: 8279017 +1957/20000 train_loss: 2.6059 train_time: 3.1m tok/s: 8279037 +1958/20000 train_loss: 2.5658 train_time: 3.1m tok/s: 8279076 +1959/20000 train_loss: 2.5512 train_time: 3.1m tok/s: 8279080 +1960/20000 train_loss: 2.5034 train_time: 3.1m tok/s: 8279110 +1961/20000 train_loss: 2.5139 train_time: 3.1m tok/s: 8279081 +1962/20000 train_loss: 2.6017 train_time: 3.1m tok/s: 8279093 +1963/20000 train_loss: 2.5665 train_time: 3.1m tok/s: 8279073 +1964/20000 train_loss: 2.5895 train_time: 3.1m tok/s: 8279133 +1965/20000 train_loss: 2.5890 train_time: 3.1m tok/s: 8279120 +1966/20000 train_loss: 2.7508 train_time: 3.1m tok/s: 8279110 +1967/20000 train_loss: 2.5610 train_time: 3.1m tok/s: 8279074 +1968/20000 train_loss: 2.7106 train_time: 3.1m tok/s: 8279065 +1969/20000 train_loss: 2.7651 train_time: 3.1m tok/s: 8279102 +1970/20000 train_loss: 2.5884 train_time: 3.1m tok/s: 8279101 +1971/20000 train_loss: 2.6186 train_time: 3.1m tok/s: 8279099 +1972/20000 train_loss: 2.6855 train_time: 3.1m tok/s: 8279117 +1973/20000 train_loss: 2.6004 train_time: 3.1m tok/s: 8279128 +1974/20000 train_loss: 2.7692 train_time: 3.1m tok/s: 8279090 +1975/20000 train_loss: 2.5243 train_time: 3.1m tok/s: 8279080 +1976/20000 train_loss: 2.7190 train_time: 3.1m tok/s: 8279126 +1977/20000 train_loss: 2.5277 train_time: 3.1m tok/s: 8279170 +1978/20000 train_loss: 2.6867 train_time: 3.1m tok/s: 8279168 +1979/20000 train_loss: 2.5286 train_time: 3.1m tok/s: 8279186 +1980/20000 train_loss: 2.5741 train_time: 3.1m tok/s: 8279218 +1981/20000 train_loss: 2.4769 train_time: 3.1m tok/s: 8279256 +1982/20000 train_loss: 2.6437 train_time: 3.1m tok/s: 8279200 +1983/20000 train_loss: 2.3817 train_time: 3.1m tok/s: 8279167 +1984/20000 train_loss: 2.6907 train_time: 3.1m tok/s: 8279147 +1985/20000 train_loss: 2.6284 train_time: 3.1m tok/s: 8279166 +1986/20000 train_loss: 2.6743 train_time: 3.1m tok/s: 8279197 +1987/20000 train_loss: 2.6898 train_time: 3.1m tok/s: 8279256 +1988/20000 train_loss: 2.6549 train_time: 3.1m tok/s: 8279319 +1989/20000 train_loss: 2.4878 train_time: 3.1m tok/s: 8279318 +1990/20000 train_loss: 2.6482 train_time: 3.2m tok/s: 8279315 +1991/20000 train_loss: 2.5719 train_time: 3.2m tok/s: 8279376 +1992/20000 train_loss: 2.7508 train_time: 3.2m tok/s: 8279417 +1993/20000 train_loss: 2.5696 train_time: 3.2m tok/s: 8279365 +1994/20000 train_loss: 2.6153 train_time: 3.2m tok/s: 8279359 +1995/20000 train_loss: 2.5196 train_time: 3.2m tok/s: 8279394 +1996/20000 train_loss: 2.6164 train_time: 3.2m tok/s: 8279440 +1997/20000 train_loss: 2.5970 train_time: 3.2m tok/s: 8279448 +1998/20000 train_loss: 2.6140 train_time: 3.2m tok/s: 8279509 +1999/20000 train_loss: 2.6754 train_time: 3.2m tok/s: 8279519 +2000/20000 train_loss: 2.4945 train_time: 3.2m tok/s: 8279532 +2001/20000 train_loss: 2.5884 train_time: 3.2m tok/s: 8279551 +2002/20000 train_loss: 2.4526 train_time: 3.2m tok/s: 8279470 +2003/20000 train_loss: 2.6501 train_time: 3.2m tok/s: 8279566 +2004/20000 train_loss: 2.6351 train_time: 3.2m tok/s: 8279603 +2005/20000 train_loss: 2.6178 train_time: 3.2m tok/s: 8279638 +2006/20000 train_loss: 2.5769 train_time: 3.2m tok/s: 8279669 +2007/20000 train_loss: 2.6280 train_time: 3.2m tok/s: 8279680 +2008/20000 train_loss: 2.5465 train_time: 3.2m tok/s: 8279663 +2009/20000 train_loss: 2.6550 train_time: 3.2m tok/s: 8279695 +2010/20000 train_loss: 2.7062 train_time: 3.2m tok/s: 8279733 +2011/20000 train_loss: 2.5728 train_time: 3.2m tok/s: 8279735 +2012/20000 train_loss: 2.5813 train_time: 3.2m tok/s: 8279752 +2013/20000 train_loss: 2.4827 train_time: 3.2m tok/s: 8279751 +2014/20000 train_loss: 2.4805 train_time: 3.2m tok/s: 8279731 +2015/20000 train_loss: 2.6996 train_time: 3.2m tok/s: 8279710 +2016/20000 train_loss: 2.4832 train_time: 3.2m tok/s: 8279684 +2017/20000 train_loss: 2.6374 train_time: 3.2m tok/s: 8279689 +2018/20000 train_loss: 2.6272 train_time: 3.2m tok/s: 8279760 +2019/20000 train_loss: 2.7204 train_time: 3.2m tok/s: 8279769 +2020/20000 train_loss: 2.7102 train_time: 3.2m tok/s: 8279780 +2021/20000 train_loss: 2.5656 train_time: 3.2m tok/s: 8279785 +2022/20000 train_loss: 2.4998 train_time: 3.2m tok/s: 8279788 +2023/20000 train_loss: 2.6927 train_time: 3.2m tok/s: 8279784 +2024/20000 train_loss: 2.6393 train_time: 3.2m tok/s: 8279766 +2025/20000 train_loss: 2.4815 train_time: 3.2m tok/s: 8279733 +2026/20000 train_loss: 2.7107 train_time: 3.2m tok/s: 8279730 +2027/20000 train_loss: 2.5896 train_time: 3.2m tok/s: 8279720 +2028/20000 train_loss: 2.7162 train_time: 3.2m tok/s: 8279651 +2029/20000 train_loss: 2.4961 train_time: 3.2m tok/s: 8279664 +2030/20000 train_loss: 2.5181 train_time: 3.2m tok/s: 8279681 +2031/20000 train_loss: 2.5048 train_time: 3.2m tok/s: 8279724 +2032/20000 train_loss: 2.5630 train_time: 3.2m tok/s: 8279654 +2033/20000 train_loss: 2.8344 train_time: 3.2m tok/s: 8279633 +2034/20000 train_loss: 2.6808 train_time: 3.2m tok/s: 8279650 +2035/20000 train_loss: 2.6606 train_time: 3.2m tok/s: 8279696 +2036/20000 train_loss: 2.6176 train_time: 3.2m tok/s: 8279680 +2037/20000 train_loss: 2.8338 train_time: 3.2m tok/s: 8279612 +2038/20000 train_loss: 2.5916 train_time: 3.2m tok/s: 8279555 +2039/20000 train_loss: 2.6321 train_time: 3.2m tok/s: 8279609 +2040/20000 train_loss: 2.5791 train_time: 3.2m tok/s: 8279613 +2041/20000 train_loss: 2.6398 train_time: 3.2m tok/s: 8279644 +2042/20000 train_loss: 2.5425 train_time: 3.2m tok/s: 8279623 +2043/20000 train_loss: 2.4899 train_time: 3.2m tok/s: 8279599 +2044/20000 train_loss: 2.6535 train_time: 3.2m tok/s: 8279607 +2045/20000 train_loss: 2.4405 train_time: 3.2m tok/s: 8279671 +2046/20000 train_loss: 2.4640 train_time: 3.2m tok/s: 8279728 +2047/20000 train_loss: 2.7460 train_time: 3.2m tok/s: 8279731 +2048/20000 train_loss: 2.5692 train_time: 3.2m tok/s: 8279735 +2049/20000 train_loss: 2.7137 train_time: 3.2m tok/s: 8279776 +2050/20000 train_loss: 2.6678 train_time: 3.2m tok/s: 8279777 +2051/20000 train_loss: 2.6450 train_time: 3.2m tok/s: 8279820 +2052/20000 train_loss: 2.5342 train_time: 3.2m tok/s: 8279914 +2053/20000 train_loss: 2.6357 train_time: 3.2m tok/s: 8279943 +2054/20000 train_loss: 2.6737 train_time: 3.3m tok/s: 8279990 +2055/20000 train_loss: 2.5781 train_time: 3.3m tok/s: 8279999 +2056/20000 train_loss: 2.6188 train_time: 3.3m tok/s: 8280001 +2057/20000 train_loss: 2.6610 train_time: 3.3m tok/s: 8280012 +2058/20000 train_loss: 2.5584 train_time: 3.3m tok/s: 8280012 +2059/20000 train_loss: 2.4952 train_time: 3.3m tok/s: 8280031 +2060/20000 train_loss: 2.5902 train_time: 3.3m tok/s: 8280028 +2061/20000 train_loss: 2.5889 train_time: 3.3m tok/s: 8280039 +2062/20000 train_loss: 2.6144 train_time: 3.3m tok/s: 8280027 +2063/20000 train_loss: 2.5529 train_time: 3.3m tok/s: 8280054 +2064/20000 train_loss: 2.7982 train_time: 3.3m tok/s: 8280102 +2065/20000 train_loss: 2.5352 train_time: 3.3m tok/s: 8280142 +2066/20000 train_loss: 2.6100 train_time: 3.3m tok/s: 8280101 +2067/20000 train_loss: 2.6673 train_time: 3.3m tok/s: 8280093 +2068/20000 train_loss: 2.6014 train_time: 3.3m tok/s: 8280114 +2069/20000 train_loss: 2.4590 train_time: 3.3m tok/s: 8280116 +2070/20000 train_loss: 2.6116 train_time: 3.3m tok/s: 8280074 +2071/20000 train_loss: 2.5526 train_time: 3.3m tok/s: 8280049 +2072/20000 train_loss: 2.5985 train_time: 3.3m tok/s: 8280055 +2073/20000 train_loss: 2.5321 train_time: 3.3m tok/s: 8280074 +2074/20000 train_loss: 2.6903 train_time: 3.3m tok/s: 8280077 +2075/20000 train_loss: 2.5732 train_time: 3.3m tok/s: 8280140 +2076/20000 train_loss: 2.6665 train_time: 3.3m tok/s: 8280147 +2077/20000 train_loss: 3.5637 train_time: 3.3m tok/s: 8280066 +2078/20000 train_loss: 2.7172 train_time: 3.3m tok/s: 8279963 +2079/20000 train_loss: 2.6616 train_time: 3.3m tok/s: 8279960 +2080/20000 train_loss: 2.6011 train_time: 3.3m tok/s: 8279968 +2081/20000 train_loss: 2.6066 train_time: 3.3m tok/s: 8279941 +2082/20000 train_loss: 2.5995 train_time: 3.3m tok/s: 8279919 +2083/20000 train_loss: 2.5423 train_time: 3.3m tok/s: 8279938 +2084/20000 train_loss: 2.5834 train_time: 3.3m tok/s: 8279983 +2085/20000 train_loss: 2.5822 train_time: 3.3m tok/s: 8280015 +2086/20000 train_loss: 2.6139 train_time: 3.3m tok/s: 8280023 +2087/20000 train_loss: 2.5504 train_time: 3.3m tok/s: 8280073 +2088/20000 train_loss: 2.4660 train_time: 3.3m tok/s: 8280035 +2089/20000 train_loss: 2.6281 train_time: 3.3m tok/s: 8280074 +2090/20000 train_loss: 2.7358 train_time: 3.3m tok/s: 8280134 +2091/20000 train_loss: 2.6223 train_time: 3.3m tok/s: 8280156 +2092/20000 train_loss: 2.6548 train_time: 3.3m tok/s: 8280175 +2093/20000 train_loss: 2.6408 train_time: 3.3m tok/s: 8280224 +2094/20000 train_loss: 2.6015 train_time: 3.3m tok/s: 8280290 +2095/20000 train_loss: 2.5895 train_time: 3.3m tok/s: 8280328 +2096/20000 train_loss: 2.6921 train_time: 3.3m tok/s: 8280343 +2097/20000 train_loss: 2.5727 train_time: 3.3m tok/s: 8280340 +2098/20000 train_loss: 2.4863 train_time: 3.3m tok/s: 8280315 +2099/20000 train_loss: 2.4869 train_time: 3.3m tok/s: 8280377 +2100/20000 train_loss: 2.5809 train_time: 3.3m tok/s: 8280374 +2101/20000 train_loss: 2.6497 train_time: 3.3m tok/s: 8280337 +2102/20000 train_loss: 2.5712 train_time: 3.3m tok/s: 8280336 +2103/20000 train_loss: 2.5889 train_time: 3.3m tok/s: 8280313 +2104/20000 train_loss: 2.7109 train_time: 3.3m tok/s: 8280344 +2105/20000 train_loss: 2.7213 train_time: 3.3m tok/s: 8280386 +2106/20000 train_loss: 2.7041 train_time: 3.3m tok/s: 8280413 +2107/20000 train_loss: 2.6031 train_time: 3.3m tok/s: 8280435 +2108/20000 train_loss: 2.5452 train_time: 3.3m tok/s: 8280370 +2109/20000 train_loss: 2.7136 train_time: 3.3m tok/s: 8280397 +2110/20000 train_loss: 2.5760 train_time: 3.3m tok/s: 8280426 +2111/20000 train_loss: 2.5567 train_time: 3.3m tok/s: 8280437 +2112/20000 train_loss: 2.5508 train_time: 3.3m tok/s: 8280409 +2113/20000 train_loss: 2.5200 train_time: 3.3m tok/s: 8280428 +2114/20000 train_loss: 2.7860 train_time: 3.3m tok/s: 8280452 +2115/20000 train_loss: 2.4931 train_time: 3.3m tok/s: 8280487 +2116/20000 train_loss: 2.6530 train_time: 3.3m tok/s: 8280501 +2117/20000 train_loss: 2.6898 train_time: 3.4m tok/s: 8280539 +2118/20000 train_loss: 2.6961 train_time: 3.4m tok/s: 8280570 +2119/20000 train_loss: 2.8045 train_time: 3.4m tok/s: 8280607 +2120/20000 train_loss: 2.6555 train_time: 3.4m tok/s: 8280644 +2121/20000 train_loss: 2.6326 train_time: 3.4m tok/s: 8280671 +2122/20000 train_loss: 2.5504 train_time: 3.4m tok/s: 8280726 +2123/20000 train_loss: 2.4014 train_time: 3.4m tok/s: 8280721 +2124/20000 train_loss: 2.6353 train_time: 3.4m tok/s: 8280709 +2125/20000 train_loss: 2.6154 train_time: 3.4m tok/s: 8280765 +2126/20000 train_loss: 2.6272 train_time: 3.4m tok/s: 8280712 +2127/20000 train_loss: 2.5554 train_time: 3.4m tok/s: 8280695 +2128/20000 train_loss: 2.5383 train_time: 3.4m tok/s: 8280734 +2129/20000 train_loss: 2.3665 train_time: 3.4m tok/s: 8280761 +2130/20000 train_loss: 2.6770 train_time: 3.4m tok/s: 8280779 +2131/20000 train_loss: 2.5886 train_time: 3.4m tok/s: 8280826 +2132/20000 train_loss: 2.8977 train_time: 3.4m tok/s: 8280850 +2133/20000 train_loss: 2.6967 train_time: 3.4m tok/s: 8280871 +2134/20000 train_loss: 2.6366 train_time: 3.4m tok/s: 8280860 +2135/20000 train_loss: 2.5881 train_time: 3.4m tok/s: 8280854 +2136/20000 train_loss: 2.5998 train_time: 3.4m tok/s: 8280936 +2137/20000 train_loss: 2.5360 train_time: 3.4m tok/s: 8280923 +2138/20000 train_loss: 2.5866 train_time: 3.4m tok/s: 8280957 +2139/20000 train_loss: 2.6766 train_time: 3.4m tok/s: 8280996 +2140/20000 train_loss: 2.5454 train_time: 3.4m tok/s: 8281013 +2141/20000 train_loss: 2.5298 train_time: 3.4m tok/s: 8280976 +2142/20000 train_loss: 2.6054 train_time: 3.4m tok/s: 8281001 +2143/20000 train_loss: 2.4823 train_time: 3.4m tok/s: 8280998 +2144/20000 train_loss: 2.5867 train_time: 3.4m tok/s: 8281003 +2145/20000 train_loss: 2.6656 train_time: 3.4m tok/s: 8280961 +2146/20000 train_loss: 2.5894 train_time: 3.4m tok/s: 8280977 +2147/20000 train_loss: 2.5574 train_time: 3.4m tok/s: 8281012 +2148/20000 train_loss: 2.7065 train_time: 3.4m tok/s: 8280997 +2149/20000 train_loss: 2.5402 train_time: 3.4m tok/s: 8280998 +2150/20000 train_loss: 2.7382 train_time: 3.4m tok/s: 8281030 +2151/20000 train_loss: 2.6811 train_time: 3.4m tok/s: 8281038 +2152/20000 train_loss: 2.5983 train_time: 3.4m tok/s: 8281027 +2153/20000 train_loss: 2.4610 train_time: 3.4m tok/s: 8280984 +2154/20000 train_loss: 2.6295 train_time: 3.4m tok/s: 8280999 +2155/20000 train_loss: 2.5183 train_time: 3.4m tok/s: 8280986 +2156/20000 train_loss: 2.6111 train_time: 3.4m tok/s: 8280995 +2157/20000 train_loss: 2.5929 train_time: 3.4m tok/s: 8280995 +2158/20000 train_loss: 2.5927 train_time: 3.4m tok/s: 8281013 +2159/20000 train_loss: 2.5379 train_time: 3.4m tok/s: 8281025 +2160/20000 train_loss: 2.4319 train_time: 3.4m tok/s: 8281030 +2161/20000 train_loss: 2.6165 train_time: 3.4m tok/s: 8280993 +2162/20000 train_loss: 2.6403 train_time: 3.4m tok/s: 8281043 +2163/20000 train_loss: 2.5627 train_time: 3.4m tok/s: 8281079 +2164/20000 train_loss: 2.6090 train_time: 3.4m tok/s: 8281068 +2165/20000 train_loss: 2.6438 train_time: 3.4m tok/s: 8281051 +2166/20000 train_loss: 2.5219 train_time: 3.4m tok/s: 8281050 +2167/20000 train_loss: 2.5266 train_time: 3.4m tok/s: 8281023 +2168/20000 train_loss: 2.5840 train_time: 3.4m tok/s: 8280980 +2169/20000 train_loss: 2.6329 train_time: 3.4m tok/s: 8280960 +2170/20000 train_loss: 2.4967 train_time: 3.4m tok/s: 8280939 +2171/20000 train_loss: 2.6035 train_time: 3.4m tok/s: 8280936 +2172/20000 train_loss: 2.5111 train_time: 3.4m tok/s: 8280933 +2173/20000 train_loss: 2.7151 train_time: 3.4m tok/s: 8280937 +2174/20000 train_loss: 2.5512 train_time: 3.4m tok/s: 8280937 +2175/20000 train_loss: 2.4537 train_time: 3.4m tok/s: 8280935 +2176/20000 train_loss: 2.6688 train_time: 3.4m tok/s: 8280916 +2177/20000 train_loss: 2.6123 train_time: 3.4m tok/s: 8280919 +2178/20000 train_loss: 2.5288 train_time: 3.4m tok/s: 8280979 +2179/20000 train_loss: 2.7065 train_time: 3.4m tok/s: 8281009 +2180/20000 train_loss: 2.5595 train_time: 3.5m tok/s: 8281002 +2181/20000 train_loss: 2.5587 train_time: 3.5m tok/s: 8281001 +2182/20000 train_loss: 2.4385 train_time: 3.5m tok/s: 8281004 +2183/20000 train_loss: 2.5338 train_time: 3.5m tok/s: 8281026 +2184/20000 train_loss: 2.5415 train_time: 3.5m tok/s: 8281022 +2185/20000 train_loss: 2.4469 train_time: 3.5m tok/s: 8281015 +2186/20000 train_loss: 2.7454 train_time: 3.5m tok/s: 8281061 +2187/20000 train_loss: 2.5558 train_time: 3.5m tok/s: 8281050 +2188/20000 train_loss: 2.5492 train_time: 3.5m tok/s: 8281046 +2189/20000 train_loss: 2.6345 train_time: 3.5m tok/s: 8281053 +2190/20000 train_loss: 2.6805 train_time: 3.5m tok/s: 8281077 +2191/20000 train_loss: 2.6055 train_time: 3.5m tok/s: 8281138 +2192/20000 train_loss: 2.6476 train_time: 3.5m tok/s: 8281145 +2193/20000 train_loss: 2.5835 train_time: 3.5m tok/s: 8281183 +2194/20000 train_loss: 2.5833 train_time: 3.5m tok/s: 8281192 +2195/20000 train_loss: 2.6098 train_time: 3.5m tok/s: 8281197 +2196/20000 train_loss: 2.6048 train_time: 3.5m tok/s: 8281194 +2197/20000 train_loss: 2.5822 train_time: 3.5m tok/s: 8281191 +2198/20000 train_loss: 2.5373 train_time: 3.5m tok/s: 8281206 +2199/20000 train_loss: 2.5359 train_time: 3.5m tok/s: 8281201 +2200/20000 train_loss: 2.6117 train_time: 3.5m tok/s: 8281217 +2201/20000 train_loss: 2.6639 train_time: 3.5m tok/s: 8281226 +2202/20000 train_loss: 2.5852 train_time: 3.5m tok/s: 8281215 +2203/20000 train_loss: 2.5684 train_time: 3.5m tok/s: 8281193 +2204/20000 train_loss: 2.4215 train_time: 3.5m tok/s: 8281198 +2205/20000 train_loss: 2.6506 train_time: 3.5m tok/s: 8281163 +2206/20000 train_loss: 2.5756 train_time: 3.5m tok/s: 8281186 +2207/20000 train_loss: 2.5267 train_time: 3.5m tok/s: 8281195 +2208/20000 train_loss: 2.6567 train_time: 3.5m tok/s: 8281204 +2209/20000 train_loss: 2.7744 train_time: 3.5m tok/s: 8281227 +layer_loop:enabled step:2209 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2210/20000 train_loss: 3.0965 train_time: 3.5m tok/s: 8279224 +2211/20000 train_loss: 2.8640 train_time: 3.5m tok/s: 8277456 +2212/20000 train_loss: 2.5947 train_time: 3.5m tok/s: 8275691 +2213/20000 train_loss: 2.6749 train_time: 3.5m tok/s: 8273957 +2214/20000 train_loss: 2.5687 train_time: 3.5m tok/s: 8272194 +2215/20000 train_loss: 2.6540 train_time: 3.5m tok/s: 8270447 +2216/20000 train_loss: 2.5774 train_time: 3.5m tok/s: 8268703 +2217/20000 train_loss: 2.7332 train_time: 3.5m tok/s: 8266897 +2218/20000 train_loss: 2.7157 train_time: 3.5m tok/s: 8265181 +2219/20000 train_loss: 2.5369 train_time: 3.5m tok/s: 8263369 +2220/20000 train_loss: 2.5765 train_time: 3.5m tok/s: 8261581 +2221/20000 train_loss: 2.6974 train_time: 3.5m tok/s: 8259888 +2222/20000 train_loss: 2.5206 train_time: 3.5m tok/s: 8258060 +2223/20000 train_loss: 2.6031 train_time: 3.5m tok/s: 8256354 +2224/20000 train_loss: 2.4179 train_time: 3.5m tok/s: 8254603 +2225/20000 train_loss: 2.5771 train_time: 3.5m tok/s: 8252852 +2226/20000 train_loss: 2.5609 train_time: 3.5m tok/s: 8251094 +2227/20000 train_loss: 2.5334 train_time: 3.5m tok/s: 8249317 +2228/20000 train_loss: 2.5732 train_time: 3.5m tok/s: 8247595 +2229/20000 train_loss: 2.5317 train_time: 3.5m tok/s: 8245877 +2230/20000 train_loss: 2.5544 train_time: 3.5m tok/s: 8244126 +2231/20000 train_loss: 2.3579 train_time: 3.5m tok/s: 8242340 +2232/20000 train_loss: 2.5466 train_time: 3.6m tok/s: 8240598 +2233/20000 train_loss: 2.6767 train_time: 3.6m tok/s: 8238926 +2234/20000 train_loss: 2.7034 train_time: 3.6m tok/s: 8237225 +2235/20000 train_loss: 2.6585 train_time: 3.6m tok/s: 8235563 +2236/20000 train_loss: 2.6121 train_time: 3.6m tok/s: 8233871 +2237/20000 train_loss: 2.6417 train_time: 3.6m tok/s: 8232192 +2238/20000 train_loss: 2.6595 train_time: 3.6m tok/s: 8230501 +2239/20000 train_loss: 2.7612 train_time: 3.6m tok/s: 8228777 +2240/20000 train_loss: 2.7671 train_time: 3.6m tok/s: 8227057 +2241/20000 train_loss: 2.4057 train_time: 3.6m tok/s: 8225305 +2242/20000 train_loss: 2.5872 train_time: 3.6m tok/s: 8223642 +2243/20000 train_loss: 2.5721 train_time: 3.6m tok/s: 8221945 +2244/20000 train_loss: 2.5990 train_time: 3.6m tok/s: 8220246 +2245/20000 train_loss: 2.6416 train_time: 3.6m tok/s: 8218521 +2246/20000 train_loss: 2.5461 train_time: 3.6m tok/s: 8216833 +2247/20000 train_loss: 2.6853 train_time: 3.6m tok/s: 8215156 +2248/20000 train_loss: 2.6732 train_time: 3.6m tok/s: 8213476 +2249/20000 train_loss: 2.5491 train_time: 3.6m tok/s: 8211797 +2250/20000 train_loss: 2.5425 train_time: 3.6m tok/s: 8210091 +2251/20000 train_loss: 2.5872 train_time: 3.6m tok/s: 8208398 +2252/20000 train_loss: 2.7405 train_time: 3.6m tok/s: 8206720 +2253/20000 train_loss: 2.5744 train_time: 3.6m tok/s: 8205066 +2254/20000 train_loss: 2.5730 train_time: 3.6m tok/s: 8203342 +2255/20000 train_loss: 2.4348 train_time: 3.6m tok/s: 8201692 +2256/20000 train_loss: 2.4798 train_time: 3.6m tok/s: 8200039 +2257/20000 train_loss: 2.5235 train_time: 3.6m tok/s: 8198287 +2258/20000 train_loss: 2.5450 train_time: 3.6m tok/s: 8196581 +2259/20000 train_loss: 2.5676 train_time: 3.6m tok/s: 8194956 +2260/20000 train_loss: 2.5117 train_time: 3.6m tok/s: 8193188 +2261/20000 train_loss: 2.6111 train_time: 3.6m tok/s: 8191543 +2262/20000 train_loss: 2.7139 train_time: 3.6m tok/s: 8189871 +2263/20000 train_loss: 2.7038 train_time: 3.6m tok/s: 8188206 +2264/20000 train_loss: 2.4706 train_time: 3.6m tok/s: 8186463 +2265/20000 train_loss: 2.5563 train_time: 3.6m tok/s: 8184850 +2266/20000 train_loss: 2.5821 train_time: 3.6m tok/s: 8183221 +2267/20000 train_loss: 2.6031 train_time: 3.6m tok/s: 8181610 +2268/20000 train_loss: 2.6942 train_time: 3.6m tok/s: 8180015 +2269/20000 train_loss: 2.7763 train_time: 3.6m tok/s: 8178271 +2270/20000 train_loss: 2.5271 train_time: 3.6m tok/s: 8176580 +2271/20000 train_loss: 2.5838 train_time: 3.6m tok/s: 8174907 +2272/20000 train_loss: 2.5159 train_time: 3.6m tok/s: 8173312 +2273/20000 train_loss: 2.5772 train_time: 3.6m tok/s: 8171705 +2274/20000 train_loss: 3.2794 train_time: 3.6m tok/s: 8170014 +2275/20000 train_loss: 2.4192 train_time: 3.7m tok/s: 8168326 +2276/20000 train_loss: 2.5269 train_time: 3.7m tok/s: 8166678 +2277/20000 train_loss: 2.7813 train_time: 3.7m tok/s: 8165018 +2278/20000 train_loss: 2.7231 train_time: 3.7m tok/s: 8163421 +2279/20000 train_loss: 2.6438 train_time: 3.7m tok/s: 8161773 +2280/20000 train_loss: 2.7333 train_time: 3.7m tok/s: 8160174 +2281/20000 train_loss: 2.5468 train_time: 3.7m tok/s: 8158631 +2282/20000 train_loss: 2.7350 train_time: 3.7m tok/s: 8157035 +2283/20000 train_loss: 2.9784 train_time: 3.7m tok/s: 8155391 +2284/20000 train_loss: 2.4906 train_time: 3.7m tok/s: 8153770 +2285/20000 train_loss: 2.5307 train_time: 3.7m tok/s: 8152161 +2286/20000 train_loss: 2.4515 train_time: 3.7m tok/s: 8150466 +2287/20000 train_loss: 2.4564 train_time: 3.7m tok/s: 8148872 +2288/20000 train_loss: 2.6521 train_time: 3.7m tok/s: 8147245 +2289/20000 train_loss: 2.6662 train_time: 3.7m tok/s: 8145673 +2290/20000 train_loss: 2.6462 train_time: 3.7m tok/s: 8144031 +2291/20000 train_loss: 2.5393 train_time: 3.7m tok/s: 8142465 +2292/20000 train_loss: 2.4847 train_time: 3.7m tok/s: 8140903 +2293/20000 train_loss: 2.5571 train_time: 3.7m tok/s: 8139344 +2294/20000 train_loss: 2.4627 train_time: 3.7m tok/s: 8137785 +2295/20000 train_loss: 2.6678 train_time: 3.7m tok/s: 8136254 +2296/20000 train_loss: 2.5201 train_time: 3.7m tok/s: 8134717 +2297/20000 train_loss: 2.6251 train_time: 3.7m tok/s: 8133111 +2298/20000 train_loss: 2.5025 train_time: 3.7m tok/s: 8131529 +2299/20000 train_loss: 2.6041 train_time: 3.7m tok/s: 8129990 +2300/20000 train_loss: 2.6289 train_time: 3.7m tok/s: 8128370 +2301/20000 train_loss: 2.3991 train_time: 3.7m tok/s: 8126795 +2302/20000 train_loss: 2.5777 train_time: 3.7m tok/s: 8125223 +2303/20000 train_loss: 2.5044 train_time: 3.7m tok/s: 8123683 +2304/20000 train_loss: 2.5927 train_time: 3.7m tok/s: 8122079 +2305/20000 train_loss: 2.4465 train_time: 3.7m tok/s: 8120399 +2306/20000 train_loss: 2.6724 train_time: 3.7m tok/s: 8118876 +2307/20000 train_loss: 2.6499 train_time: 3.7m tok/s: 8117378 +2308/20000 train_loss: 2.5860 train_time: 3.7m tok/s: 8115881 +2309/20000 train_loss: 2.5462 train_time: 3.7m tok/s: 8114276 +2310/20000 train_loss: 2.6255 train_time: 3.7m tok/s: 8112748 +2311/20000 train_loss: 2.6200 train_time: 3.7m tok/s: 8111209 +2312/20000 train_loss: 2.6413 train_time: 3.7m tok/s: 8109619 +2313/20000 train_loss: 2.6206 train_time: 3.7m tok/s: 8108100 +2314/20000 train_loss: 2.4431 train_time: 3.7m tok/s: 8106568 +2315/20000 train_loss: 2.4094 train_time: 3.7m tok/s: 8105011 +2316/20000 train_loss: 2.3702 train_time: 3.7m tok/s: 8103457 +2317/20000 train_loss: 2.7526 train_time: 3.7m tok/s: 8101770 +2318/20000 train_loss: 2.6424 train_time: 3.8m tok/s: 8100269 +2319/20000 train_loss: 2.4550 train_time: 3.8m tok/s: 8098656 +2320/20000 train_loss: 2.6224 train_time: 3.8m tok/s: 8097146 +2321/20000 train_loss: 2.6036 train_time: 3.8m tok/s: 8095582 +2322/20000 train_loss: 2.4649 train_time: 3.8m tok/s: 8094085 +2323/20000 train_loss: 2.6379 train_time: 3.8m tok/s: 8092578 +2324/20000 train_loss: 2.5787 train_time: 3.8m tok/s: 8090992 +2325/20000 train_loss: 2.5977 train_time: 3.8m tok/s: 8089396 +2326/20000 train_loss: 2.6048 train_time: 3.8m tok/s: 8087868 +2327/20000 train_loss: 2.5806 train_time: 3.8m tok/s: 8086352 +2328/20000 train_loss: 2.5341 train_time: 3.8m tok/s: 8084814 +2329/20000 train_loss: 2.4154 train_time: 3.8m tok/s: 8083335 +2330/20000 train_loss: 2.6747 train_time: 3.8m tok/s: 8081828 +2331/20000 train_loss: 2.5815 train_time: 3.8m tok/s: 8080192 +2332/20000 train_loss: 2.4040 train_time: 3.8m tok/s: 8078678 +2333/20000 train_loss: 2.6210 train_time: 3.8m tok/s: 8077147 +2334/20000 train_loss: 2.2811 train_time: 3.8m tok/s: 8075588 +2335/20000 train_loss: 2.5501 train_time: 3.8m tok/s: 8074053 +2336/20000 train_loss: 2.6287 train_time: 3.8m tok/s: 8072553 +2337/20000 train_loss: 2.6640 train_time: 3.8m tok/s: 8071086 +2338/20000 train_loss: 2.5381 train_time: 3.8m tok/s: 8069622 +2339/20000 train_loss: 2.6183 train_time: 3.8m tok/s: 8068135 +2340/20000 train_loss: 2.5772 train_time: 3.8m tok/s: 8066673 +2341/20000 train_loss: 2.5597 train_time: 3.8m tok/s: 8065231 +2342/20000 train_loss: 2.5036 train_time: 3.8m tok/s: 8063737 +2343/20000 train_loss: 2.4367 train_time: 3.8m tok/s: 8062273 +2344/20000 train_loss: 2.7079 train_time: 3.8m tok/s: 8060789 +2345/20000 train_loss: 3.0479 train_time: 3.8m tok/s: 8059290 +2346/20000 train_loss: 2.5350 train_time: 3.8m tok/s: 8057751 +2347/20000 train_loss: 2.5320 train_time: 3.8m tok/s: 8056284 +2348/20000 train_loss: 2.7238 train_time: 3.8m tok/s: 8054802 +2349/20000 train_loss: 2.5967 train_time: 3.8m tok/s: 8053331 +2350/20000 train_loss: 2.5860 train_time: 3.8m tok/s: 8051884 +2351/20000 train_loss: 2.5691 train_time: 3.8m tok/s: 8050422 +2352/20000 train_loss: 2.6004 train_time: 3.8m tok/s: 8048972 +2353/20000 train_loss: 2.4982 train_time: 3.8m tok/s: 8047410 +2354/20000 train_loss: 2.5507 train_time: 3.8m tok/s: 8045898 +2355/20000 train_loss: 2.5140 train_time: 3.8m tok/s: 8044468 +2356/20000 train_loss: 2.5698 train_time: 3.8m tok/s: 8042988 +2357/20000 train_loss: 2.5216 train_time: 3.8m tok/s: 8041521 +2358/20000 train_loss: 2.5294 train_time: 3.8m tok/s: 8040106 +2359/20000 train_loss: 2.5072 train_time: 3.8m tok/s: 8038636 +2360/20000 train_loss: 2.5844 train_time: 3.8m tok/s: 8037184 +2361/20000 train_loss: 2.5984 train_time: 3.9m tok/s: 8035713 +2362/20000 train_loss: 2.5276 train_time: 3.9m tok/s: 8034178 +2363/20000 train_loss: 2.5469 train_time: 3.9m tok/s: 8032757 +2364/20000 train_loss: 2.6624 train_time: 3.9m tok/s: 8031266 +2365/20000 train_loss: 2.5611 train_time: 3.9m tok/s: 8029785 +2366/20000 train_loss: 2.6251 train_time: 3.9m tok/s: 8028310 +2367/20000 train_loss: 2.5354 train_time: 3.9m tok/s: 8026895 +2368/20000 train_loss: 2.6896 train_time: 3.9m tok/s: 8025507 +2369/20000 train_loss: 2.5222 train_time: 3.9m tok/s: 8024049 +2370/20000 train_loss: 2.6102 train_time: 3.9m tok/s: 8022629 +2371/20000 train_loss: 2.5931 train_time: 3.9m tok/s: 8021192 +2372/20000 train_loss: 2.6300 train_time: 3.9m tok/s: 8019726 +2373/20000 train_loss: 2.4970 train_time: 3.9m tok/s: 8018281 +2374/20000 train_loss: 2.5630 train_time: 3.9m tok/s: 8016826 +2375/20000 train_loss: 2.5555 train_time: 3.9m tok/s: 8015368 +2376/20000 train_loss: 2.4994 train_time: 3.9m tok/s: 8013974 +2377/20000 train_loss: 2.4223 train_time: 3.9m tok/s: 8012518 +2378/20000 train_loss: 2.5156 train_time: 3.9m tok/s: 8011093 +2379/20000 train_loss: 2.8998 train_time: 3.9m tok/s: 8009619 +2380/20000 train_loss: 2.5050 train_time: 3.9m tok/s: 8008181 +2381/20000 train_loss: 2.6668 train_time: 3.9m tok/s: 8006777 +2382/20000 train_loss: 2.4625 train_time: 3.9m tok/s: 8005359 +2383/20000 train_loss: 2.6508 train_time: 3.9m tok/s: 8003958 +2384/20000 train_loss: 2.6595 train_time: 3.9m tok/s: 8002557 +2385/20000 train_loss: 2.6759 train_time: 3.9m tok/s: 8001065 +2386/20000 train_loss: 2.6363 train_time: 3.9m tok/s: 7999617 +2387/20000 train_loss: 2.4936 train_time: 3.9m tok/s: 7998166 +2388/20000 train_loss: 2.9512 train_time: 3.9m tok/s: 7996627 +2389/20000 train_loss: 2.3638 train_time: 3.9m tok/s: 7995065 +2390/20000 train_loss: 2.5952 train_time: 3.9m tok/s: 7993670 +2391/20000 train_loss: 2.5211 train_time: 3.9m tok/s: 7992243 +2392/20000 train_loss: 2.6423 train_time: 3.9m tok/s: 7990858 +2393/20000 train_loss: 2.5467 train_time: 3.9m tok/s: 7989511 +2394/20000 train_loss: 2.6349 train_time: 3.9m tok/s: 7988099 +2395/20000 train_loss: 2.6385 train_time: 3.9m tok/s: 7986719 +2396/20000 train_loss: 2.6105 train_time: 3.9m tok/s: 7985369 +2397/20000 train_loss: 2.6901 train_time: 3.9m tok/s: 7983946 +2398/20000 train_loss: 2.5173 train_time: 3.9m tok/s: 7982533 +2399/20000 train_loss: 2.5248 train_time: 3.9m tok/s: 7981184 +2400/20000 train_loss: 2.5277 train_time: 3.9m tok/s: 7979726 +2401/20000 train_loss: 2.6045 train_time: 3.9m tok/s: 7978372 +2402/20000 train_loss: 2.5078 train_time: 3.9m tok/s: 7977001 +2403/20000 train_loss: 2.8803 train_time: 3.9m tok/s: 7975528 +2404/20000 train_loss: 2.5463 train_time: 4.0m tok/s: 7974198 +2405/20000 train_loss: 2.4864 train_time: 4.0m tok/s: 7972762 +2406/20000 train_loss: 2.5775 train_time: 4.0m tok/s: 7971415 +2407/20000 train_loss: 2.5799 train_time: 4.0m tok/s: 7970102 +2408/20000 train_loss: 2.6620 train_time: 4.0m tok/s: 7968760 +2409/20000 train_loss: 2.5926 train_time: 4.0m tok/s: 7967374 +2410/20000 train_loss: 2.5749 train_time: 4.0m tok/s: 7965976 +2411/20000 train_loss: 2.5366 train_time: 4.0m tok/s: 7964628 +2412/20000 train_loss: 2.6524 train_time: 4.0m tok/s: 7963201 +2413/20000 train_loss: 2.4946 train_time: 4.0m tok/s: 7961809 +2414/20000 train_loss: 2.5640 train_time: 4.0m tok/s: 7960464 +2415/20000 train_loss: 2.5511 train_time: 4.0m tok/s: 7959100 +2416/20000 train_loss: 2.5799 train_time: 4.0m tok/s: 7957711 +2417/20000 train_loss: 2.5487 train_time: 4.0m tok/s: 7956315 +2418/20000 train_loss: 2.5110 train_time: 4.0m tok/s: 7954974 +2419/20000 train_loss: 2.5565 train_time: 4.0m tok/s: 7953591 +2420/20000 train_loss: 2.5953 train_time: 4.0m tok/s: 7952203 +2421/20000 train_loss: 2.6255 train_time: 4.0m tok/s: 7950864 +2422/20000 train_loss: 2.6074 train_time: 4.0m tok/s: 7949551 +2423/20000 train_loss: 2.5042 train_time: 4.0m tok/s: 7948226 +2424/20000 train_loss: 2.6236 train_time: 4.0m tok/s: 7946825 +2425/20000 train_loss: 2.6082 train_time: 4.0m tok/s: 7945420 +2426/20000 train_loss: 2.5413 train_time: 4.0m tok/s: 7944075 +2427/20000 train_loss: 2.4195 train_time: 4.0m tok/s: 7942721 +2428/20000 train_loss: 2.5229 train_time: 4.0m tok/s: 7941302 +2429/20000 train_loss: 2.4959 train_time: 4.0m tok/s: 7939993 +2430/20000 train_loss: 2.4671 train_time: 4.0m tok/s: 7938644 +2431/20000 train_loss: 2.5879 train_time: 4.0m tok/s: 7937203 +2432/20000 train_loss: 2.5449 train_time: 4.0m tok/s: 7935898 +2433/20000 train_loss: 2.6800 train_time: 4.0m tok/s: 7934552 +2434/20000 train_loss: 2.4952 train_time: 4.0m tok/s: 7933225 +2435/20000 train_loss: 2.6794 train_time: 4.0m tok/s: 7931831 +2436/20000 train_loss: 2.5359 train_time: 4.0m tok/s: 7930499 +2437/20000 train_loss: 2.5814 train_time: 4.0m tok/s: 7929182 +2438/20000 train_loss: 2.5341 train_time: 4.0m tok/s: 7927837 +2439/20000 train_loss: 2.5293 train_time: 4.0m tok/s: 7926504 +2440/20000 train_loss: 2.5030 train_time: 4.0m tok/s: 7925190 +2441/20000 train_loss: 2.5396 train_time: 4.0m tok/s: 7923862 +2442/20000 train_loss: 2.5205 train_time: 4.0m tok/s: 7922566 +2443/20000 train_loss: 2.6422 train_time: 4.0m tok/s: 7921229 +2444/20000 train_loss: 2.5891 train_time: 4.0m tok/s: 7919900 +2445/20000 train_loss: 2.5032 train_time: 4.0m tok/s: 7918622 +2446/20000 train_loss: 2.7118 train_time: 4.0m tok/s: 7917299 +2447/20000 train_loss: 2.7300 train_time: 4.1m tok/s: 7916000 +2448/20000 train_loss: 2.5911 train_time: 4.1m tok/s: 7914676 +2449/20000 train_loss: 2.5031 train_time: 4.1m tok/s: 7913332 +2450/20000 train_loss: 2.5583 train_time: 4.1m tok/s: 7912003 +2451/20000 train_loss: 2.5457 train_time: 4.1m tok/s: 7910647 +2452/20000 train_loss: 2.5731 train_time: 4.1m tok/s: 7909303 +2453/20000 train_loss: 2.4383 train_time: 4.1m tok/s: 7908020 +2454/20000 train_loss: 2.4550 train_time: 4.1m tok/s: 7906729 +2455/20000 train_loss: 2.5527 train_time: 4.1m tok/s: 7905383 +2456/20000 train_loss: 2.5561 train_time: 4.1m tok/s: 7904068 +2457/20000 train_loss: 2.6250 train_time: 4.1m tok/s: 7902808 +2458/20000 train_loss: 2.7191 train_time: 4.1m tok/s: 7901465 +2459/20000 train_loss: 2.5697 train_time: 4.1m tok/s: 7900176 +2460/20000 train_loss: 2.5892 train_time: 4.1m tok/s: 7898897 +2461/20000 train_loss: 2.6683 train_time: 4.1m tok/s: 7897602 +2462/20000 train_loss: 2.5413 train_time: 4.1m tok/s: 7896301 +2463/20000 train_loss: 2.6014 train_time: 4.1m tok/s: 7895027 +2464/20000 train_loss: 2.5239 train_time: 4.1m tok/s: 7893666 +2465/20000 train_loss: 2.6176 train_time: 4.1m tok/s: 7892415 +2466/20000 train_loss: 2.3301 train_time: 4.1m tok/s: 7891132 +2467/20000 train_loss: 2.5976 train_time: 4.1m tok/s: 7889808 +2468/20000 train_loss: 2.4819 train_time: 4.1m tok/s: 7888520 +2469/20000 train_loss: 2.5827 train_time: 4.1m tok/s: 7887190 +2470/20000 train_loss: 2.6107 train_time: 4.1m tok/s: 7885944 +2471/20000 train_loss: 2.5842 train_time: 4.1m tok/s: 7884639 +2472/20000 train_loss: 2.6800 train_time: 4.1m tok/s: 7883343 +2473/20000 train_loss: 2.5548 train_time: 4.1m tok/s: 7882083 +2474/20000 train_loss: 2.8354 train_time: 4.1m tok/s: 7880678 +2475/20000 train_loss: 2.6920 train_time: 4.1m tok/s: 7879361 +2476/20000 train_loss: 2.5740 train_time: 4.1m tok/s: 7878111 +2477/20000 train_loss: 2.5005 train_time: 4.1m tok/s: 7876875 +2478/20000 train_loss: 2.5327 train_time: 4.1m tok/s: 7875617 +2479/20000 train_loss: 2.6428 train_time: 4.1m tok/s: 7874382 +2480/20000 train_loss: 2.5813 train_time: 4.1m tok/s: 7873091 +2481/20000 train_loss: 2.4675 train_time: 4.1m tok/s: 7871795 +2482/20000 train_loss: 2.6242 train_time: 4.1m tok/s: 7870519 +2483/20000 train_loss: 2.5796 train_time: 4.1m tok/s: 7869317 +2484/20000 train_loss: 2.5457 train_time: 4.1m tok/s: 7868027 +2485/20000 train_loss: 2.4922 train_time: 4.1m tok/s: 7866773 +2486/20000 train_loss: 2.5805 train_time: 4.1m tok/s: 7865475 +2487/20000 train_loss: 2.6040 train_time: 4.1m tok/s: 7864200 +2488/20000 train_loss: 2.5716 train_time: 4.1m tok/s: 7862951 +2489/20000 train_loss: 2.4657 train_time: 4.1m tok/s: 7861631 +2490/20000 train_loss: 2.6372 train_time: 4.2m tok/s: 7860395 +2491/20000 train_loss: 2.5811 train_time: 4.2m tok/s: 7859129 +2492/20000 train_loss: 2.5440 train_time: 4.2m tok/s: 7857855 +2493/20000 train_loss: 2.5750 train_time: 4.2m tok/s: 7856613 +2494/20000 train_loss: 2.4504 train_time: 4.2m tok/s: 7855310 +2495/20000 train_loss: 2.4983 train_time: 4.2m tok/s: 7854066 +2496/20000 train_loss: 2.6114 train_time: 4.2m tok/s: 7852843 +2497/20000 train_loss: 2.5660 train_time: 4.2m tok/s: 7851585 +2498/20000 train_loss: 2.5641 train_time: 4.2m tok/s: 7850386 +2499/20000 train_loss: 2.6091 train_time: 4.2m tok/s: 7849160 +2500/20000 train_loss: 2.6721 train_time: 4.2m tok/s: 7847881 +2501/20000 train_loss: 2.5633 train_time: 4.2m tok/s: 7846634 +2502/20000 train_loss: 2.4149 train_time: 4.2m tok/s: 7845405 +2503/20000 train_loss: 2.5237 train_time: 4.2m tok/s: 7844176 +2504/20000 train_loss: 2.6151 train_time: 4.2m tok/s: 7842898 +2505/20000 train_loss: 2.5276 train_time: 4.2m tok/s: 7841612 +2506/20000 train_loss: 2.5769 train_time: 4.2m tok/s: 7840335 +2507/20000 train_loss: 2.4355 train_time: 4.2m tok/s: 7839128 +2508/20000 train_loss: 2.5869 train_time: 4.2m tok/s: 7837917 +2509/20000 train_loss: 2.6272 train_time: 4.2m tok/s: 7836711 +2510/20000 train_loss: 2.5320 train_time: 4.2m tok/s: 7835403 +2511/20000 train_loss: 2.5835 train_time: 4.2m tok/s: 7834072 +2512/20000 train_loss: 2.6223 train_time: 4.2m tok/s: 7832841 +2513/20000 train_loss: 2.4795 train_time: 4.2m tok/s: 7831628 +2514/20000 train_loss: 2.5734 train_time: 4.2m tok/s: 7830423 +2515/20000 train_loss: 2.6093 train_time: 4.2m tok/s: 7829249 +2516/20000 train_loss: 2.5989 train_time: 4.2m tok/s: 7828008 +2517/20000 train_loss: 2.4406 train_time: 4.2m tok/s: 7826777 +2518/20000 train_loss: 2.5424 train_time: 4.2m tok/s: 7825542 +2519/20000 train_loss: 2.5843 train_time: 4.2m tok/s: 7824319 +2520/20000 train_loss: 2.5429 train_time: 4.2m tok/s: 7823136 +2521/20000 train_loss: 2.6206 train_time: 4.2m tok/s: 7821978 +2522/20000 train_loss: 2.6235 train_time: 4.2m tok/s: 7820798 +2523/20000 train_loss: 2.5596 train_time: 4.2m tok/s: 7819588 +2524/20000 train_loss: 2.5436 train_time: 4.2m tok/s: 7818368 +2525/20000 train_loss: 2.4467 train_time: 4.2m tok/s: 7817149 +2526/20000 train_loss: 2.5403 train_time: 4.2m tok/s: 7815933 +2527/20000 train_loss: 2.5545 train_time: 4.2m tok/s: 7814650 +2528/20000 train_loss: 2.5993 train_time: 4.2m tok/s: 7813458 +2529/20000 train_loss: 2.5689 train_time: 4.2m tok/s: 7812279 +2530/20000 train_loss: 2.4891 train_time: 4.2m tok/s: 7811067 +2531/20000 train_loss: 2.4053 train_time: 4.2m tok/s: 7809824 +2532/20000 train_loss: 2.5252 train_time: 4.3m tok/s: 7808512 +2533/20000 train_loss: 2.5445 train_time: 4.3m tok/s: 7807325 +2534/20000 train_loss: 2.4602 train_time: 4.3m tok/s: 7806186 +2535/20000 train_loss: 2.5744 train_time: 4.3m tok/s: 7805018 +2536/20000 train_loss: 2.5687 train_time: 4.3m tok/s: 7803791 +2537/20000 train_loss: 2.4723 train_time: 4.3m tok/s: 7802625 +2538/20000 train_loss: 2.5922 train_time: 4.3m tok/s: 7801427 +2539/20000 train_loss: 2.7938 train_time: 4.3m tok/s: 7800237 +2540/20000 train_loss: 2.5502 train_time: 4.3m tok/s: 7799060 +2541/20000 train_loss: 2.5343 train_time: 4.3m tok/s: 7797938 +2542/20000 train_loss: 2.5524 train_time: 4.3m tok/s: 7796661 +2543/20000 train_loss: 2.6376 train_time: 4.3m tok/s: 7795481 +2544/20000 train_loss: 2.6612 train_time: 4.3m tok/s: 7794277 +2545/20000 train_loss: 2.4944 train_time: 4.3m tok/s: 7793060 +2546/20000 train_loss: 2.5390 train_time: 4.3m tok/s: 7791905 +2547/20000 train_loss: 2.5139 train_time: 4.3m tok/s: 7790699 +2548/20000 train_loss: 2.7853 train_time: 4.3m tok/s: 7789508 +2549/20000 train_loss: 2.5559 train_time: 4.3m tok/s: 7788323 +2550/20000 train_loss: 2.8280 train_time: 4.3m tok/s: 7787171 +2551/20000 train_loss: 2.5324 train_time: 4.3m tok/s: 7786037 +2552/20000 train_loss: 2.7464 train_time: 4.3m tok/s: 7784887 +2553/20000 train_loss: 2.5804 train_time: 4.3m tok/s: 7783644 +2554/20000 train_loss: 2.4682 train_time: 4.3m tok/s: 7782484 +2555/20000 train_loss: 2.5551 train_time: 4.3m tok/s: 7781368 +2556/20000 train_loss: 2.5560 train_time: 4.3m tok/s: 7780129 +2557/20000 train_loss: 2.5085 train_time: 4.3m tok/s: 7778981 +2558/20000 train_loss: 2.6169 train_time: 4.3m tok/s: 7777829 +2559/20000 train_loss: 2.4341 train_time: 4.3m tok/s: 7776636 +2560/20000 train_loss: 2.4488 train_time: 4.3m tok/s: 7775387 +2561/20000 train_loss: 2.5331 train_time: 4.3m tok/s: 7774229 +2562/20000 train_loss: 2.4744 train_time: 4.3m tok/s: 7773144 +2563/20000 train_loss: 2.4462 train_time: 4.3m tok/s: 7771994 +2564/20000 train_loss: 2.4335 train_time: 4.3m tok/s: 7770820 +2565/20000 train_loss: 2.5432 train_time: 4.3m tok/s: 7769668 +2566/20000 train_loss: 2.5634 train_time: 4.3m tok/s: 7768503 +2567/20000 train_loss: 2.5830 train_time: 4.3m tok/s: 7767372 +2568/20000 train_loss: 2.6175 train_time: 4.3m tok/s: 7766219 +2569/20000 train_loss: 2.6583 train_time: 4.3m tok/s: 7765084 +2570/20000 train_loss: 2.5526 train_time: 4.3m tok/s: 7763920 +2571/20000 train_loss: 2.6068 train_time: 4.3m tok/s: 7762797 +2572/20000 train_loss: 2.4349 train_time: 4.3m tok/s: 7761626 +2573/20000 train_loss: 2.4984 train_time: 4.3m tok/s: 7760433 +2574/20000 train_loss: 2.6835 train_time: 4.3m tok/s: 7759268 +2575/20000 train_loss: 2.5855 train_time: 4.4m tok/s: 7758142 +2576/20000 train_loss: 2.5133 train_time: 4.4m tok/s: 7756977 +2577/20000 train_loss: 2.4820 train_time: 4.4m tok/s: 7755834 +2578/20000 train_loss: 2.4471 train_time: 4.4m tok/s: 7754730 +2579/20000 train_loss: 2.5388 train_time: 4.4m tok/s: 7753611 +2580/20000 train_loss: 2.4652 train_time: 4.4m tok/s: 7752467 +2581/20000 train_loss: 2.3450 train_time: 4.4m tok/s: 7751237 +2582/20000 train_loss: 2.5650 train_time: 4.4m tok/s: 7750095 +2583/20000 train_loss: 2.5925 train_time: 4.4m tok/s: 7748977 +2584/20000 train_loss: 2.5932 train_time: 4.4m tok/s: 7747899 +2585/20000 train_loss: 2.4862 train_time: 4.4m tok/s: 7746736 +2586/20000 train_loss: 2.5117 train_time: 4.4m tok/s: 7745571 +2587/20000 train_loss: 2.5485 train_time: 4.4m tok/s: 7744470 +2588/20000 train_loss: 2.6063 train_time: 4.4m tok/s: 7743335 +2589/20000 train_loss: 2.4818 train_time: 4.4m tok/s: 7742173 +2590/20000 train_loss: 2.5146 train_time: 4.4m tok/s: 7741057 +2591/20000 train_loss: 2.4733 train_time: 4.4m tok/s: 7739969 +2592/20000 train_loss: 2.4318 train_time: 4.4m tok/s: 7738856 +2593/20000 train_loss: 2.5097 train_time: 4.4m tok/s: 7737721 +2594/20000 train_loss: 2.4319 train_time: 4.4m tok/s: 7736603 +2595/20000 train_loss: 2.6088 train_time: 4.4m tok/s: 7735381 +2596/20000 train_loss: 3.0920 train_time: 4.4m tok/s: 7734263 +2597/20000 train_loss: 2.4101 train_time: 4.4m tok/s: 7733180 +2598/20000 train_loss: 2.5104 train_time: 4.4m tok/s: 7732073 +2599/20000 train_loss: 2.6211 train_time: 4.4m tok/s: 7730991 +2600/20000 train_loss: 2.5808 train_time: 4.4m tok/s: 7729778 +2601/20000 train_loss: 2.4974 train_time: 4.4m tok/s: 7728707 +2602/20000 train_loss: 2.7646 train_time: 4.4m tok/s: 7727565 +2603/20000 train_loss: 2.5218 train_time: 4.4m tok/s: 7726412 +2604/20000 train_loss: 2.5253 train_time: 4.4m tok/s: 7725311 +2605/20000 train_loss: 2.6715 train_time: 4.4m tok/s: 7724151 +2606/20000 train_loss: 2.4213 train_time: 4.4m tok/s: 7723055 +2607/20000 train_loss: 2.4727 train_time: 4.4m tok/s: 7721895 +2608/20000 train_loss: 2.5277 train_time: 4.4m tok/s: 7720809 +2609/20000 train_loss: 2.4881 train_time: 4.4m tok/s: 7719707 +2610/20000 train_loss: 2.4760 train_time: 4.4m tok/s: 7718612 +2611/20000 train_loss: 2.6206 train_time: 4.4m tok/s: 7717506 +2612/20000 train_loss: 2.6816 train_time: 4.4m tok/s: 7716350 +2613/20000 train_loss: 2.5571 train_time: 4.4m tok/s: 7715281 +2614/20000 train_loss: 2.5407 train_time: 4.4m tok/s: 7714255 +2615/20000 train_loss: 2.6715 train_time: 4.4m tok/s: 7713197 +2616/20000 train_loss: 2.5835 train_time: 4.4m tok/s: 7712092 +2617/20000 train_loss: 2.5656 train_time: 4.4m tok/s: 7710988 +2618/20000 train_loss: 2.5697 train_time: 4.5m tok/s: 7709904 +2619/20000 train_loss: 2.4588 train_time: 4.5m tok/s: 7708798 +2620/20000 train_loss: 2.4258 train_time: 4.5m tok/s: 7707667 +2621/20000 train_loss: 2.5021 train_time: 4.5m tok/s: 7706592 +2622/20000 train_loss: 2.4932 train_time: 4.5m tok/s: 7705522 +2623/20000 train_loss: 2.5064 train_time: 4.5m tok/s: 7704392 +2624/20000 train_loss: 2.3494 train_time: 4.5m tok/s: 7703312 +2625/20000 train_loss: 2.6324 train_time: 4.5m tok/s: 7702223 +2626/20000 train_loss: 2.3554 train_time: 4.5m tok/s: 7701133 +2627/20000 train_loss: 2.4657 train_time: 4.5m tok/s: 7700054 +2628/20000 train_loss: 2.6628 train_time: 4.5m tok/s: 7698971 +2629/20000 train_loss: 2.5600 train_time: 4.5m tok/s: 7697935 +2630/20000 train_loss: 2.6181 train_time: 4.5m tok/s: 7696814 +2631/20000 train_loss: 2.5922 train_time: 4.5m tok/s: 7695708 +2632/20000 train_loss: 2.6308 train_time: 4.5m tok/s: 7694665 +2633/20000 train_loss: 2.4802 train_time: 4.5m tok/s: 7693555 +2634/20000 train_loss: 2.5659 train_time: 4.5m tok/s: 7692456 +2635/20000 train_loss: 2.4958 train_time: 4.5m tok/s: 7691382 +2636/20000 train_loss: 2.5453 train_time: 4.5m tok/s: 7690362 +2637/20000 train_loss: 2.4571 train_time: 4.5m tok/s: 7689283 +2638/20000 train_loss: 2.5383 train_time: 4.5m tok/s: 7688226 +2639/20000 train_loss: 2.2712 train_time: 4.5m tok/s: 7687110 +2640/20000 train_loss: 2.5467 train_time: 4.5m tok/s: 7686050 +2641/20000 train_loss: 2.5931 train_time: 4.5m tok/s: 7684913 +2642/20000 train_loss: 2.6449 train_time: 4.5m tok/s: 7683888 +2643/20000 train_loss: 2.5624 train_time: 4.5m tok/s: 7682768 +2644/20000 train_loss: 2.5861 train_time: 4.5m tok/s: 7681743 +2645/20000 train_loss: 2.5328 train_time: 4.5m tok/s: 7680638 +2646/20000 train_loss: 2.5688 train_time: 4.5m tok/s: 7679575 +2647/20000 train_loss: 2.6612 train_time: 4.5m tok/s: 7678507 +2648/20000 train_loss: 2.5220 train_time: 4.5m tok/s: 7677448 +2649/20000 train_loss: 2.5466 train_time: 4.5m tok/s: 7676361 +2650/20000 train_loss: 2.4688 train_time: 4.5m tok/s: 7675313 +2651/20000 train_loss: 2.4383 train_time: 4.5m tok/s: 7674239 +2652/20000 train_loss: 2.3579 train_time: 4.5m tok/s: 7673171 +2653/20000 train_loss: 2.6546 train_time: 4.5m tok/s: 7672095 +2654/20000 train_loss: 2.2730 train_time: 4.5m tok/s: 7670984 +2655/20000 train_loss: 2.9495 train_time: 4.5m tok/s: 7669820 +2656/20000 train_loss: 2.4532 train_time: 4.5m tok/s: 7668773 +2657/20000 train_loss: 2.4474 train_time: 4.5m tok/s: 7667742 +2658/20000 train_loss: 2.6203 train_time: 4.5m tok/s: 7666681 +2659/20000 train_loss: 2.5341 train_time: 4.5m tok/s: 7665617 +2660/20000 train_loss: 2.5962 train_time: 4.5m tok/s: 7664566 +2661/20000 train_loss: 2.5478 train_time: 4.6m tok/s: 7663557 +2662/20000 train_loss: 2.3368 train_time: 4.6m tok/s: 7662509 +2663/20000 train_loss: 2.7269 train_time: 4.6m tok/s: 7661442 +2664/20000 train_loss: 2.5373 train_time: 4.6m tok/s: 7660347 +2665/20000 train_loss: 2.4948 train_time: 4.6m tok/s: 7659296 +2666/20000 train_loss: 2.4076 train_time: 4.6m tok/s: 7658272 +2667/20000 train_loss: 2.2946 train_time: 4.6m tok/s: 7657251 +2668/20000 train_loss: 2.5781 train_time: 4.6m tok/s: 7656211 +2669/20000 train_loss: 2.4581 train_time: 4.6m tok/s: 7655221 +2670/20000 train_loss: 2.5704 train_time: 4.6m tok/s: 7654225 +2671/20000 train_loss: 2.6729 train_time: 4.6m tok/s: 7653237 +2672/20000 train_loss: 2.6146 train_time: 4.6m tok/s: 7652256 +2673/20000 train_loss: 2.5676 train_time: 4.6m tok/s: 7651266 +2674/20000 train_loss: 2.6272 train_time: 4.6m tok/s: 7650230 +2675/20000 train_loss: 2.5346 train_time: 4.6m tok/s: 7649206 +2676/20000 train_loss: 2.5240 train_time: 4.6m tok/s: 7648157 +2677/20000 train_loss: 2.4398 train_time: 4.6m tok/s: 7647111 +2678/20000 train_loss: 2.4773 train_time: 4.6m tok/s: 7646095 +2679/20000 train_loss: 2.3259 train_time: 4.6m tok/s: 7645070 +2680/20000 train_loss: 2.4385 train_time: 4.6m tok/s: 7644010 +2681/20000 train_loss: 2.4505 train_time: 4.6m tok/s: 7642992 +2682/20000 train_loss: 2.5416 train_time: 4.6m tok/s: 7642000 +2683/20000 train_loss: 2.4680 train_time: 4.6m tok/s: 7640995 +2684/20000 train_loss: 2.4799 train_time: 4.6m tok/s: 7639915 +2685/20000 train_loss: 2.8149 train_time: 4.6m tok/s: 7638870 +2686/20000 train_loss: 2.5223 train_time: 4.6m tok/s: 7637884 +2687/20000 train_loss: 2.5809 train_time: 4.6m tok/s: 7636870 +2688/20000 train_loss: 2.4575 train_time: 4.6m tok/s: 7635861 +2689/20000 train_loss: 2.5513 train_time: 4.6m tok/s: 7634833 +2690/20000 train_loss: 2.5512 train_time: 4.6m tok/s: 7633782 +2691/20000 train_loss: 2.5320 train_time: 4.6m tok/s: 7632766 +2692/20000 train_loss: 2.4669 train_time: 4.6m tok/s: 7631735 +2693/20000 train_loss: 2.4542 train_time: 4.6m tok/s: 7630685 +2694/20000 train_loss: 2.4774 train_time: 4.6m tok/s: 7629670 +2695/20000 train_loss: 2.5802 train_time: 4.6m tok/s: 7628625 +2696/20000 train_loss: 2.4791 train_time: 4.6m tok/s: 7627603 +2697/20000 train_loss: 2.5606 train_time: 4.6m tok/s: 7626609 +2698/20000 train_loss: 2.5760 train_time: 4.6m tok/s: 7625627 +2699/20000 train_loss: 2.4795 train_time: 4.6m tok/s: 7624658 +2700/20000 train_loss: 2.4635 train_time: 4.6m tok/s: 7623620 +2701/20000 train_loss: 2.5857 train_time: 4.6m tok/s: 7622620 +2702/20000 train_loss: 2.3709 train_time: 4.6m tok/s: 7621605 +2703/20000 train_loss: 2.5566 train_time: 4.6m tok/s: 7620662 +2704/20000 train_loss: 2.4531 train_time: 4.7m tok/s: 7619597 +2705/20000 train_loss: 2.5107 train_time: 4.7m tok/s: 7618610 +2706/20000 train_loss: 2.5173 train_time: 4.7m tok/s: 7617654 +2707/20000 train_loss: 2.6233 train_time: 4.7m tok/s: 7616619 +2708/20000 train_loss: 2.6539 train_time: 4.7m tok/s: 7615519 +2709/20000 train_loss: 2.5731 train_time: 4.7m tok/s: 7614559 +2710/20000 train_loss: 2.6276 train_time: 4.7m tok/s: 7613514 +2711/20000 train_loss: 2.7054 train_time: 4.7m tok/s: 7612546 +2712/20000 train_loss: 2.4933 train_time: 4.7m tok/s: 7611559 +2713/20000 train_loss: 2.5712 train_time: 4.7m tok/s: 7610559 +2714/20000 train_loss: 2.6236 train_time: 4.7m tok/s: 7609572 +2715/20000 train_loss: 2.4575 train_time: 4.7m tok/s: 7608583 +2716/20000 train_loss: 2.4222 train_time: 4.7m tok/s: 7607607 +2717/20000 train_loss: 2.5256 train_time: 4.7m tok/s: 7606615 +2718/20000 train_loss: 2.4507 train_time: 4.7m tok/s: 7605645 +2719/20000 train_loss: 2.4390 train_time: 4.7m tok/s: 7604699 +2720/20000 train_loss: 2.5693 train_time: 4.7m tok/s: 7603599 +2721/20000 train_loss: 2.4093 train_time: 4.7m tok/s: 7602569 +2722/20000 train_loss: 2.4874 train_time: 4.7m tok/s: 7601582 +2723/20000 train_loss: 2.4600 train_time: 4.7m tok/s: 7600649 +2724/20000 train_loss: 2.5234 train_time: 4.7m tok/s: 7599617 +2725/20000 train_loss: 2.6398 train_time: 4.7m tok/s: 7598625 +2726/20000 train_loss: 2.5041 train_time: 4.7m tok/s: 7597714 +2727/20000 train_loss: 2.5651 train_time: 4.7m tok/s: 7596793 +2728/20000 train_loss: 2.8970 train_time: 4.7m tok/s: 7595810 +2729/20000 train_loss: 2.7025 train_time: 4.7m tok/s: 7594785 +2730/20000 train_loss: 2.5376 train_time: 4.7m tok/s: 7593820 +2731/20000 train_loss: 2.6119 train_time: 4.7m tok/s: 7592880 +2732/20000 train_loss: 2.6286 train_time: 4.7m tok/s: 7591845 +2733/20000 train_loss: 2.5497 train_time: 4.7m tok/s: 7590861 +2734/20000 train_loss: 2.6255 train_time: 4.7m tok/s: 7589908 +2735/20000 train_loss: 2.4312 train_time: 4.7m tok/s: 7588970 +2736/20000 train_loss: 2.5528 train_time: 4.7m tok/s: 7587997 +2737/20000 train_loss: 2.4805 train_time: 4.7m tok/s: 7587006 +2738/20000 train_loss: 2.4185 train_time: 4.7m tok/s: 7586063 +2739/20000 train_loss: 2.5443 train_time: 4.7m tok/s: 7585069 +2740/20000 train_loss: 2.5599 train_time: 4.7m tok/s: 7584082 +2741/20000 train_loss: 2.5245 train_time: 4.7m tok/s: 7583136 +2742/20000 train_loss: 2.4811 train_time: 4.7m tok/s: 7582169 +2743/20000 train_loss: 2.5859 train_time: 4.7m tok/s: 7581194 +2744/20000 train_loss: 2.5912 train_time: 4.7m tok/s: 7580257 +2745/20000 train_loss: 2.6467 train_time: 4.7m tok/s: 7579285 +2746/20000 train_loss: 2.6209 train_time: 4.7m tok/s: 7578305 +2747/20000 train_loss: 2.4445 train_time: 4.8m tok/s: 7577366 +2748/20000 train_loss: 2.5163 train_time: 4.8m tok/s: 7576403 +2749/20000 train_loss: 2.5826 train_time: 4.8m tok/s: 7575443 +2750/20000 train_loss: 2.6471 train_time: 4.8m tok/s: 7574498 +2751/20000 train_loss: 2.6347 train_time: 4.8m tok/s: 7573499 +2752/20000 train_loss: 2.5050 train_time: 4.8m tok/s: 7572513 +2753/20000 train_loss: 2.4909 train_time: 4.8m tok/s: 7571571 +2754/20000 train_loss: 2.4583 train_time: 4.8m tok/s: 7570620 +2755/20000 train_loss: 2.5011 train_time: 4.8m tok/s: 7569667 +2756/20000 train_loss: 2.4799 train_time: 4.8m tok/s: 7568764 +2757/20000 train_loss: 2.4512 train_time: 4.8m tok/s: 7567794 +2758/20000 train_loss: 2.5955 train_time: 4.8m tok/s: 7566826 +2759/20000 train_loss: 2.4786 train_time: 4.8m tok/s: 7565853 +2760/20000 train_loss: 2.4133 train_time: 4.8m tok/s: 7564906 +2761/20000 train_loss: 2.6421 train_time: 4.8m tok/s: 7563920 +2762/20000 train_loss: 2.5430 train_time: 4.8m tok/s: 7562961 +2763/20000 train_loss: 2.5971 train_time: 4.8m tok/s: 7562063 +2764/20000 train_loss: 2.5691 train_time: 4.8m tok/s: 7561095 +2765/20000 train_loss: 2.5098 train_time: 4.8m tok/s: 7560134 +2766/20000 train_loss: 2.4626 train_time: 4.8m tok/s: 7559207 +2767/20000 train_loss: 2.5082 train_time: 4.8m tok/s: 7558243 +2768/20000 train_loss: 2.6431 train_time: 4.8m tok/s: 7557313 +2769/20000 train_loss: 2.5509 train_time: 4.8m tok/s: 7556323 +2770/20000 train_loss: 2.7006 train_time: 4.8m tok/s: 7555402 +2771/20000 train_loss: 2.5085 train_time: 4.8m tok/s: 7554455 +2772/20000 train_loss: 2.5344 train_time: 4.8m tok/s: 7553552 +2773/20000 train_loss: 2.5133 train_time: 4.8m tok/s: 7552596 +2774/20000 train_loss: 2.4703 train_time: 4.8m tok/s: 7551649 +2775/20000 train_loss: 2.4719 train_time: 4.8m tok/s: 7550750 +2776/20000 train_loss: 2.4181 train_time: 4.8m tok/s: 7549852 +2777/20000 train_loss: 2.4830 train_time: 4.8m tok/s: 7548897 +2778/20000 train_loss: 2.5296 train_time: 4.8m tok/s: 7547938 +2779/20000 train_loss: 2.4021 train_time: 4.8m tok/s: 7546973 +2780/20000 train_loss: 2.6585 train_time: 4.8m tok/s: 7546015 +2781/20000 train_loss: 2.6381 train_time: 4.8m tok/s: 7545068 +2782/20000 train_loss: 2.4623 train_time: 4.8m tok/s: 7544160 +2783/20000 train_loss: 2.6268 train_time: 4.8m tok/s: 7543249 +2784/20000 train_loss: 2.6523 train_time: 4.8m tok/s: 7542347 +2785/20000 train_loss: 2.5230 train_time: 4.8m tok/s: 7541431 +2786/20000 train_loss: 2.5267 train_time: 4.8m tok/s: 7540514 +2787/20000 train_loss: 2.5843 train_time: 4.8m tok/s: 7539613 +2788/20000 train_loss: 2.4281 train_time: 4.8m tok/s: 7538669 +2789/20000 train_loss: 2.5450 train_time: 4.8m tok/s: 7537776 +2790/20000 train_loss: 2.6076 train_time: 4.9m tok/s: 7536810 +2791/20000 train_loss: 2.3673 train_time: 4.9m tok/s: 7535817 +2792/20000 train_loss: 2.5218 train_time: 4.9m tok/s: 7534904 +2793/20000 train_loss: 2.5441 train_time: 4.9m tok/s: 7533992 +2794/20000 train_loss: 2.5052 train_time: 4.9m tok/s: 7533073 +2795/20000 train_loss: 2.3813 train_time: 4.9m tok/s: 7532158 +2796/20000 train_loss: 2.5163 train_time: 4.9m tok/s: 7531254 +2797/20000 train_loss: 2.5514 train_time: 4.9m tok/s: 7530337 +2798/20000 train_loss: 2.5183 train_time: 4.9m tok/s: 7529417 +2799/20000 train_loss: 2.7850 train_time: 4.9m tok/s: 7528504 +2800/20000 train_loss: 2.6321 train_time: 4.9m tok/s: 7527603 +2801/20000 train_loss: 2.5022 train_time: 4.9m tok/s: 7526711 +2802/20000 train_loss: 2.5169 train_time: 4.9m tok/s: 7525809 +2803/20000 train_loss: 2.5934 train_time: 4.9m tok/s: 7524852 +2804/20000 train_loss: 2.6347 train_time: 4.9m tok/s: 7523917 +2805/20000 train_loss: 2.4800 train_time: 4.9m tok/s: 7523050 +2806/20000 train_loss: 2.5071 train_time: 4.9m tok/s: 7522107 +2807/20000 train_loss: 2.6591 train_time: 4.9m tok/s: 7521185 +2808/20000 train_loss: 2.5892 train_time: 4.9m tok/s: 7520203 +2809/20000 train_loss: 2.4445 train_time: 4.9m tok/s: 7519285 +2810/20000 train_loss: 2.5282 train_time: 4.9m tok/s: 7518417 +2811/20000 train_loss: 2.5964 train_time: 4.9m tok/s: 7517494 +2812/20000 train_loss: 2.5962 train_time: 4.9m tok/s: 7516603 +2813/20000 train_loss: 2.3920 train_time: 4.9m tok/s: 7515706 +2814/20000 train_loss: 2.5384 train_time: 4.9m tok/s: 7514828 +2815/20000 train_loss: 2.6735 train_time: 4.9m tok/s: 7513949 +2816/20000 train_loss: 2.6361 train_time: 4.9m tok/s: 7513037 +2817/20000 train_loss: 2.5459 train_time: 4.9m tok/s: 7512118 +2818/20000 train_loss: 2.5633 train_time: 4.9m tok/s: 7511232 +2819/20000 train_loss: 2.4764 train_time: 4.9m tok/s: 7510377 +2820/20000 train_loss: 2.5001 train_time: 4.9m tok/s: 7509481 +2821/20000 train_loss: 2.4404 train_time: 4.9m tok/s: 7508569 +2822/20000 train_loss: 2.7942 train_time: 4.9m tok/s: 7507610 +2823/20000 train_loss: 2.6234 train_time: 4.9m tok/s: 7506711 +2824/20000 train_loss: 2.6702 train_time: 4.9m tok/s: 7505815 +2825/20000 train_loss: 2.4817 train_time: 4.9m tok/s: 7504922 +2826/20000 train_loss: 2.5658 train_time: 4.9m tok/s: 7504060 +2827/20000 train_loss: 2.4489 train_time: 4.9m tok/s: 7503198 +2828/20000 train_loss: 2.6232 train_time: 4.9m tok/s: 7502314 +2829/20000 train_loss: 2.4140 train_time: 4.9m tok/s: 7501448 +2830/20000 train_loss: 2.5097 train_time: 4.9m tok/s: 7500549 +2831/20000 train_loss: 2.7177 train_time: 4.9m tok/s: 7499659 +2832/20000 train_loss: 2.5260 train_time: 5.0m tok/s: 7498816 +2833/20000 train_loss: 2.6904 train_time: 5.0m tok/s: 7497945 +2834/20000 train_loss: 2.6334 train_time: 5.0m tok/s: 7497045 +2835/20000 train_loss: 2.5948 train_time: 5.0m tok/s: 7496146 +2836/20000 train_loss: 2.5005 train_time: 5.0m tok/s: 7495266 +2837/20000 train_loss: 2.5803 train_time: 5.0m tok/s: 7494384 +2838/20000 train_loss: 2.5482 train_time: 5.0m tok/s: 7493499 +2839/20000 train_loss: 2.5278 train_time: 5.0m tok/s: 7492612 +2840/20000 train_loss: 2.6128 train_time: 5.0m tok/s: 7491711 +2841/20000 train_loss: 2.5501 train_time: 5.0m tok/s: 7490856 +2842/20000 train_loss: 2.6271 train_time: 5.0m tok/s: 7489950 +2843/20000 train_loss: 2.4480 train_time: 5.0m tok/s: 7489076 +2844/20000 train_loss: 2.4383 train_time: 5.0m tok/s: 7488199 +2845/20000 train_loss: 2.5225 train_time: 5.0m tok/s: 7487359 +2846/20000 train_loss: 2.3911 train_time: 5.0m tok/s: 7486504 +2847/20000 train_loss: 2.4269 train_time: 5.0m tok/s: 7485588 +2848/20000 train_loss: 2.5601 train_time: 5.0m tok/s: 7484731 +2849/20000 train_loss: 2.6292 train_time: 5.0m tok/s: 7483833 +2850/20000 train_loss: 2.5697 train_time: 5.0m tok/s: 7482968 +2851/20000 train_loss: 2.7773 train_time: 5.0m tok/s: 7482099 +2852/20000 train_loss: 2.4619 train_time: 5.0m tok/s: 7481196 +2853/20000 train_loss: 2.5345 train_time: 5.0m tok/s: 7480332 +2854/20000 train_loss: 2.4549 train_time: 5.0m tok/s: 7479467 +2855/20000 train_loss: 2.6358 train_time: 5.0m tok/s: 7478592 +2856/20000 train_loss: 2.4699 train_time: 5.0m tok/s: 7477733 +2857/20000 train_loss: 2.5387 train_time: 5.0m tok/s: 7476907 +2858/20000 train_loss: 2.5087 train_time: 5.0m tok/s: 7476045 +2859/20000 train_loss: 3.1483 train_time: 5.0m tok/s: 7475089 +2860/20000 train_loss: 2.4926 train_time: 5.0m tok/s: 7474190 +2861/20000 train_loss: 2.4970 train_time: 5.0m tok/s: 7473352 +2862/20000 train_loss: 2.5140 train_time: 5.0m tok/s: 7472533 +2863/20000 train_loss: 2.3535 train_time: 5.0m tok/s: 7471671 +2864/20000 train_loss: 2.3840 train_time: 5.0m tok/s: 7470838 +2865/20000 train_loss: 2.6179 train_time: 5.0m tok/s: 7469988 +2866/20000 train_loss: 2.5247 train_time: 5.0m tok/s: 7469158 +2867/20000 train_loss: 2.3925 train_time: 5.0m tok/s: 7468257 +2868/20000 train_loss: 2.4325 train_time: 5.0m tok/s: 7467412 +2869/20000 train_loss: 2.5621 train_time: 5.0m tok/s: 7466550 +2870/20000 train_loss: 2.6573 train_time: 5.0m tok/s: 7465687 +2871/20000 train_loss: 2.4583 train_time: 5.0m tok/s: 7464885 +2872/20000 train_loss: 3.0349 train_time: 5.0m tok/s: 7463969 +2873/20000 train_loss: 2.4699 train_time: 5.0m tok/s: 7463021 +2874/20000 train_loss: 2.6147 train_time: 5.0m tok/s: 7462181 +2875/20000 train_loss: 2.5329 train_time: 5.1m tok/s: 7461336 +2876/20000 train_loss: 2.5894 train_time: 5.1m tok/s: 7460509 +2877/20000 train_loss: 2.5158 train_time: 5.1m tok/s: 7459627 +2878/20000 train_loss: 2.4988 train_time: 5.1m tok/s: 7458812 +2879/20000 train_loss: 2.5379 train_time: 5.1m tok/s: 7457974 +2880/20000 train_loss: 2.5457 train_time: 5.1m tok/s: 7457151 +2881/20000 train_loss: 2.6032 train_time: 5.1m tok/s: 7456342 +2882/20000 train_loss: 2.6876 train_time: 5.1m tok/s: 7455483 +2883/20000 train_loss: 2.6427 train_time: 5.1m tok/s: 7454572 +2884/20000 train_loss: 2.6015 train_time: 5.1m tok/s: 7453727 +2885/20000 train_loss: 2.5577 train_time: 5.1m tok/s: 7452865 +2886/20000 train_loss: 2.5555 train_time: 5.1m tok/s: 7452066 +2887/20000 train_loss: 2.5227 train_time: 5.1m tok/s: 7451218 +2888/20000 train_loss: 2.6214 train_time: 5.1m tok/s: 7450368 +2889/20000 train_loss: 2.5733 train_time: 5.1m tok/s: 7449526 +2890/20000 train_loss: 2.5909 train_time: 5.1m tok/s: 7448682 +2891/20000 train_loss: 2.5321 train_time: 5.1m tok/s: 7447883 +2892/20000 train_loss: 2.4688 train_time: 5.1m tok/s: 7447052 +2893/20000 train_loss: 2.3155 train_time: 5.1m tok/s: 7446225 +2894/20000 train_loss: 2.5542 train_time: 5.1m tok/s: 7445369 +2895/20000 train_loss: 2.5221 train_time: 5.1m tok/s: 7444543 +2896/20000 train_loss: 2.4936 train_time: 5.1m tok/s: 7443706 +2897/20000 train_loss: 2.5728 train_time: 5.1m tok/s: 7442888 +2898/20000 train_loss: 2.5634 train_time: 5.1m tok/s: 7442091 +2899/20000 train_loss: 2.5656 train_time: 5.1m tok/s: 7441253 +2900/20000 train_loss: 2.6114 train_time: 5.1m tok/s: 7440416 +2901/20000 train_loss: 2.4223 train_time: 5.1m tok/s: 7439564 +2902/20000 train_loss: 2.5372 train_time: 5.1m tok/s: 7438700 +2903/20000 train_loss: 2.4670 train_time: 5.1m tok/s: 7437844 +2904/20000 train_loss: 2.4634 train_time: 5.1m tok/s: 7437028 +2905/20000 train_loss: 2.6146 train_time: 5.1m tok/s: 7436191 +2906/20000 train_loss: 2.4180 train_time: 5.1m tok/s: 7435381 +2907/20000 train_loss: 2.4593 train_time: 5.1m tok/s: 7434572 +2908/20000 train_loss: 2.5191 train_time: 5.1m tok/s: 7433770 +2909/20000 train_loss: 2.5161 train_time: 5.1m tok/s: 7432963 +2910/20000 train_loss: 2.3910 train_time: 5.1m tok/s: 7432121 +2911/20000 train_loss: 2.5953 train_time: 5.1m tok/s: 7431292 +2912/20000 train_loss: 2.5398 train_time: 5.1m tok/s: 7430425 +2913/20000 train_loss: 2.5704 train_time: 5.1m tok/s: 7429619 +2914/20000 train_loss: 2.6247 train_time: 5.1m tok/s: 7428801 +2915/20000 train_loss: 2.5378 train_time: 5.1m tok/s: 7427980 +2916/20000 train_loss: 2.4730 train_time: 5.1m tok/s: 7427144 +2917/20000 train_loss: 2.5247 train_time: 5.1m tok/s: 7426335 +2918/20000 train_loss: 2.8612 train_time: 5.2m tok/s: 7425527 +2919/20000 train_loss: 2.6980 train_time: 5.2m tok/s: 7424688 +2920/20000 train_loss: 2.4797 train_time: 5.2m tok/s: 7423857 +2921/20000 train_loss: 2.4286 train_time: 5.2m tok/s: 7423049 +2922/20000 train_loss: 2.4657 train_time: 5.2m tok/s: 7422237 +2923/20000 train_loss: 2.4552 train_time: 5.2m tok/s: 7421441 +2924/20000 train_loss: 2.5767 train_time: 5.2m tok/s: 7420611 +2925/20000 train_loss: 2.6226 train_time: 5.2m tok/s: 7419776 +2926/20000 train_loss: 2.5006 train_time: 5.2m tok/s: 7418934 +2927/20000 train_loss: 2.5176 train_time: 5.2m tok/s: 7418125 +2928/20000 train_loss: 2.3554 train_time: 5.2m tok/s: 7417341 +2929/20000 train_loss: 2.5624 train_time: 5.2m tok/s: 7416526 +2930/20000 train_loss: 2.4513 train_time: 5.2m tok/s: 7415728 +2931/20000 train_loss: 2.4476 train_time: 5.2m tok/s: 7414924 +2932/20000 train_loss: 2.4269 train_time: 5.2m tok/s: 7414114 +2933/20000 train_loss: 2.5253 train_time: 5.2m tok/s: 7413312 +2934/20000 train_loss: 2.5403 train_time: 5.2m tok/s: 7412497 +2935/20000 train_loss: 2.5275 train_time: 5.2m tok/s: 7411697 +2936/20000 train_loss: 2.5200 train_time: 5.2m tok/s: 7410891 +2937/20000 train_loss: 2.6509 train_time: 5.2m tok/s: 7410096 +2938/20000 train_loss: 2.5963 train_time: 5.2m tok/s: 7409313 +2939/20000 train_loss: 2.5735 train_time: 5.2m tok/s: 7408511 +2940/20000 train_loss: 2.3908 train_time: 5.2m tok/s: 7407684 +2941/20000 train_loss: 2.6243 train_time: 5.2m tok/s: 7406874 +2942/20000 train_loss: 2.5333 train_time: 5.2m tok/s: 7406056 +2943/20000 train_loss: 2.4281 train_time: 5.2m tok/s: 7405239 +2944/20000 train_loss: 2.4187 train_time: 5.2m tok/s: 7404433 +2945/20000 train_loss: 2.5183 train_time: 5.2m tok/s: 7403622 +2946/20000 train_loss: 2.4905 train_time: 5.2m tok/s: 7402858 +2947/20000 train_loss: 2.4926 train_time: 5.2m tok/s: 7402000 +2948/20000 train_loss: 2.4752 train_time: 5.2m tok/s: 7401234 +2949/20000 train_loss: 2.5488 train_time: 5.2m tok/s: 7400466 +2950/20000 train_loss: 2.6375 train_time: 5.2m tok/s: 7399651 +2951/20000 train_loss: 2.7267 train_time: 5.2m tok/s: 7398852 +2952/20000 train_loss: 2.5537 train_time: 5.2m tok/s: 7398047 +2953/20000 train_loss: 2.5061 train_time: 5.2m tok/s: 7397271 +2954/20000 train_loss: 2.4781 train_time: 5.2m tok/s: 7396467 +2955/20000 train_loss: 2.4688 train_time: 5.2m tok/s: 7395679 +2956/20000 train_loss: 2.4647 train_time: 5.2m tok/s: 7394898 +2957/20000 train_loss: 2.7222 train_time: 5.2m tok/s: 7394082 +2958/20000 train_loss: 2.4373 train_time: 5.2m tok/s: 7393314 +2959/20000 train_loss: 2.4376 train_time: 5.2m tok/s: 7392559 +2960/20000 train_loss: 2.4290 train_time: 5.2m tok/s: 7391765 +2961/20000 train_loss: 2.5371 train_time: 5.3m tok/s: 7390973 +2962/20000 train_loss: 2.4780 train_time: 5.3m tok/s: 7390220 +2963/20000 train_loss: 2.6915 train_time: 5.3m tok/s: 7389415 +2964/20000 train_loss: 2.5893 train_time: 5.3m tok/s: 7388626 +2965/20000 train_loss: 2.6999 train_time: 5.3m tok/s: 7387833 +2966/20000 train_loss: 2.5813 train_time: 5.3m tok/s: 7387061 +2967/20000 train_loss: 2.5507 train_time: 5.3m tok/s: 7386270 +2968/20000 train_loss: 2.4840 train_time: 5.3m tok/s: 7385511 +2969/20000 train_loss: 2.6636 train_time: 5.3m tok/s: 7384699 +2970/20000 train_loss: 2.4755 train_time: 5.3m tok/s: 7383896 +2971/20000 train_loss: 2.4880 train_time: 5.3m tok/s: 7383111 +2972/20000 train_loss: 2.5237 train_time: 5.3m tok/s: 7382357 +2973/20000 train_loss: 2.4493 train_time: 5.3m tok/s: 7381575 +2974/20000 train_loss: 2.5862 train_time: 5.3m tok/s: 7380780 +2975/20000 train_loss: 2.5886 train_time: 5.3m tok/s: 7379997 +2976/20000 train_loss: 2.5513 train_time: 5.3m tok/s: 7379222 +2977/20000 train_loss: 2.4867 train_time: 5.3m tok/s: 7378453 +2978/20000 train_loss: 2.5806 train_time: 5.3m tok/s: 7377655 +2979/20000 train_loss: 2.4800 train_time: 5.3m tok/s: 7376888 +2980/20000 train_loss: 2.5836 train_time: 5.3m tok/s: 7376110 +2981/20000 train_loss: 2.3958 train_time: 5.3m tok/s: 7375305 +2982/20000 train_loss: 2.6258 train_time: 5.3m tok/s: 7374508 +2983/20000 train_loss: 2.4155 train_time: 5.3m tok/s: 7373774 +2984/20000 train_loss: 2.4495 train_time: 5.3m tok/s: 7372990 +2985/20000 train_loss: 2.4490 train_time: 5.3m tok/s: 7372230 +2986/20000 train_loss: 2.5737 train_time: 5.3m tok/s: 7371468 +2987/20000 train_loss: 2.4999 train_time: 5.3m tok/s: 7370646 +2988/20000 train_loss: 2.5546 train_time: 5.3m tok/s: 7369929 +2989/20000 train_loss: 2.4322 train_time: 5.3m tok/s: 7369150 +2990/20000 train_loss: 2.7025 train_time: 5.3m tok/s: 7368363 +2991/20000 train_loss: 2.4831 train_time: 5.3m tok/s: 7367619 +2992/20000 train_loss: 2.6001 train_time: 5.3m tok/s: 7366883 +2993/20000 train_loss: 2.6531 train_time: 5.3m tok/s: 7366052 +2994/20000 train_loss: 2.4351 train_time: 5.3m tok/s: 7365286 +2995/20000 train_loss: 2.6427 train_time: 5.3m tok/s: 7364499 +2996/20000 train_loss: 2.4320 train_time: 5.3m tok/s: 7363734 +2997/20000 train_loss: 2.5787 train_time: 5.3m tok/s: 7362975 +2998/20000 train_loss: 2.4854 train_time: 5.3m tok/s: 7362222 +2999/20000 train_loss: 2.4685 train_time: 5.3m tok/s: 7361471 +3000/20000 train_loss: 2.4860 train_time: 5.3m tok/s: 7360666 +3001/20000 train_loss: 2.5500 train_time: 5.3m tok/s: 7359888 +3002/20000 train_loss: 2.5080 train_time: 5.3m tok/s: 7359159 +3003/20000 train_loss: 2.5388 train_time: 5.3m tok/s: 7358409 +3004/20000 train_loss: 2.4660 train_time: 5.4m tok/s: 7357668 +3005/20000 train_loss: 2.5411 train_time: 5.4m tok/s: 7356924 +3006/20000 train_loss: 2.7568 train_time: 5.4m tok/s: 7356159 +3007/20000 train_loss: 2.5046 train_time: 5.4m tok/s: 7355408 +3008/20000 train_loss: 2.4979 train_time: 5.4m tok/s: 7354668 +3009/20000 train_loss: 2.4757 train_time: 5.4m tok/s: 7353895 +3010/20000 train_loss: 2.4119 train_time: 5.4m tok/s: 7353086 +3011/20000 train_loss: 2.5777 train_time: 5.4m tok/s: 7352307 +3012/20000 train_loss: 2.5513 train_time: 5.4m tok/s: 7351591 +3013/20000 train_loss: 2.5197 train_time: 5.4m tok/s: 7350853 +3014/20000 train_loss: 2.5639 train_time: 5.4m tok/s: 7350148 +3015/20000 train_loss: 2.6657 train_time: 5.4m tok/s: 7349393 +3016/20000 train_loss: 2.6889 train_time: 5.4m tok/s: 7348621 +3017/20000 train_loss: 2.4739 train_time: 5.4m tok/s: 7347854 +3018/20000 train_loss: 2.6194 train_time: 5.4m tok/s: 7347142 +3019/20000 train_loss: 2.4570 train_time: 5.4m tok/s: 7346393 +3020/20000 train_loss: 3.1212 train_time: 5.4m tok/s: 7345579 +3021/20000 train_loss: 2.4699 train_time: 5.4m tok/s: 7344833 +3022/20000 train_loss: 2.4561 train_time: 5.4m tok/s: 7344113 +3023/20000 train_loss: 2.5763 train_time: 5.4m tok/s: 7343318 +3024/20000 train_loss: 3.4371 train_time: 5.4m tok/s: 7342457 +3025/20000 train_loss: 2.4439 train_time: 5.4m tok/s: 7341699 +3026/20000 train_loss: 2.5290 train_time: 5.4m tok/s: 7340896 +3027/20000 train_loss: 2.5665 train_time: 5.4m tok/s: 7340159 +3028/20000 train_loss: 2.6507 train_time: 5.4m tok/s: 7339423 +3029/20000 train_loss: 2.7223 train_time: 5.4m tok/s: 7338729 +3030/20000 train_loss: 2.5703 train_time: 5.4m tok/s: 7338021 +3031/20000 train_loss: 2.5134 train_time: 5.4m tok/s: 7337290 +3032/20000 train_loss: 2.5435 train_time: 5.4m tok/s: 7336538 +3033/20000 train_loss: 2.5709 train_time: 5.4m tok/s: 7335844 +3034/20000 train_loss: 2.4863 train_time: 5.4m tok/s: 7335149 +3035/20000 train_loss: 2.3081 train_time: 5.4m tok/s: 7334416 +3036/20000 train_loss: 2.5762 train_time: 5.4m tok/s: 7333684 +3037/20000 train_loss: 2.4894 train_time: 5.4m tok/s: 7332966 +3038/20000 train_loss: 2.5284 train_time: 5.4m tok/s: 7332246 +3039/20000 train_loss: 2.5183 train_time: 5.4m tok/s: 7331522 +3040/20000 train_loss: 2.4118 train_time: 5.4m tok/s: 7330792 +3041/20000 train_loss: 2.6325 train_time: 5.4m tok/s: 7330055 +3042/20000 train_loss: 2.5827 train_time: 5.4m tok/s: 7329348 +3043/20000 train_loss: 2.6451 train_time: 5.4m tok/s: 7328556 +3044/20000 train_loss: 2.5934 train_time: 5.4m tok/s: 7327868 +3045/20000 train_loss: 2.5812 train_time: 5.4m tok/s: 7327130 +3046/20000 train_loss: 2.5608 train_time: 5.4m tok/s: 7326408 +3047/20000 train_loss: 2.3915 train_time: 5.5m tok/s: 7325654 +3048/20000 train_loss: 2.3591 train_time: 5.5m tok/s: 7324959 +3049/20000 train_loss: 2.5097 train_time: 5.5m tok/s: 7324211 +3050/20000 train_loss: 2.6015 train_time: 5.5m tok/s: 7323473 +3051/20000 train_loss: 2.3650 train_time: 5.5m tok/s: 7322774 +3052/20000 train_loss: 2.4220 train_time: 5.5m tok/s: 7322039 +3053/20000 train_loss: 2.6089 train_time: 5.5m tok/s: 7321270 +3054/20000 train_loss: 2.4221 train_time: 5.5m tok/s: 7320541 +3055/20000 train_loss: 2.5340 train_time: 5.5m tok/s: 7319819 +3056/20000 train_loss: 2.5458 train_time: 5.5m tok/s: 7319099 +3057/20000 train_loss: 2.5058 train_time: 5.5m tok/s: 7318381 +3058/20000 train_loss: 2.5496 train_time: 5.5m tok/s: 7317668 +3059/20000 train_loss: 2.4637 train_time: 5.5m tok/s: 7316955 +3060/20000 train_loss: 2.5736 train_time: 5.5m tok/s: 7316231 +3061/20000 train_loss: 2.4445 train_time: 5.5m tok/s: 7315500 +3062/20000 train_loss: 2.5938 train_time: 5.5m tok/s: 7314773 +3063/20000 train_loss: 2.5459 train_time: 5.5m tok/s: 7314053 +3064/20000 train_loss: 2.4400 train_time: 5.5m tok/s: 7313310 +3065/20000 train_loss: 2.4171 train_time: 5.5m tok/s: 7312612 +3066/20000 train_loss: 2.6747 train_time: 5.5m tok/s: 7311839 +3067/20000 train_loss: 2.4599 train_time: 5.5m tok/s: 7311147 +3068/20000 train_loss: 2.6009 train_time: 5.5m tok/s: 7310415 +3069/20000 train_loss: 2.3988 train_time: 5.5m tok/s: 7309493 +3070/20000 train_loss: 2.5602 train_time: 5.5m tok/s: 7308683 +3071/20000 train_loss: 2.5112 train_time: 5.5m tok/s: 7307923 +3072/20000 train_loss: 2.5408 train_time: 5.5m tok/s: 7307064 +3073/20000 train_loss: 2.6066 train_time: 5.5m tok/s: 7306394 +3074/20000 train_loss: 2.4282 train_time: 5.5m tok/s: 7305474 +3075/20000 train_loss: 2.4153 train_time: 5.5m tok/s: 7304774 +3076/20000 train_loss: 2.4770 train_time: 5.5m tok/s: 7303958 +3077/20000 train_loss: 2.5302 train_time: 5.5m tok/s: 7303292 +3078/20000 train_loss: 2.3932 train_time: 5.5m tok/s: 7302540 +3079/20000 train_loss: 2.4346 train_time: 5.5m tok/s: 7301733 +3080/20000 train_loss: 3.1981 train_time: 5.5m tok/s: 7300929 +3081/20000 train_loss: 2.3831 train_time: 5.5m tok/s: 7300099 +3082/20000 train_loss: 2.4426 train_time: 5.5m tok/s: 7299423 +3083/20000 train_loss: 2.4693 train_time: 5.5m tok/s: 7298593 +3084/20000 train_loss: 2.4718 train_time: 5.5m tok/s: 7297889 +3085/20000 train_loss: 2.5361 train_time: 5.5m tok/s: 7297004 +3086/20000 train_loss: 2.5748 train_time: 5.5m tok/s: 7296305 +3087/20000 train_loss: 2.6139 train_time: 5.5m tok/s: 7295600 +3088/20000 train_loss: 2.5760 train_time: 5.5m tok/s: 7294913 +3089/20000 train_loss: 2.4508 train_time: 5.6m tok/s: 7294205 +3090/20000 train_loss: 2.6975 train_time: 5.6m tok/s: 7293534 +3091/20000 train_loss: 2.4464 train_time: 5.6m tok/s: 7292854 +3092/20000 train_loss: 2.4826 train_time: 5.6m tok/s: 7292123 +3093/20000 train_loss: 2.5253 train_time: 5.6m tok/s: 7291405 +3094/20000 train_loss: 2.4785 train_time: 5.6m tok/s: 7290678 +3095/20000 train_loss: 2.3325 train_time: 5.6m tok/s: 7289971 +3096/20000 train_loss: 2.5155 train_time: 5.6m tok/s: 7289256 +3097/20000 train_loss: 2.5556 train_time: 5.6m tok/s: 7288580 +3098/20000 train_loss: 2.4706 train_time: 5.6m tok/s: 7287893 +3099/20000 train_loss: 2.3260 train_time: 5.6m tok/s: 7287172 +3100/20000 train_loss: 2.4126 train_time: 5.6m tok/s: 7286466 +3101/20000 train_loss: 2.6686 train_time: 5.6m tok/s: 7285794 +3102/20000 train_loss: 2.6482 train_time: 5.6m tok/s: 7285078 +3103/20000 train_loss: 2.4882 train_time: 5.6m tok/s: 7284400 +3104/20000 train_loss: 2.5867 train_time: 5.6m tok/s: 7283720 +3105/20000 train_loss: 2.3819 train_time: 5.6m tok/s: 7283031 +3106/20000 train_loss: 2.5882 train_time: 5.6m tok/s: 7282332 +3107/20000 train_loss: 2.3385 train_time: 5.6m tok/s: 7281590 +3108/20000 train_loss: 2.4734 train_time: 5.6m tok/s: 7280895 +3109/20000 train_loss: 2.5577 train_time: 5.6m tok/s: 7280224 +3110/20000 train_loss: 2.4209 train_time: 5.6m tok/s: 7279562 +3111/20000 train_loss: 2.4370 train_time: 5.6m tok/s: 7278844 +3112/20000 train_loss: 2.3920 train_time: 5.6m tok/s: 7278157 +3113/20000 train_loss: 2.5282 train_time: 5.6m tok/s: 7277463 +3114/20000 train_loss: 2.5434 train_time: 5.6m tok/s: 7276765 +3115/20000 train_loss: 2.5668 train_time: 5.6m tok/s: 7276081 +3116/20000 train_loss: 2.5805 train_time: 5.6m tok/s: 7275391 +3117/20000 train_loss: 2.6045 train_time: 5.6m tok/s: 7274723 +3118/20000 train_loss: 2.5397 train_time: 5.6m tok/s: 7274036 +3119/20000 train_loss: 2.5678 train_time: 5.6m tok/s: 7273336 +3120/20000 train_loss: 2.5547 train_time: 5.6m tok/s: 7272662 +3121/20000 train_loss: 2.5356 train_time: 5.6m tok/s: 7271965 +3122/20000 train_loss: 2.5542 train_time: 5.6m tok/s: 7271274 +3123/20000 train_loss: 2.4328 train_time: 5.6m tok/s: 7270607 +3124/20000 train_loss: 2.5316 train_time: 5.6m tok/s: 7269923 +3125/20000 train_loss: 2.4327 train_time: 5.6m tok/s: 7269220 +3126/20000 train_loss: 2.1292 train_time: 5.6m tok/s: 7268500 +3127/20000 train_loss: 2.5990 train_time: 5.6m tok/s: 7267801 +3128/20000 train_loss: 2.5021 train_time: 5.6m tok/s: 7267105 +3129/20000 train_loss: 2.5328 train_time: 5.6m tok/s: 7266445 +3130/20000 train_loss: 2.4766 train_time: 5.6m tok/s: 7265757 +3131/20000 train_loss: 2.4455 train_time: 5.6m tok/s: 7265071 +3132/20000 train_loss: 2.5702 train_time: 5.7m tok/s: 7264396 +3133/20000 train_loss: 2.5178 train_time: 5.7m tok/s: 7263744 +3134/20000 train_loss: 2.5491 train_time: 5.7m tok/s: 7263066 +3135/20000 train_loss: 2.5047 train_time: 5.7m tok/s: 7262384 +3136/20000 train_loss: 2.5704 train_time: 5.7m tok/s: 7261718 +3137/20000 train_loss: 2.6041 train_time: 5.7m tok/s: 7261028 +3138/20000 train_loss: 2.4472 train_time: 5.7m tok/s: 7260343 +3139/20000 train_loss: 2.4493 train_time: 5.7m tok/s: 7259688 +3140/20000 train_loss: 2.4276 train_time: 5.7m tok/s: 7258999 +3141/20000 train_loss: 2.5741 train_time: 5.7m tok/s: 7258288 +3142/20000 train_loss: 2.5461 train_time: 5.7m tok/s: 7257624 +3143/20000 train_loss: 2.5506 train_time: 5.7m tok/s: 7256941 +3144/20000 train_loss: 2.5437 train_time: 5.7m tok/s: 7256288 +3145/20000 train_loss: 2.0732 train_time: 5.7m tok/s: 7255576 +3146/20000 train_loss: 2.4535 train_time: 5.7m tok/s: 7254902 +3147/20000 train_loss: 2.5530 train_time: 5.7m tok/s: 7254237 +3148/20000 train_loss: 2.4819 train_time: 5.7m tok/s: 7253540 +3149/20000 train_loss: 2.5958 train_time: 5.7m tok/s: 7252892 +3150/20000 train_loss: 2.3915 train_time: 5.7m tok/s: 7252201 +3151/20000 train_loss: 2.4099 train_time: 5.7m tok/s: 7251544 +3152/20000 train_loss: 2.4800 train_time: 5.7m tok/s: 7250898 +3153/20000 train_loss: 2.5204 train_time: 5.7m tok/s: 7250255 +3154/20000 train_loss: 2.3756 train_time: 5.7m tok/s: 7249613 +3155/20000 train_loss: 2.4738 train_time: 5.7m tok/s: 7248922 +3156/20000 train_loss: 2.5923 train_time: 5.7m tok/s: 7248244 +3157/20000 train_loss: 2.5958 train_time: 5.7m tok/s: 7247588 +3158/20000 train_loss: 2.6428 train_time: 5.7m tok/s: 7246909 +3159/20000 train_loss: 2.5233 train_time: 5.7m tok/s: 7246255 +3160/20000 train_loss: 2.3706 train_time: 5.7m tok/s: 7245580 +3161/20000 train_loss: 2.5399 train_time: 5.7m tok/s: 7244920 +3162/20000 train_loss: 2.6273 train_time: 5.7m tok/s: 7244293 +3163/20000 train_loss: 2.5473 train_time: 5.7m tok/s: 7243613 +3164/20000 train_loss: 2.3727 train_time: 5.7m tok/s: 7242927 +3165/20000 train_loss: 2.5085 train_time: 5.7m tok/s: 7242265 +3166/20000 train_loss: 2.4206 train_time: 5.7m tok/s: 7241615 +3167/20000 train_loss: 2.5291 train_time: 5.7m tok/s: 7240963 +3168/20000 train_loss: 2.5098 train_time: 5.7m tok/s: 7240304 +3169/20000 train_loss: 2.4887 train_time: 5.7m tok/s: 7239648 +3170/20000 train_loss: 2.6191 train_time: 5.7m tok/s: 7238913 +3171/20000 train_loss: 2.6568 train_time: 5.7m tok/s: 7238269 +3172/20000 train_loss: 2.6575 train_time: 5.7m tok/s: 7237653 +3173/20000 train_loss: 2.7168 train_time: 5.7m tok/s: 7236963 +3174/20000 train_loss: 2.4722 train_time: 5.7m tok/s: 7236305 +3175/20000 train_loss: 2.6095 train_time: 5.8m tok/s: 7235658 +3176/20000 train_loss: 2.4503 train_time: 5.8m tok/s: 7235018 +3177/20000 train_loss: 2.4525 train_time: 5.8m tok/s: 7234370 +3178/20000 train_loss: 2.5950 train_time: 5.8m tok/s: 7233678 +3179/20000 train_loss: 2.4262 train_time: 5.8m tok/s: 7232938 +3180/20000 train_loss: 2.5258 train_time: 5.8m tok/s: 7232268 +3181/20000 train_loss: 2.3792 train_time: 5.8m tok/s: 7231623 +3182/20000 train_loss: 2.8194 train_time: 5.8m tok/s: 7230961 +3183/20000 train_loss: 2.6630 train_time: 5.8m tok/s: 7230315 +3184/20000 train_loss: 2.5047 train_time: 5.8m tok/s: 7229616 +3185/20000 train_loss: 2.5773 train_time: 5.8m tok/s: 7228992 +3186/20000 train_loss: 2.5296 train_time: 5.8m tok/s: 7228346 +3187/20000 train_loss: 2.5746 train_time: 5.8m tok/s: 7227700 +3188/20000 train_loss: 2.3955 train_time: 5.8m tok/s: 7227043 +3189/20000 train_loss: 2.5944 train_time: 5.8m tok/s: 7226407 +3190/20000 train_loss: 2.5778 train_time: 5.8m tok/s: 7225794 +3191/20000 train_loss: 2.4830 train_time: 5.8m tok/s: 7225134 +3192/20000 train_loss: 2.3810 train_time: 5.8m tok/s: 7224533 +3193/20000 train_loss: 2.4598 train_time: 5.8m tok/s: 7223922 +3194/20000 train_loss: 2.4485 train_time: 5.8m tok/s: 7223296 +3195/20000 train_loss: 2.5039 train_time: 5.8m tok/s: 7222687 +3196/20000 train_loss: 2.3909 train_time: 5.8m tok/s: 7222002 +3197/20000 train_loss: 2.4691 train_time: 5.8m tok/s: 7221391 +3198/20000 train_loss: 2.5014 train_time: 5.8m tok/s: 7220735 +3199/20000 train_loss: 2.5148 train_time: 5.8m tok/s: 7220104 +3200/20000 train_loss: 2.6319 train_time: 5.8m tok/s: 7219442 +3201/20000 train_loss: 2.5703 train_time: 5.8m tok/s: 7218836 +3202/20000 train_loss: 2.2692 train_time: 5.8m tok/s: 7218130 +3203/20000 train_loss: 2.5434 train_time: 5.8m tok/s: 7217453 +3204/20000 train_loss: 2.4446 train_time: 5.8m tok/s: 7216832 +3205/20000 train_loss: 2.5036 train_time: 5.8m tok/s: 7216167 +3206/20000 train_loss: 2.4463 train_time: 5.8m tok/s: 7215531 +3207/20000 train_loss: 2.5554 train_time: 5.8m tok/s: 7214902 +3208/20000 train_loss: 2.5809 train_time: 5.8m tok/s: 7214291 +3209/20000 train_loss: 2.4589 train_time: 5.8m tok/s: 7213665 +3210/20000 train_loss: 2.7165 train_time: 5.8m tok/s: 7212988 +3211/20000 train_loss: 2.5157 train_time: 5.8m tok/s: 7212361 +3212/20000 train_loss: 2.4069 train_time: 5.8m tok/s: 7211742 +3213/20000 train_loss: 2.5613 train_time: 5.8m tok/s: 7211082 +3214/20000 train_loss: 2.4234 train_time: 5.8m tok/s: 7210426 +3215/20000 train_loss: 2.4344 train_time: 5.8m tok/s: 7209795 +3216/20000 train_loss: 2.3228 train_time: 5.8m tok/s: 7209152 +3217/20000 train_loss: 2.4583 train_time: 5.8m tok/s: 7208538 +3218/20000 train_loss: 2.5019 train_time: 5.9m tok/s: 7207923 +3219/20000 train_loss: 2.5634 train_time: 5.9m tok/s: 7207333 +3220/20000 train_loss: 2.6326 train_time: 5.9m tok/s: 7206695 +3221/20000 train_loss: 2.4298 train_time: 5.9m tok/s: 7206064 +3222/20000 train_loss: 2.5313 train_time: 5.9m tok/s: 7205471 +3223/20000 train_loss: 2.6340 train_time: 5.9m tok/s: 7204791 +3224/20000 train_loss: 2.9697 train_time: 5.9m tok/s: 7204143 +3225/20000 train_loss: 2.5837 train_time: 5.9m tok/s: 7203490 +3226/20000 train_loss: 2.4101 train_time: 5.9m tok/s: 7202869 +3227/20000 train_loss: 2.4840 train_time: 5.9m tok/s: 7202233 +3228/20000 train_loss: 2.8148 train_time: 5.9m tok/s: 7201615 +3229/20000 train_loss: 2.4105 train_time: 5.9m tok/s: 7200988 +3230/20000 train_loss: 2.4468 train_time: 5.9m tok/s: 7200382 +3231/20000 train_loss: 2.5551 train_time: 5.9m tok/s: 7199769 +3232/20000 train_loss: 2.5343 train_time: 5.9m tok/s: 7199133 +3233/20000 train_loss: 2.5270 train_time: 5.9m tok/s: 7198504 +3234/20000 train_loss: 2.5980 train_time: 5.9m tok/s: 7197883 +3235/20000 train_loss: 2.5330 train_time: 5.9m tok/s: 7197242 +3236/20000 train_loss: 2.2840 train_time: 5.9m tok/s: 7196609 +3237/20000 train_loss: 2.4687 train_time: 5.9m tok/s: 7196004 +3238/20000 train_loss: 2.3642 train_time: 5.9m tok/s: 7195378 +3239/20000 train_loss: 2.3962 train_time: 5.9m tok/s: 7194721 +3240/20000 train_loss: 2.5002 train_time: 5.9m tok/s: 7194112 +3241/20000 train_loss: 2.4987 train_time: 5.9m tok/s: 7193495 +3242/20000 train_loss: 2.5321 train_time: 5.9m tok/s: 7192882 +3243/20000 train_loss: 2.4466 train_time: 5.9m tok/s: 7192261 +3244/20000 train_loss: 2.5594 train_time: 5.9m tok/s: 7191657 +3245/20000 train_loss: 2.6333 train_time: 5.9m tok/s: 7191056 +3246/20000 train_loss: 2.5169 train_time: 5.9m tok/s: 7190405 +3247/20000 train_loss: 2.4124 train_time: 5.9m tok/s: 7189781 +3248/20000 train_loss: 2.4686 train_time: 5.9m tok/s: 7189164 +3249/20000 train_loss: 2.5947 train_time: 5.9m tok/s: 7188578 +3250/20000 train_loss: 2.5571 train_time: 5.9m tok/s: 7187921 +3251/20000 train_loss: 2.5918 train_time: 5.9m tok/s: 7187289 +3252/20000 train_loss: 2.4486 train_time: 5.9m tok/s: 7186677 +3253/20000 train_loss: 2.4423 train_time: 5.9m tok/s: 7186060 +3254/20000 train_loss: 2.9104 train_time: 5.9m tok/s: 7185371 +3255/20000 train_loss: 2.4604 train_time: 5.9m tok/s: 7184747 +3256/20000 train_loss: 2.5041 train_time: 5.9m tok/s: 7184180 +3257/20000 train_loss: 2.5762 train_time: 5.9m tok/s: 7183523 +3258/20000 train_loss: 2.5418 train_time: 5.9m tok/s: 7182950 +3259/20000 train_loss: 2.4949 train_time: 5.9m tok/s: 7182340 +3260/20000 train_loss: 2.4864 train_time: 5.9m tok/s: 7181760 +3261/20000 train_loss: 2.5263 train_time: 6.0m tok/s: 7181161 +3262/20000 train_loss: 2.4059 train_time: 6.0m tok/s: 7180535 +3263/20000 train_loss: 2.4678 train_time: 6.0m tok/s: 7179909 +3264/20000 train_loss: 2.5175 train_time: 6.0m tok/s: 7179288 +3265/20000 train_loss: 2.5341 train_time: 6.0m tok/s: 7178709 +3266/20000 train_loss: 2.5432 train_time: 6.0m tok/s: 7178097 +3267/20000 train_loss: 2.4932 train_time: 6.0m tok/s: 7177492 +3268/20000 train_loss: 2.5175 train_time: 6.0m tok/s: 7176873 +3269/20000 train_loss: 2.6209 train_time: 6.0m tok/s: 7176255 +3270/20000 train_loss: 2.4708 train_time: 6.0m tok/s: 7175625 +3271/20000 train_loss: 2.5532 train_time: 6.0m tok/s: 7174991 +3272/20000 train_loss: 2.5181 train_time: 6.0m tok/s: 7174402 +3273/20000 train_loss: 2.6717 train_time: 6.0m tok/s: 7173814 +3274/20000 train_loss: 2.4754 train_time: 6.0m tok/s: 7173210 +3275/20000 train_loss: 2.5261 train_time: 6.0m tok/s: 7172622 +3276/20000 train_loss: 2.4992 train_time: 6.0m tok/s: 7172005 +3277/20000 train_loss: 2.5155 train_time: 6.0m tok/s: 7171327 +3278/20000 train_loss: 2.4040 train_time: 6.0m tok/s: 7170747 +3279/20000 train_loss: 2.4751 train_time: 6.0m tok/s: 7170154 +3280/20000 train_loss: 2.3836 train_time: 6.0m tok/s: 7169547 +3281/20000 train_loss: 2.4226 train_time: 6.0m tok/s: 7168892 +3282/20000 train_loss: 2.4922 train_time: 6.0m tok/s: 7168266 +3283/20000 train_loss: 2.4363 train_time: 6.0m tok/s: 7167662 +3284/20000 train_loss: 2.6385 train_time: 6.0m tok/s: 7167065 +3285/20000 train_loss: 2.4895 train_time: 6.0m tok/s: 7166472 +3286/20000 train_loss: 2.4477 train_time: 6.0m tok/s: 7165878 +3287/20000 train_loss: 2.5134 train_time: 6.0m tok/s: 7165319 +3288/20000 train_loss: 2.5073 train_time: 6.0m tok/s: 7164730 +3289/20000 train_loss: 2.4870 train_time: 6.0m tok/s: 7164147 +3290/20000 train_loss: 2.4588 train_time: 6.0m tok/s: 7163556 +3291/20000 train_loss: 2.4501 train_time: 6.0m tok/s: 7162954 +3292/20000 train_loss: 2.4649 train_time: 6.0m tok/s: 7162378 +3293/20000 train_loss: 2.4634 train_time: 6.0m tok/s: 7161764 +3294/20000 train_loss: 2.4345 train_time: 6.0m tok/s: 7161176 +3295/20000 train_loss: 2.5251 train_time: 6.0m tok/s: 7160578 +3296/20000 train_loss: 2.3930 train_time: 6.0m tok/s: 7159966 +3297/20000 train_loss: 2.4077 train_time: 6.0m tok/s: 7159374 +3298/20000 train_loss: 2.4839 train_time: 6.0m tok/s: 7158756 +3299/20000 train_loss: 2.3012 train_time: 6.0m tok/s: 7158101 +3300/20000 train_loss: 2.4608 train_time: 6.0m tok/s: 7157502 +3301/20000 train_loss: 2.4325 train_time: 6.0m tok/s: 7156919 +3302/20000 train_loss: 2.2204 train_time: 6.0m tok/s: 7156251 +3303/20000 train_loss: 2.6574 train_time: 6.1m tok/s: 7155656 +3304/20000 train_loss: 2.5004 train_time: 6.1m tok/s: 7155095 +3305/20000 train_loss: 2.5696 train_time: 6.1m tok/s: 7154540 +3306/20000 train_loss: 2.6163 train_time: 6.1m tok/s: 7153948 +3307/20000 train_loss: 2.5265 train_time: 6.1m tok/s: 7153320 +3308/20000 train_loss: 2.2498 train_time: 6.1m tok/s: 7152746 +3309/20000 train_loss: 2.4774 train_time: 6.1m tok/s: 7152174 +3310/20000 train_loss: 2.5415 train_time: 6.1m tok/s: 7151628 +3311/20000 train_loss: 2.6638 train_time: 6.1m tok/s: 7151001 +3312/20000 train_loss: 2.5331 train_time: 6.1m tok/s: 7150425 +3313/20000 train_loss: 2.3760 train_time: 6.1m tok/s: 7149842 +3314/20000 train_loss: 2.4970 train_time: 6.1m tok/s: 7149267 +3315/20000 train_loss: 2.5203 train_time: 6.1m tok/s: 7148690 +3316/20000 train_loss: 2.5088 train_time: 6.1m tok/s: 7148132 +3317/20000 train_loss: 2.4289 train_time: 6.1m tok/s: 7147524 +3318/20000 train_loss: 2.4002 train_time: 6.1m tok/s: 7146941 +3319/20000 train_loss: 2.4007 train_time: 6.1m tok/s: 7146336 +3320/20000 train_loss: 2.4288 train_time: 6.1m tok/s: 7145748 +3321/20000 train_loss: 2.4467 train_time: 6.1m tok/s: 7145184 +3322/20000 train_loss: 2.4017 train_time: 6.1m tok/s: 7144578 +3323/20000 train_loss: 2.4123 train_time: 6.1m tok/s: 7143994 +3324/20000 train_loss: 2.4539 train_time: 6.1m tok/s: 7143420 +3325/20000 train_loss: 2.5590 train_time: 6.1m tok/s: 7142880 +3326/20000 train_loss: 2.5077 train_time: 6.1m tok/s: 7142291 +3327/20000 train_loss: 2.4247 train_time: 6.1m tok/s: 7141713 +3328/20000 train_loss: 2.5676 train_time: 6.1m tok/s: 7141122 +3329/20000 train_loss: 2.4625 train_time: 6.1m tok/s: 7140545 +3330/20000 train_loss: 2.8037 train_time: 6.1m tok/s: 7139960 +3331/20000 train_loss: 2.6184 train_time: 6.1m tok/s: 7139342 +3332/20000 train_loss: 2.5302 train_time: 6.1m tok/s: 7138789 +3333/20000 train_loss: 2.3845 train_time: 6.1m tok/s: 7138231 +3334/20000 train_loss: 2.4681 train_time: 6.1m tok/s: 7137639 +3335/20000 train_loss: 2.5720 train_time: 6.1m tok/s: 7137048 +3336/20000 train_loss: 2.5187 train_time: 6.1m tok/s: 7136484 +3337/20000 train_loss: 2.2974 train_time: 6.1m tok/s: 7135915 +3338/20000 train_loss: 2.4682 train_time: 6.1m tok/s: 7135341 +3339/20000 train_loss: 2.4121 train_time: 6.1m tok/s: 7134761 +3340/20000 train_loss: 2.4342 train_time: 6.1m tok/s: 7134198 +3341/20000 train_loss: 2.4528 train_time: 6.1m tok/s: 7133610 +3342/20000 train_loss: 2.3825 train_time: 6.1m tok/s: 7133016 +3343/20000 train_loss: 2.3834 train_time: 6.1m tok/s: 7132442 +3344/20000 train_loss: 2.5434 train_time: 6.1m tok/s: 7131869 +3345/20000 train_loss: 2.4738 train_time: 6.1m tok/s: 7131291 +3346/20000 train_loss: 2.4884 train_time: 6.2m tok/s: 7130706 +3347/20000 train_loss: 2.5262 train_time: 6.2m tok/s: 7130163 +3348/20000 train_loss: 2.5833 train_time: 6.2m tok/s: 7129570 +3349/20000 train_loss: 2.4993 train_time: 6.2m tok/s: 7128998 +3350/20000 train_loss: 2.4570 train_time: 6.2m tok/s: 7128432 +3351/20000 train_loss: 2.4860 train_time: 6.2m tok/s: 7127856 +3352/20000 train_loss: 2.5012 train_time: 6.2m tok/s: 7127297 +3353/20000 train_loss: 2.4338 train_time: 6.2m tok/s: 7126715 +3354/20000 train_loss: 2.5252 train_time: 6.2m tok/s: 7126152 +3355/20000 train_loss: 2.5696 train_time: 6.2m tok/s: 7125557 +3356/20000 train_loss: 2.3541 train_time: 6.2m tok/s: 7124939 +3357/20000 train_loss: 2.3602 train_time: 6.2m tok/s: 7124359 +3358/20000 train_loss: 2.5043 train_time: 6.2m tok/s: 7123761 +3359/20000 train_loss: 2.4421 train_time: 6.2m tok/s: 7123192 +3360/20000 train_loss: 2.4509 train_time: 6.2m tok/s: 7122672 +3361/20000 train_loss: 2.5175 train_time: 6.2m tok/s: 7122133 +3362/20000 train_loss: 2.4887 train_time: 6.2m tok/s: 7121558 +3363/20000 train_loss: 2.4231 train_time: 6.2m tok/s: 7120989 +3364/20000 train_loss: 2.5105 train_time: 6.2m tok/s: 7120436 +3365/20000 train_loss: 2.5173 train_time: 6.2m tok/s: 7119868 +3366/20000 train_loss: 2.3852 train_time: 6.2m tok/s: 7119276 +3367/20000 train_loss: 2.4252 train_time: 6.2m tok/s: 7118699 +3368/20000 train_loss: 2.3779 train_time: 6.2m tok/s: 7118165 +3369/20000 train_loss: 2.6108 train_time: 6.2m tok/s: 7117565 +3370/20000 train_loss: 2.5871 train_time: 6.2m tok/s: 7116997 +3371/20000 train_loss: 2.5060 train_time: 6.2m tok/s: 7116427 +3372/20000 train_loss: 2.5787 train_time: 6.2m tok/s: 7115907 +3373/20000 train_loss: 2.4683 train_time: 6.2m tok/s: 7115343 +3374/20000 train_loss: 2.4671 train_time: 6.2m tok/s: 7114782 +3375/20000 train_loss: 2.4781 train_time: 6.2m tok/s: 7114217 +3376/20000 train_loss: 2.4273 train_time: 6.2m tok/s: 7113664 +3377/20000 train_loss: 2.5538 train_time: 6.2m tok/s: 7113113 +3378/20000 train_loss: 2.3416 train_time: 6.2m tok/s: 7112534 +3379/20000 train_loss: 2.4152 train_time: 6.2m tok/s: 7111994 +3380/20000 train_loss: 2.3567 train_time: 6.2m tok/s: 7111403 +3381/20000 train_loss: 2.3326 train_time: 6.2m tok/s: 7110803 +3382/20000 train_loss: 2.5760 train_time: 6.2m tok/s: 7110266 +3383/20000 train_loss: 2.5016 train_time: 6.2m tok/s: 7109736 +3384/20000 train_loss: 2.4818 train_time: 6.2m tok/s: 7109177 +3385/20000 train_loss: 2.4769 train_time: 6.2m tok/s: 7108637 +3386/20000 train_loss: 2.4949 train_time: 6.2m tok/s: 7108091 +3387/20000 train_loss: 2.5232 train_time: 6.2m tok/s: 7107526 +3388/20000 train_loss: 2.2891 train_time: 6.2m tok/s: 7106889 +3389/20000 train_loss: 2.4278 train_time: 6.3m tok/s: 7106355 +3390/20000 train_loss: 2.5000 train_time: 6.3m tok/s: 7105815 +3391/20000 train_loss: 2.5007 train_time: 6.3m tok/s: 7105293 +3392/20000 train_loss: 2.4633 train_time: 6.3m tok/s: 7104733 +3393/20000 train_loss: 2.4511 train_time: 6.3m tok/s: 7104191 +3394/20000 train_loss: 2.4640 train_time: 6.3m tok/s: 7103677 +3395/20000 train_loss: 2.4617 train_time: 6.3m tok/s: 7103118 +3396/20000 train_loss: 2.6050 train_time: 6.3m tok/s: 7102549 +3397/20000 train_loss: 2.5261 train_time: 6.3m tok/s: 7101955 +3398/20000 train_loss: 2.3639 train_time: 6.3m tok/s: 7101429 +3399/20000 train_loss: 2.4255 train_time: 6.3m tok/s: 7100850 +3400/20000 train_loss: 2.5280 train_time: 6.3m tok/s: 7100282 +3401/20000 train_loss: 2.4075 train_time: 6.3m tok/s: 7099722 +3402/20000 train_loss: 2.4485 train_time: 6.3m tok/s: 7099192 +3403/20000 train_loss: 2.4900 train_time: 6.3m tok/s: 7098649 +3404/20000 train_loss: 2.4499 train_time: 6.3m tok/s: 7098128 +3405/20000 train_loss: 2.6260 train_time: 6.3m tok/s: 7097588 +3406/20000 train_loss: 2.4729 train_time: 6.3m tok/s: 7097048 +3407/20000 train_loss: 2.5036 train_time: 6.3m tok/s: 7096483 +3408/20000 train_loss: 2.5684 train_time: 6.3m tok/s: 7095919 +3409/20000 train_loss: 2.4139 train_time: 6.3m tok/s: 7095372 +3410/20000 train_loss: 2.4226 train_time: 6.3m tok/s: 7094823 +3411/20000 train_loss: 2.3462 train_time: 6.3m tok/s: 7094273 +3412/20000 train_loss: 2.3423 train_time: 6.3m tok/s: 7093758 +3413/20000 train_loss: 2.3376 train_time: 6.3m tok/s: 7093192 +3414/20000 train_loss: 2.4450 train_time: 6.3m tok/s: 7092646 +3415/20000 train_loss: 2.5944 train_time: 6.3m tok/s: 7092117 +3416/20000 train_loss: 2.4977 train_time: 6.3m tok/s: 7091583 +3417/20000 train_loss: 2.5712 train_time: 6.3m tok/s: 7091053 +3418/20000 train_loss: 2.5062 train_time: 6.3m tok/s: 7090500 +3419/20000 train_loss: 2.5312 train_time: 6.3m tok/s: 7089953 +3420/20000 train_loss: 2.5377 train_time: 6.3m tok/s: 7089383 +3421/20000 train_loss: 2.3779 train_time: 6.3m tok/s: 7088833 +3422/20000 train_loss: 2.6874 train_time: 6.3m tok/s: 7088272 +3423/20000 train_loss: 2.4000 train_time: 6.3m tok/s: 7087730 +3424/20000 train_loss: 2.4541 train_time: 6.3m tok/s: 7087231 +3425/20000 train_loss: 2.4303 train_time: 6.3m tok/s: 7086681 +3426/20000 train_loss: 2.5105 train_time: 6.3m tok/s: 7086140 +3427/20000 train_loss: 2.4585 train_time: 6.3m tok/s: 7085600 +3428/20000 train_loss: 2.4551 train_time: 6.3m tok/s: 7085053 +3429/20000 train_loss: 2.6024 train_time: 6.3m tok/s: 7084506 +3430/20000 train_loss: 2.4204 train_time: 6.3m tok/s: 7083958 +3431/20000 train_loss: 2.4695 train_time: 6.3m tok/s: 7083436 +3432/20000 train_loss: 2.5961 train_time: 6.4m tok/s: 7082893 +3433/20000 train_loss: 2.4583 train_time: 6.4m tok/s: 7082366 +3434/20000 train_loss: 2.5114 train_time: 6.4m tok/s: 7081828 +3435/20000 train_loss: 2.4376 train_time: 6.4m tok/s: 7081269 +3436/20000 train_loss: 2.4138 train_time: 6.4m tok/s: 7080720 +3437/20000 train_loss: 2.5503 train_time: 6.4m tok/s: 7080204 +3438/20000 train_loss: 2.4864 train_time: 6.4m tok/s: 7079680 +3439/20000 train_loss: 2.3277 train_time: 6.4m tok/s: 7079123 +3440/20000 train_loss: 2.4817 train_time: 6.4m tok/s: 7078586 +3441/20000 train_loss: 2.3814 train_time: 6.4m tok/s: 7078003 +3442/20000 train_loss: 2.4021 train_time: 6.4m tok/s: 7077510 +3443/20000 train_loss: 2.6648 train_time: 6.4m tok/s: 7076903 +3444/20000 train_loss: 2.3234 train_time: 6.4m tok/s: 7076377 +3445/20000 train_loss: 2.4638 train_time: 6.4m tok/s: 7075898 +3446/20000 train_loss: 2.5260 train_time: 6.4m tok/s: 7075372 +3447/20000 train_loss: 2.4863 train_time: 6.4m tok/s: 7074839 +3448/20000 train_loss: 2.6026 train_time: 6.4m tok/s: 7074315 +3449/20000 train_loss: 2.4733 train_time: 6.4m tok/s: 7073804 +3450/20000 train_loss: 2.4280 train_time: 6.4m tok/s: 7073254 +3451/20000 train_loss: 2.4918 train_time: 6.4m tok/s: 7072755 +3452/20000 train_loss: 2.5184 train_time: 6.4m tok/s: 7072247 +3453/20000 train_loss: 2.4059 train_time: 6.4m tok/s: 7071744 +3454/20000 train_loss: 2.5257 train_time: 6.4m tok/s: 7071200 +3455/20000 train_loss: 2.4845 train_time: 6.4m tok/s: 7070679 +3456/20000 train_loss: 2.4849 train_time: 6.4m tok/s: 7070140 +3457/20000 train_loss: 2.4412 train_time: 6.4m tok/s: 7069615 +3458/20000 train_loss: 2.3997 train_time: 6.4m tok/s: 7069079 +3459/20000 train_loss: 2.3683 train_time: 6.4m tok/s: 7068521 +3460/20000 train_loss: 2.5196 train_time: 6.4m tok/s: 7068017 +3461/20000 train_loss: 2.6246 train_time: 6.4m tok/s: 7067488 +3462/20000 train_loss: 2.5508 train_time: 6.4m tok/s: 7066948 +3463/20000 train_loss: 2.5735 train_time: 6.4m tok/s: 7066439 +3464/20000 train_loss: 2.4842 train_time: 6.4m tok/s: 7065924 +3465/20000 train_loss: 2.5349 train_time: 6.4m tok/s: 7065376 +3466/20000 train_loss: 2.5924 train_time: 6.4m tok/s: 7064803 +3467/20000 train_loss: 2.4467 train_time: 6.4m tok/s: 7064280 +3468/20000 train_loss: 2.6319 train_time: 6.4m tok/s: 7063711 +3469/20000 train_loss: 2.3939 train_time: 6.4m tok/s: 7063167 +3470/20000 train_loss: 2.3993 train_time: 6.4m tok/s: 7062664 +3471/20000 train_loss: 2.3929 train_time: 6.4m tok/s: 7062146 +3472/20000 train_loss: 2.4341 train_time: 6.4m tok/s: 7061632 +3473/20000 train_loss: 2.3993 train_time: 6.4m tok/s: 7061104 +3474/20000 train_loss: 2.4744 train_time: 6.4m tok/s: 7060610 +3475/20000 train_loss: 2.5226 train_time: 6.5m tok/s: 7060095 +3476/20000 train_loss: 2.4549 train_time: 6.5m tok/s: 7059572 +3477/20000 train_loss: 2.5324 train_time: 6.5m tok/s: 7059075 +3478/20000 train_loss: 2.5535 train_time: 6.5m tok/s: 7058553 +3479/20000 train_loss: 2.4599 train_time: 6.5m tok/s: 7058042 +3480/20000 train_loss: 2.5231 train_time: 6.5m tok/s: 7057546 +3481/20000 train_loss: 2.4842 train_time: 6.5m tok/s: 7057003 +3482/20000 train_loss: 2.4288 train_time: 6.5m tok/s: 7056477 +3483/20000 train_loss: 2.4127 train_time: 6.5m tok/s: 7055944 +3484/20000 train_loss: 2.4662 train_time: 6.5m tok/s: 7055400 +3485/20000 train_loss: 2.3986 train_time: 6.5m tok/s: 7054893 +3486/20000 train_loss: 2.3821 train_time: 6.5m tok/s: 7054362 +3487/20000 train_loss: 2.4755 train_time: 6.5m tok/s: 7053856 +3488/20000 train_loss: 2.3663 train_time: 6.5m tok/s: 7053354 +3489/20000 train_loss: 2.3873 train_time: 6.5m tok/s: 7052829 +3490/20000 train_loss: 2.4820 train_time: 6.5m tok/s: 7052301 +3491/20000 train_loss: 2.5840 train_time: 6.5m tok/s: 7051817 +3492/20000 train_loss: 2.4249 train_time: 6.5m tok/s: 7051299 +3493/20000 train_loss: 2.5534 train_time: 6.5m tok/s: 7050793 +3494/20000 train_loss: 2.5277 train_time: 6.5m tok/s: 7050286 +3495/20000 train_loss: 2.4907 train_time: 6.5m tok/s: 7049772 +3496/20000 train_loss: 2.5968 train_time: 6.5m tok/s: 7049225 +3497/20000 train_loss: 2.6824 train_time: 6.5m tok/s: 7048694 +3498/20000 train_loss: 2.3511 train_time: 6.5m tok/s: 7048178 +3499/20000 train_loss: 2.4600 train_time: 6.5m tok/s: 7047679 +3500/20000 train_loss: 2.3666 train_time: 6.5m tok/s: 7047162 +3501/20000 train_loss: 2.4142 train_time: 6.5m tok/s: 7046645 +3502/20000 train_loss: 3.2613 train_time: 6.5m tok/s: 7046082 +3503/20000 train_loss: 2.4620 train_time: 6.5m tok/s: 7045598 +3504/20000 train_loss: 2.5556 train_time: 6.5m tok/s: 7045090 +3505/20000 train_loss: 2.4995 train_time: 6.5m tok/s: 7044567 +3506/20000 train_loss: 2.5622 train_time: 6.5m tok/s: 7044073 +3507/20000 train_loss: 2.5269 train_time: 6.5m tok/s: 7043592 +3508/20000 train_loss: 2.6192 train_time: 6.5m tok/s: 7043055 +3509/20000 train_loss: 2.4363 train_time: 6.5m tok/s: 7042520 +3510/20000 train_loss: 2.4292 train_time: 6.5m tok/s: 7042033 +3511/20000 train_loss: 2.4147 train_time: 6.5m tok/s: 7041508 +3512/20000 train_loss: 2.4649 train_time: 6.5m tok/s: 7041017 +3513/20000 train_loss: 2.4137 train_time: 6.5m tok/s: 7040550 +3514/20000 train_loss: 2.5413 train_time: 6.5m tok/s: 7040035 +3515/20000 train_loss: 2.3643 train_time: 6.5m tok/s: 7039532 +3516/20000 train_loss: 2.2750 train_time: 6.5m tok/s: 7038995 +3517/20000 train_loss: 2.4485 train_time: 6.5m tok/s: 7038488 +3518/20000 train_loss: 2.4144 train_time: 6.6m tok/s: 7037972 +3519/20000 train_loss: 2.8127 train_time: 6.6m tok/s: 7037445 +3520/20000 train_loss: 2.6393 train_time: 6.6m tok/s: 7036936 +3521/20000 train_loss: 2.4987 train_time: 6.6m tok/s: 7036426 +3522/20000 train_loss: 2.5067 train_time: 6.6m tok/s: 7035930 +3523/20000 train_loss: 2.3843 train_time: 6.6m tok/s: 7035422 +3524/20000 train_loss: 2.7613 train_time: 6.6m tok/s: 7034900 +3525/20000 train_loss: 2.5236 train_time: 6.6m tok/s: 7034427 +3526/20000 train_loss: 2.4878 train_time: 6.6m tok/s: 7033908 +3527/20000 train_loss: 2.4172 train_time: 6.6m tok/s: 7033417 +3528/20000 train_loss: 2.4569 train_time: 6.6m tok/s: 7032946 +3529/20000 train_loss: 2.4493 train_time: 6.6m tok/s: 7032444 +3530/20000 train_loss: 2.6714 train_time: 6.6m tok/s: 7031961 +3531/20000 train_loss: 2.2977 train_time: 6.6m tok/s: 7031454 +3532/20000 train_loss: 2.2762 train_time: 6.6m tok/s: 7030945 +3533/20000 train_loss: 2.4345 train_time: 6.6m tok/s: 7030435 +3534/20000 train_loss: 2.4729 train_time: 6.6m tok/s: 7029950 +3535/20000 train_loss: 2.4041 train_time: 6.6m tok/s: 7029415 +3536/20000 train_loss: 2.4706 train_time: 6.6m tok/s: 7028927 +3537/20000 train_loss: 2.6063 train_time: 6.6m tok/s: 7028451 +3538/20000 train_loss: 2.4481 train_time: 6.6m tok/s: 7027949 +3539/20000 train_loss: 2.4781 train_time: 6.6m tok/s: 7027432 +3540/20000 train_loss: 2.4951 train_time: 6.6m tok/s: 7026930 +3541/20000 train_loss: 2.2979 train_time: 6.6m tok/s: 7026422 +3542/20000 train_loss: 2.4078 train_time: 6.6m tok/s: 7025955 +3543/20000 train_loss: 2.5151 train_time: 6.6m tok/s: 7025471 +3544/20000 train_loss: 2.4238 train_time: 6.6m tok/s: 7024992 +3545/20000 train_loss: 2.4573 train_time: 6.6m tok/s: 7024496 +3546/20000 train_loss: 2.3651 train_time: 6.6m tok/s: 7023995 +3547/20000 train_loss: 2.3354 train_time: 6.6m tok/s: 7023501 +3548/20000 train_loss: 2.5770 train_time: 6.6m tok/s: 7023004 +3549/20000 train_loss: 2.5417 train_time: 6.6m tok/s: 7022525 +3550/20000 train_loss: 2.5460 train_time: 6.6m tok/s: 7022025 +3551/20000 train_loss: 2.5698 train_time: 6.6m tok/s: 7021543 +3552/20000 train_loss: 2.4710 train_time: 6.6m tok/s: 7021035 +3553/20000 train_loss: 2.5823 train_time: 6.6m tok/s: 7020523 +3554/20000 train_loss: 2.4992 train_time: 6.6m tok/s: 7020010 +3555/20000 train_loss: 2.5143 train_time: 6.6m tok/s: 7019520 +3556/20000 train_loss: 2.5207 train_time: 6.6m tok/s: 7019009 +3557/20000 train_loss: 2.4410 train_time: 6.6m tok/s: 7018551 +3558/20000 train_loss: 2.6233 train_time: 6.6m tok/s: 7018049 +3559/20000 train_loss: 2.4923 train_time: 6.6m tok/s: 7017561 +3560/20000 train_loss: 2.4459 train_time: 6.6m tok/s: 7017072 +3561/20000 train_loss: 3.1465 train_time: 6.7m tok/s: 7016554 +3562/20000 train_loss: 2.3838 train_time: 6.7m tok/s: 7016077 +3563/20000 train_loss: 2.4799 train_time: 6.7m tok/s: 7015594 +3564/20000 train_loss: 2.4844 train_time: 6.7m tok/s: 7015127 +3565/20000 train_loss: 2.4681 train_time: 6.7m tok/s: 7014645 +3566/20000 train_loss: 2.4666 train_time: 6.7m tok/s: 7014140 +3567/20000 train_loss: 2.5243 train_time: 6.7m tok/s: 7013638 +3568/20000 train_loss: 2.5337 train_time: 6.7m tok/s: 7013148 +3569/20000 train_loss: 2.3281 train_time: 6.7m tok/s: 7012648 +3570/20000 train_loss: 2.3019 train_time: 6.7m tok/s: 7012158 +3571/20000 train_loss: 2.3832 train_time: 6.7m tok/s: 7011673 +3572/20000 train_loss: 2.3914 train_time: 6.7m tok/s: 7011203 +3573/20000 train_loss: 2.2672 train_time: 6.7m tok/s: 7010683 +3574/20000 train_loss: 2.4242 train_time: 6.7m tok/s: 7010165 +3575/20000 train_loss: 2.5045 train_time: 6.7m tok/s: 7009686 +3576/20000 train_loss: 2.5092 train_time: 6.7m tok/s: 7009221 +3577/20000 train_loss: 2.4848 train_time: 6.7m tok/s: 7008742 +3578/20000 train_loss: 2.5610 train_time: 6.7m tok/s: 7008271 +3579/20000 train_loss: 2.5194 train_time: 6.7m tok/s: 7007794 +3580/20000 train_loss: 2.5210 train_time: 6.7m tok/s: 7007305 +3581/20000 train_loss: 2.5115 train_time: 6.7m tok/s: 7006830 +3582/20000 train_loss: 2.3791 train_time: 6.7m tok/s: 7006341 +3583/20000 train_loss: 2.4147 train_time: 6.7m tok/s: 7005848 +3584/20000 train_loss: 2.3580 train_time: 6.7m tok/s: 7005349 +3585/20000 train_loss: 2.3395 train_time: 6.7m tok/s: 7004833 +3586/20000 train_loss: 2.1756 train_time: 6.7m tok/s: 7004307 +3587/20000 train_loss: 2.3308 train_time: 6.7m tok/s: 7003853 +3588/20000 train_loss: 2.3801 train_time: 6.7m tok/s: 7003378 +3589/20000 train_loss: 2.6193 train_time: 6.7m tok/s: 7002867 +3590/20000 train_loss: 2.5106 train_time: 6.7m tok/s: 7002397 +3591/20000 train_loss: 2.5268 train_time: 6.7m tok/s: 7001935 +3592/20000 train_loss: 2.5208 train_time: 6.7m tok/s: 7001472 +3593/20000 train_loss: 2.5709 train_time: 6.7m tok/s: 7001001 +3594/20000 train_loss: 2.4924 train_time: 6.7m tok/s: 7000528 +3595/20000 train_loss: 2.5374 train_time: 6.7m tok/s: 7000056 +3596/20000 train_loss: 2.5190 train_time: 6.7m tok/s: 6999591 +3597/20000 train_loss: 2.5500 train_time: 6.7m tok/s: 6999120 +3598/20000 train_loss: 2.2688 train_time: 6.7m tok/s: 6998612 +3599/20000 train_loss: 2.4736 train_time: 6.7m tok/s: 6998142 +3600/20000 train_loss: 2.5624 train_time: 6.7m tok/s: 6997686 +3601/20000 train_loss: 2.4300 train_time: 6.7m tok/s: 6997222 +3602/20000 train_loss: 2.3179 train_time: 6.7m tok/s: 6996747 +3603/20000 train_loss: 2.5487 train_time: 6.8m tok/s: 6996275 +3604/20000 train_loss: 2.5783 train_time: 6.8m tok/s: 6995807 +3605/20000 train_loss: 2.4440 train_time: 6.8m tok/s: 6995328 +3606/20000 train_loss: 2.4473 train_time: 6.8m tok/s: 6994845 +3607/20000 train_loss: 2.5774 train_time: 6.8m tok/s: 6994356 +3608/20000 train_loss: 2.4193 train_time: 6.8m tok/s: 6993886 +3609/20000 train_loss: 2.4335 train_time: 6.8m tok/s: 6993407 +3610/20000 train_loss: 2.5459 train_time: 6.8m tok/s: 6992937 +3611/20000 train_loss: 2.5320 train_time: 6.8m tok/s: 6992475 +3612/20000 train_loss: 2.4099 train_time: 6.8m tok/s: 6992003 +3613/20000 train_loss: 2.4477 train_time: 6.8m tok/s: 6991525 +3614/20000 train_loss: 2.5454 train_time: 6.8m tok/s: 6991072 +3615/20000 train_loss: 2.3676 train_time: 6.8m tok/s: 6990591 +3616/20000 train_loss: 2.4756 train_time: 6.8m tok/s: 6990110 +3617/20000 train_loss: 2.4351 train_time: 6.8m tok/s: 6989633 +3618/20000 train_loss: 2.3693 train_time: 6.8m tok/s: 6989147 +3619/20000 train_loss: 2.6458 train_time: 6.8m tok/s: 6988677 +3620/20000 train_loss: 2.3821 train_time: 6.8m tok/s: 6988209 +3621/20000 train_loss: 2.5038 train_time: 6.8m tok/s: 6987749 +3622/20000 train_loss: 2.5285 train_time: 6.8m tok/s: 6987249 +3623/20000 train_loss: 2.5359 train_time: 6.8m tok/s: 6986732 +3624/20000 train_loss: 2.5947 train_time: 6.8m tok/s: 6986249 +3625/20000 train_loss: 2.4439 train_time: 6.8m tok/s: 6985819 +3626/20000 train_loss: 2.4044 train_time: 6.8m tok/s: 6985369 +3627/20000 train_loss: 2.3816 train_time: 6.8m tok/s: 6984924 +3628/20000 train_loss: 2.3982 train_time: 6.8m tok/s: 6984463 +3629/20000 train_loss: 2.5300 train_time: 6.8m tok/s: 6984009 +3630/20000 train_loss: 2.5393 train_time: 6.8m tok/s: 6983528 +3631/20000 train_loss: 2.5354 train_time: 6.8m tok/s: 6983058 +3632/20000 train_loss: 2.5869 train_time: 6.8m tok/s: 6982591 +3633/20000 train_loss: 2.3972 train_time: 6.8m tok/s: 6982153 +3634/20000 train_loss: 2.5175 train_time: 6.8m tok/s: 6981715 +3635/20000 train_loss: 2.5024 train_time: 6.8m tok/s: 6981238 +3636/20000 train_loss: 2.4855 train_time: 6.8m tok/s: 6980756 +3637/20000 train_loss: 2.4420 train_time: 6.8m tok/s: 6980274 +3638/20000 train_loss: 2.4300 train_time: 6.8m tok/s: 6979805 +3639/20000 train_loss: 2.4515 train_time: 6.8m tok/s: 6979325 +3640/20000 train_loss: 2.4013 train_time: 6.8m tok/s: 6978851 +3641/20000 train_loss: 2.3872 train_time: 6.8m tok/s: 6978414 +3642/20000 train_loss: 2.3036 train_time: 6.8m tok/s: 6977931 +3643/20000 train_loss: 2.3904 train_time: 6.8m tok/s: 6977477 +3644/20000 train_loss: 2.1024 train_time: 6.8m tok/s: 6976984 +3645/20000 train_loss: 2.5231 train_time: 6.8m tok/s: 6976524 +3646/20000 train_loss: 2.5359 train_time: 6.9m tok/s: 6976080 +3647/20000 train_loss: 2.5373 train_time: 6.9m tok/s: 6975633 +3648/20000 train_loss: 2.4791 train_time: 6.9m tok/s: 6975181 +3649/20000 train_loss: 2.4510 train_time: 6.9m tok/s: 6974717 +3650/20000 train_loss: 2.6570 train_time: 6.9m tok/s: 6974224 +3651/20000 train_loss: 2.5567 train_time: 6.9m tok/s: 6973756 +3652/20000 train_loss: 2.4394 train_time: 6.9m tok/s: 6973302 +3653/20000 train_loss: 2.4510 train_time: 6.9m tok/s: 6972851 +3654/20000 train_loss: 2.3826 train_time: 6.9m tok/s: 6972420 +3655/20000 train_loss: 2.4060 train_time: 6.9m tok/s: 6971954 +3656/20000 train_loss: 2.3851 train_time: 6.9m tok/s: 6971469 +3657/20000 train_loss: 2.4269 train_time: 6.9m tok/s: 6971006 +3658/20000 train_loss: 2.5552 train_time: 6.9m tok/s: 6970537 +3659/20000 train_loss: 2.4341 train_time: 6.9m tok/s: 6970100 +3660/20000 train_loss: 2.3424 train_time: 6.9m tok/s: 6969617 +3661/20000 train_loss: 2.4460 train_time: 6.9m tok/s: 6969145 +3662/20000 train_loss: 2.4626 train_time: 6.9m tok/s: 6968704 +3663/20000 train_loss: 2.0474 train_time: 6.9m tok/s: 6968229 +3664/20000 train_loss: 2.6473 train_time: 6.9m tok/s: 6967793 +3665/20000 train_loss: 2.4913 train_time: 6.9m tok/s: 6967319 +3666/20000 train_loss: 2.4467 train_time: 6.9m tok/s: 6966846 +3667/20000 train_loss: 2.3273 train_time: 6.9m tok/s: 6966393 +3668/20000 train_loss: 2.2318 train_time: 6.9m tok/s: 6965920 +3669/20000 train_loss: 2.3908 train_time: 6.9m tok/s: 6965425 +3670/20000 train_loss: 2.4290 train_time: 6.9m tok/s: 6964987 +3671/20000 train_loss: 2.4360 train_time: 6.9m tok/s: 6964545 +3672/20000 train_loss: 2.5413 train_time: 6.9m tok/s: 6964103 +3673/20000 train_loss: 2.4566 train_time: 6.9m tok/s: 6963665 +3674/20000 train_loss: 2.5405 train_time: 6.9m tok/s: 6963226 +3675/20000 train_loss: 2.5830 train_time: 6.9m tok/s: 6962773 +3676/20000 train_loss: 2.4695 train_time: 6.9m tok/s: 6962323 +3677/20000 train_loss: 2.5379 train_time: 6.9m tok/s: 6961901 +3678/20000 train_loss: 2.4787 train_time: 6.9m tok/s: 6961441 +3679/20000 train_loss: 2.5713 train_time: 6.9m tok/s: 6960953 +3680/20000 train_loss: 2.4726 train_time: 6.9m tok/s: 6960528 +3681/20000 train_loss: 2.5354 train_time: 6.9m tok/s: 6960080 +3682/20000 train_loss: 2.3550 train_time: 6.9m tok/s: 6959616 +3683/20000 train_loss: 2.5751 train_time: 6.9m tok/s: 6959165 +3684/20000 train_loss: 2.4065 train_time: 6.9m tok/s: 6958713 +3685/20000 train_loss: 2.6496 train_time: 6.9m tok/s: 6958255 +3686/20000 train_loss: 2.4960 train_time: 6.9m tok/s: 6957790 +3687/20000 train_loss: 2.5694 train_time: 6.9m tok/s: 6957369 +3688/20000 train_loss: 2.4781 train_time: 6.9m tok/s: 6956917 +3689/20000 train_loss: 2.5447 train_time: 7.0m tok/s: 6956473 +3690/20000 train_loss: 2.6045 train_time: 7.0m tok/s: 6956037 +3691/20000 train_loss: 2.5553 train_time: 7.0m tok/s: 6955582 +3692/20000 train_loss: 2.4950 train_time: 7.0m tok/s: 6955138 +3693/20000 train_loss: 2.5060 train_time: 7.0m tok/s: 6954676 +3694/20000 train_loss: 2.5063 train_time: 7.0m tok/s: 6954250 +3695/20000 train_loss: 2.4252 train_time: 7.0m tok/s: 6953784 +3696/20000 train_loss: 2.4112 train_time: 7.0m tok/s: 6953317 +3697/20000 train_loss: 2.4252 train_time: 7.0m tok/s: 6952860 +3698/20000 train_loss: 2.4755 train_time: 7.0m tok/s: 6952431 +3699/20000 train_loss: 2.4955 train_time: 7.0m tok/s: 6952006 +3700/20000 train_loss: 2.6308 train_time: 7.0m tok/s: 6951530 +3701/20000 train_loss: 2.5308 train_time: 7.0m tok/s: 6951081 +3702/20000 train_loss: 2.6049 train_time: 7.0m tok/s: 6950633 +3703/20000 train_loss: 2.5395 train_time: 7.0m tok/s: 6950170 +3704/20000 train_loss: 2.8504 train_time: 7.0m tok/s: 6949699 +3705/20000 train_loss: 2.4415 train_time: 7.0m tok/s: 6949281 +3706/20000 train_loss: 2.4586 train_time: 7.0m tok/s: 6948838 +3707/20000 train_loss: 2.4563 train_time: 7.0m tok/s: 6948409 +3708/20000 train_loss: 2.4078 train_time: 7.0m tok/s: 6947980 +3709/20000 train_loss: 2.5379 train_time: 7.0m tok/s: 6947557 +3710/20000 train_loss: 2.4045 train_time: 7.0m tok/s: 6947106 +3711/20000 train_loss: 2.5436 train_time: 7.0m tok/s: 6946655 +3712/20000 train_loss: 2.4579 train_time: 7.0m tok/s: 6946208 +3713/20000 train_loss: 2.4666 train_time: 7.0m tok/s: 6945771 +3714/20000 train_loss: 2.4311 train_time: 7.0m tok/s: 6945336 +3715/20000 train_loss: 2.4250 train_time: 7.0m tok/s: 6944905 +3716/20000 train_loss: 2.4175 train_time: 7.0m tok/s: 6944443 +3717/20000 train_loss: 2.3527 train_time: 7.0m tok/s: 6944004 +3718/20000 train_loss: 2.5505 train_time: 7.0m tok/s: 6943562 +3719/20000 train_loss: 2.4370 train_time: 7.0m tok/s: 6943130 +3720/20000 train_loss: 2.5071 train_time: 7.0m tok/s: 6942681 +3721/20000 train_loss: 2.5569 train_time: 7.0m tok/s: 6942229 +3722/20000 train_loss: 2.4425 train_time: 7.0m tok/s: 6941778 +3723/20000 train_loss: 2.6144 train_time: 7.0m tok/s: 6941331 +3724/20000 train_loss: 2.4241 train_time: 7.0m tok/s: 6940899 +3725/20000 train_loss: 2.4685 train_time: 7.0m tok/s: 6940452 +3726/20000 train_loss: 2.4326 train_time: 7.0m tok/s: 6940018 +3727/20000 train_loss: 2.4226 train_time: 7.0m tok/s: 6939582 +3728/20000 train_loss: 2.4421 train_time: 7.0m tok/s: 6939152 +3729/20000 train_loss: 2.5146 train_time: 7.0m tok/s: 6938722 +3730/20000 train_loss: 2.4277 train_time: 7.0m tok/s: 6938285 +3731/20000 train_loss: 2.5127 train_time: 7.0m tok/s: 6937857 +3732/20000 train_loss: 2.4351 train_time: 7.1m tok/s: 6937409 +3733/20000 train_loss: 2.5374 train_time: 7.1m tok/s: 6936987 +3734/20000 train_loss: 2.4500 train_time: 7.1m tok/s: 6936568 +3735/20000 train_loss: 2.5236 train_time: 7.1m tok/s: 6936104 +3736/20000 train_loss: 2.4301 train_time: 7.1m tok/s: 6935632 +3737/20000 train_loss: 2.4456 train_time: 7.1m tok/s: 6935217 +3738/20000 train_loss: 2.3940 train_time: 7.1m tok/s: 6934770 +3739/20000 train_loss: 2.4605 train_time: 7.1m tok/s: 6934354 +3740/20000 train_loss: 2.5335 train_time: 7.1m tok/s: 6933930 +3741/20000 train_loss: 2.3322 train_time: 7.1m tok/s: 6933452 +3742/20000 train_loss: 2.4603 train_time: 7.1m tok/s: 6933036 +3743/20000 train_loss: 2.4098 train_time: 7.1m tok/s: 6932581 +3744/20000 train_loss: 2.4352 train_time: 7.1m tok/s: 6932173 +3745/20000 train_loss: 2.5164 train_time: 7.1m tok/s: 6931757 +3746/20000 train_loss: 2.3734 train_time: 7.1m tok/s: 6931312 +3747/20000 train_loss: 2.4375 train_time: 7.1m tok/s: 6930879 +3748/20000 train_loss: 2.5497 train_time: 7.1m tok/s: 6930462 +3749/20000 train_loss: 2.5413 train_time: 7.1m tok/s: 6930024 +3750/20000 train_loss: 2.4797 train_time: 7.1m tok/s: 6929574 +3751/20000 train_loss: 2.5771 train_time: 7.1m tok/s: 6929156 +3752/20000 train_loss: 2.4399 train_time: 7.1m tok/s: 6928723 +3753/20000 train_loss: 2.3564 train_time: 7.1m tok/s: 6928293 +3754/20000 train_loss: 2.4468 train_time: 7.1m tok/s: 6927868 +3755/20000 train_loss: 2.3877 train_time: 7.1m tok/s: 6927418 +3756/20000 train_loss: 2.4834 train_time: 7.1m tok/s: 6926999 +3757/20000 train_loss: 2.3679 train_time: 7.1m tok/s: 6926566 +3758/20000 train_loss: 2.7266 train_time: 7.1m tok/s: 6926136 +3759/20000 train_loss: 2.6070 train_time: 7.1m tok/s: 6925719 +3760/20000 train_loss: 2.5438 train_time: 7.1m tok/s: 6925301 +3761/20000 train_loss: 2.6252 train_time: 7.1m tok/s: 6924841 +3762/20000 train_loss: 2.4711 train_time: 7.1m tok/s: 6924410 +3763/20000 train_loss: 2.4951 train_time: 7.1m tok/s: 6923984 +3764/20000 train_loss: 2.4541 train_time: 7.1m tok/s: 6923564 +3765/20000 train_loss: 2.4298 train_time: 7.1m tok/s: 6923135 +3766/20000 train_loss: 2.4160 train_time: 7.1m tok/s: 6922726 +3767/20000 train_loss: 2.4754 train_time: 7.1m tok/s: 6922287 +3768/20000 train_loss: 2.4730 train_time: 7.1m tok/s: 6921858 +3769/20000 train_loss: 2.4116 train_time: 7.1m tok/s: 6921437 +3770/20000 train_loss: 2.5575 train_time: 7.1m tok/s: 6921015 +3771/20000 train_loss: 2.4484 train_time: 7.1m tok/s: 6920582 +3772/20000 train_loss: 2.5163 train_time: 7.1m tok/s: 6920153 +3773/20000 train_loss: 2.4890 train_time: 7.1m tok/s: 6919738 +3774/20000 train_loss: 2.3915 train_time: 7.1m tok/s: 6919325 +3775/20000 train_loss: 2.6255 train_time: 7.2m tok/s: 6918905 +3776/20000 train_loss: 2.4993 train_time: 7.2m tok/s: 6918469 +3777/20000 train_loss: 2.4009 train_time: 7.2m tok/s: 6918025 +3778/20000 train_loss: 2.4639 train_time: 7.2m tok/s: 6917620 +3779/20000 train_loss: 2.4570 train_time: 7.2m tok/s: 6917203 +3780/20000 train_loss: 2.3806 train_time: 7.2m tok/s: 6916774 +3781/20000 train_loss: 2.4478 train_time: 7.2m tok/s: 6916321 +3782/20000 train_loss: 2.3845 train_time: 7.2m tok/s: 6915883 +3783/20000 train_loss: 2.4574 train_time: 7.2m tok/s: 6915474 +3784/20000 train_loss: 2.4628 train_time: 7.2m tok/s: 6915085 +3785/20000 train_loss: 2.4815 train_time: 7.2m tok/s: 6914685 +3786/20000 train_loss: 2.5002 train_time: 7.2m tok/s: 6914264 +3787/20000 train_loss: 2.5079 train_time: 7.2m tok/s: 6913824 +3788/20000 train_loss: 2.4197 train_time: 7.2m tok/s: 6913400 +3789/20000 train_loss: 2.3864 train_time: 7.2m tok/s: 6913009 +3790/20000 train_loss: 2.4499 train_time: 7.2m tok/s: 6912589 +3791/20000 train_loss: 2.3469 train_time: 7.2m tok/s: 6912162 +3792/20000 train_loss: 2.4631 train_time: 7.2m tok/s: 6911726 +3793/20000 train_loss: 2.3849 train_time: 7.2m tok/s: 6911278 +3794/20000 train_loss: 2.4797 train_time: 7.2m tok/s: 6910860 +3795/20000 train_loss: 2.4674 train_time: 7.2m tok/s: 6910433 +3796/20000 train_loss: 2.5369 train_time: 7.2m tok/s: 6910009 +3797/20000 train_loss: 2.5795 train_time: 7.2m tok/s: 6909592 +3798/20000 train_loss: 2.6861 train_time: 7.2m tok/s: 6909180 +3799/20000 train_loss: 2.4848 train_time: 7.2m tok/s: 6908756 +3800/20000 train_loss: 2.4874 train_time: 7.2m tok/s: 6908315 +3801/20000 train_loss: 2.2677 train_time: 7.2m tok/s: 6907882 +3802/20000 train_loss: 2.4870 train_time: 7.2m tok/s: 6907472 +3803/20000 train_loss: 2.3987 train_time: 7.2m tok/s: 6907081 +3804/20000 train_loss: 2.4704 train_time: 7.2m tok/s: 6906681 +3805/20000 train_loss: 2.3905 train_time: 7.2m tok/s: 6906275 +3806/20000 train_loss: 2.4750 train_time: 7.2m tok/s: 6905859 +3807/20000 train_loss: 2.4750 train_time: 7.2m tok/s: 6905438 +3808/20000 train_loss: 2.6567 train_time: 7.2m tok/s: 6905026 +3809/20000 train_loss: 2.5722 train_time: 7.2m tok/s: 6904617 +3810/20000 train_loss: 2.4330 train_time: 7.2m tok/s: 6904183 +3811/20000 train_loss: 2.4417 train_time: 7.2m tok/s: 6903760 +3812/20000 train_loss: 2.4410 train_time: 7.2m tok/s: 6903372 +3813/20000 train_loss: 2.4562 train_time: 7.2m tok/s: 6902953 +3814/20000 train_loss: 4.3643 train_time: 7.2m tok/s: 6902480 +3815/20000 train_loss: 2.4399 train_time: 7.2m tok/s: 6902055 +3816/20000 train_loss: 2.5606 train_time: 7.2m tok/s: 6901615 +3817/20000 train_loss: 2.4954 train_time: 7.2m tok/s: 6901230 +3818/20000 train_loss: 2.5139 train_time: 7.3m tok/s: 6900825 +3819/20000 train_loss: 2.3856 train_time: 7.3m tok/s: 6900430 +3820/20000 train_loss: 2.5529 train_time: 7.3m tok/s: 6900005 +3821/20000 train_loss: 2.4249 train_time: 7.3m tok/s: 6899606 +3822/20000 train_loss: 2.4596 train_time: 7.3m tok/s: 6899209 +3823/20000 train_loss: 2.4849 train_time: 7.3m tok/s: 6898815 +3824/20000 train_loss: 2.5082 train_time: 7.3m tok/s: 6898430 +3825/20000 train_loss: 2.4818 train_time: 7.3m tok/s: 6898013 +3826/20000 train_loss: 2.5556 train_time: 7.3m tok/s: 6897592 +3827/20000 train_loss: 2.5690 train_time: 7.3m tok/s: 6897179 +3828/20000 train_loss: 2.5521 train_time: 7.3m tok/s: 6896759 +3829/20000 train_loss: 2.5103 train_time: 7.3m tok/s: 6896358 +3830/20000 train_loss: 2.5445 train_time: 7.3m tok/s: 6895959 +3831/20000 train_loss: 2.5065 train_time: 7.3m tok/s: 6895538 +3832/20000 train_loss: 2.4585 train_time: 7.3m tok/s: 6895128 +3833/20000 train_loss: 2.4892 train_time: 7.3m tok/s: 6894718 +3834/20000 train_loss: 2.4200 train_time: 7.3m tok/s: 6894300 +3835/20000 train_loss: 2.4584 train_time: 7.3m tok/s: 6893896 +3836/20000 train_loss: 2.4239 train_time: 7.3m tok/s: 6893483 +3837/20000 train_loss: 2.4438 train_time: 7.3m tok/s: 6893096 +3838/20000 train_loss: 2.4516 train_time: 7.3m tok/s: 6892665 +3839/20000 train_loss: 2.4359 train_time: 7.3m tok/s: 6892258 +3840/20000 train_loss: 2.3938 train_time: 7.3m tok/s: 6891856 +3841/20000 train_loss: 2.5600 train_time: 7.3m tok/s: 6891459 +3842/20000 train_loss: 2.4213 train_time: 7.3m tok/s: 6891053 +3843/20000 train_loss: 2.4505 train_time: 7.3m tok/s: 6890640 +3844/20000 train_loss: 2.3915 train_time: 7.3m tok/s: 6890269 +3845/20000 train_loss: 2.5076 train_time: 7.3m tok/s: 6889860 +3846/20000 train_loss: 2.5979 train_time: 7.3m tok/s: 6889435 +3847/20000 train_loss: 2.4147 train_time: 7.3m tok/s: 6889039 +3848/20000 train_loss: 2.4411 train_time: 7.3m tok/s: 6888631 +3849/20000 train_loss: 2.5728 train_time: 7.3m tok/s: 6888213 +3850/20000 train_loss: 2.5451 train_time: 7.3m tok/s: 6887781 +3851/20000 train_loss: 2.4591 train_time: 7.3m tok/s: 6887368 +3852/20000 train_loss: 2.4417 train_time: 7.3m tok/s: 6886975 +3853/20000 train_loss: 2.3931 train_time: 7.3m tok/s: 6886571 +3854/20000 train_loss: 2.2923 train_time: 7.3m tok/s: 6886184 +3855/20000 train_loss: 2.4881 train_time: 7.3m tok/s: 6885811 +3856/20000 train_loss: 2.4109 train_time: 7.3m tok/s: 6885391 +3857/20000 train_loss: 2.2105 train_time: 7.3m tok/s: 6884937 +3858/20000 train_loss: 2.4890 train_time: 7.3m tok/s: 6884497 +3859/20000 train_loss: 2.3970 train_time: 7.3m tok/s: 6884101 +3860/20000 train_loss: 2.3325 train_time: 7.3m tok/s: 6883725 +3861/20000 train_loss: 2.4748 train_time: 7.4m tok/s: 6883340 +3862/20000 train_loss: 2.5965 train_time: 7.4m tok/s: 6882947 +3863/20000 train_loss: 2.5172 train_time: 7.4m tok/s: 6882571 +3864/20000 train_loss: 2.4664 train_time: 7.4m tok/s: 6882160 +3865/20000 train_loss: 2.4892 train_time: 7.4m tok/s: 6881753 +3866/20000 train_loss: 2.4108 train_time: 7.4m tok/s: 6881352 +3867/20000 train_loss: 2.8966 train_time: 7.4m tok/s: 6880929 +3868/20000 train_loss: 2.3213 train_time: 7.4m tok/s: 6880500 +3869/20000 train_loss: 2.4434 train_time: 7.4m tok/s: 6880117 +3870/20000 train_loss: 2.5595 train_time: 7.4m tok/s: 6879739 +3871/20000 train_loss: 2.4040 train_time: 7.4m tok/s: 6879339 +3872/20000 train_loss: 2.4521 train_time: 7.4m tok/s: 6878956 +3873/20000 train_loss: 2.3848 train_time: 7.4m tok/s: 6878552 +3874/20000 train_loss: 2.4459 train_time: 7.4m tok/s: 6878180 +3875/20000 train_loss: 2.4445 train_time: 7.4m tok/s: 6877772 +3876/20000 train_loss: 2.3330 train_time: 7.4m tok/s: 6877369 +3877/20000 train_loss: 2.6177 train_time: 7.4m tok/s: 6876959 +3878/20000 train_loss: 2.9864 train_time: 7.4m tok/s: 6876522 +3879/20000 train_loss: 2.4215 train_time: 7.4m tok/s: 6876109 +3880/20000 train_loss: 2.4189 train_time: 7.4m tok/s: 6875732 +3881/20000 train_loss: 2.4080 train_time: 7.4m tok/s: 6875333 +3882/20000 train_loss: 2.3107 train_time: 7.4m tok/s: 6874939 +3883/20000 train_loss: 2.4713 train_time: 7.4m tok/s: 6874544 +3884/20000 train_loss: 2.4661 train_time: 7.4m tok/s: 6874174 +3885/20000 train_loss: 2.4107 train_time: 7.4m tok/s: 6873769 +3886/20000 train_loss: 2.3840 train_time: 7.4m tok/s: 6873381 +3887/20000 train_loss: 2.4196 train_time: 7.4m tok/s: 6872980 +3888/20000 train_loss: 2.4241 train_time: 7.4m tok/s: 6872598 +3889/20000 train_loss: 2.4070 train_time: 7.4m tok/s: 6872206 +3890/20000 train_loss: 2.3569 train_time: 7.4m tok/s: 6871823 +3891/20000 train_loss: 2.5891 train_time: 7.4m tok/s: 6871426 +3892/20000 train_loss: 2.3919 train_time: 7.4m tok/s: 6871046 +3893/20000 train_loss: 2.5019 train_time: 7.4m tok/s: 6870660 +3894/20000 train_loss: 2.2318 train_time: 7.4m tok/s: 6870251 +3895/20000 train_loss: 2.5604 train_time: 7.4m tok/s: 6869862 +3896/20000 train_loss: 2.4684 train_time: 7.4m tok/s: 6869450 +3897/20000 train_loss: 2.4666 train_time: 7.4m tok/s: 6869046 +3898/20000 train_loss: 2.3843 train_time: 7.4m tok/s: 6868671 +3899/20000 train_loss: 2.6210 train_time: 7.4m tok/s: 6868292 +3900/20000 train_loss: 2.5398 train_time: 7.4m tok/s: 6867869 +3901/20000 train_loss: 2.4488 train_time: 7.4m tok/s: 6867496 +3902/20000 train_loss: 2.4836 train_time: 7.4m tok/s: 6867075 +3903/20000 train_loss: 2.4843 train_time: 7.5m tok/s: 6866670 +3904/20000 train_loss: 2.3097 train_time: 7.5m tok/s: 6866278 +3905/20000 train_loss: 2.4339 train_time: 7.5m tok/s: 6865883 +3906/20000 train_loss: 2.5458 train_time: 7.5m tok/s: 6865482 +3907/20000 train_loss: 2.4673 train_time: 7.5m tok/s: 6865113 +3908/20000 train_loss: 2.4658 train_time: 7.5m tok/s: 6864735 +3909/20000 train_loss: 2.4747 train_time: 7.5m tok/s: 6864350 +3910/20000 train_loss: 2.4505 train_time: 7.5m tok/s: 6863940 +3911/20000 train_loss: 2.5220 train_time: 7.5m tok/s: 6863553 +3912/20000 train_loss: 2.5772 train_time: 7.5m tok/s: 6863171 +3913/20000 train_loss: 2.4878 train_time: 7.5m tok/s: 6862791 +3914/20000 train_loss: 2.4568 train_time: 7.5m tok/s: 6862414 +3915/20000 train_loss: 2.3558 train_time: 7.5m tok/s: 6862046 +3916/20000 train_loss: 2.4161 train_time: 7.5m tok/s: 6861649 +3917/20000 train_loss: 2.5839 train_time: 7.5m tok/s: 6861241 +3918/20000 train_loss: 2.4461 train_time: 7.5m tok/s: 6860863 +3919/20000 train_loss: 2.4019 train_time: 7.5m tok/s: 6860484 +3920/20000 train_loss: 2.3553 train_time: 7.5m tok/s: 6860086 +3921/20000 train_loss: 2.4960 train_time: 7.5m tok/s: 6859707 +3922/20000 train_loss: 2.5812 train_time: 7.5m tok/s: 6859312 +3923/20000 train_loss: 2.4943 train_time: 7.5m tok/s: 6858954 +3924/20000 train_loss: 2.5576 train_time: 7.5m tok/s: 6858563 +3925/20000 train_loss: 2.4399 train_time: 7.5m tok/s: 6858173 +3926/20000 train_loss: 2.3679 train_time: 7.5m tok/s: 6857759 +3927/20000 train_loss: 2.3425 train_time: 7.5m tok/s: 6857376 +3928/20000 train_loss: 2.4462 train_time: 7.5m tok/s: 6856983 +3929/20000 train_loss: 2.4344 train_time: 7.5m tok/s: 6856599 +3930/20000 train_loss: 2.5224 train_time: 7.5m tok/s: 6856227 +3931/20000 train_loss: 2.4873 train_time: 7.5m tok/s: 6855827 +3932/20000 train_loss: 2.4741 train_time: 7.5m tok/s: 6855441 +3933/20000 train_loss: 2.5457 train_time: 7.5m tok/s: 6855079 +3934/20000 train_loss: 2.4876 train_time: 7.5m tok/s: 6854686 +3935/20000 train_loss: 2.4068 train_time: 7.5m tok/s: 6854285 +3936/20000 train_loss: 2.3849 train_time: 7.5m tok/s: 6853886 +3937/20000 train_loss: 2.4073 train_time: 7.5m tok/s: 6853517 +3938/20000 train_loss: 2.3514 train_time: 7.5m tok/s: 6853152 +3939/20000 train_loss: 2.4975 train_time: 7.5m tok/s: 6852799 +3940/20000 train_loss: 2.4246 train_time: 7.5m tok/s: 6852408 +3941/20000 train_loss: 2.5815 train_time: 7.5m tok/s: 6852010 +3942/20000 train_loss: 2.3550 train_time: 7.5m tok/s: 6851620 +3943/20000 train_loss: 2.4203 train_time: 7.5m tok/s: 6851233 +3944/20000 train_loss: 2.4997 train_time: 7.5m tok/s: 6850869 +3945/20000 train_loss: 2.4129 train_time: 7.5m tok/s: 6850479 +3946/20000 train_loss: 2.4487 train_time: 7.6m tok/s: 6850073 +3947/20000 train_loss: 2.4708 train_time: 7.6m tok/s: 6849727 +3948/20000 train_loss: 2.3537 train_time: 7.6m tok/s: 6849338 +3949/20000 train_loss: 2.3739 train_time: 7.6m tok/s: 6848980 +3950/20000 train_loss: 2.3438 train_time: 7.6m tok/s: 6848599 +3951/20000 train_loss: 2.4704 train_time: 7.6m tok/s: 6848199 +3952/20000 train_loss: 2.4448 train_time: 7.6m tok/s: 6847816 +3953/20000 train_loss: 2.4676 train_time: 7.6m tok/s: 6847442 +3954/20000 train_loss: 2.4827 train_time: 7.6m tok/s: 6847081 +3955/20000 train_loss: 2.4738 train_time: 7.6m tok/s: 6846716 +3956/20000 train_loss: 2.4587 train_time: 7.6m tok/s: 6846345 +3957/20000 train_loss: 2.3429 train_time: 7.6m tok/s: 6845950 +3958/20000 train_loss: 2.5289 train_time: 7.6m tok/s: 6845563 +3959/20000 train_loss: 2.2728 train_time: 7.6m tok/s: 6845157 +3960/20000 train_loss: 2.4012 train_time: 7.6m tok/s: 6844782 +3961/20000 train_loss: 2.3845 train_time: 7.6m tok/s: 6844400 +3962/20000 train_loss: 2.3858 train_time: 7.6m tok/s: 6844035 +3963/20000 train_loss: 2.4865 train_time: 7.6m tok/s: 6843646 +3964/20000 train_loss: 2.2929 train_time: 7.6m tok/s: 6843266 +3965/20000 train_loss: 2.5916 train_time: 7.6m tok/s: 6842895 +3966/20000 train_loss: 2.4607 train_time: 7.6m tok/s: 6842514 +3967/20000 train_loss: 2.4265 train_time: 7.6m tok/s: 6842139 +3968/20000 train_loss: 2.4696 train_time: 7.6m tok/s: 6841770 +3969/20000 train_loss: 2.4181 train_time: 7.6m tok/s: 6841416 +3970/20000 train_loss: 2.5118 train_time: 7.6m tok/s: 6841043 +3971/20000 train_loss: 2.4174 train_time: 7.6m tok/s: 6840655 +3972/20000 train_loss: 2.4308 train_time: 7.6m tok/s: 6840273 +3973/20000 train_loss: 2.3967 train_time: 7.6m tok/s: 6839902 +3974/20000 train_loss: 2.3551 train_time: 7.6m tok/s: 6839534 +3975/20000 train_loss: 2.5185 train_time: 7.6m tok/s: 6839146 +3976/20000 train_loss: 2.4908 train_time: 7.6m tok/s: 6838773 +3977/20000 train_loss: 2.7632 train_time: 7.6m tok/s: 6838404 +3978/20000 train_loss: 2.4931 train_time: 7.6m tok/s: 6838030 +3979/20000 train_loss: 3.1353 train_time: 7.6m tok/s: 6837606 +3980/20000 train_loss: 2.4527 train_time: 7.6m tok/s: 6837221 +3981/20000 train_loss: 2.3806 train_time: 7.6m tok/s: 6836871 +3982/20000 train_loss: 2.4762 train_time: 7.6m tok/s: 6836502 +3983/20000 train_loss: 2.3985 train_time: 7.6m tok/s: 6836120 +3984/20000 train_loss: 2.5233 train_time: 7.6m tok/s: 6835746 +3985/20000 train_loss: 2.4775 train_time: 7.6m tok/s: 6835382 +3986/20000 train_loss: 2.4070 train_time: 7.6m tok/s: 6835031 +3987/20000 train_loss: 2.5668 train_time: 7.6m tok/s: 6834665 +3988/20000 train_loss: 2.4912 train_time: 7.6m tok/s: 6834310 +3989/20000 train_loss: 2.4397 train_time: 7.7m tok/s: 6833953 +3990/20000 train_loss: 2.4308 train_time: 7.7m tok/s: 6833573 +3991/20000 train_loss: 2.4465 train_time: 7.7m tok/s: 6833213 +3992/20000 train_loss: 2.4581 train_time: 7.7m tok/s: 6832840 +3993/20000 train_loss: 2.3835 train_time: 7.7m tok/s: 6832487 +3994/20000 train_loss: 2.1509 train_time: 7.7m tok/s: 6832095 +3995/20000 train_loss: 2.4570 train_time: 7.7m tok/s: 6831707 +3996/20000 train_loss: 2.3691 train_time: 7.7m tok/s: 6831340 +3997/20000 train_loss: 2.4743 train_time: 7.7m tok/s: 6830976 +3998/20000 train_loss: 2.4019 train_time: 7.7m tok/s: 6830592 +3999/20000 train_loss: 2.4127 train_time: 7.7m tok/s: 6830216 +4000/20000 train_loss: 2.5109 train_time: 7.7m tok/s: 6829869 +4001/20000 train_loss: 2.4235 train_time: 7.7m tok/s: 6829501 +4002/20000 train_loss: 2.4290 train_time: 7.7m tok/s: 6829146 +4003/20000 train_loss: 2.3757 train_time: 7.7m tok/s: 6828777 +4004/20000 train_loss: 2.4843 train_time: 7.7m tok/s: 6828430 +4005/20000 train_loss: 2.4378 train_time: 7.7m tok/s: 6828069 +4006/20000 train_loss: 2.4658 train_time: 7.7m tok/s: 6827686 +4007/20000 train_loss: 2.4707 train_time: 7.7m tok/s: 6827284 +4008/20000 train_loss: 2.3598 train_time: 7.7m tok/s: 6826900 +4009/20000 train_loss: 2.3881 train_time: 7.7m tok/s: 6826559 +4010/20000 train_loss: 2.4929 train_time: 7.7m tok/s: 6826205 +4011/20000 train_loss: 2.4572 train_time: 7.7m tok/s: 6825842 +4012/20000 train_loss: 2.4730 train_time: 7.7m tok/s: 6825477 +4013/20000 train_loss: 2.3537 train_time: 7.7m tok/s: 6825084 +4014/20000 train_loss: 2.4029 train_time: 7.7m tok/s: 6824740 +4015/20000 train_loss: 2.4278 train_time: 7.7m tok/s: 6824349 +4016/20000 train_loss: 2.4270 train_time: 7.7m tok/s: 6823986 +4017/20000 train_loss: 2.5356 train_time: 7.7m tok/s: 6823622 +4018/20000 train_loss: 2.3880 train_time: 7.7m tok/s: 6823269 +4019/20000 train_loss: 2.3028 train_time: 7.7m tok/s: 6822895 +4020/20000 train_loss: 2.3337 train_time: 7.7m tok/s: 6822535 +4021/20000 train_loss: 2.3816 train_time: 7.7m tok/s: 6822140 +4022/20000 train_loss: 2.4630 train_time: 7.7m tok/s: 6821787 +4023/20000 train_loss: 2.4455 train_time: 7.7m tok/s: 6821435 +4024/20000 train_loss: 2.5103 train_time: 7.7m tok/s: 6821083 +4025/20000 train_loss: 2.4874 train_time: 7.7m tok/s: 6820749 +4026/20000 train_loss: 2.4179 train_time: 7.7m tok/s: 6820369 +4027/20000 train_loss: 2.3457 train_time: 7.7m tok/s: 6819987 +4028/20000 train_loss: 2.2439 train_time: 7.7m tok/s: 6819618 +4029/20000 train_loss: 2.3894 train_time: 7.7m tok/s: 6819268 +4030/20000 train_loss: 2.3979 train_time: 7.7m tok/s: 6818925 +4031/20000 train_loss: 2.4774 train_time: 7.7m tok/s: 6818543 +4032/20000 train_loss: 2.3726 train_time: 7.8m tok/s: 6818187 +4033/20000 train_loss: 2.4349 train_time: 7.8m tok/s: 6817828 +4034/20000 train_loss: 2.6007 train_time: 7.8m tok/s: 6817485 +4035/20000 train_loss: 2.5282 train_time: 7.8m tok/s: 6817126 +4036/20000 train_loss: 2.4933 train_time: 7.8m tok/s: 6816766 +4037/20000 train_loss: 2.4826 train_time: 7.8m tok/s: 6816405 +4038/20000 train_loss: 2.4953 train_time: 7.8m tok/s: 6816036 +4039/20000 train_loss: 2.4712 train_time: 7.8m tok/s: 6815684 +4040/20000 train_loss: 2.4241 train_time: 7.8m tok/s: 6815307 +4041/20000 train_loss: 2.3883 train_time: 7.8m tok/s: 6814941 +4042/20000 train_loss: 2.3673 train_time: 7.8m tok/s: 6814581 +4043/20000 train_loss: 2.3778 train_time: 7.8m tok/s: 6814220 +4044/20000 train_loss: 2.4259 train_time: 7.8m tok/s: 6813854 +4045/20000 train_loss: 2.4568 train_time: 7.8m tok/s: 6813507 +4046/20000 train_loss: 2.5261 train_time: 7.8m tok/s: 6813147 +4047/20000 train_loss: 2.4990 train_time: 7.8m tok/s: 6812800 +4048/20000 train_loss: 2.4298 train_time: 7.8m tok/s: 6812436 +4049/20000 train_loss: 2.5541 train_time: 7.8m tok/s: 6812068 +4050/20000 train_loss: 2.4457 train_time: 7.8m tok/s: 6811695 +4051/20000 train_loss: 2.4080 train_time: 7.8m tok/s: 6811358 +4052/20000 train_loss: 2.4907 train_time: 7.8m tok/s: 6811005 +4053/20000 train_loss: 2.4391 train_time: 7.8m tok/s: 6810630 +4054/20000 train_loss: 2.4194 train_time: 7.8m tok/s: 6810277 +4055/20000 train_loss: 2.4473 train_time: 7.8m tok/s: 6809923 +4056/20000 train_loss: 2.3248 train_time: 7.8m tok/s: 6809532 +4057/20000 train_loss: 2.4150 train_time: 7.8m tok/s: 6809185 +4058/20000 train_loss: 2.5571 train_time: 7.8m tok/s: 6808816 +4059/20000 train_loss: 2.2852 train_time: 7.8m tok/s: 6808479 +4060/20000 train_loss: 2.3964 train_time: 7.8m tok/s: 6808133 +4061/20000 train_loss: 2.4993 train_time: 7.8m tok/s: 6807788 +4062/20000 train_loss: 2.4867 train_time: 7.8m tok/s: 6807439 +4063/20000 train_loss: 2.4527 train_time: 7.8m tok/s: 6807058 +4064/20000 train_loss: 2.4276 train_time: 7.8m tok/s: 6806716 +4065/20000 train_loss: 2.5007 train_time: 7.8m tok/s: 6806378 +4066/20000 train_loss: 2.4107 train_time: 7.8m tok/s: 6806024 +4067/20000 train_loss: 2.4784 train_time: 7.8m tok/s: 6805673 +4068/20000 train_loss: 2.4842 train_time: 7.8m tok/s: 6805320 +4069/20000 train_loss: 2.3702 train_time: 7.8m tok/s: 6804977 +4070/20000 train_loss: 2.3748 train_time: 7.8m tok/s: 6804610 +4071/20000 train_loss: 2.7210 train_time: 7.8m tok/s: 6804234 +4072/20000 train_loss: 2.4521 train_time: 7.8m tok/s: 6803901 +4073/20000 train_loss: 2.4464 train_time: 7.8m tok/s: 6803524 +4074/20000 train_loss: 2.4076 train_time: 7.8m tok/s: 6803175 +4075/20000 train_loss: 2.5606 train_time: 7.9m tok/s: 6802826 +4076/20000 train_loss: 2.5849 train_time: 7.9m tok/s: 6802463 +4077/20000 train_loss: 2.4957 train_time: 7.9m tok/s: 6802131 +4078/20000 train_loss: 2.3756 train_time: 7.9m tok/s: 6801775 +4079/20000 train_loss: 3.0149 train_time: 7.9m tok/s: 6801372 +4080/20000 train_loss: 2.3631 train_time: 7.9m tok/s: 6801003 +4081/20000 train_loss: 2.3900 train_time: 7.9m tok/s: 6800652 +4082/20000 train_loss: 2.4094 train_time: 7.9m tok/s: 6800324 +4083/20000 train_loss: 2.3173 train_time: 7.9m tok/s: 6799984 +4084/20000 train_loss: 2.2540 train_time: 7.9m tok/s: 6799649 +4085/20000 train_loss: 2.3575 train_time: 7.9m tok/s: 6799285 +4086/20000 train_loss: 2.4926 train_time: 7.9m tok/s: 6798956 +4087/20000 train_loss: 2.5344 train_time: 7.9m tok/s: 6798616 +4088/20000 train_loss: 2.5354 train_time: 7.9m tok/s: 6798289 +4089/20000 train_loss: 2.4436 train_time: 7.9m tok/s: 6797939 +4090/20000 train_loss: 2.3718 train_time: 7.9m tok/s: 6797591 +4091/20000 train_loss: 2.5231 train_time: 7.9m tok/s: 6797237 +4092/20000 train_loss: 2.5259 train_time: 7.9m tok/s: 6796863 +4093/20000 train_loss: 2.4950 train_time: 7.9m tok/s: 6796515 +4094/20000 train_loss: 2.2989 train_time: 7.9m tok/s: 6796144 +4095/20000 train_loss: 2.4546 train_time: 7.9m tok/s: 6795817 +4096/20000 train_loss: 2.2626 train_time: 7.9m tok/s: 6795432 +4097/20000 train_loss: 2.3126 train_time: 7.9m tok/s: 6795087 +4098/20000 train_loss: 2.4212 train_time: 7.9m tok/s: 6794761 +4099/20000 train_loss: 2.1958 train_time: 7.9m tok/s: 6794396 +4100/20000 train_loss: 2.4239 train_time: 7.9m tok/s: 6794044 +4101/20000 train_loss: 2.5327 train_time: 7.9m tok/s: 6793728 +4102/20000 train_loss: 2.2890 train_time: 7.9m tok/s: 6793385 +4103/20000 train_loss: 2.4391 train_time: 7.9m tok/s: 6793052 +4104/20000 train_loss: 2.3818 train_time: 7.9m tok/s: 6792697 +4105/20000 train_loss: 2.4331 train_time: 7.9m tok/s: 6792337 +4106/20000 train_loss: 2.4675 train_time: 7.9m tok/s: 6791997 +4107/20000 train_loss: 2.4084 train_time: 7.9m tok/s: 6791664 +4108/20000 train_loss: 2.4185 train_time: 7.9m tok/s: 6791338 +4109/20000 train_loss: 2.3947 train_time: 7.9m tok/s: 6791008 +4110/20000 train_loss: 2.3449 train_time: 7.9m tok/s: 6790642 +4111/20000 train_loss: 2.3994 train_time: 7.9m tok/s: 6790298 +4112/20000 train_loss: 2.4441 train_time: 7.9m tok/s: 6789953 +4113/20000 train_loss: 2.4252 train_time: 7.9m tok/s: 6789603 +4114/20000 train_loss: 2.3944 train_time: 7.9m tok/s: 6789257 +4115/20000 train_loss: 2.5275 train_time: 7.9m tok/s: 6788940 +4116/20000 train_loss: 2.3680 train_time: 7.9m tok/s: 6788601 +4117/20000 train_loss: 2.3803 train_time: 7.9m tok/s: 6788261 +4118/20000 train_loss: 2.4317 train_time: 8.0m tok/s: 6787926 +4119/20000 train_loss: 2.4651 train_time: 8.0m tok/s: 6787592 +4120/20000 train_loss: 2.4368 train_time: 8.0m tok/s: 6787211 +4121/20000 train_loss: 2.4031 train_time: 8.0m tok/s: 6786819 +4122/20000 train_loss: 2.2889 train_time: 8.0m tok/s: 6786489 +4123/20000 train_loss: 2.3871 train_time: 8.0m tok/s: 6786151 +4124/20000 train_loss: 2.2592 train_time: 8.0m tok/s: 6785799 +4125/20000 train_loss: 2.4553 train_time: 8.0m tok/s: 6785463 +4126/20000 train_loss: 2.3752 train_time: 8.0m tok/s: 6785118 +4127/20000 train_loss: 2.3759 train_time: 8.0m tok/s: 6784781 +4128/20000 train_loss: 2.4330 train_time: 8.0m tok/s: 6784468 +4129/20000 train_loss: 2.4652 train_time: 8.0m tok/s: 6784121 +4130/20000 train_loss: 2.4007 train_time: 8.0m tok/s: 6783788 +4131/20000 train_loss: 2.4516 train_time: 8.0m tok/s: 6783446 +4132/20000 train_loss: 2.4241 train_time: 8.0m tok/s: 6783094 +4133/20000 train_loss: 2.4413 train_time: 8.0m tok/s: 6782741 +4134/20000 train_loss: 2.4733 train_time: 8.0m tok/s: 6782379 +4135/20000 train_loss: 2.4713 train_time: 8.0m tok/s: 6782054 +4136/20000 train_loss: 2.2978 train_time: 8.0m tok/s: 6781718 +4137/20000 train_loss: 2.4996 train_time: 8.0m tok/s: 6781407 +4138/20000 train_loss: 2.3066 train_time: 8.0m tok/s: 6781056 +4139/20000 train_loss: 2.4580 train_time: 8.0m tok/s: 6780712 +4140/20000 train_loss: 2.5223 train_time: 8.0m tok/s: 6780379 +4141/20000 train_loss: 2.3355 train_time: 8.0m tok/s: 6780028 +4142/20000 train_loss: 2.5032 train_time: 8.0m tok/s: 6779710 +4143/20000 train_loss: 2.4084 train_time: 8.0m tok/s: 6779393 +4144/20000 train_loss: 2.5095 train_time: 8.0m tok/s: 6779078 +4145/20000 train_loss: 2.2764 train_time: 8.0m tok/s: 6778711 +4146/20000 train_loss: 2.4539 train_time: 8.0m tok/s: 6778360 +4147/20000 train_loss: 2.5479 train_time: 8.0m tok/s: 6778008 +4148/20000 train_loss: 2.4394 train_time: 8.0m tok/s: 6777690 +4149/20000 train_loss: 2.2991 train_time: 8.0m tok/s: 6777353 +4150/20000 train_loss: 2.3534 train_time: 8.0m tok/s: 6777020 +4151/20000 train_loss: 2.4357 train_time: 8.0m tok/s: 6776705 +4152/20000 train_loss: 2.4172 train_time: 8.0m tok/s: 6776374 +4153/20000 train_loss: 2.5071 train_time: 8.0m tok/s: 6776036 +4154/20000 train_loss: 2.3608 train_time: 8.0m tok/s: 6775715 +4155/20000 train_loss: 2.4973 train_time: 8.0m tok/s: 6775407 +4156/20000 train_loss: 2.4304 train_time: 8.0m tok/s: 6775109 +4157/20000 train_loss: 2.4824 train_time: 8.0m tok/s: 6774789 +4158/20000 train_loss: 2.4443 train_time: 8.0m tok/s: 6774482 +4159/20000 train_loss: 2.3514 train_time: 8.0m tok/s: 6774133 +4160/20000 train_loss: 2.3525 train_time: 8.0m tok/s: 6773825 +4161/20000 train_loss: 2.4091 train_time: 8.1m tok/s: 6773469 +4162/20000 train_loss: 2.3212 train_time: 8.1m tok/s: 6773121 +4163/20000 train_loss: 2.5249 train_time: 8.1m tok/s: 6772792 +4164/20000 train_loss: 2.4435 train_time: 8.1m tok/s: 6772485 +4165/20000 train_loss: 2.4464 train_time: 8.1m tok/s: 6772188 +4166/20000 train_loss: 2.4473 train_time: 8.1m tok/s: 6771855 +4167/20000 train_loss: 2.5277 train_time: 8.1m tok/s: 6771528 +4168/20000 train_loss: 2.4350 train_time: 8.1m tok/s: 6771202 +4169/20000 train_loss: 2.4214 train_time: 8.1m tok/s: 6770880 +4170/20000 train_loss: 2.3401 train_time: 8.1m tok/s: 6770524 +4171/20000 train_loss: 2.4406 train_time: 8.1m tok/s: 6770187 +4172/20000 train_loss: 2.3988 train_time: 8.1m tok/s: 6769873 +4173/20000 train_loss: 2.5159 train_time: 8.1m tok/s: 6769546 +4174/20000 train_loss: 2.3083 train_time: 8.1m tok/s: 6769235 +4175/20000 train_loss: 2.4046 train_time: 8.1m tok/s: 6768895 +4176/20000 train_loss: 2.5119 train_time: 8.1m tok/s: 6768570 +4177/20000 train_loss: 2.3823 train_time: 8.1m tok/s: 6768258 +4178/20000 train_loss: 2.4478 train_time: 8.1m tok/s: 6767942 +4179/20000 train_loss: 2.4008 train_time: 8.1m tok/s: 6767595 +4180/20000 train_loss: 2.3933 train_time: 8.1m tok/s: 6767262 +4181/20000 train_loss: 2.5454 train_time: 8.1m tok/s: 6766932 +4182/20000 train_loss: 2.4708 train_time: 8.1m tok/s: 6766600 +4183/20000 train_loss: 2.4617 train_time: 8.1m tok/s: 6766271 +4184/20000 train_loss: 2.4772 train_time: 8.1m tok/s: 6765932 +4185/20000 train_loss: 2.5712 train_time: 8.1m tok/s: 6765591 +4186/20000 train_loss: 2.4695 train_time: 8.1m tok/s: 6765260 +4187/20000 train_loss: 2.2912 train_time: 8.1m tok/s: 6764932 +4188/20000 train_loss: 2.4588 train_time: 8.1m tok/s: 6764592 +4189/20000 train_loss: 2.4491 train_time: 8.1m tok/s: 6764282 +4190/20000 train_loss: 2.5327 train_time: 8.1m tok/s: 6763945 +4191/20000 train_loss: 2.4205 train_time: 8.1m tok/s: 6763618 +4192/20000 train_loss: 2.3717 train_time: 8.1m tok/s: 6763278 +4193/20000 train_loss: 2.4053 train_time: 8.1m tok/s: 6762968 +4194/20000 train_loss: 2.4726 train_time: 8.1m tok/s: 6762624 +4195/20000 train_loss: 2.4852 train_time: 8.1m tok/s: 6762288 +4196/20000 train_loss: 2.3096 train_time: 8.1m tok/s: 6761977 +4197/20000 train_loss: 2.3271 train_time: 8.1m tok/s: 6761650 +4198/20000 train_loss: 2.4300 train_time: 8.1m tok/s: 6761332 +4199/20000 train_loss: 2.2444 train_time: 8.1m tok/s: 6760994 +4200/20000 train_loss: 2.4366 train_time: 8.1m tok/s: 6760646 +4201/20000 train_loss: 2.3697 train_time: 8.1m tok/s: 6760310 +4202/20000 train_loss: 2.4529 train_time: 8.1m tok/s: 6759990 +4203/20000 train_loss: 2.5889 train_time: 8.1m tok/s: 6759669 +4204/20000 train_loss: 2.4933 train_time: 8.2m tok/s: 6759343 +4205/20000 train_loss: 2.4680 train_time: 8.2m tok/s: 6759009 +4206/20000 train_loss: 2.3970 train_time: 8.2m tok/s: 6758683 +4207/20000 train_loss: 2.4083 train_time: 8.2m tok/s: 6758367 +4208/20000 train_loss: 2.3884 train_time: 8.2m tok/s: 6758021 +4209/20000 train_loss: 2.3817 train_time: 8.2m tok/s: 6757706 +4210/20000 train_loss: 2.3879 train_time: 8.2m tok/s: 6757383 +4211/20000 train_loss: 2.5586 train_time: 8.2m tok/s: 6757057 +4212/20000 train_loss: 2.3679 train_time: 8.2m tok/s: 6756717 +4213/20000 train_loss: 2.3980 train_time: 8.2m tok/s: 6756399 +4214/20000 train_loss: 2.4242 train_time: 8.2m tok/s: 6756093 +4215/20000 train_loss: 2.4521 train_time: 8.2m tok/s: 6755771 +4216/20000 train_loss: 2.5499 train_time: 8.2m tok/s: 6755436 +4217/20000 train_loss: 2.2980 train_time: 8.2m tok/s: 6755121 +4218/20000 train_loss: 2.4565 train_time: 8.2m tok/s: 6754783 +4219/20000 train_loss: 2.3816 train_time: 8.2m tok/s: 6754465 +4220/20000 train_loss: 2.0787 train_time: 8.2m tok/s: 6754131 +4221/20000 train_loss: 2.4712 train_time: 8.2m tok/s: 6753810 +4222/20000 train_loss: 2.4677 train_time: 8.2m tok/s: 6753493 +4223/20000 train_loss: 2.5702 train_time: 8.2m tok/s: 6753169 +4224/20000 train_loss: 2.4494 train_time: 8.2m tok/s: 6752855 +4225/20000 train_loss: 2.4554 train_time: 8.2m tok/s: 6752533 +4226/20000 train_loss: 2.4276 train_time: 8.2m tok/s: 6752214 +4227/20000 train_loss: 2.4675 train_time: 8.2m tok/s: 6751874 +4228/20000 train_loss: 2.4481 train_time: 8.2m tok/s: 6751544 +4229/20000 train_loss: 2.4379 train_time: 8.2m tok/s: 6751206 +4230/20000 train_loss: 2.3751 train_time: 8.2m tok/s: 6750881 +4231/20000 train_loss: 2.3649 train_time: 8.2m tok/s: 6750586 +4232/20000 train_loss: 2.4053 train_time: 8.2m tok/s: 6750250 +4233/20000 train_loss: 2.5404 train_time: 8.2m tok/s: 6749916 +4234/20000 train_loss: 2.3469 train_time: 8.2m tok/s: 6749611 +4235/20000 train_loss: 2.5126 train_time: 8.2m tok/s: 6749258 +4236/20000 train_loss: 2.5598 train_time: 8.2m tok/s: 6748940 +4237/20000 train_loss: 2.4393 train_time: 8.2m tok/s: 6748637 +4238/20000 train_loss: 2.5933 train_time: 8.2m tok/s: 6748324 +4239/20000 train_loss: 2.4947 train_time: 8.2m tok/s: 6748020 +4240/20000 train_loss: 2.5707 train_time: 8.2m tok/s: 6747704 +4241/20000 train_loss: 2.3574 train_time: 8.2m tok/s: 6747395 +4242/20000 train_loss: 2.4589 train_time: 8.2m tok/s: 6747057 +4243/20000 train_loss: 2.4055 train_time: 8.2m tok/s: 6746752 +4244/20000 train_loss: 2.4566 train_time: 8.2m tok/s: 6746404 +4245/20000 train_loss: 2.4502 train_time: 8.2m tok/s: 6746093 +4246/20000 train_loss: 2.4900 train_time: 8.3m tok/s: 6745761 +4247/20000 train_loss: 2.3756 train_time: 8.3m tok/s: 6745427 +4248/20000 train_loss: 2.4083 train_time: 8.3m tok/s: 6745116 +4249/20000 train_loss: 2.4298 train_time: 8.3m tok/s: 6744794 +4250/20000 train_loss: 2.3856 train_time: 8.3m tok/s: 6744490 +4251/20000 train_loss: 2.4068 train_time: 8.3m tok/s: 6744193 +4252/20000 train_loss: 2.4820 train_time: 8.3m tok/s: 6743884 +4253/20000 train_loss: 2.5433 train_time: 8.3m tok/s: 6743545 +4254/20000 train_loss: 2.6653 train_time: 8.3m tok/s: 6743216 +4255/20000 train_loss: 2.4511 train_time: 8.3m tok/s: 6742908 +4256/20000 train_loss: 2.5048 train_time: 8.3m tok/s: 6742589 +4257/20000 train_loss: 2.5284 train_time: 8.3m tok/s: 6742281 +4258/20000 train_loss: 2.3311 train_time: 8.3m tok/s: 6741949 +4259/20000 train_loss: 2.4198 train_time: 8.3m tok/s: 6741632 +4260/20000 train_loss: 2.3915 train_time: 8.3m tok/s: 6741316 +4261/20000 train_loss: 2.2948 train_time: 8.3m tok/s: 6741000 +4262/20000 train_loss: 2.3748 train_time: 8.3m tok/s: 6740701 +4263/20000 train_loss: 2.3370 train_time: 8.3m tok/s: 6740383 +4264/20000 train_loss: 2.3772 train_time: 8.3m tok/s: 6740090 +4265/20000 train_loss: 2.4376 train_time: 8.3m tok/s: 6739777 +4266/20000 train_loss: 2.5148 train_time: 8.3m tok/s: 6739453 +4267/20000 train_loss: 2.4057 train_time: 8.3m tok/s: 6739142 +4268/20000 train_loss: 2.4648 train_time: 8.3m tok/s: 6738804 +4269/20000 train_loss: 2.4043 train_time: 8.3m tok/s: 6738490 +4270/20000 train_loss: 2.3187 train_time: 8.3m tok/s: 6738180 +4271/20000 train_loss: 2.4323 train_time: 8.3m tok/s: 6737855 +4272/20000 train_loss: 2.3868 train_time: 8.3m tok/s: 6737528 +4273/20000 train_loss: 2.4063 train_time: 8.3m tok/s: 6737221 +4274/20000 train_loss: 2.4568 train_time: 8.3m tok/s: 6736842 +4275/20000 train_loss: 2.3039 train_time: 8.3m tok/s: 6736528 +4276/20000 train_loss: 2.3880 train_time: 8.3m tok/s: 6736232 +4277/20000 train_loss: 2.3630 train_time: 8.3m tok/s: 6735915 +4278/20000 train_loss: 2.3824 train_time: 8.3m tok/s: 6735624 +4279/20000 train_loss: 2.3721 train_time: 8.3m tok/s: 6735325 +4280/20000 train_loss: 2.5270 train_time: 8.3m tok/s: 6735029 +4281/20000 train_loss: 2.6060 train_time: 8.3m tok/s: 6734726 +4282/20000 train_loss: 2.4338 train_time: 8.3m tok/s: 6734403 +4283/20000 train_loss: 2.6000 train_time: 8.3m tok/s: 6734094 +4284/20000 train_loss: 2.5154 train_time: 8.3m tok/s: 6733779 +4285/20000 train_loss: 2.3909 train_time: 8.3m tok/s: 6733478 +4286/20000 train_loss: 2.4847 train_time: 8.3m tok/s: 6733148 +4287/20000 train_loss: 2.3741 train_time: 8.3m tok/s: 6732837 +4288/20000 train_loss: 2.3949 train_time: 8.3m tok/s: 6732554 +4289/20000 train_loss: 2.5701 train_time: 8.4m tok/s: 6732167 +4290/20000 train_loss: 2.3677 train_time: 8.4m tok/s: 6731883 +4291/20000 train_loss: 2.3203 train_time: 8.4m tok/s: 6731599 +4292/20000 train_loss: 2.4575 train_time: 8.4m tok/s: 6731274 +4293/20000 train_loss: 2.4277 train_time: 8.4m tok/s: 6730978 +4294/20000 train_loss: 2.3798 train_time: 8.4m tok/s: 6730677 +4295/20000 train_loss: 2.4201 train_time: 8.4m tok/s: 6730358 +4296/20000 train_loss: 2.2157 train_time: 8.4m tok/s: 6730042 +4297/20000 train_loss: 2.4754 train_time: 8.4m tok/s: 6729754 +4298/20000 train_loss: 2.4063 train_time: 8.4m tok/s: 6729439 +4299/20000 train_loss: 2.3419 train_time: 8.4m tok/s: 6729124 +4300/20000 train_loss: 2.5151 train_time: 8.4m tok/s: 6728810 +4301/20000 train_loss: 2.1872 train_time: 8.4m tok/s: 6728471 +4302/20000 train_loss: 2.3389 train_time: 8.4m tok/s: 6728139 +4303/20000 train_loss: 2.3301 train_time: 8.4m tok/s: 6727833 +4304/20000 train_loss: 2.3478 train_time: 8.4m tok/s: 6727552 +4305/20000 train_loss: 2.3325 train_time: 8.4m tok/s: 6727245 +4306/20000 train_loss: 2.3940 train_time: 8.4m tok/s: 6726920 +4307/20000 train_loss: 2.3410 train_time: 8.4m tok/s: 6726635 +4308/20000 train_loss: 2.3731 train_time: 8.4m tok/s: 6726319 +4309/20000 train_loss: 2.5797 train_time: 8.4m tok/s: 6726013 +4310/20000 train_loss: 2.4568 train_time: 8.4m tok/s: 6725706 +4311/20000 train_loss: 2.5394 train_time: 8.4m tok/s: 6725408 +4312/20000 train_loss: 2.4171 train_time: 8.4m tok/s: 6725132 +4313/20000 train_loss: 2.2539 train_time: 8.4m tok/s: 6724827 +4314/20000 train_loss: 2.4301 train_time: 8.4m tok/s: 6724521 +4315/20000 train_loss: 2.4073 train_time: 8.4m tok/s: 6724217 +4316/20000 train_loss: 2.4199 train_time: 8.4m tok/s: 6723899 +4317/20000 train_loss: 2.4908 train_time: 8.4m tok/s: 6723570 +4318/20000 train_loss: 2.4865 train_time: 8.4m tok/s: 6723241 +4319/20000 train_loss: 2.1779 train_time: 8.4m tok/s: 6722933 +4320/20000 train_loss: 2.2692 train_time: 8.4m tok/s: 6722647 +4321/20000 train_loss: 2.3728 train_time: 8.4m tok/s: 6722340 +4322/20000 train_loss: 2.4214 train_time: 8.4m tok/s: 6722042 +4323/20000 train_loss: 2.4201 train_time: 8.4m tok/s: 6721741 +4324/20000 train_loss: 2.4411 train_time: 8.4m tok/s: 6721433 +4325/20000 train_loss: 2.4638 train_time: 8.4m tok/s: 6721149 +4326/20000 train_loss: 2.3654 train_time: 8.4m tok/s: 6720827 +4327/20000 train_loss: 2.4112 train_time: 8.4m tok/s: 6720528 +4328/20000 train_loss: 2.3850 train_time: 8.4m tok/s: 6720230 +4329/20000 train_loss: 2.4377 train_time: 8.4m tok/s: 6719927 +4330/20000 train_loss: 2.3047 train_time: 8.4m tok/s: 6719601 +4331/20000 train_loss: 2.3908 train_time: 8.4m tok/s: 6719293 +4332/20000 train_loss: 2.9184 train_time: 8.5m tok/s: 6718952 +4333/20000 train_loss: 2.3836 train_time: 8.5m tok/s: 6718658 +4334/20000 train_loss: 2.4439 train_time: 8.5m tok/s: 6718368 +4335/20000 train_loss: 2.5390 train_time: 8.5m tok/s: 6718055 +4336/20000 train_loss: 2.4153 train_time: 8.5m tok/s: 6717763 +4337/20000 train_loss: 2.5100 train_time: 8.5m tok/s: 6717468 +4338/20000 train_loss: 2.5340 train_time: 8.5m tok/s: 6717158 +4339/20000 train_loss: 2.4406 train_time: 8.5m tok/s: 6716843 +4340/20000 train_loss: 2.4258 train_time: 8.5m tok/s: 6716552 +4341/20000 train_loss: 2.4118 train_time: 8.5m tok/s: 6716250 +4342/20000 train_loss: 2.5188 train_time: 8.5m tok/s: 6715961 +4343/20000 train_loss: 2.4892 train_time: 8.5m tok/s: 6715657 +4344/20000 train_loss: 2.4908 train_time: 8.5m tok/s: 6715354 +4345/20000 train_loss: 2.3258 train_time: 8.5m tok/s: 6715036 +4346/20000 train_loss: 2.3957 train_time: 8.5m tok/s: 6714727 +4347/20000 train_loss: 2.4077 train_time: 8.5m tok/s: 6714450 +4348/20000 train_loss: 2.3980 train_time: 8.5m tok/s: 6714129 +4349/20000 train_loss: 1.8630 train_time: 8.5m tok/s: 6713779 +4350/20000 train_loss: 2.2316 train_time: 8.5m tok/s: 6713482 +4351/20000 train_loss: 2.4199 train_time: 8.5m tok/s: 6713184 +4352/20000 train_loss: 2.3693 train_time: 8.5m tok/s: 6712871 +4353/20000 train_loss: 2.4635 train_time: 8.5m tok/s: 6712586 +4354/20000 train_loss: 2.4780 train_time: 8.5m tok/s: 6712297 +4355/20000 train_loss: 2.4055 train_time: 8.5m tok/s: 6712022 +4356/20000 train_loss: 2.4685 train_time: 8.5m tok/s: 6711721 +4357/20000 train_loss: 2.5099 train_time: 8.5m tok/s: 6711432 +4358/20000 train_loss: 2.4078 train_time: 8.5m tok/s: 6711120 +4359/20000 train_loss: 2.4194 train_time: 8.5m tok/s: 6710820 +4360/20000 train_loss: 2.3209 train_time: 8.5m tok/s: 6710527 +4361/20000 train_loss: 2.2833 train_time: 8.5m tok/s: 6710238 +4362/20000 train_loss: 2.4589 train_time: 8.5m tok/s: 6709949 +4363/20000 train_loss: 2.2172 train_time: 8.5m tok/s: 6709625 +4364/20000 train_loss: 2.3525 train_time: 8.5m tok/s: 6709329 +4365/20000 train_loss: 2.5513 train_time: 8.5m tok/s: 6709030 +4366/20000 train_loss: 2.7729 train_time: 8.5m tok/s: 6708720 +4367/20000 train_loss: 2.5279 train_time: 8.5m tok/s: 6708425 +4368/20000 train_loss: 2.5058 train_time: 8.5m tok/s: 6708146 +4369/20000 train_loss: 2.3113 train_time: 8.5m tok/s: 6707835 +4370/20000 train_loss: 2.4965 train_time: 8.5m tok/s: 6707552 +4371/20000 train_loss: 2.3937 train_time: 8.5m tok/s: 6707260 +4372/20000 train_loss: 2.4795 train_time: 8.5m tok/s: 6706949 +4373/20000 train_loss: 2.2506 train_time: 8.5m tok/s: 6706653 +4374/20000 train_loss: 2.3155 train_time: 8.5m tok/s: 6706341 +4375/20000 train_loss: 2.4081 train_time: 8.6m tok/s: 6706074 +4376/20000 train_loss: 2.3821 train_time: 8.6m tok/s: 6705758 +4377/20000 train_loss: 2.5571 train_time: 8.6m tok/s: 6705462 +4378/20000 train_loss: 2.2715 train_time: 8.6m tok/s: 6705156 +4379/20000 train_loss: 2.3293 train_time: 8.6m tok/s: 6704869 +4380/20000 train_loss: 2.4374 train_time: 8.6m tok/s: 6704583 +4381/20000 train_loss: 2.5038 train_time: 8.6m tok/s: 6704317 +4382/20000 train_loss: 2.4885 train_time: 8.6m tok/s: 6704025 +4383/20000 train_loss: 2.4362 train_time: 8.6m tok/s: 6703727 +4384/20000 train_loss: 2.4967 train_time: 8.6m tok/s: 6703411 +4385/20000 train_loss: 2.4260 train_time: 8.6m tok/s: 6703121 +4386/20000 train_loss: 2.4018 train_time: 8.6m tok/s: 6702827 +4387/20000 train_loss: 2.4719 train_time: 8.6m tok/s: 6702512 +4388/20000 train_loss: 2.3864 train_time: 8.6m tok/s: 6702224 +4389/20000 train_loss: 2.2663 train_time: 8.6m tok/s: 6701922 +4390/20000 train_loss: 2.4075 train_time: 8.6m tok/s: 6701629 +4391/20000 train_loss: 2.2877 train_time: 8.6m tok/s: 6701335 +4392/20000 train_loss: 2.3958 train_time: 8.6m tok/s: 6701049 +4393/20000 train_loss: 2.3930 train_time: 8.6m tok/s: 6700760 +4394/20000 train_loss: 2.3707 train_time: 8.6m tok/s: 6700453 +4395/20000 train_loss: 2.3411 train_time: 8.6m tok/s: 6700148 +4396/20000 train_loss: 2.5812 train_time: 8.6m tok/s: 6699846 +4397/20000 train_loss: 2.4183 train_time: 8.6m tok/s: 6699560 +4398/20000 train_loss: 2.3458 train_time: 8.6m tok/s: 6699272 +4399/20000 train_loss: 2.3643 train_time: 8.6m tok/s: 6698991 +4400/20000 train_loss: 2.5068 train_time: 8.6m tok/s: 6698690 +4401/20000 train_loss: 2.4340 train_time: 8.6m tok/s: 6698403 +4402/20000 train_loss: 2.3762 train_time: 8.6m tok/s: 6698110 +4403/20000 train_loss: 2.3696 train_time: 8.6m tok/s: 6697815 +4404/20000 train_loss: 2.3183 train_time: 8.6m tok/s: 6697521 +4405/20000 train_loss: 2.5382 train_time: 8.6m tok/s: 6697217 +4406/20000 train_loss: 2.4099 train_time: 8.6m tok/s: 6696929 +4407/20000 train_loss: 2.4551 train_time: 8.6m tok/s: 6696631 +4408/20000 train_loss: 2.4768 train_time: 8.6m tok/s: 6696356 +4409/20000 train_loss: 2.4232 train_time: 8.6m tok/s: 6696064 +4410/20000 train_loss: 2.4456 train_time: 8.6m tok/s: 6695775 +4411/20000 train_loss: 2.4647 train_time: 8.6m tok/s: 6695503 +4412/20000 train_loss: 2.5264 train_time: 8.6m tok/s: 6695213 +4413/20000 train_loss: 2.3452 train_time: 8.6m tok/s: 6694915 +4414/20000 train_loss: 2.3544 train_time: 8.6m tok/s: 6694618 +4415/20000 train_loss: 2.5621 train_time: 8.6m tok/s: 6694294 +4416/20000 train_loss: 2.3834 train_time: 8.6m tok/s: 6694011 +4417/20000 train_loss: 2.2985 train_time: 8.6m tok/s: 6693736 +4418/20000 train_loss: 2.3731 train_time: 8.7m tok/s: 6693416 +4419/20000 train_loss: 2.3707 train_time: 8.7m tok/s: 6693125 +4420/20000 train_loss: 2.2848 train_time: 8.7m tok/s: 6692856 +4421/20000 train_loss: 2.2557 train_time: 8.7m tok/s: 6692550 +4422/20000 train_loss: 2.3869 train_time: 8.7m tok/s: 6692263 +4423/20000 train_loss: 2.5359 train_time: 8.7m tok/s: 6691957 +4424/20000 train_loss: 2.4697 train_time: 8.7m tok/s: 6691698 +4425/20000 train_loss: 2.5028 train_time: 8.7m tok/s: 6691398 +4426/20000 train_loss: 2.3080 train_time: 8.7m tok/s: 6691124 +4427/20000 train_loss: 2.3345 train_time: 8.7m tok/s: 6690837 +4428/20000 train_loss: 2.4062 train_time: 8.7m tok/s: 6690548 +4429/20000 train_loss: 2.4093 train_time: 8.7m tok/s: 6690249 +4430/20000 train_loss: 2.5657 train_time: 8.7m tok/s: 6689936 +4431/20000 train_loss: 2.4655 train_time: 8.7m tok/s: 6689616 +4432/20000 train_loss: 2.3598 train_time: 8.7m tok/s: 6689327 +4433/20000 train_loss: 2.2747 train_time: 8.7m tok/s: 6689046 +4434/20000 train_loss: 2.5746 train_time: 8.7m tok/s: 6688735 +4435/20000 train_loss: 2.4354 train_time: 8.7m tok/s: 6688465 +4436/20000 train_loss: 2.4063 train_time: 8.7m tok/s: 6688172 +4437/20000 train_loss: 2.2345 train_time: 8.7m tok/s: 6687883 +4438/20000 train_loss: 2.5254 train_time: 8.7m tok/s: 6687608 +4439/20000 train_loss: 2.4846 train_time: 8.7m tok/s: 6687301 +4440/20000 train_loss: 2.4771 train_time: 8.7m tok/s: 6687042 +4441/20000 train_loss: 2.3540 train_time: 8.7m tok/s: 6686768 +4442/20000 train_loss: 2.4533 train_time: 8.7m tok/s: 6686480 +4443/20000 train_loss: 2.5233 train_time: 8.7m tok/s: 6686222 +4444/20000 train_loss: 2.3749 train_time: 8.7m tok/s: 6685905 +4445/20000 train_loss: 2.3885 train_time: 8.7m tok/s: 6685629 +4446/20000 train_loss: 2.3158 train_time: 8.7m tok/s: 6685360 +4447/20000 train_loss: 2.3641 train_time: 8.7m tok/s: 6685070 +4448/20000 train_loss: 2.3101 train_time: 8.7m tok/s: 6684775 +4449/20000 train_loss: 2.5113 train_time: 8.7m tok/s: 6684489 +4450/20000 train_loss: 2.3496 train_time: 8.7m tok/s: 6684219 +4451/20000 train_loss: 2.4483 train_time: 8.7m tok/s: 6683941 +4452/20000 train_loss: 2.3737 train_time: 8.7m tok/s: 6683665 +4453/20000 train_loss: 2.4793 train_time: 8.7m tok/s: 6683373 +4454/20000 train_loss: 2.3342 train_time: 8.7m tok/s: 6683079 +4455/20000 train_loss: 2.4071 train_time: 8.7m tok/s: 6682800 +4456/20000 train_loss: 2.4193 train_time: 8.7m tok/s: 6682521 +4457/20000 train_loss: 2.6128 train_time: 8.7m tok/s: 6682250 +4458/20000 train_loss: 2.3908 train_time: 8.7m tok/s: 6681968 +4459/20000 train_loss: 2.2680 train_time: 8.7m tok/s: 6681666 +4460/20000 train_loss: 2.3231 train_time: 8.7m tok/s: 6681361 +4461/20000 train_loss: 2.3513 train_time: 8.8m tok/s: 6681089 +4462/20000 train_loss: 2.4796 train_time: 8.8m tok/s: 6680807 +4463/20000 train_loss: 2.2415 train_time: 8.8m tok/s: 6680526 +4464/20000 train_loss: 2.4919 train_time: 8.8m tok/s: 6680256 +4465/20000 train_loss: 2.4584 train_time: 8.8m tok/s: 6679959 +4466/20000 train_loss: 2.5148 train_time: 8.8m tok/s: 6679689 +4467/20000 train_loss: 2.4569 train_time: 8.8m tok/s: 6679421 +4468/20000 train_loss: 2.2138 train_time: 8.8m tok/s: 6679097 +4469/20000 train_loss: 2.3598 train_time: 8.8m tok/s: 6678778 +4470/20000 train_loss: 2.3528 train_time: 8.8m tok/s: 6678501 +4471/20000 train_loss: 2.4346 train_time: 8.8m tok/s: 6678233 +4472/20000 train_loss: 2.4024 train_time: 8.8m tok/s: 6677954 +4473/20000 train_loss: 2.5180 train_time: 8.8m tok/s: 6677684 +4474/20000 train_loss: 2.3633 train_time: 8.8m tok/s: 6677376 +4475/20000 train_loss: 2.4749 train_time: 8.8m tok/s: 6677097 +4476/20000 train_loss: 2.4247 train_time: 8.8m tok/s: 6676831 +4477/20000 train_loss: 2.3418 train_time: 8.8m tok/s: 6676555 +4478/20000 train_loss: 2.3649 train_time: 8.8m tok/s: 6676289 +4479/20000 train_loss: 2.4315 train_time: 8.8m tok/s: 6676004 +4480/20000 train_loss: 2.3858 train_time: 8.8m tok/s: 6675731 +4481/20000 train_loss: 2.4196 train_time: 8.8m tok/s: 6675474 +4482/20000 train_loss: 2.3784 train_time: 8.8m tok/s: 6675186 +4483/20000 train_loss: 2.4186 train_time: 8.8m tok/s: 6674913 +4484/20000 train_loss: 2.5020 train_time: 8.8m tok/s: 6674611 +4485/20000 train_loss: 2.3376 train_time: 8.8m tok/s: 6674339 +4486/20000 train_loss: 2.3928 train_time: 8.8m tok/s: 6674060 +4487/20000 train_loss: 2.3820 train_time: 8.8m tok/s: 6673783 +4488/20000 train_loss: 2.3775 train_time: 8.8m tok/s: 6673480 +4489/20000 train_loss: 2.2819 train_time: 8.8m tok/s: 6673198 +4490/20000 train_loss: 2.4630 train_time: 8.8m tok/s: 6672918 +4491/20000 train_loss: 2.3961 train_time: 8.8m tok/s: 6672639 +4492/20000 train_loss: 2.4524 train_time: 8.8m tok/s: 6672376 +4493/20000 train_loss: 2.4646 train_time: 8.8m tok/s: 6672098 +4494/20000 train_loss: 2.4927 train_time: 8.8m tok/s: 6671818 +4495/20000 train_loss: 2.4502 train_time: 8.8m tok/s: 6671553 +4496/20000 train_loss: 2.3313 train_time: 8.8m tok/s: 6671268 +4497/20000 train_loss: 2.4382 train_time: 8.8m tok/s: 6670978 +4498/20000 train_loss: 2.3732 train_time: 8.8m tok/s: 6670694 +4499/20000 train_loss: 2.3653 train_time: 8.8m tok/s: 6670421 +4500/20000 train_loss: 2.3950 train_time: 8.8m tok/s: 6670160 +4501/20000 train_loss: 2.0644 train_time: 8.8m tok/s: 6669833 +4502/20000 train_loss: 2.3517 train_time: 8.8m tok/s: 6669556 +4503/20000 train_loss: 2.2344 train_time: 8.8m tok/s: 6669282 +4504/20000 train_loss: 2.3176 train_time: 8.9m tok/s: 6669005 +4505/20000 train_loss: 2.3202 train_time: 8.9m tok/s: 6668722 +4506/20000 train_loss: 2.2690 train_time: 8.9m tok/s: 6668444 +4507/20000 train_loss: 2.3560 train_time: 8.9m tok/s: 6668186 +4508/20000 train_loss: 2.4659 train_time: 8.9m tok/s: 6667922 +4509/20000 train_loss: 2.4248 train_time: 8.9m tok/s: 6667638 +4510/20000 train_loss: 2.4551 train_time: 8.9m tok/s: 6667370 +4511/20000 train_loss: 2.2570 train_time: 8.9m tok/s: 6667103 +4512/20000 train_loss: 2.3594 train_time: 8.9m tok/s: 6666833 +4513/20000 train_loss: 2.3782 train_time: 8.9m tok/s: 6666540 +4514/20000 train_loss: 2.3909 train_time: 8.9m tok/s: 6666258 +4515/20000 train_loss: 2.3164 train_time: 8.9m tok/s: 6665984 +4516/20000 train_loss: 2.5687 train_time: 8.9m tok/s: 6665703 +4517/20000 train_loss: 2.6622 train_time: 8.9m tok/s: 6665430 +4518/20000 train_loss: 2.4496 train_time: 8.9m tok/s: 6665172 +4519/20000 train_loss: 2.3671 train_time: 8.9m tok/s: 6664890 +4520/20000 train_loss: 2.3096 train_time: 8.9m tok/s: 6664612 +4521/20000 train_loss: 2.4264 train_time: 8.9m tok/s: 6664342 +4522/20000 train_loss: 2.3487 train_time: 8.9m tok/s: 6664080 +4523/20000 train_loss: 2.4198 train_time: 8.9m tok/s: 6663809 +4524/20000 train_loss: 2.3123 train_time: 8.9m tok/s: 6663535 +4525/20000 train_loss: 2.4865 train_time: 8.9m tok/s: 6663269 +4526/20000 train_loss: 2.4302 train_time: 8.9m tok/s: 6662994 +4527/20000 train_loss: 2.4189 train_time: 8.9m tok/s: 6662727 +4528/20000 train_loss: 2.3302 train_time: 8.9m tok/s: 6662438 +4529/20000 train_loss: 2.3827 train_time: 8.9m tok/s: 6662156 +4530/20000 train_loss: 2.2939 train_time: 8.9m tok/s: 6661898 +4531/20000 train_loss: 2.5299 train_time: 8.9m tok/s: 6661618 +4532/20000 train_loss: 2.4027 train_time: 8.9m tok/s: 6661345 +4533/20000 train_loss: 2.2834 train_time: 8.9m tok/s: 6661084 +4534/20000 train_loss: 2.4274 train_time: 8.9m tok/s: 6660822 +4535/20000 train_loss: 2.4919 train_time: 8.9m tok/s: 6660547 +4536/20000 train_loss: 2.2871 train_time: 8.9m tok/s: 6660264 +4537/20000 train_loss: 2.3841 train_time: 8.9m tok/s: 6659991 +4538/20000 train_loss: 2.1628 train_time: 8.9m tok/s: 6659707 +4539/20000 train_loss: 2.4336 train_time: 8.9m tok/s: 6659429 +4540/20000 train_loss: 2.3688 train_time: 8.9m tok/s: 6659156 +4541/20000 train_loss: 2.2972 train_time: 8.9m tok/s: 6658880 +4542/20000 train_loss: 2.3931 train_time: 8.9m tok/s: 6658622 +4543/20000 train_loss: 2.2963 train_time: 8.9m tok/s: 6658342 +4544/20000 train_loss: 2.6369 train_time: 8.9m tok/s: 6658053 +4545/20000 train_loss: 2.3818 train_time: 8.9m tok/s: 6657795 +4546/20000 train_loss: 2.3842 train_time: 9.0m tok/s: 6657535 +4547/20000 train_loss: 2.3422 train_time: 9.0m tok/s: 6657263 +4548/20000 train_loss: 2.2307 train_time: 9.0m tok/s: 6656991 +4549/20000 train_loss: 2.4269 train_time: 9.0m tok/s: 6656723 +4550/20000 train_loss: 2.4537 train_time: 9.0m tok/s: 6656454 +4551/20000 train_loss: 2.3520 train_time: 9.0m tok/s: 6656198 +4552/20000 train_loss: 2.3447 train_time: 9.0m tok/s: 6655925 +4553/20000 train_loss: 2.2882 train_time: 9.0m tok/s: 6655663 +4554/20000 train_loss: 2.4526 train_time: 9.0m tok/s: 6655368 +4555/20000 train_loss: 2.4530 train_time: 9.0m tok/s: 6655062 +4556/20000 train_loss: 2.4152 train_time: 9.0m tok/s: 6654802 +4557/20000 train_loss: 2.4609 train_time: 9.0m tok/s: 6654558 +4558/20000 train_loss: 2.5769 train_time: 9.0m tok/s: 6654293 +4559/20000 train_loss: 2.4101 train_time: 9.0m tok/s: 6654022 +4560/20000 train_loss: 2.4288 train_time: 9.0m tok/s: 6653761 +4561/20000 train_loss: 2.4439 train_time: 9.0m tok/s: 6653503 +4562/20000 train_loss: 2.3877 train_time: 9.0m tok/s: 6653233 +4563/20000 train_loss: 2.5011 train_time: 9.0m tok/s: 6652959 +4564/20000 train_loss: 2.3212 train_time: 9.0m tok/s: 6652675 +4565/20000 train_loss: 2.3975 train_time: 9.0m tok/s: 6652418 +4566/20000 train_loss: 2.3637 train_time: 9.0m tok/s: 6652149 +4567/20000 train_loss: 2.2315 train_time: 9.0m tok/s: 6651888 +4568/20000 train_loss: 2.2608 train_time: 9.0m tok/s: 6651597 +4569/20000 train_loss: 2.4764 train_time: 9.0m tok/s: 6651327 +4570/20000 train_loss: 2.3482 train_time: 9.0m tok/s: 6651079 +4571/20000 train_loss: 2.4160 train_time: 9.0m tok/s: 6650818 +4572/20000 train_loss: 2.3996 train_time: 9.0m tok/s: 6650566 +4573/20000 train_loss: 2.3932 train_time: 9.0m tok/s: 6650303 +4574/20000 train_loss: 2.4027 train_time: 9.0m tok/s: 6650019 +4575/20000 train_loss: 2.2857 train_time: 9.0m tok/s: 6649716 +4576/20000 train_loss: 2.4186 train_time: 9.0m tok/s: 6649444 +4577/20000 train_loss: 2.4146 train_time: 9.0m tok/s: 6649169 +4578/20000 train_loss: 2.3690 train_time: 9.0m tok/s: 6648910 +4579/20000 train_loss: 2.4699 train_time: 9.0m tok/s: 6648630 +4580/20000 train_loss: 2.2528 train_time: 9.0m tok/s: 6648364 +4581/20000 train_loss: 1.9318 train_time: 9.0m tok/s: 6648067 +4582/20000 train_loss: 2.3707 train_time: 9.0m tok/s: 6647794 +4583/20000 train_loss: 2.4074 train_time: 9.0m tok/s: 6647576 +4584/20000 train_loss: 2.4615 train_time: 9.0m tok/s: 6647322 +4585/20000 train_loss: 2.3161 train_time: 9.0m tok/s: 6647050 +4586/20000 train_loss: 2.3606 train_time: 9.0m tok/s: 6646807 +4587/20000 train_loss: 2.5063 train_time: 9.0m tok/s: 6646553 +4588/20000 train_loss: 2.3980 train_time: 9.0m tok/s: 6646287 +4589/20000 train_loss: 2.4595 train_time: 9.1m tok/s: 6646037 +4590/20000 train_loss: 2.3099 train_time: 9.1m tok/s: 6645773 +4591/20000 train_loss: 2.3671 train_time: 9.1m tok/s: 6645524 +4592/20000 train_loss: 2.2558 train_time: 9.1m tok/s: 6645274 +4593/20000 train_loss: 2.4547 train_time: 9.1m tok/s: 6645019 +4594/20000 train_loss: 2.3219 train_time: 9.1m tok/s: 6644754 +4595/20000 train_loss: 2.1884 train_time: 9.1m tok/s: 6644426 +4596/20000 train_loss: 2.4317 train_time: 9.1m tok/s: 6644154 +4597/20000 train_loss: 2.4143 train_time: 9.1m tok/s: 6643872 +4598/20000 train_loss: 2.4631 train_time: 9.1m tok/s: 6643598 +4599/20000 train_loss: 2.3784 train_time: 9.1m tok/s: 6643350 +4600/20000 train_loss: 2.5586 train_time: 9.1m tok/s: 6643090 +4601/20000 train_loss: 2.4844 train_time: 9.1m tok/s: 6642822 +4602/20000 train_loss: 2.5096 train_time: 9.1m tok/s: 6642560 +4603/20000 train_loss: 2.3824 train_time: 9.1m tok/s: 6642293 +4604/20000 train_loss: 2.3457 train_time: 9.1m tok/s: 6642026 +4605/20000 train_loss: 2.3507 train_time: 9.1m tok/s: 6641752 +4606/20000 train_loss: 2.3465 train_time: 9.1m tok/s: 6641486 +4607/20000 train_loss: 2.3530 train_time: 9.1m tok/s: 6641243 +4608/20000 train_loss: 2.3721 train_time: 9.1m tok/s: 6640982 +4609/20000 train_loss: 2.3216 train_time: 9.1m tok/s: 6640717 +4610/20000 train_loss: 2.4233 train_time: 9.1m tok/s: 6640454 +4611/20000 train_loss: 2.5809 train_time: 9.1m tok/s: 6640171 +4612/20000 train_loss: 2.4514 train_time: 9.1m tok/s: 6639926 +4613/20000 train_loss: 2.5024 train_time: 9.1m tok/s: 6639658 +4614/20000 train_loss: 2.4580 train_time: 9.1m tok/s: 6639402 +4615/20000 train_loss: 2.3091 train_time: 9.1m tok/s: 6639144 +4616/20000 train_loss: 2.2770 train_time: 9.1m tok/s: 6638886 +4617/20000 train_loss: 2.3153 train_time: 9.1m tok/s: 6638628 +4618/20000 train_loss: 2.3728 train_time: 9.1m tok/s: 6638386 +4619/20000 train_loss: 2.2854 train_time: 9.1m tok/s: 6638136 +4620/20000 train_loss: 2.3506 train_time: 9.1m tok/s: 6637859 +4621/20000 train_loss: 2.3922 train_time: 9.1m tok/s: 6637587 +4622/20000 train_loss: 2.4159 train_time: 9.1m tok/s: 6637314 +4623/20000 train_loss: 2.3734 train_time: 9.1m tok/s: 6637061 +4624/20000 train_loss: 2.5494 train_time: 9.1m tok/s: 6636795 +4625/20000 train_loss: 2.5872 train_time: 9.1m tok/s: 6636531 +4626/20000 train_loss: 2.5001 train_time: 9.1m tok/s: 6636264 +4627/20000 train_loss: 2.3993 train_time: 9.1m tok/s: 6636020 +4628/20000 train_loss: 2.4512 train_time: 9.1m tok/s: 6635752 +4629/20000 train_loss: 2.3709 train_time: 9.1m tok/s: 6635494 +4630/20000 train_loss: 2.1550 train_time: 9.1m tok/s: 6635235 +4631/20000 train_loss: 2.3894 train_time: 9.1m tok/s: 6634993 +4632/20000 train_loss: 2.3094 train_time: 9.2m tok/s: 6634725 +4633/20000 train_loss: 2.2341 train_time: 9.2m tok/s: 6634475 +4634/20000 train_loss: 2.3453 train_time: 9.2m tok/s: 6634217 +4635/20000 train_loss: 2.4732 train_time: 9.2m tok/s: 6633954 +4636/20000 train_loss: 2.3893 train_time: 9.2m tok/s: 6633692 +4637/20000 train_loss: 2.4637 train_time: 9.2m tok/s: 6633444 +4638/20000 train_loss: 2.4544 train_time: 9.2m tok/s: 6633201 +4639/20000 train_loss: 2.3905 train_time: 9.2m tok/s: 6632947 +4640/20000 train_loss: 2.4066 train_time: 9.2m tok/s: 6632697 +4641/20000 train_loss: 2.3503 train_time: 9.2m tok/s: 6632445 +4642/20000 train_loss: 2.3108 train_time: 9.2m tok/s: 6632200 +4643/20000 train_loss: 2.4191 train_time: 9.2m tok/s: 6631955 +4644/20000 train_loss: 2.2541 train_time: 9.2m tok/s: 6631700 +4645/20000 train_loss: 2.3491 train_time: 9.2m tok/s: 6631444 +4646/20000 train_loss: 2.2912 train_time: 9.2m tok/s: 6631195 +4647/20000 train_loss: 2.2010 train_time: 9.2m tok/s: 6630945 +4648/20000 train_loss: 2.3714 train_time: 9.2m tok/s: 6630707 +4649/20000 train_loss: 2.3939 train_time: 9.2m tok/s: 6630464 +4650/20000 train_loss: 2.4119 train_time: 9.2m tok/s: 6630228 +4651/20000 train_loss: 2.4866 train_time: 9.2m tok/s: 6629999 +4652/20000 train_loss: 2.4629 train_time: 9.2m tok/s: 6629747 +4653/20000 train_loss: 2.4359 train_time: 9.2m tok/s: 6629497 +4654/20000 train_loss: 2.3003 train_time: 9.2m tok/s: 6629248 +4655/20000 train_loss: 2.2771 train_time: 9.2m tok/s: 6628995 +4656/20000 train_loss: 2.4664 train_time: 9.2m tok/s: 6628764 +4657/20000 train_loss: 2.3533 train_time: 9.2m tok/s: 6628495 +4658/20000 train_loss: 2.2944 train_time: 9.2m tok/s: 6628252 +4659/20000 train_loss: 2.3303 train_time: 9.2m tok/s: 6628016 +4660/20000 train_loss: 2.3425 train_time: 9.2m tok/s: 6627756 +4661/20000 train_loss: 2.0352 train_time: 9.2m tok/s: 6627447 +4662/20000 train_loss: 2.3568 train_time: 9.2m tok/s: 6627194 +4663/20000 train_loss: 2.3865 train_time: 9.2m tok/s: 6626965 +4664/20000 train_loss: 2.2944 train_time: 9.2m tok/s: 6626709 +4665/20000 train_loss: 2.4455 train_time: 9.2m tok/s: 6626454 +4666/20000 train_loss: 2.3456 train_time: 9.2m tok/s: 6626193 +4667/20000 train_loss: 2.4316 train_time: 9.2m tok/s: 6625938 +4668/20000 train_loss: 2.4001 train_time: 9.2m tok/s: 6625699 +4669/20000 train_loss: 2.6147 train_time: 9.2m tok/s: 6625444 +4670/20000 train_loss: 2.2820 train_time: 9.2m tok/s: 6625197 +4671/20000 train_loss: 2.4562 train_time: 9.2m tok/s: 6624952 +4672/20000 train_loss: 2.3529 train_time: 9.2m tok/s: 6624686 +4673/20000 train_loss: 2.3304 train_time: 9.2m tok/s: 6624426 +4674/20000 train_loss: 2.4379 train_time: 9.2m tok/s: 6624166 +4675/20000 train_loss: 2.3725 train_time: 9.3m tok/s: 6623915 +4676/20000 train_loss: 2.3621 train_time: 9.3m tok/s: 6623670 +4677/20000 train_loss: 2.4199 train_time: 9.3m tok/s: 6623416 +4678/20000 train_loss: 2.3886 train_time: 9.3m tok/s: 6623166 +4679/20000 train_loss: 2.3009 train_time: 9.3m tok/s: 6622917 +4680/20000 train_loss: 2.3070 train_time: 9.3m tok/s: 6622670 +4681/20000 train_loss: 2.3792 train_time: 9.3m tok/s: 6622414 +4682/20000 train_loss: 2.2922 train_time: 9.3m tok/s: 6622153 +4683/20000 train_loss: 2.3632 train_time: 9.3m tok/s: 6621890 +4684/20000 train_loss: 2.3499 train_time: 9.3m tok/s: 6621638 +4685/20000 train_loss: 2.3248 train_time: 9.3m tok/s: 6621368 +4686/20000 train_loss: 2.2573 train_time: 9.3m tok/s: 6621106 +4687/20000 train_loss: 2.3874 train_time: 9.3m tok/s: 6620861 +4688/20000 train_loss: 2.4494 train_time: 9.3m tok/s: 6620604 +4689/20000 train_loss: 2.4416 train_time: 9.3m tok/s: 6620353 +4690/20000 train_loss: 2.4997 train_time: 9.3m tok/s: 6620119 +4691/20000 train_loss: 2.4036 train_time: 9.3m tok/s: 6619868 +4692/20000 train_loss: 2.4231 train_time: 9.3m tok/s: 6619620 +4693/20000 train_loss: 2.3706 train_time: 9.3m tok/s: 6619361 +4694/20000 train_loss: 2.4534 train_time: 9.3m tok/s: 6619109 +4695/20000 train_loss: 2.4427 train_time: 9.3m tok/s: 6618859 +4696/20000 train_loss: 2.3429 train_time: 9.3m tok/s: 6618617 +4697/20000 train_loss: 2.3422 train_time: 9.3m tok/s: 6618387 +4698/20000 train_loss: 2.3717 train_time: 9.3m tok/s: 6618134 +4699/20000 train_loss: 2.3659 train_time: 9.3m tok/s: 6617887 +4700/20000 train_loss: 2.1605 train_time: 9.3m tok/s: 6617640 +4701/20000 train_loss: 2.4151 train_time: 9.3m tok/s: 6617393 +4702/20000 train_loss: 2.5048 train_time: 9.3m tok/s: 6617108 +4703/20000 train_loss: 2.4500 train_time: 9.3m tok/s: 6616873 +4704/20000 train_loss: 2.3683 train_time: 9.3m tok/s: 6616622 +4705/20000 train_loss: 2.3354 train_time: 9.3m tok/s: 6616393 +4706/20000 train_loss: 2.3323 train_time: 9.3m tok/s: 6616141 +4707/20000 train_loss: 2.4488 train_time: 9.3m tok/s: 6615885 +4708/20000 train_loss: 2.3587 train_time: 9.3m tok/s: 6615607 +4709/20000 train_loss: 2.3520 train_time: 9.3m tok/s: 6615352 +4710/20000 train_loss: 2.3980 train_time: 9.3m tok/s: 6615113 +4711/20000 train_loss: 2.3642 train_time: 9.3m tok/s: 6614860 +4712/20000 train_loss: 2.3221 train_time: 9.3m tok/s: 6614607 +4713/20000 train_loss: 2.4418 train_time: 9.3m tok/s: 6614371 +4714/20000 train_loss: 2.3435 train_time: 9.3m tok/s: 6614099 +4715/20000 train_loss: 2.4675 train_time: 9.3m tok/s: 6613846 +4716/20000 train_loss: 2.4935 train_time: 9.3m tok/s: 6613603 +4717/20000 train_loss: 2.3264 train_time: 9.3m tok/s: 6613356 +4718/20000 train_loss: 2.3116 train_time: 9.4m tok/s: 6613090 +4719/20000 train_loss: 2.3550 train_time: 9.4m tok/s: 6612854 +4720/20000 train_loss: 2.3015 train_time: 9.4m tok/s: 6612600 +4721/20000 train_loss: 2.2834 train_time: 9.4m tok/s: 6612349 +4722/20000 train_loss: 2.4777 train_time: 9.4m tok/s: 6612116 +4723/20000 train_loss: 2.3203 train_time: 9.4m tok/s: 6611874 +4724/20000 train_loss: 2.3666 train_time: 9.4m tok/s: 6611632 +4725/20000 train_loss: 2.2594 train_time: 9.4m tok/s: 6611373 +4726/20000 train_loss: 2.4021 train_time: 9.4m tok/s: 6611131 +4727/20000 train_loss: 2.3799 train_time: 9.4m tok/s: 6610890 +4728/20000 train_loss: 2.3647 train_time: 9.4m tok/s: 6610640 +4729/20000 train_loss: 2.3689 train_time: 9.4m tok/s: 6610384 +4730/20000 train_loss: 2.2038 train_time: 9.4m tok/s: 6610136 +4731/20000 train_loss: 2.3653 train_time: 9.4m tok/s: 6609891 +4732/20000 train_loss: 2.3175 train_time: 9.4m tok/s: 6609652 +4733/20000 train_loss: 2.4170 train_time: 9.4m tok/s: 6609405 +4734/20000 train_loss: 2.3977 train_time: 9.4m tok/s: 6609164 +4735/20000 train_loss: 2.1970 train_time: 9.4m tok/s: 6608888 +4736/20000 train_loss: 2.3385 train_time: 9.4m tok/s: 6608642 +4737/20000 train_loss: 2.4767 train_time: 9.4m tok/s: 6608404 +4738/20000 train_loss: 2.3494 train_time: 9.4m tok/s: 6608165 +4739/20000 train_loss: 2.4865 train_time: 9.4m tok/s: 6607873 +4740/20000 train_loss: 2.4580 train_time: 9.4m tok/s: 6607617 +4741/20000 train_loss: 2.3785 train_time: 9.4m tok/s: 6607380 +4742/20000 train_loss: 2.3383 train_time: 9.4m tok/s: 6607140 +4743/20000 train_loss: 2.2986 train_time: 9.4m tok/s: 6606886 +4744/20000 train_loss: 2.3717 train_time: 9.4m tok/s: 6606648 +4745/20000 train_loss: 2.2579 train_time: 9.4m tok/s: 6606409 +4746/20000 train_loss: 2.2244 train_time: 9.4m tok/s: 6606170 +4747/20000 train_loss: 2.4595 train_time: 9.4m tok/s: 6605945 +4748/20000 train_loss: 2.4019 train_time: 9.4m tok/s: 6605674 +4749/20000 train_loss: 2.3708 train_time: 9.4m tok/s: 6605426 +4750/20000 train_loss: 2.4064 train_time: 9.4m tok/s: 6605193 +4751/20000 train_loss: 2.3283 train_time: 9.4m tok/s: 6604958 +4752/20000 train_loss: 2.3816 train_time: 9.4m tok/s: 6604736 +4753/20000 train_loss: 2.2606 train_time: 9.4m tok/s: 6604478 +4754/20000 train_loss: 2.2257 train_time: 9.4m tok/s: 6604233 +4755/20000 train_loss: 2.4110 train_time: 9.4m tok/s: 6604003 +4756/20000 train_loss: 2.3799 train_time: 9.4m tok/s: 6603762 +4757/20000 train_loss: 2.3298 train_time: 9.4m tok/s: 6603499 +4758/20000 train_loss: 2.5531 train_time: 9.4m tok/s: 6603254 +4759/20000 train_loss: 2.3864 train_time: 9.4m tok/s: 6603018 +4760/20000 train_loss: 2.2650 train_time: 9.4m tok/s: 6602794 +4761/20000 train_loss: 2.4430 train_time: 9.5m tok/s: 6602547 +4762/20000 train_loss: 2.3488 train_time: 9.5m tok/s: 6602297 +4763/20000 train_loss: 2.4830 train_time: 9.5m tok/s: 6602053 +4764/20000 train_loss: 2.4408 train_time: 9.5m tok/s: 6601803 +4765/20000 train_loss: 2.4300 train_time: 9.5m tok/s: 6601563 +4766/20000 train_loss: 2.3526 train_time: 9.5m tok/s: 6601326 +4767/20000 train_loss: 2.3114 train_time: 9.5m tok/s: 6601082 +4768/20000 train_loss: 2.3788 train_time: 9.5m tok/s: 6600833 +4769/20000 train_loss: 2.3524 train_time: 9.5m tok/s: 6600590 +4770/20000 train_loss: 2.4241 train_time: 9.5m tok/s: 6600363 +4771/20000 train_loss: 2.3510 train_time: 9.5m tok/s: 6600135 +4772/20000 train_loss: 2.4267 train_time: 9.5m tok/s: 6599876 +4773/20000 train_loss: 2.4416 train_time: 9.5m tok/s: 6599626 +4774/20000 train_loss: 2.5463 train_time: 9.5m tok/s: 6599376 +4775/20000 train_loss: 2.4286 train_time: 9.5m tok/s: 6599134 +4776/20000 train_loss: 2.5284 train_time: 9.5m tok/s: 6598914 +4777/20000 train_loss: 2.3986 train_time: 9.5m tok/s: 6598677 +4778/20000 train_loss: 2.3653 train_time: 9.5m tok/s: 6598438 +4779/20000 train_loss: 2.1994 train_time: 9.5m tok/s: 6598192 +4780/20000 train_loss: 2.3140 train_time: 9.5m tok/s: 6597948 +4781/20000 train_loss: 2.3111 train_time: 9.5m tok/s: 6597711 +4782/20000 train_loss: 2.7084 train_time: 9.5m tok/s: 6597447 +4783/20000 train_loss: 2.4233 train_time: 9.5m tok/s: 6597193 +4784/20000 train_loss: 2.3772 train_time: 9.5m tok/s: 6596968 +4785/20000 train_loss: 2.4244 train_time: 9.5m tok/s: 6596731 +4786/20000 train_loss: 2.4241 train_time: 9.5m tok/s: 6596508 +4787/20000 train_loss: 2.3613 train_time: 9.5m tok/s: 6596258 +4788/20000 train_loss: 2.3749 train_time: 9.5m tok/s: 6596024 +4789/20000 train_loss: 2.1519 train_time: 9.5m tok/s: 6595782 +4790/20000 train_loss: 2.4103 train_time: 9.5m tok/s: 6595541 +4791/20000 train_loss: 2.3498 train_time: 9.5m tok/s: 6595297 +4792/20000 train_loss: 2.2465 train_time: 9.5m tok/s: 6595041 +4793/20000 train_loss: 2.3518 train_time: 9.5m tok/s: 6594809 +4794/20000 train_loss: 2.2254 train_time: 9.5m tok/s: 6594584 +4795/20000 train_loss: 2.3972 train_time: 9.5m tok/s: 6594345 +4796/20000 train_loss: 2.2546 train_time: 9.5m tok/s: 6594116 +4797/20000 train_loss: 2.4077 train_time: 9.5m tok/s: 6593867 +4798/20000 train_loss: 2.5517 train_time: 9.5m tok/s: 6593616 +4799/20000 train_loss: 2.4754 train_time: 9.5m tok/s: 6593376 +4800/20000 train_loss: 2.4456 train_time: 9.5m tok/s: 6593163 +4801/20000 train_loss: 2.4197 train_time: 9.5m tok/s: 6592913 +4802/20000 train_loss: 2.3262 train_time: 9.5m tok/s: 6592671 +4803/20000 train_loss: 2.5160 train_time: 9.5m tok/s: 6592429 +4804/20000 train_loss: 1.9498 train_time: 9.6m tok/s: 6592133 +4805/20000 train_loss: 2.3894 train_time: 9.6m tok/s: 6591900 +4806/20000 train_loss: 2.3754 train_time: 9.6m tok/s: 6591677 +4807/20000 train_loss: 2.3658 train_time: 9.6m tok/s: 6591457 +4808/20000 train_loss: 2.2983 train_time: 9.6m tok/s: 6591240 +4809/20000 train_loss: 2.4076 train_time: 9.6m tok/s: 6591001 +4810/20000 train_loss: 2.4581 train_time: 9.6m tok/s: 6590773 +4811/20000 train_loss: 2.4040 train_time: 9.6m tok/s: 6590536 +4812/20000 train_loss: 2.3559 train_time: 9.6m tok/s: 6590311 +4813/20000 train_loss: 2.3541 train_time: 9.6m tok/s: 6590073 +4814/20000 train_loss: 2.3842 train_time: 9.6m tok/s: 6589857 +4815/20000 train_loss: 2.4448 train_time: 9.6m tok/s: 6589631 +4816/20000 train_loss: 2.3632 train_time: 9.6m tok/s: 6589397 +4817/20000 train_loss: 2.4957 train_time: 9.6m tok/s: 6589159 +4818/20000 train_loss: 2.2857 train_time: 9.6m tok/s: 6588925 +4819/20000 train_loss: 2.4361 train_time: 9.6m tok/s: 6588692 +4820/20000 train_loss: 2.3091 train_time: 9.6m tok/s: 6588438 +4821/20000 train_loss: 2.3911 train_time: 9.6m tok/s: 6588182 +4822/20000 train_loss: 2.4185 train_time: 9.6m tok/s: 6587940 +4823/20000 train_loss: 2.3737 train_time: 9.6m tok/s: 6587689 +4824/20000 train_loss: 2.4813 train_time: 9.6m tok/s: 6587441 +4825/20000 train_loss: 2.5626 train_time: 9.6m tok/s: 6587208 +4826/20000 train_loss: 2.3041 train_time: 9.6m tok/s: 6586972 +4827/20000 train_loss: 2.2831 train_time: 9.6m tok/s: 6586737 +4828/20000 train_loss: 2.3220 train_time: 9.6m tok/s: 6586502 +4829/20000 train_loss: 2.3607 train_time: 9.6m tok/s: 6586269 +4830/20000 train_loss: 2.3450 train_time: 9.6m tok/s: 6586050 +4831/20000 train_loss: 2.2964 train_time: 9.6m tok/s: 6585806 +4832/20000 train_loss: 2.2531 train_time: 9.6m tok/s: 6585579 +4833/20000 train_loss: 2.2718 train_time: 9.6m tok/s: 6585338 +4834/20000 train_loss: 2.3919 train_time: 9.6m tok/s: 6585104 +4835/20000 train_loss: 2.3391 train_time: 9.6m tok/s: 6584891 +4836/20000 train_loss: 2.3539 train_time: 9.6m tok/s: 6584639 +4837/20000 train_loss: 2.5280 train_time: 9.6m tok/s: 6584398 +4838/20000 train_loss: 2.2809 train_time: 9.6m tok/s: 6584168 +4839/20000 train_loss: 2.4352 train_time: 9.6m tok/s: 6583947 +4840/20000 train_loss: 2.2695 train_time: 9.6m tok/s: 6583715 +4841/20000 train_loss: 2.4242 train_time: 9.6m tok/s: 6583477 +4842/20000 train_loss: 2.6819 train_time: 9.6m tok/s: 6583233 +4843/20000 train_loss: 2.3050 train_time: 9.6m tok/s: 6583012 +4844/20000 train_loss: 2.2924 train_time: 9.6m tok/s: 6582774 +4845/20000 train_loss: 2.3218 train_time: 9.6m tok/s: 6582543 +4846/20000 train_loss: 2.3417 train_time: 9.6m tok/s: 6582288 +4847/20000 train_loss: 2.5504 train_time: 9.7m tok/s: 6582042 +4848/20000 train_loss: 2.3142 train_time: 9.7m tok/s: 6581817 +4849/20000 train_loss: 2.4529 train_time: 9.7m tok/s: 6581597 +4850/20000 train_loss: 2.2955 train_time: 9.7m tok/s: 6581372 +4851/20000 train_loss: 2.3363 train_time: 9.7m tok/s: 6581139 +4852/20000 train_loss: 2.3532 train_time: 9.7m tok/s: 6580906 +4853/20000 train_loss: 2.2921 train_time: 9.7m tok/s: 6580665 +4854/20000 train_loss: 2.4601 train_time: 9.7m tok/s: 6580431 +4855/20000 train_loss: 2.3503 train_time: 9.7m tok/s: 6580204 +4856/20000 train_loss: 2.2825 train_time: 9.7m tok/s: 6579973 +4857/20000 train_loss: 2.2879 train_time: 9.7m tok/s: 6579739 +4858/20000 train_loss: 2.2688 train_time: 9.7m tok/s: 6579496 +4859/20000 train_loss: 2.3118 train_time: 9.7m tok/s: 6579257 +4860/20000 train_loss: 2.4731 train_time: 9.7m tok/s: 6579042 +4861/20000 train_loss: 2.4071 train_time: 9.7m tok/s: 6578818 +4862/20000 train_loss: 2.3123 train_time: 9.7m tok/s: 6578578 +4863/20000 train_loss: 2.5817 train_time: 9.7m tok/s: 6578346 +4864/20000 train_loss: 2.3998 train_time: 9.7m tok/s: 6578120 +4865/20000 train_loss: 2.5060 train_time: 9.7m tok/s: 6577877 +4866/20000 train_loss: 2.2712 train_time: 9.7m tok/s: 6577634 +4867/20000 train_loss: 2.2445 train_time: 9.7m tok/s: 6577423 +4868/20000 train_loss: 2.3159 train_time: 9.7m tok/s: 6577203 +4869/20000 train_loss: 2.4389 train_time: 9.7m tok/s: 6576947 +4870/20000 train_loss: 2.3591 train_time: 9.7m tok/s: 6576719 +4871/20000 train_loss: 2.3657 train_time: 9.7m tok/s: 6576484 +4872/20000 train_loss: 2.2724 train_time: 9.7m tok/s: 6576206 +4873/20000 train_loss: 2.5457 train_time: 9.7m tok/s: 6575977 +4874/20000 train_loss: 2.4191 train_time: 9.7m tok/s: 6575769 +4875/20000 train_loss: 2.4048 train_time: 9.7m tok/s: 6575560 +4876/20000 train_loss: 2.4029 train_time: 9.7m tok/s: 6575338 +4877/20000 train_loss: 2.3022 train_time: 9.7m tok/s: 6575112 +4878/20000 train_loss: 2.4043 train_time: 9.7m tok/s: 6574870 +4879/20000 train_loss: 2.3621 train_time: 9.7m tok/s: 6574642 +4880/20000 train_loss: 2.4352 train_time: 9.7m tok/s: 6574418 +4881/20000 train_loss: 2.3415 train_time: 9.7m tok/s: 6574187 +4882/20000 train_loss: 2.2759 train_time: 9.7m tok/s: 6573983 +4883/20000 train_loss: 2.2529 train_time: 9.7m tok/s: 6573761 +4884/20000 train_loss: 2.6038 train_time: 9.7m tok/s: 6573522 +4885/20000 train_loss: 2.4663 train_time: 9.7m tok/s: 6573298 +4886/20000 train_loss: 2.4370 train_time: 9.7m tok/s: 6573080 +4887/20000 train_loss: 2.4063 train_time: 9.7m tok/s: 6572857 +4888/20000 train_loss: 2.4064 train_time: 9.7m tok/s: 6572626 +4889/20000 train_loss: 2.3502 train_time: 9.8m tok/s: 6572402 +4890/20000 train_loss: 2.4054 train_time: 9.8m tok/s: 6572170 +4891/20000 train_loss: 2.1725 train_time: 9.8m tok/s: 6571937 +4892/20000 train_loss: 1.9582 train_time: 9.8m tok/s: 6571638 +4893/20000 train_loss: 2.4374 train_time: 9.8m tok/s: 6571414 +4894/20000 train_loss: 2.3811 train_time: 9.8m tok/s: 6571200 +4895/20000 train_loss: 2.3791 train_time: 9.8m tok/s: 6571002 +4895/20000 val_loss: 2.3588 val_bpb: 1.0778 +stopping_early: wallclock_cap train_time: 585887ms step: 4895/20000 +peak memory allocated: 41707 MiB reserved: 47048 MiB +ema:applying EMA weights +diagnostic pre-quantization post-ema val_loss:2.33484962 val_bpb:1.06686712 eval_time:7856ms +Serialized model: 135418111 bytes +Code size (uncompressed): 182796 bytes +Code size (compressed): 45910 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 4.1s +Quantized weights: + gate_int8_row: blocks.attn.attn_gate_w + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int6)+lqer_asym: blocks.mlp.fc.weight + gptq (int7)+awqgrpint8+lqer_asym: tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights, smear_gate.weight, smear_lambda, softcap_neg, softcap_pos +Serialize: per-group lrzip compression... +Serialize: per-group compression done in 122.7s +Serialized model quantized+pergroup: 15945116 bytes +Total submission size quantized+pergroup: 15991026 bytes +Deserialize: per-group lrzip decompression... +Deserialize: decompression done in 21.0s +diagnostic quantized val_loss:2.35268328 val_bpb:1.07501589 eval_time:10604ms +Deserialize: per-group lrzip decompression... +Deserialize: decompression done in 20.9s +ttt_lora:warming up compile (random tokens, no val data) +ttt_lora:compile warmup done (105.6s) +v5:precomputing ngram hints OUTSIDE eval timer +ngram_tilt:hints total=47851520 gated=13023303 token_gate=628130 within_gate=9866847 word_gate=2891588 agree2plus=303177 +ngram_tilt:precompute_outside_timer_done elapsed=164.70s total_targets=47851520 + +beginning TTT eval timer +ngram_tilt:using_precomputed_hints total_targets=47851520 (precompute time excluded from eval) +ttt_phased: total_docs:50000 prefix_docs:2500 suffix_docs:47500 num_phases:3 boundaries:[833, 1666, 2500] +ttp: b781/782 bl:2.1447 bb:1.0494 rl:2.1447 rb:1.0494 dl:17258-30330 gd:0 +ttpp: phase:1/3 pd:1296 gd:833 t:224.2s +tttg: c1/131 lr:0.001000 t:0.3s +tttg: c2/131 lr:0.001000 t:0.4s +tttg: c3/131 lr:0.000999 t:0.5s +tttg: c4/131 lr:0.000999 t:0.6s +tttg: c5/131 lr:0.000998 t:0.6s +tttg: c6/131 lr:0.000996 t:0.7s +tttg: c7/131 lr:0.000995 t:0.8s +tttg: c8/131 lr:0.000993 t:0.9s +tttg: c9/131 lr:0.000991 t:1.0s +tttg: c10/131 lr:0.000988 t:1.0s +tttg: c11/131 lr:0.000985 t:1.1s +tttg: c12/131 lr:0.000982 t:1.2s +tttg: c13/131 lr:0.000979 t:1.3s +tttg: c14/131 lr:0.000976 t:1.3s +tttg: c15/131 lr:0.000972 t:1.4s +tttg: c16/131 lr:0.000968 t:1.5s +tttg: c17/131 lr:0.000963 t:1.6s +tttg: c18/131 lr:0.000958 t:1.6s +tttg: c19/131 lr:0.000953 t:1.7s +tttg: c20/131 lr:0.000948 t:1.8s +tttg: c21/131 lr:0.000943 t:1.9s +tttg: c22/131 lr:0.000937 t:1.9s +tttg: c23/131 lr:0.000931 t:2.0s +tttg: c24/131 lr:0.000925 t:2.1s +tttg: c25/131 lr:0.000918 t:2.2s +tttg: c26/131 lr:0.000911 t:2.2s +tttg: c27/131 lr:0.000905 t:2.3s +tttg: c28/131 lr:0.000897 t:2.4s +tttg: c29/131 lr:0.000890 t:2.5s +tttg: c30/131 lr:0.000882 t:2.5s +tttg: c31/131 lr:0.000874 t:2.6s +tttg: c32/131 lr:0.000866 t:2.7s +tttg: c33/131 lr:0.000858 t:2.8s +tttg: c34/131 lr:0.000849 t:2.8s +tttg: c35/131 lr:0.000841 t:2.9s +tttg: c36/131 lr:0.000832 t:3.0s +tttg: c37/131 lr:0.000822 t:3.1s +tttg: c38/131 lr:0.000813 t:3.2s +tttg: c39/131 lr:0.000804 t:3.2s +tttg: c40/131 lr:0.000794 t:3.3s +tttg: c41/131 lr:0.000784 t:3.4s +tttg: c42/131 lr:0.000774 t:3.4s +tttg: c43/131 lr:0.000764 t:3.5s +tttg: c44/131 lr:0.000753 t:3.6s +tttg: c45/131 lr:0.000743 t:3.7s +tttg: c46/131 lr:0.000732 t:3.8s +tttg: c47/131 lr:0.000722 t:3.8s +tttg: c48/131 lr:0.000711 t:3.9s +tttg: c49/131 lr:0.000700 t:4.0s +tttg: c50/131 lr:0.000689 t:4.1s +tttg: c51/131 lr:0.000677 t:4.1s +tttg: c52/131 lr:0.000666 t:4.2s +tttg: c53/131 lr:0.000655 t:4.3s +tttg: c54/131 lr:0.000643 t:4.4s +tttg: c55/131 lr:0.000631 t:4.4s +tttg: c56/131 lr:0.000620 t:4.5s +tttg: c57/131 lr:0.000608 t:4.6s +tttg: c58/131 lr:0.000596 t:4.7s +tttg: c59/131 lr:0.000584 t:4.7s +tttg: c60/131 lr:0.000572 t:4.8s +tttg: c61/131 lr:0.000560 t:4.9s +tttg: c62/131 lr:0.000548 t:5.0s +tttg: c63/131 lr:0.000536 t:5.1s +tttg: c64/131 lr:0.000524 t:5.1s +tttg: c65/131 lr:0.000512 t:5.2s +tttg: c66/131 lr:0.000500 t:5.3s +tttg: c67/131 lr:0.000488 t:5.4s +tttg: c68/131 lr:0.000476 t:5.4s +tttg: c69/131 lr:0.000464 t:5.5s +tttg: c70/131 lr:0.000452 t:5.6s +tttg: c71/131 lr:0.000440 t:5.7s +tttg: c72/131 lr:0.000428 t:5.7s +tttg: c73/131 lr:0.000416 t:5.8s +tttg: c74/131 lr:0.000404 t:5.9s +tttg: c75/131 lr:0.000392 t:6.0s +tttg: c76/131 lr:0.000380 t:6.1s +tttg: c77/131 lr:0.000369 t:6.1s +tttg: c78/131 lr:0.000357 t:6.2s +tttg: c79/131 lr:0.000345 t:6.3s +tttg: c80/131 lr:0.000334 t:6.3s +tttg: c81/131 lr:0.000323 t:6.4s +tttg: c82/131 lr:0.000311 t:6.5s +tttg: c83/131 lr:0.000300 t:6.6s +tttg: c84/131 lr:0.000289 t:6.6s +tttg: c85/131 lr:0.000278 t:6.7s +tttg: c86/131 lr:0.000268 t:6.8s +tttg: c87/131 lr:0.000257 t:6.9s +tttg: c88/131 lr:0.000247 t:6.9s +tttg: c89/131 lr:0.000236 t:7.0s +tttg: c90/131 lr:0.000226 t:7.1s +tttg: c91/131 lr:0.000216 t:7.2s +tttg: c92/131 lr:0.000206 t:7.3s +tttg: c93/131 lr:0.000196 t:7.3s +tttg: c94/131 lr:0.000187 t:7.4s +tttg: c95/131 lr:0.000178 t:7.5s +tttg: c96/131 lr:0.000168 t:7.6s +tttg: c97/131 lr:0.000159 t:7.6s +tttg: c98/131 lr:0.000151 t:7.7s +tttg: c99/131 lr:0.000142 t:7.8s +tttg: c100/131 lr:0.000134 t:7.9s +tttg: c101/131 lr:0.000126 t:7.9s +tttg: c102/131 lr:0.000118 t:8.0s +tttg: c103/131 lr:0.000110 t:8.1s +tttg: c104/131 lr:0.000103 t:8.2s +tttg: c105/131 lr:0.000095 t:8.2s +tttg: c106/131 lr:0.000089 t:8.3s +tttg: c107/131 lr:0.000082 t:8.4s +tttg: c108/131 lr:0.000075 t:8.5s +tttg: c109/131 lr:0.000069 t:8.5s +tttg: c110/131 lr:0.000063 t:8.6s +tttg: c111/131 lr:0.000057 t:8.7s +tttg: c112/131 lr:0.000052 t:8.8s +tttg: c113/131 lr:0.000047 t:8.8s +tttg: c114/131 lr:0.000042 t:8.9s +tttg: c115/131 lr:0.000037 t:9.0s +tttg: c116/131 lr:0.000032 t:9.1s +tttg: c117/131 lr:0.000028 t:9.1s +tttg: c118/131 lr:0.000024 t:9.2s +tttg: c119/131 lr:0.000021 t:9.3s +tttg: c120/131 lr:0.000018 t:9.4s +tttg: c121/131 lr:0.000015 t:9.5s +tttg: c122/131 lr:0.000012 t:9.5s +tttg: c123/131 lr:0.000009 t:9.6s +tttg: c124/131 lr:0.000007 t:9.7s +tttg: c125/131 lr:0.000005 t:9.8s +tttg: c126/131 lr:0.000004 t:9.8s +tttg: c127/131 lr:0.000002 t:9.9s +tttg: c128/131 lr:0.000001 t:10.0s +tttg: c129/131 lr:0.000001 t:10.1s +tttg: c130/131 lr:0.000000 t:10.1s +ttpr: phase:1/3 t:236.1s +ttp: b755/782 bl:2.3830 bb:1.0764 rl:2.1768 rb:1.0533 dl:3397-3466 gd:0 +ttp: b749/782 bl:2.3934 bb:1.0860 rl:2.2001 rb:1.0570 dl:3039-3089 gd:0 +ttpp: phase:2/3 pd:2128 gd:1666 t:393.1s +tttg: c1/219 lr:0.001000 t:0.1s +tttg: c2/219 lr:0.001000 t:0.2s +tttg: c3/219 lr:0.001000 t:0.3s +tttg: c4/219 lr:0.001000 t:0.3s +tttg: c5/219 lr:0.000999 t:0.4s +tttg: c6/219 lr:0.000999 t:0.5s +tttg: c7/219 lr:0.000998 t:0.6s +tttg: c8/219 lr:0.000997 t:0.6s +tttg: c9/219 lr:0.000997 t:0.7s +tttg: c10/219 lr:0.000996 t:0.8s +tttg: c11/219 lr:0.000995 t:0.9s +tttg: c12/219 lr:0.000994 t:1.0s +tttg: c13/219 lr:0.000993 t:1.0s +tttg: c14/219 lr:0.000991 t:1.1s +tttg: c15/219 lr:0.000990 t:1.2s +tttg: c16/219 lr:0.000988 t:1.3s +tttg: c17/219 lr:0.000987 t:1.4s +tttg: c18/219 lr:0.000985 t:1.4s +tttg: c19/219 lr:0.000983 t:1.5s +tttg: c20/219 lr:0.000981 t:1.6s +tttg: c21/219 lr:0.000979 t:1.7s +tttg: c22/219 lr:0.000977 t:1.7s +tttg: c23/219 lr:0.000975 t:1.8s +tttg: c24/219 lr:0.000973 t:1.9s +tttg: c25/219 lr:0.000970 t:2.0s +tttg: c26/219 lr:0.000968 t:2.0s +tttg: c27/219 lr:0.000965 t:2.1s +tttg: c28/219 lr:0.000963 t:2.2s +tttg: c29/219 lr:0.000960 t:2.3s +tttg: c30/219 lr:0.000957 t:2.4s +tttg: c31/219 lr:0.000954 t:2.4s +tttg: c32/219 lr:0.000951 t:2.5s +tttg: c33/219 lr:0.000948 t:2.6s +tttg: c34/219 lr:0.000945 t:2.7s +tttg: c35/219 lr:0.000941 t:2.7s +tttg: c36/219 lr:0.000938 t:2.8s +tttg: c37/219 lr:0.000934 t:2.9s +tttg: c38/219 lr:0.000931 t:3.0s +tttg: c39/219 lr:0.000927 t:3.0s +tttg: c40/219 lr:0.000923 t:3.1s +tttg: c41/219 lr:0.000919 t:3.2s +tttg: c42/219 lr:0.000915 t:3.3s +tttg: c43/219 lr:0.000911 t:3.3s +tttg: c44/219 lr:0.000907 t:3.4s +tttg: c45/219 lr:0.000903 t:3.5s +tttg: c46/219 lr:0.000898 t:3.6s +tttg: c47/219 lr:0.000894 t:3.7s +tttg: c48/219 lr:0.000890 t:3.7s +tttg: c49/219 lr:0.000885 t:3.8s +tttg: c50/219 lr:0.000880 t:3.9s +tttg: c51/219 lr:0.000876 t:4.0s +tttg: c52/219 lr:0.000871 t:4.1s +tttg: c53/219 lr:0.000866 t:4.1s +tttg: c54/219 lr:0.000861 t:4.2s +tttg: c55/219 lr:0.000856 t:4.3s +tttg: c56/219 lr:0.000851 t:4.4s +tttg: c57/219 lr:0.000846 t:4.4s +tttg: c58/219 lr:0.000841 t:4.5s +tttg: c59/219 lr:0.000835 t:4.6s +tttg: c60/219 lr:0.000830 t:4.7s +tttg: c61/219 lr:0.000824 t:4.8s +tttg: c62/219 lr:0.000819 t:4.8s +tttg: c63/219 lr:0.000813 t:4.9s +tttg: c64/219 lr:0.000808 t:5.0s +tttg: c65/219 lr:0.000802 t:5.1s +tttg: c66/219 lr:0.000796 t:5.1s +tttg: c67/219 lr:0.000790 t:5.2s +tttg: c68/219 lr:0.000784 t:5.3s +tttg: c69/219 lr:0.000779 t:5.4s +tttg: c70/219 lr:0.000773 t:5.4s +tttg: c71/219 lr:0.000766 t:5.5s +tttg: c72/219 lr:0.000760 t:5.6s +tttg: c73/219 lr:0.000754 t:5.7s +tttg: c74/219 lr:0.000748 t:5.8s +tttg: c75/219 lr:0.000742 t:5.8s +tttg: c76/219 lr:0.000735 t:5.9s +tttg: c77/219 lr:0.000729 t:6.0s +tttg: c78/219 lr:0.000722 t:6.1s +tttg: c79/219 lr:0.000716 t:6.1s +tttg: c80/219 lr:0.000709 t:6.2s +tttg: c81/219 lr:0.000703 t:6.3s +tttg: c82/219 lr:0.000696 t:6.4s +tttg: c83/219 lr:0.000690 t:6.4s +tttg: c84/219 lr:0.000683 t:6.5s +tttg: c85/219 lr:0.000676 t:6.6s +tttg: c86/219 lr:0.000670 t:6.7s +tttg: c87/219 lr:0.000663 t:6.8s +tttg: c88/219 lr:0.000656 t:6.8s +tttg: c89/219 lr:0.000649 t:6.9s +tttg: c90/219 lr:0.000642 t:7.0s +tttg: c91/219 lr:0.000635 t:7.1s +tttg: c92/219 lr:0.000628 t:7.1s +tttg: c93/219 lr:0.000621 t:7.2s +tttg: c94/219 lr:0.000614 t:7.3s +tttg: c95/219 lr:0.000607 t:7.4s +tttg: c96/219 lr:0.000600 t:7.4s +tttg: c97/219 lr:0.000593 t:7.5s +tttg: c98/219 lr:0.000586 t:7.6s +tttg: c99/219 lr:0.000579 t:7.7s +tttg: c100/219 lr:0.000572 t:7.7s +tttg: c101/219 lr:0.000565 t:7.8s +tttg: c102/219 lr:0.000558 t:7.9s +tttg: c103/219 lr:0.000550 t:8.0s +tttg: c104/219 lr:0.000543 t:8.1s +tttg: c105/219 lr:0.000536 t:8.1s +tttg: c106/219 lr:0.000529 t:8.2s +tttg: c107/219 lr:0.000522 t:8.3s +tttg: c108/219 lr:0.000514 t:8.4s +tttg: c109/219 lr:0.000507 t:8.4s +tttg: c110/219 lr:0.000500 t:8.5s +tttg: c111/219 lr:0.000493 t:8.6s +tttg: c112/219 lr:0.000486 t:8.7s +tttg: c113/219 lr:0.000478 t:8.8s +tttg: c114/219 lr:0.000471 t:8.8s +tttg: c115/219 lr:0.000464 t:8.9s +tttg: c116/219 lr:0.000457 t:9.0s +tttg: c117/219 lr:0.000450 t:9.1s +tttg: c118/219 lr:0.000442 t:9.1s +tttg: c119/219 lr:0.000435 t:9.2s +tttg: c120/219 lr:0.000428 t:9.3s +tttg: c121/219 lr:0.000421 t:9.4s +tttg: c122/219 lr:0.000414 t:9.5s +tttg: c123/219 lr:0.000407 t:9.5s +tttg: c124/219 lr:0.000400 t:9.6s +tttg: c125/219 lr:0.000393 t:9.7s +tttg: c126/219 lr:0.000386 t:9.8s +tttg: c127/219 lr:0.000379 t:9.8s +tttg: c128/219 lr:0.000372 t:9.9s +tttg: c129/219 lr:0.000365 t:10.0s +tttg: c130/219 lr:0.000358 t:10.1s +tttg: c131/219 lr:0.000351 t:10.2s +tttg: c132/219 lr:0.000344 t:10.2s +tttg: c133/219 lr:0.000337 t:10.3s +tttg: c134/219 lr:0.000330 t:10.4s +tttg: c135/219 lr:0.000324 t:10.5s +tttg: c136/219 lr:0.000317 t:10.5s +tttg: c137/219 lr:0.000310 t:10.6s +tttg: c138/219 lr:0.000304 t:10.7s +tttg: c139/219 lr:0.000297 t:10.8s +tttg: c140/219 lr:0.000291 t:10.9s +tttg: c141/219 lr:0.000284 t:10.9s +tttg: c142/219 lr:0.000278 t:11.0s +tttg: c143/219 lr:0.000271 t:11.1s +tttg: c144/219 lr:0.000265 t:11.2s +tttg: c145/219 lr:0.000258 t:11.2s +tttg: c146/219 lr:0.000252 t:11.3s +tttg: c147/219 lr:0.000246 t:11.4s +tttg: c148/219 lr:0.000240 t:11.5s +tttg: c149/219 lr:0.000234 t:11.6s +tttg: c150/219 lr:0.000227 t:11.6s +tttg: c151/219 lr:0.000221 t:11.7s +tttg: c152/219 lr:0.000216 t:11.8s +tttg: c153/219 lr:0.000210 t:11.9s +tttg: c154/219 lr:0.000204 t:11.9s +tttg: c155/219 lr:0.000198 t:12.0s +tttg: c156/219 lr:0.000192 t:12.1s +tttg: c157/219 lr:0.000187 t:12.2s +tttg: c158/219 lr:0.000181 t:12.3s +tttg: c159/219 lr:0.000176 t:12.3s +tttg: c160/219 lr:0.000170 t:12.4s +tttg: c161/219 lr:0.000165 t:12.5s +tttg: c162/219 lr:0.000159 t:12.6s +tttg: c163/219 lr:0.000154 t:12.6s +tttg: c164/219 lr:0.000149 t:12.7s +tttg: c165/219 lr:0.000144 t:12.8s +tttg: c166/219 lr:0.000139 t:12.9s +tttg: c167/219 lr:0.000134 t:12.9s +tttg: c168/219 lr:0.000129 t:13.0s +tttg: c169/219 lr:0.000124 t:13.1s +tttg: c170/219 lr:0.000120 t:13.2s +tttg: c171/219 lr:0.000115 t:13.3s +tttg: c172/219 lr:0.000110 t:13.3s +tttg: c173/219 lr:0.000106 t:13.4s +tttg: c174/219 lr:0.000102 t:13.5s +tttg: c175/219 lr:0.000097 t:13.6s +tttg: c176/219 lr:0.000093 t:13.6s +tttg: c177/219 lr:0.000089 t:13.7s +tttg: c178/219 lr:0.000085 t:13.8s +tttg: c179/219 lr:0.000081 t:13.9s +tttg: c180/219 lr:0.000077 t:13.9s +tttg: c181/219 lr:0.000073 t:14.0s +tttg: c182/219 lr:0.000069 t:14.1s +tttg: c183/219 lr:0.000066 t:14.2s +tttg: c184/219 lr:0.000062 t:14.3s +tttg: c185/219 lr:0.000059 t:14.3s +tttg: c186/219 lr:0.000055 t:14.4s +tttg: c187/219 lr:0.000052 t:14.5s +tttg: c188/219 lr:0.000049 t:14.6s +tttg: c189/219 lr:0.000046 t:14.6s +tttg: c190/219 lr:0.000043 t:14.7s +tttg: c191/219 lr:0.000040 t:14.8s +tttg: c192/219 lr:0.000037 t:14.9s +tttg: c193/219 lr:0.000035 t:15.0s +tttg: c194/219 lr:0.000032 t:15.0s +tttg: c195/219 lr:0.000030 t:15.1s +tttg: c196/219 lr:0.000027 t:15.2s +tttg: c197/219 lr:0.000025 t:15.3s +tttg: c198/219 lr:0.000023 t:15.3s +tttg: c199/219 lr:0.000021 t:15.4s +tttg: c200/219 lr:0.000019 t:15.5s +tttg: c201/219 lr:0.000017 t:15.6s +tttg: c202/219 lr:0.000015 t:15.7s +tttg: c203/219 lr:0.000013 t:15.7s +tttg: c204/219 lr:0.000012 t:15.8s +tttg: c205/219 lr:0.000010 t:15.9s +tttg: c206/219 lr:0.000009 t:16.0s +tttg: c207/219 lr:0.000007 t:16.0s +tttg: c208/219 lr:0.000006 t:16.1s +tttg: c209/219 lr:0.000005 t:16.2s +tttg: c210/219 lr:0.000004 t:16.3s +tttg: c211/219 lr:0.000003 t:16.3s +tttg: c212/219 lr:0.000003 t:16.4s +tttg: c213/219 lr:0.000002 t:16.5s +tttg: c214/219 lr:0.000001 t:16.6s +tttg: c215/219 lr:0.000001 t:16.7s +tttg: c216/219 lr:0.000000 t:16.7s +tttg: c217/219 lr:0.000000 t:16.8s +tttg: c218/219 lr:0.000000 t:16.9s +ttpr: phase:2/3 t:411.7s +ttp: b748/782 bl:2.3181 bb:1.0818 rl:2.2114 rb:1.0594 dl:2992-3039 gd:0 +ttpp: phase:3/3 pd:2960 gd:2500 t:427.8s +tttg: c1/289 lr:0.001000 t:0.1s +tttg: c2/289 lr:0.001000 t:0.2s +tttg: c3/289 lr:0.001000 t:0.2s +tttg: c4/289 lr:0.001000 t:0.3s +tttg: c5/289 lr:0.001000 t:0.4s +tttg: c6/289 lr:0.000999 t:0.5s +tttg: c7/289 lr:0.000999 t:0.5s +tttg: c8/289 lr:0.000999 t:0.6s +tttg: c9/289 lr:0.000998 t:0.7s +tttg: c10/289 lr:0.000998 t:0.8s +tttg: c11/289 lr:0.000997 t:0.8s +tttg: c12/289 lr:0.000996 t:0.9s +tttg: c13/289 lr:0.000996 t:1.0s +tttg: c14/289 lr:0.000995 t:1.1s +tttg: c15/289 lr:0.000994 t:1.1s +tttg: c16/289 lr:0.000993 t:1.2s +tttg: c17/289 lr:0.000992 t:1.3s +tttg: c18/289 lr:0.000991 t:1.4s +tttg: c19/289 lr:0.000990 t:1.5s +tttg: c20/289 lr:0.000989 t:1.6s +tttg: c21/289 lr:0.000988 t:1.6s +tttg: c22/289 lr:0.000987 t:1.7s +tttg: c23/289 lr:0.000986 t:1.8s +tttg: c24/289 lr:0.000984 t:1.9s +tttg: c25/289 lr:0.000983 t:1.9s +tttg: c26/289 lr:0.000982 t:2.0s +tttg: c27/289 lr:0.000980 t:2.1s +tttg: c28/289 lr:0.000978 t:2.2s +tttg: c29/289 lr:0.000977 t:2.2s +tttg: c30/289 lr:0.000975 t:2.3s +tttg: c31/289 lr:0.000973 t:2.4s +tttg: c32/289 lr:0.000972 t:2.5s +tttg: c33/289 lr:0.000970 t:2.5s +tttg: c34/289 lr:0.000968 t:2.6s +tttg: c35/289 lr:0.000966 t:2.7s +tttg: c36/289 lr:0.000964 t:2.8s +tttg: c37/289 lr:0.000962 t:2.9s +tttg: c38/289 lr:0.000960 t:2.9s +tttg: c39/289 lr:0.000958 t:3.0s +tttg: c40/289 lr:0.000955 t:3.1s +tttg: c41/289 lr:0.000953 t:3.2s +tttg: c42/289 lr:0.000951 t:3.3s +tttg: c43/289 lr:0.000948 t:3.3s +tttg: c44/289 lr:0.000946 t:3.4s +tttg: c45/289 lr:0.000944 t:3.5s +tttg: c46/289 lr:0.000941 t:3.6s +tttg: c47/289 lr:0.000938 t:3.6s +tttg: c48/289 lr:0.000936 t:3.7s +tttg: c49/289 lr:0.000933 t:3.8s +tttg: c50/289 lr:0.000930 t:3.9s +tttg: c51/289 lr:0.000927 t:4.0s +tttg: c52/289 lr:0.000925 t:4.0s +tttg: c53/289 lr:0.000922 t:4.1s +tttg: c54/289 lr:0.000919 t:4.2s +tttg: c55/289 lr:0.000916 t:4.3s +tttg: c56/289 lr:0.000913 t:4.4s +tttg: c57/289 lr:0.000910 t:4.4s +tttg: c58/289 lr:0.000906 t:4.5s +tttg: c59/289 lr:0.000903 t:4.6s +tttg: c60/289 lr:0.000900 t:4.7s +tttg: c61/289 lr:0.000897 t:4.7s +tttg: c62/289 lr:0.000893 t:4.8s +tttg: c63/289 lr:0.000890 t:4.9s +tttg: c64/289 lr:0.000887 t:5.0s +tttg: c65/289 lr:0.000883 t:5.1s +tttg: c66/289 lr:0.000879 t:5.1s +tttg: c67/289 lr:0.000876 t:5.2s +tttg: c68/289 lr:0.000872 t:5.3s +tttg: c69/289 lr:0.000869 t:5.4s +tttg: c70/289 lr:0.000865 t:5.4s +tttg: c71/289 lr:0.000861 t:5.5s +tttg: c72/289 lr:0.000857 t:5.6s +tttg: c73/289 lr:0.000854 t:5.7s +tttg: c74/289 lr:0.000850 t:5.7s +tttg: c75/289 lr:0.000846 t:5.8s +tttg: c76/289 lr:0.000842 t:5.9s +tttg: c77/289 lr:0.000838 t:6.0s +tttg: c78/289 lr:0.000834 t:6.0s +tttg: c79/289 lr:0.000830 t:6.1s +tttg: c80/289 lr:0.000826 t:6.2s +tttg: c81/289 lr:0.000821 t:6.3s +tttg: c82/289 lr:0.000817 t:6.3s +tttg: c83/289 lr:0.000813 t:6.4s +tttg: c84/289 lr:0.000809 t:6.5s +tttg: c85/289 lr:0.000804 t:6.6s +tttg: c86/289 lr:0.000800 t:6.6s +tttg: c87/289 lr:0.000796 t:6.7s +tttg: c88/289 lr:0.000791 t:6.8s +tttg: c89/289 lr:0.000787 t:6.9s +tttg: c90/289 lr:0.000782 t:7.0s +tttg: c91/289 lr:0.000778 t:7.0s +tttg: c92/289 lr:0.000773 t:7.1s +tttg: c93/289 lr:0.000769 t:7.2s +tttg: c94/289 lr:0.000764 t:7.3s +tttg: c95/289 lr:0.000759 t:7.4s +tttg: c96/289 lr:0.000755 t:7.5s +tttg: c97/289 lr:0.000750 t:7.5s +tttg: c98/289 lr:0.000745 t:7.6s +tttg: c99/289 lr:0.000740 t:7.7s +tttg: c100/289 lr:0.000736 t:7.8s +tttg: c101/289 lr:0.000731 t:7.9s +tttg: c102/289 lr:0.000726 t:8.0s +tttg: c103/289 lr:0.000721 t:8.0s +tttg: c104/289 lr:0.000716 t:8.1s +tttg: c105/289 lr:0.000711 t:8.2s +tttg: c106/289 lr:0.000706 t:8.3s +tttg: c107/289 lr:0.000701 t:8.3s +tttg: c108/289 lr:0.000696 t:8.4s +tttg: c109/289 lr:0.000691 t:8.5s +tttg: c110/289 lr:0.000686 t:8.6s +tttg: c111/289 lr:0.000681 t:8.6s +tttg: c112/289 lr:0.000676 t:8.7s +tttg: c113/289 lr:0.000671 t:8.8s +tttg: c114/289 lr:0.000666 t:8.9s +tttg: c115/289 lr:0.000661 t:9.0s +tttg: c116/289 lr:0.000656 t:9.0s +tttg: c117/289 lr:0.000650 t:9.1s +tttg: c118/289 lr:0.000645 t:9.2s +tttg: c119/289 lr:0.000640 t:9.3s +tttg: c120/289 lr:0.000635 t:9.3s +tttg: c121/289 lr:0.000629 t:9.4s +tttg: c122/289 lr:0.000624 t:9.5s +tttg: c123/289 lr:0.000619 t:9.6s +tttg: c124/289 lr:0.000614 t:9.6s +tttg: c125/289 lr:0.000608 t:9.7s +tttg: c126/289 lr:0.000603 t:9.8s +tttg: c127/289 lr:0.000598 t:9.9s +tttg: c128/289 lr:0.000592 t:10.0s +tttg: c129/289 lr:0.000587 t:10.0s +tttg: c130/289 lr:0.000581 t:10.1s +tttg: c131/289 lr:0.000576 t:10.2s +tttg: c132/289 lr:0.000571 t:10.3s +tttg: c133/289 lr:0.000565 t:10.3s +tttg: c134/289 lr:0.000560 t:10.4s +tttg: c135/289 lr:0.000554 t:10.5s +tttg: c136/289 lr:0.000549 t:10.6s +tttg: c137/289 lr:0.000544 t:10.6s +tttg: c138/289 lr:0.000538 t:10.7s +tttg: c139/289 lr:0.000533 t:10.8s +tttg: c140/289 lr:0.000527 t:10.9s +tttg: c141/289 lr:0.000522 t:11.0s +tttg: c142/289 lr:0.000516 t:11.0s +tttg: c143/289 lr:0.000511 t:11.1s +tttg: c144/289 lr:0.000505 t:11.2s +tttg: c145/289 lr:0.000500 t:11.3s +tttg: c146/289 lr:0.000495 t:11.3s +tttg: c147/289 lr:0.000489 t:11.4s +tttg: c148/289 lr:0.000484 t:11.5s +tttg: c149/289 lr:0.000478 t:11.6s +tttg: c150/289 lr:0.000473 t:11.6s +tttg: c151/289 lr:0.000467 t:11.7s +tttg: c152/289 lr:0.000462 t:11.8s +tttg: c153/289 lr:0.000456 t:11.9s +tttg: c154/289 lr:0.000451 t:12.0s +tttg: c155/289 lr:0.000446 t:12.0s +tttg: c156/289 lr:0.000440 t:12.1s +tttg: c157/289 lr:0.000435 t:12.2s +tttg: c158/289 lr:0.000429 t:12.3s +tttg: c159/289 lr:0.000424 t:12.3s +tttg: c160/289 lr:0.000419 t:12.4s +tttg: c161/289 lr:0.000413 t:12.5s +tttg: c162/289 lr:0.000408 t:12.6s +tttg: c163/289 lr:0.000402 t:12.7s +tttg: c164/289 lr:0.000397 t:12.8s +tttg: c165/289 lr:0.000392 t:12.8s +tttg: c166/289 lr:0.000386 t:12.9s +tttg: c167/289 lr:0.000381 t:13.0s +tttg: c168/289 lr:0.000376 t:13.1s +tttg: c169/289 lr:0.000371 t:13.1s +tttg: c170/289 lr:0.000365 t:13.2s +tttg: c171/289 lr:0.000360 t:13.3s +tttg: c172/289 lr:0.000355 t:13.4s +tttg: c173/289 lr:0.000350 t:13.5s +tttg: c174/289 lr:0.000344 t:13.5s +tttg: c175/289 lr:0.000339 t:13.6s +tttg: c176/289 lr:0.000334 t:13.7s +tttg: c177/289 lr:0.000329 t:13.8s +tttg: c178/289 lr:0.000324 t:13.9s +tttg: c179/289 lr:0.000319 t:14.0s +tttg: c180/289 lr:0.000314 t:14.0s +tttg: c181/289 lr:0.000309 t:14.1s +tttg: c182/289 lr:0.000304 t:14.2s +tttg: c183/289 lr:0.000299 t:14.3s +tttg: c184/289 lr:0.000294 t:14.4s +tttg: c185/289 lr:0.000289 t:14.4s +tttg: c186/289 lr:0.000284 t:14.5s +tttg: c187/289 lr:0.000279 t:14.6s +tttg: c188/289 lr:0.000274 t:14.7s +tttg: c189/289 lr:0.000269 t:14.7s +tttg: c190/289 lr:0.000264 t:14.8s +tttg: c191/289 lr:0.000260 t:14.9s +tttg: c192/289 lr:0.000255 t:15.0s +tttg: c193/289 lr:0.000250 t:15.1s +tttg: c194/289 lr:0.000245 t:15.1s +tttg: c195/289 lr:0.000241 t:15.2s +tttg: c196/289 lr:0.000236 t:15.3s +tttg: c197/289 lr:0.000231 t:15.4s +tttg: c198/289 lr:0.000227 t:15.4s +tttg: c199/289 lr:0.000222 t:15.5s +tttg: c200/289 lr:0.000218 t:15.6s +tttg: c201/289 lr:0.000213 t:15.7s +tttg: c202/289 lr:0.000209 t:15.8s +tttg: c203/289 lr:0.000204 t:15.8s +tttg: c204/289 lr:0.000200 t:15.9s +tttg: c205/289 lr:0.000196 t:16.0s +tttg: c206/289 lr:0.000191 t:16.1s +tttg: c207/289 lr:0.000187 t:16.1s +tttg: c208/289 lr:0.000183 t:16.2s +tttg: c209/289 lr:0.000179 t:16.3s +tttg: c210/289 lr:0.000174 t:16.4s +tttg: c211/289 lr:0.000170 t:16.5s +tttg: c212/289 lr:0.000166 t:16.5s +tttg: c213/289 lr:0.000162 t:16.6s +tttg: c214/289 lr:0.000158 t:16.7s +tttg: c215/289 lr:0.000154 t:16.8s +tttg: c216/289 lr:0.000150 t:16.9s +tttg: c217/289 lr:0.000146 t:16.9s +tttg: c218/289 lr:0.000143 t:17.0s +tttg: c219/289 lr:0.000139 t:17.1s +tttg: c220/289 lr:0.000135 t:17.2s +tttg: c221/289 lr:0.000131 t:17.3s +tttg: c222/289 lr:0.000128 t:17.3s +tttg: c223/289 lr:0.000124 t:17.4s +tttg: c224/289 lr:0.000121 t:17.5s +tttg: c225/289 lr:0.000117 t:17.6s +tttg: c226/289 lr:0.000113 t:17.7s +tttg: c227/289 lr:0.000110 t:17.7s +tttg: c228/289 lr:0.000107 t:17.8s +tttg: c229/289 lr:0.000103 t:17.9s +tttg: c230/289 lr:0.000100 t:18.0s +tttg: c231/289 lr:0.000097 t:18.1s +tttg: c232/289 lr:0.000094 t:18.1s +tttg: c233/289 lr:0.000090 t:18.2s +tttg: c234/289 lr:0.000087 t:18.3s +tttg: c235/289 lr:0.000084 t:18.4s +tttg: c236/289 lr:0.000081 t:18.4s +tttg: c237/289 lr:0.000078 t:18.5s +tttg: c238/289 lr:0.000075 t:18.6s +tttg: c239/289 lr:0.000073 t:18.7s +tttg: c240/289 lr:0.000070 t:18.8s +tttg: c241/289 lr:0.000067 t:18.8s +tttg: c242/289 lr:0.000064 t:18.9s +tttg: c243/289 lr:0.000062 t:19.0s +tttg: c244/289 lr:0.000059 t:19.1s +tttg: c245/289 lr:0.000056 t:19.1s +tttg: c246/289 lr:0.000054 t:19.2s +tttg: c247/289 lr:0.000052 t:19.3s +tttg: c248/289 lr:0.000049 t:19.4s +tttg: c249/289 lr:0.000047 t:19.5s +tttg: c250/289 lr:0.000045 t:19.5s +tttg: c251/289 lr:0.000042 t:19.6s +tttg: c252/289 lr:0.000040 t:19.7s +tttg: c253/289 lr:0.000038 t:19.8s +tttg: c254/289 lr:0.000036 t:19.9s +tttg: c255/289 lr:0.000034 t:19.9s +tttg: c256/289 lr:0.000032 t:20.0s +tttg: c257/289 lr:0.000030 t:20.1s +tttg: c258/289 lr:0.000028 t:20.2s +tttg: c259/289 lr:0.000027 t:20.3s +tttg: c260/289 lr:0.000025 t:20.3s +tttg: c261/289 lr:0.000023 t:20.4s +tttg: c262/289 lr:0.000022 t:20.5s +tttg: c263/289 lr:0.000020 t:20.6s +tttg: c264/289 lr:0.000018 t:20.7s +tttg: c265/289 lr:0.000017 t:20.7s +tttg: c266/289 lr:0.000016 t:20.8s +tttg: c267/289 lr:0.000014 t:20.9s +tttg: c268/289 lr:0.000013 t:21.0s +tttg: c269/289 lr:0.000012 t:21.1s +tttg: c270/289 lr:0.000011 t:21.1s +tttg: c271/289 lr:0.000010 t:21.2s +tttg: c272/289 lr:0.000009 t:21.3s +tttg: c273/289 lr:0.000008 t:21.4s +tttg: c274/289 lr:0.000007 t:21.4s +tttg: c275/289 lr:0.000006 t:21.5s +tttg: c276/289 lr:0.000005 t:21.6s +tttg: c277/289 lr:0.000004 t:21.7s +tttg: c278/289 lr:0.000004 t:21.8s +tttg: c279/289 lr:0.000003 t:21.8s +tttg: c280/289 lr:0.000002 t:21.9s +tttg: c281/289 lr:0.000002 t:22.0s +tttg: c282/289 lr:0.000001 t:22.1s +tttg: c283/289 lr:0.000001 t:22.1s +tttg: c284/289 lr:0.000001 t:22.2s +tttg: c285/289 lr:0.000000 t:22.3s +tttg: c286/289 lr:0.000000 t:22.4s +tttg: c287/289 lr:0.000000 t:22.5s +tttg: c288/289 lr:0.000000 t:22.5s +ttpr: phase:3/3 t:452.1s +ttp: b732/782 bl:2.3722 bb:1.0924 rl:2.2229 rb:1.0619 dl:2416-2441 gd:1 +ttp: b724/782 bl:2.3161 bb:1.0575 rl:2.2286 rb:1.0616 dl:2203-2231 gd:1 +ttp: b714/782 bl:2.3061 bb:1.0215 rl:2.2327 rb:1.0593 dl:2018-2035 gd:1 +ttp: b709/782 bl:2.4416 bb:1.0922 rl:2.2428 rb:1.0610 dl:1937-1952 gd:1 +ttp: b700/782 bl:2.2712 bb:1.0142 rl:2.2440 rb:1.0588 dl:1824-1834 gd:1 +ttp: b689/782 bl:2.3865 bb:1.0745 rl:2.2496 rb:1.0595 dl:1706-1715 gd:1 +ttp: b686/782 bl:2.4400 bb:1.0742 rl:2.2566 rb:1.0601 dl:1675-1685 gd:1 +ttp: b673/782 bl:2.3581 bb:1.0585 rl:2.2600 rb:1.0600 dl:1562-1571 gd:1 +ttp: b666/782 bl:2.4083 bb:1.0631 rl:2.2646 rb:1.0601 dl:1507-1514 gd:1 +ttp: b660/782 bl:2.3686 bb:1.0471 rl:2.2677 rb:1.0597 dl:1466-1474 gd:1 +ttp: b653/782 bl:2.2868 bb:1.0368 rl:2.2682 rb:1.0591 dl:1419-1425 gd:1 +ttp: b645/782 bl:2.2988 bb:1.0285 rl:2.2690 rb:1.0582 dl:1367-1375 gd:1 +ttp: b637/782 bl:2.3641 bb:1.0781 rl:2.2713 rb:1.0587 dl:1320-1325 gd:1 +ttp: b629/782 bl:2.3467 bb:1.0099 rl:2.2731 rb:1.0575 dl:1276-1280 gd:1 +ttp: b621/782 bl:2.2885 bb:1.0451 rl:2.2734 rb:1.0572 dl:1231-1237 gd:1 +ttp: b613/782 bl:2.3329 bb:1.0387 rl:2.2746 rb:1.0568 dl:1190-1195 gd:1 +ttp: b606/782 bl:2.3549 bb:1.0641 rl:2.2762 rb:1.0570 dl:1159-1164 gd:1 +ttp: b596/782 bl:2.2875 bb:1.0458 rl:2.2764 rb:1.0568 dl:1115-1119 gd:1 +ttp: b589/782 bl:2.2749 bb:1.0103 rl:2.2764 rb:1.0559 dl:1086-1089 gd:1 +ttp: b582/782 bl:2.3464 bb:1.0306 rl:2.2776 rb:1.0554 dl:1056-1060 gd:1 +ttp: b573/782 bl:2.3663 bb:1.0667 rl:2.2790 rb:1.0556 dl:1021-1025 gd:1 +ttp: b565/782 bl:2.3769 bb:1.0297 rl:2.2805 rb:1.0552 dl:993-997 gd:1 +ttp: b558/782 bl:2.3730 bb:1.0613 rl:2.2819 rb:1.0553 dl:968-972 gd:1 +ttp: b551/782 bl:2.3327 bb:1.0543 rl:2.2826 rb:1.0553 dl:946-949 gd:1 +ttp: b544/782 bl:2.3448 bb:1.0685 rl:2.2835 rb:1.0555 dl:924-927 gd:1 +ttp: b536/782 bl:2.3126 bb:1.0414 rl:2.2839 rb:1.0553 dl:899-902 gd:1 +ttp: b528/782 bl:2.3310 bb:1.0419 rl:2.2845 rb:1.0551 dl:875-878 gd:1 +ttp: b520/782 bl:2.3217 bb:1.0011 rl:2.2849 rb:1.0544 dl:852-854 gd:1 +ttp: b512/782 bl:2.3012 bb:1.0627 rl:2.2851 rb:1.0545 dl:829-832 gd:1 +ttp: b504/782 bl:2.3134 bb:1.0321 rl:2.2855 rb:1.0542 dl:807-809 gd:1 +ttp: b496/782 bl:2.4151 bb:1.0455 rl:2.2869 rb:1.0541 dl:785-788 gd:1 +ttp: b488/782 bl:2.2926 bb:1.0088 rl:2.2869 rb:1.0536 dl:766-769 gd:1 +ttp: b483/782 bl:2.2531 bb:1.0279 rl:2.2866 rb:1.0534 dl:754-756 gd:1 +ttp: b462/782 bl:2.3334 bb:1.0357 rl:2.2870 rb:1.0532 dl:706-708 gd:1 +ttp: b454/782 bl:2.3806 bb:1.0812 rl:2.2879 rb:1.0535 dl:689-691 gd:1 +ttp: b447/782 bl:2.3221 bb:1.0667 rl:2.2882 rb:1.0536 dl:674-676 gd:1 +ttp: b439/782 bl:2.3233 bb:1.0367 rl:2.2885 rb:1.0534 dl:657-659 gd:1 +ttp: b431/782 bl:2.3682 bb:1.0506 rl:2.2892 rb:1.0534 dl:642-643 gd:1 +ttp: b423/782 bl:2.3096 bb:1.0538 rl:2.2893 rb:1.0534 dl:626-629 gd:1 +ttp: b415/782 bl:2.2813 bb:1.0567 rl:2.2893 rb:1.0534 dl:611-613 gd:1 +ttp: b408/782 bl:2.2936 bb:1.0665 rl:2.2893 rb:1.0535 dl:597-598 gd:1 +ttp: b401/782 bl:2.2418 bb:1.0300 rl:2.2889 rb:1.0533 dl:584-586 gd:1 +ttp: b393/782 bl:2.2960 bb:1.0545 rl:2.2890 rb:1.0534 dl:570-571 gd:1 +ttp: b385/782 bl:2.4035 bb:1.0718 rl:2.2898 rb:1.0535 dl:555-557 gd:1 +ttp: b377/782 bl:2.2230 bb:1.0184 rl:2.2893 rb:1.0533 dl:542-544 gd:1 +ttp: b369/782 bl:2.3443 bb:1.0591 rl:2.2897 rb:1.0533 dl:528-530 gd:1 +ttp: b361/782 bl:2.3467 bb:1.0956 rl:2.2901 rb:1.0536 dl:515-517 gd:1 +ttp: b353/782 bl:2.1936 bb:1.0031 rl:2.2895 rb:1.0532 dl:501-503 gd:1 +ttp: b345/782 bl:2.3542 bb:1.0717 rl:2.2898 rb:1.0534 dl:489-491 gd:1 +ttp: b337/782 bl:2.3085 bb:1.0505 rl:2.2900 rb:1.0533 dl:477-478 gd:1 +ttp: b329/782 bl:2.2847 bb:1.0826 rl:2.2899 rb:1.0535 dl:465-466 gd:1 +ttp: b321/782 bl:2.3379 bb:1.0673 rl:2.2902 rb:1.0536 dl:453-455 gd:1 +ttp: b313/782 bl:2.2830 bb:1.0757 rl:2.2901 rb:1.0537 dl:440-442 gd:1 +ttp: b305/782 bl:2.3327 bb:1.0843 rl:2.2904 rb:1.0538 dl:429-430 gd:1 +ttp: b297/782 bl:2.3983 bb:1.0836 rl:2.2909 rb:1.0540 dl:417-418 gd:1 +ttp: b289/782 bl:2.3228 bb:1.0803 rl:2.2910 rb:1.0541 dl:405-406 gd:1 +ttp: b281/782 bl:2.2856 bb:1.0835 rl:2.2910 rb:1.0542 dl:394-395 gd:1 +ttp: b273/782 bl:2.3298 bb:1.0739 rl:2.2912 rb:1.0543 dl:383-384 gd:1 +ttp: b265/782 bl:2.3695 bb:1.1025 rl:2.2915 rb:1.0545 dl:372-374 gd:1 +ttp: b258/782 bl:2.4245 bb:1.0879 rl:2.2921 rb:1.0547 dl:364-365 gd:1 +ttp: b250/782 bl:2.3067 bb:1.0695 rl:2.2921 rb:1.0547 dl:354-355 gd:1 +ttp: b243/782 bl:2.3446 bb:1.0758 rl:2.2923 rb:1.0548 dl:345-346 gd:1 +ttp: b234/782 bl:2.4130 bb:1.1434 rl:2.2928 rb:1.0551 dl:334-335 gd:1 +ttp: b232/782 bl:2.2974 bb:1.0829 rl:2.2928 rb:1.0552 dl:331-333 gd:1 +ttp: b223/782 bl:2.3196 bb:1.1199 rl:2.2929 rb:1.0555 dl:321-322 gd:1 +ttp: b214/782 bl:2.3376 bb:1.1186 rl:2.2930 rb:1.0557 dl:310-312 gd:1 +ttp: b208/782 bl:2.3897 bb:1.1312 rl:2.2934 rb:1.0559 dl:304-305 gd:1 +ttp: b200/782 bl:2.3567 bb:1.0896 rl:2.2936 rb:1.0560 dl:296-297 gd:1 +ttp: b190/782 bl:2.3390 bb:1.0754 rl:2.2937 rb:1.0561 dl:284-285 gd:1 +ttp: b181/782 bl:2.3283 bb:1.1243 rl:2.2938 rb:1.0563 dl:275-276 gd:1 +ttp: b174/782 bl:2.4400 bb:1.1508 rl:2.2943 rb:1.0565 dl:268-269 gd:1 +ttp: b164/782 bl:2.4316 bb:1.1503 rl:2.2946 rb:1.0568 dl:259-260 gd:1 +ttp: b157/782 bl:2.3562 bb:1.1285 rl:2.2948 rb:1.0570 dl:252-253 gd:1 +ttp: b148/782 bl:2.3307 bb:1.1028 rl:2.2949 rb:1.0571 dl:243-244 gd:1 +ttp: b140/782 bl:2.4267 bb:1.1330 rl:2.2952 rb:1.0573 dl:235-236 gd:1 +ttp: b132/782 bl:2.4398 bb:1.1587 rl:2.2956 rb:1.0575 dl:228-229 gd:1 +ttp: b126/782 bl:2.3880 bb:1.1376 rl:2.2958 rb:1.0577 dl:222-223 gd:1 +ttp: b122/782 bl:2.4099 bb:1.1409 rl:2.2961 rb:1.0579 dl:219-219 gd:1 +ttp: b114/782 bl:2.4589 bb:1.1402 rl:2.2965 rb:1.0581 dl:211-212 gd:1 +ttp: b106/782 bl:2.4236 bb:1.1666 rl:2.2967 rb:1.0583 dl:204-205 gd:1 +ttp: b99/782 bl:2.4970 bb:1.1760 rl:2.2972 rb:1.0586 dl:198-199 gd:1 +ttp: b89/782 bl:2.4852 bb:1.1484 rl:2.2975 rb:1.0588 dl:189-190 gd:1 +ttp: b82/782 bl:2.4909 bb:1.1856 rl:2.2979 rb:1.0590 dl:183-183 gd:1 +ttp: b73/782 bl:2.5360 bb:1.2448 rl:2.2984 rb:1.0593 dl:174-175 gd:1 +ttp: b65/782 bl:2.4611 bb:1.1673 rl:2.2986 rb:1.0595 dl:167-169 gd:1 +ttp: b59/782 bl:2.4824 bb:1.1826 rl:2.2990 rb:1.0597 dl:162-163 gd:1 +ttp: b51/782 bl:2.4830 bb:1.1879 rl:2.2993 rb:1.0599 dl:154-155 gd:1 +ttp: b45/782 bl:2.4500 bb:1.1723 rl:2.2995 rb:1.0601 dl:148-149 gd:1 +ttp: b38/782 bl:2.5963 bb:1.1908 rl:2.2999 rb:1.0603 dl:141-142 gd:1 +ttp: b31/782 bl:2.4286 bb:1.1521 rl:2.3001 rb:1.0604 dl:134-135 gd:1 +ttp: b24/782 bl:2.4516 bb:1.1562 rl:2.3003 rb:1.0605 dl:127-128 gd:1 +ttp: b15/782 bl:2.6514 bb:1.2313 rl:2.3007 rb:1.0607 dl:115-117 gd:1 +ttp: b6/782 bl:2.6955 bb:1.2018 rl:2.3012 rb:1.0609 dl:99-101 gd:1 +quantized_ttt_phased val_loss:2.31756847 val_bpb:1.05903968 eval_time:554723ms +total_eval_time:554.7s diff --git a/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_seed42.log b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_seed42.log new file mode 100644 index 0000000000..5187ee3210 --- /dev/null +++ b/records/track_10min_16mb/2026-04-30_NgramTilt_V21_LeakyReLU_1.05851/train_seed42.log @@ -0,0 +1,5847 @@ +nohup: ignoring input +==================================================== + v5 PRIMARY noLC fulltilt + precompute outside timer: V21 + #1953 + #1948 + fulltilt-tilt SEED=42 Thu Apr 30 05:59:22 UTC 2026 + LeakyReLU slope 0.3 (code patch + v5 hint-precompute-outside-timer), EVAL_SEQ_LEN 2048 (no long-ctx for cap), no_qv, fulltilt-tilt +==================================================== +W0430 05:59:24.045000 943476 torch/distributed/run.py:803] +W0430 05:59:24.045000 943476 torch/distributed/run.py:803] ***************************************** +W0430 05:59:24.045000 943476 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0430 05:59:24.045000 943476 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + agree_add_boost: 0.5 + artifact_dir: + attn_clip_sigmas: 13.0 + attn_out_gate_enabled: False + attn_out_gate_src: proj + awq_lite_bits: 8 + awq_lite_enabled: True + awq_lite_group_size: 64 + awq_lite_group_top_k: 1 + beta1: 0.9 + beta2: 0.99 + caseops_enabled: True + compressor: pergroup + data_dir: /runpod-volume/caseops_data/datasets + datasets_dir: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved + distributed: True + ema_decay: 0.9965 + embed_bits: 7 + embed_clip_sigmas: 14.0 + embed_lr: 0.6 + embed_wd: 0.085 + enable_looping_at: 0.35 + eval_seq_len: 2048 + eval_stride: 64 + fused_ce_enabled: True + gate_window: 12 + gated_attn_enabled: False + gated_attn_init_std: 0.01 + gated_attn_quant_gate: True + global_ttt_batch_seqs: 32 + global_ttt_chunk_tokens: 32768 + global_ttt_epochs: 1 + global_ttt_grad_clip: 1.0 + global_ttt_lr: 0.001 + global_ttt_momentum: 0.9 + global_ttt_respect_doc_boundaries: True + global_ttt_warmup_chunks: 0 + global_ttt_warmup_start_lr: 0.0 + gptq_calibration_batches: 16 + gptq_reserve_seconds: 0.5 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/eda0247f-74dc-42c3-bca1-899aa80e6c11.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + lqer_asym_enabled: True + lqer_asym_group: 64 + lqer_enabled: True + lqer_factor_bits: 4 + lqer_gain_select: False + lqer_rank: 4 + lqer_scope: all + lqer_top_k: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.026 + max_wallclock_seconds: 600.0 + min_lr: 0.1 + mlp_clip_sigmas: 11.5 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_momentum: 0.97 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + ngram_hint_precompute_outside: True + ngram_tilt_enabled: True + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_final_lane: mean + parallel_start_layer: 8 + phased_ttt_num_phases: 3 + phased_ttt_prefix_docs: 2500 + qk_gain_init: 5.25 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + rope_yarn: False + run_id: eda0247f-74dc-42c3-bca1-899aa80e6c11 + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + smear_gate_enabled: True + sparse_attn_gate_enabled: True + sparse_attn_gate_init_std: 0.0 + sparse_attn_gate_scale: 0.5 + temperature_scale: 1.0 + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + token_boost: 2.625 + token_order: 16 + token_threshold: 0.8 + tokenizer_path: /runpod-volume/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model + train_batch_tokens: 786432 + train_files: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_size: 64 + ttt_beta1: 0.0 + ttt_beta2: 0.99 + ttt_chunk_size: 48 + ttt_enabled: True + ttt_eval_batches: + ttt_eval_seq_len: 2048 + ttt_grad_steps: 1 + ttt_k_lora: True + ttt_lora_lr: 0.0001 + ttt_lora_rank: 80 + ttt_mlp_lora: True + ttt_o_lora: True + ttt_optimizer: adam + ttt_weight_decay: 0.5 + val_batch_tokens: 524288 + val_bytes_files: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin + val_doc_fraction: 1.0 + val_files: /runpod-volume/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin + val_loss_every: 0 + vocab_size: 8192 + warmdown_frac: 0.85 + warmup_steps: 20 + within_boost: 0.75 + within_tau: 0.45 + word_boost: 0.75 + word_normalize: strip_punct_lower + word_order: 4 + word_tau: 0.65 + world_size: 8 + xsa_last_n: 11 +train_shards: 1499 +val_tokens: 47851520 +model_params:35945673 +gptq:reserving 0s, effective=599500ms +warmup_cu_buckets:64,128,192,256 iters_each:3 +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +1/20000 train_loss: 9.0087 train_time: 0.0m tok/s: 17813757 +2/20000 train_loss: 12.8396 train_time: 0.0m tok/s: 11353557 +3/20000 train_loss: 10.2198 train_time: 0.0m tok/s: 10164607 +4/20000 train_loss: 8.6819 train_time: 0.0m tok/s: 9670678 +5/20000 train_loss: 7.9295 train_time: 0.0m tok/s: 9401870 +6/20000 train_loss: 7.5654 train_time: 0.0m tok/s: 9223922 +7/20000 train_loss: 7.2991 train_time: 0.0m tok/s: 9106273 +8/20000 train_loss: 6.9412 train_time: 0.0m tok/s: 9019275 +9/20000 train_loss: 6.6091 train_time: 0.0m tok/s: 8953788 +10/20000 train_loss: 6.5120 train_time: 0.0m tok/s: 8893055 +11/20000 train_loss: 6.1863 train_time: 0.0m tok/s: 8750805 +12/20000 train_loss: 5.8676 train_time: 0.0m tok/s: 8702513 +13/20000 train_loss: 5.7070 train_time: 0.0m tok/s: 8669440 +14/20000 train_loss: 5.3298 train_time: 0.0m tok/s: 8640628 +15/20000 train_loss: 5.2682 train_time: 0.0m tok/s: 8621953 +16/20000 train_loss: 5.2408 train_time: 0.0m tok/s: 8604580 +17/20000 train_loss: 5.1147 train_time: 0.0m tok/s: 8593361 +18/20000 train_loss: 5.0961 train_time: 0.0m tok/s: 8588455 +19/20000 train_loss: 4.9911 train_time: 0.0m tok/s: 8583699 +20/20000 train_loss: 4.8820 train_time: 0.0m tok/s: 8577087 +21/20000 train_loss: 4.8061 train_time: 0.0m tok/s: 8553787 +22/20000 train_loss: 4.8512 train_time: 0.0m tok/s: 8532718 +23/20000 train_loss: 4.8027 train_time: 0.0m tok/s: 8518229 +24/20000 train_loss: 4.9048 train_time: 0.0m tok/s: 8507379 +25/20000 train_loss: 4.6799 train_time: 0.0m tok/s: 8503538 +26/20000 train_loss: 4.7073 train_time: 0.0m tok/s: 8496215 +27/20000 train_loss: 4.5864 train_time: 0.0m tok/s: 8491111 +28/20000 train_loss: 4.6475 train_time: 0.0m tok/s: 8491100 +29/20000 train_loss: 4.5763 train_time: 0.0m tok/s: 8489289 +30/20000 train_loss: 4.5564 train_time: 0.0m tok/s: 8483863 +31/20000 train_loss: 4.5506 train_time: 0.0m tok/s: 8479038 +32/20000 train_loss: 4.5139 train_time: 0.0m tok/s: 8470053 +33/20000 train_loss: 4.4774 train_time: 0.1m tok/s: 8466505 +34/20000 train_loss: 4.4157 train_time: 0.1m tok/s: 8460762 +35/20000 train_loss: 4.3554 train_time: 0.1m tok/s: 8454917 +36/20000 train_loss: 4.5061 train_time: 0.1m tok/s: 8446453 +37/20000 train_loss: 4.4410 train_time: 0.1m tok/s: 8444580 +38/20000 train_loss: 4.3638 train_time: 0.1m tok/s: 8444645 +39/20000 train_loss: 4.4932 train_time: 0.1m tok/s: 8443041 +40/20000 train_loss: 4.4562 train_time: 0.1m tok/s: 8440786 +41/20000 train_loss: 4.3357 train_time: 0.1m tok/s: 8438705 +42/20000 train_loss: 4.2650 train_time: 0.1m tok/s: 8437099 +43/20000 train_loss: 4.2950 train_time: 0.1m tok/s: 8433654 +44/20000 train_loss: 4.2303 train_time: 0.1m tok/s: 8428557 +45/20000 train_loss: 4.3550 train_time: 0.1m tok/s: 8426695 +46/20000 train_loss: 4.2711 train_time: 0.1m tok/s: 8423335 +47/20000 train_loss: 4.1348 train_time: 0.1m tok/s: 8421384 +48/20000 train_loss: 4.1720 train_time: 0.1m tok/s: 8419541 +49/20000 train_loss: 4.1124 train_time: 0.1m tok/s: 8418157 +50/20000 train_loss: 4.0770 train_time: 0.1m tok/s: 8415646 +51/20000 train_loss: 4.2929 train_time: 0.1m tok/s: 8415021 +52/20000 train_loss: 4.2221 train_time: 0.1m tok/s: 8410471 +53/20000 train_loss: 4.1508 train_time: 0.1m tok/s: 8409071 +54/20000 train_loss: 4.1554 train_time: 0.1m tok/s: 8407667 +55/20000 train_loss: 4.1609 train_time: 0.1m tok/s: 8407699 +56/20000 train_loss: 4.0900 train_time: 0.1m tok/s: 8404234 +57/20000 train_loss: 4.1306 train_time: 0.1m tok/s: 8403971 +58/20000 train_loss: 4.0633 train_time: 0.1m tok/s: 8401762 +59/20000 train_loss: 4.0296 train_time: 0.1m tok/s: 8400594 +60/20000 train_loss: 3.9405 train_time: 0.1m tok/s: 8398508 +61/20000 train_loss: 3.9449 train_time: 0.1m tok/s: 8400351 +62/20000 train_loss: 4.0539 train_time: 0.1m tok/s: 8400275 +63/20000 train_loss: 4.1396 train_time: 0.1m tok/s: 8400582 +64/20000 train_loss: 3.9402 train_time: 0.1m tok/s: 8400382 +65/20000 train_loss: 4.0522 train_time: 0.1m tok/s: 8400438 +66/20000 train_loss: 4.0038 train_time: 0.1m tok/s: 8400514 +67/20000 train_loss: 3.9151 train_time: 0.1m tok/s: 8398036 +68/20000 train_loss: 3.9464 train_time: 0.1m tok/s: 8397950 +69/20000 train_loss: 3.8502 train_time: 0.1m tok/s: 8396709 +70/20000 train_loss: 3.9585 train_time: 0.1m tok/s: 8395402 +71/20000 train_loss: 3.8774 train_time: 0.1m tok/s: 8395004 +72/20000 train_loss: 4.0561 train_time: 0.1m tok/s: 8395002 +73/20000 train_loss: 3.8692 train_time: 0.1m tok/s: 8394867 +74/20000 train_loss: 3.8879 train_time: 0.1m tok/s: 8393030 +75/20000 train_loss: 3.8760 train_time: 0.1m tok/s: 8390432 +76/20000 train_loss: 3.8444 train_time: 0.1m tok/s: 8389896 +77/20000 train_loss: 3.7994 train_time: 0.1m tok/s: 8390289 +78/20000 train_loss: 3.7165 train_time: 0.1m tok/s: 8388436 +79/20000 train_loss: 3.8736 train_time: 0.1m tok/s: 8388948 +80/20000 train_loss: 3.7745 train_time: 0.1m tok/s: 8388136 +81/20000 train_loss: 3.7167 train_time: 0.1m tok/s: 8386692 +82/20000 train_loss: 3.7345 train_time: 0.1m tok/s: 8385416 +83/20000 train_loss: 3.6099 train_time: 0.1m tok/s: 8385015 +84/20000 train_loss: 3.6539 train_time: 0.1m tok/s: 8383756 +85/20000 train_loss: 3.6271 train_time: 0.1m tok/s: 8383049 +86/20000 train_loss: 3.4179 train_time: 0.1m tok/s: 8382227 +87/20000 train_loss: 3.6514 train_time: 0.1m tok/s: 8382745 +88/20000 train_loss: 3.5347 train_time: 0.1m tok/s: 8381423 +89/20000 train_loss: 3.5722 train_time: 0.1m tok/s: 8379753 +90/20000 train_loss: 3.5772 train_time: 0.1m tok/s: 8379248 +91/20000 train_loss: 3.6206 train_time: 0.1m tok/s: 8379503 +92/20000 train_loss: 3.6984 train_time: 0.1m tok/s: 8379383 +93/20000 train_loss: 3.6010 train_time: 0.1m tok/s: 8378002 +94/20000 train_loss: 3.6329 train_time: 0.1m tok/s: 8378283 +95/20000 train_loss: 3.6013 train_time: 0.1m tok/s: 8378156 +96/20000 train_loss: 3.5712 train_time: 0.2m tok/s: 8377287 +97/20000 train_loss: 3.4690 train_time: 0.2m tok/s: 8376168 +98/20000 train_loss: 3.5339 train_time: 0.2m tok/s: 8375759 +99/20000 train_loss: 3.4887 train_time: 0.2m tok/s: 8376182 +100/20000 train_loss: 3.4266 train_time: 0.2m tok/s: 8375856 +101/20000 train_loss: 3.4329 train_time: 0.2m tok/s: 8374768 +102/20000 train_loss: 3.4855 train_time: 0.2m tok/s: 8374887 +103/20000 train_loss: 3.3708 train_time: 0.2m tok/s: 8374595 +104/20000 train_loss: 3.4741 train_time: 0.2m tok/s: 8374031 +105/20000 train_loss: 3.3666 train_time: 0.2m tok/s: 8373488 +106/20000 train_loss: 3.4951 train_time: 0.2m tok/s: 8372742 +107/20000 train_loss: 3.2530 train_time: 0.2m tok/s: 8371606 +108/20000 train_loss: 3.4156 train_time: 0.2m tok/s: 8370583 +109/20000 train_loss: 3.4078 train_time: 0.2m tok/s: 8369722 +110/20000 train_loss: 3.4401 train_time: 0.2m tok/s: 8369148 +111/20000 train_loss: 3.4264 train_time: 0.2m tok/s: 8368163 +112/20000 train_loss: 3.4336 train_time: 0.2m tok/s: 8368393 +113/20000 train_loss: 3.3479 train_time: 0.2m tok/s: 8368018 +114/20000 train_loss: 3.3913 train_time: 0.2m tok/s: 8368129 +115/20000 train_loss: 3.4235 train_time: 0.2m tok/s: 8367765 +116/20000 train_loss: 3.2332 train_time: 0.2m tok/s: 8367664 +117/20000 train_loss: 3.4513 train_time: 0.2m tok/s: 8367640 +118/20000 train_loss: 3.3891 train_time: 0.2m tok/s: 8366970 +119/20000 train_loss: 3.3608 train_time: 0.2m tok/s: 8366379 +120/20000 train_loss: 3.3482 train_time: 0.2m tok/s: 8366141 +121/20000 train_loss: 3.2975 train_time: 0.2m tok/s: 8366335 +122/20000 train_loss: 3.3100 train_time: 0.2m tok/s: 8366383 +123/20000 train_loss: 3.2921 train_time: 0.2m tok/s: 8366013 +124/20000 train_loss: 3.3437 train_time: 0.2m tok/s: 8365151 +125/20000 train_loss: 3.2407 train_time: 0.2m tok/s: 8364900 +126/20000 train_loss: 3.2555 train_time: 0.2m tok/s: 8364912 +127/20000 train_loss: 3.2823 train_time: 0.2m tok/s: 8364609 +128/20000 train_loss: 3.3313 train_time: 0.2m tok/s: 8363963 +129/20000 train_loss: 3.2968 train_time: 0.2m tok/s: 8363208 +130/20000 train_loss: 3.2729 train_time: 0.2m tok/s: 8362601 +131/20000 train_loss: 3.2279 train_time: 0.2m tok/s: 8362199 +132/20000 train_loss: 3.1761 train_time: 0.2m tok/s: 8362559 +133/20000 train_loss: 3.2326 train_time: 0.2m tok/s: 8361786 +134/20000 train_loss: 3.1373 train_time: 0.2m tok/s: 8361650 +135/20000 train_loss: 2.9806 train_time: 0.2m tok/s: 8360382 +136/20000 train_loss: 3.2406 train_time: 0.2m tok/s: 8359233 +137/20000 train_loss: 3.0948 train_time: 0.2m tok/s: 8358260 +138/20000 train_loss: 3.2821 train_time: 0.2m tok/s: 8357671 +139/20000 train_loss: 3.2456 train_time: 0.2m tok/s: 8358062 +140/20000 train_loss: 3.1692 train_time: 0.2m tok/s: 8357845 +141/20000 train_loss: 3.0977 train_time: 0.2m tok/s: 8357074 +142/20000 train_loss: 3.2961 train_time: 0.2m tok/s: 8357053 +143/20000 train_loss: 3.3638 train_time: 0.2m tok/s: 8356667 +144/20000 train_loss: 3.2911 train_time: 0.2m tok/s: 8356261 +145/20000 train_loss: 3.2585 train_time: 0.2m tok/s: 8355984 +146/20000 train_loss: 3.2680 train_time: 0.2m tok/s: 8356015 +147/20000 train_loss: 3.1713 train_time: 0.2m tok/s: 8355837 +148/20000 train_loss: 3.1924 train_time: 0.2m tok/s: 8355539 +149/20000 train_loss: 3.2620 train_time: 0.2m tok/s: 8355103 +150/20000 train_loss: 3.1983 train_time: 0.2m tok/s: 8354823 +151/20000 train_loss: 3.5557 train_time: 0.2m tok/s: 8354274 +152/20000 train_loss: 3.1759 train_time: 0.2m tok/s: 8353418 +153/20000 train_loss: 3.2964 train_time: 0.2m tok/s: 8352295 +154/20000 train_loss: 3.2022 train_time: 0.2m tok/s: 8352619 +155/20000 train_loss: 3.1404 train_time: 0.2m tok/s: 8352635 +156/20000 train_loss: 3.0424 train_time: 0.2m tok/s: 8351886 +157/20000 train_loss: 3.0913 train_time: 0.2m tok/s: 8351593 +158/20000 train_loss: 3.1904 train_time: 0.2m tok/s: 8351467 +159/20000 train_loss: 3.0521 train_time: 0.2m tok/s: 8351143 +160/20000 train_loss: 3.1740 train_time: 0.3m tok/s: 8351057 +161/20000 train_loss: 3.1329 train_time: 0.3m tok/s: 8351128 +162/20000 train_loss: 3.0726 train_time: 0.3m tok/s: 8350005 +163/20000 train_loss: 3.1473 train_time: 0.3m tok/s: 8349350 +164/20000 train_loss: 3.0306 train_time: 0.3m tok/s: 8348909 +165/20000 train_loss: 3.2067 train_time: 0.3m tok/s: 8348720 +166/20000 train_loss: 3.1356 train_time: 0.3m tok/s: 8348859 +167/20000 train_loss: 3.1349 train_time: 0.3m tok/s: 8348034 +168/20000 train_loss: 3.1907 train_time: 0.3m tok/s: 8347737 +169/20000 train_loss: 3.1070 train_time: 0.3m tok/s: 8347296 +170/20000 train_loss: 2.8128 train_time: 0.3m tok/s: 8346560 +171/20000 train_loss: 3.1329 train_time: 0.3m tok/s: 8346074 +172/20000 train_loss: 3.0910 train_time: 0.3m tok/s: 8345964 +173/20000 train_loss: 3.2297 train_time: 0.3m tok/s: 8345421 +174/20000 train_loss: 3.1141 train_time: 0.3m tok/s: 8345533 +175/20000 train_loss: 3.1411 train_time: 0.3m tok/s: 8345439 +176/20000 train_loss: 3.1559 train_time: 0.3m tok/s: 8345618 +177/20000 train_loss: 3.1297 train_time: 0.3m tok/s: 8345302 +178/20000 train_loss: 2.9611 train_time: 0.3m tok/s: 8345277 +179/20000 train_loss: 3.3190 train_time: 0.3m tok/s: 8344937 +180/20000 train_loss: 2.9711 train_time: 0.3m tok/s: 8344615 +181/20000 train_loss: 2.9547 train_time: 0.3m tok/s: 8344803 +182/20000 train_loss: 3.0487 train_time: 0.3m tok/s: 8344401 +183/20000 train_loss: 2.9941 train_time: 0.3m tok/s: 8343265 +184/20000 train_loss: 3.0015 train_time: 0.3m tok/s: 8342762 +185/20000 train_loss: 2.7213 train_time: 0.3m tok/s: 8341850 +186/20000 train_loss: 3.1062 train_time: 0.3m tok/s: 8341403 +187/20000 train_loss: 3.0427 train_time: 0.3m tok/s: 8341237 +188/20000 train_loss: 3.1932 train_time: 0.3m tok/s: 8341450 +189/20000 train_loss: 3.5065 train_time: 0.3m tok/s: 8341598 +190/20000 train_loss: 3.0809 train_time: 0.3m tok/s: 8341553 +191/20000 train_loss: 3.0474 train_time: 0.3m tok/s: 8341245 +192/20000 train_loss: 3.0148 train_time: 0.3m tok/s: 8341545 +193/20000 train_loss: 2.9913 train_time: 0.3m tok/s: 8341274 +194/20000 train_loss: 2.9980 train_time: 0.3m tok/s: 8340937 +195/20000 train_loss: 2.8926 train_time: 0.3m tok/s: 8340506 +196/20000 train_loss: 3.1216 train_time: 0.3m tok/s: 8338554 +197/20000 train_loss: 3.0418 train_time: 0.3m tok/s: 8340210 +198/20000 train_loss: 3.0488 train_time: 0.3m tok/s: 8339926 +199/20000 train_loss: 3.0536 train_time: 0.3m tok/s: 8340171 +200/20000 train_loss: 3.0576 train_time: 0.3m tok/s: 8340439 +201/20000 train_loss: 3.1083 train_time: 0.3m tok/s: 8340537 +202/20000 train_loss: 3.3213 train_time: 0.3m tok/s: 8340171 +203/20000 train_loss: 3.0699 train_time: 0.3m tok/s: 8339993 +204/20000 train_loss: 3.0792 train_time: 0.3m tok/s: 8339747 +205/20000 train_loss: 3.0570 train_time: 0.3m tok/s: 8339991 +206/20000 train_loss: 2.9469 train_time: 0.3m tok/s: 8339671 +207/20000 train_loss: 3.0879 train_time: 0.3m tok/s: 8339355 +208/20000 train_loss: 2.9330 train_time: 0.3m tok/s: 8339224 +209/20000 train_loss: 2.9950 train_time: 0.3m tok/s: 8338784 +210/20000 train_loss: 3.0690 train_time: 0.3m tok/s: 8338616 +211/20000 train_loss: 3.2682 train_time: 0.3m tok/s: 8338362 +212/20000 train_loss: 3.0158 train_time: 0.3m tok/s: 8338314 +213/20000 train_loss: 2.9386 train_time: 0.3m tok/s: 8337687 +214/20000 train_loss: 3.0900 train_time: 0.3m tok/s: 8336971 +215/20000 train_loss: 3.0341 train_time: 0.3m tok/s: 8335571 +216/20000 train_loss: 3.0879 train_time: 0.3m tok/s: 8336933 +217/20000 train_loss: 3.0184 train_time: 0.3m tok/s: 8336852 +218/20000 train_loss: 3.0276 train_time: 0.3m tok/s: 8337100 +219/20000 train_loss: 3.1140 train_time: 0.3m tok/s: 8336767 +220/20000 train_loss: 3.3308 train_time: 0.3m tok/s: 8336391 +221/20000 train_loss: 2.9277 train_time: 0.3m tok/s: 8335196 +222/20000 train_loss: 2.9755 train_time: 0.3m tok/s: 8335461 +223/20000 train_loss: 2.9996 train_time: 0.4m tok/s: 8335620 +224/20000 train_loss: 2.9995 train_time: 0.4m tok/s: 8335263 +225/20000 train_loss: 3.0679 train_time: 0.4m tok/s: 8334371 +226/20000 train_loss: 3.0370 train_time: 0.4m tok/s: 8334772 +227/20000 train_loss: 3.0707 train_time: 0.4m tok/s: 8334325 +228/20000 train_loss: 3.0860 train_time: 0.4m tok/s: 8334159 +229/20000 train_loss: 3.0855 train_time: 0.4m tok/s: 8334184 +230/20000 train_loss: 2.9516 train_time: 0.4m tok/s: 8334552 +231/20000 train_loss: 3.1039 train_time: 0.4m tok/s: 8334332 +232/20000 train_loss: 2.9822 train_time: 0.4m tok/s: 8334116 +233/20000 train_loss: 3.0155 train_time: 0.4m tok/s: 8334114 +234/20000 train_loss: 3.0169 train_time: 0.4m tok/s: 8332981 +235/20000 train_loss: 2.9340 train_time: 0.4m tok/s: 8334118 +236/20000 train_loss: 3.0109 train_time: 0.4m tok/s: 8333887 +237/20000 train_loss: 2.8947 train_time: 0.4m tok/s: 8333605 +238/20000 train_loss: 3.0771 train_time: 0.4m tok/s: 8333279 +239/20000 train_loss: 3.0042 train_time: 0.4m tok/s: 8333014 +240/20000 train_loss: 3.1526 train_time: 0.4m tok/s: 8332792 +241/20000 train_loss: 3.0189 train_time: 0.4m tok/s: 8332808 +242/20000 train_loss: 3.0885 train_time: 0.4m tok/s: 8333043 +243/20000 train_loss: 3.0007 train_time: 0.4m tok/s: 8333273 +244/20000 train_loss: 3.0391 train_time: 0.4m tok/s: 8333479 +245/20000 train_loss: 2.9846 train_time: 0.4m tok/s: 8333159 +246/20000 train_loss: 3.0338 train_time: 0.4m tok/s: 8333217 +247/20000 train_loss: 2.9753 train_time: 0.4m tok/s: 8332871 +248/20000 train_loss: 2.8992 train_time: 0.4m tok/s: 8333017 +249/20000 train_loss: 2.9800 train_time: 0.4m tok/s: 8332803 +250/20000 train_loss: 2.9859 train_time: 0.4m tok/s: 8332863 +251/20000 train_loss: 2.9328 train_time: 0.4m tok/s: 8332695 +252/20000 train_loss: 2.9331 train_time: 0.4m tok/s: 8332541 +253/20000 train_loss: 3.0358 train_time: 0.4m tok/s: 8332566 +254/20000 train_loss: 3.0861 train_time: 0.4m tok/s: 8332325 +255/20000 train_loss: 3.1017 train_time: 0.4m tok/s: 8332127 +256/20000 train_loss: 2.9651 train_time: 0.4m tok/s: 8332143 +257/20000 train_loss: 2.9613 train_time: 0.4m tok/s: 8332171 +258/20000 train_loss: 3.0156 train_time: 0.4m tok/s: 8331260 +259/20000 train_loss: 2.9484 train_time: 0.4m tok/s: 8331211 +260/20000 train_loss: 3.1488 train_time: 0.4m tok/s: 8331242 +261/20000 train_loss: 2.9399 train_time: 0.4m tok/s: 8330621 +262/20000 train_loss: 2.7864 train_time: 0.4m tok/s: 8330479 +263/20000 train_loss: 2.7959 train_time: 0.4m tok/s: 8330379 +264/20000 train_loss: 2.9745 train_time: 0.4m tok/s: 8330305 +265/20000 train_loss: 2.9932 train_time: 0.4m tok/s: 8330181 +266/20000 train_loss: 2.9231 train_time: 0.4m tok/s: 8329836 +267/20000 train_loss: 2.9311 train_time: 0.4m tok/s: 8329800 +268/20000 train_loss: 3.0145 train_time: 0.4m tok/s: 8329781 +269/20000 train_loss: 2.9962 train_time: 0.4m tok/s: 8329907 +270/20000 train_loss: 2.9980 train_time: 0.4m tok/s: 8329722 +271/20000 train_loss: 3.0016 train_time: 0.4m tok/s: 8329809 +272/20000 train_loss: 3.0594 train_time: 0.4m tok/s: 8329559 +273/20000 train_loss: 2.9292 train_time: 0.4m tok/s: 8329830 +274/20000 train_loss: 3.0246 train_time: 0.4m tok/s: 8329830 +275/20000 train_loss: 2.9425 train_time: 0.4m tok/s: 8329905 +276/20000 train_loss: 2.8729 train_time: 0.4m tok/s: 8330087 +277/20000 train_loss: 2.8671 train_time: 0.4m tok/s: 8329803 +278/20000 train_loss: 2.8325 train_time: 0.4m tok/s: 8328977 +279/20000 train_loss: 2.9669 train_time: 0.4m tok/s: 8328393 +280/20000 train_loss: 3.0117 train_time: 0.4m tok/s: 8328377 +281/20000 train_loss: 2.7592 train_time: 0.4m tok/s: 8328443 +282/20000 train_loss: 3.0679 train_time: 0.4m tok/s: 8328279 +283/20000 train_loss: 2.8644 train_time: 0.4m tok/s: 8328146 +284/20000 train_loss: 2.9139 train_time: 0.4m tok/s: 8326561 +285/20000 train_loss: 2.9631 train_time: 0.4m tok/s: 8327736 +286/20000 train_loss: 2.9969 train_time: 0.5m tok/s: 8327540 +287/20000 train_loss: 2.8249 train_time: 0.5m tok/s: 8327518 +288/20000 train_loss: 2.9647 train_time: 0.5m tok/s: 8327410 +289/20000 train_loss: 2.8874 train_time: 0.5m tok/s: 8327067 +290/20000 train_loss: 2.9129 train_time: 0.5m tok/s: 8326979 +291/20000 train_loss: 2.8766 train_time: 0.5m tok/s: 8326835 +292/20000 train_loss: 2.7213 train_time: 0.5m tok/s: 8326555 +293/20000 train_loss: 2.9355 train_time: 0.5m tok/s: 8326422 +294/20000 train_loss: 3.0614 train_time: 0.5m tok/s: 8326024 +295/20000 train_loss: 3.0062 train_time: 0.5m tok/s: 8326180 +296/20000 train_loss: 3.0598 train_time: 0.5m tok/s: 8326122 +297/20000 train_loss: 2.9419 train_time: 0.5m tok/s: 8325805 +298/20000 train_loss: 2.9812 train_time: 0.5m tok/s: 8325823 +299/20000 train_loss: 2.8210 train_time: 0.5m tok/s: 8325807 +300/20000 train_loss: 3.0208 train_time: 0.5m tok/s: 8325652 +301/20000 train_loss: 2.9634 train_time: 0.5m tok/s: 8325865 +302/20000 train_loss: 2.8640 train_time: 0.5m tok/s: 8325882 +303/20000 train_loss: 2.9272 train_time: 0.5m tok/s: 8325838 +304/20000 train_loss: 2.9339 train_time: 0.5m tok/s: 8325687 +305/20000 train_loss: 2.9361 train_time: 0.5m tok/s: 8325436 +306/20000 train_loss: 3.0064 train_time: 0.5m tok/s: 8325389 +307/20000 train_loss: 2.9115 train_time: 0.5m tok/s: 8325390 +308/20000 train_loss: 2.8993 train_time: 0.5m tok/s: 8325252 +309/20000 train_loss: 3.0366 train_time: 0.5m tok/s: 8324593 +310/20000 train_loss: 2.8584 train_time: 0.5m tok/s: 8325114 +311/20000 train_loss: 2.9297 train_time: 0.5m tok/s: 8325095 +312/20000 train_loss: 2.8386 train_time: 0.5m tok/s: 8324901 +313/20000 train_loss: 2.8361 train_time: 0.5m tok/s: 8324801 +314/20000 train_loss: 2.8713 train_time: 0.5m tok/s: 8324676 +315/20000 train_loss: 2.9604 train_time: 0.5m tok/s: 8324185 +316/20000 train_loss: 2.6961 train_time: 0.5m tok/s: 8324127 +317/20000 train_loss: 2.8048 train_time: 0.5m tok/s: 8323795 +318/20000 train_loss: 2.9182 train_time: 0.5m tok/s: 8323765 +319/20000 train_loss: 2.8973 train_time: 0.5m tok/s: 8323611 +320/20000 train_loss: 3.0276 train_time: 0.5m tok/s: 8322991 +321/20000 train_loss: 3.0035 train_time: 0.5m tok/s: 8323188 +322/20000 train_loss: 2.9657 train_time: 0.5m tok/s: 8323153 +323/20000 train_loss: 3.0083 train_time: 0.5m tok/s: 8323023 +324/20000 train_loss: 2.9081 train_time: 0.5m tok/s: 8322938 +325/20000 train_loss: 2.8969 train_time: 0.5m tok/s: 8322398 +326/20000 train_loss: 2.8953 train_time: 0.5m tok/s: 8322688 +327/20000 train_loss: 2.8332 train_time: 0.5m tok/s: 8322637 +328/20000 train_loss: 2.8604 train_time: 0.5m tok/s: 8322712 +329/20000 train_loss: 2.8223 train_time: 0.5m tok/s: 8322630 +330/20000 train_loss: 2.7780 train_time: 0.5m tok/s: 8322781 +331/20000 train_loss: 2.8899 train_time: 0.5m tok/s: 8322418 +332/20000 train_loss: 2.9727 train_time: 0.5m tok/s: 8322081 +333/20000 train_loss: 2.8749 train_time: 0.5m tok/s: 8321993 +334/20000 train_loss: 3.0885 train_time: 0.5m tok/s: 8322059 +335/20000 train_loss: 2.8467 train_time: 0.5m tok/s: 8321755 +336/20000 train_loss: 2.9329 train_time: 0.5m tok/s: 8321719 +337/20000 train_loss: 2.8175 train_time: 0.5m tok/s: 8321781 +338/20000 train_loss: 2.8875 train_time: 0.5m tok/s: 8321875 +339/20000 train_loss: 2.9367 train_time: 0.5m tok/s: 8321947 +340/20000 train_loss: 2.9688 train_time: 0.5m tok/s: 8321799 +341/20000 train_loss: 2.9263 train_time: 0.5m tok/s: 8321725 +342/20000 train_loss: 2.8094 train_time: 0.5m tok/s: 8321677 +343/20000 train_loss: 2.9143 train_time: 0.5m tok/s: 8321646 +344/20000 train_loss: 2.8264 train_time: 0.5m tok/s: 8321479 +345/20000 train_loss: 2.8572 train_time: 0.5m tok/s: 8321297 +346/20000 train_loss: 2.8709 train_time: 0.5m tok/s: 8321213 +347/20000 train_loss: 2.8884 train_time: 0.5m tok/s: 8321148 +348/20000 train_loss: 2.8568 train_time: 0.5m tok/s: 8321013 +349/20000 train_loss: 2.9316 train_time: 0.5m tok/s: 8321070 +350/20000 train_loss: 2.7782 train_time: 0.6m tok/s: 8321058 +351/20000 train_loss: 2.7981 train_time: 0.6m tok/s: 8321142 +352/20000 train_loss: 2.7647 train_time: 0.6m tok/s: 8320978 +353/20000 train_loss: 2.6155 train_time: 0.6m tok/s: 8320685 +354/20000 train_loss: 2.9856 train_time: 0.6m tok/s: 8320417 +355/20000 train_loss: 2.9201 train_time: 0.6m tok/s: 8320309 +356/20000 train_loss: 2.8311 train_time: 0.6m tok/s: 8320034 +357/20000 train_loss: 2.7778 train_time: 0.6m tok/s: 8319629 +358/20000 train_loss: 2.7863 train_time: 0.6m tok/s: 8319566 +359/20000 train_loss: 2.8946 train_time: 0.6m tok/s: 8319542 +360/20000 train_loss: 2.8906 train_time: 0.6m tok/s: 8319427 +361/20000 train_loss: 2.9662 train_time: 0.6m tok/s: 8319130 +362/20000 train_loss: 2.8707 train_time: 0.6m tok/s: 8318978 +363/20000 train_loss: 2.9528 train_time: 0.6m tok/s: 8318867 +364/20000 train_loss: 2.8166 train_time: 0.6m tok/s: 8318392 +365/20000 train_loss: 2.8019 train_time: 0.6m tok/s: 8318405 +366/20000 train_loss: 2.8067 train_time: 0.6m tok/s: 8318201 +367/20000 train_loss: 2.9206 train_time: 0.6m tok/s: 8318214 +368/20000 train_loss: 2.7269 train_time: 0.6m tok/s: 8317967 +369/20000 train_loss: 2.8811 train_time: 0.6m tok/s: 8317919 +370/20000 train_loss: 2.8711 train_time: 0.6m tok/s: 8317879 +371/20000 train_loss: 2.8781 train_time: 0.6m tok/s: 8317929 +372/20000 train_loss: 2.8298 train_time: 0.6m tok/s: 8317786 +373/20000 train_loss: 2.7126 train_time: 0.6m tok/s: 8318007 +374/20000 train_loss: 2.7281 train_time: 0.6m tok/s: 8318016 +375/20000 train_loss: 2.6767 train_time: 0.6m tok/s: 8317858 +376/20000 train_loss: 2.9141 train_time: 0.6m tok/s: 8317740 +377/20000 train_loss: 2.7271 train_time: 0.6m tok/s: 8317852 +378/20000 train_loss: 2.8231 train_time: 0.6m tok/s: 8317857 +379/20000 train_loss: 2.8799 train_time: 0.6m tok/s: 8317751 +380/20000 train_loss: 2.8882 train_time: 0.6m tok/s: 8317532 +381/20000 train_loss: 2.9014 train_time: 0.6m tok/s: 8317431 +382/20000 train_loss: 2.9543 train_time: 0.6m tok/s: 8317344 +383/20000 train_loss: 2.9433 train_time: 0.6m tok/s: 8317490 +384/20000 train_loss: 2.8102 train_time: 0.6m tok/s: 8317269 +385/20000 train_loss: 2.8273 train_time: 0.6m tok/s: 8317080 +386/20000 train_loss: 2.8751 train_time: 0.6m tok/s: 8317092 +387/20000 train_loss: 3.0446 train_time: 0.6m tok/s: 8316158 +388/20000 train_loss: 2.8779 train_time: 0.6m tok/s: 8316816 +389/20000 train_loss: 2.9085 train_time: 0.6m tok/s: 8316926 +390/20000 train_loss: 2.7441 train_time: 0.6m tok/s: 8316754 +391/20000 train_loss: 2.7159 train_time: 0.6m tok/s: 8316713 +392/20000 train_loss: 2.7758 train_time: 0.6m tok/s: 8316594 +393/20000 train_loss: 2.8452 train_time: 0.6m tok/s: 8316843 +394/20000 train_loss: 2.8291 train_time: 0.6m tok/s: 8316693 +395/20000 train_loss: 2.9269 train_time: 0.6m tok/s: 8316377 +396/20000 train_loss: 2.8220 train_time: 0.6m tok/s: 8316141 +397/20000 train_loss: 2.8282 train_time: 0.6m tok/s: 8316093 +398/20000 train_loss: 2.8551 train_time: 0.6m tok/s: 8315985 +399/20000 train_loss: 2.7685 train_time: 0.6m tok/s: 8315874 +400/20000 train_loss: 2.8592 train_time: 0.6m tok/s: 8315853 +401/20000 train_loss: 2.8565 train_time: 0.6m tok/s: 8316107 +402/20000 train_loss: 2.7225 train_time: 0.6m tok/s: 8316121 +403/20000 train_loss: 2.9507 train_time: 0.6m tok/s: 8316004 +404/20000 train_loss: 2.9292 train_time: 0.6m tok/s: 8315904 +405/20000 train_loss: 2.9241 train_time: 0.6m tok/s: 8315804 +406/20000 train_loss: 2.8062 train_time: 0.6m tok/s: 8315767 +407/20000 train_loss: 2.8308 train_time: 0.6m tok/s: 8315343 +408/20000 train_loss: 2.8341 train_time: 0.6m tok/s: 8315222 +409/20000 train_loss: 2.8022 train_time: 0.6m tok/s: 8315183 +410/20000 train_loss: 2.8703 train_time: 0.6m tok/s: 8315171 +411/20000 train_loss: 2.8135 train_time: 0.6m tok/s: 8315082 +412/20000 train_loss: 2.8144 train_time: 0.6m tok/s: 8315275 +413/20000 train_loss: 2.7057 train_time: 0.7m tok/s: 8315177 +414/20000 train_loss: 2.7273 train_time: 0.7m tok/s: 8315152 +415/20000 train_loss: 2.7018 train_time: 0.7m tok/s: 8315071 +416/20000 train_loss: 2.7630 train_time: 0.7m tok/s: 8315104 +417/20000 train_loss: 2.7665 train_time: 0.7m tok/s: 8315036 +418/20000 train_loss: 2.7874 train_time: 0.7m tok/s: 8314931 +419/20000 train_loss: 2.8066 train_time: 0.7m tok/s: 8314916 +420/20000 train_loss: 2.7979 train_time: 0.7m tok/s: 8314897 +421/20000 train_loss: 2.8659 train_time: 0.7m tok/s: 8315051 +422/20000 train_loss: 2.8414 train_time: 0.7m tok/s: 8315020 +423/20000 train_loss: 2.8364 train_time: 0.7m tok/s: 8314955 +424/20000 train_loss: 2.9020 train_time: 0.7m tok/s: 8314715 +425/20000 train_loss: 2.8008 train_time: 0.7m tok/s: 8314679 +426/20000 train_loss: 2.8228 train_time: 0.7m tok/s: 8314004 +427/20000 train_loss: 2.8269 train_time: 0.7m tok/s: 8314176 +428/20000 train_loss: 2.7908 train_time: 0.7m tok/s: 8314169 +429/20000 train_loss: 2.7270 train_time: 0.7m tok/s: 8314242 +430/20000 train_loss: 2.8591 train_time: 0.7m tok/s: 8314148 +431/20000 train_loss: 2.6819 train_time: 0.7m tok/s: 8314005 +432/20000 train_loss: 2.7330 train_time: 0.7m tok/s: 8314004 +433/20000 train_loss: 2.6762 train_time: 0.7m tok/s: 8314043 +434/20000 train_loss: 2.6429 train_time: 0.7m tok/s: 8313725 +435/20000 train_loss: 2.8633 train_time: 0.7m tok/s: 8313267 +436/20000 train_loss: 2.4870 train_time: 0.7m tok/s: 8313056 +437/20000 train_loss: 2.7377 train_time: 0.7m tok/s: 8313060 +438/20000 train_loss: 2.8543 train_time: 0.7m tok/s: 8313230 +439/20000 train_loss: 2.7657 train_time: 0.7m tok/s: 8312519 +440/20000 train_loss: 2.6723 train_time: 0.7m tok/s: 8312566 +441/20000 train_loss: 2.9090 train_time: 0.7m tok/s: 8312498 +442/20000 train_loss: 2.9619 train_time: 0.7m tok/s: 8312471 +443/20000 train_loss: 2.9135 train_time: 0.7m tok/s: 8312548 +444/20000 train_loss: 2.9309 train_time: 0.7m tok/s: 8312575 +445/20000 train_loss: 2.8900 train_time: 0.7m tok/s: 8312426 +446/20000 train_loss: 2.7687 train_time: 0.7m tok/s: 8312654 +447/20000 train_loss: 2.7951 train_time: 0.7m tok/s: 8312541 +448/20000 train_loss: 2.8127 train_time: 0.7m tok/s: 8312831 +449/20000 train_loss: 2.7771 train_time: 0.7m tok/s: 8312591 +450/20000 train_loss: 2.8180 train_time: 0.7m tok/s: 8312718 +451/20000 train_loss: 2.5212 train_time: 0.7m tok/s: 8312486 +452/20000 train_loss: 2.7493 train_time: 0.7m tok/s: 8312205 +453/20000 train_loss: 2.6864 train_time: 0.7m tok/s: 8311986 +454/20000 train_loss: 2.6903 train_time: 0.7m tok/s: 8311729 +455/20000 train_loss: 2.7531 train_time: 0.7m tok/s: 8311628 +456/20000 train_loss: 2.7666 train_time: 0.7m tok/s: 8311696 +457/20000 train_loss: 2.6783 train_time: 0.7m tok/s: 8311469 +458/20000 train_loss: 2.7570 train_time: 0.7m tok/s: 8311507 +459/20000 train_loss: 2.8813 train_time: 0.7m tok/s: 8310666 +460/20000 train_loss: 2.8033 train_time: 0.7m tok/s: 8311006 +461/20000 train_loss: 2.8661 train_time: 0.7m tok/s: 8311107 +462/20000 train_loss: 2.9150 train_time: 0.7m tok/s: 8311145 +463/20000 train_loss: 2.8069 train_time: 0.7m tok/s: 8311213 +464/20000 train_loss: 2.7663 train_time: 0.7m tok/s: 8311417 +465/20000 train_loss: 2.9416 train_time: 0.7m tok/s: 8311225 +466/20000 train_loss: 2.8469 train_time: 0.7m tok/s: 8311148 +467/20000 train_loss: 2.8367 train_time: 0.7m tok/s: 8311163 +468/20000 train_loss: 2.9814 train_time: 0.7m tok/s: 8311023 +469/20000 train_loss: 2.7276 train_time: 0.7m tok/s: 8310674 +470/20000 train_loss: 2.7628 train_time: 0.7m tok/s: 8310762 +471/20000 train_loss: 2.8858 train_time: 0.7m tok/s: 8310412 +472/20000 train_loss: 2.9824 train_time: 0.7m tok/s: 8310005 +473/20000 train_loss: 2.7126 train_time: 0.7m tok/s: 8309616 +474/20000 train_loss: 2.6865 train_time: 0.7m tok/s: 8309650 +475/20000 train_loss: 2.8525 train_time: 0.7m tok/s: 8309686 +476/20000 train_loss: 2.6161 train_time: 0.8m tok/s: 8309297 +477/20000 train_loss: 2.7182 train_time: 0.8m tok/s: 8309275 +478/20000 train_loss: 2.8182 train_time: 0.8m tok/s: 8309285 +479/20000 train_loss: 2.7912 train_time: 0.8m tok/s: 8309196 +480/20000 train_loss: 3.0399 train_time: 0.8m tok/s: 8309076 +481/20000 train_loss: 2.8450 train_time: 0.8m tok/s: 8309086 +482/20000 train_loss: 2.7815 train_time: 0.8m tok/s: 8309044 +483/20000 train_loss: 2.8191 train_time: 0.8m tok/s: 8309004 +484/20000 train_loss: 2.8777 train_time: 0.8m tok/s: 8308939 +485/20000 train_loss: 2.7558 train_time: 0.8m tok/s: 8308935 +486/20000 train_loss: 2.7538 train_time: 0.8m tok/s: 8309007 +487/20000 train_loss: 2.8186 train_time: 0.8m tok/s: 8308980 +488/20000 train_loss: 2.7534 train_time: 0.8m tok/s: 8308836 +489/20000 train_loss: 2.3565 train_time: 0.8m tok/s: 8308651 +490/20000 train_loss: 2.8571 train_time: 0.8m tok/s: 8308609 +491/20000 train_loss: 2.7728 train_time: 0.8m tok/s: 8308671 +492/20000 train_loss: 2.7752 train_time: 0.8m tok/s: 8308680 +493/20000 train_loss: 2.6710 train_time: 0.8m tok/s: 8308673 +494/20000 train_loss: 2.6763 train_time: 0.8m tok/s: 8308566 +495/20000 train_loss: 2.7887 train_time: 0.8m tok/s: 8308473 +496/20000 train_loss: 2.6865 train_time: 0.8m tok/s: 8308195 +497/20000 train_loss: 2.9280 train_time: 0.8m tok/s: 8308275 +498/20000 train_loss: 2.8232 train_time: 0.8m tok/s: 8308297 +499/20000 train_loss: 2.9264 train_time: 0.8m tok/s: 8308394 +500/20000 train_loss: 2.7415 train_time: 0.8m tok/s: 8308360 +501/20000 train_loss: 2.9090 train_time: 0.8m tok/s: 8308509 +502/20000 train_loss: 2.7082 train_time: 0.8m tok/s: 8308528 +503/20000 train_loss: 2.7864 train_time: 0.8m tok/s: 8308354 +504/20000 train_loss: 2.6837 train_time: 0.8m tok/s: 8308240 +505/20000 train_loss: 2.8765 train_time: 0.8m tok/s: 8308086 +506/20000 train_loss: 2.7881 train_time: 0.8m tok/s: 8308065 +507/20000 train_loss: 2.7592 train_time: 0.8m tok/s: 8308005 +508/20000 train_loss: 2.9152 train_time: 0.8m tok/s: 8308100 +509/20000 train_loss: 2.8975 train_time: 0.8m tok/s: 8308172 +510/20000 train_loss: 2.6950 train_time: 0.8m tok/s: 8308205 +511/20000 train_loss: 2.8679 train_time: 0.8m tok/s: 8307947 +512/20000 train_loss: 2.8523 train_time: 0.8m tok/s: 8307925 +513/20000 train_loss: 2.8902 train_time: 0.8m tok/s: 8307941 +514/20000 train_loss: 2.8456 train_time: 0.8m tok/s: 8307848 +515/20000 train_loss: 2.8484 train_time: 0.8m tok/s: 8307424 +516/20000 train_loss: 2.7200 train_time: 0.8m tok/s: 8307382 +517/20000 train_loss: 2.8059 train_time: 0.8m tok/s: 8307449 +518/20000 train_loss: 2.9380 train_time: 0.8m tok/s: 8307561 +519/20000 train_loss: 2.7365 train_time: 0.8m tok/s: 8307477 +520/20000 train_loss: 2.6585 train_time: 0.8m tok/s: 8307486 +521/20000 train_loss: 2.7645 train_time: 0.8m tok/s: 8307430 +522/20000 train_loss: 2.7411 train_time: 0.8m tok/s: 8307471 +523/20000 train_loss: 2.7244 train_time: 0.8m tok/s: 8307383 +524/20000 train_loss: 2.7952 train_time: 0.8m tok/s: 8307405 +525/20000 train_loss: 2.7181 train_time: 0.8m tok/s: 8307043 +526/20000 train_loss: 2.8226 train_time: 0.8m tok/s: 8307152 +527/20000 train_loss: 2.8789 train_time: 0.8m tok/s: 8306749 +528/20000 train_loss: 2.8669 train_time: 0.8m tok/s: 8306605 +529/20000 train_loss: 2.8805 train_time: 0.8m tok/s: 8306536 +530/20000 train_loss: 2.8998 train_time: 0.8m tok/s: 8306383 +531/20000 train_loss: 3.2048 train_time: 0.8m tok/s: 8306310 +532/20000 train_loss: 3.1061 train_time: 0.8m tok/s: 8305992 +533/20000 train_loss: 2.6725 train_time: 0.8m tok/s: 8306018 +534/20000 train_loss: 2.8683 train_time: 0.8m tok/s: 8306094 +535/20000 train_loss: 2.7898 train_time: 0.8m tok/s: 8305996 +536/20000 train_loss: 2.6734 train_time: 0.8m tok/s: 8305657 +537/20000 train_loss: 2.8922 train_time: 0.8m tok/s: 8305362 +538/20000 train_loss: 2.7378 train_time: 0.8m tok/s: 8305328 +539/20000 train_loss: 2.8475 train_time: 0.9m tok/s: 8305424 +540/20000 train_loss: 2.8540 train_time: 0.9m tok/s: 8305267 +541/20000 train_loss: 2.2827 train_time: 0.9m tok/s: 8305051 +542/20000 train_loss: 2.8403 train_time: 0.9m tok/s: 8304830 +543/20000 train_loss: 2.8078 train_time: 0.9m tok/s: 8304863 +544/20000 train_loss: 2.8363 train_time: 0.9m tok/s: 8304875 +545/20000 train_loss: 2.7789 train_time: 0.9m tok/s: 8304837 +546/20000 train_loss: 2.8094 train_time: 0.9m tok/s: 8304845 +547/20000 train_loss: 2.7682 train_time: 0.9m tok/s: 8304970 +548/20000 train_loss: 2.7376 train_time: 0.9m tok/s: 8304624 +549/20000 train_loss: 2.6815 train_time: 0.9m tok/s: 8304422 +550/20000 train_loss: 2.7734 train_time: 0.9m tok/s: 8304351 +551/20000 train_loss: 2.7257 train_time: 0.9m tok/s: 8304311 +552/20000 train_loss: 2.8900 train_time: 0.9m tok/s: 8303824 +553/20000 train_loss: 2.7458 train_time: 0.9m tok/s: 8303474 +554/20000 train_loss: 2.5926 train_time: 0.9m tok/s: 8303460 +555/20000 train_loss: 2.6650 train_time: 0.9m tok/s: 8303556 +556/20000 train_loss: 2.7554 train_time: 0.9m tok/s: 8303659 +557/20000 train_loss: 2.8711 train_time: 0.9m tok/s: 8303756 +558/20000 train_loss: 2.8283 train_time: 0.9m tok/s: 8303526 +559/20000 train_loss: 2.7465 train_time: 0.9m tok/s: 8303731 +560/20000 train_loss: 2.7803 train_time: 0.9m tok/s: 8303558 +561/20000 train_loss: 2.7912 train_time: 0.9m tok/s: 8303536 +562/20000 train_loss: 2.8427 train_time: 0.9m tok/s: 8303559 +563/20000 train_loss: 2.8295 train_time: 0.9m tok/s: 8303601 +564/20000 train_loss: 2.9396 train_time: 0.9m tok/s: 8303538 +565/20000 train_loss: 2.8431 train_time: 0.9m tok/s: 8303338 +566/20000 train_loss: 2.7443 train_time: 0.9m tok/s: 8303263 +567/20000 train_loss: 2.6899 train_time: 0.9m tok/s: 8303353 +568/20000 train_loss: 2.8259 train_time: 0.9m tok/s: 8303435 +569/20000 train_loss: 2.6689 train_time: 0.9m tok/s: 8303411 +570/20000 train_loss: 2.6824 train_time: 0.9m tok/s: 8303272 +571/20000 train_loss: 2.7864 train_time: 0.9m tok/s: 8303098 +572/20000 train_loss: 2.6360 train_time: 0.9m tok/s: 8302952 +573/20000 train_loss: 2.6437 train_time: 0.9m tok/s: 8303003 +574/20000 train_loss: 2.7343 train_time: 0.9m tok/s: 8303107 +575/20000 train_loss: 2.5123 train_time: 0.9m tok/s: 8303195 +576/20000 train_loss: 2.7701 train_time: 0.9m tok/s: 8302931 +577/20000 train_loss: 2.8539 train_time: 0.9m tok/s: 8302634 +578/20000 train_loss: 2.8321 train_time: 0.9m tok/s: 8302643 +579/20000 train_loss: 2.7331 train_time: 0.9m tok/s: 8302745 +580/20000 train_loss: 2.8064 train_time: 0.9m tok/s: 8302889 +581/20000 train_loss: 2.7787 train_time: 0.9m tok/s: 8302723 +582/20000 train_loss: 2.7719 train_time: 0.9m tok/s: 8302499 +583/20000 train_loss: 2.7269 train_time: 0.9m tok/s: 8302717 +584/20000 train_loss: 2.7871 train_time: 0.9m tok/s: 8302759 +585/20000 train_loss: 2.8013 train_time: 0.9m tok/s: 8302767 +586/20000 train_loss: 2.6355 train_time: 0.9m tok/s: 8302706 +587/20000 train_loss: 2.7267 train_time: 0.9m tok/s: 8302674 +588/20000 train_loss: 2.7054 train_time: 0.9m tok/s: 8302707 +589/20000 train_loss: 2.7467 train_time: 0.9m tok/s: 8302532 +590/20000 train_loss: 2.7655 train_time: 0.9m tok/s: 8302700 +591/20000 train_loss: 2.7554 train_time: 0.9m tok/s: 8302719 +592/20000 train_loss: 2.7378 train_time: 0.9m tok/s: 8302804 +593/20000 train_loss: 2.7391 train_time: 0.9m tok/s: 8302773 +594/20000 train_loss: 2.6362 train_time: 0.9m tok/s: 8302780 +595/20000 train_loss: 2.8036 train_time: 0.9m tok/s: 8302789 +596/20000 train_loss: 2.6761 train_time: 0.9m tok/s: 8302842 +597/20000 train_loss: 2.7522 train_time: 0.9m tok/s: 8302926 +598/20000 train_loss: 2.7887 train_time: 0.9m tok/s: 8302947 +599/20000 train_loss: 2.6972 train_time: 0.9m tok/s: 8302596 +600/20000 train_loss: 2.7420 train_time: 0.9m tok/s: 8302782 +601/20000 train_loss: 2.7255 train_time: 0.9m tok/s: 8302690 +602/20000 train_loss: 2.7601 train_time: 1.0m tok/s: 8302625 +603/20000 train_loss: 2.7557 train_time: 1.0m tok/s: 8302632 +604/20000 train_loss: 2.7439 train_time: 1.0m tok/s: 8302469 +605/20000 train_loss: 2.6520 train_time: 1.0m tok/s: 8302464 +606/20000 train_loss: 2.6518 train_time: 1.0m tok/s: 8302559 +607/20000 train_loss: 2.7337 train_time: 1.0m tok/s: 8302665 +608/20000 train_loss: 2.6602 train_time: 1.0m tok/s: 8302493 +609/20000 train_loss: 2.7255 train_time: 1.0m tok/s: 8302490 +610/20000 train_loss: 2.7681 train_time: 1.0m tok/s: 8302524 +611/20000 train_loss: 2.8850 train_time: 1.0m tok/s: 8302421 +612/20000 train_loss: 2.8319 train_time: 1.0m tok/s: 8302218 +613/20000 train_loss: 2.8064 train_time: 1.0m tok/s: 8302200 +614/20000 train_loss: 2.7984 train_time: 1.0m tok/s: 8302113 +615/20000 train_loss: 2.7475 train_time: 1.0m tok/s: 8302221 +616/20000 train_loss: 2.7714 train_time: 1.0m tok/s: 8302063 +617/20000 train_loss: 2.7297 train_time: 1.0m tok/s: 8301959 +618/20000 train_loss: 2.7396 train_time: 1.0m tok/s: 8301972 +619/20000 train_loss: 2.7908 train_time: 1.0m tok/s: 8301747 +620/20000 train_loss: 2.8857 train_time: 1.0m tok/s: 8301952 +621/20000 train_loss: 2.6886 train_time: 1.0m tok/s: 8302000 +622/20000 train_loss: 2.7241 train_time: 1.0m tok/s: 8301980 +623/20000 train_loss: 2.7303 train_time: 1.0m tok/s: 8301922 +624/20000 train_loss: 2.4400 train_time: 1.0m tok/s: 8301763 +625/20000 train_loss: 2.7518 train_time: 1.0m tok/s: 8301811 +626/20000 train_loss: 2.8937 train_time: 1.0m tok/s: 8301753 +627/20000 train_loss: 2.6930 train_time: 1.0m tok/s: 8301578 +628/20000 train_loss: 2.8563 train_time: 1.0m tok/s: 8301499 +629/20000 train_loss: 2.8459 train_time: 1.0m tok/s: 8301471 +630/20000 train_loss: 2.6994 train_time: 1.0m tok/s: 8301487 +631/20000 train_loss: 2.8224 train_time: 1.0m tok/s: 8301441 +632/20000 train_loss: 2.8377 train_time: 1.0m tok/s: 8301459 +633/20000 train_loss: 2.7188 train_time: 1.0m tok/s: 8301591 +634/20000 train_loss: 2.9365 train_time: 1.0m tok/s: 8301696 +635/20000 train_loss: 2.7398 train_time: 1.0m tok/s: 8301572 +636/20000 train_loss: 2.8711 train_time: 1.0m tok/s: 8301419 +637/20000 train_loss: 2.7615 train_time: 1.0m tok/s: 8301402 +638/20000 train_loss: 2.5716 train_time: 1.0m tok/s: 8301358 +639/20000 train_loss: 2.7245 train_time: 1.0m tok/s: 8301265 +640/20000 train_loss: 2.7088 train_time: 1.0m tok/s: 8301065 +641/20000 train_loss: 2.7861 train_time: 1.0m tok/s: 8301032 +642/20000 train_loss: 2.7937 train_time: 1.0m tok/s: 8300986 +643/20000 train_loss: 2.7564 train_time: 1.0m tok/s: 8301067 +644/20000 train_loss: 2.7995 train_time: 1.0m tok/s: 8301135 +645/20000 train_loss: 2.8742 train_time: 1.0m tok/s: 8301249 +646/20000 train_loss: 2.7762 train_time: 1.0m tok/s: 8301345 +647/20000 train_loss: 2.8247 train_time: 1.0m tok/s: 8301089 +648/20000 train_loss: 2.7391 train_time: 1.0m tok/s: 8300882 +649/20000 train_loss: 2.8735 train_time: 1.0m tok/s: 8301025 +650/20000 train_loss: 2.7573 train_time: 1.0m tok/s: 8301094 +651/20000 train_loss: 2.7429 train_time: 1.0m tok/s: 8301007 +652/20000 train_loss: 2.6974 train_time: 1.0m tok/s: 8300953 +653/20000 train_loss: 2.6555 train_time: 1.0m tok/s: 8300952 +654/20000 train_loss: 2.7097 train_time: 1.0m tok/s: 8300979 +655/20000 train_loss: 2.6993 train_time: 1.0m tok/s: 8300733 +656/20000 train_loss: 2.6450 train_time: 1.0m tok/s: 8300553 +657/20000 train_loss: 2.6565 train_time: 1.0m tok/s: 8300554 +658/20000 train_loss: 2.6944 train_time: 1.0m tok/s: 8300561 +659/20000 train_loss: 2.7425 train_time: 1.0m tok/s: 8300678 +660/20000 train_loss: 2.7526 train_time: 1.0m tok/s: 8300578 +661/20000 train_loss: 2.7961 train_time: 1.0m tok/s: 8300656 +662/20000 train_loss: 2.6871 train_time: 1.0m tok/s: 8300712 +663/20000 train_loss: 2.7855 train_time: 1.0m tok/s: 8300710 +664/20000 train_loss: 2.7906 train_time: 1.0m tok/s: 8300368 +665/20000 train_loss: 2.8353 train_time: 1.1m tok/s: 8300204 +666/20000 train_loss: 2.8224 train_time: 1.1m tok/s: 8300259 +667/20000 train_loss: 2.7619 train_time: 1.1m tok/s: 8300003 +668/20000 train_loss: 2.7443 train_time: 1.1m tok/s: 8300171 +669/20000 train_loss: 2.6291 train_time: 1.1m tok/s: 8300104 +670/20000 train_loss: 2.6398 train_time: 1.1m tok/s: 8300126 +671/20000 train_loss: 2.6513 train_time: 1.1m tok/s: 8300095 +672/20000 train_loss: 2.7804 train_time: 1.1m tok/s: 8300019 +673/20000 train_loss: 2.6199 train_time: 1.1m tok/s: 8300003 +674/20000 train_loss: 2.8544 train_time: 1.1m tok/s: 8299919 +675/20000 train_loss: 2.6126 train_time: 1.1m tok/s: 8299743 +676/20000 train_loss: 2.8452 train_time: 1.1m tok/s: 8299680 +677/20000 train_loss: 2.6643 train_time: 1.1m tok/s: 8299801 +678/20000 train_loss: 2.7601 train_time: 1.1m tok/s: 8299801 +679/20000 train_loss: 2.6939 train_time: 1.1m tok/s: 8299822 +680/20000 train_loss: 2.9021 train_time: 1.1m tok/s: 8299697 +681/20000 train_loss: 2.7749 train_time: 1.1m tok/s: 8299658 +682/20000 train_loss: 2.8693 train_time: 1.1m tok/s: 8299538 +683/20000 train_loss: 2.8522 train_time: 1.1m tok/s: 8299474 +684/20000 train_loss: 2.7840 train_time: 1.1m tok/s: 8299431 +685/20000 train_loss: 2.6533 train_time: 1.1m tok/s: 8299422 +686/20000 train_loss: 2.8806 train_time: 1.1m tok/s: 8299320 +687/20000 train_loss: 2.7717 train_time: 1.1m tok/s: 8299462 +688/20000 train_loss: 2.7766 train_time: 1.1m tok/s: 8298888 +689/20000 train_loss: 2.8100 train_time: 1.1m tok/s: 8299109 +690/20000 train_loss: 2.7376 train_time: 1.1m tok/s: 8299038 +691/20000 train_loss: 2.8502 train_time: 1.1m tok/s: 8299063 +692/20000 train_loss: 2.9082 train_time: 1.1m tok/s: 8299104 +693/20000 train_loss: 2.8087 train_time: 1.1m tok/s: 8298815 +694/20000 train_loss: 2.8139 train_time: 1.1m tok/s: 8298749 +695/20000 train_loss: 2.7989 train_time: 1.1m tok/s: 8298795 +696/20000 train_loss: 2.8001 train_time: 1.1m tok/s: 8298833 +697/20000 train_loss: 2.6668 train_time: 1.1m tok/s: 8298802 +698/20000 train_loss: 2.8350 train_time: 1.1m tok/s: 8298736 +699/20000 train_loss: 2.6846 train_time: 1.1m tok/s: 8298721 +700/20000 train_loss: 2.6287 train_time: 1.1m tok/s: 8298823 +701/20000 train_loss: 2.6323 train_time: 1.1m tok/s: 8298815 +702/20000 train_loss: 2.6233 train_time: 1.1m tok/s: 8298707 +703/20000 train_loss: 2.5130 train_time: 1.1m tok/s: 8298596 +704/20000 train_loss: 2.8452 train_time: 1.1m tok/s: 8298578 +705/20000 train_loss: 2.7733 train_time: 1.1m tok/s: 8298290 +706/20000 train_loss: 2.7536 train_time: 1.1m tok/s: 8298109 +707/20000 train_loss: 2.7517 train_time: 1.1m tok/s: 8298178 +708/20000 train_loss: 2.8362 train_time: 1.1m tok/s: 8297974 +709/20000 train_loss: 2.8019 train_time: 1.1m tok/s: 8298228 +710/20000 train_loss: 2.6346 train_time: 1.1m tok/s: 8298160 +711/20000 train_loss: 2.7097 train_time: 1.1m tok/s: 8298146 +712/20000 train_loss: 2.6394 train_time: 1.1m tok/s: 8298161 +713/20000 train_loss: 2.6951 train_time: 1.1m tok/s: 8298121 +714/20000 train_loss: 2.7545 train_time: 1.1m tok/s: 8298128 +715/20000 train_loss: 2.7078 train_time: 1.1m tok/s: 8298108 +716/20000 train_loss: 2.7246 train_time: 1.1m tok/s: 8298145 +717/20000 train_loss: 2.9165 train_time: 1.1m tok/s: 8297654 +718/20000 train_loss: 2.8169 train_time: 1.1m tok/s: 8297984 +719/20000 train_loss: 2.7478 train_time: 1.1m tok/s: 8298118 +720/20000 train_loss: 2.6921 train_time: 1.1m tok/s: 8298092 +721/20000 train_loss: 2.8326 train_time: 1.1m tok/s: 8297968 +722/20000 train_loss: 2.6637 train_time: 1.1m tok/s: 8297759 +723/20000 train_loss: 2.8672 train_time: 1.1m tok/s: 8297761 +724/20000 train_loss: 2.7640 train_time: 1.1m tok/s: 8297829 +725/20000 train_loss: 2.6268 train_time: 1.1m tok/s: 8297886 +726/20000 train_loss: 2.7807 train_time: 1.1m tok/s: 8297955 +727/20000 train_loss: 2.6035 train_time: 1.1m tok/s: 8297960 +728/20000 train_loss: 2.7978 train_time: 1.1m tok/s: 8297941 +729/20000 train_loss: 2.8350 train_time: 1.2m tok/s: 8297817 +730/20000 train_loss: 2.7661 train_time: 1.2m tok/s: 8297810 +731/20000 train_loss: 2.8660 train_time: 1.2m tok/s: 8297866 +732/20000 train_loss: 2.7096 train_time: 1.2m tok/s: 8297875 +733/20000 train_loss: 2.8850 train_time: 1.2m tok/s: 8297671 +734/20000 train_loss: 2.7393 train_time: 1.2m tok/s: 8297484 +735/20000 train_loss: 2.7917 train_time: 1.2m tok/s: 8297398 +736/20000 train_loss: 2.6670 train_time: 1.2m tok/s: 8297406 +737/20000 train_loss: 2.8021 train_time: 1.2m tok/s: 8297345 +738/20000 train_loss: 2.6717 train_time: 1.2m tok/s: 8297280 +739/20000 train_loss: 2.5957 train_time: 1.2m tok/s: 8297372 +740/20000 train_loss: 2.8301 train_time: 1.2m tok/s: 8297033 +741/20000 train_loss: 2.8245 train_time: 1.2m tok/s: 8297366 +742/20000 train_loss: 2.6877 train_time: 1.2m tok/s: 8297417 +743/20000 train_loss: 2.8414 train_time: 1.2m tok/s: 8297476 +744/20000 train_loss: 2.7248 train_time: 1.2m tok/s: 8297446 +745/20000 train_loss: 2.7384 train_time: 1.2m tok/s: 8297403 +746/20000 train_loss: 2.8204 train_time: 1.2m tok/s: 8297465 +747/20000 train_loss: 2.7030 train_time: 1.2m tok/s: 8297495 +748/20000 train_loss: 2.7486 train_time: 1.2m tok/s: 8297484 +749/20000 train_loss: 2.8073 train_time: 1.2m tok/s: 8297444 +750/20000 train_loss: 2.8176 train_time: 1.2m tok/s: 8297397 +751/20000 train_loss: 2.6822 train_time: 1.2m tok/s: 8297350 +752/20000 train_loss: 2.7749 train_time: 1.2m tok/s: 8297248 +753/20000 train_loss: 2.4224 train_time: 1.2m tok/s: 8297080 +754/20000 train_loss: 2.6687 train_time: 1.2m tok/s: 8296968 +755/20000 train_loss: 2.8643 train_time: 1.2m tok/s: 8297051 +756/20000 train_loss: 3.1201 train_time: 1.2m tok/s: 8297052 +757/20000 train_loss: 2.7877 train_time: 1.2m tok/s: 8296978 +758/20000 train_loss: 2.7214 train_time: 1.2m tok/s: 8296942 +759/20000 train_loss: 2.6822 train_time: 1.2m tok/s: 8296550 +760/20000 train_loss: 2.8602 train_time: 1.2m tok/s: 8296901 +761/20000 train_loss: 2.7322 train_time: 1.2m tok/s: 8296956 +762/20000 train_loss: 2.8311 train_time: 1.2m tok/s: 8297097 +763/20000 train_loss: 2.6589 train_time: 1.2m tok/s: 8297098 +764/20000 train_loss: 2.7034 train_time: 1.2m tok/s: 8296931 +765/20000 train_loss: 2.6744 train_time: 1.2m tok/s: 8296835 +766/20000 train_loss: 2.6684 train_time: 1.2m tok/s: 8296745 +767/20000 train_loss: 2.6903 train_time: 1.2m tok/s: 8296716 +768/20000 train_loss: 2.7369 train_time: 1.2m tok/s: 8296698 +769/20000 train_loss: 2.7653 train_time: 1.2m tok/s: 8296782 +770/20000 train_loss: 2.7693 train_time: 1.2m tok/s: 8296763 +771/20000 train_loss: 2.7840 train_time: 1.2m tok/s: 8296776 +772/20000 train_loss: 2.7676 train_time: 1.2m tok/s: 8296820 +773/20000 train_loss: 2.7031 train_time: 1.2m tok/s: 8296771 +774/20000 train_loss: 2.8448 train_time: 1.2m tok/s: 8296814 +775/20000 train_loss: 2.8117 train_time: 1.2m tok/s: 8296910 +776/20000 train_loss: 2.8967 train_time: 1.2m tok/s: 8296747 +777/20000 train_loss: 2.9946 train_time: 1.2m tok/s: 8296413 +778/20000 train_loss: 2.7078 train_time: 1.2m tok/s: 8296257 +779/20000 train_loss: 2.4482 train_time: 1.2m tok/s: 8296236 +780/20000 train_loss: 2.7765 train_time: 1.2m tok/s: 8296236 +781/20000 train_loss: 2.7510 train_time: 1.2m tok/s: 8296253 +782/20000 train_loss: 3.0301 train_time: 1.2m tok/s: 8296167 +783/20000 train_loss: 2.5320 train_time: 1.2m tok/s: 8295981 +784/20000 train_loss: 2.9025 train_time: 1.2m tok/s: 8296022 +785/20000 train_loss: 2.8582 train_time: 1.2m tok/s: 8296108 +786/20000 train_loss: 2.7230 train_time: 1.2m tok/s: 8296206 +787/20000 train_loss: 2.6346 train_time: 1.2m tok/s: 8296130 +788/20000 train_loss: 2.6754 train_time: 1.2m tok/s: 8296160 +789/20000 train_loss: 2.7905 train_time: 1.2m tok/s: 8296035 +790/20000 train_loss: 2.6396 train_time: 1.2m tok/s: 8295904 +791/20000 train_loss: 2.5933 train_time: 1.2m tok/s: 8295826 +792/20000 train_loss: 2.7325 train_time: 1.3m tok/s: 8295828 +793/20000 train_loss: 2.7115 train_time: 1.3m tok/s: 8295855 +794/20000 train_loss: 2.7200 train_time: 1.3m tok/s: 8295619 +795/20000 train_loss: 2.8574 train_time: 1.3m tok/s: 8295687 +796/20000 train_loss: 2.7175 train_time: 1.3m tok/s: 8295844 +797/20000 train_loss: 2.7364 train_time: 1.3m tok/s: 8295892 +798/20000 train_loss: 2.7487 train_time: 1.3m tok/s: 8295843 +799/20000 train_loss: 2.7865 train_time: 1.3m tok/s: 8295831 +800/20000 train_loss: 2.7143 train_time: 1.3m tok/s: 8295899 +801/20000 train_loss: 2.7490 train_time: 1.3m tok/s: 8295931 +802/20000 train_loss: 2.8162 train_time: 1.3m tok/s: 8295889 +803/20000 train_loss: 2.6884 train_time: 1.3m tok/s: 8295866 +804/20000 train_loss: 2.6690 train_time: 1.3m tok/s: 8295916 +805/20000 train_loss: 2.6819 train_time: 1.3m tok/s: 8295826 +806/20000 train_loss: 2.7969 train_time: 1.3m tok/s: 8295833 +807/20000 train_loss: 2.7816 train_time: 1.3m tok/s: 8295869 +808/20000 train_loss: 2.8153 train_time: 1.3m tok/s: 8295927 +809/20000 train_loss: 2.6283 train_time: 1.3m tok/s: 8295864 +810/20000 train_loss: 2.8478 train_time: 1.3m tok/s: 8295825 +811/20000 train_loss: 2.8485 train_time: 1.3m tok/s: 8295917 +812/20000 train_loss: 2.6848 train_time: 1.3m tok/s: 8295932 +813/20000 train_loss: 2.7511 train_time: 1.3m tok/s: 8295857 +814/20000 train_loss: 2.7852 train_time: 1.3m tok/s: 8295900 +815/20000 train_loss: 2.8946 train_time: 1.3m tok/s: 8295925 +816/20000 train_loss: 2.7158 train_time: 1.3m tok/s: 8296007 +817/20000 train_loss: 2.6982 train_time: 1.3m tok/s: 8296034 +818/20000 train_loss: 2.7748 train_time: 1.3m tok/s: 8296069 +819/20000 train_loss: 2.7918 train_time: 1.3m tok/s: 8295973 +820/20000 train_loss: 3.0389 train_time: 1.3m tok/s: 8295840 +821/20000 train_loss: 2.7877 train_time: 1.3m tok/s: 8295738 +822/20000 train_loss: 2.6005 train_time: 1.3m tok/s: 8295699 +823/20000 train_loss: 2.6417 train_time: 1.3m tok/s: 8295684 +824/20000 train_loss: 2.8072 train_time: 1.3m tok/s: 8295655 +825/20000 train_loss: 2.8816 train_time: 1.3m tok/s: 8295521 +826/20000 train_loss: 2.8607 train_time: 1.3m tok/s: 8295468 +827/20000 train_loss: 2.6405 train_time: 1.3m tok/s: 8295505 +828/20000 train_loss: 2.7220 train_time: 1.3m tok/s: 8295578 +829/20000 train_loss: 3.3518 train_time: 1.3m tok/s: 8295512 +830/20000 train_loss: 2.7548 train_time: 1.3m tok/s: 8295351 +831/20000 train_loss: 2.7382 train_time: 1.3m tok/s: 8295235 +832/20000 train_loss: 2.7740 train_time: 1.3m tok/s: 8295255 +833/20000 train_loss: 2.8629 train_time: 1.3m tok/s: 8295230 +834/20000 train_loss: 2.6932 train_time: 1.3m tok/s: 8295295 +835/20000 train_loss: 2.7784 train_time: 1.3m tok/s: 8295300 +836/20000 train_loss: 2.6204 train_time: 1.3m tok/s: 8295219 +837/20000 train_loss: 2.5056 train_time: 1.3m tok/s: 8295060 +838/20000 train_loss: 2.6266 train_time: 1.3m tok/s: 8294917 +839/20000 train_loss: 2.7218 train_time: 1.3m tok/s: 8294850 +840/20000 train_loss: 3.1152 train_time: 1.3m tok/s: 8294798 +841/20000 train_loss: 2.7132 train_time: 1.3m tok/s: 8294749 +842/20000 train_loss: 2.7220 train_time: 1.3m tok/s: 8294657 +843/20000 train_loss: 2.6449 train_time: 1.3m tok/s: 8294535 +844/20000 train_loss: 2.7341 train_time: 1.3m tok/s: 8294699 +845/20000 train_loss: 2.6809 train_time: 1.3m tok/s: 8294712 +846/20000 train_loss: 2.6698 train_time: 1.3m tok/s: 8294630 +847/20000 train_loss: 2.7196 train_time: 1.3m tok/s: 8294593 +848/20000 train_loss: 2.6391 train_time: 1.3m tok/s: 8294627 +849/20000 train_loss: 2.7447 train_time: 1.3m tok/s: 8294640 +850/20000 train_loss: 2.5853 train_time: 1.3m tok/s: 8294669 +851/20000 train_loss: 2.7553 train_time: 1.3m tok/s: 8294660 +852/20000 train_loss: 2.5657 train_time: 1.3m tok/s: 8294640 +853/20000 train_loss: 2.7151 train_time: 1.3m tok/s: 8294619 +854/20000 train_loss: 2.7049 train_time: 1.3m tok/s: 8294325 +855/20000 train_loss: 2.7534 train_time: 1.4m tok/s: 8294382 +856/20000 train_loss: 2.6900 train_time: 1.4m tok/s: 8294555 +857/20000 train_loss: 2.8341 train_time: 1.4m tok/s: 8294564 +858/20000 train_loss: 2.8270 train_time: 1.4m tok/s: 8294589 +859/20000 train_loss: 2.7284 train_time: 1.4m tok/s: 8294531 +860/20000 train_loss: 2.6671 train_time: 1.4m tok/s: 8294463 +861/20000 train_loss: 2.7085 train_time: 1.4m tok/s: 8294459 +862/20000 train_loss: 2.6575 train_time: 1.4m tok/s: 8294320 +863/20000 train_loss: 2.9023 train_time: 1.4m tok/s: 8294206 +864/20000 train_loss: 2.7471 train_time: 1.4m tok/s: 8294141 +865/20000 train_loss: 2.7772 train_time: 1.4m tok/s: 8294100 +866/20000 train_loss: 2.6312 train_time: 1.4m tok/s: 8294046 +867/20000 train_loss: 2.6686 train_time: 1.4m tok/s: 8294003 +868/20000 train_loss: 2.6776 train_time: 1.4m tok/s: 8293969 +869/20000 train_loss: 2.6940 train_time: 1.4m tok/s: 8294070 +870/20000 train_loss: 2.6665 train_time: 1.4m tok/s: 8294034 +871/20000 train_loss: 2.6537 train_time: 1.4m tok/s: 8293954 +872/20000 train_loss: 2.7489 train_time: 1.4m tok/s: 8294049 +873/20000 train_loss: 2.6562 train_time: 1.4m tok/s: 8294141 +874/20000 train_loss: 2.8032 train_time: 1.4m tok/s: 8294074 +875/20000 train_loss: 2.7548 train_time: 1.4m tok/s: 8294138 +876/20000 train_loss: 2.8111 train_time: 1.4m tok/s: 8294159 +877/20000 train_loss: 2.6979 train_time: 1.4m tok/s: 8294266 +878/20000 train_loss: 2.6905 train_time: 1.4m tok/s: 8294270 +879/20000 train_loss: 2.7355 train_time: 1.4m tok/s: 8294272 +880/20000 train_loss: 2.7430 train_time: 1.4m tok/s: 8294289 +881/20000 train_loss: 2.6663 train_time: 1.4m tok/s: 8294301 +882/20000 train_loss: 2.6867 train_time: 1.4m tok/s: 8294213 +883/20000 train_loss: 2.7811 train_time: 1.4m tok/s: 8294166 +884/20000 train_loss: 2.5199 train_time: 1.4m tok/s: 8294176 +885/20000 train_loss: 2.6323 train_time: 1.4m tok/s: 8294245 +886/20000 train_loss: 2.7023 train_time: 1.4m tok/s: 8294219 +887/20000 train_loss: 2.6647 train_time: 1.4m tok/s: 8294228 +888/20000 train_loss: 2.6402 train_time: 1.4m tok/s: 8294238 +889/20000 train_loss: 2.8013 train_time: 1.4m tok/s: 8294204 +890/20000 train_loss: 2.6004 train_time: 1.4m tok/s: 8294232 +891/20000 train_loss: 2.6834 train_time: 1.4m tok/s: 8294163 +892/20000 train_loss: 2.7063 train_time: 1.4m tok/s: 8294273 +893/20000 train_loss: 2.6458 train_time: 1.4m tok/s: 8294263 +894/20000 train_loss: 2.7050 train_time: 1.4m tok/s: 8294282 +895/20000 train_loss: 2.7347 train_time: 1.4m tok/s: 8294310 +896/20000 train_loss: 2.7970 train_time: 1.4m tok/s: 8294213 +897/20000 train_loss: 2.7144 train_time: 1.4m tok/s: 8294095 +898/20000 train_loss: 2.6857 train_time: 1.4m tok/s: 8294077 +899/20000 train_loss: 2.6688 train_time: 1.4m tok/s: 8294201 +900/20000 train_loss: 2.7170 train_time: 1.4m tok/s: 8293859 +901/20000 train_loss: 2.6337 train_time: 1.4m tok/s: 8294138 +902/20000 train_loss: 2.6247 train_time: 1.4m tok/s: 8294077 +903/20000 train_loss: 2.5815 train_time: 1.4m tok/s: 8294059 +904/20000 train_loss: 2.5665 train_time: 1.4m tok/s: 8294108 +905/20000 train_loss: 2.6955 train_time: 1.4m tok/s: 8294111 +906/20000 train_loss: 2.7737 train_time: 1.4m tok/s: 8294160 +907/20000 train_loss: 2.7549 train_time: 1.4m tok/s: 8294211 +908/20000 train_loss: 2.8321 train_time: 1.4m tok/s: 8294277 +909/20000 train_loss: 2.7711 train_time: 1.4m tok/s: 8294258 +910/20000 train_loss: 2.8257 train_time: 1.4m tok/s: 8294225 +911/20000 train_loss: 2.7224 train_time: 1.4m tok/s: 8294324 +912/20000 train_loss: 2.5640 train_time: 1.4m tok/s: 8293946 +913/20000 train_loss: 2.7284 train_time: 1.4m tok/s: 8293727 +914/20000 train_loss: 2.8046 train_time: 1.4m tok/s: 8293777 +915/20000 train_loss: 2.7424 train_time: 1.4m tok/s: 8293849 +916/20000 train_loss: 2.7503 train_time: 1.4m tok/s: 8293863 +917/20000 train_loss: 2.6817 train_time: 1.4m tok/s: 8293756 +918/20000 train_loss: 2.5597 train_time: 1.5m tok/s: 8293720 +919/20000 train_loss: 2.6169 train_time: 1.5m tok/s: 8293732 +920/20000 train_loss: 2.6384 train_time: 1.5m tok/s: 8293811 +921/20000 train_loss: 2.5038 train_time: 1.5m tok/s: 8293780 +922/20000 train_loss: 2.7040 train_time: 1.5m tok/s: 8293717 +923/20000 train_loss: 2.6313 train_time: 1.5m tok/s: 8293657 +924/20000 train_loss: 2.6171 train_time: 1.5m tok/s: 8293629 +925/20000 train_loss: 2.9362 train_time: 1.5m tok/s: 8293568 +926/20000 train_loss: 2.5529 train_time: 1.5m tok/s: 8293459 +927/20000 train_loss: 2.7464 train_time: 1.5m tok/s: 8293388 +928/20000 train_loss: 2.7986 train_time: 1.5m tok/s: 8293474 +929/20000 train_loss: 2.7133 train_time: 1.5m tok/s: 8293536 +930/20000 train_loss: 2.8809 train_time: 1.5m tok/s: 8293516 +931/20000 train_loss: 2.7619 train_time: 1.5m tok/s: 8293458 +932/20000 train_loss: 2.6941 train_time: 1.5m tok/s: 8293499 +933/20000 train_loss: 2.7061 train_time: 1.5m tok/s: 8293477 +934/20000 train_loss: 2.7147 train_time: 1.5m tok/s: 8293372 +935/20000 train_loss: 2.7950 train_time: 1.5m tok/s: 8293366 +936/20000 train_loss: 2.6119 train_time: 1.5m tok/s: 8293323 +937/20000 train_loss: 2.7236 train_time: 1.5m tok/s: 8293278 +938/20000 train_loss: 2.5030 train_time: 1.5m tok/s: 8293148 +939/20000 train_loss: 2.5132 train_time: 1.5m tok/s: 8292947 +940/20000 train_loss: 2.7668 train_time: 1.5m tok/s: 8292859 +941/20000 train_loss: 2.8628 train_time: 1.5m tok/s: 8292829 +942/20000 train_loss: 2.6959 train_time: 1.5m tok/s: 8292809 +943/20000 train_loss: 2.7055 train_time: 1.5m tok/s: 8292878 +944/20000 train_loss: 2.7998 train_time: 1.5m tok/s: 8292902 +945/20000 train_loss: 2.7235 train_time: 1.5m tok/s: 8292984 +946/20000 train_loss: 2.6258 train_time: 1.5m tok/s: 8292691 +947/20000 train_loss: 2.7945 train_time: 1.5m tok/s: 8292612 +948/20000 train_loss: 2.7209 train_time: 1.5m tok/s: 8292667 +949/20000 train_loss: 2.7109 train_time: 1.5m tok/s: 8292829 +950/20000 train_loss: 2.7215 train_time: 1.5m tok/s: 8292760 +951/20000 train_loss: 2.7858 train_time: 1.5m tok/s: 8292784 +952/20000 train_loss: 2.5494 train_time: 1.5m tok/s: 8292746 +953/20000 train_loss: 2.6845 train_time: 1.5m tok/s: 8292806 +954/20000 train_loss: 2.6898 train_time: 1.5m tok/s: 8292593 +955/20000 train_loss: 2.7997 train_time: 1.5m tok/s: 8292541 +956/20000 train_loss: 2.8012 train_time: 1.5m tok/s: 8292501 +957/20000 train_loss: 2.6737 train_time: 1.5m tok/s: 8292406 +958/20000 train_loss: 2.7615 train_time: 1.5m tok/s: 8292336 +959/20000 train_loss: 2.7770 train_time: 1.5m tok/s: 8292410 +960/20000 train_loss: 2.9850 train_time: 1.5m tok/s: 8292405 +961/20000 train_loss: 2.7904 train_time: 1.5m tok/s: 8292446 +962/20000 train_loss: 2.6981 train_time: 1.5m tok/s: 8292353 +963/20000 train_loss: 2.7506 train_time: 1.5m tok/s: 8292304 +964/20000 train_loss: 2.6572 train_time: 1.5m tok/s: 8292330 +965/20000 train_loss: 2.7795 train_time: 1.5m tok/s: 8292344 +966/20000 train_loss: 2.7727 train_time: 1.5m tok/s: 8292365 +967/20000 train_loss: 2.5718 train_time: 1.5m tok/s: 8292321 +968/20000 train_loss: 2.6727 train_time: 1.5m tok/s: 8292294 +969/20000 train_loss: 2.7023 train_time: 1.5m tok/s: 8292268 +970/20000 train_loss: 2.7033 train_time: 1.5m tok/s: 8292113 +971/20000 train_loss: 2.5500 train_time: 1.5m tok/s: 8291911 +972/20000 train_loss: 2.4939 train_time: 1.5m tok/s: 8291777 +973/20000 train_loss: 2.5627 train_time: 1.5m tok/s: 8291599 +974/20000 train_loss: 2.6866 train_time: 1.5m tok/s: 8291524 +975/20000 train_loss: 2.7491 train_time: 1.5m tok/s: 8291569 +976/20000 train_loss: 2.6846 train_time: 1.5m tok/s: 8291484 +977/20000 train_loss: 2.7308 train_time: 1.5m tok/s: 8291428 +978/20000 train_loss: 2.7629 train_time: 1.5m tok/s: 8291452 +979/20000 train_loss: 2.6029 train_time: 1.5m tok/s: 8291471 +980/20000 train_loss: 2.6519 train_time: 1.5m tok/s: 8291146 +981/20000 train_loss: 2.6397 train_time: 1.6m tok/s: 8291244 +982/20000 train_loss: 2.6290 train_time: 1.6m tok/s: 8291332 +983/20000 train_loss: 2.7525 train_time: 1.6m tok/s: 8291353 +984/20000 train_loss: 2.6464 train_time: 1.6m tok/s: 8291286 +985/20000 train_loss: 2.7633 train_time: 1.6m tok/s: 8291261 +986/20000 train_loss: 2.6870 train_time: 1.6m tok/s: 8291239 +987/20000 train_loss: 2.6757 train_time: 1.6m tok/s: 8291342 +988/20000 train_loss: 2.5424 train_time: 1.6m tok/s: 8291255 +989/20000 train_loss: 2.6684 train_time: 1.6m tok/s: 8291281 +990/20000 train_loss: 2.6639 train_time: 1.6m tok/s: 8291052 +991/20000 train_loss: 2.8053 train_time: 1.6m tok/s: 8291210 +992/20000 train_loss: 2.6279 train_time: 1.6m tok/s: 8291233 +993/20000 train_loss: 2.5743 train_time: 1.6m tok/s: 8291291 +994/20000 train_loss: 2.7147 train_time: 1.6m tok/s: 8291229 +995/20000 train_loss: 2.8797 train_time: 1.6m tok/s: 8291320 +996/20000 train_loss: 2.8171 train_time: 1.6m tok/s: 8291293 +997/20000 train_loss: 2.7741 train_time: 1.6m tok/s: 8291424 +998/20000 train_loss: 2.6603 train_time: 1.6m tok/s: 8291420 +999/20000 train_loss: 2.7502 train_time: 1.6m tok/s: 8291494 +1000/20000 train_loss: 2.7887 train_time: 1.6m tok/s: 8291508 +1001/20000 train_loss: 2.6736 train_time: 1.6m tok/s: 8291522 +1002/20000 train_loss: 2.7246 train_time: 1.6m tok/s: 8291458 +1003/20000 train_loss: 2.6443 train_time: 1.6m tok/s: 8291388 +1004/20000 train_loss: 2.6519 train_time: 1.6m tok/s: 8291275 +1005/20000 train_loss: 2.6357 train_time: 1.6m tok/s: 8291363 +1006/20000 train_loss: 2.7104 train_time: 1.6m tok/s: 8291370 +1007/20000 train_loss: 2.5870 train_time: 1.6m tok/s: 8291414 +1008/20000 train_loss: 2.5316 train_time: 1.6m tok/s: 8291312 +1009/20000 train_loss: 2.6862 train_time: 1.6m tok/s: 8291242 +1010/20000 train_loss: 2.8172 train_time: 1.6m tok/s: 8291344 +1011/20000 train_loss: 2.8028 train_time: 1.6m tok/s: 8291345 +1012/20000 train_loss: 2.4291 train_time: 1.6m tok/s: 8291269 +1013/20000 train_loss: 2.6167 train_time: 1.6m tok/s: 8291100 +1014/20000 train_loss: 2.7281 train_time: 1.6m tok/s: 8291092 +1015/20000 train_loss: 2.7571 train_time: 1.6m tok/s: 8291071 +1016/20000 train_loss: 2.5613 train_time: 1.6m tok/s: 8290947 +1017/20000 train_loss: 2.7099 train_time: 1.6m tok/s: 8290992 +1018/20000 train_loss: 2.7847 train_time: 1.6m tok/s: 8290826 +1019/20000 train_loss: 2.6675 train_time: 1.6m tok/s: 8290782 +1020/20000 train_loss: 2.6883 train_time: 1.6m tok/s: 8290721 +1021/20000 train_loss: 2.6644 train_time: 1.6m tok/s: 8290592 +1022/20000 train_loss: 2.7582 train_time: 1.6m tok/s: 8290652 +1023/20000 train_loss: 2.6656 train_time: 1.6m tok/s: 8290686 +1024/20000 train_loss: 2.6840 train_time: 1.6m tok/s: 8290760 +1025/20000 train_loss: 2.7606 train_time: 1.6m tok/s: 8290774 +1026/20000 train_loss: 3.3080 train_time: 1.6m tok/s: 8290568 +1027/20000 train_loss: 2.5470 train_time: 1.6m tok/s: 8290453 +1028/20000 train_loss: 2.6138 train_time: 1.6m tok/s: 8290533 +1029/20000 train_loss: 2.6961 train_time: 1.6m tok/s: 8290397 +1030/20000 train_loss: 2.5558 train_time: 1.6m tok/s: 8290363 +1031/20000 train_loss: 2.6123 train_time: 1.6m tok/s: 8290232 +1032/20000 train_loss: 2.7306 train_time: 1.6m tok/s: 8290265 +1033/20000 train_loss: 2.8480 train_time: 1.6m tok/s: 8290247 +1034/20000 train_loss: 2.6284 train_time: 1.6m tok/s: 8290225 +1035/20000 train_loss: 2.7264 train_time: 1.6m tok/s: 8290290 +1036/20000 train_loss: 2.6793 train_time: 1.6m tok/s: 8290244 +1037/20000 train_loss: 2.7000 train_time: 1.6m tok/s: 8290273 +1038/20000 train_loss: 2.4940 train_time: 1.6m tok/s: 8290226 +1039/20000 train_loss: 2.6972 train_time: 1.6m tok/s: 8290262 +1040/20000 train_loss: 2.6464 train_time: 1.6m tok/s: 8290341 +1041/20000 train_loss: 2.6616 train_time: 1.6m tok/s: 8290270 +1042/20000 train_loss: 2.6669 train_time: 1.6m tok/s: 8290163 +1043/20000 train_loss: 2.7216 train_time: 1.6m tok/s: 8290293 +1044/20000 train_loss: 2.6650 train_time: 1.7m tok/s: 8290159 +1045/20000 train_loss: 2.7900 train_time: 1.7m tok/s: 8290096 +1046/20000 train_loss: 2.7329 train_time: 1.7m tok/s: 8290060 +1047/20000 train_loss: 2.6961 train_time: 1.7m tok/s: 8290131 +1048/20000 train_loss: 2.6000 train_time: 1.7m tok/s: 8290125 +1049/20000 train_loss: 2.7256 train_time: 1.7m tok/s: 8290097 +1050/20000 train_loss: 2.8196 train_time: 1.7m tok/s: 8290131 +1051/20000 train_loss: 2.7290 train_time: 1.7m tok/s: 8290208 +1052/20000 train_loss: 2.6132 train_time: 1.7m tok/s: 8290240 +1053/20000 train_loss: 2.6429 train_time: 1.7m tok/s: 8290126 +1054/20000 train_loss: 2.6169 train_time: 1.7m tok/s: 8290105 +1055/20000 train_loss: 2.5842 train_time: 1.7m tok/s: 8290135 +1056/20000 train_loss: 2.6564 train_time: 1.7m tok/s: 8290119 +1057/20000 train_loss: 2.7149 train_time: 1.7m tok/s: 8290027 +1058/20000 train_loss: 2.5992 train_time: 1.7m tok/s: 8290014 +1059/20000 train_loss: 2.7480 train_time: 1.7m tok/s: 8290023 +1060/20000 train_loss: 2.6713 train_time: 1.7m tok/s: 8289817 +1061/20000 train_loss: 2.7516 train_time: 1.7m tok/s: 8289741 +1062/20000 train_loss: 2.7687 train_time: 1.7m tok/s: 8289763 +1063/20000 train_loss: 2.7642 train_time: 1.7m tok/s: 8289712 +1064/20000 train_loss: 2.6602 train_time: 1.7m tok/s: 8289870 +1065/20000 train_loss: 2.4648 train_time: 1.7m tok/s: 8289766 +1066/20000 train_loss: 2.7878 train_time: 1.7m tok/s: 8289628 +1067/20000 train_loss: 2.7928 train_time: 1.7m tok/s: 8289627 +1068/20000 train_loss: 2.6958 train_time: 1.7m tok/s: 8289681 +1069/20000 train_loss: 2.6422 train_time: 1.7m tok/s: 8289669 +1070/20000 train_loss: 2.5726 train_time: 1.7m tok/s: 8289627 +1071/20000 train_loss: 2.7354 train_time: 1.7m tok/s: 8289619 +1072/20000 train_loss: 2.6651 train_time: 1.7m tok/s: 8289705 +1073/20000 train_loss: 2.6650 train_time: 1.7m tok/s: 8289725 +1074/20000 train_loss: 2.6912 train_time: 1.7m tok/s: 8289780 +1075/20000 train_loss: 2.7155 train_time: 1.7m tok/s: 8289879 +1076/20000 train_loss: 2.7147 train_time: 1.7m tok/s: 8289771 +1077/20000 train_loss: 2.6620 train_time: 1.7m tok/s: 8289621 +1078/20000 train_loss: 2.8109 train_time: 1.7m tok/s: 8289594 +1079/20000 train_loss: 2.6719 train_time: 1.7m tok/s: 8289590 +1080/20000 train_loss: 2.6461 train_time: 1.7m tok/s: 8289724 +1081/20000 train_loss: 2.6727 train_time: 1.7m tok/s: 8289684 +1082/20000 train_loss: 2.6684 train_time: 1.7m tok/s: 8289693 +1083/20000 train_loss: 2.6029 train_time: 1.7m tok/s: 8289681 +1084/20000 train_loss: 2.6838 train_time: 1.7m tok/s: 8289672 +1085/20000 train_loss: 2.6418 train_time: 1.7m tok/s: 8289674 +1086/20000 train_loss: 2.6871 train_time: 1.7m tok/s: 8289759 +1087/20000 train_loss: 2.6662 train_time: 1.7m tok/s: 8289710 +1088/20000 train_loss: 2.8441 train_time: 1.7m tok/s: 8289590 +1089/20000 train_loss: 2.8072 train_time: 1.7m tok/s: 8289593 +1090/20000 train_loss: 2.6342 train_time: 1.7m tok/s: 8289566 +1091/20000 train_loss: 2.6721 train_time: 1.7m tok/s: 8289587 +1092/20000 train_loss: 2.6997 train_time: 1.7m tok/s: 8289624 +1093/20000 train_loss: 2.7487 train_time: 1.7m tok/s: 8289719 +1094/20000 train_loss: 2.8096 train_time: 1.7m tok/s: 8289708 +1095/20000 train_loss: 2.6185 train_time: 1.7m tok/s: 8289661 +1096/20000 train_loss: 2.5221 train_time: 1.7m tok/s: 8289657 +1097/20000 train_loss: 2.6454 train_time: 1.7m tok/s: 8289504 +1098/20000 train_loss: 2.6637 train_time: 1.7m tok/s: 8289441 +1099/20000 train_loss: 2.5080 train_time: 1.7m tok/s: 8289500 +1100/20000 train_loss: 2.5778 train_time: 1.7m tok/s: 8289498 +1101/20000 train_loss: 2.6555 train_time: 1.7m tok/s: 8289543 +1102/20000 train_loss: 2.6712 train_time: 1.7m tok/s: 8289544 +1103/20000 train_loss: 2.7344 train_time: 1.7m tok/s: 8289574 +1104/20000 train_loss: 2.7002 train_time: 1.7m tok/s: 8289625 +1105/20000 train_loss: 2.7228 train_time: 1.7m tok/s: 8289668 +1106/20000 train_loss: 2.7289 train_time: 1.7m tok/s: 8289781 +1107/20000 train_loss: 2.7410 train_time: 1.8m tok/s: 8289731 +1108/20000 train_loss: 2.6759 train_time: 1.8m tok/s: 8289670 +1109/20000 train_loss: 2.6663 train_time: 1.8m tok/s: 8289382 +1110/20000 train_loss: 2.6663 train_time: 1.8m tok/s: 8289701 +1111/20000 train_loss: 2.6353 train_time: 1.8m tok/s: 8289731 +1112/20000 train_loss: 2.6223 train_time: 1.8m tok/s: 8289713 +1113/20000 train_loss: 2.6511 train_time: 1.8m tok/s: 8289652 +1114/20000 train_loss: 2.8156 train_time: 1.8m tok/s: 8289653 +1115/20000 train_loss: 2.6675 train_time: 1.8m tok/s: 8289673 +1116/20000 train_loss: 2.8600 train_time: 1.8m tok/s: 8289775 +1117/20000 train_loss: 2.6900 train_time: 1.8m tok/s: 8289647 +1118/20000 train_loss: 2.7256 train_time: 1.8m tok/s: 8289648 +1119/20000 train_loss: 2.7414 train_time: 1.8m tok/s: 8289634 +1120/20000 train_loss: 2.6326 train_time: 1.8m tok/s: 8289637 +1121/20000 train_loss: 2.6419 train_time: 1.8m tok/s: 8289682 +1122/20000 train_loss: 2.7385 train_time: 1.8m tok/s: 8289764 +1123/20000 train_loss: 2.5331 train_time: 1.8m tok/s: 8289828 +1124/20000 train_loss: 2.6846 train_time: 1.8m tok/s: 8289658 +1125/20000 train_loss: 2.5687 train_time: 1.8m tok/s: 8289681 +1126/20000 train_loss: 2.6399 train_time: 1.8m tok/s: 8289764 +1127/20000 train_loss: 2.8796 train_time: 1.8m tok/s: 8289801 +1128/20000 train_loss: 2.8540 train_time: 1.8m tok/s: 8289726 +1129/20000 train_loss: 2.5903 train_time: 1.8m tok/s: 8289740 +1130/20000 train_loss: 2.7668 train_time: 1.8m tok/s: 8289757 +1131/20000 train_loss: 2.7480 train_time: 1.8m tok/s: 8289793 +1132/20000 train_loss: 2.6195 train_time: 1.8m tok/s: 8289858 +1133/20000 train_loss: 2.5727 train_time: 1.8m tok/s: 8289871 +1134/20000 train_loss: 2.7329 train_time: 1.8m tok/s: 8289865 +1135/20000 train_loss: 2.7223 train_time: 1.8m tok/s: 8289870 +1136/20000 train_loss: 2.5657 train_time: 1.8m tok/s: 8289957 +1137/20000 train_loss: 2.6078 train_time: 1.8m tok/s: 8289952 +1138/20000 train_loss: 2.5649 train_time: 1.8m tok/s: 8289990 +1139/20000 train_loss: 2.5529 train_time: 1.8m tok/s: 8290017 +1140/20000 train_loss: 2.6680 train_time: 1.8m tok/s: 8290001 +1141/20000 train_loss: 2.7064 train_time: 1.8m tok/s: 8290041 +1142/20000 train_loss: 2.6940 train_time: 1.8m tok/s: 8290032 +1143/20000 train_loss: 2.7221 train_time: 1.8m tok/s: 8289986 +1144/20000 train_loss: 2.7719 train_time: 1.8m tok/s: 8289990 +1145/20000 train_loss: 2.7275 train_time: 1.8m tok/s: 8289979 +1146/20000 train_loss: 2.5903 train_time: 1.8m tok/s: 8290005 +1147/20000 train_loss: 2.7482 train_time: 1.8m tok/s: 8290072 +1148/20000 train_loss: 2.5592 train_time: 1.8m tok/s: 8290099 +1149/20000 train_loss: 2.7173 train_time: 1.8m tok/s: 8290033 +1150/20000 train_loss: 2.5930 train_time: 1.8m tok/s: 8290071 +1151/20000 train_loss: 2.5966 train_time: 1.8m tok/s: 8290065 +1152/20000 train_loss: 2.4608 train_time: 1.8m tok/s: 8290067 +1153/20000 train_loss: 2.5986 train_time: 1.8m tok/s: 8290054 +1154/20000 train_loss: 2.7343 train_time: 1.8m tok/s: 8290065 +1155/20000 train_loss: 2.5865 train_time: 1.8m tok/s: 8290044 +1156/20000 train_loss: 2.6965 train_time: 1.8m tok/s: 8290109 +1157/20000 train_loss: 2.6817 train_time: 1.8m tok/s: 8290059 +1158/20000 train_loss: 2.7806 train_time: 1.8m tok/s: 8290023 +1159/20000 train_loss: 2.7105 train_time: 1.8m tok/s: 8290040 +1160/20000 train_loss: 2.6770 train_time: 1.8m tok/s: 8290159 +1161/20000 train_loss: 2.6578 train_time: 1.8m tok/s: 8290157 +1162/20000 train_loss: 2.7018 train_time: 1.8m tok/s: 8290136 +1163/20000 train_loss: 2.6980 train_time: 1.8m tok/s: 8290179 +1164/20000 train_loss: 2.6612 train_time: 1.8m tok/s: 8290143 +1165/20000 train_loss: 2.5280 train_time: 1.8m tok/s: 8290058 +1166/20000 train_loss: 2.7248 train_time: 1.8m tok/s: 8290046 +1167/20000 train_loss: 2.7683 train_time: 1.8m tok/s: 8289960 +1168/20000 train_loss: 2.5654 train_time: 1.8m tok/s: 8289858 +1169/20000 train_loss: 2.6962 train_time: 1.8m tok/s: 8289868 +1170/20000 train_loss: 2.9088 train_time: 1.8m tok/s: 8289935 +1171/20000 train_loss: 2.6582 train_time: 1.9m tok/s: 8290044 +1172/20000 train_loss: 2.7082 train_time: 1.9m tok/s: 8290088 +1173/20000 train_loss: 2.6415 train_time: 1.9m tok/s: 8290117 +1174/20000 train_loss: 2.7030 train_time: 1.9m tok/s: 8290125 +1175/20000 train_loss: 2.6541 train_time: 1.9m tok/s: 8290100 +1176/20000 train_loss: 2.7741 train_time: 1.9m tok/s: 8290095 +1177/20000 train_loss: 2.7789 train_time: 1.9m tok/s: 8290030 +1178/20000 train_loss: 2.6656 train_time: 1.9m tok/s: 8289949 +1179/20000 train_loss: 2.5437 train_time: 1.9m tok/s: 8289875 +1180/20000 train_loss: 2.6035 train_time: 1.9m tok/s: 8289796 +1181/20000 train_loss: 2.6624 train_time: 1.9m tok/s: 8289827 +1182/20000 train_loss: 2.5818 train_time: 1.9m tok/s: 8289901 +1183/20000 train_loss: 2.7845 train_time: 1.9m tok/s: 8289976 +1184/20000 train_loss: 2.4910 train_time: 1.9m tok/s: 8289866 +1185/20000 train_loss: 2.6695 train_time: 1.9m tok/s: 8289913 +1186/20000 train_loss: 2.6624 train_time: 1.9m tok/s: 8289843 +1187/20000 train_loss: 2.7636 train_time: 1.9m tok/s: 8289719 +1188/20000 train_loss: 2.8720 train_time: 1.9m tok/s: 8289823 +1189/20000 train_loss: 2.6345 train_time: 1.9m tok/s: 8289850 +1190/20000 train_loss: 2.6962 train_time: 1.9m tok/s: 8289876 +1191/20000 train_loss: 2.6439 train_time: 1.9m tok/s: 8289891 +1192/20000 train_loss: 2.6717 train_time: 1.9m tok/s: 8289868 +1193/20000 train_loss: 2.6938 train_time: 1.9m tok/s: 8289860 +1194/20000 train_loss: 2.6980 train_time: 1.9m tok/s: 8289787 +1195/20000 train_loss: 2.6016 train_time: 1.9m tok/s: 8289776 +1196/20000 train_loss: 2.8403 train_time: 1.9m tok/s: 8289754 +1197/20000 train_loss: 2.5704 train_time: 1.9m tok/s: 8289741 +1198/20000 train_loss: 2.6925 train_time: 1.9m tok/s: 8289726 +1199/20000 train_loss: 2.7395 train_time: 1.9m tok/s: 8289835 +1200/20000 train_loss: 2.7285 train_time: 1.9m tok/s: 8289899 +1201/20000 train_loss: 2.7233 train_time: 1.9m tok/s: 8289975 +1202/20000 train_loss: 2.8214 train_time: 1.9m tok/s: 8290034 +1203/20000 train_loss: 2.6659 train_time: 1.9m tok/s: 8290035 +1204/20000 train_loss: 2.7038 train_time: 1.9m tok/s: 8290104 +1205/20000 train_loss: 2.7365 train_time: 1.9m tok/s: 8289984 +1206/20000 train_loss: 2.7635 train_time: 1.9m tok/s: 8289972 +1207/20000 train_loss: 2.5724 train_time: 1.9m tok/s: 8289897 +1208/20000 train_loss: 2.5677 train_time: 1.9m tok/s: 8289917 +1209/20000 train_loss: 2.6944 train_time: 1.9m tok/s: 8289876 +1210/20000 train_loss: 2.6276 train_time: 1.9m tok/s: 8289886 +1211/20000 train_loss: 2.5623 train_time: 1.9m tok/s: 8289870 +1212/20000 train_loss: 2.5357 train_time: 1.9m tok/s: 8289774 +1213/20000 train_loss: 2.8085 train_time: 1.9m tok/s: 8289689 +1214/20000 train_loss: 2.6244 train_time: 1.9m tok/s: 8289776 +1215/20000 train_loss: 2.7184 train_time: 1.9m tok/s: 8289851 +1216/20000 train_loss: 2.6802 train_time: 1.9m tok/s: 8289944 +1217/20000 train_loss: 2.7604 train_time: 1.9m tok/s: 8289896 +1218/20000 train_loss: 2.6975 train_time: 1.9m tok/s: 8289968 +1219/20000 train_loss: 3.3063 train_time: 1.9m tok/s: 8289990 +1220/20000 train_loss: 2.6071 train_time: 1.9m tok/s: 8289954 +1221/20000 train_loss: 2.7657 train_time: 1.9m tok/s: 8289921 +1222/20000 train_loss: 2.5917 train_time: 1.9m tok/s: 8289942 +1223/20000 train_loss: 2.6984 train_time: 1.9m tok/s: 8289882 +1224/20000 train_loss: 2.7103 train_time: 1.9m tok/s: 8289870 +1225/20000 train_loss: 2.5383 train_time: 1.9m tok/s: 8289856 +1226/20000 train_loss: 2.6507 train_time: 1.9m tok/s: 8289828 +1227/20000 train_loss: 2.8631 train_time: 1.9m tok/s: 8289819 +1228/20000 train_loss: 2.6573 train_time: 1.9m tok/s: 8289773 +1229/20000 train_loss: 2.6616 train_time: 1.9m tok/s: 8289749 +1230/20000 train_loss: 2.7540 train_time: 1.9m tok/s: 8289750 +1231/20000 train_loss: 2.6936 train_time: 1.9m tok/s: 8289778 +1232/20000 train_loss: 2.6812 train_time: 1.9m tok/s: 8289821 +1233/20000 train_loss: 2.6616 train_time: 1.9m tok/s: 8289738 +1234/20000 train_loss: 2.6524 train_time: 2.0m tok/s: 8289763 +1235/20000 train_loss: 2.5949 train_time: 2.0m tok/s: 8289723 +1236/20000 train_loss: 2.6399 train_time: 2.0m tok/s: 8289720 +1237/20000 train_loss: 2.6166 train_time: 2.0m tok/s: 8289745 +1238/20000 train_loss: 2.5816 train_time: 2.0m tok/s: 8289846 +1239/20000 train_loss: 2.5846 train_time: 2.0m tok/s: 8289882 +1240/20000 train_loss: 2.5392 train_time: 2.0m tok/s: 8289892 +1241/20000 train_loss: 2.5892 train_time: 2.0m tok/s: 8289847 +1242/20000 train_loss: 2.5857 train_time: 2.0m tok/s: 8289712 +1243/20000 train_loss: 2.6755 train_time: 2.0m tok/s: 8289547 +1244/20000 train_loss: 2.7768 train_time: 2.0m tok/s: 8289474 +1245/20000 train_loss: 2.6669 train_time: 2.0m tok/s: 8289324 +1246/20000 train_loss: 2.7858 train_time: 2.0m tok/s: 8289324 +1247/20000 train_loss: 2.7811 train_time: 2.0m tok/s: 8289266 +1248/20000 train_loss: 2.6493 train_time: 2.0m tok/s: 8289352 +1249/20000 train_loss: 2.6384 train_time: 2.0m tok/s: 8289394 +1250/20000 train_loss: 2.6450 train_time: 2.0m tok/s: 8289467 +1251/20000 train_loss: 2.6006 train_time: 2.0m tok/s: 8289536 +1252/20000 train_loss: 2.6738 train_time: 2.0m tok/s: 8289479 +1253/20000 train_loss: 2.6218 train_time: 2.0m tok/s: 8289468 +1254/20000 train_loss: 2.6841 train_time: 2.0m tok/s: 8289414 +1255/20000 train_loss: 2.4468 train_time: 2.0m tok/s: 8289425 +1256/20000 train_loss: 2.6775 train_time: 2.0m tok/s: 8289406 +1257/20000 train_loss: 2.6010 train_time: 2.0m tok/s: 8289381 +1258/20000 train_loss: 2.6300 train_time: 2.0m tok/s: 8289324 +1259/20000 train_loss: 2.7604 train_time: 2.0m tok/s: 8289208 +1260/20000 train_loss: 2.7066 train_time: 2.0m tok/s: 8289279 +1261/20000 train_loss: 2.7848 train_time: 2.0m tok/s: 8289334 +1262/20000 train_loss: 2.6972 train_time: 2.0m tok/s: 8289326 +1263/20000 train_loss: 2.7058 train_time: 2.0m tok/s: 8289371 +1264/20000 train_loss: 2.6344 train_time: 2.0m tok/s: 8289372 +1265/20000 train_loss: 2.6325 train_time: 2.0m tok/s: 8289336 +1266/20000 train_loss: 2.6446 train_time: 2.0m tok/s: 8289315 +1267/20000 train_loss: 2.6872 train_time: 2.0m tok/s: 8289344 +1268/20000 train_loss: 2.4885 train_time: 2.0m tok/s: 8289405 +1269/20000 train_loss: 2.6810 train_time: 2.0m tok/s: 8289389 +1270/20000 train_loss: 2.6481 train_time: 2.0m tok/s: 8289413 +1271/20000 train_loss: 2.5640 train_time: 2.0m tok/s: 8289442 +1272/20000 train_loss: 2.8062 train_time: 2.0m tok/s: 8289508 +1273/20000 train_loss: 2.7470 train_time: 2.0m tok/s: 8289556 +1274/20000 train_loss: 2.6750 train_time: 2.0m tok/s: 8289603 +1275/20000 train_loss: 2.7781 train_time: 2.0m tok/s: 8289597 +1276/20000 train_loss: 2.7137 train_time: 2.0m tok/s: 8289579 +1277/20000 train_loss: 2.7021 train_time: 2.0m tok/s: 8289469 +1278/20000 train_loss: 2.6150 train_time: 2.0m tok/s: 8289635 +1279/20000 train_loss: 2.7173 train_time: 2.0m tok/s: 8289714 +1280/20000 train_loss: 2.6410 train_time: 2.0m tok/s: 8289674 +1281/20000 train_loss: 2.8148 train_time: 2.0m tok/s: 8289637 +1282/20000 train_loss: 2.5363 train_time: 2.0m tok/s: 8289555 +1283/20000 train_loss: 2.6318 train_time: 2.0m tok/s: 8289600 +1284/20000 train_loss: 2.6146 train_time: 2.0m tok/s: 8289610 +1285/20000 train_loss: 2.7804 train_time: 2.0m tok/s: 8289637 +1286/20000 train_loss: 2.6689 train_time: 2.0m tok/s: 8289624 +1287/20000 train_loss: 2.7074 train_time: 2.0m tok/s: 8289611 +1288/20000 train_loss: 2.7208 train_time: 2.0m tok/s: 8289651 +1289/20000 train_loss: 2.7481 train_time: 2.0m tok/s: 8289617 +1290/20000 train_loss: 2.6312 train_time: 2.0m tok/s: 8289678 +1291/20000 train_loss: 2.7456 train_time: 2.0m tok/s: 8289719 +1292/20000 train_loss: 2.7271 train_time: 2.0m tok/s: 8289726 +1293/20000 train_loss: 2.7142 train_time: 2.0m tok/s: 8289742 +1294/20000 train_loss: 2.7186 train_time: 2.0m tok/s: 8289773 +1295/20000 train_loss: 2.7468 train_time: 2.0m tok/s: 8289836 +1296/20000 train_loss: 2.6851 train_time: 2.0m tok/s: 8289535 +1297/20000 train_loss: 2.5932 train_time: 2.1m tok/s: 8289695 +1298/20000 train_loss: 2.6589 train_time: 2.1m tok/s: 8289774 +1299/20000 train_loss: 2.5134 train_time: 2.1m tok/s: 8289742 +1300/20000 train_loss: 2.6338 train_time: 2.1m tok/s: 8289741 +1301/20000 train_loss: 2.6841 train_time: 2.1m tok/s: 8289791 +1302/20000 train_loss: 2.6684 train_time: 2.1m tok/s: 8289871 +1303/20000 train_loss: 2.8871 train_time: 2.1m tok/s: 8289819 +1304/20000 train_loss: 2.7367 train_time: 2.1m tok/s: 8289785 +1305/20000 train_loss: 2.7642 train_time: 2.1m tok/s: 8289760 +1306/20000 train_loss: 2.8538 train_time: 2.1m tok/s: 8289738 +1307/20000 train_loss: 2.6249 train_time: 2.1m tok/s: 8289702 +1308/20000 train_loss: 2.6284 train_time: 2.1m tok/s: 8289669 +1309/20000 train_loss: 2.6644 train_time: 2.1m tok/s: 8289631 +1310/20000 train_loss: 2.5617 train_time: 2.1m tok/s: 8289539 +1311/20000 train_loss: 2.6100 train_time: 2.1m tok/s: 8289447 +1312/20000 train_loss: 2.5485 train_time: 2.1m tok/s: 8289451 +1313/20000 train_loss: 2.5721 train_time: 2.1m tok/s: 8289441 +1314/20000 train_loss: 2.5571 train_time: 2.1m tok/s: 8289511 +1315/20000 train_loss: 2.4212 train_time: 2.1m tok/s: 8289422 +1316/20000 train_loss: 2.6830 train_time: 2.1m tok/s: 8289337 +1317/20000 train_loss: 2.6843 train_time: 2.1m tok/s: 8289418 +1318/20000 train_loss: 2.7176 train_time: 2.1m tok/s: 8289502 +1319/20000 train_loss: 2.8105 train_time: 2.1m tok/s: 8289546 +1320/20000 train_loss: 2.7265 train_time: 2.1m tok/s: 8289546 +1321/20000 train_loss: 2.7221 train_time: 2.1m tok/s: 8289584 +1322/20000 train_loss: 2.7394 train_time: 2.1m tok/s: 8289638 +1323/20000 train_loss: 2.5862 train_time: 2.1m tok/s: 8289603 +1324/20000 train_loss: 2.6482 train_time: 2.1m tok/s: 8289585 +1325/20000 train_loss: 2.8296 train_time: 2.1m tok/s: 8289624 +1326/20000 train_loss: 2.8349 train_time: 2.1m tok/s: 8289637 +1327/20000 train_loss: 2.6598 train_time: 2.1m tok/s: 8289602 +1328/20000 train_loss: 2.6520 train_time: 2.1m tok/s: 8289628 +1329/20000 train_loss: 2.7424 train_time: 2.1m tok/s: 8289630 +1330/20000 train_loss: 2.6112 train_time: 2.1m tok/s: 8289667 +1331/20000 train_loss: 2.6786 train_time: 2.1m tok/s: 8289647 +1332/20000 train_loss: 2.9194 train_time: 2.1m tok/s: 8289598 +1333/20000 train_loss: 2.8342 train_time: 2.1m tok/s: 8289540 +1334/20000 train_loss: 2.6707 train_time: 2.1m tok/s: 8289565 +1335/20000 train_loss: 2.6290 train_time: 2.1m tok/s: 8289533 +1336/20000 train_loss: 2.6590 train_time: 2.1m tok/s: 8289534 +1337/20000 train_loss: 2.8070 train_time: 2.1m tok/s: 8289545 +1338/20000 train_loss: 2.8904 train_time: 2.1m tok/s: 8289551 +1339/20000 train_loss: 2.6914 train_time: 2.1m tok/s: 8289525 +1340/20000 train_loss: 2.5319 train_time: 2.1m tok/s: 8289538 +1341/20000 train_loss: 2.5616 train_time: 2.1m tok/s: 8289544 +1342/20000 train_loss: 2.6442 train_time: 2.1m tok/s: 8289530 +1343/20000 train_loss: 2.6638 train_time: 2.1m tok/s: 8289438 +1344/20000 train_loss: 2.6880 train_time: 2.1m tok/s: 8289479 +1345/20000 train_loss: 2.6402 train_time: 2.1m tok/s: 8289518 +1346/20000 train_loss: 2.7671 train_time: 2.1m tok/s: 8289560 +1347/20000 train_loss: 2.7919 train_time: 2.1m tok/s: 8289551 +1348/20000 train_loss: 2.6919 train_time: 2.1m tok/s: 8289570 +1349/20000 train_loss: 2.7235 train_time: 2.1m tok/s: 8289578 +1350/20000 train_loss: 2.6769 train_time: 2.1m tok/s: 8289592 +1351/20000 train_loss: 2.7885 train_time: 2.1m tok/s: 8289533 +1352/20000 train_loss: 2.6690 train_time: 2.1m tok/s: 8289551 +1353/20000 train_loss: 2.7048 train_time: 2.1m tok/s: 8289562 +1354/20000 train_loss: 2.3823 train_time: 2.1m tok/s: 8289535 +1355/20000 train_loss: 2.5791 train_time: 2.1m tok/s: 8289480 +1356/20000 train_loss: 2.6877 train_time: 2.1m tok/s: 8289544 +1357/20000 train_loss: 2.6747 train_time: 2.1m tok/s: 8289618 +1358/20000 train_loss: 2.7438 train_time: 2.1m tok/s: 8289620 +1359/20000 train_loss: 2.5080 train_time: 2.1m tok/s: 8289560 +1360/20000 train_loss: 2.7351 train_time: 2.2m tok/s: 8289440 +1361/20000 train_loss: 2.6339 train_time: 2.2m tok/s: 8289428 +1362/20000 train_loss: 2.6175 train_time: 2.2m tok/s: 8289437 +1363/20000 train_loss: 2.7021 train_time: 2.2m tok/s: 8289348 +1364/20000 train_loss: 2.5359 train_time: 2.2m tok/s: 8289394 +1365/20000 train_loss: 2.5318 train_time: 2.2m tok/s: 8289401 +1366/20000 train_loss: 2.6229 train_time: 2.2m tok/s: 8289438 +1367/20000 train_loss: 2.7057 train_time: 2.2m tok/s: 8289517 +1368/20000 train_loss: 2.5719 train_time: 2.2m tok/s: 8289515 +1369/20000 train_loss: 2.6747 train_time: 2.2m tok/s: 8289504 +1370/20000 train_loss: 2.7219 train_time: 2.2m tok/s: 8289517 +1371/20000 train_loss: 2.6987 train_time: 2.2m tok/s: 8289429 +1372/20000 train_loss: 2.7290 train_time: 2.2m tok/s: 8289437 +1373/20000 train_loss: 2.6653 train_time: 2.2m tok/s: 8289512 +1374/20000 train_loss: 2.7701 train_time: 2.2m tok/s: 8289488 +1375/20000 train_loss: 2.7282 train_time: 2.2m tok/s: 8289534 +1376/20000 train_loss: 2.5834 train_time: 2.2m tok/s: 8289552 +1377/20000 train_loss: 2.6405 train_time: 2.2m tok/s: 8289568 +1378/20000 train_loss: 2.5870 train_time: 2.2m tok/s: 8289561 +1379/20000 train_loss: 2.6362 train_time: 2.2m tok/s: 8289537 +1380/20000 train_loss: 2.5969 train_time: 2.2m tok/s: 8289477 +1381/20000 train_loss: 2.6423 train_time: 2.2m tok/s: 8289535 +1382/20000 train_loss: 2.6654 train_time: 2.2m tok/s: 8289503 +1383/20000 train_loss: 2.6851 train_time: 2.2m tok/s: 8289605 +1384/20000 train_loss: 2.5768 train_time: 2.2m tok/s: 8289644 +1385/20000 train_loss: 2.6073 train_time: 2.2m tok/s: 8289683 +1386/20000 train_loss: 2.7596 train_time: 2.2m tok/s: 8289740 +1387/20000 train_loss: 2.6777 train_time: 2.2m tok/s: 8289815 +1388/20000 train_loss: 2.7557 train_time: 2.2m tok/s: 8289864 +1389/20000 train_loss: 2.6059 train_time: 2.2m tok/s: 8289829 +1390/20000 train_loss: 2.7366 train_time: 2.2m tok/s: 8289860 +1391/20000 train_loss: 2.5455 train_time: 2.2m tok/s: 8289858 +1392/20000 train_loss: 2.7353 train_time: 2.2m tok/s: 8289872 +1393/20000 train_loss: 2.6311 train_time: 2.2m tok/s: 8289844 +1394/20000 train_loss: 2.8983 train_time: 2.2m tok/s: 8289854 +1395/20000 train_loss: 2.4958 train_time: 2.2m tok/s: 8289800 +1396/20000 train_loss: 2.8222 train_time: 2.2m tok/s: 8289815 +1397/20000 train_loss: 2.7164 train_time: 2.2m tok/s: 8289810 +1398/20000 train_loss: 2.8064 train_time: 2.2m tok/s: 8289850 +1399/20000 train_loss: 2.6388 train_time: 2.2m tok/s: 8289869 +1400/20000 train_loss: 2.7437 train_time: 2.2m tok/s: 8289668 +1401/20000 train_loss: 2.7299 train_time: 2.2m tok/s: 8289909 +1402/20000 train_loss: 2.5746 train_time: 2.2m tok/s: 8289966 +1403/20000 train_loss: 2.5970 train_time: 2.2m tok/s: 8290021 +1404/20000 train_loss: 2.6829 train_time: 2.2m tok/s: 8289933 +1405/20000 train_loss: 2.7138 train_time: 2.2m tok/s: 8289923 +1406/20000 train_loss: 2.8384 train_time: 2.2m tok/s: 8289881 +1407/20000 train_loss: 2.5746 train_time: 2.2m tok/s: 8289769 +1408/20000 train_loss: 2.6943 train_time: 2.2m tok/s: 8289738 +1409/20000 train_loss: 2.7697 train_time: 2.2m tok/s: 8289704 +1410/20000 train_loss: 2.6568 train_time: 2.2m tok/s: 8289745 +1411/20000 train_loss: 2.6951 train_time: 2.2m tok/s: 8289784 +1412/20000 train_loss: 2.7362 train_time: 2.2m tok/s: 8289849 +1413/20000 train_loss: 2.6159 train_time: 2.2m tok/s: 8289898 +1414/20000 train_loss: 2.6011 train_time: 2.2m tok/s: 8289937 +1415/20000 train_loss: 2.6644 train_time: 2.2m tok/s: 8289984 +1416/20000 train_loss: 2.6193 train_time: 2.2m tok/s: 8290024 +1417/20000 train_loss: 2.6256 train_time: 2.2m tok/s: 8290036 +1418/20000 train_loss: 2.7543 train_time: 2.2m tok/s: 8290018 +1419/20000 train_loss: 2.6610 train_time: 2.2m tok/s: 8289964 +1420/20000 train_loss: 2.6295 train_time: 2.2m tok/s: 8289935 +1421/20000 train_loss: 2.7504 train_time: 2.2m tok/s: 8289964 +1422/20000 train_loss: 2.7445 train_time: 2.2m tok/s: 8290018 +1423/20000 train_loss: 2.7065 train_time: 2.2m tok/s: 8290087 +1424/20000 train_loss: 2.7075 train_time: 2.3m tok/s: 8290120 +1425/20000 train_loss: 2.6289 train_time: 2.3m tok/s: 8290139 +1426/20000 train_loss: 2.6965 train_time: 2.3m tok/s: 8290137 +1427/20000 train_loss: 2.6480 train_time: 2.3m tok/s: 8290079 +1428/20000 train_loss: 2.6511 train_time: 2.3m tok/s: 8290041 +1429/20000 train_loss: 2.5895 train_time: 2.3m tok/s: 8290014 +1430/20000 train_loss: 2.6250 train_time: 2.3m tok/s: 8290061 +1431/20000 train_loss: 2.6025 train_time: 2.3m tok/s: 8290066 +1432/20000 train_loss: 2.4340 train_time: 2.3m tok/s: 8290070 +1433/20000 train_loss: 2.6247 train_time: 2.3m tok/s: 8290072 +1434/20000 train_loss: 2.7408 train_time: 2.3m tok/s: 8290029 +1435/20000 train_loss: 2.7156 train_time: 2.3m tok/s: 8290151 +1436/20000 train_loss: 2.6158 train_time: 2.3m tok/s: 8290208 +1437/20000 train_loss: 2.7040 train_time: 2.3m tok/s: 8290089 +1438/20000 train_loss: 2.7535 train_time: 2.3m tok/s: 8290252 +1439/20000 train_loss: 2.6696 train_time: 2.3m tok/s: 8290299 +1440/20000 train_loss: 2.7105 train_time: 2.3m tok/s: 8290235 +1441/20000 train_loss: 2.6704 train_time: 2.3m tok/s: 8290254 +1442/20000 train_loss: 2.5798 train_time: 2.3m tok/s: 8290199 +1443/20000 train_loss: 2.5953 train_time: 2.3m tok/s: 8290196 +1444/20000 train_loss: 2.5516 train_time: 2.3m tok/s: 8290237 +1445/20000 train_loss: 2.6645 train_time: 2.3m tok/s: 8290159 +1446/20000 train_loss: 2.7756 train_time: 2.3m tok/s: 8290183 +1447/20000 train_loss: 2.7184 train_time: 2.3m tok/s: 8290203 +1448/20000 train_loss: 2.6965 train_time: 2.3m tok/s: 8290210 +1449/20000 train_loss: 2.6414 train_time: 2.3m tok/s: 8290246 +1450/20000 train_loss: 2.7317 train_time: 2.3m tok/s: 8290171 +1451/20000 train_loss: 2.5956 train_time: 2.3m tok/s: 8290219 +1452/20000 train_loss: 2.6192 train_time: 2.3m tok/s: 8290203 +1453/20000 train_loss: 2.6481 train_time: 2.3m tok/s: 8290213 +1454/20000 train_loss: 2.7536 train_time: 2.3m tok/s: 8290105 +1455/20000 train_loss: 2.5734 train_time: 2.3m tok/s: 8290122 +1456/20000 train_loss: 2.4553 train_time: 2.3m tok/s: 8290157 +1457/20000 train_loss: 2.4776 train_time: 2.3m tok/s: 8290160 +1458/20000 train_loss: 2.5955 train_time: 2.3m tok/s: 8290109 +1459/20000 train_loss: 2.6735 train_time: 2.3m tok/s: 8290091 +1460/20000 train_loss: 2.6961 train_time: 2.3m tok/s: 8290104 +1461/20000 train_loss: 2.7715 train_time: 2.3m tok/s: 8290143 +1462/20000 train_loss: 2.6461 train_time: 2.3m tok/s: 8290185 +1463/20000 train_loss: 2.6614 train_time: 2.3m tok/s: 8290284 +1464/20000 train_loss: 2.6443 train_time: 2.3m tok/s: 8290139 +1465/20000 train_loss: 2.6788 train_time: 2.3m tok/s: 8290050 +1466/20000 train_loss: 2.6148 train_time: 2.3m tok/s: 8290046 +1467/20000 train_loss: 2.6164 train_time: 2.3m tok/s: 8290134 +1468/20000 train_loss: 2.5571 train_time: 2.3m tok/s: 8290086 +1469/20000 train_loss: 2.5827 train_time: 2.3m tok/s: 8290101 +1470/20000 train_loss: 2.5082 train_time: 2.3m tok/s: 8290074 +1471/20000 train_loss: 2.8054 train_time: 2.3m tok/s: 8290060 +1472/20000 train_loss: 2.8790 train_time: 2.3m tok/s: 8290021 +1473/20000 train_loss: 2.7658 train_time: 2.3m tok/s: 8289917 +1474/20000 train_loss: 2.7332 train_time: 2.3m tok/s: 8289869 +1475/20000 train_loss: 2.6932 train_time: 2.3m tok/s: 8289941 +1476/20000 train_loss: 2.7734 train_time: 2.3m tok/s: 8289916 +1477/20000 train_loss: 2.6032 train_time: 2.3m tok/s: 8289895 +1478/20000 train_loss: 2.6055 train_time: 2.3m tok/s: 8289910 +1479/20000 train_loss: 2.5880 train_time: 2.3m tok/s: 8289926 +1480/20000 train_loss: 2.6088 train_time: 2.3m tok/s: 8289939 +1481/20000 train_loss: 2.6343 train_time: 2.3m tok/s: 8289951 +1482/20000 train_loss: 3.0609 train_time: 2.3m tok/s: 8289847 +1483/20000 train_loss: 2.6934 train_time: 2.3m tok/s: 8289791 +1484/20000 train_loss: 2.6290 train_time: 2.3m tok/s: 8289760 +1485/20000 train_loss: 2.7844 train_time: 2.3m tok/s: 8289774 +1486/20000 train_loss: 2.5946 train_time: 2.3m tok/s: 8289794 +1487/20000 train_loss: 2.6944 train_time: 2.4m tok/s: 8289863 +1488/20000 train_loss: 2.6450 train_time: 2.4m tok/s: 8289911 +1489/20000 train_loss: 2.5784 train_time: 2.4m tok/s: 8289973 +1490/20000 train_loss: 2.6722 train_time: 2.4m tok/s: 8289934 +1491/20000 train_loss: 2.6699 train_time: 2.4m tok/s: 8289923 +1492/20000 train_loss: 2.5969 train_time: 2.4m tok/s: 8289958 +1493/20000 train_loss: 2.6667 train_time: 2.4m tok/s: 8289960 +1494/20000 train_loss: 2.6404 train_time: 2.4m tok/s: 8289929 +1495/20000 train_loss: 2.5674 train_time: 2.4m tok/s: 8289852 +1496/20000 train_loss: 2.6768 train_time: 2.4m tok/s: 8289843 +1497/20000 train_loss: 2.6022 train_time: 2.4m tok/s: 8289873 +1498/20000 train_loss: 2.9241 train_time: 2.4m tok/s: 8289841 +1499/20000 train_loss: 2.6967 train_time: 2.4m tok/s: 8289898 +1500/20000 train_loss: 2.7213 train_time: 2.4m tok/s: 8289922 +1501/20000 train_loss: 2.6910 train_time: 2.4m tok/s: 8289955 +1502/20000 train_loss: 2.8005 train_time: 2.4m tok/s: 8289892 +1503/20000 train_loss: 2.6704 train_time: 2.4m tok/s: 8289854 +1504/20000 train_loss: 2.7262 train_time: 2.4m tok/s: 8289925 +1505/20000 train_loss: 2.6814 train_time: 2.4m tok/s: 8289918 +1506/20000 train_loss: 2.7087 train_time: 2.4m tok/s: 8289916 +1507/20000 train_loss: 2.8259 train_time: 2.4m tok/s: 8289941 +1508/20000 train_loss: 2.5302 train_time: 2.4m tok/s: 8289832 +1509/20000 train_loss: 2.5798 train_time: 2.4m tok/s: 8289873 +1510/20000 train_loss: 2.5392 train_time: 2.4m tok/s: 8289852 +1511/20000 train_loss: 2.4979 train_time: 2.4m tok/s: 8289838 +1512/20000 train_loss: 2.5664 train_time: 2.4m tok/s: 8289868 +1513/20000 train_loss: 2.7208 train_time: 2.4m tok/s: 8289917 +1514/20000 train_loss: 2.7470 train_time: 2.4m tok/s: 8289959 +1515/20000 train_loss: 2.6917 train_time: 2.4m tok/s: 8289994 +1516/20000 train_loss: 2.5843 train_time: 2.4m tok/s: 8289976 +1517/20000 train_loss: 2.5939 train_time: 2.4m tok/s: 8289933 +1518/20000 train_loss: 2.7321 train_time: 2.4m tok/s: 8289938 +1519/20000 train_loss: 2.6561 train_time: 2.4m tok/s: 8289890 +1520/20000 train_loss: 2.6659 train_time: 2.4m tok/s: 8289901 +1521/20000 train_loss: 2.6451 train_time: 2.4m tok/s: 8289867 +1522/20000 train_loss: 2.6459 train_time: 2.4m tok/s: 8289889 +1523/20000 train_loss: 2.6730 train_time: 2.4m tok/s: 8289909 +1524/20000 train_loss: 2.6426 train_time: 2.4m tok/s: 8289902 +1525/20000 train_loss: 2.6276 train_time: 2.4m tok/s: 8289883 +1526/20000 train_loss: 2.7202 train_time: 2.4m tok/s: 8289794 +1527/20000 train_loss: 2.6743 train_time: 2.4m tok/s: 8289736 +1528/20000 train_loss: 2.4648 train_time: 2.4m tok/s: 8289612 +1529/20000 train_loss: 2.6276 train_time: 2.4m tok/s: 8289753 +1530/20000 train_loss: 2.6090 train_time: 2.4m tok/s: 8289734 +1531/20000 train_loss: 2.3566 train_time: 2.4m tok/s: 8289752 +1532/20000 train_loss: 2.6189 train_time: 2.4m tok/s: 8289792 +1533/20000 train_loss: 2.6721 train_time: 2.4m tok/s: 8289694 +1534/20000 train_loss: 2.6341 train_time: 2.4m tok/s: 8289644 +1535/20000 train_loss: 2.7731 train_time: 2.4m tok/s: 8289638 +1536/20000 train_loss: 2.6943 train_time: 2.4m tok/s: 8289695 +1537/20000 train_loss: 3.0719 train_time: 2.4m tok/s: 8289636 +1538/20000 train_loss: 2.7273 train_time: 2.4m tok/s: 8289575 +1539/20000 train_loss: 2.6285 train_time: 2.4m tok/s: 8289602 +1540/20000 train_loss: 2.6781 train_time: 2.4m tok/s: 8289589 +1541/20000 train_loss: 2.5856 train_time: 2.4m tok/s: 8289627 +1542/20000 train_loss: 2.6145 train_time: 2.4m tok/s: 8289625 +1543/20000 train_loss: 2.6287 train_time: 2.4m tok/s: 8289641 +1544/20000 train_loss: 2.5817 train_time: 2.4m tok/s: 8289662 +1545/20000 train_loss: 2.6177 train_time: 2.4m tok/s: 8289613 +1546/20000 train_loss: 2.4991 train_time: 2.4m tok/s: 8289579 +1547/20000 train_loss: 2.7421 train_time: 2.4m tok/s: 8289591 +1548/20000 train_loss: 2.6856 train_time: 2.4m tok/s: 8289583 +1549/20000 train_loss: 2.5766 train_time: 2.4m tok/s: 8289635 +1550/20000 train_loss: 2.7105 train_time: 2.5m tok/s: 8289612 +1551/20000 train_loss: 2.6733 train_time: 2.5m tok/s: 8289607 +1552/20000 train_loss: 2.5505 train_time: 2.5m tok/s: 8289614 +1553/20000 train_loss: 2.4925 train_time: 2.5m tok/s: 8289513 +1554/20000 train_loss: 2.5861 train_time: 2.5m tok/s: 8289554 +1555/20000 train_loss: 2.6270 train_time: 2.5m tok/s: 8289569 +1556/20000 train_loss: 2.5132 train_time: 2.5m tok/s: 8289558 +1557/20000 train_loss: 2.5526 train_time: 2.5m tok/s: 8289493 +1558/20000 train_loss: 2.5655 train_time: 2.5m tok/s: 8289406 +1559/20000 train_loss: 2.5477 train_time: 2.5m tok/s: 8289403 +1560/20000 train_loss: 2.6183 train_time: 2.5m tok/s: 8289489 +1561/20000 train_loss: 2.5433 train_time: 2.5m tok/s: 8289467 +1562/20000 train_loss: 2.5881 train_time: 2.5m tok/s: 8289501 +1563/20000 train_loss: 2.4923 train_time: 2.5m tok/s: 8289516 +1564/20000 train_loss: 2.5840 train_time: 2.5m tok/s: 8289483 +1565/20000 train_loss: 2.5718 train_time: 2.5m tok/s: 8289431 +1566/20000 train_loss: 2.7408 train_time: 2.5m tok/s: 8289408 +1567/20000 train_loss: 2.6868 train_time: 2.5m tok/s: 8289436 +1568/20000 train_loss: 2.5256 train_time: 2.5m tok/s: 8289487 +1569/20000 train_loss: 2.5940 train_time: 2.5m tok/s: 8289459 +1570/20000 train_loss: 2.5458 train_time: 2.5m tok/s: 8289444 +1571/20000 train_loss: 2.6133 train_time: 2.5m tok/s: 8289426 +1572/20000 train_loss: 3.1973 train_time: 2.5m tok/s: 8289480 +1573/20000 train_loss: 2.7536 train_time: 2.5m tok/s: 8289458 +1574/20000 train_loss: 2.5991 train_time: 2.5m tok/s: 8289391 +1575/20000 train_loss: 2.5450 train_time: 2.5m tok/s: 8289393 +1576/20000 train_loss: 2.5396 train_time: 2.5m tok/s: 8289377 +1577/20000 train_loss: 2.5724 train_time: 2.5m tok/s: 8289350 +1578/20000 train_loss: 2.4928 train_time: 2.5m tok/s: 8289273 +1579/20000 train_loss: 2.7610 train_time: 2.5m tok/s: 8289234 +1580/20000 train_loss: 2.6520 train_time: 2.5m tok/s: 8289258 +1581/20000 train_loss: 2.4919 train_time: 2.5m tok/s: 8289274 +1582/20000 train_loss: 2.5201 train_time: 2.5m tok/s: 8289247 +1583/20000 train_loss: 2.5840 train_time: 2.5m tok/s: 8289259 +1584/20000 train_loss: 2.5565 train_time: 2.5m tok/s: 8289269 +1585/20000 train_loss: 2.7077 train_time: 2.5m tok/s: 8289294 +1586/20000 train_loss: 2.5587 train_time: 2.5m tok/s: 8289267 +1587/20000 train_loss: 2.5920 train_time: 2.5m tok/s: 8289287 +1588/20000 train_loss: 2.6336 train_time: 2.5m tok/s: 8289288 +1589/20000 train_loss: 2.6964 train_time: 2.5m tok/s: 8289280 +1590/20000 train_loss: 2.6479 train_time: 2.5m tok/s: 8289295 +1591/20000 train_loss: 2.6423 train_time: 2.5m tok/s: 8289281 +1592/20000 train_loss: 2.5726 train_time: 2.5m tok/s: 8289296 +1593/20000 train_loss: 2.6373 train_time: 2.5m tok/s: 8289321 +1594/20000 train_loss: 2.7380 train_time: 2.5m tok/s: 8289374 +1595/20000 train_loss: 2.6700 train_time: 2.5m tok/s: 8289429 +1596/20000 train_loss: 2.4443 train_time: 2.5m tok/s: 8289436 +1597/20000 train_loss: 2.5591 train_time: 2.5m tok/s: 8289439 +1598/20000 train_loss: 2.6178 train_time: 2.5m tok/s: 8289474 +1599/20000 train_loss: 2.6159 train_time: 2.5m tok/s: 8289455 +1600/20000 train_loss: 2.8126 train_time: 2.5m tok/s: 8289509 +1601/20000 train_loss: 2.6539 train_time: 2.5m tok/s: 8289530 +1602/20000 train_loss: 2.7662 train_time: 2.5m tok/s: 8289403 +1603/20000 train_loss: 2.5763 train_time: 2.5m tok/s: 8289415 +1604/20000 train_loss: 2.6000 train_time: 2.5m tok/s: 8289475 +1605/20000 train_loss: 2.6192 train_time: 2.5m tok/s: 8289446 +1606/20000 train_loss: 2.6125 train_time: 2.5m tok/s: 8289461 +1607/20000 train_loss: 2.5265 train_time: 2.5m tok/s: 8289467 +1608/20000 train_loss: 2.5057 train_time: 2.5m tok/s: 8289431 +1609/20000 train_loss: 2.7006 train_time: 2.5m tok/s: 8289260 +1610/20000 train_loss: 2.6048 train_time: 2.5m tok/s: 8289400 +1611/20000 train_loss: 2.5893 train_time: 2.5m tok/s: 8289420 +1612/20000 train_loss: 2.6534 train_time: 2.5m tok/s: 8289374 +1613/20000 train_loss: 2.6523 train_time: 2.6m tok/s: 8289311 +1614/20000 train_loss: 2.7154 train_time: 2.6m tok/s: 8289353 +1615/20000 train_loss: 2.7230 train_time: 2.6m tok/s: 8289168 +1616/20000 train_loss: 2.6559 train_time: 2.6m tok/s: 8289329 +1617/20000 train_loss: 2.6027 train_time: 2.6m tok/s: 8289318 +1618/20000 train_loss: 3.0212 train_time: 2.6m tok/s: 8289336 +1619/20000 train_loss: 2.7345 train_time: 2.6m tok/s: 8289338 +1620/20000 train_loss: 2.5622 train_time: 2.6m tok/s: 8289397 +1621/20000 train_loss: 2.5789 train_time: 2.6m tok/s: 8289446 +1622/20000 train_loss: 2.7540 train_time: 2.6m tok/s: 8289424 +1623/20000 train_loss: 2.6729 train_time: 2.6m tok/s: 8289470 +1624/20000 train_loss: 2.6263 train_time: 2.6m tok/s: 8289378 +1625/20000 train_loss: 2.6282 train_time: 2.6m tok/s: 8289337 +1626/20000 train_loss: 2.6990 train_time: 2.6m tok/s: 8289318 +1627/20000 train_loss: 2.4426 train_time: 2.6m tok/s: 8289320 +1628/20000 train_loss: 2.5911 train_time: 2.6m tok/s: 8289337 +1629/20000 train_loss: 2.5683 train_time: 2.6m tok/s: 8289348 +1630/20000 train_loss: 2.5837 train_time: 2.6m tok/s: 8289344 +1631/20000 train_loss: 2.7993 train_time: 2.6m tok/s: 8289392 +1632/20000 train_loss: 2.6962 train_time: 2.6m tok/s: 8289380 +1633/20000 train_loss: 2.6685 train_time: 2.6m tok/s: 8289369 +1634/20000 train_loss: 2.6197 train_time: 2.6m tok/s: 8289414 +1635/20000 train_loss: 2.6757 train_time: 2.6m tok/s: 8289409 +1636/20000 train_loss: 2.4593 train_time: 2.6m tok/s: 8289306 +1637/20000 train_loss: 2.5506 train_time: 2.6m tok/s: 8289218 +1638/20000 train_loss: 2.4993 train_time: 2.6m tok/s: 8289210 +1639/20000 train_loss: 2.5273 train_time: 2.6m tok/s: 8289201 +1640/20000 train_loss: 2.3713 train_time: 2.6m tok/s: 8289195 +1641/20000 train_loss: 2.5441 train_time: 2.6m tok/s: 8289213 +1642/20000 train_loss: 2.7469 train_time: 2.6m tok/s: 8289231 +1643/20000 train_loss: 2.4356 train_time: 2.6m tok/s: 8289263 +1644/20000 train_loss: 2.4756 train_time: 2.6m tok/s: 8289286 +1645/20000 train_loss: 2.7615 train_time: 2.6m tok/s: 8289231 +1646/20000 train_loss: 2.5191 train_time: 2.6m tok/s: 8289274 +1647/20000 train_loss: 2.7400 train_time: 2.6m tok/s: 8289046 +1648/20000 train_loss: 2.6378 train_time: 2.6m tok/s: 8289222 +1649/20000 train_loss: 2.7455 train_time: 2.6m tok/s: 8289199 +1650/20000 train_loss: 2.5639 train_time: 2.6m tok/s: 8289219 +1651/20000 train_loss: 2.7306 train_time: 2.6m tok/s: 8289272 +1652/20000 train_loss: 2.6466 train_time: 2.6m tok/s: 8289336 +1653/20000 train_loss: 2.7454 train_time: 2.6m tok/s: 8289369 +1654/20000 train_loss: 2.6709 train_time: 2.6m tok/s: 8289389 +1655/20000 train_loss: 2.5596 train_time: 2.6m tok/s: 8289403 +1656/20000 train_loss: 2.6004 train_time: 2.6m tok/s: 8289382 +1657/20000 train_loss: 2.6426 train_time: 2.6m tok/s: 8289425 +1658/20000 train_loss: 2.6346 train_time: 2.6m tok/s: 8289425 +1659/20000 train_loss: 2.5898 train_time: 2.6m tok/s: 8289389 +1660/20000 train_loss: 2.5584 train_time: 2.6m tok/s: 8289356 +1661/20000 train_loss: 2.7490 train_time: 2.6m tok/s: 8289269 +1662/20000 train_loss: 2.7385 train_time: 2.6m tok/s: 8289183 +1663/20000 train_loss: 2.7954 train_time: 2.6m tok/s: 8289088 +1664/20000 train_loss: 2.8005 train_time: 2.6m tok/s: 8289094 +1665/20000 train_loss: 2.8007 train_time: 2.6m tok/s: 8289102 +1666/20000 train_loss: 2.6963 train_time: 2.6m tok/s: 8289078 +1667/20000 train_loss: 2.6052 train_time: 2.6m tok/s: 8288970 +1668/20000 train_loss: 2.6319 train_time: 2.6m tok/s: 8289147 +1669/20000 train_loss: 2.7543 train_time: 2.6m tok/s: 8289144 +1670/20000 train_loss: 2.5522 train_time: 2.6m tok/s: 8289156 +1671/20000 train_loss: 2.4846 train_time: 2.6m tok/s: 8289187 +1672/20000 train_loss: 2.6066 train_time: 2.6m tok/s: 8289194 +1673/20000 train_loss: 2.5759 train_time: 2.6m tok/s: 8289205 +1674/20000 train_loss: 2.6393 train_time: 2.6m tok/s: 8289227 +1675/20000 train_loss: 2.4378 train_time: 2.6m tok/s: 8289282 +1676/20000 train_loss: 2.6768 train_time: 2.7m tok/s: 8289264 +1677/20000 train_loss: 2.5994 train_time: 2.7m tok/s: 8289282 +1678/20000 train_loss: 2.6688 train_time: 2.7m tok/s: 8289219 +1679/20000 train_loss: 2.6078 train_time: 2.7m tok/s: 8289189 +1680/20000 train_loss: 2.5286 train_time: 2.7m tok/s: 8289159 +1681/20000 train_loss: 2.5149 train_time: 2.7m tok/s: 8289156 +1682/20000 train_loss: 2.6209 train_time: 2.7m tok/s: 8289213 +1683/20000 train_loss: 2.6239 train_time: 2.7m tok/s: 8289195 +1684/20000 train_loss: 2.5695 train_time: 2.7m tok/s: 8289138 +1685/20000 train_loss: 2.6733 train_time: 2.7m tok/s: 8289126 +1686/20000 train_loss: 2.5784 train_time: 2.7m tok/s: 8289169 +1687/20000 train_loss: 2.5505 train_time: 2.7m tok/s: 8289189 +1688/20000 train_loss: 2.5665 train_time: 2.7m tok/s: 8289192 +1689/20000 train_loss: 2.5221 train_time: 2.7m tok/s: 8289231 +1690/20000 train_loss: 2.8026 train_time: 2.7m tok/s: 8289234 +1691/20000 train_loss: 2.5978 train_time: 2.7m tok/s: 8289238 +1692/20000 train_loss: 2.5819 train_time: 2.7m tok/s: 8289290 +1693/20000 train_loss: 2.3905 train_time: 2.7m tok/s: 8289264 +1694/20000 train_loss: 2.6124 train_time: 2.7m tok/s: 8289289 +1695/20000 train_loss: 2.6133 train_time: 2.7m tok/s: 8289297 +1696/20000 train_loss: 2.6552 train_time: 2.7m tok/s: 8289304 +1697/20000 train_loss: 2.7536 train_time: 2.7m tok/s: 8289313 +1698/20000 train_loss: 2.6500 train_time: 2.7m tok/s: 8289315 +1699/20000 train_loss: 2.7457 train_time: 2.7m tok/s: 8289284 +1700/20000 train_loss: 2.5793 train_time: 2.7m tok/s: 8289284 +1701/20000 train_loss: 2.4770 train_time: 2.7m tok/s: 8289332 +1702/20000 train_loss: 2.6095 train_time: 2.7m tok/s: 8289393 +1703/20000 train_loss: 2.6425 train_time: 2.7m tok/s: 8289370 +1704/20000 train_loss: 2.7578 train_time: 2.7m tok/s: 8289389 +1705/20000 train_loss: 2.7683 train_time: 2.7m tok/s: 8289360 +1706/20000 train_loss: 2.6990 train_time: 2.7m tok/s: 8289407 +1707/20000 train_loss: 2.8456 train_time: 2.7m tok/s: 8289488 +1708/20000 train_loss: 2.4698 train_time: 2.7m tok/s: 8289459 +1709/20000 train_loss: 2.6600 train_time: 2.7m tok/s: 8289418 +1710/20000 train_loss: 2.6355 train_time: 2.7m tok/s: 8289442 +1711/20000 train_loss: 2.5721 train_time: 2.7m tok/s: 8289462 +1712/20000 train_loss: 2.6973 train_time: 2.7m tok/s: 8289481 +1713/20000 train_loss: 2.7929 train_time: 2.7m tok/s: 8289473 +1714/20000 train_loss: 2.5588 train_time: 2.7m tok/s: 8289480 +1715/20000 train_loss: 2.7454 train_time: 2.7m tok/s: 8289480 +1716/20000 train_loss: 2.7268 train_time: 2.7m tok/s: 8289435 +1717/20000 train_loss: 2.7327 train_time: 2.7m tok/s: 8289461 +1718/20000 train_loss: 2.8165 train_time: 2.7m tok/s: 8289480 +1719/20000 train_loss: 2.7142 train_time: 2.7m tok/s: 8289479 +1720/20000 train_loss: 2.5282 train_time: 2.7m tok/s: 8289482 +1721/20000 train_loss: 2.6022 train_time: 2.7m tok/s: 8289465 +1722/20000 train_loss: 2.7339 train_time: 2.7m tok/s: 8289456 +1723/20000 train_loss: 2.6209 train_time: 2.7m tok/s: 8289480 +1724/20000 train_loss: 2.6630 train_time: 2.7m tok/s: 8289539 +1725/20000 train_loss: 2.5943 train_time: 2.7m tok/s: 8289534 +1726/20000 train_loss: 2.6297 train_time: 2.7m tok/s: 8289597 +1727/20000 train_loss: 2.5957 train_time: 2.7m tok/s: 8289616 +1728/20000 train_loss: 2.8359 train_time: 2.7m tok/s: 8289637 +1729/20000 train_loss: 2.6579 train_time: 2.7m tok/s: 8289653 +1730/20000 train_loss: 2.7412 train_time: 2.7m tok/s: 8289696 +1731/20000 train_loss: 2.7398 train_time: 2.7m tok/s: 8289723 +1732/20000 train_loss: 2.7029 train_time: 2.7m tok/s: 8289755 +1733/20000 train_loss: 2.6964 train_time: 2.7m tok/s: 8289770 +1734/20000 train_loss: 2.6076 train_time: 2.7m tok/s: 8289749 +1735/20000 train_loss: 2.4626 train_time: 2.7m tok/s: 8289725 +1736/20000 train_loss: 2.6986 train_time: 2.7m tok/s: 8289671 +1737/20000 train_loss: 2.5797 train_time: 2.7m tok/s: 8289682 +1738/20000 train_loss: 2.8127 train_time: 2.7m tok/s: 8289704 +1739/20000 train_loss: 2.7428 train_time: 2.7m tok/s: 8289740 +1740/20000 train_loss: 2.3853 train_time: 2.8m tok/s: 8289748 +1741/20000 train_loss: 2.7948 train_time: 2.8m tok/s: 8289766 +1742/20000 train_loss: 2.6065 train_time: 2.8m tok/s: 8289801 +1743/20000 train_loss: 2.5185 train_time: 2.8m tok/s: 8289860 +1744/20000 train_loss: 2.5818 train_time: 2.8m tok/s: 8289802 +1745/20000 train_loss: 2.6255 train_time: 2.8m tok/s: 8289832 +1746/20000 train_loss: 2.6077 train_time: 2.8m tok/s: 8289897 +1747/20000 train_loss: 2.6369 train_time: 2.8m tok/s: 8289856 +1748/20000 train_loss: 2.5655 train_time: 2.8m tok/s: 8289877 +1749/20000 train_loss: 2.6263 train_time: 2.8m tok/s: 8289844 +1750/20000 train_loss: 2.6700 train_time: 2.8m tok/s: 8289830 +1751/20000 train_loss: 2.6696 train_time: 2.8m tok/s: 8289853 +1752/20000 train_loss: 2.6229 train_time: 2.8m tok/s: 8289850 +1753/20000 train_loss: 2.6093 train_time: 2.8m tok/s: 8289918 +1754/20000 train_loss: 2.6758 train_time: 2.8m tok/s: 8289914 +1755/20000 train_loss: 2.5880 train_time: 2.8m tok/s: 8289934 +1756/20000 train_loss: 2.6143 train_time: 2.8m tok/s: 8289922 +1757/20000 train_loss: 2.5828 train_time: 2.8m tok/s: 8289884 +1758/20000 train_loss: 2.8590 train_time: 2.8m tok/s: 8289897 +1759/20000 train_loss: 2.6340 train_time: 2.8m tok/s: 8289878 +1760/20000 train_loss: 2.5281 train_time: 2.8m tok/s: 8289876 +1761/20000 train_loss: 2.6349 train_time: 2.8m tok/s: 8289891 +1762/20000 train_loss: 2.6996 train_time: 2.8m tok/s: 8289926 +1763/20000 train_loss: 2.7069 train_time: 2.8m tok/s: 8289976 +1764/20000 train_loss: 2.6542 train_time: 2.8m tok/s: 8290035 +1765/20000 train_loss: 2.5818 train_time: 2.8m tok/s: 8290059 +1766/20000 train_loss: 2.7126 train_time: 2.8m tok/s: 8290073 +1767/20000 train_loss: 2.5701 train_time: 2.8m tok/s: 8290101 +1768/20000 train_loss: 2.6498 train_time: 2.8m tok/s: 8290131 +1769/20000 train_loss: 2.6444 train_time: 2.8m tok/s: 8290094 +1770/20000 train_loss: 2.6616 train_time: 2.8m tok/s: 8290094 +1771/20000 train_loss: 2.5271 train_time: 2.8m tok/s: 8290094 +1772/20000 train_loss: 2.5448 train_time: 2.8m tok/s: 8290104 +1773/20000 train_loss: 2.8507 train_time: 2.8m tok/s: 8290106 +1774/20000 train_loss: 2.7066 train_time: 2.8m tok/s: 8290153 +1775/20000 train_loss: 2.7488 train_time: 2.8m tok/s: 8290170 +1776/20000 train_loss: 2.5782 train_time: 2.8m tok/s: 8290165 +1777/20000 train_loss: 2.6936 train_time: 2.8m tok/s: 8290138 +1778/20000 train_loss: 2.6524 train_time: 2.8m tok/s: 8290179 +1779/20000 train_loss: 2.6546 train_time: 2.8m tok/s: 8290229 +1780/20000 train_loss: 2.6533 train_time: 2.8m tok/s: 8290202 +1781/20000 train_loss: 2.5253 train_time: 2.8m tok/s: 8290173 +1782/20000 train_loss: 2.4177 train_time: 2.8m tok/s: 8290170 +1783/20000 train_loss: 2.6245 train_time: 2.8m tok/s: 8290145 +1784/20000 train_loss: 2.6234 train_time: 2.8m tok/s: 8290178 +1785/20000 train_loss: 2.6367 train_time: 2.8m tok/s: 8290211 +1786/20000 train_loss: 2.7920 train_time: 2.8m tok/s: 8290071 +1787/20000 train_loss: 2.7117 train_time: 2.8m tok/s: 8290239 +1788/20000 train_loss: 2.6257 train_time: 2.8m tok/s: 8290308 +1789/20000 train_loss: 2.7159 train_time: 2.8m tok/s: 8290313 +1790/20000 train_loss: 2.5190 train_time: 2.8m tok/s: 8290362 +1791/20000 train_loss: 2.3548 train_time: 2.8m tok/s: 8290293 +1792/20000 train_loss: 2.6315 train_time: 2.8m tok/s: 8290241 +1793/20000 train_loss: 2.4954 train_time: 2.8m tok/s: 8290249 +1794/20000 train_loss: 2.4335 train_time: 2.8m tok/s: 8290219 +1795/20000 train_loss: 2.6275 train_time: 2.8m tok/s: 8290275 +1796/20000 train_loss: 2.6406 train_time: 2.8m tok/s: 8290250 +1797/20000 train_loss: 2.8086 train_time: 2.8m tok/s: 8290228 +1798/20000 train_loss: 2.5702 train_time: 2.8m tok/s: 8290239 +1799/20000 train_loss: 2.6548 train_time: 2.8m tok/s: 8290247 +1800/20000 train_loss: 2.5546 train_time: 2.8m tok/s: 8290279 +1801/20000 train_loss: 2.6855 train_time: 2.8m tok/s: 8290331 +1802/20000 train_loss: 2.5868 train_time: 2.8m tok/s: 8290329 +1803/20000 train_loss: 2.6414 train_time: 2.9m tok/s: 8290315 +1804/20000 train_loss: 2.6016 train_time: 2.9m tok/s: 8290300 +1805/20000 train_loss: 2.5874 train_time: 2.9m tok/s: 8290304 +1806/20000 train_loss: 2.8101 train_time: 2.9m tok/s: 8290279 +1807/20000 train_loss: 2.7069 train_time: 2.9m tok/s: 8290270 +1808/20000 train_loss: 2.7242 train_time: 2.9m tok/s: 8290267 +1809/20000 train_loss: 2.6120 train_time: 2.9m tok/s: 8290217 +1810/20000 train_loss: 2.7155 train_time: 2.9m tok/s: 8290130 +1811/20000 train_loss: 2.5655 train_time: 2.9m tok/s: 8290127 +1812/20000 train_loss: 2.6125 train_time: 2.9m tok/s: 8290131 +1813/20000 train_loss: 2.7006 train_time: 2.9m tok/s: 8290160 +1814/20000 train_loss: 2.7466 train_time: 2.9m tok/s: 8290179 +1815/20000 train_loss: 2.5505 train_time: 2.9m tok/s: 8290188 +1816/20000 train_loss: 2.4879 train_time: 2.9m tok/s: 8290132 +1817/20000 train_loss: 2.8407 train_time: 2.9m tok/s: 8290085 +1818/20000 train_loss: 2.6704 train_time: 2.9m tok/s: 8290115 +1819/20000 train_loss: 2.6908 train_time: 2.9m tok/s: 8289993 +1820/20000 train_loss: 2.5759 train_time: 2.9m tok/s: 8290091 +1821/20000 train_loss: 2.6439 train_time: 2.9m tok/s: 8290102 +1822/20000 train_loss: 2.7076 train_time: 2.9m tok/s: 8290131 +1823/20000 train_loss: 2.3949 train_time: 2.9m tok/s: 8290111 +1824/20000 train_loss: 2.6029 train_time: 2.9m tok/s: 8290081 +1825/20000 train_loss: 2.6236 train_time: 2.9m tok/s: 8290047 +1826/20000 train_loss: 2.4582 train_time: 2.9m tok/s: 8290018 +1827/20000 train_loss: 2.5753 train_time: 2.9m tok/s: 8289965 +1828/20000 train_loss: 2.5220 train_time: 2.9m tok/s: 8289942 +1829/20000 train_loss: 2.4708 train_time: 2.9m tok/s: 8289899 +1830/20000 train_loss: 2.6467 train_time: 2.9m tok/s: 8289913 +1831/20000 train_loss: 2.6506 train_time: 2.9m tok/s: 8289909 +1832/20000 train_loss: 2.6405 train_time: 2.9m tok/s: 8289933 +1833/20000 train_loss: 2.7046 train_time: 2.9m tok/s: 8289924 +1834/20000 train_loss: 2.6721 train_time: 2.9m tok/s: 8289991 +1835/20000 train_loss: 2.5727 train_time: 2.9m tok/s: 8289994 +1836/20000 train_loss: 2.5778 train_time: 2.9m tok/s: 8289953 +1837/20000 train_loss: 2.4866 train_time: 2.9m tok/s: 8289995 +1838/20000 train_loss: 2.6090 train_time: 2.9m tok/s: 8290006 +1839/20000 train_loss: 2.7081 train_time: 2.9m tok/s: 8289879 +1840/20000 train_loss: 2.6524 train_time: 2.9m tok/s: 8290023 +1841/20000 train_loss: 2.6946 train_time: 2.9m tok/s: 8290019 +1842/20000 train_loss: 2.6282 train_time: 2.9m tok/s: 8290039 +1843/20000 train_loss: 2.5589 train_time: 2.9m tok/s: 8290097 +1844/20000 train_loss: 2.6547 train_time: 2.9m tok/s: 8290070 +1845/20000 train_loss: 2.8158 train_time: 2.9m tok/s: 8290039 +1846/20000 train_loss: 2.5909 train_time: 2.9m tok/s: 8290040 +1847/20000 train_loss: 2.4679 train_time: 2.9m tok/s: 8290042 +1848/20000 train_loss: 2.5278 train_time: 2.9m tok/s: 8289958 +1849/20000 train_loss: 2.5266 train_time: 2.9m tok/s: 8289991 +1850/20000 train_loss: 2.6525 train_time: 2.9m tok/s: 8290014 +1851/20000 train_loss: 2.6582 train_time: 2.9m tok/s: 8290048 +1852/20000 train_loss: 2.6734 train_time: 2.9m tok/s: 8290022 +1853/20000 train_loss: 2.5092 train_time: 2.9m tok/s: 8290014 +1854/20000 train_loss: 2.5508 train_time: 2.9m tok/s: 8290031 +1855/20000 train_loss: 2.6357 train_time: 2.9m tok/s: 8290049 +1856/20000 train_loss: 2.6715 train_time: 2.9m tok/s: 8290061 +1857/20000 train_loss: 2.7893 train_time: 2.9m tok/s: 8289925 +1858/20000 train_loss: 2.7324 train_time: 2.9m tok/s: 8290105 +1859/20000 train_loss: 2.6434 train_time: 2.9m tok/s: 8290126 +1860/20000 train_loss: 2.5791 train_time: 2.9m tok/s: 8290164 +1861/20000 train_loss: 2.5691 train_time: 2.9m tok/s: 8290146 +1862/20000 train_loss: 2.5410 train_time: 2.9m tok/s: 8290180 +1863/20000 train_loss: 2.6204 train_time: 2.9m tok/s: 8290194 +1864/20000 train_loss: 2.5799 train_time: 2.9m tok/s: 8290177 +1865/20000 train_loss: 2.7351 train_time: 2.9m tok/s: 8290141 +1866/20000 train_loss: 2.6326 train_time: 3.0m tok/s: 8290059 +1867/20000 train_loss: 2.5472 train_time: 3.0m tok/s: 8290038 +1868/20000 train_loss: 2.5482 train_time: 3.0m tok/s: 8290009 +1869/20000 train_loss: 2.6837 train_time: 3.0m tok/s: 8290003 +1870/20000 train_loss: 2.6340 train_time: 3.0m tok/s: 8290018 +1871/20000 train_loss: 2.5469 train_time: 3.0m tok/s: 8290039 +1872/20000 train_loss: 2.5854 train_time: 3.0m tok/s: 8290039 +1873/20000 train_loss: 2.7294 train_time: 3.0m tok/s: 8290054 +1874/20000 train_loss: 2.6603 train_time: 3.0m tok/s: 8290090 +1875/20000 train_loss: 2.7443 train_time: 3.0m tok/s: 8290039 +1876/20000 train_loss: 2.7810 train_time: 3.0m tok/s: 8290106 +1877/20000 train_loss: 2.9580 train_time: 3.0m tok/s: 8290050 +1878/20000 train_loss: 2.6546 train_time: 3.0m tok/s: 8290027 +1879/20000 train_loss: 2.6077 train_time: 3.0m tok/s: 8290016 +1880/20000 train_loss: 2.7505 train_time: 3.0m tok/s: 8289996 +1881/20000 train_loss: 2.5947 train_time: 3.0m tok/s: 8290011 +1882/20000 train_loss: 2.7270 train_time: 3.0m tok/s: 8290034 +1883/20000 train_loss: 2.5909 train_time: 3.0m tok/s: 8290001 +1884/20000 train_loss: 2.5610 train_time: 3.0m tok/s: 8289963 +1885/20000 train_loss: 2.6374 train_time: 3.0m tok/s: 8289942 +1886/20000 train_loss: 2.5631 train_time: 3.0m tok/s: 8289966 +1887/20000 train_loss: 2.6277 train_time: 3.0m tok/s: 8289930 +1888/20000 train_loss: 2.4888 train_time: 3.0m tok/s: 8289975 +1889/20000 train_loss: 2.5460 train_time: 3.0m tok/s: 8289990 +1890/20000 train_loss: 2.6743 train_time: 3.0m tok/s: 8289994 +1891/20000 train_loss: 2.5088 train_time: 3.0m tok/s: 8290029 +1892/20000 train_loss: 2.6814 train_time: 3.0m tok/s: 8290054 +1893/20000 train_loss: 2.6826 train_time: 3.0m tok/s: 8290045 +1894/20000 train_loss: 2.5974 train_time: 3.0m tok/s: 8290054 +1895/20000 train_loss: 2.6483 train_time: 3.0m tok/s: 8290074 +1896/20000 train_loss: 2.6221 train_time: 3.0m tok/s: 8290031 +1897/20000 train_loss: 2.5623 train_time: 3.0m tok/s: 8290076 +1898/20000 train_loss: 2.7007 train_time: 3.0m tok/s: 8290062 +1899/20000 train_loss: 2.6297 train_time: 3.0m tok/s: 8290074 +1900/20000 train_loss: 2.6156 train_time: 3.0m tok/s: 8290102 +1901/20000 train_loss: 2.6859 train_time: 3.0m tok/s: 8290113 +1902/20000 train_loss: 2.6006 train_time: 3.0m tok/s: 8290116 +1903/20000 train_loss: 2.7628 train_time: 3.0m tok/s: 8290115 +1904/20000 train_loss: 3.1416 train_time: 3.0m tok/s: 8290079 +1905/20000 train_loss: 2.4826 train_time: 3.0m tok/s: 8290028 +1906/20000 train_loss: 2.6308 train_time: 3.0m tok/s: 8290002 +1907/20000 train_loss: 2.5151 train_time: 3.0m tok/s: 8289994 +1908/20000 train_loss: 2.5362 train_time: 3.0m tok/s: 8290017 +1909/20000 train_loss: 2.5855 train_time: 3.0m tok/s: 8290040 +1910/20000 train_loss: 2.5332 train_time: 3.0m tok/s: 8290060 +1911/20000 train_loss: 2.4814 train_time: 3.0m tok/s: 8290060 +1912/20000 train_loss: 2.7027 train_time: 3.0m tok/s: 8290086 +1913/20000 train_loss: 2.7120 train_time: 3.0m tok/s: 8290094 +1914/20000 train_loss: 2.6948 train_time: 3.0m tok/s: 8290112 +1915/20000 train_loss: 2.7060 train_time: 3.0m tok/s: 8290169 +1916/20000 train_loss: 2.5766 train_time: 3.0m tok/s: 8290180 +1917/20000 train_loss: 2.7139 train_time: 3.0m tok/s: 8290203 +1918/20000 train_loss: 2.5770 train_time: 3.0m tok/s: 8290208 +1919/20000 train_loss: 2.5572 train_time: 3.0m tok/s: 8290231 +1920/20000 train_loss: 2.4943 train_time: 3.0m tok/s: 8290232 +1921/20000 train_loss: 2.7059 train_time: 3.0m tok/s: 8290218 +1922/20000 train_loss: 2.5984 train_time: 3.0m tok/s: 8290238 +1923/20000 train_loss: 2.5236 train_time: 3.0m tok/s: 8290266 +1924/20000 train_loss: 2.5957 train_time: 3.0m tok/s: 8290309 +1925/20000 train_loss: 2.5285 train_time: 3.0m tok/s: 8290254 +1926/20000 train_loss: 2.7436 train_time: 3.0m tok/s: 8290270 +1927/20000 train_loss: 2.5713 train_time: 3.0m tok/s: 8290290 +1928/20000 train_loss: 2.6347 train_time: 3.0m tok/s: 8290295 +1929/20000 train_loss: 2.6408 train_time: 3.0m tok/s: 8290294 +1930/20000 train_loss: 2.7059 train_time: 3.1m tok/s: 8290282 +1931/20000 train_loss: 2.6401 train_time: 3.1m tok/s: 8290299 +1932/20000 train_loss: 2.7560 train_time: 3.1m tok/s: 8290264 +1933/20000 train_loss: 2.6515 train_time: 3.1m tok/s: 8290281 +1934/20000 train_loss: 2.6652 train_time: 3.1m tok/s: 8290322 +1935/20000 train_loss: 2.5591 train_time: 3.1m tok/s: 8290294 +1936/20000 train_loss: 2.6781 train_time: 3.1m tok/s: 8290249 +1937/20000 train_loss: 2.6718 train_time: 3.1m tok/s: 8290254 +1938/20000 train_loss: 2.6975 train_time: 3.1m tok/s: 8290295 +1939/20000 train_loss: 2.6180 train_time: 3.1m tok/s: 8290333 +1940/20000 train_loss: 2.8139 train_time: 3.1m tok/s: 8290359 +1941/20000 train_loss: 2.4738 train_time: 3.1m tok/s: 8290387 +1942/20000 train_loss: 2.4976 train_time: 3.1m tok/s: 8290435 +1943/20000 train_loss: 2.4980 train_time: 3.1m tok/s: 8290452 +1944/20000 train_loss: 2.5234 train_time: 3.1m tok/s: 8290440 +1945/20000 train_loss: 2.5715 train_time: 3.1m tok/s: 8290477 +1946/20000 train_loss: 2.6210 train_time: 3.1m tok/s: 8290342 +1947/20000 train_loss: 2.6885 train_time: 3.1m tok/s: 8290517 +1948/20000 train_loss: 2.6837 train_time: 3.1m tok/s: 8290561 +1949/20000 train_loss: 2.7277 train_time: 3.1m tok/s: 8290529 +1950/20000 train_loss: 2.5849 train_time: 3.1m tok/s: 8290571 +1951/20000 train_loss: 2.8011 train_time: 3.1m tok/s: 8290479 +1952/20000 train_loss: 2.8197 train_time: 3.1m tok/s: 8290545 +1953/20000 train_loss: 2.6386 train_time: 3.1m tok/s: 8290575 +1954/20000 train_loss: 2.5848 train_time: 3.1m tok/s: 8290579 +1955/20000 train_loss: 2.8469 train_time: 3.1m tok/s: 8290541 +1956/20000 train_loss: 2.5796 train_time: 3.1m tok/s: 8290536 +1957/20000 train_loss: 2.6011 train_time: 3.1m tok/s: 8290553 +1958/20000 train_loss: 2.5690 train_time: 3.1m tok/s: 8290580 +1959/20000 train_loss: 2.5482 train_time: 3.1m tok/s: 8290584 +1960/20000 train_loss: 2.5001 train_time: 3.1m tok/s: 8290568 +1961/20000 train_loss: 2.5150 train_time: 3.1m tok/s: 8290550 +1962/20000 train_loss: 2.5980 train_time: 3.1m tok/s: 8290578 +1963/20000 train_loss: 2.5633 train_time: 3.1m tok/s: 8290650 +1964/20000 train_loss: 2.5851 train_time: 3.1m tok/s: 8290716 +1965/20000 train_loss: 2.5863 train_time: 3.1m tok/s: 8290684 +1966/20000 train_loss: 2.7447 train_time: 3.1m tok/s: 8290649 +1967/20000 train_loss: 2.5570 train_time: 3.1m tok/s: 8290626 +1968/20000 train_loss: 2.7037 train_time: 3.1m tok/s: 8290603 +1969/20000 train_loss: 2.7658 train_time: 3.1m tok/s: 8290631 +1970/20000 train_loss: 2.5912 train_time: 3.1m tok/s: 8290640 +1971/20000 train_loss: 2.6143 train_time: 3.1m tok/s: 8290629 +1972/20000 train_loss: 2.6814 train_time: 3.1m tok/s: 8290647 +1973/20000 train_loss: 2.5994 train_time: 3.1m tok/s: 8290641 +1974/20000 train_loss: 2.7719 train_time: 3.1m tok/s: 8290654 +1975/20000 train_loss: 2.5244 train_time: 3.1m tok/s: 8290669 +1976/20000 train_loss: 2.7181 train_time: 3.1m tok/s: 8290700 +1977/20000 train_loss: 2.5265 train_time: 3.1m tok/s: 8290746 +1978/20000 train_loss: 2.6921 train_time: 3.1m tok/s: 8290747 +1979/20000 train_loss: 2.5237 train_time: 3.1m tok/s: 8290780 +1980/20000 train_loss: 2.5761 train_time: 3.1m tok/s: 8290764 +1981/20000 train_loss: 2.4745 train_time: 3.1m tok/s: 8290740 +1982/20000 train_loss: 2.6388 train_time: 3.1m tok/s: 8290704 +1983/20000 train_loss: 2.3818 train_time: 3.1m tok/s: 8290711 +1984/20000 train_loss: 2.6869 train_time: 3.1m tok/s: 8290731 +1985/20000 train_loss: 2.6267 train_time: 3.1m tok/s: 8290749 +1986/20000 train_loss: 2.6724 train_time: 3.1m tok/s: 8290773 +1987/20000 train_loss: 2.6883 train_time: 3.1m tok/s: 8290819 +1988/20000 train_loss: 2.6504 train_time: 3.1m tok/s: 8290865 +1989/20000 train_loss: 2.4860 train_time: 3.1m tok/s: 8290904 +1990/20000 train_loss: 2.6527 train_time: 3.1m tok/s: 8290941 +1991/20000 train_loss: 2.5696 train_time: 3.1m tok/s: 8290994 +1992/20000 train_loss: 2.7516 train_time: 3.1m tok/s: 8291020 +1993/20000 train_loss: 2.5681 train_time: 3.2m tok/s: 8291045 +1994/20000 train_loss: 2.6140 train_time: 3.2m tok/s: 8291067 +1995/20000 train_loss: 2.5215 train_time: 3.2m tok/s: 8291067 +1996/20000 train_loss: 2.6147 train_time: 3.2m tok/s: 8291087 +1997/20000 train_loss: 2.5925 train_time: 3.2m tok/s: 8291090 +1998/20000 train_loss: 2.6135 train_time: 3.2m tok/s: 8291139 +1999/20000 train_loss: 2.6771 train_time: 3.2m tok/s: 8291131 +2000/20000 train_loss: 2.4927 train_time: 3.2m tok/s: 8291148 +2001/20000 train_loss: 2.5829 train_time: 3.2m tok/s: 8291179 +2002/20000 train_loss: 2.4523 train_time: 3.2m tok/s: 8291199 +2003/20000 train_loss: 2.6492 train_time: 3.2m tok/s: 8291202 +2004/20000 train_loss: 2.6347 train_time: 3.2m tok/s: 8291264 +2005/20000 train_loss: 2.6165 train_time: 3.2m tok/s: 8291323 +2006/20000 train_loss: 2.5791 train_time: 3.2m tok/s: 8291375 +2007/20000 train_loss: 2.6216 train_time: 3.2m tok/s: 8291320 +2008/20000 train_loss: 2.5454 train_time: 3.2m tok/s: 8291310 +2009/20000 train_loss: 2.6492 train_time: 3.2m tok/s: 8291338 +2010/20000 train_loss: 2.7025 train_time: 3.2m tok/s: 8291299 +2011/20000 train_loss: 2.5708 train_time: 3.2m tok/s: 8291302 +2012/20000 train_loss: 2.5746 train_time: 3.2m tok/s: 8291348 +2013/20000 train_loss: 2.4797 train_time: 3.2m tok/s: 8291319 +2014/20000 train_loss: 2.4742 train_time: 3.2m tok/s: 8291005 +2015/20000 train_loss: 2.6916 train_time: 3.2m tok/s: 8291316 +2016/20000 train_loss: 2.4779 train_time: 3.2m tok/s: 8291334 +2017/20000 train_loss: 2.6328 train_time: 3.2m tok/s: 8291327 +2018/20000 train_loss: 2.6270 train_time: 3.2m tok/s: 8291368 +2019/20000 train_loss: 2.7204 train_time: 3.2m tok/s: 8291425 +2020/20000 train_loss: 2.7076 train_time: 3.2m tok/s: 8291440 +2021/20000 train_loss: 2.5610 train_time: 3.2m tok/s: 8291402 +2022/20000 train_loss: 2.4910 train_time: 3.2m tok/s: 8291382 +2023/20000 train_loss: 2.6881 train_time: 3.2m tok/s: 8291389 +2024/20000 train_loss: 2.6380 train_time: 3.2m tok/s: 8291414 +2025/20000 train_loss: 2.4804 train_time: 3.2m tok/s: 8291386 +2026/20000 train_loss: 2.7093 train_time: 3.2m tok/s: 8291371 +2027/20000 train_loss: 2.5967 train_time: 3.2m tok/s: 8291375 +2028/20000 train_loss: 2.7139 train_time: 3.2m tok/s: 8291353 +2029/20000 train_loss: 2.4944 train_time: 3.2m tok/s: 8291365 +2030/20000 train_loss: 2.5111 train_time: 3.2m tok/s: 8291425 +2031/20000 train_loss: 2.5048 train_time: 3.2m tok/s: 8291456 +2032/20000 train_loss: 2.5616 train_time: 3.2m tok/s: 8291413 +2033/20000 train_loss: 2.8412 train_time: 3.2m tok/s: 8291355 +2034/20000 train_loss: 2.6811 train_time: 3.2m tok/s: 8291334 +2035/20000 train_loss: 2.6551 train_time: 3.2m tok/s: 8291385 +2036/20000 train_loss: 2.6146 train_time: 3.2m tok/s: 8291379 +2037/20000 train_loss: 2.8215 train_time: 3.2m tok/s: 8291311 +2038/20000 train_loss: 2.5898 train_time: 3.2m tok/s: 8291246 +2039/20000 train_loss: 2.6309 train_time: 3.2m tok/s: 8291310 +2040/20000 train_loss: 2.5776 train_time: 3.2m tok/s: 8291315 +2041/20000 train_loss: 2.6383 train_time: 3.2m tok/s: 8291318 +2042/20000 train_loss: 2.5297 train_time: 3.2m tok/s: 8291281 +2043/20000 train_loss: 2.4900 train_time: 3.2m tok/s: 8291254 +2044/20000 train_loss: 2.6508 train_time: 3.2m tok/s: 8291273 +2045/20000 train_loss: 2.4366 train_time: 3.2m tok/s: 8291291 +2046/20000 train_loss: 2.4630 train_time: 3.2m tok/s: 8291344 +2047/20000 train_loss: 2.7441 train_time: 3.2m tok/s: 8291379 +2048/20000 train_loss: 2.5711 train_time: 3.2m tok/s: 8291419 +2049/20000 train_loss: 2.7087 train_time: 3.2m tok/s: 8291483 +2050/20000 train_loss: 2.6641 train_time: 3.2m tok/s: 8291453 +2051/20000 train_loss: 2.6474 train_time: 3.2m tok/s: 8291482 +2052/20000 train_loss: 2.5261 train_time: 3.2m tok/s: 8291515 +2053/20000 train_loss: 2.6339 train_time: 3.2m tok/s: 8291529 +2054/20000 train_loss: 2.6695 train_time: 3.2m tok/s: 8291551 +2055/20000 train_loss: 2.5748 train_time: 3.2m tok/s: 8291526 +2056/20000 train_loss: 2.6133 train_time: 3.3m tok/s: 8291547 +2057/20000 train_loss: 2.6575 train_time: 3.3m tok/s: 8291507 +2058/20000 train_loss: 2.5596 train_time: 3.3m tok/s: 8291530 +2059/20000 train_loss: 2.5005 train_time: 3.3m tok/s: 8291544 +2060/20000 train_loss: 2.5871 train_time: 3.3m tok/s: 8291571 +2061/20000 train_loss: 2.5861 train_time: 3.3m tok/s: 8291583 +2062/20000 train_loss: 2.6091 train_time: 3.3m tok/s: 8291609 +2063/20000 train_loss: 2.5513 train_time: 3.3m tok/s: 8291594 +2064/20000 train_loss: 2.7987 train_time: 3.3m tok/s: 8291620 +2065/20000 train_loss: 2.5345 train_time: 3.3m tok/s: 8291640 +2066/20000 train_loss: 2.6121 train_time: 3.3m tok/s: 8291576 +2067/20000 train_loss: 2.6664 train_time: 3.3m tok/s: 8291515 +2068/20000 train_loss: 2.6048 train_time: 3.3m tok/s: 8291539 +2069/20000 train_loss: 2.4607 train_time: 3.3m tok/s: 8291538 +2070/20000 train_loss: 2.6100 train_time: 3.3m tok/s: 8291578 +2071/20000 train_loss: 2.5489 train_time: 3.3m tok/s: 8291588 +2072/20000 train_loss: 2.5948 train_time: 3.3m tok/s: 8291586 +2073/20000 train_loss: 2.5307 train_time: 3.3m tok/s: 8291561 +2074/20000 train_loss: 2.6917 train_time: 3.3m tok/s: 8291554 +2075/20000 train_loss: 2.5746 train_time: 3.3m tok/s: 8291600 +2076/20000 train_loss: 2.6688 train_time: 3.3m tok/s: 8291607 +2077/20000 train_loss: 3.5665 train_time: 3.3m tok/s: 8291530 +2078/20000 train_loss: 2.7101 train_time: 3.3m tok/s: 8291442 +2079/20000 train_loss: 2.6564 train_time: 3.3m tok/s: 8291442 +2080/20000 train_loss: 2.6023 train_time: 3.3m tok/s: 8291407 +2081/20000 train_loss: 2.6039 train_time: 3.3m tok/s: 8291359 +2082/20000 train_loss: 2.5942 train_time: 3.3m tok/s: 8291387 +2083/20000 train_loss: 2.5436 train_time: 3.3m tok/s: 8291245 +2084/20000 train_loss: 2.5772 train_time: 3.3m tok/s: 8291274 +2085/20000 train_loss: 2.5786 train_time: 3.3m tok/s: 8291131 +2086/20000 train_loss: 2.6103 train_time: 3.3m tok/s: 8291312 +2087/20000 train_loss: 2.5430 train_time: 3.3m tok/s: 8291297 +2088/20000 train_loss: 2.4582 train_time: 3.3m tok/s: 8291273 +2089/20000 train_loss: 2.6284 train_time: 3.3m tok/s: 8291313 +2090/20000 train_loss: 2.7349 train_time: 3.3m tok/s: 8291358 +2091/20000 train_loss: 2.6191 train_time: 3.3m tok/s: 8291388 +2092/20000 train_loss: 2.6517 train_time: 3.3m tok/s: 8291395 +2093/20000 train_loss: 2.6436 train_time: 3.3m tok/s: 8291433 +2094/20000 train_loss: 2.5984 train_time: 3.3m tok/s: 8291465 +2095/20000 train_loss: 2.5864 train_time: 3.3m tok/s: 8291517 +2096/20000 train_loss: 2.6904 train_time: 3.3m tok/s: 8291519 +2097/20000 train_loss: 2.5687 train_time: 3.3m tok/s: 8291527 +2098/20000 train_loss: 2.4823 train_time: 3.3m tok/s: 8291539 +2099/20000 train_loss: 2.4886 train_time: 3.3m tok/s: 8291553 +2100/20000 train_loss: 2.5773 train_time: 3.3m tok/s: 8291564 +2101/20000 train_loss: 2.6519 train_time: 3.3m tok/s: 8291492 +2102/20000 train_loss: 2.5672 train_time: 3.3m tok/s: 8291475 +2103/20000 train_loss: 2.5789 train_time: 3.3m tok/s: 8291463 +2104/20000 train_loss: 2.7015 train_time: 3.3m tok/s: 8291520 +2105/20000 train_loss: 2.7181 train_time: 3.3m tok/s: 8291556 +2106/20000 train_loss: 2.7032 train_time: 3.3m tok/s: 8291602 +2107/20000 train_loss: 2.5995 train_time: 3.3m tok/s: 8291610 +2108/20000 train_loss: 2.5364 train_time: 3.3m tok/s: 8291600 +2109/20000 train_loss: 2.7123 train_time: 3.3m tok/s: 8291598 +2110/20000 train_loss: 2.5728 train_time: 3.3m tok/s: 8291613 +2111/20000 train_loss: 2.5540 train_time: 3.3m tok/s: 8291611 +2112/20000 train_loss: 2.5530 train_time: 3.3m tok/s: 8291626 +2113/20000 train_loss: 2.5222 train_time: 3.3m tok/s: 8291629 +2114/20000 train_loss: 2.7815 train_time: 3.3m tok/s: 8291652 +2115/20000 train_loss: 2.4908 train_time: 3.3m tok/s: 8291667 +2116/20000 train_loss: 2.6482 train_time: 3.3m tok/s: 8291725 +2117/20000 train_loss: 2.6899 train_time: 3.3m tok/s: 8291766 +2118/20000 train_loss: 2.6936 train_time: 3.3m tok/s: 8291820 +2119/20000 train_loss: 2.7927 train_time: 3.3m tok/s: 8291844 +2120/20000 train_loss: 2.6526 train_time: 3.4m tok/s: 8291865 +2121/20000 train_loss: 2.6317 train_time: 3.4m tok/s: 8291869 +2122/20000 train_loss: 2.5488 train_time: 3.4m tok/s: 8291882 +2123/20000 train_loss: 2.4025 train_time: 3.4m tok/s: 8291861 +2124/20000 train_loss: 2.6329 train_time: 3.4m tok/s: 8291883 +2125/20000 train_loss: 2.6149 train_time: 3.4m tok/s: 8291962 +2126/20000 train_loss: 2.6272 train_time: 3.4m tok/s: 8291961 +2127/20000 train_loss: 2.5549 train_time: 3.4m tok/s: 8291912 +2128/20000 train_loss: 2.5337 train_time: 3.4m tok/s: 8291935 +2129/20000 train_loss: 2.3653 train_time: 3.4m tok/s: 8291928 +2130/20000 train_loss: 2.6738 train_time: 3.4m tok/s: 8291946 +2131/20000 train_loss: 2.5861 train_time: 3.4m tok/s: 8291967 +2132/20000 train_loss: 2.9060 train_time: 3.4m tok/s: 8291978 +2133/20000 train_loss: 2.6964 train_time: 3.4m tok/s: 8292010 +2134/20000 train_loss: 2.6296 train_time: 3.4m tok/s: 8291934 +2135/20000 train_loss: 2.5824 train_time: 3.4m tok/s: 8291902 +2136/20000 train_loss: 2.5968 train_time: 3.4m tok/s: 8291967 +2137/20000 train_loss: 2.5364 train_time: 3.4m tok/s: 8291985 +2138/20000 train_loss: 2.5873 train_time: 3.4m tok/s: 8291997 +2139/20000 train_loss: 2.6778 train_time: 3.4m tok/s: 8291982 +2140/20000 train_loss: 2.5487 train_time: 3.4m tok/s: 8291992 +2141/20000 train_loss: 2.5237 train_time: 3.4m tok/s: 8292019 +2142/20000 train_loss: 2.6039 train_time: 3.4m tok/s: 8292026 +2143/20000 train_loss: 2.4848 train_time: 3.4m tok/s: 8292002 +2144/20000 train_loss: 2.5850 train_time: 3.4m tok/s: 8292042 +2145/20000 train_loss: 2.6658 train_time: 3.4m tok/s: 8292044 +2146/20000 train_loss: 2.5916 train_time: 3.4m tok/s: 8292004 +2147/20000 train_loss: 2.5544 train_time: 3.4m tok/s: 8291992 +2148/20000 train_loss: 2.7034 train_time: 3.4m tok/s: 8291986 +2149/20000 train_loss: 2.5341 train_time: 3.4m tok/s: 8291992 +2150/20000 train_loss: 2.7333 train_time: 3.4m tok/s: 8291989 +2151/20000 train_loss: 2.6847 train_time: 3.4m tok/s: 8292001 +2152/20000 train_loss: 2.5944 train_time: 3.4m tok/s: 8291998 +2153/20000 train_loss: 2.4621 train_time: 3.4m tok/s: 8291988 +2154/20000 train_loss: 2.6287 train_time: 3.4m tok/s: 8291979 +2155/20000 train_loss: 2.5160 train_time: 3.4m tok/s: 8291977 +2156/20000 train_loss: 2.6089 train_time: 3.4m tok/s: 8291971 +2157/20000 train_loss: 2.5869 train_time: 3.4m tok/s: 8292034 +2158/20000 train_loss: 2.5861 train_time: 3.4m tok/s: 8292027 +2159/20000 train_loss: 2.5334 train_time: 3.4m tok/s: 8292033 +2160/20000 train_loss: 2.4307 train_time: 3.4m tok/s: 8292067 +2161/20000 train_loss: 2.6130 train_time: 3.4m tok/s: 8292087 +2162/20000 train_loss: 2.6328 train_time: 3.4m tok/s: 8292131 +2163/20000 train_loss: 2.5553 train_time: 3.4m tok/s: 8292134 +2164/20000 train_loss: 2.6036 train_time: 3.4m tok/s: 8292154 +2165/20000 train_loss: 2.6352 train_time: 3.4m tok/s: 8292156 +2166/20000 train_loss: 2.5209 train_time: 3.4m tok/s: 8292145 +2167/20000 train_loss: 2.5253 train_time: 3.4m tok/s: 8292092 +2168/20000 train_loss: 2.5819 train_time: 3.4m tok/s: 8292073 +2169/20000 train_loss: 2.6276 train_time: 3.4m tok/s: 8292028 +2170/20000 train_loss: 2.4920 train_time: 3.4m tok/s: 8292015 +2171/20000 train_loss: 2.6018 train_time: 3.4m tok/s: 8292012 +2172/20000 train_loss: 2.5086 train_time: 3.4m tok/s: 8292024 +2173/20000 train_loss: 2.7131 train_time: 3.4m tok/s: 8292053 +2174/20000 train_loss: 2.5495 train_time: 3.4m tok/s: 8292066 +2175/20000 train_loss: 2.4521 train_time: 3.4m tok/s: 8292027 +2176/20000 train_loss: 2.6667 train_time: 3.4m tok/s: 8292053 +2177/20000 train_loss: 2.6090 train_time: 3.4m tok/s: 8292053 +2178/20000 train_loss: 2.5260 train_time: 3.4m tok/s: 8292101 +2179/20000 train_loss: 2.7057 train_time: 3.4m tok/s: 8292117 +2180/20000 train_loss: 2.5616 train_time: 3.4m tok/s: 8292131 +2181/20000 train_loss: 2.5550 train_time: 3.4m tok/s: 8292114 +2182/20000 train_loss: 2.4372 train_time: 3.4m tok/s: 8292150 +2183/20000 train_loss: 2.5303 train_time: 3.5m tok/s: 8292185 +2184/20000 train_loss: 2.5372 train_time: 3.5m tok/s: 8292184 +2185/20000 train_loss: 2.4353 train_time: 3.5m tok/s: 8292186 +2186/20000 train_loss: 2.7413 train_time: 3.5m tok/s: 8292203 +2187/20000 train_loss: 2.5540 train_time: 3.5m tok/s: 8292166 +2188/20000 train_loss: 2.5506 train_time: 3.5m tok/s: 8292175 +2189/20000 train_loss: 2.6329 train_time: 3.5m tok/s: 8292239 +2190/20000 train_loss: 2.6745 train_time: 3.5m tok/s: 8292276 +2191/20000 train_loss: 2.6067 train_time: 3.5m tok/s: 8292282 +2192/20000 train_loss: 2.6464 train_time: 3.5m tok/s: 8292303 +2193/20000 train_loss: 2.5787 train_time: 3.5m tok/s: 8292317 +2194/20000 train_loss: 2.5835 train_time: 3.5m tok/s: 8292352 +2195/20000 train_loss: 2.6095 train_time: 3.5m tok/s: 8292307 +2196/20000 train_loss: 2.6028 train_time: 3.5m tok/s: 8292291 +2197/20000 train_loss: 2.5812 train_time: 3.5m tok/s: 8292287 +2198/20000 train_loss: 2.5352 train_time: 3.5m tok/s: 8292293 +2199/20000 train_loss: 2.5373 train_time: 3.5m tok/s: 8292317 +2200/20000 train_loss: 2.6137 train_time: 3.5m tok/s: 8292384 +2201/20000 train_loss: 2.6671 train_time: 3.5m tok/s: 8292395 +2202/20000 train_loss: 2.5884 train_time: 3.5m tok/s: 8292340 +2203/20000 train_loss: 2.5674 train_time: 3.5m tok/s: 8292247 +2204/20000 train_loss: 2.4201 train_time: 3.5m tok/s: 8292253 +2205/20000 train_loss: 2.6445 train_time: 3.5m tok/s: 8292220 +2206/20000 train_loss: 2.5663 train_time: 3.5m tok/s: 8292240 +2207/20000 train_loss: 2.5223 train_time: 3.5m tok/s: 8292248 +2208/20000 train_loss: 2.6521 train_time: 3.5m tok/s: 8292267 +2209/20000 train_loss: 2.7697 train_time: 3.5m tok/s: 8292313 +2210/20000 train_loss: 2.7221 train_time: 3.5m tok/s: 8292332 +2211/20000 train_loss: 2.6573 train_time: 3.5m tok/s: 8292316 +2212/20000 train_loss: 2.4505 train_time: 3.5m tok/s: 8292341 +layer_loop:enabled step:2212 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2213/20000 train_loss: 2.8799 train_time: 3.5m tok/s: 8290354 +2214/20000 train_loss: 2.6519 train_time: 3.5m tok/s: 8288573 +2215/20000 train_loss: 2.6916 train_time: 3.5m tok/s: 8286861 +2216/20000 train_loss: 2.6041 train_time: 3.5m tok/s: 8285088 +2217/20000 train_loss: 2.7447 train_time: 3.5m tok/s: 8283239 +2218/20000 train_loss: 2.7245 train_time: 3.5m tok/s: 8281466 +2219/20000 train_loss: 2.5453 train_time: 3.5m tok/s: 8279704 +2220/20000 train_loss: 2.5754 train_time: 3.5m tok/s: 8277888 +2221/20000 train_loss: 2.7018 train_time: 3.5m tok/s: 8276155 +2222/20000 train_loss: 2.5240 train_time: 3.5m tok/s: 8274390 +2223/20000 train_loss: 2.6029 train_time: 3.5m tok/s: 8272593 +2224/20000 train_loss: 2.4153 train_time: 3.5m tok/s: 8270836 +2225/20000 train_loss: 2.5763 train_time: 3.5m tok/s: 8269044 +2226/20000 train_loss: 2.5582 train_time: 3.5m tok/s: 8267335 +2227/20000 train_loss: 2.5307 train_time: 3.5m tok/s: 8265609 +2228/20000 train_loss: 2.5708 train_time: 3.5m tok/s: 8263854 +2229/20000 train_loss: 2.5299 train_time: 3.5m tok/s: 8262086 +2230/20000 train_loss: 2.5482 train_time: 3.5m tok/s: 8260364 +2231/20000 train_loss: 2.3497 train_time: 3.5m tok/s: 8258610 +2232/20000 train_loss: 2.5423 train_time: 3.5m tok/s: 8256806 +2233/20000 train_loss: 2.6722 train_time: 3.5m tok/s: 8255141 +2234/20000 train_loss: 2.7006 train_time: 3.5m tok/s: 8253433 +2235/20000 train_loss: 2.6533 train_time: 3.6m tok/s: 8251728 +2236/20000 train_loss: 2.6055 train_time: 3.6m tok/s: 8250008 +2237/20000 train_loss: 2.6368 train_time: 3.6m tok/s: 8248263 +2238/20000 train_loss: 2.6565 train_time: 3.6m tok/s: 8246520 +2239/20000 train_loss: 2.7573 train_time: 3.6m tok/s: 8244744 +2240/20000 train_loss: 2.7669 train_time: 3.6m tok/s: 8243047 +2241/20000 train_loss: 2.4029 train_time: 3.6m tok/s: 8241343 +2242/20000 train_loss: 2.5853 train_time: 3.6m tok/s: 8239606 +2243/20000 train_loss: 2.5704 train_time: 3.6m tok/s: 8237926 +2244/20000 train_loss: 2.5933 train_time: 3.6m tok/s: 8236212 +2245/20000 train_loss: 2.6316 train_time: 3.6m tok/s: 8234470 +2246/20000 train_loss: 2.5407 train_time: 3.6m tok/s: 8232796 +2247/20000 train_loss: 2.6831 train_time: 3.6m tok/s: 8231117 +2248/20000 train_loss: 2.6684 train_time: 3.6m tok/s: 8229408 +2249/20000 train_loss: 2.5449 train_time: 3.6m tok/s: 8227727 +2250/20000 train_loss: 2.5389 train_time: 3.6m tok/s: 8226015 +2251/20000 train_loss: 2.5833 train_time: 3.6m tok/s: 8224343 +2252/20000 train_loss: 2.7330 train_time: 3.6m tok/s: 8222654 +2253/20000 train_loss: 2.5716 train_time: 3.6m tok/s: 8221000 +2254/20000 train_loss: 2.5642 train_time: 3.6m tok/s: 8219232 +2255/20000 train_loss: 2.4265 train_time: 3.6m tok/s: 8217526 +2256/20000 train_loss: 2.4719 train_time: 3.6m tok/s: 8215859 +2257/20000 train_loss: 2.5176 train_time: 3.6m tok/s: 8214104 +2258/20000 train_loss: 2.5385 train_time: 3.6m tok/s: 8212333 +2259/20000 train_loss: 2.5620 train_time: 3.6m tok/s: 8210689 +2260/20000 train_loss: 2.5165 train_time: 3.6m tok/s: 8208987 +2261/20000 train_loss: 2.6054 train_time: 3.6m tok/s: 8207354 +2262/20000 train_loss: 2.7081 train_time: 3.6m tok/s: 8205642 +2263/20000 train_loss: 2.6994 train_time: 3.6m tok/s: 8203990 +2264/20000 train_loss: 2.4714 train_time: 3.6m tok/s: 8202280 +2265/20000 train_loss: 2.5564 train_time: 3.6m tok/s: 8200693 +2266/20000 train_loss: 2.5837 train_time: 3.6m tok/s: 8198999 +2267/20000 train_loss: 2.6004 train_time: 3.6m tok/s: 8197349 +2268/20000 train_loss: 2.6907 train_time: 3.6m tok/s: 8195715 +2269/20000 train_loss: 2.7710 train_time: 3.6m tok/s: 8194023 +2270/20000 train_loss: 2.5252 train_time: 3.6m tok/s: 8192395 +2271/20000 train_loss: 2.5757 train_time: 3.6m tok/s: 8190703 +2272/20000 train_loss: 2.5117 train_time: 3.6m tok/s: 8189072 +2273/20000 train_loss: 2.5748 train_time: 3.6m tok/s: 8187456 +2274/20000 train_loss: 3.2742 train_time: 3.6m tok/s: 8185755 +2275/20000 train_loss: 2.4158 train_time: 3.6m tok/s: 8184062 +2276/20000 train_loss: 2.5209 train_time: 3.6m tok/s: 8182390 +2277/20000 train_loss: 2.7794 train_time: 3.6m tok/s: 8180752 +2278/20000 train_loss: 2.7189 train_time: 3.7m tok/s: 8179155 +2279/20000 train_loss: 2.6406 train_time: 3.7m tok/s: 8177514 +2280/20000 train_loss: 2.7333 train_time: 3.7m tok/s: 8175910 +2281/20000 train_loss: 2.5420 train_time: 3.7m tok/s: 8174351 +2282/20000 train_loss: 2.7305 train_time: 3.7m tok/s: 8172751 +2283/20000 train_loss: 2.9771 train_time: 3.7m tok/s: 8171110 +2284/20000 train_loss: 2.4886 train_time: 3.7m tok/s: 8169473 +2285/20000 train_loss: 2.5270 train_time: 3.7m tok/s: 8167872 +2286/20000 train_loss: 2.4520 train_time: 3.7m tok/s: 8166147 +2287/20000 train_loss: 2.4572 train_time: 3.7m tok/s: 8164521 +2288/20000 train_loss: 2.6488 train_time: 3.7m tok/s: 8162881 +2289/20000 train_loss: 2.6620 train_time: 3.7m tok/s: 8161279 +2290/20000 train_loss: 2.6427 train_time: 3.7m tok/s: 8159681 +2291/20000 train_loss: 2.5350 train_time: 3.7m tok/s: 8158101 +2292/20000 train_loss: 2.4754 train_time: 3.7m tok/s: 8156495 +2293/20000 train_loss: 2.5558 train_time: 3.7m tok/s: 8154897 +2294/20000 train_loss: 2.4628 train_time: 3.7m tok/s: 8153329 +2295/20000 train_loss: 2.6636 train_time: 3.7m tok/s: 8151746 +2296/20000 train_loss: 2.5104 train_time: 3.7m tok/s: 8150172 +2297/20000 train_loss: 2.6258 train_time: 3.7m tok/s: 8148532 +2298/20000 train_loss: 2.4971 train_time: 3.7m tok/s: 8146874 +2299/20000 train_loss: 2.5963 train_time: 3.7m tok/s: 8145301 +2300/20000 train_loss: 2.6222 train_time: 3.7m tok/s: 8143686 +2301/20000 train_loss: 2.3976 train_time: 3.7m tok/s: 8142088 +2302/20000 train_loss: 2.5739 train_time: 3.7m tok/s: 8140549 +2303/20000 train_loss: 2.4994 train_time: 3.7m tok/s: 8138982 +2304/20000 train_loss: 2.5911 train_time: 3.7m tok/s: 8137362 +2305/20000 train_loss: 2.4470 train_time: 3.7m tok/s: 8135676 +2306/20000 train_loss: 2.6719 train_time: 3.7m tok/s: 8134117 +2307/20000 train_loss: 2.6450 train_time: 3.7m tok/s: 8132587 +2308/20000 train_loss: 2.5860 train_time: 3.7m tok/s: 8131064 +2309/20000 train_loss: 2.5421 train_time: 3.7m tok/s: 8129504 +2310/20000 train_loss: 2.6229 train_time: 3.7m tok/s: 8127919 +2311/20000 train_loss: 2.6152 train_time: 3.7m tok/s: 8126410 +2312/20000 train_loss: 2.6310 train_time: 3.7m tok/s: 8124792 +2313/20000 train_loss: 2.6181 train_time: 3.7m tok/s: 8123245 +2314/20000 train_loss: 2.4363 train_time: 3.7m tok/s: 8121661 +2315/20000 train_loss: 2.4126 train_time: 3.7m tok/s: 8120063 +2316/20000 train_loss: 2.3652 train_time: 3.7m tok/s: 8118503 +2317/20000 train_loss: 2.7504 train_time: 3.7m tok/s: 8116844 +2318/20000 train_loss: 2.6389 train_time: 3.7m tok/s: 8115300 +2319/20000 train_loss: 2.4504 train_time: 3.7m tok/s: 8113742 +2320/20000 train_loss: 2.6196 train_time: 3.7m tok/s: 8112199 +2321/20000 train_loss: 2.5968 train_time: 3.8m tok/s: 8110654 +2322/20000 train_loss: 2.4604 train_time: 3.8m tok/s: 8109112 +2323/20000 train_loss: 2.6340 train_time: 3.8m tok/s: 8107570 +2324/20000 train_loss: 2.5737 train_time: 3.8m tok/s: 8105969 +2325/20000 train_loss: 2.5943 train_time: 3.8m tok/s: 8104406 +2326/20000 train_loss: 2.5995 train_time: 3.8m tok/s: 8102870 +2327/20000 train_loss: 2.5770 train_time: 3.8m tok/s: 8101346 +2328/20000 train_loss: 2.5304 train_time: 3.8m tok/s: 8099845 +2329/20000 train_loss: 2.4132 train_time: 3.8m tok/s: 8098350 +2330/20000 train_loss: 2.6775 train_time: 3.8m tok/s: 8096797 +2331/20000 train_loss: 2.5801 train_time: 3.8m tok/s: 8095266 +2332/20000 train_loss: 2.3998 train_time: 3.8m tok/s: 8093731 +2333/20000 train_loss: 2.6232 train_time: 3.8m tok/s: 8092213 +2334/20000 train_loss: 2.2830 train_time: 3.8m tok/s: 8090597 +2335/20000 train_loss: 2.5444 train_time: 3.8m tok/s: 8089121 +2336/20000 train_loss: 2.6277 train_time: 3.8m tok/s: 8087702 +2337/20000 train_loss: 2.6572 train_time: 3.8m tok/s: 8086194 +2338/20000 train_loss: 2.5334 train_time: 3.8m tok/s: 8084636 +2339/20000 train_loss: 2.6192 train_time: 3.8m tok/s: 8083138 +2340/20000 train_loss: 2.5719 train_time: 3.8m tok/s: 8081630 +2341/20000 train_loss: 2.5547 train_time: 3.8m tok/s: 8080180 +2342/20000 train_loss: 2.5023 train_time: 3.8m tok/s: 8078631 +2343/20000 train_loss: 2.4343 train_time: 3.8m tok/s: 8077146 +2344/20000 train_loss: 2.7054 train_time: 3.8m tok/s: 8075619 +2345/20000 train_loss: 3.0418 train_time: 3.8m tok/s: 8074088 +2346/20000 train_loss: 2.5283 train_time: 3.8m tok/s: 8072573 +2347/20000 train_loss: 2.5334 train_time: 3.8m tok/s: 8071027 +2348/20000 train_loss: 2.7228 train_time: 3.8m tok/s: 8069535 +2349/20000 train_loss: 2.5956 train_time: 3.8m tok/s: 8068062 +2350/20000 train_loss: 2.5778 train_time: 3.8m tok/s: 8066566 +2351/20000 train_loss: 2.5684 train_time: 3.8m tok/s: 8065122 +2352/20000 train_loss: 2.5947 train_time: 3.8m tok/s: 8063623 +2353/20000 train_loss: 2.4990 train_time: 3.8m tok/s: 8062053 +2354/20000 train_loss: 2.5453 train_time: 3.8m tok/s: 8060607 +2355/20000 train_loss: 2.5102 train_time: 3.8m tok/s: 8059138 +2356/20000 train_loss: 2.5687 train_time: 3.8m tok/s: 8057655 +2357/20000 train_loss: 2.5197 train_time: 3.8m tok/s: 8056171 +2358/20000 train_loss: 2.5226 train_time: 3.8m tok/s: 8054705 +2359/20000 train_loss: 2.5087 train_time: 3.8m tok/s: 8053218 +2360/20000 train_loss: 2.5832 train_time: 3.8m tok/s: 8051739 +2361/20000 train_loss: 2.5939 train_time: 3.8m tok/s: 8050274 +2362/20000 train_loss: 2.5251 train_time: 3.8m tok/s: 8048713 +2363/20000 train_loss: 2.5409 train_time: 3.8m tok/s: 8047267 +2364/20000 train_loss: 2.6554 train_time: 3.9m tok/s: 8045741 +2365/20000 train_loss: 2.5557 train_time: 3.9m tok/s: 8044303 +2366/20000 train_loss: 2.6198 train_time: 3.9m tok/s: 8042844 +2367/20000 train_loss: 2.5366 train_time: 3.9m tok/s: 8041390 +2368/20000 train_loss: 2.6843 train_time: 3.9m tok/s: 8039953 +2369/20000 train_loss: 2.5187 train_time: 3.9m tok/s: 8038493 +2370/20000 train_loss: 2.6036 train_time: 3.9m tok/s: 8037038 +2371/20000 train_loss: 2.5908 train_time: 3.9m tok/s: 8035597 +2372/20000 train_loss: 2.6324 train_time: 3.9m tok/s: 8034165 +2373/20000 train_loss: 2.4956 train_time: 3.9m tok/s: 8032675 +2374/20000 train_loss: 2.5590 train_time: 3.9m tok/s: 8031204 +2375/20000 train_loss: 2.5488 train_time: 3.9m tok/s: 8029757 +2376/20000 train_loss: 2.4979 train_time: 3.9m tok/s: 8028313 +2377/20000 train_loss: 2.4139 train_time: 3.9m tok/s: 8026837 +2378/20000 train_loss: 2.5139 train_time: 3.9m tok/s: 8025375 +2379/20000 train_loss: 2.8960 train_time: 3.9m tok/s: 8023909 +2380/20000 train_loss: 2.5065 train_time: 3.9m tok/s: 8022477 +2381/20000 train_loss: 2.6692 train_time: 3.9m tok/s: 8021092 +2382/20000 train_loss: 2.4629 train_time: 3.9m tok/s: 8019665 +2383/20000 train_loss: 2.6464 train_time: 3.9m tok/s: 8018186 +2384/20000 train_loss: 2.6560 train_time: 3.9m tok/s: 8016766 +2385/20000 train_loss: 2.6713 train_time: 3.9m tok/s: 8015286 +2386/20000 train_loss: 2.6318 train_time: 3.9m tok/s: 8013851 +2387/20000 train_loss: 2.4930 train_time: 3.9m tok/s: 8012418 +2388/20000 train_loss: 2.9492 train_time: 3.9m tok/s: 8010841 +2389/20000 train_loss: 2.3594 train_time: 3.9m tok/s: 8009289 +2390/20000 train_loss: 2.5891 train_time: 3.9m tok/s: 8007885 +2391/20000 train_loss: 2.5180 train_time: 3.9m tok/s: 8006443 +2392/20000 train_loss: 2.6426 train_time: 3.9m tok/s: 8005056 +2393/20000 train_loss: 2.5431 train_time: 3.9m tok/s: 8003686 +2394/20000 train_loss: 2.6311 train_time: 3.9m tok/s: 8002287 +2395/20000 train_loss: 2.6373 train_time: 3.9m tok/s: 8000897 +2396/20000 train_loss: 2.6038 train_time: 3.9m tok/s: 7999543 +2397/20000 train_loss: 2.6886 train_time: 3.9m tok/s: 7998135 +2398/20000 train_loss: 2.5166 train_time: 3.9m tok/s: 7996718 +2399/20000 train_loss: 2.5219 train_time: 3.9m tok/s: 7995316 +2400/20000 train_loss: 2.5231 train_time: 3.9m tok/s: 7993898 +2401/20000 train_loss: 2.6050 train_time: 3.9m tok/s: 7992528 +2402/20000 train_loss: 2.5077 train_time: 3.9m tok/s: 7991141 +2403/20000 train_loss: 2.8782 train_time: 3.9m tok/s: 7989662 +2404/20000 train_loss: 2.5499 train_time: 3.9m tok/s: 7988286 +2405/20000 train_loss: 2.4814 train_time: 3.9m tok/s: 7986928 +2406/20000 train_loss: 2.5731 train_time: 3.9m tok/s: 7985535 +2407/20000 train_loss: 2.5799 train_time: 4.0m tok/s: 7984206 +2408/20000 train_loss: 2.6559 train_time: 4.0m tok/s: 7982802 +2409/20000 train_loss: 2.5912 train_time: 4.0m tok/s: 7981413 +2410/20000 train_loss: 2.5747 train_time: 4.0m tok/s: 7979998 +2411/20000 train_loss: 2.5325 train_time: 4.0m tok/s: 7978635 +2412/20000 train_loss: 2.6527 train_time: 4.0m tok/s: 7977222 +2413/20000 train_loss: 2.4942 train_time: 4.0m tok/s: 7975789 +2414/20000 train_loss: 2.5594 train_time: 4.0m tok/s: 7974456 +2415/20000 train_loss: 2.5489 train_time: 4.0m tok/s: 7973052 +2416/20000 train_loss: 2.5811 train_time: 4.0m tok/s: 7971662 +2417/20000 train_loss: 2.5505 train_time: 4.0m tok/s: 7970258 +2418/20000 train_loss: 2.5037 train_time: 4.0m tok/s: 7968863 +2419/20000 train_loss: 2.5496 train_time: 4.0m tok/s: 7967501 +2420/20000 train_loss: 2.5951 train_time: 4.0m tok/s: 7966122 +2421/20000 train_loss: 2.6249 train_time: 4.0m tok/s: 7964789 +2422/20000 train_loss: 2.6069 train_time: 4.0m tok/s: 7963436 +2423/20000 train_loss: 2.5030 train_time: 4.0m tok/s: 7962059 +2424/20000 train_loss: 2.6222 train_time: 4.0m tok/s: 7960703 +2425/20000 train_loss: 2.6090 train_time: 4.0m tok/s: 7959290 +2426/20000 train_loss: 2.5448 train_time: 4.0m tok/s: 7957937 +2427/20000 train_loss: 2.4147 train_time: 4.0m tok/s: 7956549 +2428/20000 train_loss: 2.5232 train_time: 4.0m tok/s: 7955142 +2429/20000 train_loss: 2.4949 train_time: 4.0m tok/s: 7953808 +2430/20000 train_loss: 2.4655 train_time: 4.0m tok/s: 7952460 +2431/20000 train_loss: 2.5862 train_time: 4.0m tok/s: 7951016 +2432/20000 train_loss: 2.5450 train_time: 4.0m tok/s: 7949686 +2433/20000 train_loss: 2.6749 train_time: 4.0m tok/s: 7948370 +2434/20000 train_loss: 2.4874 train_time: 4.0m tok/s: 7947004 +2435/20000 train_loss: 2.6716 train_time: 4.0m tok/s: 7945646 +2436/20000 train_loss: 2.5349 train_time: 4.0m tok/s: 7944317 +2437/20000 train_loss: 2.5780 train_time: 4.0m tok/s: 7942985 +2438/20000 train_loss: 2.5266 train_time: 4.0m tok/s: 7941623 +2439/20000 train_loss: 2.5257 train_time: 4.0m tok/s: 7940262 +2440/20000 train_loss: 2.5050 train_time: 4.0m tok/s: 7938922 +2441/20000 train_loss: 2.5389 train_time: 4.0m tok/s: 7937599 +2442/20000 train_loss: 2.5173 train_time: 4.0m tok/s: 7936280 +2443/20000 train_loss: 2.6372 train_time: 4.0m tok/s: 7934920 +2444/20000 train_loss: 2.5840 train_time: 4.0m tok/s: 7933613 +2445/20000 train_loss: 2.5051 train_time: 4.0m tok/s: 7932265 +2446/20000 train_loss: 2.7138 train_time: 4.0m tok/s: 7930910 +2447/20000 train_loss: 2.7230 train_time: 4.0m tok/s: 7929570 +2448/20000 train_loss: 2.5842 train_time: 4.0m tok/s: 7928250 +2449/20000 train_loss: 2.4981 train_time: 4.0m tok/s: 7926894 +2450/20000 train_loss: 2.5508 train_time: 4.1m tok/s: 7925554 +2451/20000 train_loss: 2.5417 train_time: 4.1m tok/s: 7924270 +2452/20000 train_loss: 2.5665 train_time: 4.1m tok/s: 7922896 +2453/20000 train_loss: 2.4379 train_time: 4.1m tok/s: 7921562 +2454/20000 train_loss: 2.4481 train_time: 4.1m tok/s: 7920249 +2455/20000 train_loss: 2.5500 train_time: 4.1m tok/s: 7918857 +2456/20000 train_loss: 2.5582 train_time: 4.1m tok/s: 7917555 +2457/20000 train_loss: 2.6213 train_time: 4.1m tok/s: 7916274 +2458/20000 train_loss: 2.7157 train_time: 4.1m tok/s: 7914972 +2459/20000 train_loss: 2.5600 train_time: 4.1m tok/s: 7913714 +2460/20000 train_loss: 2.5860 train_time: 4.1m tok/s: 7912380 +2461/20000 train_loss: 2.6626 train_time: 4.1m tok/s: 7911065 +2462/20000 train_loss: 2.5418 train_time: 4.1m tok/s: 7909771 +2463/20000 train_loss: 2.6028 train_time: 4.1m tok/s: 7908450 +2464/20000 train_loss: 2.5220 train_time: 4.1m tok/s: 7907099 +2465/20000 train_loss: 2.6131 train_time: 4.1m tok/s: 7905836 +2466/20000 train_loss: 2.3269 train_time: 4.1m tok/s: 7904529 +2467/20000 train_loss: 2.5911 train_time: 4.1m tok/s: 7903200 +2468/20000 train_loss: 2.4780 train_time: 4.1m tok/s: 7901843 +2469/20000 train_loss: 2.5797 train_time: 4.1m tok/s: 7900520 +2470/20000 train_loss: 2.6088 train_time: 4.1m tok/s: 7899251 +2471/20000 train_loss: 2.5813 train_time: 4.1m tok/s: 7897978 +2472/20000 train_loss: 2.6836 train_time: 4.1m tok/s: 7896677 +2473/20000 train_loss: 2.5528 train_time: 4.1m tok/s: 7895370 +2474/20000 train_loss: 2.8426 train_time: 4.1m tok/s: 7894026 +2475/20000 train_loss: 2.6921 train_time: 4.1m tok/s: 7892764 +2476/20000 train_loss: 2.5713 train_time: 4.1m tok/s: 7891456 +2477/20000 train_loss: 2.4964 train_time: 4.1m tok/s: 7890177 +2478/20000 train_loss: 2.5296 train_time: 4.1m tok/s: 7888883 +2479/20000 train_loss: 2.6447 train_time: 4.1m tok/s: 7887620 +2480/20000 train_loss: 2.5777 train_time: 4.1m tok/s: 7886357 +2481/20000 train_loss: 2.4659 train_time: 4.1m tok/s: 7885071 +2482/20000 train_loss: 2.6205 train_time: 4.1m tok/s: 7883780 +2483/20000 train_loss: 2.5710 train_time: 4.1m tok/s: 7882569 +2484/20000 train_loss: 2.5492 train_time: 4.1m tok/s: 7881243 +2485/20000 train_loss: 2.4896 train_time: 4.1m tok/s: 7879938 +2486/20000 train_loss: 2.5771 train_time: 4.1m tok/s: 7878630 +2487/20000 train_loss: 2.6019 train_time: 4.1m tok/s: 7877391 +2488/20000 train_loss: 2.5726 train_time: 4.1m tok/s: 7876126 +2489/20000 train_loss: 2.4657 train_time: 4.1m tok/s: 7874837 +2490/20000 train_loss: 2.6323 train_time: 4.1m tok/s: 7873601 +2491/20000 train_loss: 2.5807 train_time: 4.1m tok/s: 7872317 +2492/20000 train_loss: 2.5439 train_time: 4.1m tok/s: 7870999 +2493/20000 train_loss: 2.5732 train_time: 4.2m tok/s: 7869723 +2494/20000 train_loss: 2.4490 train_time: 4.2m tok/s: 7868460 +2495/20000 train_loss: 2.4923 train_time: 4.2m tok/s: 7867173 +2496/20000 train_loss: 2.6058 train_time: 4.2m tok/s: 7865971 +2497/20000 train_loss: 2.5675 train_time: 4.2m tok/s: 7864728 +2498/20000 train_loss: 2.5576 train_time: 4.2m tok/s: 7863478 +2499/20000 train_loss: 2.6020 train_time: 4.2m tok/s: 7862215 +2500/20000 train_loss: 2.6705 train_time: 4.2m tok/s: 7860950 +2501/20000 train_loss: 2.5620 train_time: 4.2m tok/s: 7859675 +2502/20000 train_loss: 2.4124 train_time: 4.2m tok/s: 7858456 +2503/20000 train_loss: 2.5204 train_time: 4.2m tok/s: 7857222 +2504/20000 train_loss: 2.6174 train_time: 4.2m tok/s: 7855910 +2505/20000 train_loss: 2.5201 train_time: 4.2m tok/s: 7854668 +2506/20000 train_loss: 2.5713 train_time: 4.2m tok/s: 7853378 +2507/20000 train_loss: 2.4316 train_time: 4.2m tok/s: 7852140 +2508/20000 train_loss: 2.5895 train_time: 4.2m tok/s: 7850892 +2509/20000 train_loss: 2.6257 train_time: 4.2m tok/s: 7849669 +2510/20000 train_loss: 2.5246 train_time: 4.2m tok/s: 7848394 +2511/20000 train_loss: 2.5743 train_time: 4.2m tok/s: 7847035 +2512/20000 train_loss: 2.6195 train_time: 4.2m tok/s: 7845800 +2513/20000 train_loss: 2.4798 train_time: 4.2m tok/s: 7844607 +2514/20000 train_loss: 2.5702 train_time: 4.2m tok/s: 7843381 +2515/20000 train_loss: 2.6074 train_time: 4.2m tok/s: 7842162 +2516/20000 train_loss: 2.5984 train_time: 4.2m tok/s: 7840882 +2517/20000 train_loss: 2.4406 train_time: 4.2m tok/s: 7839677 +2518/20000 train_loss: 2.5454 train_time: 4.2m tok/s: 7838449 +2519/20000 train_loss: 2.5808 train_time: 4.2m tok/s: 7837193 +2520/20000 train_loss: 2.5346 train_time: 4.2m tok/s: 7836011 +2521/20000 train_loss: 2.6150 train_time: 4.2m tok/s: 7834838 +2522/20000 train_loss: 2.6140 train_time: 4.2m tok/s: 7833634 +2523/20000 train_loss: 2.5583 train_time: 4.2m tok/s: 7832429 +2524/20000 train_loss: 2.5373 train_time: 4.2m tok/s: 7831223 +2525/20000 train_loss: 2.4461 train_time: 4.2m tok/s: 7830025 +2526/20000 train_loss: 2.5357 train_time: 4.2m tok/s: 7828793 +2527/20000 train_loss: 2.5519 train_time: 4.2m tok/s: 7827501 +2528/20000 train_loss: 2.6009 train_time: 4.2m tok/s: 7826299 +2529/20000 train_loss: 2.5582 train_time: 4.2m tok/s: 7825089 +2530/20000 train_loss: 2.4847 train_time: 4.2m tok/s: 7823872 +2531/20000 train_loss: 2.4019 train_time: 4.2m tok/s: 7822675 +2532/20000 train_loss: 2.5244 train_time: 4.2m tok/s: 7821345 +2533/20000 train_loss: 2.5397 train_time: 4.2m tok/s: 7820132 +2534/20000 train_loss: 2.4583 train_time: 4.2m tok/s: 7818957 +2535/20000 train_loss: 2.5752 train_time: 4.3m tok/s: 7817761 +2536/20000 train_loss: 2.5691 train_time: 4.3m tok/s: 7816556 +2537/20000 train_loss: 2.4735 train_time: 4.3m tok/s: 7815344 +2538/20000 train_loss: 2.5922 train_time: 4.3m tok/s: 7814161 +2539/20000 train_loss: 2.7870 train_time: 4.3m tok/s: 7812994 +2540/20000 train_loss: 2.5413 train_time: 4.3m tok/s: 7811775 +2541/20000 train_loss: 2.5330 train_time: 4.3m tok/s: 7810622 +2542/20000 train_loss: 2.5476 train_time: 4.3m tok/s: 7809366 +2543/20000 train_loss: 2.6343 train_time: 4.3m tok/s: 7808157 +2544/20000 train_loss: 2.6570 train_time: 4.3m tok/s: 7806946 +2545/20000 train_loss: 2.4918 train_time: 4.3m tok/s: 7805722 +2546/20000 train_loss: 2.5387 train_time: 4.3m tok/s: 7804532 +2547/20000 train_loss: 2.5105 train_time: 4.3m tok/s: 7803382 +2548/20000 train_loss: 2.7881 train_time: 4.3m tok/s: 7802182 +2549/20000 train_loss: 2.5493 train_time: 4.3m tok/s: 7800995 +2550/20000 train_loss: 2.8252 train_time: 4.3m tok/s: 7799849 +2551/20000 train_loss: 2.5305 train_time: 4.3m tok/s: 7798682 +2552/20000 train_loss: 2.7463 train_time: 4.3m tok/s: 7797479 +2553/20000 train_loss: 2.5832 train_time: 4.3m tok/s: 7796192 +2554/20000 train_loss: 2.4617 train_time: 4.3m tok/s: 7795069 +2555/20000 train_loss: 2.5515 train_time: 4.3m tok/s: 7793932 +2556/20000 train_loss: 2.5546 train_time: 4.3m tok/s: 7792730 +2557/20000 train_loss: 2.4996 train_time: 4.3m tok/s: 7791572 +2558/20000 train_loss: 2.6084 train_time: 4.3m tok/s: 7790394 +2559/20000 train_loss: 2.4337 train_time: 4.3m tok/s: 7789195 +2560/20000 train_loss: 2.4442 train_time: 4.3m tok/s: 7787957 +2561/20000 train_loss: 2.5290 train_time: 4.3m tok/s: 7786804 +2562/20000 train_loss: 2.4698 train_time: 4.3m tok/s: 7785709 +2563/20000 train_loss: 2.4392 train_time: 4.3m tok/s: 7784496 +2564/20000 train_loss: 2.4310 train_time: 4.3m tok/s: 7783309 +2565/20000 train_loss: 2.5385 train_time: 4.3m tok/s: 7782106 +2566/20000 train_loss: 2.5647 train_time: 4.3m tok/s: 7780999 +2567/20000 train_loss: 2.5855 train_time: 4.3m tok/s: 7779852 +2568/20000 train_loss: 2.6146 train_time: 4.3m tok/s: 7778694 +2569/20000 train_loss: 2.6568 train_time: 4.3m tok/s: 7777564 +2570/20000 train_loss: 2.5494 train_time: 4.3m tok/s: 7776413 +2571/20000 train_loss: 2.6068 train_time: 4.3m tok/s: 7775282 +2572/20000 train_loss: 2.4359 train_time: 4.3m tok/s: 7774100 +2573/20000 train_loss: 2.4941 train_time: 4.3m tok/s: 7772899 +2574/20000 train_loss: 2.6808 train_time: 4.3m tok/s: 7771728 +2575/20000 train_loss: 2.5845 train_time: 4.3m tok/s: 7770574 +2576/20000 train_loss: 2.5079 train_time: 4.3m tok/s: 7769451 +2577/20000 train_loss: 2.4747 train_time: 4.3m tok/s: 7768306 +2578/20000 train_loss: 2.4417 train_time: 4.4m tok/s: 7767157 +2579/20000 train_loss: 2.5394 train_time: 4.4m tok/s: 7765993 +2580/20000 train_loss: 2.4621 train_time: 4.4m tok/s: 7764825 +2581/20000 train_loss: 2.3444 train_time: 4.4m tok/s: 7763625 +2582/20000 train_loss: 2.5593 train_time: 4.4m tok/s: 7762412 +2583/20000 train_loss: 2.5878 train_time: 4.4m tok/s: 7761326 +2584/20000 train_loss: 2.5856 train_time: 4.4m tok/s: 7760198 +2585/20000 train_loss: 2.4826 train_time: 4.4m tok/s: 7759052 +2586/20000 train_loss: 2.5101 train_time: 4.4m tok/s: 7757931 +2587/20000 train_loss: 2.5421 train_time: 4.4m tok/s: 7756778 +2588/20000 train_loss: 2.6055 train_time: 4.4m tok/s: 7755610 +2589/20000 train_loss: 2.4782 train_time: 4.4m tok/s: 7754475 +2590/20000 train_loss: 2.5138 train_time: 4.4m tok/s: 7753352 +2591/20000 train_loss: 2.4714 train_time: 4.4m tok/s: 7752221 +2592/20000 train_loss: 2.4312 train_time: 4.4m tok/s: 7751070 +2593/20000 train_loss: 2.5056 train_time: 4.4m tok/s: 7749921 +2594/20000 train_loss: 2.4257 train_time: 4.4m tok/s: 7748781 +2595/20000 train_loss: 2.6065 train_time: 4.4m tok/s: 7747643 +2596/20000 train_loss: 3.0948 train_time: 4.4m tok/s: 7746509 +2597/20000 train_loss: 2.4059 train_time: 4.4m tok/s: 7745389 +2598/20000 train_loss: 2.5067 train_time: 4.4m tok/s: 7744301 +2599/20000 train_loss: 2.6161 train_time: 4.4m tok/s: 7743214 +2600/20000 train_loss: 2.5765 train_time: 4.4m tok/s: 7742035 +2601/20000 train_loss: 2.4978 train_time: 4.4m tok/s: 7740890 +2602/20000 train_loss: 2.7667 train_time: 4.4m tok/s: 7739771 +2603/20000 train_loss: 2.5190 train_time: 4.4m tok/s: 7738597 +2604/20000 train_loss: 2.5216 train_time: 4.4m tok/s: 7737531 +2605/20000 train_loss: 2.6637 train_time: 4.4m tok/s: 7736324 +2606/20000 train_loss: 2.4122 train_time: 4.4m tok/s: 7735263 +2607/20000 train_loss: 2.4749 train_time: 4.4m tok/s: 7734096 +2608/20000 train_loss: 2.5258 train_time: 4.4m tok/s: 7733023 +2609/20000 train_loss: 2.4871 train_time: 4.4m tok/s: 7731883 +2610/20000 train_loss: 2.4762 train_time: 4.4m tok/s: 7730768 +2611/20000 train_loss: 2.6121 train_time: 4.4m tok/s: 7729697 +2612/20000 train_loss: 2.6791 train_time: 4.4m tok/s: 7728536 +2613/20000 train_loss: 2.5561 train_time: 4.4m tok/s: 7727442 +2614/20000 train_loss: 2.5422 train_time: 4.4m tok/s: 7726363 +2615/20000 train_loss: 2.6696 train_time: 4.4m tok/s: 7725266 +2616/20000 train_loss: 2.5790 train_time: 4.4m tok/s: 7724151 +2617/20000 train_loss: 2.5676 train_time: 4.4m tok/s: 7723025 +2618/20000 train_loss: 2.5650 train_time: 4.4m tok/s: 7721921 +2619/20000 train_loss: 2.4551 train_time: 4.4m tok/s: 7720834 +2620/20000 train_loss: 2.4235 train_time: 4.4m tok/s: 7719712 +2621/20000 train_loss: 2.5023 train_time: 4.5m tok/s: 7718649 +2622/20000 train_loss: 2.4919 train_time: 4.5m tok/s: 7717565 +2623/20000 train_loss: 2.5056 train_time: 4.5m tok/s: 7716417 +2624/20000 train_loss: 2.3422 train_time: 4.5m tok/s: 7715295 +2625/20000 train_loss: 2.6304 train_time: 4.5m tok/s: 7714227 +2626/20000 train_loss: 2.3545 train_time: 4.5m tok/s: 7713133 +2627/20000 train_loss: 2.4577 train_time: 4.5m tok/s: 7712035 +2628/20000 train_loss: 2.6580 train_time: 4.5m tok/s: 7710986 +2629/20000 train_loss: 2.5559 train_time: 4.5m tok/s: 7709905 +2630/20000 train_loss: 2.6164 train_time: 4.5m tok/s: 7708788 +2631/20000 train_loss: 2.5909 train_time: 4.5m tok/s: 7707673 +2632/20000 train_loss: 2.6305 train_time: 4.5m tok/s: 7706610 +2633/20000 train_loss: 2.4780 train_time: 4.5m tok/s: 7705496 +2634/20000 train_loss: 2.5648 train_time: 4.5m tok/s: 7704403 +2635/20000 train_loss: 2.4915 train_time: 4.5m tok/s: 7703336 +2636/20000 train_loss: 2.5435 train_time: 4.5m tok/s: 7702264 +2637/20000 train_loss: 2.4495 train_time: 4.5m tok/s: 7701186 +2638/20000 train_loss: 2.5394 train_time: 4.5m tok/s: 7700077 +2639/20000 train_loss: 2.2718 train_time: 4.5m tok/s: 7698963 +2640/20000 train_loss: 2.5405 train_time: 4.5m tok/s: 7697884 +2641/20000 train_loss: 2.5931 train_time: 4.5m tok/s: 7696784 +2642/20000 train_loss: 2.6497 train_time: 4.5m tok/s: 7695764 +2643/20000 train_loss: 2.5579 train_time: 4.5m tok/s: 7694601 +2644/20000 train_loss: 2.5777 train_time: 4.5m tok/s: 7693570 +2645/20000 train_loss: 2.5347 train_time: 4.5m tok/s: 7692497 +2646/20000 train_loss: 2.5665 train_time: 4.5m tok/s: 7691382 +2647/20000 train_loss: 2.6571 train_time: 4.5m tok/s: 7690337 +2648/20000 train_loss: 2.5201 train_time: 4.5m tok/s: 7689257 +2649/20000 train_loss: 2.5411 train_time: 4.5m tok/s: 7688189 +2650/20000 train_loss: 2.4645 train_time: 4.5m tok/s: 7687164 +2651/20000 train_loss: 2.4307 train_time: 4.5m tok/s: 7686092 +2652/20000 train_loss: 2.3529 train_time: 4.5m tok/s: 7685003 +2653/20000 train_loss: 2.6536 train_time: 4.5m tok/s: 7683878 +2654/20000 train_loss: 2.2672 train_time: 4.5m tok/s: 7682727 +2655/20000 train_loss: 2.9463 train_time: 4.5m tok/s: 7681600 +2656/20000 train_loss: 2.4578 train_time: 4.5m tok/s: 7680529 +2657/20000 train_loss: 2.4378 train_time: 4.5m tok/s: 7679509 +2658/20000 train_loss: 2.6176 train_time: 4.5m tok/s: 7678398 +2659/20000 train_loss: 2.5292 train_time: 4.5m tok/s: 7677331 +2660/20000 train_loss: 2.5925 train_time: 4.5m tok/s: 7676278 +2661/20000 train_loss: 2.5428 train_time: 4.5m tok/s: 7675248 +2662/20000 train_loss: 2.3326 train_time: 4.5m tok/s: 7674189 +2663/20000 train_loss: 2.7200 train_time: 4.5m tok/s: 7673141 +2664/20000 train_loss: 2.5368 train_time: 4.6m tok/s: 7672049 +2665/20000 train_loss: 2.4919 train_time: 4.6m tok/s: 7671003 +2666/20000 train_loss: 2.4038 train_time: 4.6m tok/s: 7669979 +2667/20000 train_loss: 2.2918 train_time: 4.6m tok/s: 7668952 +2668/20000 train_loss: 2.5726 train_time: 4.6m tok/s: 7667901 +2669/20000 train_loss: 2.4550 train_time: 4.6m tok/s: 7666868 +2670/20000 train_loss: 2.5660 train_time: 4.6m tok/s: 7665875 +2671/20000 train_loss: 2.6702 train_time: 4.6m tok/s: 7664864 +2672/20000 train_loss: 2.6134 train_time: 4.6m tok/s: 7663849 +2673/20000 train_loss: 2.5672 train_time: 4.6m tok/s: 7662865 +2674/20000 train_loss: 2.6238 train_time: 4.6m tok/s: 7661831 +2675/20000 train_loss: 2.5295 train_time: 4.6m tok/s: 7660813 +2676/20000 train_loss: 2.5196 train_time: 4.6m tok/s: 7659750 +2677/20000 train_loss: 2.4356 train_time: 4.6m tok/s: 7658709 +2678/20000 train_loss: 2.4755 train_time: 4.6m tok/s: 7657569 +2679/20000 train_loss: 2.3187 train_time: 4.6m tok/s: 7656554 +2680/20000 train_loss: 2.4332 train_time: 4.6m tok/s: 7655565 +2681/20000 train_loss: 2.4468 train_time: 4.6m tok/s: 7654552 +2682/20000 train_loss: 2.5389 train_time: 4.6m tok/s: 7653539 +2683/20000 train_loss: 2.4657 train_time: 4.6m tok/s: 7652527 +2684/20000 train_loss: 2.4773 train_time: 4.6m tok/s: 7651421 +2685/20000 train_loss: 2.8083 train_time: 4.6m tok/s: 7650375 +2686/20000 train_loss: 2.5242 train_time: 4.6m tok/s: 7649385 +2687/20000 train_loss: 2.5792 train_time: 4.6m tok/s: 7648413 +2688/20000 train_loss: 2.4538 train_time: 4.6m tok/s: 7647390 +2689/20000 train_loss: 2.5512 train_time: 4.6m tok/s: 7646358 +2690/20000 train_loss: 2.5487 train_time: 4.6m tok/s: 7645300 +2691/20000 train_loss: 2.5274 train_time: 4.6m tok/s: 7644297 +2692/20000 train_loss: 2.4691 train_time: 4.6m tok/s: 7643256 +2693/20000 train_loss: 2.4481 train_time: 4.6m tok/s: 7642177 +2694/20000 train_loss: 2.4732 train_time: 4.6m tok/s: 7641163 +2695/20000 train_loss: 2.5771 train_time: 4.6m tok/s: 7640130 +2696/20000 train_loss: 2.4740 train_time: 4.6m tok/s: 7639143 +2697/20000 train_loss: 2.5574 train_time: 4.6m tok/s: 7638126 +2698/20000 train_loss: 2.5726 train_time: 4.6m tok/s: 7637099 +2699/20000 train_loss: 2.4777 train_time: 4.6m tok/s: 7636112 +2700/20000 train_loss: 2.4564 train_time: 4.6m tok/s: 7635099 +2701/20000 train_loss: 2.5835 train_time: 4.6m tok/s: 7634101 +2702/20000 train_loss: 2.3716 train_time: 4.6m tok/s: 7633085 +2703/20000 train_loss: 2.5560 train_time: 4.6m tok/s: 7632070 +2704/20000 train_loss: 2.4481 train_time: 4.6m tok/s: 7631014 +2705/20000 train_loss: 2.5073 train_time: 4.6m tok/s: 7630018 +2706/20000 train_loss: 2.5115 train_time: 4.6m tok/s: 7629047 +2707/20000 train_loss: 2.6176 train_time: 4.7m tok/s: 7627997 +2708/20000 train_loss: 2.6536 train_time: 4.7m tok/s: 7626937 +2709/20000 train_loss: 2.5704 train_time: 4.7m tok/s: 7625964 +2710/20000 train_loss: 2.6256 train_time: 4.7m tok/s: 7624911 +2711/20000 train_loss: 2.7039 train_time: 4.7m tok/s: 7623877 +2712/20000 train_loss: 2.4920 train_time: 4.7m tok/s: 7622924 +2713/20000 train_loss: 2.5686 train_time: 4.7m tok/s: 7621884 +2714/20000 train_loss: 2.6242 train_time: 4.7m tok/s: 7620898 +2715/20000 train_loss: 2.4536 train_time: 4.7m tok/s: 7619941 +2716/20000 train_loss: 2.4187 train_time: 4.7m tok/s: 7618960 +2717/20000 train_loss: 2.5260 train_time: 4.7m tok/s: 7617978 +2718/20000 train_loss: 2.4493 train_time: 4.7m tok/s: 7616954 +2719/20000 train_loss: 2.4409 train_time: 4.7m tok/s: 7615980 +2720/20000 train_loss: 2.5634 train_time: 4.7m tok/s: 7614913 +2721/20000 train_loss: 2.4095 train_time: 4.7m tok/s: 7613866 +2722/20000 train_loss: 2.4847 train_time: 4.7m tok/s: 7612876 +2723/20000 train_loss: 2.4610 train_time: 4.7m tok/s: 7611924 +2724/20000 train_loss: 2.5216 train_time: 4.7m tok/s: 7610874 +2725/20000 train_loss: 2.6384 train_time: 4.7m tok/s: 7609898 +2726/20000 train_loss: 2.5066 train_time: 4.7m tok/s: 7608923 +2727/20000 train_loss: 2.5617 train_time: 4.7m tok/s: 7607963 +2728/20000 train_loss: 2.8963 train_time: 4.7m tok/s: 7607002 +2729/20000 train_loss: 2.6999 train_time: 4.7m tok/s: 7606016 +2730/20000 train_loss: 2.5326 train_time: 4.7m tok/s: 7605039 +2731/20000 train_loss: 2.6113 train_time: 4.7m tok/s: 7604040 +2732/20000 train_loss: 2.6245 train_time: 4.7m tok/s: 7603025 +2733/20000 train_loss: 2.5434 train_time: 4.7m tok/s: 7602011 +2734/20000 train_loss: 2.6167 train_time: 4.7m tok/s: 7601061 +2735/20000 train_loss: 2.4265 train_time: 4.7m tok/s: 7600103 +2736/20000 train_loss: 2.5524 train_time: 4.7m tok/s: 7599146 +2737/20000 train_loss: 2.4734 train_time: 4.7m tok/s: 7598137 +2738/20000 train_loss: 2.4146 train_time: 4.7m tok/s: 7597186 +2739/20000 train_loss: 2.5373 train_time: 4.7m tok/s: 7596203 +2740/20000 train_loss: 2.5545 train_time: 4.7m tok/s: 7595216 +2741/20000 train_loss: 2.5223 train_time: 4.7m tok/s: 7594251 +2742/20000 train_loss: 2.4775 train_time: 4.7m tok/s: 7593275 +2743/20000 train_loss: 2.5809 train_time: 4.7m tok/s: 7592264 +2744/20000 train_loss: 2.5880 train_time: 4.7m tok/s: 7591325 +2745/20000 train_loss: 2.6478 train_time: 4.7m tok/s: 7590390 +2746/20000 train_loss: 2.6168 train_time: 4.7m tok/s: 7589390 +2747/20000 train_loss: 2.4400 train_time: 4.7m tok/s: 7588418 +2748/20000 train_loss: 2.5137 train_time: 4.7m tok/s: 7587429 +2749/20000 train_loss: 2.5853 train_time: 4.7m tok/s: 7586506 +2750/20000 train_loss: 2.6463 train_time: 4.8m tok/s: 7585548 +2751/20000 train_loss: 2.6250 train_time: 4.8m tok/s: 7584574 +2752/20000 train_loss: 2.5016 train_time: 4.8m tok/s: 7583612 +2753/20000 train_loss: 2.4889 train_time: 4.8m tok/s: 7582659 +2754/20000 train_loss: 2.4544 train_time: 4.8m tok/s: 7581683 +2755/20000 train_loss: 2.4963 train_time: 4.8m tok/s: 7580748 +2756/20000 train_loss: 2.4780 train_time: 4.8m tok/s: 7579809 +2757/20000 train_loss: 2.4529 train_time: 4.8m tok/s: 7578826 +2758/20000 train_loss: 2.5935 train_time: 4.8m tok/s: 7577832 +2759/20000 train_loss: 2.4749 train_time: 4.8m tok/s: 7576927 +2760/20000 train_loss: 2.4057 train_time: 4.8m tok/s: 7575953 +2761/20000 train_loss: 2.6409 train_time: 4.8m tok/s: 7574947 +2762/20000 train_loss: 2.5392 train_time: 4.8m tok/s: 7574018 +2763/20000 train_loss: 2.5901 train_time: 4.8m tok/s: 7573067 +2764/20000 train_loss: 2.5649 train_time: 4.8m tok/s: 7572117 +2765/20000 train_loss: 2.5045 train_time: 4.8m tok/s: 7571189 +2766/20000 train_loss: 2.4603 train_time: 4.8m tok/s: 7570272 +2767/20000 train_loss: 2.5096 train_time: 4.8m tok/s: 7569299 +2768/20000 train_loss: 2.6387 train_time: 4.8m tok/s: 7568337 +2769/20000 train_loss: 2.5583 train_time: 4.8m tok/s: 7567304 +2770/20000 train_loss: 2.7016 train_time: 4.8m tok/s: 7566377 +2771/20000 train_loss: 2.5050 train_time: 4.8m tok/s: 7565395 +2772/20000 train_loss: 2.5277 train_time: 4.8m tok/s: 7564463 +2773/20000 train_loss: 2.5092 train_time: 4.8m tok/s: 7563500 +2774/20000 train_loss: 2.4683 train_time: 4.8m tok/s: 7562586 +2775/20000 train_loss: 2.4689 train_time: 4.8m tok/s: 7561666 +2776/20000 train_loss: 2.4155 train_time: 4.8m tok/s: 7560765 +2777/20000 train_loss: 2.4803 train_time: 4.8m tok/s: 7559798 +2778/20000 train_loss: 2.5297 train_time: 4.8m tok/s: 7558868 +2779/20000 train_loss: 2.3965 train_time: 4.8m tok/s: 7557843 +2780/20000 train_loss: 2.6587 train_time: 4.8m tok/s: 7556936 +2781/20000 train_loss: 2.6379 train_time: 4.8m tok/s: 7556016 +2782/20000 train_loss: 2.4596 train_time: 4.8m tok/s: 7555106 +2783/20000 train_loss: 2.6210 train_time: 4.8m tok/s: 7554185 +2784/20000 train_loss: 2.6457 train_time: 4.8m tok/s: 7553253 +2785/20000 train_loss: 2.5179 train_time: 4.8m tok/s: 7552343 +2786/20000 train_loss: 2.5174 train_time: 4.8m tok/s: 7551402 +2787/20000 train_loss: 2.5814 train_time: 4.8m tok/s: 7550445 +2788/20000 train_loss: 2.4237 train_time: 4.8m tok/s: 7549501 +2789/20000 train_loss: 2.5436 train_time: 4.8m tok/s: 7548602 +2790/20000 train_loss: 2.6068 train_time: 4.8m tok/s: 7547650 +2791/20000 train_loss: 2.3633 train_time: 4.8m tok/s: 7546682 +2792/20000 train_loss: 2.5163 train_time: 4.8m tok/s: 7545762 +2793/20000 train_loss: 2.5418 train_time: 4.9m tok/s: 7544808 +2794/20000 train_loss: 2.5052 train_time: 4.9m tok/s: 7543882 +2795/20000 train_loss: 2.3742 train_time: 4.9m tok/s: 7542992 +2796/20000 train_loss: 2.5125 train_time: 4.9m tok/s: 7542079 +2797/20000 train_loss: 2.5443 train_time: 4.9m tok/s: 7541129 +2798/20000 train_loss: 2.5160 train_time: 4.9m tok/s: 7540196 +2799/20000 train_loss: 2.7790 train_time: 4.9m tok/s: 7539273 +2800/20000 train_loss: 2.6263 train_time: 4.9m tok/s: 7538386 +2801/20000 train_loss: 2.5010 train_time: 4.9m tok/s: 7537493 +2802/20000 train_loss: 2.5134 train_time: 4.9m tok/s: 7536563 +2803/20000 train_loss: 2.5884 train_time: 4.9m tok/s: 7535592 +2804/20000 train_loss: 2.6324 train_time: 4.9m tok/s: 7534671 +2805/20000 train_loss: 2.4778 train_time: 4.9m tok/s: 7533771 +2806/20000 train_loss: 2.5043 train_time: 4.9m tok/s: 7532849 +2807/20000 train_loss: 2.6619 train_time: 4.9m tok/s: 7531918 +2808/20000 train_loss: 2.5852 train_time: 4.9m tok/s: 7530923 +2809/20000 train_loss: 2.4423 train_time: 4.9m tok/s: 7530031 +2810/20000 train_loss: 2.5246 train_time: 4.9m tok/s: 7529117 +2811/20000 train_loss: 2.5925 train_time: 4.9m tok/s: 7528188 +2812/20000 train_loss: 2.5944 train_time: 4.9m tok/s: 7527321 +2813/20000 train_loss: 2.3919 train_time: 4.9m tok/s: 7526394 +2814/20000 train_loss: 2.5354 train_time: 4.9m tok/s: 7525506 +2815/20000 train_loss: 2.6710 train_time: 4.9m tok/s: 7524621 +2816/20000 train_loss: 2.6328 train_time: 4.9m tok/s: 7523698 +2817/20000 train_loss: 2.5386 train_time: 4.9m tok/s: 7522786 +2818/20000 train_loss: 2.5612 train_time: 4.9m tok/s: 7521919 +2819/20000 train_loss: 2.4722 train_time: 4.9m tok/s: 7521060 +2820/20000 train_loss: 2.4956 train_time: 4.9m tok/s: 7520146 +2821/20000 train_loss: 2.4386 train_time: 4.9m tok/s: 7519205 +2822/20000 train_loss: 2.7896 train_time: 4.9m tok/s: 7518230 +2823/20000 train_loss: 2.6126 train_time: 4.9m tok/s: 7517337 +2824/20000 train_loss: 2.6712 train_time: 4.9m tok/s: 7516445 +2825/20000 train_loss: 2.4765 train_time: 4.9m tok/s: 7515535 +2826/20000 train_loss: 2.5638 train_time: 4.9m tok/s: 7514666 +2827/20000 train_loss: 2.4441 train_time: 4.9m tok/s: 7513801 +2828/20000 train_loss: 2.6208 train_time: 4.9m tok/s: 7512878 +2829/20000 train_loss: 2.4087 train_time: 4.9m tok/s: 7512011 +2830/20000 train_loss: 2.5036 train_time: 4.9m tok/s: 7511094 +2831/20000 train_loss: 2.7149 train_time: 4.9m tok/s: 7510244 +2832/20000 train_loss: 2.5279 train_time: 4.9m tok/s: 7509363 +2833/20000 train_loss: 2.6864 train_time: 4.9m tok/s: 7508464 +2834/20000 train_loss: 2.6328 train_time: 4.9m tok/s: 7507555 +2835/20000 train_loss: 2.5888 train_time: 5.0m tok/s: 7506658 +2836/20000 train_loss: 2.4971 train_time: 5.0m tok/s: 7505784 +2837/20000 train_loss: 2.5740 train_time: 5.0m tok/s: 7504891 +2838/20000 train_loss: 2.5450 train_time: 5.0m tok/s: 7504010 +2839/20000 train_loss: 2.5233 train_time: 5.0m tok/s: 7503109 +2840/20000 train_loss: 2.6090 train_time: 5.0m tok/s: 7502207 +2841/20000 train_loss: 2.5473 train_time: 5.0m tok/s: 7501332 +2842/20000 train_loss: 2.6238 train_time: 5.0m tok/s: 7500424 +2843/20000 train_loss: 2.4416 train_time: 5.0m tok/s: 7499519 +2844/20000 train_loss: 2.4360 train_time: 5.0m tok/s: 7498656 +2845/20000 train_loss: 2.5188 train_time: 5.0m tok/s: 7497787 +2846/20000 train_loss: 2.3889 train_time: 5.0m tok/s: 7496928 +2847/20000 train_loss: 2.4257 train_time: 5.0m tok/s: 7496002 +2848/20000 train_loss: 2.5580 train_time: 5.0m tok/s: 7495112 +2849/20000 train_loss: 2.6248 train_time: 5.0m tok/s: 7494254 +2850/20000 train_loss: 2.5764 train_time: 5.0m tok/s: 7493339 +2851/20000 train_loss: 2.7756 train_time: 5.0m tok/s: 7492468 +2852/20000 train_loss: 2.4613 train_time: 5.0m tok/s: 7491586 +2853/20000 train_loss: 2.5322 train_time: 5.0m tok/s: 7490694 +2854/20000 train_loss: 2.4541 train_time: 5.0m tok/s: 7489824 +2855/20000 train_loss: 2.6349 train_time: 5.0m tok/s: 7488938 +2856/20000 train_loss: 2.4658 train_time: 5.0m tok/s: 7488084 +2857/20000 train_loss: 2.5335 train_time: 5.0m tok/s: 7487257 +2858/20000 train_loss: 2.5089 train_time: 5.0m tok/s: 7486377 +2859/20000 train_loss: 3.1496 train_time: 5.0m tok/s: 7485454 +2860/20000 train_loss: 2.4905 train_time: 5.0m tok/s: 7484597 +2861/20000 train_loss: 2.4960 train_time: 5.0m tok/s: 7483765 +2862/20000 train_loss: 2.5102 train_time: 5.0m tok/s: 7482892 +2863/20000 train_loss: 2.3497 train_time: 5.0m tok/s: 7482028 +2864/20000 train_loss: 2.3803 train_time: 5.0m tok/s: 7481160 +2865/20000 train_loss: 2.6113 train_time: 5.0m tok/s: 7480287 +2866/20000 train_loss: 2.5235 train_time: 5.0m tok/s: 7479440 +2867/20000 train_loss: 2.3872 train_time: 5.0m tok/s: 7478578 +2868/20000 train_loss: 2.4298 train_time: 5.0m tok/s: 7477714 +2869/20000 train_loss: 2.5582 train_time: 5.0m tok/s: 7476879 +2870/20000 train_loss: 2.6493 train_time: 5.0m tok/s: 7476020 +2871/20000 train_loss: 2.4562 train_time: 5.0m tok/s: 7475191 +2872/20000 train_loss: 3.0336 train_time: 5.0m tok/s: 7474271 +2873/20000 train_loss: 2.4688 train_time: 5.0m tok/s: 7473294 +2874/20000 train_loss: 2.6097 train_time: 5.0m tok/s: 7472440 +2875/20000 train_loss: 2.5315 train_time: 5.0m tok/s: 7471623 +2876/20000 train_loss: 2.5860 train_time: 5.0m tok/s: 7470793 +2877/20000 train_loss: 2.5108 train_time: 5.0m tok/s: 7469915 +2878/20000 train_loss: 2.4971 train_time: 5.1m tok/s: 7469090 +2879/20000 train_loss: 2.5354 train_time: 5.1m tok/s: 7468257 +2880/20000 train_loss: 2.5404 train_time: 5.1m tok/s: 7467403 +2881/20000 train_loss: 2.5996 train_time: 5.1m tok/s: 7466593 +2882/20000 train_loss: 2.6848 train_time: 5.1m tok/s: 7465739 +2883/20000 train_loss: 2.6413 train_time: 5.1m tok/s: 7464864 +2884/20000 train_loss: 2.6037 train_time: 5.1m tok/s: 7464017 +2885/20000 train_loss: 2.5562 train_time: 5.1m tok/s: 7463172 +2886/20000 train_loss: 2.5530 train_time: 5.1m tok/s: 7462327 +2887/20000 train_loss: 2.5218 train_time: 5.1m tok/s: 7461478 +2888/20000 train_loss: 2.6159 train_time: 5.1m tok/s: 7460588 +2889/20000 train_loss: 2.5675 train_time: 5.1m tok/s: 7459758 +2890/20000 train_loss: 2.5891 train_time: 5.1m tok/s: 7458890 +2891/20000 train_loss: 2.5303 train_time: 5.1m tok/s: 7458100 +2892/20000 train_loss: 2.4673 train_time: 5.1m tok/s: 7457255 +2893/20000 train_loss: 2.3151 train_time: 5.1m tok/s: 7456402 +2894/20000 train_loss: 2.5533 train_time: 5.1m tok/s: 7455544 +2895/20000 train_loss: 2.5206 train_time: 5.1m tok/s: 7454696 +2896/20000 train_loss: 2.4927 train_time: 5.1m tok/s: 7453877 +2897/20000 train_loss: 2.5690 train_time: 5.1m tok/s: 7453037 +2898/20000 train_loss: 2.5617 train_time: 5.1m tok/s: 7452237 +2899/20000 train_loss: 2.5618 train_time: 5.1m tok/s: 7451397 +2900/20000 train_loss: 2.6085 train_time: 5.1m tok/s: 7450544 +2901/20000 train_loss: 2.4171 train_time: 5.1m tok/s: 7449698 +2902/20000 train_loss: 2.5346 train_time: 5.1m tok/s: 7448801 +2903/20000 train_loss: 2.4653 train_time: 5.1m tok/s: 7447968 +2904/20000 train_loss: 2.4622 train_time: 5.1m tok/s: 7447149 +2905/20000 train_loss: 2.6071 train_time: 5.1m tok/s: 7446303 +2906/20000 train_loss: 2.4156 train_time: 5.1m tok/s: 7445493 +2907/20000 train_loss: 2.4541 train_time: 5.1m tok/s: 7444695 +2908/20000 train_loss: 2.5135 train_time: 5.1m tok/s: 7443856 +2909/20000 train_loss: 2.5144 train_time: 5.1m tok/s: 7443016 +2910/20000 train_loss: 2.3947 train_time: 5.1m tok/s: 7442185 +2911/20000 train_loss: 2.5944 train_time: 5.1m tok/s: 7441344 +2912/20000 train_loss: 2.5390 train_time: 5.1m tok/s: 7440464 +2913/20000 train_loss: 2.5701 train_time: 5.1m tok/s: 7439619 +2914/20000 train_loss: 2.6236 train_time: 5.1m tok/s: 7438824 +2915/20000 train_loss: 2.5375 train_time: 5.1m tok/s: 7438025 +2916/20000 train_loss: 2.4691 train_time: 5.1m tok/s: 7437186 +2917/20000 train_loss: 2.5182 train_time: 5.1m tok/s: 7436360 +2918/20000 train_loss: 2.8601 train_time: 5.1m tok/s: 7435531 +2919/20000 train_loss: 2.6925 train_time: 5.1m tok/s: 7434695 +2920/20000 train_loss: 2.4755 train_time: 5.1m tok/s: 7433875 +2921/20000 train_loss: 2.4274 train_time: 5.2m tok/s: 7433059 +2922/20000 train_loss: 2.4653 train_time: 5.2m tok/s: 7432256 +2923/20000 train_loss: 2.4581 train_time: 5.2m tok/s: 7431427 +2924/20000 train_loss: 2.5733 train_time: 5.2m tok/s: 7430614 +2925/20000 train_loss: 2.6224 train_time: 5.2m tok/s: 7429785 +2926/20000 train_loss: 2.4967 train_time: 5.2m tok/s: 7428920 +2927/20000 train_loss: 2.5087 train_time: 5.2m tok/s: 7428128 +2928/20000 train_loss: 2.3518 train_time: 5.2m tok/s: 7427268 +2929/20000 train_loss: 2.5595 train_time: 5.2m tok/s: 7426468 +2930/20000 train_loss: 2.4493 train_time: 5.2m tok/s: 7425646 +2931/20000 train_loss: 2.4403 train_time: 5.2m tok/s: 7424848 +2932/20000 train_loss: 2.4287 train_time: 5.2m tok/s: 7424043 +2933/20000 train_loss: 2.5201 train_time: 5.2m tok/s: 7423249 +2934/20000 train_loss: 2.5334 train_time: 5.2m tok/s: 7422425 +2935/20000 train_loss: 2.5271 train_time: 5.2m tok/s: 7421616 +2936/20000 train_loss: 2.5160 train_time: 5.2m tok/s: 7420814 +2937/20000 train_loss: 2.6488 train_time: 5.2m tok/s: 7420030 +2938/20000 train_loss: 2.5955 train_time: 5.2m tok/s: 7419237 +2939/20000 train_loss: 2.5776 train_time: 5.2m tok/s: 7418430 +2940/20000 train_loss: 2.3921 train_time: 5.2m tok/s: 7417607 +2941/20000 train_loss: 2.6225 train_time: 5.2m tok/s: 7416795 +2942/20000 train_loss: 2.5324 train_time: 5.2m tok/s: 7416001 +2943/20000 train_loss: 2.4242 train_time: 5.2m tok/s: 7415157 +2944/20000 train_loss: 2.4112 train_time: 5.2m tok/s: 7414344 +2945/20000 train_loss: 2.5171 train_time: 5.2m tok/s: 7413542 +2946/20000 train_loss: 2.4908 train_time: 5.2m tok/s: 7412726 +2947/20000 train_loss: 2.4953 train_time: 5.2m tok/s: 7411922 +2948/20000 train_loss: 2.4739 train_time: 5.2m tok/s: 7411130 +2949/20000 train_loss: 2.5420 train_time: 5.2m tok/s: 7410333 +2950/20000 train_loss: 2.6409 train_time: 5.2m tok/s: 7409522 +2951/20000 train_loss: 2.7278 train_time: 5.2m tok/s: 7408721 +2952/20000 train_loss: 2.5532 train_time: 5.2m tok/s: 7407918 +2953/20000 train_loss: 2.5009 train_time: 5.2m tok/s: 7407129 +2954/20000 train_loss: 2.4773 train_time: 5.2m tok/s: 7406328 +2955/20000 train_loss: 2.4615 train_time: 5.2m tok/s: 7405512 +2956/20000 train_loss: 2.4595 train_time: 5.2m tok/s: 7404722 +2957/20000 train_loss: 2.7208 train_time: 5.2m tok/s: 7403861 +2958/20000 train_loss: 2.4363 train_time: 5.2m tok/s: 7403085 +2959/20000 train_loss: 2.4288 train_time: 5.2m tok/s: 7402341 +2960/20000 train_loss: 2.4282 train_time: 5.2m tok/s: 7401571 +2961/20000 train_loss: 2.5311 train_time: 5.2m tok/s: 7400768 +2962/20000 train_loss: 2.4758 train_time: 5.2m tok/s: 7399972 +2963/20000 train_loss: 2.6897 train_time: 5.2m tok/s: 7399182 +2964/20000 train_loss: 2.5806 train_time: 5.3m tok/s: 7398396 +2965/20000 train_loss: 2.6967 train_time: 5.3m tok/s: 7397613 +2966/20000 train_loss: 2.5784 train_time: 5.3m tok/s: 7396848 +2967/20000 train_loss: 2.5514 train_time: 5.3m tok/s: 7396045 +2968/20000 train_loss: 2.4828 train_time: 5.3m tok/s: 7395286 +2969/20000 train_loss: 2.6660 train_time: 5.3m tok/s: 7394478 +2970/20000 train_loss: 2.4726 train_time: 5.3m tok/s: 7393690 +2971/20000 train_loss: 2.4887 train_time: 5.3m tok/s: 7392885 +2972/20000 train_loss: 2.5155 train_time: 5.3m tok/s: 7392105 +2973/20000 train_loss: 2.4409 train_time: 5.3m tok/s: 7391332 +2974/20000 train_loss: 2.5871 train_time: 5.3m tok/s: 7390542 +2975/20000 train_loss: 2.5883 train_time: 5.3m tok/s: 7389761 +2976/20000 train_loss: 2.5445 train_time: 5.3m tok/s: 7389008 +2977/20000 train_loss: 2.4829 train_time: 5.3m tok/s: 7388205 +2978/20000 train_loss: 2.5794 train_time: 5.3m tok/s: 7387407 +2979/20000 train_loss: 2.4749 train_time: 5.3m tok/s: 7386608 +2980/20000 train_loss: 2.5849 train_time: 5.3m tok/s: 7385833 +2981/20000 train_loss: 2.3864 train_time: 5.3m tok/s: 7385056 +2982/20000 train_loss: 2.6201 train_time: 5.3m tok/s: 7384263 +2983/20000 train_loss: 2.4102 train_time: 5.3m tok/s: 7383482 +2984/20000 train_loss: 2.4478 train_time: 5.3m tok/s: 7382691 +2985/20000 train_loss: 2.4437 train_time: 5.3m tok/s: 7381919 +2986/20000 train_loss: 2.5728 train_time: 5.3m tok/s: 7381141 +2987/20000 train_loss: 2.4910 train_time: 5.3m tok/s: 7380304 +2988/20000 train_loss: 2.5509 train_time: 5.3m tok/s: 7379553 +2989/20000 train_loss: 2.4350 train_time: 5.3m tok/s: 7378770 +2990/20000 train_loss: 2.6988 train_time: 5.3m tok/s: 7377988 +2991/20000 train_loss: 2.4830 train_time: 5.3m tok/s: 7377233 +2992/20000 train_loss: 2.5980 train_time: 5.3m tok/s: 7376460 +2993/20000 train_loss: 2.6487 train_time: 5.3m tok/s: 7375649 +2994/20000 train_loss: 2.4309 train_time: 5.3m tok/s: 7374899 +2995/20000 train_loss: 2.6409 train_time: 5.3m tok/s: 7374142 +2996/20000 train_loss: 2.4241 train_time: 5.3m tok/s: 7373364 +2997/20000 train_loss: 2.5777 train_time: 5.3m tok/s: 7372578 +2998/20000 train_loss: 2.4826 train_time: 5.3m tok/s: 7371837 +2999/20000 train_loss: 2.4638 train_time: 5.3m tok/s: 7371016 +3000/20000 train_loss: 2.4840 train_time: 5.3m tok/s: 7370249 +3001/20000 train_loss: 2.5502 train_time: 5.3m tok/s: 7369500 +3002/20000 train_loss: 2.5050 train_time: 5.3m tok/s: 7368774 +3003/20000 train_loss: 2.5363 train_time: 5.3m tok/s: 7368032 +3004/20000 train_loss: 2.4637 train_time: 5.3m tok/s: 7367278 +3005/20000 train_loss: 2.5398 train_time: 5.3m tok/s: 7366525 +3006/20000 train_loss: 2.7526 train_time: 5.3m tok/s: 7365732 +3007/20000 train_loss: 2.5009 train_time: 5.4m tok/s: 7364942 +3008/20000 train_loss: 2.4957 train_time: 5.4m tok/s: 7364199 +3009/20000 train_loss: 2.4754 train_time: 5.4m tok/s: 7363450 +3010/20000 train_loss: 2.4135 train_time: 5.4m tok/s: 7362657 +3011/20000 train_loss: 2.5777 train_time: 5.4m tok/s: 7361864 +3012/20000 train_loss: 2.5457 train_time: 5.4m tok/s: 7361149 +3013/20000 train_loss: 2.5188 train_time: 5.4m tok/s: 7360403 +3014/20000 train_loss: 2.5573 train_time: 5.4m tok/s: 7359673 +3015/20000 train_loss: 2.6593 train_time: 5.4m tok/s: 7358923 +3016/20000 train_loss: 2.6823 train_time: 5.4m tok/s: 7358171 +3017/20000 train_loss: 2.4668 train_time: 5.4m tok/s: 7357407 +3018/20000 train_loss: 2.6175 train_time: 5.4m tok/s: 7356643 +3019/20000 train_loss: 2.4567 train_time: 5.4m tok/s: 7355938 +3020/20000 train_loss: 3.1200 train_time: 5.4m tok/s: 7355099 +3021/20000 train_loss: 2.4679 train_time: 5.4m tok/s: 7354316 +3022/20000 train_loss: 2.4517 train_time: 5.4m tok/s: 7353578 +3023/20000 train_loss: 2.5723 train_time: 5.4m tok/s: 7352832 +3024/20000 train_loss: 3.4325 train_time: 5.4m tok/s: 7351998 +3025/20000 train_loss: 2.4386 train_time: 5.4m tok/s: 7351230 +3026/20000 train_loss: 2.5236 train_time: 5.4m tok/s: 7350421 +3027/20000 train_loss: 2.5662 train_time: 5.4m tok/s: 7349682 +3028/20000 train_loss: 2.6463 train_time: 5.4m tok/s: 7348912 +3029/20000 train_loss: 2.7220 train_time: 5.4m tok/s: 7348208 +3030/20000 train_loss: 2.5623 train_time: 5.4m tok/s: 7347509 +3031/20000 train_loss: 2.5127 train_time: 5.4m tok/s: 7346802 +3032/20000 train_loss: 2.5403 train_time: 5.4m tok/s: 7346046 +3033/20000 train_loss: 2.5684 train_time: 5.4m tok/s: 7345347 +3034/20000 train_loss: 2.4828 train_time: 5.4m tok/s: 7344614 +3035/20000 train_loss: 2.3005 train_time: 5.4m tok/s: 7343860 +3036/20000 train_loss: 2.5710 train_time: 5.4m tok/s: 7343119 +3037/20000 train_loss: 2.4842 train_time: 5.4m tok/s: 7342387 +3038/20000 train_loss: 2.5276 train_time: 5.4m tok/s: 7341673 +3039/20000 train_loss: 2.5187 train_time: 5.4m tok/s: 7340924 +3040/20000 train_loss: 2.4073 train_time: 5.4m tok/s: 7340180 +3041/20000 train_loss: 2.6290 train_time: 5.4m tok/s: 7339449 +3042/20000 train_loss: 2.5788 train_time: 5.4m tok/s: 7338680 +3043/20000 train_loss: 2.6437 train_time: 5.4m tok/s: 7337893 +3044/20000 train_loss: 2.5855 train_time: 5.4m tok/s: 7337231 +3045/20000 train_loss: 2.5752 train_time: 5.4m tok/s: 7336508 +3046/20000 train_loss: 2.5592 train_time: 5.4m tok/s: 7335783 +3047/20000 train_loss: 2.3877 train_time: 5.4m tok/s: 7335014 +3048/20000 train_loss: 2.3539 train_time: 5.4m tok/s: 7334283 +3049/20000 train_loss: 2.5074 train_time: 5.4m tok/s: 7333546 +3050/20000 train_loss: 2.6039 train_time: 5.5m tok/s: 7332813 +3051/20000 train_loss: 2.3611 train_time: 5.5m tok/s: 7332120 +3052/20000 train_loss: 2.4186 train_time: 5.5m tok/s: 7331401 +3053/20000 train_loss: 2.6086 train_time: 5.5m tok/s: 7330619 +3054/20000 train_loss: 2.4201 train_time: 5.5m tok/s: 7329873 +3055/20000 train_loss: 2.5346 train_time: 5.5m tok/s: 7329145 +3056/20000 train_loss: 2.5436 train_time: 5.5m tok/s: 7328429 +3057/20000 train_loss: 2.5029 train_time: 5.5m tok/s: 7327711 +3058/20000 train_loss: 2.5473 train_time: 5.5m tok/s: 7327005 +3059/20000 train_loss: 2.4617 train_time: 5.5m tok/s: 7326289 +3060/20000 train_loss: 2.5684 train_time: 5.5m tok/s: 7325577 +3061/20000 train_loss: 2.4382 train_time: 5.5m tok/s: 7324862 +3062/20000 train_loss: 2.5901 train_time: 5.5m tok/s: 7324138 +3063/20000 train_loss: 2.5429 train_time: 5.5m tok/s: 7323402 +3064/20000 train_loss: 2.4390 train_time: 5.5m tok/s: 7322634 +3065/20000 train_loss: 2.4084 train_time: 5.5m tok/s: 7321901 +3066/20000 train_loss: 2.6699 train_time: 5.5m tok/s: 7321142 +3067/20000 train_loss: 2.4535 train_time: 5.5m tok/s: 7320403 +3068/20000 train_loss: 2.6014 train_time: 5.5m tok/s: 7319711 +3069/20000 train_loss: 2.3904 train_time: 5.5m tok/s: 7318962 +3070/20000 train_loss: 2.5560 train_time: 5.5m tok/s: 7318222 +3071/20000 train_loss: 2.5093 train_time: 5.5m tok/s: 7317495 +3072/20000 train_loss: 2.5410 train_time: 5.5m tok/s: 7316803 +3073/20000 train_loss: 2.6042 train_time: 5.5m tok/s: 7316094 +3074/20000 train_loss: 2.4253 train_time: 5.5m tok/s: 7315374 +3075/20000 train_loss: 2.4115 train_time: 5.5m tok/s: 7314642 +3076/20000 train_loss: 2.4698 train_time: 5.5m tok/s: 7313930 +3077/20000 train_loss: 2.5285 train_time: 5.5m tok/s: 7313229 +3078/20000 train_loss: 2.3914 train_time: 5.5m tok/s: 7312505 +3079/20000 train_loss: 2.4310 train_time: 5.5m tok/s: 7311771 +3080/20000 train_loss: 3.2295 train_time: 5.5m tok/s: 7311012 +3081/20000 train_loss: 2.3817 train_time: 5.5m tok/s: 7310317 +3082/20000 train_loss: 2.4439 train_time: 5.5m tok/s: 7309626 +3083/20000 train_loss: 2.4634 train_time: 5.5m tok/s: 7308935 +3084/20000 train_loss: 2.4662 train_time: 5.5m tok/s: 7308232 +3085/20000 train_loss: 2.5325 train_time: 5.5m tok/s: 7307512 +3086/20000 train_loss: 2.5705 train_time: 5.5m tok/s: 7306846 +3087/20000 train_loss: 2.6097 train_time: 5.5m tok/s: 7306108 +3088/20000 train_loss: 2.5716 train_time: 5.5m tok/s: 7305392 +3089/20000 train_loss: 2.4474 train_time: 5.5m tok/s: 7304688 +3090/20000 train_loss: 2.6913 train_time: 5.5m tok/s: 7303990 +3091/20000 train_loss: 2.4444 train_time: 5.5m tok/s: 7303301 +3092/20000 train_loss: 2.4806 train_time: 5.5m tok/s: 7302578 +3093/20000 train_loss: 2.5252 train_time: 5.6m tok/s: 7301826 +3094/20000 train_loss: 2.4747 train_time: 5.6m tok/s: 7301076 +3095/20000 train_loss: 2.3304 train_time: 5.6m tok/s: 7300375 +3096/20000 train_loss: 2.5068 train_time: 5.6m tok/s: 7299691 +3097/20000 train_loss: 2.5500 train_time: 5.6m tok/s: 7298999 +3098/20000 train_loss: 2.4669 train_time: 5.6m tok/s: 7298296 +3099/20000 train_loss: 2.3244 train_time: 5.6m tok/s: 7297584 +3100/20000 train_loss: 2.4076 train_time: 5.6m tok/s: 7296875 +3101/20000 train_loss: 2.6640 train_time: 5.6m tok/s: 7296210 +3102/20000 train_loss: 2.6487 train_time: 5.6m tok/s: 7295485 +3103/20000 train_loss: 2.4843 train_time: 5.6m tok/s: 7294776 +3104/20000 train_loss: 2.5856 train_time: 5.6m tok/s: 7294074 +3105/20000 train_loss: 2.3862 train_time: 5.6m tok/s: 7293383 +3106/20000 train_loss: 2.5853 train_time: 5.6m tok/s: 7292715 +3107/20000 train_loss: 2.3384 train_time: 5.6m tok/s: 7291974 +3108/20000 train_loss: 2.4687 train_time: 5.6m tok/s: 7291273 +3109/20000 train_loss: 2.5589 train_time: 5.6m tok/s: 7290600 +3110/20000 train_loss: 2.4217 train_time: 5.6m tok/s: 7289907 +3111/20000 train_loss: 2.4315 train_time: 5.6m tok/s: 7289212 +3112/20000 train_loss: 2.3854 train_time: 5.6m tok/s: 7288517 +3113/20000 train_loss: 2.5220 train_time: 5.6m tok/s: 7287752 +3114/20000 train_loss: 2.5429 train_time: 5.6m tok/s: 7287076 +3115/20000 train_loss: 2.5620 train_time: 5.6m tok/s: 7286374 +3116/20000 train_loss: 2.5754 train_time: 5.6m tok/s: 7285668 +3117/20000 train_loss: 2.6006 train_time: 5.6m tok/s: 7284976 +3118/20000 train_loss: 2.5344 train_time: 5.6m tok/s: 7284264 +3119/20000 train_loss: 2.5685 train_time: 5.6m tok/s: 7283581 +3120/20000 train_loss: 2.5532 train_time: 5.6m tok/s: 7282904 +3121/20000 train_loss: 2.5326 train_time: 5.6m tok/s: 7282212 +3122/20000 train_loss: 2.5538 train_time: 5.6m tok/s: 7281509 +3123/20000 train_loss: 2.4266 train_time: 5.6m tok/s: 7280831 +3124/20000 train_loss: 2.5336 train_time: 5.6m tok/s: 7280159 +3125/20000 train_loss: 2.4329 train_time: 5.6m tok/s: 7279454 +3126/20000 train_loss: 2.1199 train_time: 5.6m tok/s: 7278732 +3127/20000 train_loss: 2.6003 train_time: 5.6m tok/s: 7278072 +3128/20000 train_loss: 2.5038 train_time: 5.6m tok/s: 7277360 +3129/20000 train_loss: 2.5325 train_time: 5.6m tok/s: 7276675 +3130/20000 train_loss: 2.4751 train_time: 5.6m tok/s: 7275994 +3131/20000 train_loss: 2.4399 train_time: 5.6m tok/s: 7275311 +3132/20000 train_loss: 2.5638 train_time: 5.6m tok/s: 7274620 +3133/20000 train_loss: 2.5145 train_time: 5.6m tok/s: 7273934 +3134/20000 train_loss: 2.5469 train_time: 5.6m tok/s: 7273273 +3135/20000 train_loss: 2.5025 train_time: 5.7m tok/s: 7272596 +3136/20000 train_loss: 2.5680 train_time: 5.7m tok/s: 7271919 +3137/20000 train_loss: 2.6027 train_time: 5.7m tok/s: 7271247 +3138/20000 train_loss: 2.4463 train_time: 5.7m tok/s: 7270535 +3139/20000 train_loss: 2.4497 train_time: 5.7m tok/s: 7269879 +3140/20000 train_loss: 2.4256 train_time: 5.7m tok/s: 7269190 +3141/20000 train_loss: 2.5734 train_time: 5.7m tok/s: 7268511 +3142/20000 train_loss: 2.5418 train_time: 5.7m tok/s: 7267794 +3143/20000 train_loss: 2.5479 train_time: 5.7m tok/s: 7267111 +3144/20000 train_loss: 2.5473 train_time: 5.7m tok/s: 7266466 +3145/20000 train_loss: 2.0690 train_time: 5.7m tok/s: 7265739 +3146/20000 train_loss: 2.4465 train_time: 5.7m tok/s: 7265069 +3147/20000 train_loss: 2.5490 train_time: 5.7m tok/s: 7264410 +3148/20000 train_loss: 2.4830 train_time: 5.7m tok/s: 7263738 +3149/20000 train_loss: 2.5965 train_time: 5.7m tok/s: 7263061 +3150/20000 train_loss: 2.3880 train_time: 5.7m tok/s: 7262355 +3151/20000 train_loss: 2.4058 train_time: 5.7m tok/s: 7261727 +3152/20000 train_loss: 2.4725 train_time: 5.7m tok/s: 7261059 +3153/20000 train_loss: 2.5212 train_time: 5.7m tok/s: 7260405 +3154/20000 train_loss: 2.3718 train_time: 5.7m tok/s: 7259713 +3155/20000 train_loss: 2.4728 train_time: 5.7m tok/s: 7259047 +3156/20000 train_loss: 2.5897 train_time: 5.7m tok/s: 7258358 +3157/20000 train_loss: 2.5929 train_time: 5.7m tok/s: 7257695 +3158/20000 train_loss: 2.6330 train_time: 5.7m tok/s: 7257018 +3159/20000 train_loss: 2.5218 train_time: 5.7m tok/s: 7256357 +3160/20000 train_loss: 2.3692 train_time: 5.7m tok/s: 7255661 +3161/20000 train_loss: 2.5348 train_time: 5.7m tok/s: 7255007 +3162/20000 train_loss: 2.6202 train_time: 5.7m tok/s: 7254372 +3163/20000 train_loss: 2.5470 train_time: 5.7m tok/s: 7253690 +3164/20000 train_loss: 2.3717 train_time: 5.7m tok/s: 7253005 +3165/20000 train_loss: 2.5047 train_time: 5.7m tok/s: 7252337 +3166/20000 train_loss: 2.4214 train_time: 5.7m tok/s: 7251686 +3167/20000 train_loss: 2.5250 train_time: 5.7m tok/s: 7251024 +3168/20000 train_loss: 2.5111 train_time: 5.7m tok/s: 7250354 +3169/20000 train_loss: 2.4891 train_time: 5.7m tok/s: 7249703 +3170/20000 train_loss: 2.6197 train_time: 5.7m tok/s: 7248979 +3171/20000 train_loss: 2.6526 train_time: 5.7m tok/s: 7248349 +3172/20000 train_loss: 2.6582 train_time: 5.7m tok/s: 7247690 +3173/20000 train_loss: 2.7111 train_time: 5.7m tok/s: 7246989 +3174/20000 train_loss: 2.4650 train_time: 5.7m tok/s: 7246336 +3175/20000 train_loss: 2.6110 train_time: 5.7m tok/s: 7245684 +3176/20000 train_loss: 2.4433 train_time: 5.7m tok/s: 7245050 +3177/20000 train_loss: 2.4470 train_time: 5.7m tok/s: 7244374 +3178/20000 train_loss: 2.5914 train_time: 5.8m tok/s: 7243650 +3179/20000 train_loss: 2.4297 train_time: 5.8m tok/s: 7242931 +3180/20000 train_loss: 2.5291 train_time: 5.8m tok/s: 7242289 +3181/20000 train_loss: 2.3763 train_time: 5.8m tok/s: 7241655 +3182/20000 train_loss: 2.8157 train_time: 5.8m tok/s: 7240984 +3183/20000 train_loss: 2.6566 train_time: 5.8m tok/s: 7240328 +3184/20000 train_loss: 2.4968 train_time: 5.8m tok/s: 7239639 +3185/20000 train_loss: 2.5749 train_time: 5.8m tok/s: 7238971 +3186/20000 train_loss: 2.5264 train_time: 5.8m tok/s: 7238355 +3187/20000 train_loss: 2.5703 train_time: 5.8m tok/s: 7237702 +3188/20000 train_loss: 2.3911 train_time: 5.8m tok/s: 7237046 +3189/20000 train_loss: 2.5913 train_time: 5.8m tok/s: 7236446 +3190/20000 train_loss: 2.5740 train_time: 5.8m tok/s: 7235794 +3191/20000 train_loss: 2.4781 train_time: 5.8m tok/s: 7235130 +3192/20000 train_loss: 2.3756 train_time: 5.8m tok/s: 7234506 +3193/20000 train_loss: 2.4587 train_time: 5.8m tok/s: 7233868 +3194/20000 train_loss: 2.4452 train_time: 5.8m tok/s: 7233244 +3195/20000 train_loss: 2.5014 train_time: 5.8m tok/s: 7232590 +3196/20000 train_loss: 2.3887 train_time: 5.8m tok/s: 7231947 +3197/20000 train_loss: 2.4670 train_time: 5.8m tok/s: 7231297 +3198/20000 train_loss: 2.4922 train_time: 5.8m tok/s: 7230663 +3199/20000 train_loss: 2.5067 train_time: 5.8m tok/s: 7230001 +3200/20000 train_loss: 2.6316 train_time: 5.8m tok/s: 7229359 +3201/20000 train_loss: 2.5681 train_time: 5.8m tok/s: 7228720 +3202/20000 train_loss: 2.2660 train_time: 5.8m tok/s: 7228023 +3203/20000 train_loss: 2.5436 train_time: 5.8m tok/s: 7227364 +3204/20000 train_loss: 2.4430 train_time: 5.8m tok/s: 7226720 +3205/20000 train_loss: 2.5027 train_time: 5.8m tok/s: 7226083 +3206/20000 train_loss: 2.4455 train_time: 5.8m tok/s: 7225461 +3207/20000 train_loss: 2.5527 train_time: 5.8m tok/s: 7224817 +3208/20000 train_loss: 2.5809 train_time: 5.8m tok/s: 7224176 +3209/20000 train_loss: 2.4532 train_time: 5.8m tok/s: 7223534 +3210/20000 train_loss: 2.7146 train_time: 5.8m tok/s: 7222868 +3211/20000 train_loss: 2.5104 train_time: 5.8m tok/s: 7222248 +3212/20000 train_loss: 2.4045 train_time: 5.8m tok/s: 7221617 +3213/20000 train_loss: 2.5550 train_time: 5.8m tok/s: 7220942 +3214/20000 train_loss: 2.4174 train_time: 5.8m tok/s: 7220273 +3215/20000 train_loss: 2.4334 train_time: 5.8m tok/s: 7219657 +3216/20000 train_loss: 2.3250 train_time: 5.8m tok/s: 7219011 +3217/20000 train_loss: 2.4573 train_time: 5.8m tok/s: 7218367 +3218/20000 train_loss: 2.4992 train_time: 5.8m tok/s: 7217746 +3219/20000 train_loss: 2.5620 train_time: 5.8m tok/s: 7217107 +3220/20000 train_loss: 2.6253 train_time: 5.8m tok/s: 7216498 +3221/20000 train_loss: 2.4274 train_time: 5.9m tok/s: 7215888 +3222/20000 train_loss: 2.5267 train_time: 5.9m tok/s: 7215253 +3223/20000 train_loss: 2.6323 train_time: 5.9m tok/s: 7214591 +3224/20000 train_loss: 2.9704 train_time: 5.9m tok/s: 7213903 +3225/20000 train_loss: 2.5757 train_time: 5.9m tok/s: 7213276 +3226/20000 train_loss: 2.4038 train_time: 5.9m tok/s: 7212627 +3227/20000 train_loss: 2.4830 train_time: 5.9m tok/s: 7212004 +3228/20000 train_loss: 2.8148 train_time: 5.9m tok/s: 7211356 +3229/20000 train_loss: 2.4096 train_time: 5.9m tok/s: 7210751 +3230/20000 train_loss: 2.4425 train_time: 5.9m tok/s: 7210108 +3231/20000 train_loss: 2.5574 train_time: 5.9m tok/s: 7209509 +3232/20000 train_loss: 2.5294 train_time: 5.9m tok/s: 7208890 +3233/20000 train_loss: 2.5237 train_time: 5.9m tok/s: 7208263 +3234/20000 train_loss: 2.5934 train_time: 5.9m tok/s: 7207608 +3235/20000 train_loss: 2.5289 train_time: 5.9m tok/s: 7206955 +3236/20000 train_loss: 2.2800 train_time: 5.9m tok/s: 7206336 +3237/20000 train_loss: 2.4675 train_time: 5.9m tok/s: 7205725 +3238/20000 train_loss: 2.3632 train_time: 5.9m tok/s: 7205075 +3239/20000 train_loss: 2.3968 train_time: 5.9m tok/s: 7204418 +3240/20000 train_loss: 2.4947 train_time: 5.9m tok/s: 7203797 +3241/20000 train_loss: 2.4977 train_time: 5.9m tok/s: 7203199 +3242/20000 train_loss: 2.5252 train_time: 5.9m tok/s: 7202593 +3243/20000 train_loss: 2.4453 train_time: 5.9m tok/s: 7201971 +3244/20000 train_loss: 2.5583 train_time: 5.9m tok/s: 7201333 +3245/20000 train_loss: 2.6331 train_time: 5.9m tok/s: 7200710 +3246/20000 train_loss: 2.5135 train_time: 5.9m tok/s: 7200100 +3247/20000 train_loss: 2.4101 train_time: 5.9m tok/s: 7199471 +3248/20000 train_loss: 2.4641 train_time: 5.9m tok/s: 7198825 +3249/20000 train_loss: 2.5948 train_time: 5.9m tok/s: 7198204 +3250/20000 train_loss: 2.5486 train_time: 5.9m tok/s: 7197573 +3251/20000 train_loss: 2.5854 train_time: 5.9m tok/s: 7196933 +3252/20000 train_loss: 2.4455 train_time: 5.9m tok/s: 7196304 +3253/20000 train_loss: 2.4404 train_time: 5.9m tok/s: 7195694 +3254/20000 train_loss: 2.9084 train_time: 5.9m tok/s: 7195007 +3255/20000 train_loss: 2.4589 train_time: 5.9m tok/s: 7194370 +3256/20000 train_loss: 2.4984 train_time: 5.9m tok/s: 7193805 +3257/20000 train_loss: 2.5737 train_time: 5.9m tok/s: 7193129 +3258/20000 train_loss: 2.5417 train_time: 5.9m tok/s: 7192523 +3259/20000 train_loss: 2.4893 train_time: 5.9m tok/s: 7191915 +3260/20000 train_loss: 2.4826 train_time: 5.9m tok/s: 7191321 +3261/20000 train_loss: 2.5216 train_time: 5.9m tok/s: 7190728 +3262/20000 train_loss: 2.4027 train_time: 5.9m tok/s: 7190091 +3263/20000 train_loss: 2.4644 train_time: 5.9m tok/s: 7189482 +3264/20000 train_loss: 2.5121 train_time: 6.0m tok/s: 7188864 +3265/20000 train_loss: 2.5335 train_time: 6.0m tok/s: 7188248 +3266/20000 train_loss: 2.5419 train_time: 6.0m tok/s: 7187674 +3267/20000 train_loss: 2.4892 train_time: 6.0m tok/s: 7187056 +3268/20000 train_loss: 2.5159 train_time: 6.0m tok/s: 7186424 +3269/20000 train_loss: 2.6167 train_time: 6.0m tok/s: 7185794 +3270/20000 train_loss: 2.4660 train_time: 6.0m tok/s: 7185171 +3271/20000 train_loss: 2.5525 train_time: 6.0m tok/s: 7184526 +3272/20000 train_loss: 2.5093 train_time: 6.0m tok/s: 7183936 +3273/20000 train_loss: 2.6726 train_time: 6.0m tok/s: 7183326 +3274/20000 train_loss: 2.4739 train_time: 6.0m tok/s: 7182728 +3275/20000 train_loss: 2.5227 train_time: 6.0m tok/s: 7182144 +3276/20000 train_loss: 2.4958 train_time: 6.0m tok/s: 7181545 +3277/20000 train_loss: 2.5035 train_time: 6.0m tok/s: 7180858 +3278/20000 train_loss: 2.4002 train_time: 6.0m tok/s: 7180291 +3279/20000 train_loss: 2.4710 train_time: 6.0m tok/s: 7179678 +3280/20000 train_loss: 2.3869 train_time: 6.0m tok/s: 7179092 +3281/20000 train_loss: 2.4256 train_time: 6.0m tok/s: 7178429 +3282/20000 train_loss: 2.4917 train_time: 6.0m tok/s: 7177787 +3283/20000 train_loss: 2.4325 train_time: 6.0m tok/s: 7177198 +3284/20000 train_loss: 2.6316 train_time: 6.0m tok/s: 7176613 +3285/20000 train_loss: 2.4892 train_time: 6.0m tok/s: 7175992 +3286/20000 train_loss: 2.4447 train_time: 6.0m tok/s: 7175391 +3287/20000 train_loss: 2.5094 train_time: 6.0m tok/s: 7174823 +3288/20000 train_loss: 2.5037 train_time: 6.0m tok/s: 7174229 +3289/20000 train_loss: 2.4839 train_time: 6.0m tok/s: 7173620 +3290/20000 train_loss: 2.4590 train_time: 6.0m tok/s: 7173036 +3291/20000 train_loss: 2.4432 train_time: 6.0m tok/s: 7172454 +3292/20000 train_loss: 2.4648 train_time: 6.0m tok/s: 7171860 +3293/20000 train_loss: 2.4589 train_time: 6.0m tok/s: 7171230 +3294/20000 train_loss: 2.4295 train_time: 6.0m tok/s: 7170630 +3295/20000 train_loss: 2.5179 train_time: 6.0m tok/s: 7170041 +3296/20000 train_loss: 2.3923 train_time: 6.0m tok/s: 7169416 +3297/20000 train_loss: 2.4029 train_time: 6.0m tok/s: 7168810 +3298/20000 train_loss: 2.4810 train_time: 6.0m tok/s: 7168196 +3299/20000 train_loss: 2.3018 train_time: 6.0m tok/s: 7167548 +3300/20000 train_loss: 2.4603 train_time: 6.0m tok/s: 7166960 +3301/20000 train_loss: 2.4298 train_time: 6.0m tok/s: 7166367 +3302/20000 train_loss: 2.2260 train_time: 6.0m tok/s: 7165696 +3303/20000 train_loss: 2.6549 train_time: 6.0m tok/s: 7165109 +3304/20000 train_loss: 2.4978 train_time: 6.0m tok/s: 7164535 +3305/20000 train_loss: 2.5661 train_time: 6.0m tok/s: 7163947 +3306/20000 train_loss: 2.6147 train_time: 6.0m tok/s: 7163386 +3307/20000 train_loss: 2.5220 train_time: 6.1m tok/s: 7162767 +3308/20000 train_loss: 2.2471 train_time: 6.1m tok/s: 7162169 +3309/20000 train_loss: 2.4734 train_time: 6.1m tok/s: 7161575 +3310/20000 train_loss: 2.5375 train_time: 6.1m tok/s: 7161013 +3311/20000 train_loss: 2.6588 train_time: 6.1m tok/s: 7160392 +3312/20000 train_loss: 2.5314 train_time: 6.1m tok/s: 7159828 +3313/20000 train_loss: 2.3741 train_time: 6.1m tok/s: 7159227 +3314/20000 train_loss: 2.4932 train_time: 6.1m tok/s: 7158673 +3315/20000 train_loss: 2.5193 train_time: 6.1m tok/s: 7158077 +3316/20000 train_loss: 2.5116 train_time: 6.1m tok/s: 7157515 +3317/20000 train_loss: 2.4245 train_time: 6.1m tok/s: 7156898 +3318/20000 train_loss: 2.3951 train_time: 6.1m tok/s: 7156298 +3319/20000 train_loss: 2.3987 train_time: 6.1m tok/s: 7155696 +3320/20000 train_loss: 2.4221 train_time: 6.1m tok/s: 7155103 +3321/20000 train_loss: 2.4437 train_time: 6.1m tok/s: 7154516 +3322/20000 train_loss: 2.3967 train_time: 6.1m tok/s: 7153916 +3323/20000 train_loss: 2.4075 train_time: 6.1m tok/s: 7153311 +3324/20000 train_loss: 2.4545 train_time: 6.1m tok/s: 7152717 +3325/20000 train_loss: 2.5554 train_time: 6.1m tok/s: 7152169 +3326/20000 train_loss: 2.5091 train_time: 6.1m tok/s: 7151592 +3327/20000 train_loss: 2.4195 train_time: 6.1m tok/s: 7151021 +3328/20000 train_loss: 2.5676 train_time: 6.1m tok/s: 7150382 +3329/20000 train_loss: 2.4589 train_time: 6.1m tok/s: 7149820 +3330/20000 train_loss: 2.8022 train_time: 6.1m tok/s: 7149216 +3331/20000 train_loss: 2.6140 train_time: 6.1m tok/s: 7148634 +3332/20000 train_loss: 2.5293 train_time: 6.1m tok/s: 7148042 +3333/20000 train_loss: 2.3851 train_time: 6.1m tok/s: 7147498 +3334/20000 train_loss: 2.4664 train_time: 6.1m tok/s: 7146907 +3335/20000 train_loss: 2.5640 train_time: 6.1m tok/s: 7146345 +3336/20000 train_loss: 2.5167 train_time: 6.1m tok/s: 7145785 +3337/20000 train_loss: 2.2930 train_time: 6.1m tok/s: 7145230 +3338/20000 train_loss: 2.4640 train_time: 6.1m tok/s: 7144634 +3339/20000 train_loss: 2.4096 train_time: 6.1m tok/s: 7144032 +3340/20000 train_loss: 2.4313 train_time: 6.1m tok/s: 7143455 +3341/20000 train_loss: 2.4451 train_time: 6.1m tok/s: 7142874 +3342/20000 train_loss: 2.3835 train_time: 6.1m tok/s: 7142273 +3343/20000 train_loss: 2.3845 train_time: 6.1m tok/s: 7141699 +3344/20000 train_loss: 2.5397 train_time: 6.1m tok/s: 7141119 +3345/20000 train_loss: 2.4693 train_time: 6.1m tok/s: 7140553 +3346/20000 train_loss: 2.4839 train_time: 6.1m tok/s: 7139959 +3347/20000 train_loss: 2.5239 train_time: 6.1m tok/s: 7139382 +3348/20000 train_loss: 2.5812 train_time: 6.1m tok/s: 7138785 +3349/20000 train_loss: 2.4968 train_time: 6.1m tok/s: 7138208 +3350/20000 train_loss: 2.4481 train_time: 6.2m tok/s: 7137642 +3351/20000 train_loss: 2.4829 train_time: 6.2m tok/s: 7137099 +3352/20000 train_loss: 2.4912 train_time: 6.2m tok/s: 7136527 +3353/20000 train_loss: 2.4329 train_time: 6.2m tok/s: 7135911 +3354/20000 train_loss: 2.5235 train_time: 6.2m tok/s: 7135325 +3355/20000 train_loss: 2.5622 train_time: 6.2m tok/s: 7134734 +3356/20000 train_loss: 2.3523 train_time: 6.2m tok/s: 7134122 +3357/20000 train_loss: 2.3535 train_time: 6.2m tok/s: 7133544 +3358/20000 train_loss: 2.5081 train_time: 6.2m tok/s: 7132949 +3359/20000 train_loss: 2.4437 train_time: 6.2m tok/s: 7132382 +3360/20000 train_loss: 2.4519 train_time: 6.2m tok/s: 7131830 +3361/20000 train_loss: 2.5160 train_time: 6.2m tok/s: 7131287 +3362/20000 train_loss: 2.4883 train_time: 6.2m tok/s: 7130744 +3363/20000 train_loss: 2.4211 train_time: 6.2m tok/s: 7130130 +3364/20000 train_loss: 2.5065 train_time: 6.2m tok/s: 7129582 +3365/20000 train_loss: 2.5105 train_time: 6.2m tok/s: 7129025 +3366/20000 train_loss: 2.3797 train_time: 6.2m tok/s: 7128450 +3367/20000 train_loss: 2.4190 train_time: 6.2m tok/s: 7127864 +3368/20000 train_loss: 2.3751 train_time: 6.2m tok/s: 7127321 +3369/20000 train_loss: 2.6137 train_time: 6.2m tok/s: 7126709 +3370/20000 train_loss: 2.5867 train_time: 6.2m tok/s: 7126119 +3371/20000 train_loss: 2.5008 train_time: 6.2m tok/s: 7125561 +3372/20000 train_loss: 2.5788 train_time: 6.2m tok/s: 7125015 +3373/20000 train_loss: 2.4649 train_time: 6.2m tok/s: 7124450 +3374/20000 train_loss: 2.4657 train_time: 6.2m tok/s: 7123902 +3375/20000 train_loss: 2.4768 train_time: 6.2m tok/s: 7123315 +3376/20000 train_loss: 2.4241 train_time: 6.2m tok/s: 7122752 +3377/20000 train_loss: 2.5516 train_time: 6.2m tok/s: 7122199 +3378/20000 train_loss: 2.3387 train_time: 6.2m tok/s: 7121625 +3379/20000 train_loss: 2.4119 train_time: 6.2m tok/s: 7121077 +3380/20000 train_loss: 2.3512 train_time: 6.2m tok/s: 7120468 +3381/20000 train_loss: 2.3308 train_time: 6.2m tok/s: 7119902 +3382/20000 train_loss: 2.5739 train_time: 6.2m tok/s: 7119384 +3383/20000 train_loss: 2.4963 train_time: 6.2m tok/s: 7118815 +3384/20000 train_loss: 2.4819 train_time: 6.2m tok/s: 7118235 +3385/20000 train_loss: 2.4738 train_time: 6.2m tok/s: 7117710 +3386/20000 train_loss: 2.4925 train_time: 6.2m tok/s: 7117143 +3387/20000 train_loss: 2.5179 train_time: 6.2m tok/s: 7116575 +3388/20000 train_loss: 2.2938 train_time: 6.2m tok/s: 7115932 +3389/20000 train_loss: 2.4199 train_time: 6.2m tok/s: 7115408 +3390/20000 train_loss: 2.4964 train_time: 6.2m tok/s: 7114842 +3391/20000 train_loss: 2.4946 train_time: 6.2m tok/s: 7114318 +3392/20000 train_loss: 2.4608 train_time: 6.2m tok/s: 7113756 +3393/20000 train_loss: 2.4411 train_time: 6.3m tok/s: 7113209 +3394/20000 train_loss: 2.4623 train_time: 6.3m tok/s: 7112681 +3395/20000 train_loss: 2.4558 train_time: 6.3m tok/s: 7112116 +3396/20000 train_loss: 2.6042 train_time: 6.3m tok/s: 7111583 +3397/20000 train_loss: 2.5238 train_time: 6.3m tok/s: 7110996 +3398/20000 train_loss: 2.3634 train_time: 6.3m tok/s: 7110432 +3399/20000 train_loss: 2.4214 train_time: 6.3m tok/s: 7109852 +3400/20000 train_loss: 2.5236 train_time: 6.3m tok/s: 7109286 +3401/20000 train_loss: 2.4066 train_time: 6.3m tok/s: 7108735 +3402/20000 train_loss: 2.4460 train_time: 6.3m tok/s: 7108217 +3403/20000 train_loss: 2.4835 train_time: 6.3m tok/s: 7107675 +3404/20000 train_loss: 2.4491 train_time: 6.3m tok/s: 7107126 +3405/20000 train_loss: 2.6259 train_time: 6.3m tok/s: 7106562 +3406/20000 train_loss: 2.4740 train_time: 6.3m tok/s: 7106038 +3407/20000 train_loss: 2.5022 train_time: 6.3m tok/s: 7105486 +3408/20000 train_loss: 2.5675 train_time: 6.3m tok/s: 7104924 +3409/20000 train_loss: 2.4109 train_time: 6.3m tok/s: 7104363 +3410/20000 train_loss: 2.4201 train_time: 6.3m tok/s: 7103800 +3411/20000 train_loss: 2.3420 train_time: 6.3m tok/s: 7103251 +3412/20000 train_loss: 2.3418 train_time: 6.3m tok/s: 7102700 +3413/20000 train_loss: 2.3354 train_time: 6.3m tok/s: 7102115 +3414/20000 train_loss: 2.4392 train_time: 6.3m tok/s: 7101590 +3415/20000 train_loss: 2.5920 train_time: 6.3m tok/s: 7101048 +3416/20000 train_loss: 2.4995 train_time: 6.3m tok/s: 7100508 +3417/20000 train_loss: 2.5651 train_time: 6.3m tok/s: 7099983 +3418/20000 train_loss: 2.5026 train_time: 6.3m tok/s: 7099418 +3419/20000 train_loss: 2.5278 train_time: 6.3m tok/s: 7098852 +3420/20000 train_loss: 2.5372 train_time: 6.3m tok/s: 7098283 +3421/20000 train_loss: 2.3724 train_time: 6.3m tok/s: 7097739 +3422/20000 train_loss: 2.6820 train_time: 6.3m tok/s: 7097172 +3423/20000 train_loss: 2.4012 train_time: 6.3m tok/s: 7096621 +3424/20000 train_loss: 2.4492 train_time: 6.3m tok/s: 7096120 +3425/20000 train_loss: 2.4224 train_time: 6.3m tok/s: 7095582 +3426/20000 train_loss: 2.5048 train_time: 6.3m tok/s: 7095052 +3427/20000 train_loss: 2.4574 train_time: 6.3m tok/s: 7094517 +3428/20000 train_loss: 2.4555 train_time: 6.3m tok/s: 7093952 +3429/20000 train_loss: 2.5958 train_time: 6.3m tok/s: 7093416 +3430/20000 train_loss: 2.4138 train_time: 6.3m tok/s: 7092868 +3431/20000 train_loss: 2.4666 train_time: 6.3m tok/s: 7092326 +3432/20000 train_loss: 2.5927 train_time: 6.3m tok/s: 7091774 +3433/20000 train_loss: 2.4548 train_time: 6.3m tok/s: 7091237 +3434/20000 train_loss: 2.5114 train_time: 6.3m tok/s: 7090698 +3435/20000 train_loss: 2.4386 train_time: 6.4m tok/s: 7090129 +3436/20000 train_loss: 2.4114 train_time: 6.4m tok/s: 7089598 +3437/20000 train_loss: 2.5472 train_time: 6.4m tok/s: 7089048 +3438/20000 train_loss: 2.4870 train_time: 6.4m tok/s: 7088511 +3439/20000 train_loss: 2.3244 train_time: 6.4m tok/s: 7087965 +3440/20000 train_loss: 2.4800 train_time: 6.4m tok/s: 7087421 +3441/20000 train_loss: 2.3781 train_time: 6.4m tok/s: 7086902 +3442/20000 train_loss: 2.4075 train_time: 6.4m tok/s: 7086366 +3443/20000 train_loss: 2.6580 train_time: 6.4m tok/s: 7085749 +3444/20000 train_loss: 2.3229 train_time: 6.4m tok/s: 7085220 +3445/20000 train_loss: 2.4615 train_time: 6.4m tok/s: 7084707 +3446/20000 train_loss: 2.5265 train_time: 6.4m tok/s: 7084210 +3447/20000 train_loss: 2.4848 train_time: 6.4m tok/s: 7083667 +3448/20000 train_loss: 2.6048 train_time: 6.4m tok/s: 7083145 +3449/20000 train_loss: 2.4741 train_time: 6.4m tok/s: 7082623 +3450/20000 train_loss: 2.4281 train_time: 6.4m tok/s: 7082091 +3451/20000 train_loss: 2.4902 train_time: 6.4m tok/s: 7081574 +3452/20000 train_loss: 2.5142 train_time: 6.4m tok/s: 7081048 +3453/20000 train_loss: 2.4023 train_time: 6.4m tok/s: 7080490 +3454/20000 train_loss: 2.5197 train_time: 6.4m tok/s: 7079952 +3455/20000 train_loss: 2.4801 train_time: 6.4m tok/s: 7079417 +3456/20000 train_loss: 2.4824 train_time: 6.4m tok/s: 7078868 +3457/20000 train_loss: 2.4390 train_time: 6.4m tok/s: 7078340 +3458/20000 train_loss: 2.3951 train_time: 6.4m tok/s: 7077809 +3459/20000 train_loss: 2.3627 train_time: 6.4m tok/s: 7077260 +3460/20000 train_loss: 2.5176 train_time: 6.4m tok/s: 7076757 +3461/20000 train_loss: 2.6255 train_time: 6.4m tok/s: 7076238 +3462/20000 train_loss: 2.5459 train_time: 6.4m tok/s: 7075724 +3463/20000 train_loss: 2.5698 train_time: 6.4m tok/s: 7075195 +3464/20000 train_loss: 2.4813 train_time: 6.4m tok/s: 7074659 +3465/20000 train_loss: 2.5264 train_time: 6.4m tok/s: 7074112 +3466/20000 train_loss: 2.5895 train_time: 6.4m tok/s: 7073557 +3467/20000 train_loss: 2.4422 train_time: 6.4m tok/s: 7073037 +3468/20000 train_loss: 2.6316 train_time: 6.4m tok/s: 7072462 +3469/20000 train_loss: 2.3911 train_time: 6.4m tok/s: 7071924 +3470/20000 train_loss: 2.3986 train_time: 6.4m tok/s: 7071385 +3471/20000 train_loss: 2.3894 train_time: 6.4m tok/s: 7070864 +3472/20000 train_loss: 2.4341 train_time: 6.4m tok/s: 7070330 +3473/20000 train_loss: 2.3967 train_time: 6.4m tok/s: 7069795 +3474/20000 train_loss: 2.4729 train_time: 6.4m tok/s: 7069321 +3475/20000 train_loss: 2.5199 train_time: 6.4m tok/s: 7068809 +3476/20000 train_loss: 2.4534 train_time: 6.4m tok/s: 7068293 +3477/20000 train_loss: 2.5265 train_time: 6.4m tok/s: 7067796 +3478/20000 train_loss: 2.5563 train_time: 6.5m tok/s: 7067253 +3479/20000 train_loss: 2.4552 train_time: 6.5m tok/s: 7066724 +3480/20000 train_loss: 2.5218 train_time: 6.5m tok/s: 7066215 +3481/20000 train_loss: 2.4812 train_time: 6.5m tok/s: 7065689 +3482/20000 train_loss: 2.4246 train_time: 6.5m tok/s: 7065172 +3483/20000 train_loss: 2.4135 train_time: 6.5m tok/s: 7064639 +3484/20000 train_loss: 2.4664 train_time: 6.5m tok/s: 7064092 +3485/20000 train_loss: 2.3978 train_time: 6.5m tok/s: 7063548 +3486/20000 train_loss: 2.3790 train_time: 6.5m tok/s: 7063039 +3487/20000 train_loss: 2.4752 train_time: 6.5m tok/s: 7062537 +3488/20000 train_loss: 2.3655 train_time: 6.5m tok/s: 7062003 +3489/20000 train_loss: 2.3809 train_time: 6.5m tok/s: 7061508 +3490/20000 train_loss: 2.4803 train_time: 6.5m tok/s: 7060982 +3491/20000 train_loss: 2.5808 train_time: 6.5m tok/s: 7060473 +3492/20000 train_loss: 2.4219 train_time: 6.5m tok/s: 7059941 +3493/20000 train_loss: 2.5487 train_time: 6.5m tok/s: 7059420 +3494/20000 train_loss: 2.5215 train_time: 6.5m tok/s: 7058899 +3495/20000 train_loss: 2.4859 train_time: 6.5m tok/s: 7058394 +3496/20000 train_loss: 2.5887 train_time: 6.5m tok/s: 7057857 +3497/20000 train_loss: 2.6826 train_time: 6.5m tok/s: 7057333 +3498/20000 train_loss: 2.3489 train_time: 6.5m tok/s: 7056780 +3499/20000 train_loss: 2.4571 train_time: 6.5m tok/s: 7056263 +3500/20000 train_loss: 2.3643 train_time: 6.5m tok/s: 7055765 +3501/20000 train_loss: 2.4152 train_time: 6.5m tok/s: 7055250 +3502/20000 train_loss: 3.2537 train_time: 6.5m tok/s: 7054691 +3503/20000 train_loss: 2.4561 train_time: 6.5m tok/s: 7054189 +3504/20000 train_loss: 2.5520 train_time: 6.5m tok/s: 7053684 +3505/20000 train_loss: 2.5014 train_time: 6.5m tok/s: 7053158 +3506/20000 train_loss: 2.5595 train_time: 6.5m tok/s: 7052641 +3507/20000 train_loss: 2.5266 train_time: 6.5m tok/s: 7052153 +3508/20000 train_loss: 2.6145 train_time: 6.5m tok/s: 7051633 +3509/20000 train_loss: 2.4345 train_time: 6.5m tok/s: 7051124 +3510/20000 train_loss: 2.4274 train_time: 6.5m tok/s: 7050612 +3511/20000 train_loss: 2.4147 train_time: 6.5m tok/s: 7050096 +3512/20000 train_loss: 2.4612 train_time: 6.5m tok/s: 7049615 +3513/20000 train_loss: 2.4101 train_time: 6.5m tok/s: 7049102 +3514/20000 train_loss: 2.5405 train_time: 6.5m tok/s: 7048598 +3515/20000 train_loss: 2.3628 train_time: 6.5m tok/s: 7048087 +3516/20000 train_loss: 2.2678 train_time: 6.5m tok/s: 7047556 +3517/20000 train_loss: 2.4471 train_time: 6.5m tok/s: 7047047 +3518/20000 train_loss: 2.4150 train_time: 6.5m tok/s: 7046531 +3519/20000 train_loss: 2.8152 train_time: 6.5m tok/s: 7046012 +3520/20000 train_loss: 2.6382 train_time: 6.5m tok/s: 7045504 +3521/20000 train_loss: 2.4947 train_time: 6.6m tok/s: 7045002 +3522/20000 train_loss: 2.5101 train_time: 6.6m tok/s: 7044511 +3523/20000 train_loss: 2.3836 train_time: 6.6m tok/s: 7043984 +3524/20000 train_loss: 2.7616 train_time: 6.6m tok/s: 7043456 +3525/20000 train_loss: 2.5207 train_time: 6.6m tok/s: 7042973 +3526/20000 train_loss: 2.4885 train_time: 6.6m tok/s: 7042480 +3527/20000 train_loss: 2.4110 train_time: 6.6m tok/s: 7041958 +3528/20000 train_loss: 2.4551 train_time: 6.6m tok/s: 7041471 +3529/20000 train_loss: 2.4471 train_time: 6.6m tok/s: 7040972 +3530/20000 train_loss: 2.6740 train_time: 6.6m tok/s: 7040477 +3531/20000 train_loss: 2.3005 train_time: 6.6m tok/s: 7039962 +3532/20000 train_loss: 2.2754 train_time: 6.6m tok/s: 7039426 +3533/20000 train_loss: 2.4349 train_time: 6.6m tok/s: 7038943 +3534/20000 train_loss: 2.4674 train_time: 6.6m tok/s: 7038449 +3535/20000 train_loss: 2.4061 train_time: 6.6m tok/s: 7037905 +3536/20000 train_loss: 2.4711 train_time: 6.6m tok/s: 7037399 +3537/20000 train_loss: 2.6009 train_time: 6.6m tok/s: 7036911 +3538/20000 train_loss: 2.4461 train_time: 6.6m tok/s: 7036401 +3539/20000 train_loss: 2.4700 train_time: 6.6m tok/s: 7035894 +3540/20000 train_loss: 2.4923 train_time: 6.6m tok/s: 7035408 +3541/20000 train_loss: 2.2918 train_time: 6.6m tok/s: 7034869 +3542/20000 train_loss: 2.4060 train_time: 6.6m tok/s: 7034385 +3543/20000 train_loss: 2.5108 train_time: 6.6m tok/s: 7033919 +3544/20000 train_loss: 2.4217 train_time: 6.6m tok/s: 7033426 +3545/20000 train_loss: 2.4562 train_time: 6.6m tok/s: 7032932 +3546/20000 train_loss: 2.3594 train_time: 6.6m tok/s: 7032448 +3547/20000 train_loss: 2.3354 train_time: 6.6m tok/s: 7031939 +3548/20000 train_loss: 2.5740 train_time: 6.6m tok/s: 7031456 +3549/20000 train_loss: 2.5424 train_time: 6.6m tok/s: 7030946 +3550/20000 train_loss: 2.5435 train_time: 6.6m tok/s: 7030445 +3551/20000 train_loss: 2.5676 train_time: 6.6m tok/s: 7029952 +3552/20000 train_loss: 2.4730 train_time: 6.6m tok/s: 7029451 +3553/20000 train_loss: 2.5794 train_time: 6.6m tok/s: 7028928 +3554/20000 train_loss: 2.4968 train_time: 6.6m tok/s: 7028440 +3555/20000 train_loss: 2.5131 train_time: 6.6m tok/s: 7027919 +3556/20000 train_loss: 2.5184 train_time: 6.6m tok/s: 7027441 +3557/20000 train_loss: 2.4359 train_time: 6.6m tok/s: 7026950 +3558/20000 train_loss: 2.6188 train_time: 6.6m tok/s: 7026446 +3559/20000 train_loss: 2.4893 train_time: 6.6m tok/s: 7025958 +3560/20000 train_loss: 2.4495 train_time: 6.6m tok/s: 7025472 +3561/20000 train_loss: 3.1508 train_time: 6.6m tok/s: 7024970 +3562/20000 train_loss: 2.3820 train_time: 6.6m tok/s: 7024474 +3563/20000 train_loss: 2.4776 train_time: 6.6m tok/s: 7023990 +3564/20000 train_loss: 2.4798 train_time: 6.7m tok/s: 7023501 +3565/20000 train_loss: 2.4718 train_time: 6.7m tok/s: 7023002 +3566/20000 train_loss: 2.4701 train_time: 6.7m tok/s: 7022501 +3567/20000 train_loss: 2.5216 train_time: 6.7m tok/s: 7022009 +3568/20000 train_loss: 2.5333 train_time: 6.7m tok/s: 7021501 +3569/20000 train_loss: 2.3183 train_time: 6.7m tok/s: 7021010 +3570/20000 train_loss: 2.3002 train_time: 6.7m tok/s: 7020506 +3571/20000 train_loss: 2.3844 train_time: 6.7m tok/s: 7020037 +3572/20000 train_loss: 2.3876 train_time: 6.7m tok/s: 7019537 +3573/20000 train_loss: 2.2646 train_time: 6.7m tok/s: 7019025 +3574/20000 train_loss: 2.4195 train_time: 6.7m tok/s: 7018508 +3575/20000 train_loss: 2.5021 train_time: 6.7m tok/s: 7018028 +3576/20000 train_loss: 2.5048 train_time: 6.7m tok/s: 7017568 +3577/20000 train_loss: 2.4834 train_time: 6.7m tok/s: 7017078 +3578/20000 train_loss: 2.5625 train_time: 6.7m tok/s: 7016595 +3579/20000 train_loss: 2.5155 train_time: 6.7m tok/s: 7016114 +3580/20000 train_loss: 2.5200 train_time: 6.7m tok/s: 7015595 +3581/20000 train_loss: 2.5040 train_time: 6.7m tok/s: 7015139 +3582/20000 train_loss: 2.3786 train_time: 6.7m tok/s: 7014647 +3583/20000 train_loss: 2.4107 train_time: 6.7m tok/s: 7014167 +3584/20000 train_loss: 2.3550 train_time: 6.7m tok/s: 7013670 +3585/20000 train_loss: 2.3413 train_time: 6.7m tok/s: 7013149 +3586/20000 train_loss: 2.1771 train_time: 6.7m tok/s: 7012644 +3587/20000 train_loss: 2.3261 train_time: 6.7m tok/s: 7012166 +3588/20000 train_loss: 2.3762 train_time: 6.7m tok/s: 7011693 +3589/20000 train_loss: 2.6170 train_time: 6.7m tok/s: 7011165 +3590/20000 train_loss: 2.5090 train_time: 6.7m tok/s: 7010702 +3591/20000 train_loss: 2.5254 train_time: 6.7m tok/s: 7010232 +3592/20000 train_loss: 2.5125 train_time: 6.7m tok/s: 7009766 +3593/20000 train_loss: 2.5689 train_time: 6.7m tok/s: 7009283 +3594/20000 train_loss: 2.4895 train_time: 6.7m tok/s: 7008812 +3595/20000 train_loss: 2.5334 train_time: 6.7m tok/s: 7008336 +3596/20000 train_loss: 2.5142 train_time: 6.7m tok/s: 7007890 +3597/20000 train_loss: 2.5492 train_time: 6.7m tok/s: 7007376 +3598/20000 train_loss: 2.2644 train_time: 6.7m tok/s: 7006877 +3599/20000 train_loss: 2.4678 train_time: 6.7m tok/s: 7006418 +3600/20000 train_loss: 2.5597 train_time: 6.7m tok/s: 7005950 +3601/20000 train_loss: 2.4273 train_time: 6.7m tok/s: 7005494 +3602/20000 train_loss: 2.3135 train_time: 6.7m tok/s: 7004984 +3603/20000 train_loss: 2.5459 train_time: 6.7m tok/s: 7004497 +3604/20000 train_loss: 2.5742 train_time: 6.7m tok/s: 7004050 +3605/20000 train_loss: 2.4399 train_time: 6.7m tok/s: 7003573 +3606/20000 train_loss: 2.4455 train_time: 6.7m tok/s: 7003098 +3607/20000 train_loss: 2.5733 train_time: 6.8m tok/s: 7002621 +3608/20000 train_loss: 2.4212 train_time: 6.8m tok/s: 7002141 +3609/20000 train_loss: 2.4298 train_time: 6.8m tok/s: 7001646 +3610/20000 train_loss: 2.5449 train_time: 6.8m tok/s: 7001159 +3611/20000 train_loss: 2.5282 train_time: 6.8m tok/s: 7000707 +3612/20000 train_loss: 2.4062 train_time: 6.8m tok/s: 7000249 +3613/20000 train_loss: 2.4426 train_time: 6.8m tok/s: 6999777 +3614/20000 train_loss: 2.5392 train_time: 6.8m tok/s: 6999323 +3615/20000 train_loss: 2.3687 train_time: 6.8m tok/s: 6998835 +3616/20000 train_loss: 2.4741 train_time: 6.8m tok/s: 6998366 +3617/20000 train_loss: 2.4300 train_time: 6.8m tok/s: 6997864 +3618/20000 train_loss: 2.3676 train_time: 6.8m tok/s: 6997378 +3619/20000 train_loss: 2.6401 train_time: 6.8m tok/s: 6996900 +3620/20000 train_loss: 2.3782 train_time: 6.8m tok/s: 6996421 +3621/20000 train_loss: 2.4969 train_time: 6.8m tok/s: 6995967 +3622/20000 train_loss: 2.5241 train_time: 6.8m tok/s: 6995476 +3623/20000 train_loss: 2.5351 train_time: 6.8m tok/s: 6994968 +3624/20000 train_loss: 2.5898 train_time: 6.8m tok/s: 6994490 +3625/20000 train_loss: 2.4384 train_time: 6.8m tok/s: 6994023 +3626/20000 train_loss: 2.4041 train_time: 6.8m tok/s: 6993554 +3627/20000 train_loss: 2.3779 train_time: 6.8m tok/s: 6993110 +3628/20000 train_loss: 2.3938 train_time: 6.8m tok/s: 6992636 +3629/20000 train_loss: 2.5326 train_time: 6.8m tok/s: 6992155 +3630/20000 train_loss: 2.5355 train_time: 6.8m tok/s: 6991673 +3631/20000 train_loss: 2.5370 train_time: 6.8m tok/s: 6991226 +3632/20000 train_loss: 2.5801 train_time: 6.8m tok/s: 6990764 +3633/20000 train_loss: 2.3912 train_time: 6.8m tok/s: 6990296 +3634/20000 train_loss: 2.5125 train_time: 6.8m tok/s: 6989844 +3635/20000 train_loss: 2.4983 train_time: 6.8m tok/s: 6989403 +3636/20000 train_loss: 2.4837 train_time: 6.8m tok/s: 6988946 +3637/20000 train_loss: 2.4436 train_time: 6.8m tok/s: 6988485 +3638/20000 train_loss: 2.4264 train_time: 6.8m tok/s: 6988039 +3639/20000 train_loss: 2.4529 train_time: 6.8m tok/s: 6987574 +3640/20000 train_loss: 2.3957 train_time: 6.8m tok/s: 6987111 +3641/20000 train_loss: 2.3896 train_time: 6.8m tok/s: 6986669 +3642/20000 train_loss: 2.2978 train_time: 6.8m tok/s: 6986233 +3643/20000 train_loss: 2.3839 train_time: 6.8m tok/s: 6985763 +3644/20000 train_loss: 2.0912 train_time: 6.8m tok/s: 6985257 +3645/20000 train_loss: 2.5215 train_time: 6.8m tok/s: 6984811 +3646/20000 train_loss: 2.5317 train_time: 6.8m tok/s: 6984390 +3647/20000 train_loss: 2.5385 train_time: 6.8m tok/s: 6983956 +3648/20000 train_loss: 2.4802 train_time: 6.8m tok/s: 6983485 +3649/20000 train_loss: 2.4499 train_time: 6.8m tok/s: 6983065 +3650/20000 train_loss: 2.6527 train_time: 6.9m tok/s: 6982573 +3651/20000 train_loss: 2.5564 train_time: 6.9m tok/s: 6982131 +3652/20000 train_loss: 2.4341 train_time: 6.9m tok/s: 6981705 +3653/20000 train_loss: 2.4481 train_time: 6.9m tok/s: 6981251 +3654/20000 train_loss: 2.3805 train_time: 6.9m tok/s: 6980803 +3655/20000 train_loss: 2.4052 train_time: 6.9m tok/s: 6980297 +3656/20000 train_loss: 2.3830 train_time: 6.9m tok/s: 6979807 +3657/20000 train_loss: 2.4250 train_time: 6.9m tok/s: 6979381 +3658/20000 train_loss: 2.5595 train_time: 6.9m tok/s: 6978921 +3659/20000 train_loss: 2.4264 train_time: 6.9m tok/s: 6978479 +3660/20000 train_loss: 2.3356 train_time: 6.9m tok/s: 6978014 +3661/20000 train_loss: 2.4414 train_time: 6.9m tok/s: 6977537 +3662/20000 train_loss: 2.4615 train_time: 6.9m tok/s: 6977083 +3663/20000 train_loss: 2.0425 train_time: 6.9m tok/s: 6976618 +3664/20000 train_loss: 2.6435 train_time: 6.9m tok/s: 6976158 +3665/20000 train_loss: 2.4860 train_time: 6.9m tok/s: 6975703 +3666/20000 train_loss: 2.4433 train_time: 6.9m tok/s: 6975242 +3667/20000 train_loss: 2.3198 train_time: 6.9m tok/s: 6974778 +3668/20000 train_loss: 2.2215 train_time: 6.9m tok/s: 6974295 +3669/20000 train_loss: 2.3880 train_time: 6.9m tok/s: 6973809 +3670/20000 train_loss: 2.4223 train_time: 6.9m tok/s: 6973363 +3671/20000 train_loss: 2.4314 train_time: 6.9m tok/s: 6972903 +3672/20000 train_loss: 2.5391 train_time: 6.9m tok/s: 6972459 +3673/20000 train_loss: 2.4531 train_time: 6.9m tok/s: 6972016 +3674/20000 train_loss: 2.5365 train_time: 6.9m tok/s: 6971581 +3675/20000 train_loss: 2.5772 train_time: 6.9m tok/s: 6971130 +3676/20000 train_loss: 2.4693 train_time: 6.9m tok/s: 6970677 +3677/20000 train_loss: 2.5344 train_time: 6.9m tok/s: 6970230 +3678/20000 train_loss: 2.4734 train_time: 6.9m tok/s: 6969763 +3679/20000 train_loss: 2.5732 train_time: 6.9m tok/s: 6969270 +3680/20000 train_loss: 2.4711 train_time: 6.9m tok/s: 6968846 +3681/20000 train_loss: 2.5347 train_time: 6.9m tok/s: 6968397 +3682/20000 train_loss: 2.3534 train_time: 6.9m tok/s: 6967915 +3683/20000 train_loss: 2.5737 train_time: 6.9m tok/s: 6967459 +3684/20000 train_loss: 2.4020 train_time: 6.9m tok/s: 6967005 +3685/20000 train_loss: 2.6487 train_time: 6.9m tok/s: 6966543 +3686/20000 train_loss: 2.4893 train_time: 6.9m tok/s: 6966087 +3687/20000 train_loss: 2.5658 train_time: 6.9m tok/s: 6965663 +3688/20000 train_loss: 2.4730 train_time: 6.9m tok/s: 6965220 +3689/20000 train_loss: 2.5402 train_time: 6.9m tok/s: 6964771 +3690/20000 train_loss: 2.5979 train_time: 6.9m tok/s: 6964305 +3691/20000 train_loss: 2.5518 train_time: 6.9m tok/s: 6963853 +3692/20000 train_loss: 2.4890 train_time: 6.9m tok/s: 6963417 +3693/20000 train_loss: 2.5058 train_time: 7.0m tok/s: 6962968 +3694/20000 train_loss: 2.5065 train_time: 7.0m tok/s: 6962532 +3695/20000 train_loss: 2.4263 train_time: 7.0m tok/s: 6962069 +3696/20000 train_loss: 2.4121 train_time: 7.0m tok/s: 6961608 +3697/20000 train_loss: 2.4253 train_time: 7.0m tok/s: 6961134 +3698/20000 train_loss: 2.4699 train_time: 7.0m tok/s: 6960683 +3699/20000 train_loss: 2.4890 train_time: 7.0m tok/s: 6960255 +3700/20000 train_loss: 2.6283 train_time: 7.0m tok/s: 6959775 +3701/20000 train_loss: 2.5250 train_time: 7.0m tok/s: 6959320 +3702/20000 train_loss: 2.6027 train_time: 7.0m tok/s: 6958873 +3703/20000 train_loss: 2.5329 train_time: 7.0m tok/s: 6958421 +3704/20000 train_loss: 2.8432 train_time: 7.0m tok/s: 6957940 +3705/20000 train_loss: 2.4398 train_time: 7.0m tok/s: 6957511 +3706/20000 train_loss: 2.4593 train_time: 7.0m tok/s: 6957084 +3707/20000 train_loss: 2.4550 train_time: 7.0m tok/s: 6956649 +3708/20000 train_loss: 2.4043 train_time: 7.0m tok/s: 6956201 +3709/20000 train_loss: 2.5321 train_time: 7.0m tok/s: 6955778 +3710/20000 train_loss: 2.4011 train_time: 7.0m tok/s: 6955336 +3711/20000 train_loss: 2.5421 train_time: 7.0m tok/s: 6954882 +3712/20000 train_loss: 2.4515 train_time: 7.0m tok/s: 6954425 +3713/20000 train_loss: 2.4627 train_time: 7.0m tok/s: 6953986 +3714/20000 train_loss: 2.4293 train_time: 7.0m tok/s: 6953577 +3715/20000 train_loss: 2.4230 train_time: 7.0m tok/s: 6953135 +3716/20000 train_loss: 2.4205 train_time: 7.0m tok/s: 6952667 +3717/20000 train_loss: 2.3518 train_time: 7.0m tok/s: 6952208 +3718/20000 train_loss: 2.5453 train_time: 7.0m tok/s: 6951759 +3719/20000 train_loss: 2.4330 train_time: 7.0m tok/s: 6951334 +3720/20000 train_loss: 2.5002 train_time: 7.0m tok/s: 6950879 +3721/20000 train_loss: 2.5596 train_time: 7.0m tok/s: 6950430 +3722/20000 train_loss: 2.4446 train_time: 7.0m tok/s: 6949998 +3723/20000 train_loss: 2.6126 train_time: 7.0m tok/s: 6949554 +3724/20000 train_loss: 2.4246 train_time: 7.0m tok/s: 6949091 +3725/20000 train_loss: 2.4685 train_time: 7.0m tok/s: 6948649 +3726/20000 train_loss: 2.4302 train_time: 7.0m tok/s: 6948214 +3727/20000 train_loss: 2.4202 train_time: 7.0m tok/s: 6947763 +3728/20000 train_loss: 2.4435 train_time: 7.0m tok/s: 6947334 +3729/20000 train_loss: 2.5120 train_time: 7.0m tok/s: 6946901 +3730/20000 train_loss: 2.4277 train_time: 7.0m tok/s: 6946452 +3731/20000 train_loss: 2.5097 train_time: 7.0m tok/s: 6946005 +3732/20000 train_loss: 2.4297 train_time: 7.0m tok/s: 6945571 +3733/20000 train_loss: 2.5335 train_time: 7.0m tok/s: 6945145 +3734/20000 train_loss: 2.4514 train_time: 7.0m tok/s: 6944714 +3735/20000 train_loss: 2.5197 train_time: 7.0m tok/s: 6944263 +3736/20000 train_loss: 2.4304 train_time: 7.1m tok/s: 6943794 +3737/20000 train_loss: 2.4450 train_time: 7.1m tok/s: 6943358 +3738/20000 train_loss: 2.3873 train_time: 7.1m tok/s: 6942914 +3739/20000 train_loss: 2.4596 train_time: 7.1m tok/s: 6942484 +3740/20000 train_loss: 2.5310 train_time: 7.1m tok/s: 6942050 +3741/20000 train_loss: 2.3283 train_time: 7.1m tok/s: 6941585 +3742/20000 train_loss: 2.4582 train_time: 7.1m tok/s: 6941168 +3743/20000 train_loss: 2.4075 train_time: 7.1m tok/s: 6940709 +3744/20000 train_loss: 2.4315 train_time: 7.1m tok/s: 6940287 +3745/20000 train_loss: 2.5123 train_time: 7.1m tok/s: 6939852 +3746/20000 train_loss: 2.3687 train_time: 7.1m tok/s: 6939412 +3747/20000 train_loss: 2.4307 train_time: 7.1m tok/s: 6938974 +3748/20000 train_loss: 2.5475 train_time: 7.1m tok/s: 6938562 +3749/20000 train_loss: 2.5375 train_time: 7.1m tok/s: 6938136 +3750/20000 train_loss: 2.4781 train_time: 7.1m tok/s: 6937659 +3751/20000 train_loss: 2.5772 train_time: 7.1m tok/s: 6937221 +3752/20000 train_loss: 2.4319 train_time: 7.1m tok/s: 6936813 +3753/20000 train_loss: 2.3549 train_time: 7.1m tok/s: 6936370 +3754/20000 train_loss: 2.4404 train_time: 7.1m tok/s: 6935937 +3755/20000 train_loss: 2.3914 train_time: 7.1m tok/s: 6935496 +3756/20000 train_loss: 2.4776 train_time: 7.1m tok/s: 6935070 +3757/20000 train_loss: 2.3670 train_time: 7.1m tok/s: 6934630 +3758/20000 train_loss: 2.7205 train_time: 7.1m tok/s: 6934202 +3759/20000 train_loss: 2.6036 train_time: 7.1m tok/s: 6933782 +3760/20000 train_loss: 2.5442 train_time: 7.1m tok/s: 6933368 +3761/20000 train_loss: 2.6231 train_time: 7.1m tok/s: 6932894 +3762/20000 train_loss: 2.4668 train_time: 7.1m tok/s: 6932471 +3763/20000 train_loss: 2.4906 train_time: 7.1m tok/s: 6932055 +3764/20000 train_loss: 2.4499 train_time: 7.1m tok/s: 6931635 +3765/20000 train_loss: 2.4253 train_time: 7.1m tok/s: 6931196 +3766/20000 train_loss: 2.4115 train_time: 7.1m tok/s: 6930772 +3767/20000 train_loss: 2.4738 train_time: 7.1m tok/s: 6930355 +3768/20000 train_loss: 2.4678 train_time: 7.1m tok/s: 6929918 +3769/20000 train_loss: 2.4089 train_time: 7.1m tok/s: 6929493 +3770/20000 train_loss: 2.5564 train_time: 7.1m tok/s: 6929058 +3771/20000 train_loss: 2.4493 train_time: 7.1m tok/s: 6928630 +3772/20000 train_loss: 2.5157 train_time: 7.1m tok/s: 6928213 +3773/20000 train_loss: 2.4870 train_time: 7.1m tok/s: 6927754 +3774/20000 train_loss: 2.3880 train_time: 7.1m tok/s: 6927346 +3775/20000 train_loss: 2.6263 train_time: 7.1m tok/s: 6926933 +3776/20000 train_loss: 2.4962 train_time: 7.1m tok/s: 6926489 +3777/20000 train_loss: 2.4013 train_time: 7.1m tok/s: 6926047 +3778/20000 train_loss: 2.4621 train_time: 7.2m tok/s: 6925623 +3779/20000 train_loss: 2.4570 train_time: 7.2m tok/s: 6925203 +3780/20000 train_loss: 2.3775 train_time: 7.2m tok/s: 6924769 +3781/20000 train_loss: 2.4491 train_time: 7.2m tok/s: 6924323 +3782/20000 train_loss: 2.3837 train_time: 7.2m tok/s: 6923894 +3783/20000 train_loss: 2.4562 train_time: 7.2m tok/s: 6923473 +3784/20000 train_loss: 2.4641 train_time: 7.2m tok/s: 6923066 +3785/20000 train_loss: 2.4824 train_time: 7.2m tok/s: 6922653 +3786/20000 train_loss: 2.4988 train_time: 7.2m tok/s: 6922223 +3787/20000 train_loss: 2.5094 train_time: 7.2m tok/s: 6921795 +3788/20000 train_loss: 2.4189 train_time: 7.2m tok/s: 6921393 +3789/20000 train_loss: 2.3868 train_time: 7.2m tok/s: 6920984 +3790/20000 train_loss: 2.4480 train_time: 7.2m tok/s: 6920551 +3791/20000 train_loss: 2.3415 train_time: 7.2m tok/s: 6920095 +3792/20000 train_loss: 2.4607 train_time: 7.2m tok/s: 6919667 +3793/20000 train_loss: 2.3852 train_time: 7.2m tok/s: 6919223 +3794/20000 train_loss: 2.4770 train_time: 7.2m tok/s: 6918794 +3795/20000 train_loss: 2.4690 train_time: 7.2m tok/s: 6918385 +3796/20000 train_loss: 2.5346 train_time: 7.2m tok/s: 6917963 +3797/20000 train_loss: 2.5794 train_time: 7.2m tok/s: 6917559 +3798/20000 train_loss: 2.6839 train_time: 7.2m tok/s: 6917132 +3799/20000 train_loss: 2.4824 train_time: 7.2m tok/s: 6916707 +3800/20000 train_loss: 2.4848 train_time: 7.2m tok/s: 6916265 +3801/20000 train_loss: 2.2662 train_time: 7.2m tok/s: 6915814 +3802/20000 train_loss: 2.4853 train_time: 7.2m tok/s: 6915424 +3803/20000 train_loss: 2.3963 train_time: 7.2m tok/s: 6915029 +3804/20000 train_loss: 2.4667 train_time: 7.2m tok/s: 6914616 +3805/20000 train_loss: 2.3891 train_time: 7.2m tok/s: 6914201 +3806/20000 train_loss: 2.4756 train_time: 7.2m tok/s: 6913787 +3807/20000 train_loss: 2.4704 train_time: 7.2m tok/s: 6913376 +3808/20000 train_loss: 2.6572 train_time: 7.2m tok/s: 6912959 +3809/20000 train_loss: 2.5652 train_time: 7.2m tok/s: 6912552 +3810/20000 train_loss: 2.4263 train_time: 7.2m tok/s: 6912116 +3811/20000 train_loss: 2.4402 train_time: 7.2m tok/s: 6911682 +3812/20000 train_loss: 2.4376 train_time: 7.2m tok/s: 6911269 +3813/20000 train_loss: 2.4542 train_time: 7.2m tok/s: 6910844 +3814/20000 train_loss: 4.3540 train_time: 7.2m tok/s: 6910380 +3815/20000 train_loss: 2.4367 train_time: 7.2m tok/s: 6909972 +3816/20000 train_loss: 2.5520 train_time: 7.2m tok/s: 6909505 +3817/20000 train_loss: 2.4941 train_time: 7.2m tok/s: 6909128 +3818/20000 train_loss: 2.5115 train_time: 7.2m tok/s: 6908722 +3819/20000 train_loss: 2.3850 train_time: 7.2m tok/s: 6908313 +3820/20000 train_loss: 2.5549 train_time: 7.2m tok/s: 6907907 +3821/20000 train_loss: 2.4198 train_time: 7.3m tok/s: 6907518 +3822/20000 train_loss: 2.4549 train_time: 7.3m tok/s: 6907112 +3823/20000 train_loss: 2.4784 train_time: 7.3m tok/s: 6906685 +3824/20000 train_loss: 2.5046 train_time: 7.3m tok/s: 6906297 +3825/20000 train_loss: 2.4807 train_time: 7.3m tok/s: 6905889 +3826/20000 train_loss: 2.5548 train_time: 7.3m tok/s: 6905458 +3827/20000 train_loss: 2.5648 train_time: 7.3m tok/s: 6905035 +3828/20000 train_loss: 2.5500 train_time: 7.3m tok/s: 6904622 +3829/20000 train_loss: 2.5085 train_time: 7.3m tok/s: 6904230 +3830/20000 train_loss: 2.5433 train_time: 7.3m tok/s: 6903822 +3831/20000 train_loss: 2.5075 train_time: 7.3m tok/s: 6903406 +3832/20000 train_loss: 2.4524 train_time: 7.3m tok/s: 6902989 +3833/20000 train_loss: 2.4855 train_time: 7.3m tok/s: 6902579 +3834/20000 train_loss: 2.4184 train_time: 7.3m tok/s: 6902161 +3835/20000 train_loss: 2.4563 train_time: 7.3m tok/s: 6901741 +3836/20000 train_loss: 2.4222 train_time: 7.3m tok/s: 6901314 +3837/20000 train_loss: 2.4424 train_time: 7.3m tok/s: 6900909 +3838/20000 train_loss: 2.4498 train_time: 7.3m tok/s: 6900512 +3839/20000 train_loss: 2.4353 train_time: 7.3m tok/s: 6900090 +3840/20000 train_loss: 2.3913 train_time: 7.3m tok/s: 6899673 +3841/20000 train_loss: 2.5570 train_time: 7.3m tok/s: 6899256 +3842/20000 train_loss: 2.4179 train_time: 7.3m tok/s: 6898870 +3843/20000 train_loss: 2.4476 train_time: 7.3m tok/s: 6898463 +3844/20000 train_loss: 2.3891 train_time: 7.3m tok/s: 6898080 +3845/20000 train_loss: 2.5041 train_time: 7.3m tok/s: 6897690 +3846/20000 train_loss: 2.5958 train_time: 7.3m tok/s: 6897265 +3847/20000 train_loss: 2.4140 train_time: 7.3m tok/s: 6896858 +3848/20000 train_loss: 2.4390 train_time: 7.3m tok/s: 6896439 +3849/20000 train_loss: 2.5702 train_time: 7.3m tok/s: 6896044 +3850/20000 train_loss: 2.5412 train_time: 7.3m tok/s: 6895636 +3851/20000 train_loss: 2.4585 train_time: 7.3m tok/s: 6895214 +3852/20000 train_loss: 2.4434 train_time: 7.3m tok/s: 6894803 +3853/20000 train_loss: 2.3913 train_time: 7.3m tok/s: 6894411 +3854/20000 train_loss: 2.2869 train_time: 7.3m tok/s: 6894012 +3855/20000 train_loss: 2.4910 train_time: 7.3m tok/s: 6893621 +3856/20000 train_loss: 2.4100 train_time: 7.3m tok/s: 6893190 +3857/20000 train_loss: 2.2076 train_time: 7.3m tok/s: 6892755 +3858/20000 train_loss: 2.4858 train_time: 7.3m tok/s: 6892321 +3859/20000 train_loss: 2.3970 train_time: 7.3m tok/s: 6891911 +3860/20000 train_loss: 2.3303 train_time: 7.3m tok/s: 6891514 +3861/20000 train_loss: 2.4701 train_time: 7.3m tok/s: 6891104 +3862/20000 train_loss: 2.5971 train_time: 7.3m tok/s: 6890702 +3863/20000 train_loss: 2.5163 train_time: 7.3m tok/s: 6890313 +3864/20000 train_loss: 2.4629 train_time: 7.4m tok/s: 6889919 +3865/20000 train_loss: 2.4851 train_time: 7.4m tok/s: 6889534 +3866/20000 train_loss: 2.4061 train_time: 7.4m tok/s: 6889125 +3867/20000 train_loss: 2.8911 train_time: 7.4m tok/s: 6888687 +3868/20000 train_loss: 2.3163 train_time: 7.4m tok/s: 6888258 +3869/20000 train_loss: 2.4424 train_time: 7.4m tok/s: 6887867 +3870/20000 train_loss: 2.5593 train_time: 7.4m tok/s: 6887477 +3871/20000 train_loss: 2.4042 train_time: 7.4m tok/s: 6887083 +3872/20000 train_loss: 2.4479 train_time: 7.4m tok/s: 6886673 +3873/20000 train_loss: 2.3787 train_time: 7.4m tok/s: 6886295 +3874/20000 train_loss: 2.4422 train_time: 7.4m tok/s: 6885912 +3875/20000 train_loss: 2.4448 train_time: 7.4m tok/s: 6885508 +3876/20000 train_loss: 2.3304 train_time: 7.4m tok/s: 6885105 +3877/20000 train_loss: 2.6177 train_time: 7.4m tok/s: 6884681 +3878/20000 train_loss: 2.9861 train_time: 7.4m tok/s: 6884240 +3879/20000 train_loss: 2.4195 train_time: 7.4m tok/s: 6883842 +3880/20000 train_loss: 2.4129 train_time: 7.4m tok/s: 6883446 +3881/20000 train_loss: 2.4049 train_time: 7.4m tok/s: 6883047 +3882/20000 train_loss: 2.3135 train_time: 7.4m tok/s: 6882666 +3883/20000 train_loss: 2.4637 train_time: 7.4m tok/s: 6882272 +3884/20000 train_loss: 2.4648 train_time: 7.4m tok/s: 6881894 +3885/20000 train_loss: 2.4093 train_time: 7.4m tok/s: 6881488 +3886/20000 train_loss: 2.3933 train_time: 7.4m tok/s: 6881098 +3887/20000 train_loss: 2.4179 train_time: 7.4m tok/s: 6880683 +3888/20000 train_loss: 2.4244 train_time: 7.4m tok/s: 6880296 +3889/20000 train_loss: 2.4083 train_time: 7.4m tok/s: 6879922 +3890/20000 train_loss: 2.3527 train_time: 7.4m tok/s: 6879518 +3891/20000 train_loss: 2.5876 train_time: 7.4m tok/s: 6879117 +3892/20000 train_loss: 2.3937 train_time: 7.4m tok/s: 6878709 +3893/20000 train_loss: 2.4967 train_time: 7.4m tok/s: 6878344 +3894/20000 train_loss: 2.2295 train_time: 7.4m tok/s: 6877926 +3895/20000 train_loss: 2.5577 train_time: 7.4m tok/s: 6877551 +3896/20000 train_loss: 2.4659 train_time: 7.4m tok/s: 6877160 +3897/20000 train_loss: 2.4649 train_time: 7.4m tok/s: 6876760 +3898/20000 train_loss: 2.3873 train_time: 7.4m tok/s: 6876355 +3899/20000 train_loss: 2.6146 train_time: 7.4m tok/s: 6875967 +3900/20000 train_loss: 2.5389 train_time: 7.4m tok/s: 6875533 +3901/20000 train_loss: 2.4455 train_time: 7.4m tok/s: 6875147 +3902/20000 train_loss: 2.4826 train_time: 7.4m tok/s: 6874731 +3903/20000 train_loss: 2.4767 train_time: 7.4m tok/s: 6874320 +3904/20000 train_loss: 2.3054 train_time: 7.4m tok/s: 6873923 +3905/20000 train_loss: 2.4322 train_time: 7.4m tok/s: 6873532 +3906/20000 train_loss: 2.5454 train_time: 7.4m tok/s: 6873145 +3907/20000 train_loss: 2.4658 train_time: 7.5m tok/s: 6872766 +3908/20000 train_loss: 2.4627 train_time: 7.5m tok/s: 6872370 +3909/20000 train_loss: 2.4672 train_time: 7.5m tok/s: 6871988 +3910/20000 train_loss: 2.4461 train_time: 7.5m tok/s: 6871593 +3911/20000 train_loss: 2.5221 train_time: 7.5m tok/s: 6871202 +3912/20000 train_loss: 2.5765 train_time: 7.5m tok/s: 6870816 +3913/20000 train_loss: 2.4862 train_time: 7.5m tok/s: 6870462 +3914/20000 train_loss: 2.4576 train_time: 7.5m tok/s: 6870046 +3915/20000 train_loss: 2.3535 train_time: 7.5m tok/s: 6869653 +3916/20000 train_loss: 2.4130 train_time: 7.5m tok/s: 6869280 +3917/20000 train_loss: 2.5786 train_time: 7.5m tok/s: 6868867 +3918/20000 train_loss: 2.4445 train_time: 7.5m tok/s: 6868488 +3919/20000 train_loss: 2.4056 train_time: 7.5m tok/s: 6868121 +3920/20000 train_loss: 2.3557 train_time: 7.5m tok/s: 6867740 +3921/20000 train_loss: 2.4895 train_time: 7.5m tok/s: 6867348 +3922/20000 train_loss: 2.5776 train_time: 7.5m tok/s: 6866947 +3923/20000 train_loss: 2.4932 train_time: 7.5m tok/s: 6866576 +3924/20000 train_loss: 2.5565 train_time: 7.5m tok/s: 6866203 +3925/20000 train_loss: 2.4392 train_time: 7.5m tok/s: 6865833 +3926/20000 train_loss: 2.3595 train_time: 7.5m tok/s: 6865439 +3927/20000 train_loss: 2.3387 train_time: 7.5m tok/s: 6865064 +3928/20000 train_loss: 2.4440 train_time: 7.5m tok/s: 6864690 +3929/20000 train_loss: 2.4328 train_time: 7.5m tok/s: 6864335 +3930/20000 train_loss: 2.5181 train_time: 7.5m tok/s: 6863970 +3931/20000 train_loss: 2.4851 train_time: 7.5m tok/s: 6863581 +3932/20000 train_loss: 2.4717 train_time: 7.5m tok/s: 6863216 +3933/20000 train_loss: 2.5438 train_time: 7.5m tok/s: 6862852 +3934/20000 train_loss: 2.4872 train_time: 7.5m tok/s: 6862476 +3935/20000 train_loss: 2.4036 train_time: 7.5m tok/s: 6862054 +3936/20000 train_loss: 2.3821 train_time: 7.5m tok/s: 6861688 +3937/20000 train_loss: 2.4037 train_time: 7.5m tok/s: 6861306 +3938/20000 train_loss: 2.3515 train_time: 7.5m tok/s: 6860950 +3939/20000 train_loss: 2.4934 train_time: 7.5m tok/s: 6860586 +3940/20000 train_loss: 2.4185 train_time: 7.5m tok/s: 6860177 +3941/20000 train_loss: 2.5808 train_time: 7.5m tok/s: 6859790 +3942/20000 train_loss: 2.3511 train_time: 7.5m tok/s: 6859418 +3943/20000 train_loss: 2.4124 train_time: 7.5m tok/s: 6859042 +3944/20000 train_loss: 2.4939 train_time: 7.5m tok/s: 6858682 +3945/20000 train_loss: 2.4070 train_time: 7.5m tok/s: 6858290 +3946/20000 train_loss: 2.4450 train_time: 7.5m tok/s: 6857896 +3947/20000 train_loss: 2.4692 train_time: 7.5m tok/s: 6857542 +3948/20000 train_loss: 2.3502 train_time: 7.5m tok/s: 6857170 +3949/20000 train_loss: 2.3697 train_time: 7.5m tok/s: 6856786 +3950/20000 train_loss: 2.3389 train_time: 7.6m tok/s: 6856409 +3951/20000 train_loss: 2.4687 train_time: 7.6m tok/s: 6855983 +3952/20000 train_loss: 2.4443 train_time: 7.6m tok/s: 6855619 +3953/20000 train_loss: 2.4699 train_time: 7.6m tok/s: 6855248 +3954/20000 train_loss: 2.4840 train_time: 7.6m tok/s: 6854859 +3955/20000 train_loss: 2.4743 train_time: 7.6m tok/s: 6854485 +3956/20000 train_loss: 2.4576 train_time: 7.6m tok/s: 6854118 +3957/20000 train_loss: 2.3400 train_time: 7.6m tok/s: 6853739 +3958/20000 train_loss: 2.5264 train_time: 7.6m tok/s: 6853345 +3959/20000 train_loss: 2.2674 train_time: 7.6m tok/s: 6852945 +3960/20000 train_loss: 2.3958 train_time: 7.6m tok/s: 6852551 +3961/20000 train_loss: 2.3811 train_time: 7.6m tok/s: 6852173 +3962/20000 train_loss: 2.3895 train_time: 7.6m tok/s: 6851802 +3963/20000 train_loss: 2.4814 train_time: 7.6m tok/s: 6851423 +3964/20000 train_loss: 2.2918 train_time: 7.6m tok/s: 6851024 +3965/20000 train_loss: 2.5963 train_time: 7.6m tok/s: 6850667 +3966/20000 train_loss: 2.4597 train_time: 7.6m tok/s: 6850309 +3967/20000 train_loss: 2.4256 train_time: 7.6m tok/s: 6849938 +3968/20000 train_loss: 2.4689 train_time: 7.6m tok/s: 6849556 +3969/20000 train_loss: 2.4149 train_time: 7.6m tok/s: 6849187 +3970/20000 train_loss: 2.5131 train_time: 7.6m tok/s: 6848823 +3971/20000 train_loss: 2.4183 train_time: 7.6m tok/s: 6848462 +3972/20000 train_loss: 2.4326 train_time: 7.6m tok/s: 6848086 +3973/20000 train_loss: 2.3940 train_time: 7.6m tok/s: 6847718 +3974/20000 train_loss: 2.3530 train_time: 7.6m tok/s: 6847348 +3975/20000 train_loss: 2.5165 train_time: 7.6m tok/s: 6846973 +3976/20000 train_loss: 2.4940 train_time: 7.6m tok/s: 6846611 +3977/20000 train_loss: 2.7638 train_time: 7.6m tok/s: 6846258 +3978/20000 train_loss: 2.4883 train_time: 7.6m tok/s: 6845915 +3979/20000 train_loss: 3.1384 train_time: 7.6m tok/s: 6845486 +3980/20000 train_loss: 2.4497 train_time: 7.6m tok/s: 6845090 +3981/20000 train_loss: 2.3784 train_time: 7.6m tok/s: 6844725 +3982/20000 train_loss: 2.4772 train_time: 7.6m tok/s: 6844364 +3983/20000 train_loss: 2.3929 train_time: 7.6m tok/s: 6843978 +3984/20000 train_loss: 2.5175 train_time: 7.6m tok/s: 6843602 +3985/20000 train_loss: 2.4737 train_time: 7.6m tok/s: 6843233 +3986/20000 train_loss: 2.4027 train_time: 7.6m tok/s: 6842884 +3987/20000 train_loss: 2.5584 train_time: 7.6m tok/s: 6842523 +3988/20000 train_loss: 2.4898 train_time: 7.6m tok/s: 6842188 +3989/20000 train_loss: 2.4366 train_time: 7.6m tok/s: 6841802 +3990/20000 train_loss: 2.4252 train_time: 7.6m tok/s: 6841443 +3991/20000 train_loss: 2.4469 train_time: 7.6m tok/s: 6841101 +3992/20000 train_loss: 2.4494 train_time: 7.6m tok/s: 6840765 +3993/20000 train_loss: 2.3831 train_time: 7.7m tok/s: 6840426 +3994/20000 train_loss: 2.1488 train_time: 7.7m tok/s: 6840006 +3995/20000 train_loss: 2.4500 train_time: 7.7m tok/s: 6839648 +3996/20000 train_loss: 2.3635 train_time: 7.7m tok/s: 6839295 +3997/20000 train_loss: 2.4735 train_time: 7.7m tok/s: 6838904 +3998/20000 train_loss: 2.3975 train_time: 7.7m tok/s: 6838536 +3999/20000 train_loss: 2.4116 train_time: 7.7m tok/s: 6838187 +4000/20000 train_loss: 2.5072 train_time: 7.7m tok/s: 6837848 +4001/20000 train_loss: 2.4244 train_time: 7.7m tok/s: 6837479 +4002/20000 train_loss: 2.4292 train_time: 7.7m tok/s: 6837106 +4003/20000 train_loss: 2.3695 train_time: 7.7m tok/s: 6836743 +4004/20000 train_loss: 2.4847 train_time: 7.7m tok/s: 6836371 +4005/20000 train_loss: 2.4389 train_time: 7.7m tok/s: 6836009 +4006/20000 train_loss: 2.4662 train_time: 7.7m tok/s: 6835611 +4007/20000 train_loss: 2.4723 train_time: 7.7m tok/s: 6835226 +4008/20000 train_loss: 2.3589 train_time: 7.7m tok/s: 6834858 +4009/20000 train_loss: 2.3866 train_time: 7.7m tok/s: 6834499 +4010/20000 train_loss: 2.4901 train_time: 7.7m tok/s: 6834153 +4011/20000 train_loss: 2.4547 train_time: 7.7m tok/s: 6833791 +4012/20000 train_loss: 2.4684 train_time: 7.7m tok/s: 6833405 +4013/20000 train_loss: 2.3532 train_time: 7.7m tok/s: 6833001 +4014/20000 train_loss: 2.3977 train_time: 7.7m tok/s: 6832651 +4015/20000 train_loss: 2.4257 train_time: 7.7m tok/s: 6832262 +4016/20000 train_loss: 2.4218 train_time: 7.7m tok/s: 6831887 +4017/20000 train_loss: 2.5352 train_time: 7.7m tok/s: 6831533 +4018/20000 train_loss: 2.3891 train_time: 7.7m tok/s: 6831171 +4019/20000 train_loss: 2.2971 train_time: 7.7m tok/s: 6830777 +4020/20000 train_loss: 2.3329 train_time: 7.7m tok/s: 6830425 +4021/20000 train_loss: 2.3784 train_time: 7.7m tok/s: 6830046 +4022/20000 train_loss: 2.4642 train_time: 7.7m tok/s: 6829678 +4023/20000 train_loss: 2.4438 train_time: 7.7m tok/s: 6829322 +4024/20000 train_loss: 2.5052 train_time: 7.7m tok/s: 6828967 +4025/20000 train_loss: 2.4854 train_time: 7.7m tok/s: 6828632 +4026/20000 train_loss: 2.4137 train_time: 7.7m tok/s: 6828279 +4027/20000 train_loss: 2.3435 train_time: 7.7m tok/s: 6827892 +4028/20000 train_loss: 2.2444 train_time: 7.7m tok/s: 6827514 +4029/20000 train_loss: 2.3828 train_time: 7.7m tok/s: 6827143 +4030/20000 train_loss: 2.3988 train_time: 7.7m tok/s: 6826796 +4031/20000 train_loss: 2.4765 train_time: 7.7m tok/s: 6826444 +4032/20000 train_loss: 2.3699 train_time: 7.7m tok/s: 6826069 +4033/20000 train_loss: 2.4281 train_time: 7.7m tok/s: 6825730 +4034/20000 train_loss: 2.5976 train_time: 7.7m tok/s: 6825359 +4035/20000 train_loss: 2.5254 train_time: 7.7m tok/s: 6825010 +4036/20000 train_loss: 2.4899 train_time: 7.8m tok/s: 6824650 +4037/20000 train_loss: 2.4816 train_time: 7.8m tok/s: 6824300 +4038/20000 train_loss: 2.4912 train_time: 7.8m tok/s: 6823962 +4039/20000 train_loss: 2.4733 train_time: 7.8m tok/s: 6823607 +4040/20000 train_loss: 2.4227 train_time: 7.8m tok/s: 6823221 +4041/20000 train_loss: 2.3873 train_time: 7.8m tok/s: 6822864 +4042/20000 train_loss: 2.3595 train_time: 7.8m tok/s: 6822509 +4043/20000 train_loss: 2.3786 train_time: 7.8m tok/s: 6822134 +4044/20000 train_loss: 2.4185 train_time: 7.8m tok/s: 6821769 +4045/20000 train_loss: 2.4571 train_time: 7.8m tok/s: 6821436 +4046/20000 train_loss: 2.5233 train_time: 7.8m tok/s: 6821083 +4047/20000 train_loss: 2.4972 train_time: 7.8m tok/s: 6820739 +4048/20000 train_loss: 2.4226 train_time: 7.8m tok/s: 6820380 +4049/20000 train_loss: 2.5480 train_time: 7.8m tok/s: 6820048 +4050/20000 train_loss: 2.4407 train_time: 7.8m tok/s: 6819653 +4051/20000 train_loss: 2.4067 train_time: 7.8m tok/s: 6819285 +4052/20000 train_loss: 2.4907 train_time: 7.8m tok/s: 6818934 +4053/20000 train_loss: 2.4328 train_time: 7.8m tok/s: 6818584 +4054/20000 train_loss: 2.4165 train_time: 7.8m tok/s: 6818250 +4055/20000 train_loss: 2.4427 train_time: 7.8m tok/s: 6817912 +4056/20000 train_loss: 2.3208 train_time: 7.8m tok/s: 6817552 +4057/20000 train_loss: 2.4139 train_time: 7.8m tok/s: 6817217 +4058/20000 train_loss: 2.5559 train_time: 7.8m tok/s: 6816827 +4059/20000 train_loss: 2.2795 train_time: 7.8m tok/s: 6816495 +4060/20000 train_loss: 2.3949 train_time: 7.8m tok/s: 6816170 +4061/20000 train_loss: 2.4970 train_time: 7.8m tok/s: 6815841 +4062/20000 train_loss: 2.4855 train_time: 7.8m tok/s: 6815507 +4063/20000 train_loss: 2.4491 train_time: 7.8m tok/s: 6815146 +4064/20000 train_loss: 2.4283 train_time: 7.8m tok/s: 6814808 +4065/20000 train_loss: 2.4946 train_time: 7.8m tok/s: 6814462 +4066/20000 train_loss: 2.4064 train_time: 7.8m tok/s: 6814097 +4067/20000 train_loss: 2.4733 train_time: 7.8m tok/s: 6813742 +4068/20000 train_loss: 2.4797 train_time: 7.8m tok/s: 6813409 +4069/20000 train_loss: 2.3650 train_time: 7.8m tok/s: 6813049 +4070/20000 train_loss: 2.3680 train_time: 7.8m tok/s: 6812690 +4071/20000 train_loss: 2.7196 train_time: 7.8m tok/s: 6812310 +4072/20000 train_loss: 2.4502 train_time: 7.8m tok/s: 6811968 +4073/20000 train_loss: 2.4491 train_time: 7.8m tok/s: 6811593 +4074/20000 train_loss: 2.4050 train_time: 7.8m tok/s: 6811237 +4075/20000 train_loss: 2.5574 train_time: 7.8m tok/s: 6810882 +4076/20000 train_loss: 2.5832 train_time: 7.8m tok/s: 6810537 +4077/20000 train_loss: 2.4930 train_time: 7.8m tok/s: 6810189 +4078/20000 train_loss: 2.3754 train_time: 7.8m tok/s: 6809833 +4079/20000 train_loss: 3.0116 train_time: 7.9m tok/s: 6809409 +4080/20000 train_loss: 2.3635 train_time: 7.9m tok/s: 6809052 +4081/20000 train_loss: 2.3903 train_time: 7.9m tok/s: 6808707 +4082/20000 train_loss: 2.4031 train_time: 7.9m tok/s: 6808370 +4083/20000 train_loss: 2.3157 train_time: 7.9m tok/s: 6808037 +4084/20000 train_loss: 2.2543 train_time: 7.9m tok/s: 6807685 +4085/20000 train_loss: 2.3530 train_time: 7.9m tok/s: 6807308 +4086/20000 train_loss: 2.4932 train_time: 7.9m tok/s: 6806960 +4087/20000 train_loss: 2.5314 train_time: 7.9m tok/s: 6806618 +4088/20000 train_loss: 2.5333 train_time: 7.9m tok/s: 6806276 +4089/20000 train_loss: 2.4444 train_time: 7.9m tok/s: 6805933 +4090/20000 train_loss: 2.3707 train_time: 7.9m tok/s: 6805586 +4091/20000 train_loss: 2.5203 train_time: 7.9m tok/s: 6805236 +4092/20000 train_loss: 2.5277 train_time: 7.9m tok/s: 6804860 +4093/20000 train_loss: 2.4958 train_time: 7.9m tok/s: 6804511 +4094/20000 train_loss: 2.2944 train_time: 7.9m tok/s: 6804136 +4095/20000 train_loss: 2.4483 train_time: 7.9m tok/s: 6803784 +4096/20000 train_loss: 2.2595 train_time: 7.9m tok/s: 6803423 +4097/20000 train_loss: 2.3059 train_time: 7.9m tok/s: 6803083 +4098/20000 train_loss: 2.4173 train_time: 7.9m tok/s: 6802768 +4099/20000 train_loss: 2.1911 train_time: 7.9m tok/s: 6802385 +4100/20000 train_loss: 2.4199 train_time: 7.9m tok/s: 6802021 +4101/20000 train_loss: 2.5312 train_time: 7.9m tok/s: 6801705 +4102/20000 train_loss: 2.2882 train_time: 7.9m tok/s: 6801334 +4103/20000 train_loss: 2.4363 train_time: 7.9m tok/s: 6801001 +4104/20000 train_loss: 2.3843 train_time: 7.9m tok/s: 6800668 +4105/20000 train_loss: 2.4328 train_time: 7.9m tok/s: 6800262 +4106/20000 train_loss: 2.4629 train_time: 7.9m tok/s: 6799954 +4107/20000 train_loss: 2.4071 train_time: 7.9m tok/s: 6799614 +4108/20000 train_loss: 2.4177 train_time: 7.9m tok/s: 6799275 +4109/20000 train_loss: 2.3919 train_time: 7.9m tok/s: 6798943 +4110/20000 train_loss: 2.3435 train_time: 7.9m tok/s: 6798589 +4111/20000 train_loss: 2.3976 train_time: 7.9m tok/s: 6798254 +4112/20000 train_loss: 2.4437 train_time: 7.9m tok/s: 6797899 +4113/20000 train_loss: 2.4202 train_time: 7.9m tok/s: 6797550 +4114/20000 train_loss: 2.3915 train_time: 7.9m tok/s: 6797212 +4115/20000 train_loss: 2.5271 train_time: 7.9m tok/s: 6796871 +4116/20000 train_loss: 2.3672 train_time: 7.9m tok/s: 6796521 +4117/20000 train_loss: 2.3818 train_time: 7.9m tok/s: 6796170 +4118/20000 train_loss: 2.4307 train_time: 7.9m tok/s: 6795828 +4119/20000 train_loss: 2.4647 train_time: 7.9m tok/s: 6795475 +4120/20000 train_loss: 2.4346 train_time: 7.9m tok/s: 6795118 +4121/20000 train_loss: 2.4027 train_time: 7.9m tok/s: 6794746 +4122/20000 train_loss: 2.2857 train_time: 8.0m tok/s: 6794411 +4123/20000 train_loss: 2.3857 train_time: 8.0m tok/s: 6794072 +4124/20000 train_loss: 2.2550 train_time: 8.0m tok/s: 6793720 +4125/20000 train_loss: 2.4531 train_time: 8.0m tok/s: 6793374 +4126/20000 train_loss: 2.3729 train_time: 8.0m tok/s: 6793033 +4127/20000 train_loss: 2.3760 train_time: 8.0m tok/s: 6792704 +4128/20000 train_loss: 2.4355 train_time: 8.0m tok/s: 6792367 +4129/20000 train_loss: 2.4616 train_time: 8.0m tok/s: 6792024 +4130/20000 train_loss: 2.3970 train_time: 8.0m tok/s: 6791679 +4131/20000 train_loss: 2.4507 train_time: 8.0m tok/s: 6791334 +4132/20000 train_loss: 2.4222 train_time: 8.0m tok/s: 6791002 +4133/20000 train_loss: 2.4381 train_time: 8.0m tok/s: 6790631 +4134/20000 train_loss: 2.4765 train_time: 8.0m tok/s: 6790283 +4135/20000 train_loss: 2.4691 train_time: 8.0m tok/s: 6789955 +4136/20000 train_loss: 2.2932 train_time: 8.0m tok/s: 6789612 +4137/20000 train_loss: 2.4980 train_time: 8.0m tok/s: 6789299 +4138/20000 train_loss: 2.3044 train_time: 8.0m tok/s: 6788946 +4139/20000 train_loss: 2.4532 train_time: 8.0m tok/s: 6788592 +4140/20000 train_loss: 2.5205 train_time: 8.0m tok/s: 6788232 +4141/20000 train_loss: 2.3312 train_time: 8.0m tok/s: 6787897 +4142/20000 train_loss: 2.5032 train_time: 8.0m tok/s: 6787546 +4143/20000 train_loss: 2.4041 train_time: 8.0m tok/s: 6787232 +4144/20000 train_loss: 2.5081 train_time: 8.0m tok/s: 6786894 +4145/20000 train_loss: 2.2718 train_time: 8.0m tok/s: 6786518 +4146/20000 train_loss: 2.4525 train_time: 8.0m tok/s: 6786183 +4147/20000 train_loss: 2.5416 train_time: 8.0m tok/s: 6785852 +4148/20000 train_loss: 2.4351 train_time: 8.0m tok/s: 6785531 +4149/20000 train_loss: 2.2961 train_time: 8.0m tok/s: 6785192 +4150/20000 train_loss: 2.3497 train_time: 8.0m tok/s: 6784843 +4151/20000 train_loss: 2.4356 train_time: 8.0m tok/s: 6784515 +4152/20000 train_loss: 2.4178 train_time: 8.0m tok/s: 6784177 +4153/20000 train_loss: 2.5047 train_time: 8.0m tok/s: 6783842 +4154/20000 train_loss: 2.3565 train_time: 8.0m tok/s: 6783501 +4155/20000 train_loss: 2.4932 train_time: 8.0m tok/s: 6783183 +4156/20000 train_loss: 2.4259 train_time: 8.0m tok/s: 6782857 +4157/20000 train_loss: 2.4856 train_time: 8.0m tok/s: 6782541 +4158/20000 train_loss: 2.4438 train_time: 8.0m tok/s: 6782215 +4159/20000 train_loss: 2.3457 train_time: 8.0m tok/s: 6781869 +4160/20000 train_loss: 2.3518 train_time: 8.0m tok/s: 6781537 +4161/20000 train_loss: 2.4066 train_time: 8.0m tok/s: 6781208 +4162/20000 train_loss: 2.3186 train_time: 8.0m tok/s: 6780870 +4163/20000 train_loss: 2.5220 train_time: 8.0m tok/s: 6780526 +4164/20000 train_loss: 2.4399 train_time: 8.0m tok/s: 6780213 +4165/20000 train_loss: 2.4417 train_time: 8.1m tok/s: 6779894 +4166/20000 train_loss: 2.4449 train_time: 8.1m tok/s: 6779553 +4167/20000 train_loss: 2.5228 train_time: 8.1m tok/s: 6779223 +4168/20000 train_loss: 2.4345 train_time: 8.1m tok/s: 6778897 +4169/20000 train_loss: 2.4171 train_time: 8.1m tok/s: 6778576 +4170/20000 train_loss: 2.3374 train_time: 8.1m tok/s: 6778237 +4171/20000 train_loss: 2.4369 train_time: 8.1m tok/s: 6777911 +4172/20000 train_loss: 2.3924 train_time: 8.1m tok/s: 6777581 +4173/20000 train_loss: 2.5197 train_time: 8.1m tok/s: 6777257 +4174/20000 train_loss: 2.3053 train_time: 8.1m tok/s: 6776926 +4175/20000 train_loss: 2.4024 train_time: 8.1m tok/s: 6776598 +4176/20000 train_loss: 2.5054 train_time: 8.1m tok/s: 6776281 +4177/20000 train_loss: 2.3773 train_time: 8.1m tok/s: 6775970 +4178/20000 train_loss: 2.4486 train_time: 8.1m tok/s: 6775667 +4179/20000 train_loss: 2.3986 train_time: 8.1m tok/s: 6775333 +4180/20000 train_loss: 2.3862 train_time: 8.1m tok/s: 6774999 +4181/20000 train_loss: 2.5434 train_time: 8.1m tok/s: 6774677 +4182/20000 train_loss: 2.4704 train_time: 8.1m tok/s: 6774340 +4183/20000 train_loss: 2.4604 train_time: 8.1m tok/s: 6774014 +4184/20000 train_loss: 2.4723 train_time: 8.1m tok/s: 6773675 +4185/20000 train_loss: 2.5698 train_time: 8.1m tok/s: 6773341 +4186/20000 train_loss: 2.4666 train_time: 8.1m tok/s: 6773019 +4187/20000 train_loss: 2.2912 train_time: 8.1m tok/s: 6772690 +4188/20000 train_loss: 2.4573 train_time: 8.1m tok/s: 6772332 +4189/20000 train_loss: 2.4432 train_time: 8.1m tok/s: 6772001 +4190/20000 train_loss: 2.5328 train_time: 8.1m tok/s: 6771689 +4191/20000 train_loss: 2.4201 train_time: 8.1m tok/s: 6771376 +4192/20000 train_loss: 2.3685 train_time: 8.1m tok/s: 6771038 +4193/20000 train_loss: 2.4028 train_time: 8.1m tok/s: 6770723 +4194/20000 train_loss: 2.4676 train_time: 8.1m tok/s: 6770374 +4195/20000 train_loss: 2.4824 train_time: 8.1m tok/s: 6770028 +4196/20000 train_loss: 2.3096 train_time: 8.1m tok/s: 6769694 +4197/20000 train_loss: 2.3249 train_time: 8.1m tok/s: 6769355 +4198/20000 train_loss: 2.4285 train_time: 8.1m tok/s: 6769022 +4199/20000 train_loss: 2.2411 train_time: 8.1m tok/s: 6768698 +4200/20000 train_loss: 2.4335 train_time: 8.1m tok/s: 6768348 +4201/20000 train_loss: 2.3659 train_time: 8.1m tok/s: 6768013 +4202/20000 train_loss: 2.4515 train_time: 8.1m tok/s: 6767683 +4203/20000 train_loss: 2.5870 train_time: 8.1m tok/s: 6767361 +4204/20000 train_loss: 2.4889 train_time: 8.1m tok/s: 6767052 +4205/20000 train_loss: 2.4659 train_time: 8.1m tok/s: 6766710 +4206/20000 train_loss: 2.3878 train_time: 8.1m tok/s: 6766388 +4207/20000 train_loss: 2.4054 train_time: 8.1m tok/s: 6766070 +4208/20000 train_loss: 2.3865 train_time: 8.2m tok/s: 6765739 +4209/20000 train_loss: 2.3794 train_time: 8.2m tok/s: 6765418 +4210/20000 train_loss: 2.3809 train_time: 8.2m tok/s: 6765099 +4211/20000 train_loss: 2.5536 train_time: 8.2m tok/s: 6764752 +4212/20000 train_loss: 2.3714 train_time: 8.2m tok/s: 6764395 +4213/20000 train_loss: 2.3961 train_time: 8.2m tok/s: 6764096 +4214/20000 train_loss: 2.4199 train_time: 8.2m tok/s: 6763766 +4215/20000 train_loss: 2.4491 train_time: 8.2m tok/s: 6763463 +4216/20000 train_loss: 2.5494 train_time: 8.2m tok/s: 6763135 +4217/20000 train_loss: 2.2983 train_time: 8.2m tok/s: 6762798 +4218/20000 train_loss: 2.4529 train_time: 8.2m tok/s: 6762462 +4219/20000 train_loss: 2.3814 train_time: 8.2m tok/s: 6762118 +4220/20000 train_loss: 2.0826 train_time: 8.2m tok/s: 6761800 +4221/20000 train_loss: 2.4683 train_time: 8.2m tok/s: 6761480 +4222/20000 train_loss: 2.4651 train_time: 8.2m tok/s: 6761170 +4223/20000 train_loss: 2.5645 train_time: 8.2m tok/s: 6760838 +4224/20000 train_loss: 2.4495 train_time: 8.2m tok/s: 6760504 +4225/20000 train_loss: 2.4559 train_time: 8.2m tok/s: 6760171 +4226/20000 train_loss: 2.4249 train_time: 8.2m tok/s: 6759842 +4227/20000 train_loss: 2.4637 train_time: 8.2m tok/s: 6759505 +4228/20000 train_loss: 2.4434 train_time: 8.2m tok/s: 6759154 +4229/20000 train_loss: 2.4399 train_time: 8.2m tok/s: 6758676 +4230/20000 train_loss: 2.3728 train_time: 8.2m tok/s: 6758319 +4231/20000 train_loss: 2.3631 train_time: 8.2m tok/s: 6758021 +4232/20000 train_loss: 2.4057 train_time: 8.2m tok/s: 6757521 +4233/20000 train_loss: 2.5433 train_time: 8.2m tok/s: 6757213 +4234/20000 train_loss: 2.3417 train_time: 8.2m tok/s: 6756790 +4235/20000 train_loss: 2.5135 train_time: 8.2m tok/s: 6756453 +4236/20000 train_loss: 2.5582 train_time: 8.2m tok/s: 6756092 +4237/20000 train_loss: 2.4336 train_time: 8.2m tok/s: 6755764 +4238/20000 train_loss: 2.5923 train_time: 8.2m tok/s: 6755407 +4239/20000 train_loss: 2.4940 train_time: 8.2m tok/s: 6755077 +4240/20000 train_loss: 2.5675 train_time: 8.2m tok/s: 6754725 +4241/20000 train_loss: 2.3551 train_time: 8.2m tok/s: 6754353 +4242/20000 train_loss: 2.4594 train_time: 8.2m tok/s: 6754008 +4243/20000 train_loss: 2.4047 train_time: 8.2m tok/s: 6753622 +4244/20000 train_loss: 2.4563 train_time: 8.2m tok/s: 6753313 +4245/20000 train_loss: 2.4492 train_time: 8.2m tok/s: 6752832 +4246/20000 train_loss: 2.4946 train_time: 8.2m tok/s: 6752523 +4247/20000 train_loss: 2.3696 train_time: 8.2m tok/s: 6752180 +4248/20000 train_loss: 2.4087 train_time: 8.2m tok/s: 6751870 +4249/20000 train_loss: 2.4281 train_time: 8.2m tok/s: 6751538 +4250/20000 train_loss: 2.3873 train_time: 8.3m tok/s: 6751230 +4251/20000 train_loss: 2.4059 train_time: 8.3m tok/s: 6750905 +4252/20000 train_loss: 2.4843 train_time: 8.3m tok/s: 6750615 +4253/20000 train_loss: 2.5384 train_time: 8.3m tok/s: 6750277 +4254/20000 train_loss: 2.6599 train_time: 8.3m tok/s: 6749960 +4255/20000 train_loss: 2.4473 train_time: 8.3m tok/s: 6749671 +4256/20000 train_loss: 2.5086 train_time: 8.3m tok/s: 6749347 +4257/20000 train_loss: 2.5233 train_time: 8.3m tok/s: 6749032 +4258/20000 train_loss: 2.3286 train_time: 8.3m tok/s: 6748705 +4259/20000 train_loss: 2.4181 train_time: 8.3m tok/s: 6748375 +4260/20000 train_loss: 2.3916 train_time: 8.3m tok/s: 6748070 +4261/20000 train_loss: 2.2910 train_time: 8.3m tok/s: 6747752 +4262/20000 train_loss: 2.3727 train_time: 8.3m tok/s: 6747462 +4263/20000 train_loss: 2.3348 train_time: 8.3m tok/s: 6747142 +4264/20000 train_loss: 2.3663 train_time: 8.3m tok/s: 6746808 +4265/20000 train_loss: 2.4364 train_time: 8.3m tok/s: 6746500 +4266/20000 train_loss: 2.5118 train_time: 8.3m tok/s: 6746165 +4267/20000 train_loss: 2.4037 train_time: 8.3m tok/s: 6745863 +4268/20000 train_loss: 2.4635 train_time: 8.3m tok/s: 6745536 +4269/20000 train_loss: 2.4039 train_time: 8.3m tok/s: 6745228 +4270/20000 train_loss: 2.3166 train_time: 8.3m tok/s: 6744927 +4271/20000 train_loss: 2.4268 train_time: 8.3m tok/s: 6744614 +4272/20000 train_loss: 2.3836 train_time: 8.3m tok/s: 6744289 +4273/20000 train_loss: 2.4074 train_time: 8.3m tok/s: 6743956 +4274/20000 train_loss: 2.4539 train_time: 8.3m tok/s: 6743608 +4275/20000 train_loss: 2.3017 train_time: 8.3m tok/s: 6743294 +4276/20000 train_loss: 2.3849 train_time: 8.3m tok/s: 6742988 +4277/20000 train_loss: 2.3573 train_time: 8.3m tok/s: 6742687 +4278/20000 train_loss: 2.3804 train_time: 8.3m tok/s: 6742365 +4279/20000 train_loss: 2.3683 train_time: 8.3m tok/s: 6742047 +4280/20000 train_loss: 2.5229 train_time: 8.3m tok/s: 6741748 +4281/20000 train_loss: 2.6019 train_time: 8.3m tok/s: 6741447 +4282/20000 train_loss: 2.4264 train_time: 8.3m tok/s: 6741122 +4283/20000 train_loss: 2.5997 train_time: 8.3m tok/s: 6740808 +4284/20000 train_loss: 2.5164 train_time: 8.3m tok/s: 6740508 +4285/20000 train_loss: 2.3901 train_time: 8.3m tok/s: 6740202 +4286/20000 train_loss: 2.4813 train_time: 8.3m tok/s: 6739863 +4287/20000 train_loss: 2.3717 train_time: 8.3m tok/s: 6739535 +4288/20000 train_loss: 2.3968 train_time: 8.3m tok/s: 6739233 +4289/20000 train_loss: 2.5699 train_time: 8.3m tok/s: 6738856 +4290/20000 train_loss: 2.3607 train_time: 8.3m tok/s: 6738573 +4291/20000 train_loss: 2.3186 train_time: 8.3m tok/s: 6738280 +4292/20000 train_loss: 2.4555 train_time: 8.3m tok/s: 6737972 +4293/20000 train_loss: 2.4250 train_time: 8.4m tok/s: 6737648 +4294/20000 train_loss: 2.3751 train_time: 8.4m tok/s: 6737340 +4295/20000 train_loss: 2.4169 train_time: 8.4m tok/s: 6737034 +4296/20000 train_loss: 2.2180 train_time: 8.4m tok/s: 6736706 +4297/20000 train_loss: 2.4765 train_time: 8.4m tok/s: 6736412 +4298/20000 train_loss: 2.4062 train_time: 8.4m tok/s: 6736100 +4299/20000 train_loss: 2.3432 train_time: 8.4m tok/s: 6735778 +4300/20000 train_loss: 2.5165 train_time: 8.4m tok/s: 6735469 +4301/20000 train_loss: 2.1848 train_time: 8.4m tok/s: 6735117 +4302/20000 train_loss: 2.3400 train_time: 8.4m tok/s: 6734797 +4303/20000 train_loss: 2.3308 train_time: 8.4m tok/s: 6734503 +4304/20000 train_loss: 2.3514 train_time: 8.4m tok/s: 6734208 +4305/20000 train_loss: 2.3303 train_time: 8.4m tok/s: 6733905 +4306/20000 train_loss: 2.3861 train_time: 8.4m tok/s: 6733574 +4307/20000 train_loss: 2.3377 train_time: 8.4m tok/s: 6733275 +4308/20000 train_loss: 2.3711 train_time: 8.4m tok/s: 6732969 +4309/20000 train_loss: 2.5788 train_time: 8.4m tok/s: 6732656 +4310/20000 train_loss: 2.4527 train_time: 8.4m tok/s: 6732334 +4311/20000 train_loss: 2.5382 train_time: 8.4m tok/s: 6732054 +4312/20000 train_loss: 2.4152 train_time: 8.4m tok/s: 6731755 +4313/20000 train_loss: 2.2540 train_time: 8.4m tok/s: 6731439 +4314/20000 train_loss: 2.4276 train_time: 8.4m tok/s: 6731120 +4315/20000 train_loss: 2.4035 train_time: 8.4m tok/s: 6730830 +4316/20000 train_loss: 2.4145 train_time: 8.4m tok/s: 6730510 +4317/20000 train_loss: 2.4880 train_time: 8.4m tok/s: 6730193 +4318/20000 train_loss: 2.4834 train_time: 8.4m tok/s: 6729871 +4319/20000 train_loss: 2.1766 train_time: 8.4m tok/s: 6729549 +4320/20000 train_loss: 2.2638 train_time: 8.4m tok/s: 6729257 +4321/20000 train_loss: 2.3716 train_time: 8.4m tok/s: 6728956 +4322/20000 train_loss: 2.4197 train_time: 8.4m tok/s: 6728648 +4323/20000 train_loss: 2.4112 train_time: 8.4m tok/s: 6728352 +4324/20000 train_loss: 2.4374 train_time: 8.4m tok/s: 6728050 +4325/20000 train_loss: 2.4597 train_time: 8.4m tok/s: 6727766 +4326/20000 train_loss: 2.3643 train_time: 8.4m tok/s: 6727465 +4327/20000 train_loss: 2.4084 train_time: 8.4m tok/s: 6727172 +4328/20000 train_loss: 2.3819 train_time: 8.4m tok/s: 6726879 +4329/20000 train_loss: 2.4348 train_time: 8.4m tok/s: 6726591 +4330/20000 train_loss: 2.3009 train_time: 8.4m tok/s: 6726272 +4331/20000 train_loss: 2.3838 train_time: 8.4m tok/s: 6725974 +4332/20000 train_loss: 2.9158 train_time: 8.4m tok/s: 6725635 +4333/20000 train_loss: 2.3827 train_time: 8.4m tok/s: 6725335 +4334/20000 train_loss: 2.4441 train_time: 8.4m tok/s: 6725023 +4335/20000 train_loss: 2.5361 train_time: 8.4m tok/s: 6724714 +4336/20000 train_loss: 2.4145 train_time: 8.5m tok/s: 6724408 +4337/20000 train_loss: 2.5032 train_time: 8.5m tok/s: 6724125 +4338/20000 train_loss: 2.5365 train_time: 8.5m tok/s: 6723824 +4339/20000 train_loss: 2.4431 train_time: 8.5m tok/s: 6723516 +4340/20000 train_loss: 2.4233 train_time: 8.5m tok/s: 6723230 +4341/20000 train_loss: 2.4107 train_time: 8.5m tok/s: 6722949 +4342/20000 train_loss: 2.5157 train_time: 8.5m tok/s: 6722668 +4343/20000 train_loss: 2.4872 train_time: 8.5m tok/s: 6722365 +4344/20000 train_loss: 2.4865 train_time: 8.5m tok/s: 6722041 +4345/20000 train_loss: 2.3206 train_time: 8.5m tok/s: 6721725 +4346/20000 train_loss: 2.3915 train_time: 8.5m tok/s: 6721439 +4347/20000 train_loss: 2.4026 train_time: 8.5m tok/s: 6721161 +4348/20000 train_loss: 2.3976 train_time: 8.5m tok/s: 6720858 +4349/20000 train_loss: 1.8610 train_time: 8.5m tok/s: 6720515 +4350/20000 train_loss: 2.2267 train_time: 8.5m tok/s: 6720213 +4351/20000 train_loss: 2.4175 train_time: 8.5m tok/s: 6719933 +4352/20000 train_loss: 2.3706 train_time: 8.5m tok/s: 6719615 +4353/20000 train_loss: 2.4613 train_time: 8.5m tok/s: 6719340 +4354/20000 train_loss: 2.4761 train_time: 8.5m tok/s: 6719045 +4355/20000 train_loss: 2.4059 train_time: 8.5m tok/s: 6718778 +4356/20000 train_loss: 2.4682 train_time: 8.5m tok/s: 6718480 +4357/20000 train_loss: 2.5053 train_time: 8.5m tok/s: 6718173 +4358/20000 train_loss: 2.4054 train_time: 8.5m tok/s: 6717903 +4359/20000 train_loss: 2.4193 train_time: 8.5m tok/s: 6717596 +4360/20000 train_loss: 2.3197 train_time: 8.5m tok/s: 6717317 +4361/20000 train_loss: 2.2813 train_time: 8.5m tok/s: 6717027 +4362/20000 train_loss: 2.4558 train_time: 8.5m tok/s: 6716724 +4363/20000 train_loss: 2.2143 train_time: 8.5m tok/s: 6716406 +4364/20000 train_loss: 2.3503 train_time: 8.5m tok/s: 6716108 +4365/20000 train_loss: 2.5505 train_time: 8.5m tok/s: 6715827 +4366/20000 train_loss: 2.7721 train_time: 8.5m tok/s: 6715518 +4367/20000 train_loss: 2.5278 train_time: 8.5m tok/s: 6715227 +4368/20000 train_loss: 2.5053 train_time: 8.5m tok/s: 6714924 +4369/20000 train_loss: 2.3122 train_time: 8.5m tok/s: 6714618 +4370/20000 train_loss: 2.4949 train_time: 8.5m tok/s: 6714335 +4371/20000 train_loss: 2.3880 train_time: 8.5m tok/s: 6714022 +4372/20000 train_loss: 2.4783 train_time: 8.5m tok/s: 6713714 +4373/20000 train_loss: 2.2524 train_time: 8.5m tok/s: 6713408 +4374/20000 train_loss: 2.3128 train_time: 8.5m tok/s: 6713112 +4375/20000 train_loss: 2.4045 train_time: 8.5m tok/s: 6712816 +4376/20000 train_loss: 2.3834 train_time: 8.5m tok/s: 6712490 +4377/20000 train_loss: 2.5521 train_time: 8.5m tok/s: 6712204 +4378/20000 train_loss: 2.2660 train_time: 8.5m tok/s: 6711893 +4379/20000 train_loss: 2.3272 train_time: 8.6m tok/s: 6711617 +4380/20000 train_loss: 2.4335 train_time: 8.6m tok/s: 6711340 +4381/20000 train_loss: 2.5031 train_time: 8.6m tok/s: 6711068 +4382/20000 train_loss: 2.4871 train_time: 8.6m tok/s: 6710769 +4383/20000 train_loss: 2.4339 train_time: 8.6m tok/s: 6710451 +4384/20000 train_loss: 2.4952 train_time: 8.6m tok/s: 6710157 +4385/20000 train_loss: 2.4210 train_time: 8.6m tok/s: 6709852 +4386/20000 train_loss: 2.3982 train_time: 8.6m tok/s: 6709553 +4387/20000 train_loss: 2.4682 train_time: 8.6m tok/s: 6709247 +4388/20000 train_loss: 2.3857 train_time: 8.6m tok/s: 6708956 +4389/20000 train_loss: 2.2617 train_time: 8.6m tok/s: 6708661 +4390/20000 train_loss: 2.4042 train_time: 8.6m tok/s: 6708361 +4391/20000 train_loss: 2.2865 train_time: 8.6m tok/s: 6708058 +4392/20000 train_loss: 2.3929 train_time: 8.6m tok/s: 6707767 +4393/20000 train_loss: 2.3914 train_time: 8.6m tok/s: 6707459 +4394/20000 train_loss: 2.3708 train_time: 8.6m tok/s: 6707175 +4395/20000 train_loss: 2.3400 train_time: 8.6m tok/s: 6706891 +4396/20000 train_loss: 2.5816 train_time: 8.6m tok/s: 6706585 +4397/20000 train_loss: 2.4130 train_time: 8.6m tok/s: 6706301 +4398/20000 train_loss: 2.3429 train_time: 8.6m tok/s: 6706019 +4399/20000 train_loss: 2.3590 train_time: 8.6m tok/s: 6705737 +4400/20000 train_loss: 2.5065 train_time: 8.6m tok/s: 6705454 +4401/20000 train_loss: 2.4328 train_time: 8.6m tok/s: 6705171 +4402/20000 train_loss: 2.3784 train_time: 8.6m tok/s: 6704890 +4403/20000 train_loss: 2.3663 train_time: 8.6m tok/s: 6704606 +4404/20000 train_loss: 2.3194 train_time: 8.6m tok/s: 6704327 +4405/20000 train_loss: 2.5378 train_time: 8.6m tok/s: 6704032 +4406/20000 train_loss: 2.4058 train_time: 8.6m tok/s: 6703741 +4407/20000 train_loss: 2.4497 train_time: 8.6m tok/s: 6703462 +4408/20000 train_loss: 2.4758 train_time: 8.6m tok/s: 6703181 +4409/20000 train_loss: 2.4200 train_time: 8.6m tok/s: 6702915 +4410/20000 train_loss: 2.4457 train_time: 8.6m tok/s: 6702639 +4411/20000 train_loss: 2.4640 train_time: 8.6m tok/s: 6702387 +4412/20000 train_loss: 2.5236 train_time: 8.6m tok/s: 6702091 +4413/20000 train_loss: 2.3427 train_time: 8.6m tok/s: 6701799 +4414/20000 train_loss: 2.3517 train_time: 8.6m tok/s: 6701519 +4415/20000 train_loss: 2.5635 train_time: 8.6m tok/s: 6701197 +4416/20000 train_loss: 2.3844 train_time: 8.6m tok/s: 6700905 +4417/20000 train_loss: 2.2951 train_time: 8.6m tok/s: 6700616 +4418/20000 train_loss: 2.3700 train_time: 8.6m tok/s: 6700318 +4419/20000 train_loss: 2.3696 train_time: 8.6m tok/s: 6700043 +4420/20000 train_loss: 2.2846 train_time: 8.6m tok/s: 6699763 +4421/20000 train_loss: 2.2516 train_time: 8.6m tok/s: 6699475 +4422/20000 train_loss: 2.3867 train_time: 8.7m tok/s: 6699189 +4423/20000 train_loss: 2.5335 train_time: 8.7m tok/s: 6698885 +4424/20000 train_loss: 2.4636 train_time: 8.7m tok/s: 6698620 +4425/20000 train_loss: 2.5000 train_time: 8.7m tok/s: 6698338 +4426/20000 train_loss: 2.3037 train_time: 8.7m tok/s: 6698062 +4427/20000 train_loss: 2.3325 train_time: 8.7m tok/s: 6697773 +4428/20000 train_loss: 2.3992 train_time: 8.7m tok/s: 6697487 +4429/20000 train_loss: 2.4078 train_time: 8.7m tok/s: 6697205 +4430/20000 train_loss: 2.5593 train_time: 8.7m tok/s: 6696892 +4431/20000 train_loss: 2.4652 train_time: 8.7m tok/s: 6696552 +4432/20000 train_loss: 2.3570 train_time: 8.7m tok/s: 6696262 +4433/20000 train_loss: 2.2747 train_time: 8.7m tok/s: 6695970 +4434/20000 train_loss: 2.5701 train_time: 8.7m tok/s: 6695659 +4435/20000 train_loss: 2.4319 train_time: 8.7m tok/s: 6695376 +4436/20000 train_loss: 2.4043 train_time: 8.7m tok/s: 6695092 +4437/20000 train_loss: 2.2318 train_time: 8.7m tok/s: 6694798 +4438/20000 train_loss: 2.5273 train_time: 8.7m tok/s: 6694519 +4439/20000 train_loss: 2.4861 train_time: 8.7m tok/s: 6694225 +4440/20000 train_loss: 2.4758 train_time: 8.7m tok/s: 6693950 +4441/20000 train_loss: 2.3512 train_time: 8.7m tok/s: 6693680 +4442/20000 train_loss: 2.4496 train_time: 8.7m tok/s: 6693385 +4443/20000 train_loss: 2.5191 train_time: 8.7m tok/s: 6693119 +4444/20000 train_loss: 2.3710 train_time: 8.7m tok/s: 6692805 +4445/20000 train_loss: 2.3881 train_time: 8.7m tok/s: 6692527 +4446/20000 train_loss: 2.3101 train_time: 8.7m tok/s: 6692244 +4447/20000 train_loss: 2.3631 train_time: 8.7m tok/s: 6691951 +4448/20000 train_loss: 2.3067 train_time: 8.7m tok/s: 6691652 +4449/20000 train_loss: 2.5094 train_time: 8.7m tok/s: 6691358 +4450/20000 train_loss: 2.3468 train_time: 8.7m tok/s: 6691084 +4451/20000 train_loss: 2.4480 train_time: 8.7m tok/s: 6690788 +4452/20000 train_loss: 2.3652 train_time: 8.7m tok/s: 6690512 +4453/20000 train_loss: 2.4747 train_time: 8.7m tok/s: 6690217 +4454/20000 train_loss: 2.3302 train_time: 8.7m tok/s: 6689932 +4455/20000 train_loss: 2.4053 train_time: 8.7m tok/s: 6689656 +4456/20000 train_loss: 2.4194 train_time: 8.7m tok/s: 6689361 +4457/20000 train_loss: 2.6127 train_time: 8.7m tok/s: 6689091 +4458/20000 train_loss: 2.3862 train_time: 8.7m tok/s: 6688819 +4459/20000 train_loss: 2.2665 train_time: 8.7m tok/s: 6688528 +4460/20000 train_loss: 2.3211 train_time: 8.7m tok/s: 6688215 +4461/20000 train_loss: 2.3537 train_time: 8.7m tok/s: 6687931 +4462/20000 train_loss: 2.4765 train_time: 8.7m tok/s: 6687653 +4463/20000 train_loss: 2.2405 train_time: 8.7m tok/s: 6687334 +4464/20000 train_loss: 2.4953 train_time: 8.7m tok/s: 6687076 +4465/20000 train_loss: 2.4583 train_time: 8.8m tok/s: 6686788 +4466/20000 train_loss: 2.5095 train_time: 8.8m tok/s: 6686506 +4467/20000 train_loss: 2.4530 train_time: 8.8m tok/s: 6686235 +4468/20000 train_loss: 2.2166 train_time: 8.8m tok/s: 6685917 +4469/20000 train_loss: 2.3555 train_time: 8.8m tok/s: 6685606 +4470/20000 train_loss: 2.3491 train_time: 8.8m tok/s: 6685343 +4471/20000 train_loss: 2.4330 train_time: 8.8m tok/s: 6685070 +4472/20000 train_loss: 2.3990 train_time: 8.8m tok/s: 6684779 +4473/20000 train_loss: 2.5104 train_time: 8.8m tok/s: 6684495 +4474/20000 train_loss: 2.3578 train_time: 8.8m tok/s: 6684206 +4475/20000 train_loss: 2.4741 train_time: 8.8m tok/s: 6683929 +4476/20000 train_loss: 2.4173 train_time: 8.8m tok/s: 6683656 +4477/20000 train_loss: 2.3421 train_time: 8.8m tok/s: 6683360 +4478/20000 train_loss: 2.3640 train_time: 8.8m tok/s: 6683092 +4479/20000 train_loss: 2.4296 train_time: 8.8m tok/s: 6682820 +4480/20000 train_loss: 2.3867 train_time: 8.8m tok/s: 6682538 +4481/20000 train_loss: 2.4155 train_time: 8.8m tok/s: 6682280 +4482/20000 train_loss: 2.3762 train_time: 8.8m tok/s: 6681987 +4483/20000 train_loss: 2.4165 train_time: 8.8m tok/s: 6681707 +4484/20000 train_loss: 2.4974 train_time: 8.8m tok/s: 6681408 +4485/20000 train_loss: 2.3361 train_time: 8.8m tok/s: 6681121 +4486/20000 train_loss: 2.3906 train_time: 8.8m tok/s: 6680868 +4487/20000 train_loss: 2.3808 train_time: 8.8m tok/s: 6680568 +4488/20000 train_loss: 2.3757 train_time: 8.8m tok/s: 6680270 +4489/20000 train_loss: 2.2765 train_time: 8.8m tok/s: 6679985 +4490/20000 train_loss: 2.4607 train_time: 8.8m tok/s: 6679699 +4491/20000 train_loss: 2.3961 train_time: 8.8m tok/s: 6679412 +4492/20000 train_loss: 2.4482 train_time: 8.8m tok/s: 6679138 +4493/20000 train_loss: 2.4597 train_time: 8.8m tok/s: 6678865 +4494/20000 train_loss: 2.4887 train_time: 8.8m tok/s: 6678588 +4495/20000 train_loss: 2.4468 train_time: 8.8m tok/s: 6678309 +4496/20000 train_loss: 2.3280 train_time: 8.8m tok/s: 6678029 +4497/20000 train_loss: 2.4385 train_time: 8.8m tok/s: 6677742 +4498/20000 train_loss: 2.3704 train_time: 8.8m tok/s: 6677461 +4499/20000 train_loss: 2.3604 train_time: 8.8m tok/s: 6677186 +4500/20000 train_loss: 2.3922 train_time: 8.8m tok/s: 6676915 +4501/20000 train_loss: 2.0596 train_time: 8.8m tok/s: 6676590 +4502/20000 train_loss: 2.3531 train_time: 8.8m tok/s: 6676315 +4503/20000 train_loss: 2.2314 train_time: 8.8m tok/s: 6676049 +4504/20000 train_loss: 2.3194 train_time: 8.8m tok/s: 6675759 +4505/20000 train_loss: 2.3178 train_time: 8.8m tok/s: 6675474 +4506/20000 train_loss: 2.2608 train_time: 8.8m tok/s: 6675197 +4507/20000 train_loss: 2.3522 train_time: 8.9m tok/s: 6674926 +4508/20000 train_loss: 2.4639 train_time: 8.9m tok/s: 6674660 +4509/20000 train_loss: 2.4258 train_time: 8.9m tok/s: 6674388 +4510/20000 train_loss: 2.4530 train_time: 8.9m tok/s: 6674125 +4511/20000 train_loss: 2.2494 train_time: 8.9m tok/s: 6673846 +4512/20000 train_loss: 2.3563 train_time: 8.9m tok/s: 6673565 +4513/20000 train_loss: 2.3731 train_time: 8.9m tok/s: 6673278 +4514/20000 train_loss: 2.3872 train_time: 8.9m tok/s: 6672990 +4515/20000 train_loss: 2.3132 train_time: 8.9m tok/s: 6672716 +4516/20000 train_loss: 2.5643 train_time: 8.9m tok/s: 6672423 +4517/20000 train_loss: 2.6638 train_time: 8.9m tok/s: 6672138 +4518/20000 train_loss: 2.4458 train_time: 8.9m tok/s: 6671879 +4519/20000 train_loss: 2.3627 train_time: 8.9m tok/s: 6671618 +4520/20000 train_loss: 2.3099 train_time: 8.9m tok/s: 6671337 +4521/20000 train_loss: 2.4271 train_time: 8.9m tok/s: 6671055 +4522/20000 train_loss: 2.3445 train_time: 8.9m tok/s: 6670782 +4523/20000 train_loss: 2.4164 train_time: 8.9m tok/s: 6670499 +4524/20000 train_loss: 2.3124 train_time: 8.9m tok/s: 6670231 +4525/20000 train_loss: 2.4880 train_time: 8.9m tok/s: 6669979 +4526/20000 train_loss: 2.4301 train_time: 8.9m tok/s: 6669718 +4527/20000 train_loss: 2.4127 train_time: 8.9m tok/s: 6669444 +4528/20000 train_loss: 2.3276 train_time: 8.9m tok/s: 6669168 +4529/20000 train_loss: 2.3791 train_time: 8.9m tok/s: 6668875 +4530/20000 train_loss: 2.2905 train_time: 8.9m tok/s: 6668587 +4531/20000 train_loss: 2.5287 train_time: 8.9m tok/s: 6668292 +4532/20000 train_loss: 2.3981 train_time: 8.9m tok/s: 6668023 +4533/20000 train_loss: 2.2795 train_time: 8.9m tok/s: 6667772 +4534/20000 train_loss: 2.4270 train_time: 8.9m tok/s: 6667512 +4535/20000 train_loss: 2.4913 train_time: 8.9m tok/s: 6667230 +4536/20000 train_loss: 2.2861 train_time: 8.9m tok/s: 6666948 +4537/20000 train_loss: 2.3806 train_time: 8.9m tok/s: 6666659 +4538/20000 train_loss: 2.1558 train_time: 8.9m tok/s: 6666376 +4539/20000 train_loss: 2.4297 train_time: 8.9m tok/s: 6666110 +4540/20000 train_loss: 2.3672 train_time: 8.9m tok/s: 6665834 +4541/20000 train_loss: 2.2974 train_time: 8.9m tok/s: 6665566 +4542/20000 train_loss: 2.3860 train_time: 8.9m tok/s: 6665299 +4543/20000 train_loss: 2.2957 train_time: 8.9m tok/s: 6665019 +4544/20000 train_loss: 2.6316 train_time: 8.9m tok/s: 6664734 +4545/20000 train_loss: 2.3768 train_time: 8.9m tok/s: 6664476 +4546/20000 train_loss: 2.3791 train_time: 8.9m tok/s: 6664201 +4547/20000 train_loss: 2.3396 train_time: 8.9m tok/s: 6663912 +4548/20000 train_loss: 2.2313 train_time: 8.9m tok/s: 6663663 +4549/20000 train_loss: 2.4267 train_time: 8.9m tok/s: 6663399 +4550/20000 train_loss: 2.4522 train_time: 9.0m tok/s: 6663137 +4551/20000 train_loss: 2.3482 train_time: 9.0m tok/s: 6662893 +4552/20000 train_loss: 2.3445 train_time: 9.0m tok/s: 6662608 +4553/20000 train_loss: 2.2841 train_time: 9.0m tok/s: 6662362 +4554/20000 train_loss: 2.4559 train_time: 9.0m tok/s: 6662056 +4555/20000 train_loss: 2.4510 train_time: 9.0m tok/s: 6661751 +4556/20000 train_loss: 2.4111 train_time: 9.0m tok/s: 6661464 +4557/20000 train_loss: 2.4583 train_time: 9.0m tok/s: 6661206 +4558/20000 train_loss: 2.5731 train_time: 9.0m tok/s: 6660947 +4559/20000 train_loss: 2.4103 train_time: 9.0m tok/s: 6660674 +4560/20000 train_loss: 2.4260 train_time: 9.0m tok/s: 6660408 +4561/20000 train_loss: 2.4416 train_time: 9.0m tok/s: 6660161 +4562/20000 train_loss: 2.3871 train_time: 9.0m tok/s: 6659924 +4563/20000 train_loss: 2.4996 train_time: 9.0m tok/s: 6659641 +4564/20000 train_loss: 2.3199 train_time: 9.0m tok/s: 6659375 +4565/20000 train_loss: 2.3918 train_time: 9.0m tok/s: 6659119 +4566/20000 train_loss: 2.3614 train_time: 9.0m tok/s: 6658858 +4567/20000 train_loss: 2.2262 train_time: 9.0m tok/s: 6658611 +4568/20000 train_loss: 2.2567 train_time: 9.0m tok/s: 6658323 +4569/20000 train_loss: 2.4755 train_time: 9.0m tok/s: 6658062 +4570/20000 train_loss: 2.3418 train_time: 9.0m tok/s: 6657820 +4571/20000 train_loss: 2.4145 train_time: 9.0m tok/s: 6657565 +4572/20000 train_loss: 2.3961 train_time: 9.0m tok/s: 6657303 +4573/20000 train_loss: 2.3895 train_time: 9.0m tok/s: 6657045 +4574/20000 train_loss: 2.4003 train_time: 9.0m tok/s: 6656783 +4575/20000 train_loss: 2.2791 train_time: 9.0m tok/s: 6656516 +4576/20000 train_loss: 2.4175 train_time: 9.0m tok/s: 6656271 +4577/20000 train_loss: 2.4102 train_time: 9.0m tok/s: 6656011 +4578/20000 train_loss: 2.3658 train_time: 9.0m tok/s: 6655757 +4579/20000 train_loss: 2.4701 train_time: 9.0m tok/s: 6655467 +4580/20000 train_loss: 2.2497 train_time: 9.0m tok/s: 6655187 +4581/20000 train_loss: 1.9333 train_time: 9.0m tok/s: 6654883 +4582/20000 train_loss: 2.3701 train_time: 9.0m tok/s: 6654618 +4583/20000 train_loss: 2.4078 train_time: 9.0m tok/s: 6654390 +4584/20000 train_loss: 2.4583 train_time: 9.0m tok/s: 6654147 +4585/20000 train_loss: 2.3146 train_time: 9.0m tok/s: 6653882 +4586/20000 train_loss: 2.3567 train_time: 9.0m tok/s: 6653634 +4587/20000 train_loss: 2.5047 train_time: 9.0m tok/s: 6653346 +4588/20000 train_loss: 2.3948 train_time: 9.0m tok/s: 6653088 +4589/20000 train_loss: 2.4592 train_time: 9.0m tok/s: 6652826 +4590/20000 train_loss: 2.3049 train_time: 9.0m tok/s: 6652564 +4591/20000 train_loss: 2.3639 train_time: 9.0m tok/s: 6652323 +4592/20000 train_loss: 2.2539 train_time: 9.0m tok/s: 6652065 +4593/20000 train_loss: 2.4555 train_time: 9.1m tok/s: 6651812 +4594/20000 train_loss: 2.3171 train_time: 9.1m tok/s: 6651546 +4595/20000 train_loss: 2.1855 train_time: 9.1m tok/s: 6651261 +4596/20000 train_loss: 2.4285 train_time: 9.1m tok/s: 6651008 +4597/20000 train_loss: 2.4130 train_time: 9.1m tok/s: 6650698 +4598/20000 train_loss: 2.4603 train_time: 9.1m tok/s: 6650447 +4599/20000 train_loss: 2.3748 train_time: 9.1m tok/s: 6650199 +4600/20000 train_loss: 2.5591 train_time: 9.1m tok/s: 6649943 +4601/20000 train_loss: 2.4829 train_time: 9.1m tok/s: 6649695 +4602/20000 train_loss: 2.5079 train_time: 9.1m tok/s: 6649446 +4603/20000 train_loss: 2.3784 train_time: 9.1m tok/s: 6649205 +4604/20000 train_loss: 2.3440 train_time: 9.1m tok/s: 6648926 +4605/20000 train_loss: 2.3469 train_time: 9.1m tok/s: 6648671 +4606/20000 train_loss: 2.3461 train_time: 9.1m tok/s: 6648419 +4607/20000 train_loss: 2.3485 train_time: 9.1m tok/s: 6648184 +4608/20000 train_loss: 2.3692 train_time: 9.1m tok/s: 6647902 +4609/20000 train_loss: 2.3181 train_time: 9.1m tok/s: 6647644 +4610/20000 train_loss: 2.4194 train_time: 9.1m tok/s: 6647405 +4611/20000 train_loss: 2.5795 train_time: 9.1m tok/s: 6647137 +4612/20000 train_loss: 2.4497 train_time: 9.1m tok/s: 6646884 +4613/20000 train_loss: 2.5011 train_time: 9.1m tok/s: 6646622 +4614/20000 train_loss: 2.4568 train_time: 9.1m tok/s: 6646368 +4615/20000 train_loss: 2.3050 train_time: 9.1m tok/s: 6646082 +4616/20000 train_loss: 2.2722 train_time: 9.1m tok/s: 6645843 +4617/20000 train_loss: 2.3128 train_time: 9.1m tok/s: 6645587 +4618/20000 train_loss: 2.3701 train_time: 9.1m tok/s: 6645351 +4619/20000 train_loss: 2.2857 train_time: 9.1m tok/s: 6645101 +4620/20000 train_loss: 2.3530 train_time: 9.1m tok/s: 6644849 +4621/20000 train_loss: 2.3891 train_time: 9.1m tok/s: 6644565 +4622/20000 train_loss: 2.4144 train_time: 9.1m tok/s: 6644293 +4623/20000 train_loss: 2.3717 train_time: 9.1m tok/s: 6644048 +4624/20000 train_loss: 2.5495 train_time: 9.1m tok/s: 6643793 +4625/20000 train_loss: 2.5831 train_time: 9.1m tok/s: 6643528 +4626/20000 train_loss: 2.4950 train_time: 9.1m tok/s: 6643261 +4627/20000 train_loss: 2.3984 train_time: 9.1m tok/s: 6643019 +4628/20000 train_loss: 2.4493 train_time: 9.1m tok/s: 6642740 +4629/20000 train_loss: 2.3749 train_time: 9.1m tok/s: 6642495 +4630/20000 train_loss: 2.1516 train_time: 9.1m tok/s: 6642231 +4631/20000 train_loss: 2.3874 train_time: 9.1m tok/s: 6641970 +4632/20000 train_loss: 2.3126 train_time: 9.1m tok/s: 6641711 +4633/20000 train_loss: 2.2328 train_time: 9.1m tok/s: 6641444 +4634/20000 train_loss: 2.3448 train_time: 9.1m tok/s: 6641201 +4635/20000 train_loss: 2.4744 train_time: 9.1m tok/s: 6640922 +4636/20000 train_loss: 2.3821 train_time: 9.2m tok/s: 6640661 +4637/20000 train_loss: 2.4636 train_time: 9.2m tok/s: 6640406 +4638/20000 train_loss: 2.4506 train_time: 9.2m tok/s: 6640153 +4639/20000 train_loss: 2.3899 train_time: 9.2m tok/s: 6639905 +4640/20000 train_loss: 2.4049 train_time: 9.2m tok/s: 6639643 +4641/20000 train_loss: 2.3465 train_time: 9.2m tok/s: 6639380 +4642/20000 train_loss: 2.3124 train_time: 9.2m tok/s: 6639109 +4643/20000 train_loss: 2.4205 train_time: 9.2m tok/s: 6638841 +4644/20000 train_loss: 2.2553 train_time: 9.2m tok/s: 6638579 +4645/20000 train_loss: 2.3492 train_time: 9.2m tok/s: 6638319 +4646/20000 train_loss: 2.2886 train_time: 9.2m tok/s: 6638060 +4647/20000 train_loss: 2.1952 train_time: 9.2m tok/s: 6637810 +4648/20000 train_loss: 2.3751 train_time: 9.2m tok/s: 6637537 +4649/20000 train_loss: 2.3902 train_time: 9.2m tok/s: 6637287 +4650/20000 train_loss: 2.4108 train_time: 9.2m tok/s: 6637026 +4651/20000 train_loss: 2.4822 train_time: 9.2m tok/s: 6636753 +4652/20000 train_loss: 2.4624 train_time: 9.2m tok/s: 6636479 +4653/20000 train_loss: 2.4322 train_time: 9.2m tok/s: 6636229 +4654/20000 train_loss: 2.2968 train_time: 9.2m tok/s: 6635973 +4655/20000 train_loss: 2.2708 train_time: 9.2m tok/s: 6635708 +4656/20000 train_loss: 2.4646 train_time: 9.2m tok/s: 6635454 +4657/20000 train_loss: 2.3532 train_time: 9.2m tok/s: 6635199 +4658/20000 train_loss: 2.2959 train_time: 9.2m tok/s: 6634945 +4659/20000 train_loss: 2.3288 train_time: 9.2m tok/s: 6634709 +4660/20000 train_loss: 2.3410 train_time: 9.2m tok/s: 6634454 +4661/20000 train_loss: 2.0325 train_time: 9.2m tok/s: 6634138 +4662/20000 train_loss: 2.3542 train_time: 9.2m tok/s: 6633875 +4663/20000 train_loss: 2.3865 train_time: 9.2m tok/s: 6633654 +4664/20000 train_loss: 2.2924 train_time: 9.2m tok/s: 6633395 +4665/20000 train_loss: 2.4440 train_time: 9.2m tok/s: 6633140 +4666/20000 train_loss: 2.3333 train_time: 9.2m tok/s: 6632885 +4667/20000 train_loss: 2.4309 train_time: 9.2m tok/s: 6632617 +4668/20000 train_loss: 2.3975 train_time: 9.2m tok/s: 6632384 +4669/20000 train_loss: 2.6144 train_time: 9.2m tok/s: 6632120 +4670/20000 train_loss: 2.2799 train_time: 9.2m tok/s: 6631862 +4671/20000 train_loss: 2.4520 train_time: 9.2m tok/s: 6631600 +4672/20000 train_loss: 2.3491 train_time: 9.2m tok/s: 6631341 +4673/20000 train_loss: 2.3259 train_time: 9.2m tok/s: 6631078 +4674/20000 train_loss: 2.4326 train_time: 9.2m tok/s: 6630834 +4675/20000 train_loss: 2.3725 train_time: 9.2m tok/s: 6630570 +4676/20000 train_loss: 2.3605 train_time: 9.2m tok/s: 6630319 +4677/20000 train_loss: 2.4168 train_time: 9.2m tok/s: 6630076 +4678/20000 train_loss: 2.3839 train_time: 9.2m tok/s: 6629827 +4679/20000 train_loss: 2.2910 train_time: 9.3m tok/s: 6629586 +4680/20000 train_loss: 2.3024 train_time: 9.3m tok/s: 6629336 +4681/20000 train_loss: 2.3740 train_time: 9.3m tok/s: 6629077 +4682/20000 train_loss: 2.2895 train_time: 9.3m tok/s: 6628821 +4683/20000 train_loss: 2.3611 train_time: 9.3m tok/s: 6628567 +4684/20000 train_loss: 2.3500 train_time: 9.3m tok/s: 6628315 +4685/20000 train_loss: 2.3215 train_time: 9.3m tok/s: 6628046 +4686/20000 train_loss: 2.2539 train_time: 9.3m tok/s: 6627787 +4687/20000 train_loss: 2.3871 train_time: 9.3m tok/s: 6627517 +4688/20000 train_loss: 2.4456 train_time: 9.3m tok/s: 6627262 +4689/20000 train_loss: 2.4388 train_time: 9.3m tok/s: 6627013 +4690/20000 train_loss: 2.4983 train_time: 9.3m tok/s: 6626767 +4691/20000 train_loss: 2.4024 train_time: 9.3m tok/s: 6626509 +4692/20000 train_loss: 2.4186 train_time: 9.3m tok/s: 6626259 +4693/20000 train_loss: 2.3652 train_time: 9.3m tok/s: 6625999 +4694/20000 train_loss: 2.4527 train_time: 9.3m tok/s: 6625750 +4695/20000 train_loss: 2.4378 train_time: 9.3m tok/s: 6625477 +4696/20000 train_loss: 2.3405 train_time: 9.3m tok/s: 6625225 +4697/20000 train_loss: 2.3377 train_time: 9.3m tok/s: 6624994 +4698/20000 train_loss: 2.3723 train_time: 9.3m tok/s: 6624750 +4699/20000 train_loss: 2.3659 train_time: 9.3m tok/s: 6624519 +4700/20000 train_loss: 2.1580 train_time: 9.3m tok/s: 6624269 +4701/20000 train_loss: 2.4155 train_time: 9.3m tok/s: 6624018 +4702/20000 train_loss: 2.5016 train_time: 9.3m tok/s: 6623718 +4703/20000 train_loss: 2.4466 train_time: 9.3m tok/s: 6623457 +4704/20000 train_loss: 2.3639 train_time: 9.3m tok/s: 6623217 +4705/20000 train_loss: 2.3338 train_time: 9.3m tok/s: 6622974 +4706/20000 train_loss: 2.3281 train_time: 9.3m tok/s: 6622719 +4707/20000 train_loss: 2.4484 train_time: 9.3m tok/s: 6622482 +4708/20000 train_loss: 2.3554 train_time: 9.3m tok/s: 6622204 +4709/20000 train_loss: 2.3504 train_time: 9.3m tok/s: 6621957 +4710/20000 train_loss: 2.3936 train_time: 9.3m tok/s: 6621698 +4711/20000 train_loss: 2.3593 train_time: 9.3m tok/s: 6621439 +4712/20000 train_loss: 2.3171 train_time: 9.3m tok/s: 6621207 +4713/20000 train_loss: 2.4409 train_time: 9.3m tok/s: 6620946 +4714/20000 train_loss: 2.3416 train_time: 9.3m tok/s: 6620677 +4715/20000 train_loss: 2.4660 train_time: 9.3m tok/s: 6620423 +4716/20000 train_loss: 2.4928 train_time: 9.3m tok/s: 6620177 +4717/20000 train_loss: 2.3211 train_time: 9.3m tok/s: 6619935 +4718/20000 train_loss: 2.3110 train_time: 9.3m tok/s: 6619691 +4719/20000 train_loss: 2.3513 train_time: 9.3m tok/s: 6619457 +4720/20000 train_loss: 2.3034 train_time: 9.3m tok/s: 6619185 +4721/20000 train_loss: 2.2790 train_time: 9.3m tok/s: 6618930 +4722/20000 train_loss: 2.4753 train_time: 9.4m tok/s: 6618690 +4723/20000 train_loss: 2.3220 train_time: 9.4m tok/s: 6618443 +4724/20000 train_loss: 2.3629 train_time: 9.4m tok/s: 6618200 +4725/20000 train_loss: 2.2543 train_time: 9.4m tok/s: 6617928 +4726/20000 train_loss: 2.3963 train_time: 9.4m tok/s: 6617671 +4727/20000 train_loss: 2.3740 train_time: 9.4m tok/s: 6617429 +4728/20000 train_loss: 2.3554 train_time: 9.4m tok/s: 6617183 +4729/20000 train_loss: 2.3643 train_time: 9.4m tok/s: 6616932 +4730/20000 train_loss: 2.2017 train_time: 9.4m tok/s: 6616687 +4731/20000 train_loss: 2.3617 train_time: 9.4m tok/s: 6616426 +4732/20000 train_loss: 2.3129 train_time: 9.4m tok/s: 6616189 +4733/20000 train_loss: 2.4162 train_time: 9.4m tok/s: 6615943 +4734/20000 train_loss: 2.3951 train_time: 9.4m tok/s: 6615702 +4735/20000 train_loss: 2.1973 train_time: 9.4m tok/s: 6615434 +4736/20000 train_loss: 2.3372 train_time: 9.4m tok/s: 6615202 +4737/20000 train_loss: 2.4776 train_time: 9.4m tok/s: 6614947 +4738/20000 train_loss: 2.3508 train_time: 9.4m tok/s: 6614703 +4739/20000 train_loss: 2.4810 train_time: 9.4m tok/s: 6614405 +4740/20000 train_loss: 2.4548 train_time: 9.4m tok/s: 6614156 +4741/20000 train_loss: 2.3779 train_time: 9.4m tok/s: 6613886 +4742/20000 train_loss: 2.3399 train_time: 9.4m tok/s: 6613656 +4743/20000 train_loss: 2.2966 train_time: 9.4m tok/s: 6613399 +4744/20000 train_loss: 2.3667 train_time: 9.4m tok/s: 6613160 +4745/20000 train_loss: 2.2543 train_time: 9.4m tok/s: 6612915 +4746/20000 train_loss: 2.2238 train_time: 9.4m tok/s: 6612680 +4747/20000 train_loss: 2.4582 train_time: 9.4m tok/s: 6612444 +4748/20000 train_loss: 2.3972 train_time: 9.4m tok/s: 6612177 +4749/20000 train_loss: 2.3641 train_time: 9.4m tok/s: 6611914 +4750/20000 train_loss: 2.4015 train_time: 9.4m tok/s: 6611689 +4751/20000 train_loss: 2.3246 train_time: 9.4m tok/s: 6611453 +4752/20000 train_loss: 2.3775 train_time: 9.4m tok/s: 6611215 +4753/20000 train_loss: 2.2574 train_time: 9.4m tok/s: 6610970 +4754/20000 train_loss: 2.2225 train_time: 9.4m tok/s: 6610727 +4755/20000 train_loss: 2.4098 train_time: 9.4m tok/s: 6610480 +4756/20000 train_loss: 2.3780 train_time: 9.4m tok/s: 6610231 +4757/20000 train_loss: 2.3282 train_time: 9.4m tok/s: 6609989 +4758/20000 train_loss: 2.5541 train_time: 9.4m tok/s: 6609746 +4759/20000 train_loss: 2.3841 train_time: 9.4m tok/s: 6609522 +4760/20000 train_loss: 2.2599 train_time: 9.4m tok/s: 6609280 +4761/20000 train_loss: 2.4405 train_time: 9.4m tok/s: 6609041 +4762/20000 train_loss: 2.3450 train_time: 9.4m tok/s: 6608793 +4763/20000 train_loss: 2.4769 train_time: 9.4m tok/s: 6608545 +4764/20000 train_loss: 2.4395 train_time: 9.4m tok/s: 6608300 +4765/20000 train_loss: 2.4230 train_time: 9.5m tok/s: 6608049 +4766/20000 train_loss: 2.3487 train_time: 9.5m tok/s: 6607807 +4767/20000 train_loss: 2.3050 train_time: 9.5m tok/s: 6607563 +4768/20000 train_loss: 2.3768 train_time: 9.5m tok/s: 6607323 +4769/20000 train_loss: 2.3496 train_time: 9.5m tok/s: 6607077 +4770/20000 train_loss: 2.4166 train_time: 9.5m tok/s: 6606825 +4771/20000 train_loss: 2.3464 train_time: 9.5m tok/s: 6606600 +4772/20000 train_loss: 2.4227 train_time: 9.5m tok/s: 6606339 +4773/20000 train_loss: 2.4359 train_time: 9.5m tok/s: 6606106 +4774/20000 train_loss: 2.5465 train_time: 9.5m tok/s: 6605853 +4775/20000 train_loss: 2.4233 train_time: 9.5m tok/s: 6605580 +4776/20000 train_loss: 2.5193 train_time: 9.5m tok/s: 6605356 +4777/20000 train_loss: 2.3943 train_time: 9.5m tok/s: 6605120 +4778/20000 train_loss: 2.3614 train_time: 9.5m tok/s: 6604883 +4779/20000 train_loss: 2.1982 train_time: 9.5m tok/s: 6604628 +4780/20000 train_loss: 2.3159 train_time: 9.5m tok/s: 6604394 +4781/20000 train_loss: 2.3072 train_time: 9.5m tok/s: 6604164 +4782/20000 train_loss: 2.7087 train_time: 9.5m tok/s: 6603895 +4783/20000 train_loss: 2.4208 train_time: 9.5m tok/s: 6603654 +4784/20000 train_loss: 2.3756 train_time: 9.5m tok/s: 6603414 +4785/20000 train_loss: 2.4229 train_time: 9.5m tok/s: 6603165 +4786/20000 train_loss: 2.4212 train_time: 9.5m tok/s: 6602936 +4787/20000 train_loss: 2.3573 train_time: 9.5m tok/s: 6602698 +4788/20000 train_loss: 2.3737 train_time: 9.5m tok/s: 6602449 +4789/20000 train_loss: 2.1518 train_time: 9.5m tok/s: 6602205 +4790/20000 train_loss: 2.4080 train_time: 9.5m tok/s: 6601964 +4791/20000 train_loss: 2.3496 train_time: 9.5m tok/s: 6601716 +4792/20000 train_loss: 2.2430 train_time: 9.5m tok/s: 6601478 +4793/20000 train_loss: 2.3481 train_time: 9.5m tok/s: 6601236 +4794/20000 train_loss: 2.2226 train_time: 9.5m tok/s: 6601005 +4795/20000 train_loss: 2.3902 train_time: 9.5m tok/s: 6600768 +4796/20000 train_loss: 2.2539 train_time: 9.5m tok/s: 6600554 +4797/20000 train_loss: 2.4014 train_time: 9.5m tok/s: 6600306 +4798/20000 train_loss: 2.5482 train_time: 9.5m tok/s: 6600047 +4799/20000 train_loss: 2.4734 train_time: 9.5m tok/s: 6599818 +4800/20000 train_loss: 2.4457 train_time: 9.5m tok/s: 6599591 +4801/20000 train_loss: 2.4180 train_time: 9.5m tok/s: 6599333 +4802/20000 train_loss: 2.3239 train_time: 9.5m tok/s: 6599110 +4803/20000 train_loss: 2.5116 train_time: 9.5m tok/s: 6598875 +4804/20000 train_loss: 1.9468 train_time: 9.5m tok/s: 6598590 +4805/20000 train_loss: 2.3877 train_time: 9.5m tok/s: 6598332 +4806/20000 train_loss: 2.3722 train_time: 9.5m tok/s: 6598102 +4807/20000 train_loss: 2.3611 train_time: 9.5m tok/s: 6597888 +4808/20000 train_loss: 2.2900 train_time: 9.6m tok/s: 6597647 +4809/20000 train_loss: 2.4079 train_time: 9.6m tok/s: 6597416 +4810/20000 train_loss: 2.4536 train_time: 9.6m tok/s: 6597200 +4811/20000 train_loss: 2.4003 train_time: 9.6m tok/s: 6596954 +4812/20000 train_loss: 2.3534 train_time: 9.6m tok/s: 6596718 +4813/20000 train_loss: 2.3524 train_time: 9.6m tok/s: 6596481 +4814/20000 train_loss: 2.3808 train_time: 9.6m tok/s: 6596262 +4815/20000 train_loss: 2.4411 train_time: 9.6m tok/s: 6596026 +4816/20000 train_loss: 2.3612 train_time: 9.6m tok/s: 6595779 +4817/20000 train_loss: 2.4890 train_time: 9.6m tok/s: 6595547 +4818/20000 train_loss: 2.2813 train_time: 9.6m tok/s: 6595318 +4819/20000 train_loss: 2.4325 train_time: 9.6m tok/s: 6595092 +4820/20000 train_loss: 2.3077 train_time: 9.6m tok/s: 6594840 +4821/20000 train_loss: 2.3921 train_time: 9.6m tok/s: 6594608 +4822/20000 train_loss: 2.4091 train_time: 9.6m tok/s: 6594381 +4823/20000 train_loss: 2.3777 train_time: 9.6m tok/s: 6594145 +4824/20000 train_loss: 2.4769 train_time: 9.6m tok/s: 6593901 +4825/20000 train_loss: 2.5617 train_time: 9.6m tok/s: 6593648 +4826/20000 train_loss: 2.2990 train_time: 9.6m tok/s: 6593399 +4827/20000 train_loss: 2.2797 train_time: 9.6m tok/s: 6593164 +4828/20000 train_loss: 2.3190 train_time: 9.6m tok/s: 6592919 +4829/20000 train_loss: 2.3610 train_time: 9.6m tok/s: 6592674 +4830/20000 train_loss: 2.3417 train_time: 9.6m tok/s: 6592434 +4831/20000 train_loss: 2.3005 train_time: 9.6m tok/s: 6592194 +4832/20000 train_loss: 2.2504 train_time: 9.6m tok/s: 6591956 +4833/20000 train_loss: 2.2702 train_time: 9.6m tok/s: 6591725 +4834/20000 train_loss: 2.3891 train_time: 9.6m tok/s: 6591485 +4835/20000 train_loss: 2.3358 train_time: 9.6m tok/s: 6591246 +4836/20000 train_loss: 2.3487 train_time: 9.6m tok/s: 6590982 +4837/20000 train_loss: 2.5204 train_time: 9.6m tok/s: 6590744 +4838/20000 train_loss: 2.2807 train_time: 9.6m tok/s: 6590513 +4839/20000 train_loss: 2.4294 train_time: 9.6m tok/s: 6590287 +4840/20000 train_loss: 2.2677 train_time: 9.6m tok/s: 6590049 +4841/20000 train_loss: 2.4232 train_time: 9.6m tok/s: 6589810 +4842/20000 train_loss: 2.6843 train_time: 9.6m tok/s: 6589568 +4843/20000 train_loss: 2.3009 train_time: 9.6m tok/s: 6589330 +4844/20000 train_loss: 2.2899 train_time: 9.6m tok/s: 6589101 +4845/20000 train_loss: 2.3239 train_time: 9.6m tok/s: 6588881 +4846/20000 train_loss: 2.3398 train_time: 9.6m tok/s: 6588622 +4847/20000 train_loss: 2.5501 train_time: 9.6m tok/s: 6588406 +4848/20000 train_loss: 2.3077 train_time: 9.6m tok/s: 6588162 +4849/20000 train_loss: 2.4476 train_time: 9.6m tok/s: 6587948 +4850/20000 train_loss: 2.2920 train_time: 9.6m tok/s: 6587695 +4851/20000 train_loss: 2.3365 train_time: 9.7m tok/s: 6587462 +4852/20000 train_loss: 2.3513 train_time: 9.7m tok/s: 6587234 +4853/20000 train_loss: 2.2910 train_time: 9.7m tok/s: 6586981 +4854/20000 train_loss: 2.4535 train_time: 9.7m tok/s: 6586751 +4855/20000 train_loss: 2.3458 train_time: 9.7m tok/s: 6586530 +4856/20000 train_loss: 2.2774 train_time: 9.7m tok/s: 6586306 +4857/20000 train_loss: 2.2852 train_time: 9.7m tok/s: 6586070 +4858/20000 train_loss: 2.2667 train_time: 9.7m tok/s: 6585820 +4859/20000 train_loss: 2.3087 train_time: 9.7m tok/s: 6585573 +4860/20000 train_loss: 2.4699 train_time: 9.7m tok/s: 6585352 +4861/20000 train_loss: 2.4042 train_time: 9.7m tok/s: 6585123 +4862/20000 train_loss: 2.3105 train_time: 9.7m tok/s: 6584890 +4863/20000 train_loss: 2.5780 train_time: 9.7m tok/s: 6584662 +4864/20000 train_loss: 2.3987 train_time: 9.7m tok/s: 6584432 +4865/20000 train_loss: 2.5059 train_time: 9.7m tok/s: 6584176 +4866/20000 train_loss: 2.2645 train_time: 9.7m tok/s: 6583920 +4867/20000 train_loss: 2.2462 train_time: 9.7m tok/s: 6583696 +4868/20000 train_loss: 2.3088 train_time: 9.7m tok/s: 6583469 +4869/20000 train_loss: 2.4410 train_time: 9.7m tok/s: 6583215 +4870/20000 train_loss: 2.3563 train_time: 9.7m tok/s: 6582997 +4871/20000 train_loss: 2.3606 train_time: 9.7m tok/s: 6582768 +4872/20000 train_loss: 2.2728 train_time: 9.7m tok/s: 6582509 +4873/20000 train_loss: 2.5435 train_time: 9.7m tok/s: 6582277 +4874/20000 train_loss: 2.4189 train_time: 9.7m tok/s: 6582066 +4875/20000 train_loss: 2.4035 train_time: 9.7m tok/s: 6581848 +4876/20000 train_loss: 2.4026 train_time: 9.7m tok/s: 6581629 +4877/20000 train_loss: 2.3003 train_time: 9.7m tok/s: 6581403 +4878/20000 train_loss: 2.4007 train_time: 9.7m tok/s: 6581162 +4879/20000 train_loss: 2.3613 train_time: 9.7m tok/s: 6580938 +4880/20000 train_loss: 2.4316 train_time: 9.7m tok/s: 6580702 +4881/20000 train_loss: 2.3413 train_time: 9.7m tok/s: 6580473 +4882/20000 train_loss: 2.2732 train_time: 9.7m tok/s: 6580241 +4883/20000 train_loss: 2.2500 train_time: 9.7m tok/s: 6580017 +4884/20000 train_loss: 2.6043 train_time: 9.7m tok/s: 6579799 +4885/20000 train_loss: 2.4654 train_time: 9.7m tok/s: 6579567 +4886/20000 train_loss: 2.4352 train_time: 9.7m tok/s: 6579351 +4887/20000 train_loss: 2.4040 train_time: 9.7m tok/s: 6579120 +4888/20000 train_loss: 2.4027 train_time: 9.7m tok/s: 6578891 +4889/20000 train_loss: 2.3467 train_time: 9.7m tok/s: 6578671 +4890/20000 train_loss: 2.4002 train_time: 9.7m tok/s: 6578435 +4891/20000 train_loss: 2.1710 train_time: 9.7m tok/s: 6578195 +4892/20000 train_loss: 1.9570 train_time: 9.7m tok/s: 6577905 +4893/20000 train_loss: 2.4379 train_time: 9.8m tok/s: 6577666 +4894/20000 train_loss: 2.3804 train_time: 9.8m tok/s: 6577465 +4895/20000 train_loss: 2.3734 train_time: 9.8m tok/s: 6577250 +4895/20000 val_loss: 2.3562 val_bpb: 1.0766 +stopping_early: wallclock_cap train_time: 585331ms step: 4895/20000 +peak memory allocated: 41707 MiB reserved: 47048 MiB +ema:applying EMA weights +diagnostic pre-quantization post-ema val_loss:2.33221133 val_bpb:1.06566160 eval_time:7561ms +Serialized model: 135418111 bytes +Code size (uncompressed): 182796 bytes +Code size (compressed): 45910 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 4.1s +Quantized weights: + gate_int8_row: blocks.attn.attn_gate_w + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int6)+lqer_asym: blocks.mlp.fc.weight + gptq (int7)+awqgrpint8+lqer_asym: tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights, smear_gate.weight, smear_lambda, softcap_neg, softcap_pos +Serialize: per-group lrzip compression... +Serialize: per-group compression done in 119.5s +Serialized model quantized+pergroup: 15949305 bytes +Total submission size quantized+pergroup: 15995215 bytes +Deserialize: per-group lrzip decompression... +Deserialize: decompression done in 21.1s +diagnostic quantized val_loss:2.34933075 val_bpb:1.07348401 eval_time:10651ms +Deserialize: per-group lrzip decompression... +Deserialize: decompression done in 21.1s +ttt_lora:warming up compile (random tokens, no val data) +ttt_lora:compile warmup done (104.8s) +v5:precomputing ngram hints OUTSIDE eval timer +ngram_tilt:hints total=47851520 gated=13023303 token_gate=628130 within_gate=9866847 word_gate=2891588 agree2plus=303177 +ngram_tilt:precompute_outside_timer_done elapsed=160.59s total_targets=47851520 + +beginning TTT eval timer +ngram_tilt:using_precomputed_hints total_targets=47851520 (precompute time excluded from eval) +ttt_phased: total_docs:50000 prefix_docs:2500 suffix_docs:47500 num_phases:3 boundaries:[833, 1666, 2500] +ttp: b777/782 bl:2.3077 bb:1.0815 rl:2.3077 rb:1.0815 dl:8452-9229 gd:0 +ttp: b772/782 bl:2.3220 bb:1.0944 rl:2.3134 rb:1.0866 dl:5762-6095 gd:0 +ttp: b767/782 bl:2.2635 bb:1.0710 rl:2.3012 rb:1.0828 dl:4681-4858 gd:0 +ttpp: phase:1/3 pd:1296 gd:833 t:225.4s +tttg: c1/131 lr:0.001000 t:1.8s +tttg: c2/131 lr:0.001000 t:1.9s +tttg: c3/131 lr:0.000999 t:2.0s +tttg: c4/131 lr:0.000999 t:2.1s +tttg: c5/131 lr:0.000998 t:2.1s +tttg: c6/131 lr:0.000996 t:2.2s +tttg: c7/131 lr:0.000995 t:2.3s +tttg: c8/131 lr:0.000993 t:2.4s +tttg: c9/131 lr:0.000991 t:2.4s +tttg: c10/131 lr:0.000988 t:2.5s +tttg: c11/131 lr:0.000985 t:2.6s +tttg: c12/131 lr:0.000982 t:2.7s +tttg: c13/131 lr:0.000979 t:2.7s +tttg: c14/131 lr:0.000976 t:2.8s +tttg: c15/131 lr:0.000972 t:2.9s +tttg: c16/131 lr:0.000968 t:3.0s +tttg: c17/131 lr:0.000963 t:3.0s +tttg: c18/131 lr:0.000958 t:3.1s +tttg: c19/131 lr:0.000953 t:3.2s +tttg: c20/131 lr:0.000948 t:3.2s +tttg: c21/131 lr:0.000943 t:3.3s +tttg: c22/131 lr:0.000937 t:3.4s +tttg: c23/131 lr:0.000931 t:3.5s +tttg: c24/131 lr:0.000925 t:3.5s +tttg: c25/131 lr:0.000918 t:3.6s +tttg: c26/131 lr:0.000911 t:3.7s +tttg: c27/131 lr:0.000905 t:3.8s +tttg: c28/131 lr:0.000897 t:3.9s +tttg: c29/131 lr:0.000890 t:3.9s +tttg: c30/131 lr:0.000882 t:4.0s +tttg: c31/131 lr:0.000874 t:4.1s +tttg: c32/131 lr:0.000866 t:4.2s +tttg: c33/131 lr:0.000858 t:4.2s +tttg: c34/131 lr:0.000849 t:4.3s +tttg: c35/131 lr:0.000841 t:4.4s +tttg: c36/131 lr:0.000832 t:4.4s +tttg: c37/131 lr:0.000822 t:4.5s +tttg: c38/131 lr:0.000813 t:4.6s +tttg: c39/131 lr:0.000804 t:4.7s +tttg: c40/131 lr:0.000794 t:4.7s +tttg: c41/131 lr:0.000784 t:4.8s +tttg: c42/131 lr:0.000774 t:4.9s +tttg: c43/131 lr:0.000764 t:5.0s +tttg: c44/131 lr:0.000753 t:5.0s +tttg: c45/131 lr:0.000743 t:5.1s +tttg: c46/131 lr:0.000732 t:5.2s +tttg: c47/131 lr:0.000722 t:5.3s +tttg: c48/131 lr:0.000711 t:5.3s +tttg: c49/131 lr:0.000700 t:5.4s +tttg: c50/131 lr:0.000689 t:5.5s +tttg: c51/131 lr:0.000677 t:5.6s +tttg: c52/131 lr:0.000666 t:5.6s +tttg: c53/131 lr:0.000655 t:5.7s +tttg: c54/131 lr:0.000643 t:5.8s +tttg: c55/131 lr:0.000631 t:5.9s +tttg: c56/131 lr:0.000620 t:5.9s +tttg: c57/131 lr:0.000608 t:6.0s +tttg: c58/131 lr:0.000596 t:6.1s +tttg: c59/131 lr:0.000584 t:6.2s +tttg: c60/131 lr:0.000572 t:6.2s +tttg: c61/131 lr:0.000560 t:6.3s +tttg: c62/131 lr:0.000548 t:6.4s +tttg: c63/131 lr:0.000536 t:6.5s +tttg: c64/131 lr:0.000524 t:6.5s +tttg: c65/131 lr:0.000512 t:6.6s +tttg: c66/131 lr:0.000500 t:6.7s +tttg: c67/131 lr:0.000488 t:6.8s +tttg: c68/131 lr:0.000476 t:6.8s +tttg: c69/131 lr:0.000464 t:6.9s +tttg: c70/131 lr:0.000452 t:7.0s +tttg: c71/131 lr:0.000440 t:7.0s +tttg: c72/131 lr:0.000428 t:7.1s +tttg: c73/131 lr:0.000416 t:7.2s +tttg: c74/131 lr:0.000404 t:7.3s +tttg: c75/131 lr:0.000392 t:7.4s +tttg: c76/131 lr:0.000380 t:7.4s +tttg: c77/131 lr:0.000369 t:7.5s +tttg: c78/131 lr:0.000357 t:7.6s +tttg: c79/131 lr:0.000345 t:7.7s +tttg: c80/131 lr:0.000334 t:7.7s +tttg: c81/131 lr:0.000323 t:7.8s +tttg: c82/131 lr:0.000311 t:7.9s +tttg: c83/131 lr:0.000300 t:8.0s +tttg: c84/131 lr:0.000289 t:8.0s +tttg: c85/131 lr:0.000278 t:8.1s +tttg: c86/131 lr:0.000268 t:8.2s +tttg: c87/131 lr:0.000257 t:8.3s +tttg: c88/131 lr:0.000247 t:8.3s +tttg: c89/131 lr:0.000236 t:8.4s +tttg: c90/131 lr:0.000226 t:8.5s +tttg: c91/131 lr:0.000216 t:8.5s +tttg: c92/131 lr:0.000206 t:8.6s +tttg: c93/131 lr:0.000196 t:8.7s +tttg: c94/131 lr:0.000187 t:8.8s +tttg: c95/131 lr:0.000178 t:8.9s +tttg: c96/131 lr:0.000168 t:8.9s +tttg: c97/131 lr:0.000159 t:9.0s +tttg: c98/131 lr:0.000151 t:9.1s +tttg: c99/131 lr:0.000142 t:9.1s +tttg: c100/131 lr:0.000134 t:9.2s +tttg: c101/131 lr:0.000126 t:9.3s +tttg: c102/131 lr:0.000118 t:9.4s +tttg: c103/131 lr:0.000110 t:9.5s +tttg: c104/131 lr:0.000103 t:9.5s +tttg: c105/131 lr:0.000095 t:9.6s +tttg: c106/131 lr:0.000089 t:9.7s +tttg: c107/131 lr:0.000082 t:9.7s +tttg: c108/131 lr:0.000075 t:9.8s +tttg: c109/131 lr:0.000069 t:9.9s +tttg: c110/131 lr:0.000063 t:10.0s +tttg: c111/131 lr:0.000057 t:10.0s +tttg: c112/131 lr:0.000052 t:10.1s +tttg: c113/131 lr:0.000047 t:10.2s +tttg: c114/131 lr:0.000042 t:10.3s +tttg: c115/131 lr:0.000037 t:10.3s +tttg: c116/131 lr:0.000032 t:10.4s +tttg: c117/131 lr:0.000028 t:10.5s +tttg: c118/131 lr:0.000024 t:10.5s +tttg: c119/131 lr:0.000021 t:10.6s +tttg: c120/131 lr:0.000018 t:10.7s +tttg: c121/131 lr:0.000015 t:10.8s +tttg: c122/131 lr:0.000012 t:10.8s +tttg: c123/131 lr:0.000009 t:10.9s +tttg: c124/131 lr:0.000007 t:11.0s +tttg: c125/131 lr:0.000005 t:11.1s +tttg: c126/131 lr:0.000004 t:11.1s +tttg: c127/131 lr:0.000002 t:11.2s +tttg: c128/131 lr:0.000001 t:11.3s +tttg: c129/131 lr:0.000001 t:11.4s +tttg: c130/131 lr:0.000000 t:11.4s +ttpr: phase:1/3 t:238.5s +ttp: b757/782 bl:2.2770 bb:1.0599 rl:2.2975 rb:1.0793 dl:3550-3633 gd:0 +ttp: b753/782 bl:2.2098 bb:0.9976 rl:2.2865 rb:1.0686 dl:3284-3344 gd:0 +ttpp: phase:2/3 pd:2128 gd:1666 t:410.7s +tttg: c1/219 lr:0.001000 t:0.1s +tttg: c2/219 lr:0.001000 t:0.2s +tttg: c3/219 lr:0.001000 t:0.2s +tttg: c4/219 lr:0.001000 t:0.3s +tttg: c5/219 lr:0.000999 t:0.4s +tttg: c6/219 lr:0.000999 t:0.4s +tttg: c7/219 lr:0.000998 t:0.5s +tttg: c8/219 lr:0.000997 t:0.6s +tttg: c9/219 lr:0.000997 t:0.7s +tttg: c10/219 lr:0.000996 t:0.7s +tttg: c11/219 lr:0.000995 t:0.8s +tttg: c12/219 lr:0.000994 t:0.9s +tttg: c13/219 lr:0.000993 t:1.0s +tttg: c14/219 lr:0.000991 t:1.0s +tttg: c15/219 lr:0.000990 t:1.1s +tttg: c16/219 lr:0.000988 t:1.2s +tttg: c17/219 lr:0.000987 t:1.3s +tttg: c18/219 lr:0.000985 t:1.3s +tttg: c19/219 lr:0.000983 t:1.4s +tttg: c20/219 lr:0.000981 t:1.5s +tttg: c21/219 lr:0.000979 t:1.6s +tttg: c22/219 lr:0.000977 t:1.6s +tttg: c23/219 lr:0.000975 t:1.7s +tttg: c24/219 lr:0.000973 t:1.8s +tttg: c25/219 lr:0.000970 t:1.9s +tttg: c26/219 lr:0.000968 t:1.9s +tttg: c27/219 lr:0.000965 t:2.0s +tttg: c28/219 lr:0.000963 t:2.1s +tttg: c29/219 lr:0.000960 t:2.1s +tttg: c30/219 lr:0.000957 t:2.2s +tttg: c31/219 lr:0.000954 t:2.3s +tttg: c32/219 lr:0.000951 t:2.4s +tttg: c33/219 lr:0.000948 t:2.5s +tttg: c34/219 lr:0.000945 t:2.5s +tttg: c35/219 lr:0.000941 t:2.6s +tttg: c36/219 lr:0.000938 t:2.7s +tttg: c37/219 lr:0.000934 t:2.8s +tttg: c38/219 lr:0.000931 t:2.8s +tttg: c39/219 lr:0.000927 t:2.9s +tttg: c40/219 lr:0.000923 t:3.0s +tttg: c41/219 lr:0.000919 t:3.1s +tttg: c42/219 lr:0.000915 t:3.1s +tttg: c43/219 lr:0.000911 t:3.2s +tttg: c44/219 lr:0.000907 t:3.3s +tttg: c45/219 lr:0.000903 t:3.4s +tttg: c46/219 lr:0.000898 t:3.4s +tttg: c47/219 lr:0.000894 t:3.5s +tttg: c48/219 lr:0.000890 t:3.6s +tttg: c49/219 lr:0.000885 t:3.7s +tttg: c50/219 lr:0.000880 t:3.7s +tttg: c51/219 lr:0.000876 t:3.8s +tttg: c52/219 lr:0.000871 t:3.9s +tttg: c53/219 lr:0.000866 t:4.0s +tttg: c54/219 lr:0.000861 t:4.0s +tttg: c55/219 lr:0.000856 t:4.1s +tttg: c56/219 lr:0.000851 t:4.2s +tttg: c57/219 lr:0.000846 t:4.3s +tttg: c58/219 lr:0.000841 t:4.3s +tttg: c59/219 lr:0.000835 t:4.4s +tttg: c60/219 lr:0.000830 t:4.5s +tttg: c61/219 lr:0.000824 t:4.6s +tttg: c62/219 lr:0.000819 t:4.6s +tttg: c63/219 lr:0.000813 t:4.7s +tttg: c64/219 lr:0.000808 t:4.8s +tttg: c65/219 lr:0.000802 t:4.9s +tttg: c66/219 lr:0.000796 t:4.9s +tttg: c67/219 lr:0.000790 t:5.0s +tttg: c68/219 lr:0.000784 t:5.1s +tttg: c69/219 lr:0.000779 t:5.2s +tttg: c70/219 lr:0.000773 t:5.2s +tttg: c71/219 lr:0.000766 t:5.3s +tttg: c72/219 lr:0.000760 t:5.4s +tttg: c73/219 lr:0.000754 t:5.5s +tttg: c74/219 lr:0.000748 t:5.5s +tttg: c75/219 lr:0.000742 t:5.6s +tttg: c76/219 lr:0.000735 t:5.7s +tttg: c77/219 lr:0.000729 t:5.8s +tttg: c78/219 lr:0.000722 t:5.8s +tttg: c79/219 lr:0.000716 t:5.9s +tttg: c80/219 lr:0.000709 t:6.0s +tttg: c81/219 lr:0.000703 t:6.1s +tttg: c82/219 lr:0.000696 t:6.1s +tttg: c83/219 lr:0.000690 t:6.2s +tttg: c84/219 lr:0.000683 t:6.3s +tttg: c85/219 lr:0.000676 t:6.4s +tttg: c86/219 lr:0.000670 t:6.4s +tttg: c87/219 lr:0.000663 t:6.5s +tttg: c88/219 lr:0.000656 t:6.6s +tttg: c89/219 lr:0.000649 t:6.6s +tttg: c90/219 lr:0.000642 t:6.7s +tttg: c91/219 lr:0.000635 t:6.8s +tttg: c92/219 lr:0.000628 t:6.9s +tttg: c93/219 lr:0.000621 t:7.0s +tttg: c94/219 lr:0.000614 t:7.0s +tttg: c95/219 lr:0.000607 t:7.1s +tttg: c96/219 lr:0.000600 t:7.2s +tttg: c97/219 lr:0.000593 t:7.2s +tttg: c98/219 lr:0.000586 t:7.3s +tttg: c99/219 lr:0.000579 t:7.4s +tttg: c100/219 lr:0.000572 t:7.5s +tttg: c101/219 lr:0.000565 t:7.5s +tttg: c102/219 lr:0.000558 t:7.6s +tttg: c103/219 lr:0.000550 t:7.7s +tttg: c104/219 lr:0.000543 t:7.8s +tttg: c105/219 lr:0.000536 t:7.8s +tttg: c106/219 lr:0.000529 t:7.9s +tttg: c107/219 lr:0.000522 t:8.0s +tttg: c108/219 lr:0.000514 t:8.0s +tttg: c109/219 lr:0.000507 t:8.1s +tttg: c110/219 lr:0.000500 t:8.2s +tttg: c111/219 lr:0.000493 t:8.3s +tttg: c112/219 lr:0.000486 t:8.4s +tttg: c113/219 lr:0.000478 t:8.4s +tttg: c114/219 lr:0.000471 t:8.5s +tttg: c115/219 lr:0.000464 t:8.6s +tttg: c116/219 lr:0.000457 t:8.6s +tttg: c117/219 lr:0.000450 t:8.7s +tttg: c118/219 lr:0.000442 t:8.8s +tttg: c119/219 lr:0.000435 t:8.9s +tttg: c120/219 lr:0.000428 t:8.9s +tttg: c121/219 lr:0.000421 t:9.0s +tttg: c122/219 lr:0.000414 t:9.1s +tttg: c123/219 lr:0.000407 t:9.2s +tttg: c124/219 lr:0.000400 t:9.2s +tttg: c125/219 lr:0.000393 t:9.3s +tttg: c126/219 lr:0.000386 t:9.4s +tttg: c127/219 lr:0.000379 t:9.5s +tttg: c128/219 lr:0.000372 t:9.5s +tttg: c129/219 lr:0.000365 t:9.6s +tttg: c130/219 lr:0.000358 t:9.7s +tttg: c131/219 lr:0.000351 t:9.8s +tttg: c132/219 lr:0.000344 t:9.8s +tttg: c133/219 lr:0.000337 t:9.9s +tttg: c134/219 lr:0.000330 t:10.0s +tttg: c135/219 lr:0.000324 t:10.1s +tttg: c136/219 lr:0.000317 t:10.1s +tttg: c137/219 lr:0.000310 t:10.2s +tttg: c138/219 lr:0.000304 t:10.3s +tttg: c139/219 lr:0.000297 t:10.3s +tttg: c140/219 lr:0.000291 t:10.4s +tttg: c141/219 lr:0.000284 t:10.5s +tttg: c142/219 lr:0.000278 t:10.6s +tttg: c143/219 lr:0.000271 t:10.7s +tttg: c144/219 lr:0.000265 t:10.7s +tttg: c145/219 lr:0.000258 t:10.8s +tttg: c146/219 lr:0.000252 t:10.9s +tttg: c147/219 lr:0.000246 t:11.0s +tttg: c148/219 lr:0.000240 t:11.0s +tttg: c149/219 lr:0.000234 t:11.1s +tttg: c150/219 lr:0.000227 t:11.2s +tttg: c151/219 lr:0.000221 t:11.2s +tttg: c152/219 lr:0.000216 t:11.3s +tttg: c153/219 lr:0.000210 t:11.4s +tttg: c154/219 lr:0.000204 t:11.5s +tttg: c155/219 lr:0.000198 t:11.6s +tttg: c156/219 lr:0.000192 t:11.6s +tttg: c157/219 lr:0.000187 t:11.7s +tttg: c158/219 lr:0.000181 t:11.8s +tttg: c159/219 lr:0.000176 t:11.9s +tttg: c160/219 lr:0.000170 t:12.0s +tttg: c161/219 lr:0.000165 t:12.0s +tttg: c162/219 lr:0.000159 t:12.1s +tttg: c163/219 lr:0.000154 t:12.2s +tttg: c164/219 lr:0.000149 t:12.2s +tttg: c165/219 lr:0.000144 t:12.3s +tttg: c166/219 lr:0.000139 t:12.4s +tttg: c167/219 lr:0.000134 t:12.5s +tttg: c168/219 lr:0.000129 t:12.5s +tttg: c169/219 lr:0.000124 t:12.6s +tttg: c170/219 lr:0.000120 t:12.7s +tttg: c171/219 lr:0.000115 t:12.8s +tttg: c172/219 lr:0.000110 t:12.8s +tttg: c173/219 lr:0.000106 t:12.9s +tttg: c174/219 lr:0.000102 t:13.0s +tttg: c175/219 lr:0.000097 t:13.1s +tttg: c176/219 lr:0.000093 t:13.1s +tttg: c177/219 lr:0.000089 t:13.2s +tttg: c178/219 lr:0.000085 t:13.3s +tttg: c179/219 lr:0.000081 t:13.4s +tttg: c180/219 lr:0.000077 t:13.4s +tttg: c181/219 lr:0.000073 t:13.5s +tttg: c182/219 lr:0.000069 t:13.6s +tttg: c183/219 lr:0.000066 t:13.6s +tttg: c184/219 lr:0.000062 t:13.7s +tttg: c185/219 lr:0.000059 t:13.8s +tttg: c186/219 lr:0.000055 t:13.9s +tttg: c187/219 lr:0.000052 t:13.9s +tttg: c188/219 lr:0.000049 t:14.0s +tttg: c189/219 lr:0.000046 t:14.1s +tttg: c190/219 lr:0.000043 t:14.2s +tttg: c191/219 lr:0.000040 t:14.3s +tttg: c192/219 lr:0.000037 t:14.4s +tttg: c193/219 lr:0.000035 t:14.4s +tttg: c194/219 lr:0.000032 t:14.5s +tttg: c195/219 lr:0.000030 t:14.6s +tttg: c196/219 lr:0.000027 t:14.7s +tttg: c197/219 lr:0.000025 t:14.8s +tttg: c198/219 lr:0.000023 t:14.8s +tttg: c199/219 lr:0.000021 t:14.9s +tttg: c200/219 lr:0.000019 t:15.0s +tttg: c201/219 lr:0.000017 t:15.1s +tttg: c202/219 lr:0.000015 t:15.2s +tttg: c203/219 lr:0.000013 t:15.2s +tttg: c204/219 lr:0.000012 t:15.3s +tttg: c205/219 lr:0.000010 t:15.4s +tttg: c206/219 lr:0.000009 t:15.5s +tttg: c207/219 lr:0.000007 t:15.5s +tttg: c208/219 lr:0.000006 t:15.6s +tttg: c209/219 lr:0.000005 t:15.7s +tttg: c210/219 lr:0.000004 t:15.8s +tttg: c211/219 lr:0.000003 t:15.8s +tttg: c212/219 lr:0.000003 t:15.9s +tttg: c213/219 lr:0.000002 t:16.0s +tttg: c214/219 lr:0.000001 t:16.1s +tttg: c215/219 lr:0.000001 t:16.1s +tttg: c216/219 lr:0.000000 t:16.2s +tttg: c217/219 lr:0.000000 t:16.3s +tttg: c218/219 lr:0.000000 t:16.4s +ttpr: phase:2/3 t:428.8s +ttp: b748/782 bl:2.3170 bb:1.0813 rl:2.2896 rb:1.0699 dl:2992-3039 gd:0 +ttpp: phase:3/3 pd:2960 gd:2500 t:444.9s +tttg: c1/289 lr:0.001000 t:0.1s +tttg: c2/289 lr:0.001000 t:0.2s +tttg: c3/289 lr:0.001000 t:0.2s +tttg: c4/289 lr:0.001000 t:0.3s +tttg: c5/289 lr:0.001000 t:0.4s +tttg: c6/289 lr:0.000999 t:0.4s +tttg: c7/289 lr:0.000999 t:0.5s +tttg: c8/289 lr:0.000999 t:0.6s +tttg: c9/289 lr:0.000998 t:0.7s +tttg: c10/289 lr:0.000998 t:0.7s +tttg: c11/289 lr:0.000997 t:0.8s +tttg: c12/289 lr:0.000996 t:0.9s +tttg: c13/289 lr:0.000996 t:1.0s +tttg: c14/289 lr:0.000995 t:1.1s +tttg: c15/289 lr:0.000994 t:1.1s +tttg: c16/289 lr:0.000993 t:1.2s +tttg: c17/289 lr:0.000992 t:1.3s +tttg: c18/289 lr:0.000991 t:1.3s +tttg: c19/289 lr:0.000990 t:1.4s +tttg: c20/289 lr:0.000989 t:1.5s +tttg: c21/289 lr:0.000988 t:1.6s +tttg: c22/289 lr:0.000987 t:1.6s +tttg: c23/289 lr:0.000986 t:1.7s +tttg: c24/289 lr:0.000984 t:1.8s +tttg: c25/289 lr:0.000983 t:1.9s +tttg: c26/289 lr:0.000982 t:1.9s +tttg: c27/289 lr:0.000980 t:2.0s +tttg: c28/289 lr:0.000978 t:2.1s +tttg: c29/289 lr:0.000977 t:2.2s +tttg: c30/289 lr:0.000975 t:2.2s +tttg: c31/289 lr:0.000973 t:2.3s +tttg: c32/289 lr:0.000972 t:2.4s +tttg: c33/289 lr:0.000970 t:2.5s +tttg: c34/289 lr:0.000968 t:2.6s +tttg: c35/289 lr:0.000966 t:2.6s +tttg: c36/289 lr:0.000964 t:2.7s +tttg: c37/289 lr:0.000962 t:2.8s +tttg: c38/289 lr:0.000960 t:2.8s +tttg: c39/289 lr:0.000958 t:2.9s +tttg: c40/289 lr:0.000955 t:3.0s +tttg: c41/289 lr:0.000953 t:3.1s +tttg: c42/289 lr:0.000951 t:3.1s +tttg: c43/289 lr:0.000948 t:3.2s +tttg: c44/289 lr:0.000946 t:3.3s +tttg: c45/289 lr:0.000944 t:3.4s +tttg: c46/289 lr:0.000941 t:3.4s +tttg: c47/289 lr:0.000938 t:3.5s +tttg: c48/289 lr:0.000936 t:3.6s +tttg: c49/289 lr:0.000933 t:3.7s +tttg: c50/289 lr:0.000930 t:3.7s +tttg: c51/289 lr:0.000927 t:3.8s +tttg: c52/289 lr:0.000925 t:3.9s +tttg: c53/289 lr:0.000922 t:4.0s +tttg: c54/289 lr:0.000919 t:4.0s +tttg: c55/289 lr:0.000916 t:4.1s +tttg: c56/289 lr:0.000913 t:4.2s +tttg: c57/289 lr:0.000910 t:4.3s +tttg: c58/289 lr:0.000906 t:4.3s +tttg: c59/289 lr:0.000903 t:4.4s +tttg: c60/289 lr:0.000900 t:4.5s +tttg: c61/289 lr:0.000897 t:4.6s +tttg: c62/289 lr:0.000893 t:4.6s +tttg: c63/289 lr:0.000890 t:4.7s +tttg: c64/289 lr:0.000887 t:4.8s +tttg: c65/289 lr:0.000883 t:4.9s +tttg: c66/289 lr:0.000879 t:4.9s +tttg: c67/289 lr:0.000876 t:5.0s +tttg: c68/289 lr:0.000872 t:5.1s +tttg: c69/289 lr:0.000869 t:5.2s +tttg: c70/289 lr:0.000865 t:5.2s +tttg: c71/289 lr:0.000861 t:5.3s +tttg: c72/289 lr:0.000857 t:5.4s +tttg: c73/289 lr:0.000854 t:5.5s +tttg: c74/289 lr:0.000850 t:5.5s +tttg: c75/289 lr:0.000846 t:5.6s +tttg: c76/289 lr:0.000842 t:5.7s +tttg: c77/289 lr:0.000838 t:5.8s +tttg: c78/289 lr:0.000834 t:5.8s +tttg: c79/289 lr:0.000830 t:5.9s +tttg: c80/289 lr:0.000826 t:6.0s +tttg: c81/289 lr:0.000821 t:6.1s +tttg: c82/289 lr:0.000817 t:6.1s +tttg: c83/289 lr:0.000813 t:6.2s +tttg: c84/289 lr:0.000809 t:6.3s +tttg: c85/289 lr:0.000804 t:6.3s +tttg: c86/289 lr:0.000800 t:6.4s +tttg: c87/289 lr:0.000796 t:6.5s +tttg: c88/289 lr:0.000791 t:6.6s +tttg: c89/289 lr:0.000787 t:6.7s +tttg: c90/289 lr:0.000782 t:6.7s +tttg: c91/289 lr:0.000778 t:6.8s +tttg: c92/289 lr:0.000773 t:6.9s +tttg: c93/289 lr:0.000769 t:7.0s +tttg: c94/289 lr:0.000764 t:7.0s +tttg: c95/289 lr:0.000759 t:7.1s +tttg: c96/289 lr:0.000755 t:7.2s +tttg: c97/289 lr:0.000750 t:7.2s +tttg: c98/289 lr:0.000745 t:7.3s +tttg: c99/289 lr:0.000740 t:7.4s +tttg: c100/289 lr:0.000736 t:7.5s +tttg: c101/289 lr:0.000731 t:7.5s +tttg: c102/289 lr:0.000726 t:7.6s +tttg: c103/289 lr:0.000721 t:7.7s +tttg: c104/289 lr:0.000716 t:7.8s +tttg: c105/289 lr:0.000711 t:7.8s +tttg: c106/289 lr:0.000706 t:7.9s +tttg: c107/289 lr:0.000701 t:8.0s +tttg: c108/289 lr:0.000696 t:8.1s +tttg: c109/289 lr:0.000691 t:8.1s +tttg: c110/289 lr:0.000686 t:8.2s +tttg: c111/289 lr:0.000681 t:8.3s +tttg: c112/289 lr:0.000676 t:8.4s +tttg: c113/289 lr:0.000671 t:8.4s +tttg: c114/289 lr:0.000666 t:8.5s +tttg: c115/289 lr:0.000661 t:8.6s +tttg: c116/289 lr:0.000656 t:8.7s +tttg: c117/289 lr:0.000650 t:8.7s +tttg: c118/289 lr:0.000645 t:8.8s +tttg: c119/289 lr:0.000640 t:8.9s +tttg: c120/289 lr:0.000635 t:9.0s +tttg: c121/289 lr:0.000629 t:9.0s +tttg: c122/289 lr:0.000624 t:9.1s +tttg: c123/289 lr:0.000619 t:9.2s +tttg: c124/289 lr:0.000614 t:9.3s +tttg: c125/289 lr:0.000608 t:9.3s +tttg: c126/289 lr:0.000603 t:9.4s +tttg: c127/289 lr:0.000598 t:9.5s +tttg: c128/289 lr:0.000592 t:9.6s +tttg: c129/289 lr:0.000587 t:9.7s +tttg: c130/289 lr:0.000581 t:9.7s +tttg: c131/289 lr:0.000576 t:9.8s +tttg: c132/289 lr:0.000571 t:9.9s +tttg: c133/289 lr:0.000565 t:10.0s +tttg: c134/289 lr:0.000560 t:10.0s +tttg: c135/289 lr:0.000554 t:10.1s +tttg: c136/289 lr:0.000549 t:10.2s +tttg: c137/289 lr:0.000544 t:10.3s +tttg: c138/289 lr:0.000538 t:10.3s +tttg: c139/289 lr:0.000533 t:10.4s +tttg: c140/289 lr:0.000527 t:10.5s +tttg: c141/289 lr:0.000522 t:10.6s +tttg: c142/289 lr:0.000516 t:10.6s +tttg: c143/289 lr:0.000511 t:10.7s +tttg: c144/289 lr:0.000505 t:10.8s +tttg: c145/289 lr:0.000500 t:10.9s +tttg: c146/289 lr:0.000495 t:10.9s +tttg: c147/289 lr:0.000489 t:11.0s +tttg: c148/289 lr:0.000484 t:11.1s +tttg: c149/289 lr:0.000478 t:11.2s +tttg: c150/289 lr:0.000473 t:11.2s +tttg: c151/289 lr:0.000467 t:11.3s +tttg: c152/289 lr:0.000462 t:11.4s +tttg: c153/289 lr:0.000456 t:11.5s +tttg: c154/289 lr:0.000451 t:11.5s +tttg: c155/289 lr:0.000446 t:11.6s +tttg: c156/289 lr:0.000440 t:11.7s +tttg: c157/289 lr:0.000435 t:11.8s +tttg: c158/289 lr:0.000429 t:11.8s +tttg: c159/289 lr:0.000424 t:11.9s +tttg: c160/289 lr:0.000419 t:12.0s +tttg: c161/289 lr:0.000413 t:12.1s +tttg: c162/289 lr:0.000408 t:12.1s +tttg: c163/289 lr:0.000402 t:12.2s +tttg: c164/289 lr:0.000397 t:12.3s +tttg: c165/289 lr:0.000392 t:12.4s +tttg: c166/289 lr:0.000386 t:12.4s +tttg: c167/289 lr:0.000381 t:12.5s +tttg: c168/289 lr:0.000376 t:12.6s +tttg: c169/289 lr:0.000371 t:12.7s +tttg: c170/289 lr:0.000365 t:12.7s +tttg: c171/289 lr:0.000360 t:12.8s +tttg: c172/289 lr:0.000355 t:12.9s +tttg: c173/289 lr:0.000350 t:13.0s +tttg: c174/289 lr:0.000344 t:13.0s +tttg: c175/289 lr:0.000339 t:13.1s +tttg: c176/289 lr:0.000334 t:13.2s +tttg: c177/289 lr:0.000329 t:13.3s +tttg: c178/289 lr:0.000324 t:13.3s +tttg: c179/289 lr:0.000319 t:13.4s +tttg: c180/289 lr:0.000314 t:13.5s +tttg: c181/289 lr:0.000309 t:13.6s +tttg: c182/289 lr:0.000304 t:13.6s +tttg: c183/289 lr:0.000299 t:13.7s +tttg: c184/289 lr:0.000294 t:13.8s +tttg: c185/289 lr:0.000289 t:13.9s +tttg: c186/289 lr:0.000284 t:13.9s +tttg: c187/289 lr:0.000279 t:14.0s +tttg: c188/289 lr:0.000274 t:14.1s +tttg: c189/289 lr:0.000269 t:14.2s +tttg: c190/289 lr:0.000264 t:14.2s +tttg: c191/289 lr:0.000260 t:14.3s +tttg: c192/289 lr:0.000255 t:14.4s +tttg: c193/289 lr:0.000250 t:14.5s +tttg: c194/289 lr:0.000245 t:14.5s +tttg: c195/289 lr:0.000241 t:14.6s +tttg: c196/289 lr:0.000236 t:14.7s +tttg: c197/289 lr:0.000231 t:14.8s +tttg: c198/289 lr:0.000227 t:14.8s +tttg: c199/289 lr:0.000222 t:14.9s +tttg: c200/289 lr:0.000218 t:15.0s +tttg: c201/289 lr:0.000213 t:15.1s +tttg: c202/289 lr:0.000209 t:15.1s +tttg: c203/289 lr:0.000204 t:15.2s +tttg: c204/289 lr:0.000200 t:15.3s +tttg: c205/289 lr:0.000196 t:15.4s +tttg: c206/289 lr:0.000191 t:15.4s +tttg: c207/289 lr:0.000187 t:15.5s +tttg: c208/289 lr:0.000183 t:15.6s +tttg: c209/289 lr:0.000179 t:15.6s +tttg: c210/289 lr:0.000174 t:15.7s +tttg: c211/289 lr:0.000170 t:15.8s +tttg: c212/289 lr:0.000166 t:15.9s +tttg: c213/289 lr:0.000162 t:16.0s +tttg: c214/289 lr:0.000158 t:16.0s +tttg: c215/289 lr:0.000154 t:16.1s +tttg: c216/289 lr:0.000150 t:16.2s +tttg: c217/289 lr:0.000146 t:16.3s +tttg: c218/289 lr:0.000143 t:16.3s +tttg: c219/289 lr:0.000139 t:16.4s +tttg: c220/289 lr:0.000135 t:16.5s +tttg: c221/289 lr:0.000131 t:16.5s +tttg: c222/289 lr:0.000128 t:16.6s +tttg: c223/289 lr:0.000124 t:16.7s +tttg: c224/289 lr:0.000121 t:16.8s +tttg: c225/289 lr:0.000117 t:16.9s +tttg: c226/289 lr:0.000113 t:16.9s +tttg: c227/289 lr:0.000110 t:17.0s +tttg: c228/289 lr:0.000107 t:17.1s +tttg: c229/289 lr:0.000103 t:17.2s +tttg: c230/289 lr:0.000100 t:17.2s +tttg: c231/289 lr:0.000097 t:17.3s +tttg: c232/289 lr:0.000094 t:17.4s +tttg: c233/289 lr:0.000090 t:17.5s +tttg: c234/289 lr:0.000087 t:17.5s +tttg: c235/289 lr:0.000084 t:17.6s +tttg: c236/289 lr:0.000081 t:17.7s +tttg: c237/289 lr:0.000078 t:17.8s +tttg: c238/289 lr:0.000075 t:17.8s +tttg: c239/289 lr:0.000073 t:17.9s +tttg: c240/289 lr:0.000070 t:18.0s +tttg: c241/289 lr:0.000067 t:18.1s +tttg: c242/289 lr:0.000064 t:18.1s +tttg: c243/289 lr:0.000062 t:18.2s +tttg: c244/289 lr:0.000059 t:18.3s +tttg: c245/289 lr:0.000056 t:18.4s +tttg: c246/289 lr:0.000054 t:18.5s +tttg: c247/289 lr:0.000052 t:18.5s +tttg: c248/289 lr:0.000049 t:18.6s +tttg: c249/289 lr:0.000047 t:18.7s +tttg: c250/289 lr:0.000045 t:18.7s +tttg: c251/289 lr:0.000042 t:18.8s +tttg: c252/289 lr:0.000040 t:18.9s +tttg: c253/289 lr:0.000038 t:19.0s +tttg: c254/289 lr:0.000036 t:19.0s +tttg: c255/289 lr:0.000034 t:19.1s +tttg: c256/289 lr:0.000032 t:19.2s +tttg: c257/289 lr:0.000030 t:19.3s +tttg: c258/289 lr:0.000028 t:19.4s +tttg: c259/289 lr:0.000027 t:19.4s +tttg: c260/289 lr:0.000025 t:19.5s +tttg: c261/289 lr:0.000023 t:19.6s +tttg: c262/289 lr:0.000022 t:19.6s +tttg: c263/289 lr:0.000020 t:19.7s +tttg: c264/289 lr:0.000018 t:19.8s +tttg: c265/289 lr:0.000017 t:19.9s +tttg: c266/289 lr:0.000016 t:20.0s +tttg: c267/289 lr:0.000014 t:20.0s +tttg: c268/289 lr:0.000013 t:20.1s +tttg: c269/289 lr:0.000012 t:20.2s +tttg: c270/289 lr:0.000011 t:20.3s +tttg: c271/289 lr:0.000010 t:20.3s +tttg: c272/289 lr:0.000009 t:20.4s +tttg: c273/289 lr:0.000008 t:20.5s +tttg: c274/289 lr:0.000007 t:20.6s +tttg: c275/289 lr:0.000006 t:20.7s +tttg: c276/289 lr:0.000005 t:20.7s +tttg: c277/289 lr:0.000004 t:20.8s +tttg: c278/289 lr:0.000004 t:20.9s +tttg: c279/289 lr:0.000003 t:21.0s +tttg: c280/289 lr:0.000002 t:21.0s +tttg: c281/289 lr:0.000002 t:21.1s +tttg: c282/289 lr:0.000001 t:21.2s +tttg: c283/289 lr:0.000001 t:21.2s +tttg: c284/289 lr:0.000001 t:21.3s +tttg: c285/289 lr:0.000000 t:21.4s +tttg: c286/289 lr:0.000000 t:21.5s +tttg: c287/289 lr:0.000000 t:21.5s +tttg: c288/289 lr:0.000000 t:21.6s +ttpr: phase:3/3 t:468.1s +ttp: b731/782 bl:2.3393 bb:1.0433 rl:2.2933 rb:1.0678 dl:2377-2414 gd:1 +ttp: b723/782 bl:2.2926 bb:1.0292 rl:2.2933 rb:1.0653 dl:2185-2203 gd:1 +ttp: b716/782 bl:2.2477 bb:1.0386 rl:2.2907 rb:1.0637 dl:2054-2069 gd:1 +ttp: b705/782 bl:2.3600 bb:1.0608 rl:2.2941 rb:1.0636 dl:1885-1898 gd:1 +ttp: b701/782 bl:2.3053 bb:1.0336 rl:2.2947 rb:1.0622 dl:1835-1847 gd:1 +ttp: b689/782 bl:2.3828 bb:1.0728 rl:2.2983 rb:1.0626 dl:1706-1715 gd:1 +ttp: b685/782 bl:2.2948 bb:1.0270 rl:2.2981 rb:1.0612 dl:1665-1675 gd:1 +ttp: b678/782 bl:2.3419 bb:1.0251 rl:2.2997 rb:1.0598 dl:1601-1610 gd:1 +ttp: b666/782 bl:2.4056 bb:1.0618 rl:2.3032 rb:1.0599 dl:1507-1514 gd:1 +ttp: b659/782 bl:2.3017 bb:1.0387 rl:2.3031 rb:1.0592 dl:1459-1466 gd:1 +ttp: b651/782 bl:2.3859 bb:1.0427 rl:2.3055 rb:1.0587 dl:1406-1411 gd:1 +ttp: b642/782 bl:2.3170 bb:1.0374 rl:2.3058 rb:1.0582 dl:1349-1356 gd:1 +ttp: b633/782 bl:2.2716 bb:1.0207 rl:2.3049 rb:1.0572 dl:1297-1302 gd:1 +ttp: b624/782 bl:2.3487 bb:1.0632 rl:2.3060 rb:1.0573 dl:1249-1255 gd:1 +ttp: b618/782 bl:2.3965 bb:1.0666 rl:2.3080 rb:1.0576 dl:1216-1221 gd:1 +ttp: b610/782 bl:2.2426 bb:1.0028 rl:2.3066 rb:1.0564 dl:1177-1182 gd:1 +ttp: b604/782 bl:2.3734 bb:1.0418 rl:2.3080 rb:1.0561 dl:1150-1154 gd:1 +ttp: b594/782 bl:2.3320 bb:1.0646 rl:2.3084 rb:1.0562 dl:1107-1110 gd:1 +ttp: b586/782 bl:2.2510 bb:1.0293 rl:2.3074 rb:1.0557 dl:1073-1076 gd:1 +ttp: b579/782 bl:2.3396 bb:1.0341 rl:2.3079 rb:1.0553 dl:1044-1048 gd:1 +ttp: b575/782 bl:2.2811 bb:1.0381 rl:2.3075 rb:1.0550 dl:1029-1033 gd:1 +ttp: b568/782 bl:2.3537 bb:1.0805 rl:2.3082 rb:1.0555 dl:1004-1007 gd:1 +ttp: b561/782 bl:2.2398 bb:1.0103 rl:2.3072 rb:1.0547 dl:979-983 gd:1 +ttp: b551/782 bl:2.3273 bb:1.0518 rl:2.3075 rb:1.0547 dl:946-949 gd:1 +ttp: b545/782 bl:2.3282 bb:1.0295 rl:2.3078 rb:1.0543 dl:927-930 gd:1 +ttp: b536/782 bl:2.3104 bb:1.0404 rl:2.3078 rb:1.0541 dl:899-902 gd:1 +ttp: b515/782 bl:2.3412 bb:1.0425 rl:2.3082 rb:1.0540 dl:838-841 gd:1 +ttp: b508/782 bl:2.3835 bb:1.0479 rl:2.3091 rb:1.0539 dl:817-820 gd:1 +ttp: b501/782 bl:2.3728 bb:1.0483 rl:2.3099 rb:1.0538 dl:799-802 gd:1 +ttp: b493/782 bl:2.3531 bb:1.0387 rl:2.3104 rb:1.0537 dl:778-780 gd:1 +ttp: b485/782 bl:2.2904 bb:1.0318 rl:2.3102 rb:1.0534 dl:759-761 gd:1 +ttp: b477/782 bl:2.3951 bb:1.0315 rl:2.3111 rb:1.0532 dl:740-742 gd:1 +ttp: b469/782 bl:2.3207 bb:1.0206 rl:2.3112 rb:1.0528 dl:721-724 gd:1 +ttp: b463/782 bl:2.3068 bb:1.0380 rl:2.3111 rb:1.0527 dl:708-710 gd:1 +ttp: b457/782 bl:2.2485 bb:1.0294 rl:2.3105 rb:1.0525 dl:695-697 gd:1 +ttp: b451/782 bl:2.3973 bb:1.0847 rl:2.3113 rb:1.0528 dl:682-685 gd:1 +ttp: b444/782 bl:2.3013 bb:1.0603 rl:2.3112 rb:1.0528 dl:668-670 gd:1 +ttp: b438/782 bl:2.3048 bb:1.0518 rl:2.3112 rb:1.0528 dl:655-657 gd:1 +ttp: b429/782 bl:2.2382 bb:1.0209 rl:2.3106 rb:1.0525 dl:638-640 gd:1 +ttp: b421/782 bl:2.2867 bb:1.0012 rl:2.3104 rb:1.0521 dl:622-624 gd:1 +ttp: b414/782 bl:2.2026 bb:1.0085 rl:2.3095 rb:1.0518 dl:609-611 gd:1 +ttp: b406/782 bl:2.3010 bb:1.0596 rl:2.3094 rb:1.0518 dl:593-595 gd:1 +ttp: b398/782 bl:2.2357 bb:0.9984 rl:2.3089 rb:1.0514 dl:579-581 gd:1 +ttp: b390/782 bl:2.3446 bb:1.0563 rl:2.3091 rb:1.0515 dl:564-566 gd:1 +ttp: b381/782 bl:2.4191 bb:1.0996 rl:2.3099 rb:1.0518 dl:549-550 gd:1 +ttp: b374/782 bl:2.2883 bb:1.0316 rl:2.3098 rb:1.0516 dl:537-538 gd:1 +ttp: b366/782 bl:2.3354 bb:1.0699 rl:2.3099 rb:1.0518 dl:524-525 gd:1 +ttp: b358/782 bl:2.3992 bb:1.0767 rl:2.3105 rb:1.0519 dl:510-512 gd:1 +ttp: b350/782 bl:2.3182 bb:1.0535 rl:2.3105 rb:1.0519 dl:497-498 gd:1 +ttp: b342/782 bl:2.3683 bb:1.1204 rl:2.3109 rb:1.0523 dl:485-486 gd:1 +ttp: b334/782 bl:2.3688 bb:1.0648 rl:2.3112 rb:1.0524 dl:472-474 gd:1 +ttp: b326/782 bl:2.2961 bb:1.0515 rl:2.3111 rb:1.0524 dl:461-462 gd:1 +ttp: b318/782 bl:2.3368 bb:1.0679 rl:2.3113 rb:1.0525 dl:448-450 gd:1 +ttp: b310/782 bl:2.2858 bb:1.0958 rl:2.3111 rb:1.0527 dl:437-438 gd:1 +ttp: b302/782 bl:2.2940 bb:1.0550 rl:2.3111 rb:1.0527 dl:424-426 gd:1 +ttp: b294/782 bl:2.3079 bb:1.0781 rl:2.3110 rb:1.0528 dl:412-414 gd:1 +ttp: b286/782 bl:2.3680 bb:1.1046 rl:2.3113 rb:1.0531 dl:400-402 gd:1 +ttp: b278/782 bl:2.2520 bb:1.0549 rl:2.3110 rb:1.0531 dl:389-391 gd:1 +ttp: b270/782 bl:2.3069 bb:1.0555 rl:2.3110 rb:1.0531 dl:379-380 gd:1 +ttp: b262/782 bl:2.4400 bb:1.1415 rl:2.3116 rb:1.0535 dl:369-370 gd:1 +ttp: b254/782 bl:2.3487 bb:1.1134 rl:2.3117 rb:1.0537 dl:358-360 gd:1 +ttp: b246/782 bl:2.3436 bb:1.0954 rl:2.3119 rb:1.0539 dl:349-350 gd:1 +ttp: b238/782 bl:2.3102 bb:1.1018 rl:2.3119 rb:1.0540 dl:338-340 gd:1 +ttp: b230/782 bl:2.4474 bb:1.1484 rl:2.3124 rb:1.0544 dl:329-330 gd:1 +ttp: b222/782 bl:2.3724 bb:1.1089 rl:2.3126 rb:1.0546 dl:320-321 gd:1 +ttp: b214/782 bl:2.3336 bb:1.1167 rl:2.3127 rb:1.0548 dl:310-312 gd:1 +ttp: b207/782 bl:2.3363 bb:1.1228 rl:2.3127 rb:1.0550 dl:303-304 gd:1 +ttp: b197/782 bl:2.3396 bb:1.1060 rl:2.3128 rb:1.0552 dl:292-294 gd:1 +ttp: b189/782 bl:2.4077 bb:1.1360 rl:2.3131 rb:1.0554 dl:283-284 gd:1 +ttp: b181/782 bl:2.3302 bb:1.1252 rl:2.3132 rb:1.0556 dl:275-276 gd:1 +ttp: b172/782 bl:2.5140 bb:1.1526 rl:2.3138 rb:1.0559 dl:266-267 gd:1 +ttp: b165/782 bl:2.3348 bb:1.1088 rl:2.3138 rb:1.0561 dl:260-260 gd:1 +ttp: b158/782 bl:2.3356 bb:1.1043 rl:2.3139 rb:1.0562 dl:253-254 gd:1 +ttp: b150/782 bl:2.3307 bb:1.1067 rl:2.3140 rb:1.0563 dl:245-246 gd:1 +ttp: b141/782 bl:2.4556 bb:1.1206 rl:2.3143 rb:1.0565 dl:236-237 gd:1 +ttp: b133/782 bl:2.3595 bb:1.1318 rl:2.3144 rb:1.0567 dl:229-230 gd:1 +ttp: b124/782 bl:2.3657 bb:1.1556 rl:2.3146 rb:1.0569 dl:220-222 gd:1 +ttp: b117/782 bl:2.4743 bb:1.2023 rl:2.3149 rb:1.0572 dl:214-215 gd:1 +ttp: b110/782 bl:2.3620 bb:1.1210 rl:2.3151 rb:1.0574 dl:208-208 gd:1 +ttp: b103/782 bl:2.4338 bb:1.1715 rl:2.3153 rb:1.0576 dl:202-202 gd:1 +ttp: b94/782 bl:2.5516 bb:1.2057 rl:2.3158 rb:1.0579 dl:193-194 gd:1 +ttp: b85/782 bl:2.5024 bb:1.1984 rl:2.3162 rb:1.0582 dl:185-186 gd:1 +ttp: b77/782 bl:2.5039 bb:1.2299 rl:2.3166 rb:1.0585 dl:178-179 gd:1 +ttp: b69/782 bl:2.4660 bb:1.2037 rl:2.3168 rb:1.0588 dl:171-172 gd:1 +ttp: b61/782 bl:2.4397 bb:1.2076 rl:2.3171 rb:1.0590 dl:164-165 gd:1 +ttp: b53/782 bl:2.5003 bb:1.1914 rl:2.3174 rb:1.0592 dl:156-157 gd:1 +ttp: b45/782 bl:2.4477 bb:1.1712 rl:2.3176 rb:1.0594 dl:148-149 gd:1 +ttp: b37/782 bl:2.5714 bb:1.2120 rl:2.3180 rb:1.0596 dl:140-141 gd:1 +ttp: b29/782 bl:2.6203 bb:1.2122 rl:2.3184 rb:1.0598 dl:132-133 gd:1 +ttp: b21/782 bl:2.6003 bb:1.2267 rl:2.3188 rb:1.0600 dl:123-124 gd:1 +ttp: b13/782 bl:2.6787 bb:1.2136 rl:2.3192 rb:1.0602 dl:112-114 gd:1 +ttp: b6/782 bl:2.7123 bb:1.2093 rl:2.3196 rb:1.0604 dl:99-101 gd:1 +quantized_ttt_phased val_loss:2.31451121 val_bpb:1.05764263 eval_time:575915ms +total_eval_time:575.9s