You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* fix: add safeguards during data processing
Signed-off-by: Oleg S <[email protected]>
* fix: add a safeguard for max_batch_len & max_seq_len in training
We currently have certain values that need to be validated against others, but no logic to ensure that this
works adequately. This commit provides a pre-training check that errors out if the value of max_batch_len
is smaller than max_seq_len, since this breaks our ability to generate training batches
Signed-off-by: Oleg S <[email protected]>
* fix: add fallback logic to use the distributed sampler
When we use the multipack sampler, it requires a certain shape of the dataset relative to the
GPUs to be able to sufficiently distribute all of the samples across different nodes.
When this happens, the train loaderbecomes empty which prevents us from being able to train.
This commit resolves that issue by falling back to the distributed sampler when the multipack
fails.
Signed-off-by: Oleg S <[email protected]>
---------
Signed-off-by: Oleg S <[email protected]>
f"\033[36mat {args.max_seq_len} max sequence length, the number of samples to be dropped is {num_dropped_samples}\033[0m"
231
242
)
232
243
print(f"\033[36m({((num_dropped_samples/len(lens)) *100):.2f}% of total)\033[0m")
244
+
ifnum_dropped_samples==len(data):
245
+
raiseRuntimeError(
246
+
f"Dataset does not contain any samples containing less than {args.max_seq_len=} tokens.\nPlease consider increasing your `max_seq_len` value, or adding more samples."
0 commit comments