-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Closed
Labels
Description
System Info
transformers==4.51.3
Python version: 3.11
Who can help?
@zach-huggingface @SunMarc @MekkCyber
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
When using SFTTrainer with BitsAndBytes and TensorBoard integration, the TrainingArguments are serialized to JSON but fails with:
[rank0]: Traceback (most recent call last):
[rank0]: main({'model_name_or_path': 'meta-llama/Llama-4-Scout-17B-16E-Instruct', 'model_revision': 'main', 'torch_dtype': 'bfloat16', 'attn_implementation': 'flex_attention', 'use_liger': False, 'use_peft': False, 'lora_r': 16, 'lora_alpha': 8, 'lora_dropout': 0.05, 'lora_target_modules': ['q_proj', 'v_proj', 'k_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'], 'lora_modules_to_save': [], 'load_in_4bit': False, 'load_in_8bit': True, 'dataset_name': 'gsm8k', 'dataset_config': 'main', 'dataset_train_split': 'train', 'dataset_test_split': 'test', 'dataset_text_field': 'text', 'dataset_kwargs': {'add_special_tokens': False, 'append_concat_token': False}, 'max_seq_length': 512, 'dataset_batch_size': 1000, 'packing': False, 'num_train_epochs': 10, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'auto_find_batch_size': False, 'eval_strategy': 'epoch', 'bf16': True, 'tf32': False, 'learning_rate': 0.0002, 'warmup_steps': 10, 'lr_scheduler_type': 'inverse_sqrt', 'optim': 'adamw_torch_fused', 'max_grad_norm': 1.0, 'seed': 42, 'gradient_accumulation_steps': 1, 'gradient_checkpointing': False, 'gradient_checkpointing_kwargs': {'use_reentrant': False}, 'fsdp': 'full_shard auto_wrap', 'fsdp_config': {'activation_checkpointing': True, 'cpu_ram_efficient_loading': False, 'sync_module_states': True, 'use_orig_params': True, 'limit_all_gathers': False}, 'save_strategy': 'epoch', 'save_total_limit': 1, 'resume_from_checkpoint': False, 'log_level': 'info', 'logging_strategy': 'steps', 'logging_steps': 1, 'report_to': ['tensorboard'], 'output_dir': '/mnt/shared/Llama-4-Scout-17B-16E-Instruct'})
[rank0]: File "/tmp/tmp.jsNRcydokN/ephemeral_script.py", line 126, in main
[rank0]: trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/transformers/trainer.py", line 2238, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/transformers/trainer.py", line 2462, in _inner_training_loop
[rank0]: self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/transformers/trainer_callback.py", line 506, in on_train_begin
[rank0]: return self.call_event("on_train_begin", args, state, control)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/transformers/trainer_callback.py", line 556, in call_event
[rank0]: result = getattr(callback, event)(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/transformers/integrations/integration_utils.py", line 698, in on_train_begin
[rank0]: self.tb_writer.add_text("args", args.to_json_string())
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/transformers/training_args.py", line 2509, in to_json_string
[rank0]: return json.dumps(self.to_dict(), indent=2)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib64/python3.11/json/__init__.py", line 238, in dumps
[rank0]: **kw).encode(obj)
[rank0]: ^^^^^^^^^^^
[rank0]: File "/usr/lib64/python3.11/json/encoder.py", line 202, in encode
[rank0]: chunks = list(chunks)
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/usr/lib64/python3.11/json/encoder.py", line 432, in _iterencode
[rank0]: yield from _iterencode_dict(o, _current_indent_level)
[rank0]: File "/usr/lib64/python3.11/json/encoder.py", line 406, in _iterencode_dict
[rank0]: yield from chunks
[rank0]: File "/usr/lib64/python3.11/json/encoder.py", line 406, in _iterencode_dict
[rank0]: yield from chunks
[rank0]: File "/usr/lib64/python3.11/json/encoder.py", line 439, in _iterencode
[rank0]: o = _default(o)
[rank0]: ^^^^^^^^^^^
[rank0]: File "/usr/lib64/python3.11/json/encoder.py", line 180, in default
[rank0]: raise TypeError(f'Object of type {o.__class__.__name__} '
[rank0]: TypeError: Object of type BitsAndBytesConfig is not JSON serializable
Expected behavior
The BitsAndBytesConfig should be converted to dict before TrainingArguments are serialized.