Dear Fairchem team,
Thanks so much for developing and improving the great performance of uma models.
I have some questions about the element reference across different DFT packages and settings to fine tune the uma-s-1p1 of "odac" task.
DFT dataset: CP2K calculations of MOFs, in total I have around ~3000 DFT points with PBE functional and DFTD3(BJ) correction.
Fine tune uma model: uma-s-1p1, odac task
The fine tune yaml and the data yaml file generated by below commands:
python create_uma_finetune_dataset.py --train-dir ./cp2k_db/pbe_d3/train --val-dir ./cp2k_dataset/cp2k_db/pbe_d3/val --uma-task odac --regression-tasks efs --base-model uma-s-1p1 --output-dir uma_omat_ft_efs --num-workers 16
Then the elem_refs automately generated in data/uma_conserving_data_task_energy_force_stress.yaml file. 1. By looking at the element_reference.py code: https://github.com/facebookresearch/fairchem/blob/main/src/fairchem/core/modules/normalization/element_references.py, this generated elem_refers should come from the linear regression of torch.linalg.lstsq, so I do not need to calculate the atomic reference energy in vacuum using cp2k for each element in my dataset. Do I understand it correctly?
2. By reading the oc22 paper: https://arxiv.org/pdf/2206.08917, it makes sense that the normalized Ei,ML is the target for training the model. I do not understand it why in UMA paper: https://arxiv.org/pdf/2506.23971 we need to add the heat of formation (HOF) and make it as the reference energy, why we need to add the HOF?
3. I still have questions about the elem_refs from only one system but different configurations, which means the xA+yB+zC, the combination of number of elements and elements are the same, in this case, how is the elem_refs come from, as the xA+yB+zC does not change but energies change? In this case, how can we trust the elem_refs give us good reference of potential energy surface?
I have questions about the fine tune process of the elem_refs. Right now I need to do multiple loops of fine tune, the MOF structures are the same, but different configurations with energy, forces, stresses (the first loop fine tuned model will be used to generate some configurations for the second loop of fine tune).
- In the loops of fine tune, should I always keep the same elem_refs and normalizer_rmsd that comes from the first loop of fine tune dataset, or in each fine tune process, I need to modify the elem_refs and normalizer_rmsd to the current loop of dataset?
- From my understanding of fine tune uma models, every time we retain the backbones from the checkpoint and load the scratch heads with setting of:
model:
_target_: fairchem.core.units.mlip_unit.mlip_unit.initialize_finetuning_model
checkpoint_location:
_target_: fairchem.core.calculate.pretrained_mlip.pretrained_checkpoint_path_from_name
model_name: ${base_model_name}
overrides:
backbone:
otf_graph: true
max_neighbors: ${max_neighbors}
regress_stress: ${data.regress_stress}
always_use_pbc: true
pass_through_head_outputs: ${data.pass_through_head_outputs}
heads: ${data.heads}
In my case, as I am doing fine tune for the same MOFs, but just fine tune several loops, I think I can retain the backbones and the heads from the previous fine tuned model, right? I did some test to load the backbones and the heads from the checkpoint_path, but it does not work? How can I make it work?
Below is my first loop of fine tune uma_sm_finetune_template.yaml and uma_conserving_data_task_energy_force_stress.yaml:
defaults:
- data: uma_conserving_data_task_energy_force_stress
- _self_
job:
device_type: CUDA
scheduler:
mode: LOCAL
ranks_per_node: 1
num_nodes: 1
debug: false
run_dir: /home/juhuang/scratch/uma_finetune/loop_fine_tune/uma_finetune_runs/
run_name: uma_odac_ft_e1f100s10
logger:
_target_: fairchem.core.common.logger.WandBSingletonLogger.init_wandb
_partial_: true
entity: xxx
project: uma_finetune
base_model_name: uma-s-1p1
max_neighbors: 300
epochs: 200
steps: null
batch_size: 4
lr: 1e-4
weight_decay: 1e-3
evaluate_every_n_steps: 200
checkpoint_every_n_steps: 5000
train_dataset:
_target_: fairchem.core.datasets.mt_concat_dataset.create_concat_dataset
dataset_configs:
odac: ${data.train_dataset}
combined_dataset_config:
sampling:
type: temperature
temperature: 1.0
train_dataloader:
_target_: fairchem.core.components.common.dataloader_builder.get_dataloader
dataset: ${train_dataset}
batch_sampler_fn:
_target_: fairchem.core.common.data_parallel.BalancedBatchSampler
_partial_: true
batch_size: ${batch_size}
shuffle: true
seed: 0
num_workers: 0
collate_fn:
_target_: fairchem.core.units.mlip_unit.mlip_unit.mt_collater_adapter
tasks: ${data.tasks_list}
val_dataset:
_target_: fairchem.core.datasets.mt_concat_dataset.create_concat_dataset
dataset_configs:
odac: ${data.val_dataset}
combined_dataset_config:
sampling:
type: temperature
temperature: 1.0
eval_dataloader:
_target_: fairchem.core.components.common.dataloader_builder.get_dataloader
dataset: ${val_dataset}
batch_sampler_fn:
_target_: fairchem.core.common.data_parallel.BalancedBatchSampler
_partial_: true
batch_size: ${batch_size}
shuffle: false
seed: 0
num_workers: 0
collate_fn:
_target_: fairchem.core.units.mlip_unit.mlip_unit.mt_collater_adapter
tasks: ${data.tasks_list}
runner:
_target_: fairchem.core.components.train.train_runner.TrainEvalRunner
train_dataloader: ${train_dataloader}
eval_dataloader: ${eval_dataloader}
train_eval_unit:
_target_: fairchem.core.units.mlip_unit.mlip_unit.MLIPTrainEvalUnit
job_config: ${job}
tasks: ${data.tasks_list}
model:
_target_: fairchem.core.units.mlip_unit.mlip_unit.initialize_finetuning_model
checkpoint_location:
_target_: fairchem.core.calculate.pretrained_mlip.pretrained_checkpoint_path_from_name
model_name: ${base_model_name}
overrides:
backbone:
otf_graph: true
max_neighbors: ${max_neighbors}
regress_stress: ${data.regress_stress}
always_use_pbc: true
pass_through_head_outputs: ${data.pass_through_head_outputs}
heads: ${data.heads}
optimizer_fn:
_target_: torch.optim.AdamW
_partial_: true
lr: ${lr}
weight_decay: ${weight_decay}
cosine_lr_scheduler_fn:
_target_: fairchem.core.units.mlip_unit.mlip_unit._get_consine_lr_scheduler
_partial_: true
warmup_factor: 0.2
warmup_epochs: 10
lr_min_factor: 0.01
epochs: ${epochs}
steps: ${steps}
print_every: 100
clip_grad_norm: 100
max_epochs: ${epochs}
max_steps: ${steps}
evaluate_every_n_steps: ${evaluate_every_n_steps}
callbacks:
- _target_: fairchem.core.components.train.train_runner.TrainCheckpointCallback
checkpoint_every_n_steps: ${checkpoint_every_n_steps}
max_saved_checkpoints: 10
- _target_: torchtnt.framework.callbacks.TQDMProgressBar
dataset_name: odac
elem_refs:
- 1.055359684407542e-11
- -16.286047526750394
- 1.2975363006262342e-09
- -3.0382807381101884e-10
- 1.4938450476620346e-10
- -80.63138882017302
- -154.77571475099126
- -270.5314486747352
- -436.15079857909586
- -659.7657404948458
- 4.0472514228895307e-11
- -5.1386450650170445e-11
- -28.362243215582566
- -61.3421829097735
- -111.81805406684589
- -181.1878420805824
- -277.84211909687383
- -408.1613777774754
- 9.094947017729282e-13
- 3.069544618483633e-12
- -2.2737367544323206e-13
- -1277.558359578204
- 2.7284841053187847e-12
- -1952.3561835373582
- -2.2737367544323206e-13
- -2834.992497942113
- -3363.2049478315917
- -3958.723179217824
- -4607.824100742575
- -1309.8793138371884
- -1648.9706171805071
- 5.684341886080801e-13
- 0.0
- 9.094947017729282e-13
- -254.7169766514607
- -364.9047665022991
- -9.094947017729282e-13
- 1.1368683772161603e-13
- 2.2737367544323206e-13
- -1048.316399185378
- -1287.9145342616596
- 0.0
- -1859.2051422421227
- 0.0
- -2575.970152842453
- 0.0
- 0.0
- -1007.7889691433224
- -1255.0341831041471
- -1535.3639606620357
- 0.0
- 0.0
- 0.0
- -312.36599574730445
- 0.0
- 0.0
- 0.0
- -869.9256886932654
- 0.0
- 0.0
- -1545.6479435678816
- 0.0
- -2167.9535175954616
- -2535.4579680737893
- -2946.0212595125977
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- -7158.497128005289
- 0.0
- 0.0
- -1860.6112981993438
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- -99.5617630258107
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
normalizer_rmsd: 0.7443373240113296
train_dataset:
splits:
train:
src: /home/juhuang/scratch/uma_finetune/loop_fine_tune/uma_odac_ft_e1f100s10/train
format: ase_db
transforms:
common_transform:
dataset_name: ${data.dataset_name}
stress_reshape_transform:
dataset_name: ${data.dataset_name}
val_dataset:
splits:
val:
src: /home/juhuang/scratch/uma_finetune/loop_fine_tune/uma_odac_ft_e1f100s10/val
format: ase_db
transforms:
common_transform:
dataset_name: ${data.dataset_name}
stress_reshape_transform:
dataset_name: ${data.dataset_name}
regress_stress: true
pass_through_head_outputs: true
heads:
efs:
module: fairchem.core.models.uma.escn_md.MLP_EFS_Head
tasks_list:
- _target_: fairchem.core.units.mlip_unit.mlip_unit.Task
name: energy
level: system
property: energy
loss_fn:
_target_: fairchem.core.modules.loss.DDPMTLoss
loss_fn:
_target_: fairchem.core.modules.loss.PerAtomMAELoss
coefficient: 1
out_spec:
dim:
- 1
dtype: float32
normalizer:
_target_: fairchem.core.modules.normalization.normalizer.Normalizer
mean: 0.0
rmsd: ${data.normalizer_rmsd}
element_references:
_target_: fairchem.core.modules.normalization.element_references.ElementReferences
element_references:
_target_: torch.DoubleTensor
_args_:
- ${data.elem_refs}
datasets:
- ${data.dataset_name}
metrics:
- mae
- per_atom_mae
- _target_: fairchem.core.units.mlip_unit.mlip_unit.Task
name: forces
level: atom
property: forces
train_on_free_atoms: true
eval_on_free_atoms: true
loss_fn:
_target_: fairchem.core.modules.loss.DDPMTLoss
loss_fn:
_target_: fairchem.core.modules.loss.L2NormLoss
reduction: mean
coefficient: 100
out_spec:
dim:
- 3
dtype: float32
normalizer:
_target_: fairchem.core.modules.normalization.normalizer.Normalizer
mean: 0.0
rmsd: ${data.normalizer_rmsd}
datasets:
- ${data.dataset_name}
metrics:
- mae
- cosine_similarity
- magnitude_error
- _target_: fairchem.core.units.mlip_unit.mlip_unit.Task
name: stress
level: system
property: stress
loss_fn:
_target_: fairchem.core.modules.loss.DDPMTLoss
loss_fn:
_target_: fairchem.core.modules.loss.MAELoss
reduction: mean
coefficient: 10
out_spec:
dim:
- 1
- 9
dtype: float32
normalizer:
_target_: fairchem.core.modules.normalization.normalizer.Normalizer
mean: 0.0
rmsd: ${data.normalizer_rmsd}
datasets:
- ${data.dataset_name}
metrics:
- mae
I really appreciate if I can get some professional feedback from the developer team.
Best,
Ju
Dear Fairchem team,
Thanks so much for developing and improving the great performance of uma models.
I have some questions about the element reference across different DFT packages and settings to fine tune the
uma-s-1p1of "odac" task.DFT dataset: CP2K calculations of MOFs, in total I have around ~3000 DFT points with PBE functional and DFTD3(BJ) correction.
Fine tune uma model: uma-s-1p1, odac task
The fine tune yaml and the data yaml file generated by below commands:
python create_uma_finetune_dataset.py --train-dir ./cp2k_db/pbe_d3/train --val-dir ./cp2k_dataset/cp2k_db/pbe_d3/val --uma-task odac --regression-tasks efs --base-model uma-s-1p1 --output-dir uma_omat_ft_efs --num-workers 16Then the elem_refs automately generated in data/uma_conserving_data_task_energy_force_stress.yaml file. 1. By looking at the element_reference.py code:
https://github.com/facebookresearch/fairchem/blob/main/src/fairchem/core/modules/normalization/element_references.py, this generated elem_refers should come from the linear regression of torch.linalg.lstsq, so I do not need to calculate the atomic reference energy in vacuum using cp2k for each element in my dataset. Do I understand it correctly?2. By reading the oc22 paper: https://arxiv.org/pdf/2206.08917, it makes sense that the normalized Ei,ML is the target for training the model. I do not understand it why in UMA paper: https://arxiv.org/pdf/2506.23971 we need to add the heat of formation (HOF) and make it as the reference energy, why we need to add the HOF?
3. I still have questions about the elem_refs from only one system but different configurations, which means the xA+yB+zC, the combination of number of elements and elements are the same, in this case, how is the elem_refs come from, as the xA+yB+zC does not change but energies change? In this case, how can we trust the elem_refs give us good reference of potential energy surface?
I have questions about the fine tune process of the elem_refs. Right now I need to do multiple loops of fine tune, the MOF structures are the same, but different configurations with energy, forces, stresses (the first loop fine tuned model will be used to generate some configurations for the second loop of fine tune).
In my case, as I am doing fine tune for the same MOFs, but just fine tune several loops, I think I can retain the backbones and the heads from the previous fine tuned model, right? I did some test to load the backbones and the heads from the checkpoint_path, but it does not work? How can I make it work?
Below is my first loop of fine tune uma_sm_finetune_template.yaml and uma_conserving_data_task_energy_force_stress.yaml:
I really appreciate if I can get some professional feedback from the developer team.
Best,
Ju