Skip to content

element_reference of fine tune uma-odac task with using CP2K calculations #1937

@JujuHuang

Description

@JujuHuang

Dear Fairchem team,

Thanks so much for developing and improving the great performance of uma models.
I have some questions about the element reference across different DFT packages and settings to fine tune the uma-s-1p1 of "odac" task.
DFT dataset: CP2K calculations of MOFs, in total I have around ~3000 DFT points with PBE functional and DFTD3(BJ) correction.
Fine tune uma model: uma-s-1p1, odac task

The fine tune yaml and the data yaml file generated by below commands:
python create_uma_finetune_dataset.py --train-dir ./cp2k_db/pbe_d3/train --val-dir ./cp2k_dataset/cp2k_db/pbe_d3/val --uma-task odac --regression-tasks efs --base-model uma-s-1p1 --output-dir uma_omat_ft_efs --num-workers 16
Then the elem_refs automately generated in data/uma_conserving_data_task_energy_force_stress.yaml file. 1. By looking at the element_reference.py code: https://github.com/facebookresearch/fairchem/blob/main/src/fairchem/core/modules/normalization/element_references.py, this generated elem_refers should come from the linear regression of torch.linalg.lstsq, so I do not need to calculate the atomic reference energy in vacuum using cp2k for each element in my dataset. Do I understand it correctly?
2. By reading the oc22 paper: https://arxiv.org/pdf/2206.08917, it makes sense that the normalized Ei,ML is the target for training the model. I do not understand it why in UMA paper: https://arxiv.org/pdf/2506.23971 we need to add the heat of formation (HOF) and make it as the reference energy, why we need to add the HOF?
3. I still have questions about the elem_refs from only one system but different configurations, which means the xA+yB+zC, the combination of number of elements and elements are the same, in this case, how is the elem_refs come from, as the xA+yB+zC does not change but energies change? In this case, how can we trust the elem_refs give us good reference of potential energy surface?

I have questions about the fine tune process of the elem_refs. Right now I need to do multiple loops of fine tune, the MOF structures are the same, but different configurations with energy, forces, stresses (the first loop fine tuned model will be used to generate some configurations for the second loop of fine tune).

  1. In the loops of fine tune, should I always keep the same elem_refs and normalizer_rmsd that comes from the first loop of fine tune dataset, or in each fine tune process, I need to modify the elem_refs and normalizer_rmsd to the current loop of dataset?
  2. From my understanding of fine tune uma models, every time we retain the backbones from the checkpoint and load the scratch heads with setting of:
    model:
      _target_: fairchem.core.units.mlip_unit.mlip_unit.initialize_finetuning_model
      checkpoint_location:
        _target_: fairchem.core.calculate.pretrained_mlip.pretrained_checkpoint_path_from_name
        model_name: ${base_model_name}
      overrides:
        backbone:
          otf_graph: true
          max_neighbors: ${max_neighbors}
          regress_stress: ${data.regress_stress}
          always_use_pbc: true
        pass_through_head_outputs: ${data.pass_through_head_outputs}
      heads: ${data.heads}

In my case, as I am doing fine tune for the same MOFs, but just fine tune several loops, I think I can retain the backbones and the heads from the previous fine tuned model, right? I did some test to load the backbones and the heads from the checkpoint_path, but it does not work? How can I make it work?

Below is my first loop of fine tune uma_sm_finetune_template.yaml and uma_conserving_data_task_energy_force_stress.yaml:

defaults:
- data: uma_conserving_data_task_energy_force_stress
- _self_
job:
  device_type: CUDA
  scheduler:
    mode: LOCAL
    ranks_per_node: 1
    num_nodes: 1
  debug: false
  run_dir: /home/juhuang/scratch/uma_finetune/loop_fine_tune/uma_finetune_runs/
  run_name: uma_odac_ft_e1f100s10
  logger:
    _target_: fairchem.core.common.logger.WandBSingletonLogger.init_wandb
    _partial_: true
    entity: xxx
    project: uma_finetune
base_model_name: uma-s-1p1
max_neighbors: 300
epochs: 200
steps: null
batch_size: 4
lr: 1e-4
weight_decay: 1e-3
evaluate_every_n_steps: 200
checkpoint_every_n_steps: 5000
train_dataset:
  _target_: fairchem.core.datasets.mt_concat_dataset.create_concat_dataset
  dataset_configs:
    odac: ${data.train_dataset}
  combined_dataset_config:
    sampling:
      type: temperature
      temperature: 1.0
train_dataloader:
  _target_: fairchem.core.components.common.dataloader_builder.get_dataloader
  dataset: ${train_dataset}
  batch_sampler_fn:
    _target_: fairchem.core.common.data_parallel.BalancedBatchSampler
    _partial_: true
    batch_size: ${batch_size}
    shuffle: true
    seed: 0
  num_workers: 0
  collate_fn:
    _target_: fairchem.core.units.mlip_unit.mlip_unit.mt_collater_adapter
    tasks: ${data.tasks_list}
val_dataset:
  _target_: fairchem.core.datasets.mt_concat_dataset.create_concat_dataset
  dataset_configs:
    odac: ${data.val_dataset}
  combined_dataset_config:
    sampling:
      type: temperature
      temperature: 1.0
eval_dataloader:
  _target_: fairchem.core.components.common.dataloader_builder.get_dataloader
  dataset: ${val_dataset}
  batch_sampler_fn:
    _target_: fairchem.core.common.data_parallel.BalancedBatchSampler
    _partial_: true
    batch_size: ${batch_size}
    shuffle: false
    seed: 0
  num_workers: 0
  collate_fn:
    _target_: fairchem.core.units.mlip_unit.mlip_unit.mt_collater_adapter
    tasks: ${data.tasks_list}
runner:
  _target_: fairchem.core.components.train.train_runner.TrainEvalRunner
  train_dataloader: ${train_dataloader}
  eval_dataloader: ${eval_dataloader}
  train_eval_unit:
    _target_: fairchem.core.units.mlip_unit.mlip_unit.MLIPTrainEvalUnit
    job_config: ${job}
    tasks: ${data.tasks_list}
    model:
      _target_: fairchem.core.units.mlip_unit.mlip_unit.initialize_finetuning_model
      checkpoint_location:
        _target_: fairchem.core.calculate.pretrained_mlip.pretrained_checkpoint_path_from_name
        model_name: ${base_model_name}
      overrides:
        backbone:
          otf_graph: true
          max_neighbors: ${max_neighbors}
          regress_stress: ${data.regress_stress}
          always_use_pbc: true
        pass_through_head_outputs: ${data.pass_through_head_outputs}
      heads: ${data.heads}
    optimizer_fn:
      _target_: torch.optim.AdamW
      _partial_: true
      lr: ${lr}
      weight_decay: ${weight_decay}
    cosine_lr_scheduler_fn:
      _target_: fairchem.core.units.mlip_unit.mlip_unit._get_consine_lr_scheduler
      _partial_: true
      warmup_factor: 0.2
      warmup_epochs: 10
      lr_min_factor: 0.01
      epochs: ${epochs}
      steps: ${steps}
    print_every: 100
    clip_grad_norm: 100
  max_epochs: ${epochs}
  max_steps: ${steps}
  evaluate_every_n_steps: ${evaluate_every_n_steps}
  callbacks:
  - _target_: fairchem.core.components.train.train_runner.TrainCheckpointCallback
    checkpoint_every_n_steps: ${checkpoint_every_n_steps}
    max_saved_checkpoints: 10
  - _target_: torchtnt.framework.callbacks.TQDMProgressBar
dataset_name: odac
elem_refs:
- 1.055359684407542e-11
- -16.286047526750394
- 1.2975363006262342e-09
- -3.0382807381101884e-10
- 1.4938450476620346e-10
- -80.63138882017302
- -154.77571475099126
- -270.5314486747352
- -436.15079857909586
- -659.7657404948458
- 4.0472514228895307e-11
- -5.1386450650170445e-11
- -28.362243215582566
- -61.3421829097735
- -111.81805406684589
- -181.1878420805824
- -277.84211909687383
- -408.1613777774754
- 9.094947017729282e-13
- 3.069544618483633e-12
- -2.2737367544323206e-13
- -1277.558359578204
- 2.7284841053187847e-12
- -1952.3561835373582
- -2.2737367544323206e-13
- -2834.992497942113
- -3363.2049478315917
- -3958.723179217824
- -4607.824100742575
- -1309.8793138371884
- -1648.9706171805071
- 5.684341886080801e-13
- 0.0
- 9.094947017729282e-13
- -254.7169766514607
- -364.9047665022991
- -9.094947017729282e-13
- 1.1368683772161603e-13
- 2.2737367544323206e-13
- -1048.316399185378
- -1287.9145342616596
- 0.0
- -1859.2051422421227
- 0.0
- -2575.970152842453
- 0.0
- 0.0
- -1007.7889691433224
- -1255.0341831041471
- -1535.3639606620357
- 0.0
- 0.0
- 0.0
- -312.36599574730445
- 0.0
- 0.0
- 0.0
- -869.9256886932654
- 0.0
- 0.0
- -1545.6479435678816
- 0.0
- -2167.9535175954616
- -2535.4579680737893
- -2946.0212595125977
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- -7158.497128005289
- 0.0
- 0.0
- -1860.6112981993438
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- -99.5617630258107
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
normalizer_rmsd: 0.7443373240113296
train_dataset:
  splits:
    train:
      src: /home/juhuang/scratch/uma_finetune/loop_fine_tune/uma_odac_ft_e1f100s10/train
  format: ase_db
  transforms:
    common_transform:
      dataset_name: ${data.dataset_name}
    stress_reshape_transform:
      dataset_name: ${data.dataset_name}
val_dataset:
  splits:
    val:
      src: /home/juhuang/scratch/uma_finetune/loop_fine_tune/uma_odac_ft_e1f100s10/val
  format: ase_db
  transforms:
    common_transform:
      dataset_name: ${data.dataset_name}
    stress_reshape_transform:
      dataset_name: ${data.dataset_name}
regress_stress: true
pass_through_head_outputs: true
heads:
  efs:
    module: fairchem.core.models.uma.escn_md.MLP_EFS_Head
tasks_list:
- _target_: fairchem.core.units.mlip_unit.mlip_unit.Task
  name: energy
  level: system
  property: energy
  loss_fn:
    _target_: fairchem.core.modules.loss.DDPMTLoss
    loss_fn:
      _target_: fairchem.core.modules.loss.PerAtomMAELoss
    coefficient: 1
  out_spec:
    dim:
    - 1
    dtype: float32
  normalizer:
    _target_: fairchem.core.modules.normalization.normalizer.Normalizer
    mean: 0.0
    rmsd: ${data.normalizer_rmsd}
  element_references:
    _target_: fairchem.core.modules.normalization.element_references.ElementReferences
    element_references:
      _target_: torch.DoubleTensor
      _args_:
      - ${data.elem_refs}
  datasets:
  - ${data.dataset_name}
  metrics:
  - mae
  - per_atom_mae
- _target_: fairchem.core.units.mlip_unit.mlip_unit.Task
  name: forces
  level: atom
  property: forces
  train_on_free_atoms: true
  eval_on_free_atoms: true
  loss_fn:
    _target_: fairchem.core.modules.loss.DDPMTLoss
    loss_fn:
      _target_: fairchem.core.modules.loss.L2NormLoss
    reduction: mean
    coefficient: 100
  out_spec:
    dim:
    - 3
    dtype: float32
  normalizer:
    _target_: fairchem.core.modules.normalization.normalizer.Normalizer
    mean: 0.0
    rmsd: ${data.normalizer_rmsd}
  datasets:
  - ${data.dataset_name}
  metrics:
  - mae
  - cosine_similarity
  - magnitude_error
- _target_: fairchem.core.units.mlip_unit.mlip_unit.Task
  name: stress
  level: system
  property: stress
  loss_fn:
    _target_: fairchem.core.modules.loss.DDPMTLoss
    loss_fn:
      _target_: fairchem.core.modules.loss.MAELoss
    reduction: mean
    coefficient: 10
  out_spec:
    dim:
    - 1
    - 9
    dtype: float32
  normalizer:
    _target_: fairchem.core.modules.normalization.normalizer.Normalizer
    mean: 0.0
    rmsd: ${data.normalizer_rmsd}
  datasets:
  - ${data.dataset_name}
  metrics:
  - mae

I really appreciate if I can get some professional feedback from the developer team.

Best,
Ju

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions