Very slow for csc-dmft calculations #107

guuuj · 2025-09-14T06:37:08Z

guuuj
Sep 14, 2025

Dear developers,
Recently, I use solid_dmft to do the csc-dmft calculations (the ce2o3 case in the tutorial) in our HPC cluster. I use QE as the DFT code. And I found the QE scf part is very slow.
When I use 4 cores, the QE scf time is 1m27s cpu, 6m7s WALL; when I change to 12 cores, it is even worse, the QE scf time is 50s for cpu time, but 10m23s for WALL time. In the latter case, most of the wall time comes from the "electrons" part (515.29s), and in the "electrons" part, most of the wall time comes from the "sum_band" (269.56s). See below

It seems that it is the mpi efficiency that is very low. So I checked the code of qe_manager.py. I find that the QE is called by "qe_exec += f'pw.x -nk {number_cores}'", which means it is a k points parallel. And I found that the QE seems to use the diagonalization parallel by default if I call QE by slurm directly. So I modify the code to "qe_exec += f'pw.x -nd {number_cores}'" and do the calculation by solid_dmft. However, the situation is even worse. When I use 9 cores to do the calculation, see below, the QE scf time is about 2h cpu time+ 6h wall time.

This is very strange. For comparision, I also do a QE scf calculation by using slurm directly (that is, a normal QE scf calculation not by solid_dmft), in which also 9 cores are used. The QE scf time is now 26s cpu time+28s wall time, which is very quick.
So, my question is, why the QE scf calcuation will be so slow when it is called by solid_dmft?

the-hampel · 2025-09-16T08:09:41Z

the-hampel
Sep 16, 2025
Maintainer

Hi @guuuj .

Is there a way for you to do this in a live session on the HPC center and monitor CPU activity while QE runs? My suspicion is that somehow QE only gets one hardware CPU core and the more ranks you spawn the slower this will get if you are limited to one hardware cpu core. Another way to check this is to try to run QE in solid_dmft only on 1 core and check if this is faster than 4. Can you try this?

If this is the case you need to somehow make sure that you are allowed to spawn a second MPI process on your HPC system and if needed at special flags for oversubscription. But let's first figure out if this is the problem here.

Best,
Alex

7 replies

guuuj Sep 17, 2025
Author

Hi @the-hampel ,
In fact, my QE is compiled via intel-mpi, and the slurm I call QE is

#!/bin/bash

#SBATCH -J qe-test
#SBATCH -o qe.out.%j
#SBATCH -e qe.err.%j
#SBATCH -p c52-medium
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive

# add modulefiles
ulimit -s unlimited
module load intel/19.1.0.163 mpi/intel/2019.1.0.163
export UCX_TLS=ud,sm,self
mpiexec.hydra -n $SLURM_NTASKS /data/jgu/scripts/pw.x -in scf.in > scf.out

So maybe the default mpi_env is the correct choice?
I have to mention that I install TRIQS via conda install -c conda-forge triqs, in which the mpi seems to be MPICH. And I install solid_dmft via cmake, before cmake I source $(INSTALL_PREFIX)/share/triqs/triqsvars.sh, so I think the mpi for solid_dmft is also MPICH.
However, I call the solid_dmft by

#!/bin/bash

#SBATCH -J solid_dmft
#SBATCH -o triqs.out.%j
#SBATCH -e triqs.err.%j
#SBATCH -p c52-medium
#SBATCH -N 1
#SBATCH --ntasks-per-node=3
#SBATCH --exclusive

# add modulefiles
ulimit -s unlimited
source activate triqs_1
module load intel/19.1.0.163 mpi/intel/2019.1.0.163
mpiexec.hydra -n $SLURM_NTASKS solid_dmft > dmft.out 2>&1

Is it correct? Another question is, does it need to use the same MPI for QE, TRIQS, and solid_dmft?

the-hampel Sep 17, 2025
Maintainer

Technically it is if of course better for compatibility to have both compiled with the same build chain, but in practice this is not necessary. The one thing we might wanna try now is to pin tasks for QE. Can you try the following in the dft section set:

mpi_env = "default"
n_cores = 4
mpi_exe = "mpirun" # or whatever you used now
dft_exec = "taskset -c 0,1,2,3 /path/to/pw.x"

This should pin the 4 mpi tasks to the first 4 cores of your node. Let's see if that changes your wall time in QE.

guuuj Sep 17, 2025
Author

Hi @the-hampel ,
It works!! The wall time decreases dramatically after your suggestion dft_exec = "taskset -c 0,1,2,3 /path/to/pw.x". Now the QE wall time is almost equal to that called by slurm directly !! See below

Thank you so much for your insightful suggestion!
But what happens to it?

the-hampel Sep 17, 2025
Maintainer

Nice. So my suspicion was correct. The first mpirun / mpiexec call takes the slurm allocated cores correctly (this is somewhere set in the environment variables and is then usually steered via taskset). The second mpirun call from inside solid_dmft might not see this correctly, or purposefully only gets one core from slurm since mpi took them already, or whatever the reason is... ;-) To circumvent this we have to trick here a bit and tell mpirun via taskset manually where to place the mpi ranks. Otherwise they get all assigned to CPU 0. If you want to run QE on more CPU cores just extend this list. Annoyingly we have to pass this in the end, since the command has to look like this:

mpirun -np 4 taskset -c 0,1,2,3 pw.x

i.e. taskset cannot be put before mpirun. Since all other rank sleep while DFT is running it is safe to just start with cpu 0, except if you did not get cpu 0 from slurm (on some HPC machines certain cores are blocked, so check this).

I hope this is a viable solution for you for now. I am not sure how to bake this into the code permanently. But I will keep it in mind.

Best,
Alex

guuuj Sep 18, 2025
Author

Thank you so much! You've been a great help.

nguyentrangiabao05 · 2025-10-07T10:12:07Z

nguyentrangiabao05
Oct 7, 2025

I have encountered an issue similar to the one discussed above. My Quantum ESPRESSO (QE) is compiled with IntelMKL + IntelMPI, and I am running it within a conda environment that includes TRIQS (installed via conda) and solid_dmft (installed via conda).

When I run the QE calculation inside solid_dmft (Ce₂O₃ example), using

dft_exec = "taskset -c 0,1,2,3 /path/to/pw.x"
the dmft.out file shows:

QE calculation failed. Exiting program.

However, when I open the corresponding QE output file, there is no explicit error message — it simply stops before any SCF iteration begins.

I also tried to use mpirun from the conda environment before sourcing the Intel MPI environment, but in that case, the program does not occupy any CPU cores.

Another issue is that when I set

mpi_exe = "pw.x"
the QE output file ends with:

KILLED BY SIGNAL: 9 (Killed)
I am not sure whether this is caused by MPI incompatibility between IntelMPI (used for QE) and MPICH (used in the conda TRIQS environment), or by a CPU binding or environment variable issue.

Could anyone advise what the root cause might be, or how to properly configure the MPI environment so that QE runs correctly inside solid_dmft?

9 replies

the-hampel Oct 7, 2025
Maintainer

Then please check as @phibeck suggested in the folder for any .out .err files from quantum espresso. They will contain relevant information. If you can't find them please sent output from ls in the directory in which the error occurred.

nguyentrangiabao05 Oct 7, 2025

Then please check as @phibeck suggested in the folder for any .out .err files from quantum espresso. They will contain relevant information. If you can't find them please sent output from ls in the directory in which the error occurred.

Here is the output from ls in the working directory:

total 944
-rw-rw-r-- 1 thomson-work thomson-work     55 Oct  7 16:19 ce2o3.inp
-rw-rw-r-- 1 thomson-work thomson-work   4025 Oct  7 16:19 ce2o3.mod_scf.in
-rw-rw-r-- 1 thomson-work thomson-work   3932 Oct  7 16:19 ce2o3.nscf.in
-rw-rw-r-- 1 thomson-work thomson-work    205 Oct  7 16:19 ce2o3.pw2wan.in
-rw-rw-r-- 1 thomson-work thomson-work   1059 Oct  7 19:29 ce2o3.scf.err
-rw-rw-r-- 1 thomson-work thomson-work   1381 Oct  7 16:19 ce2o3.scf.in
-rw-rw-r-- 1 thomson-work thomson-work   5960 Oct  7 19:29 ce2o3.scf.out
-rw-rw-r-- 1 thomson-work thomson-work   7910 Oct  7 16:46 ce2o3.scf.out.test
-rw-rw-r-- 1 thomson-work thomson-work   3879 Oct  7 16:19 ce2o3.win
-rw-rw-r-- 1 thomson-work thomson-work     27 Oct  7 16:49 dft.hostfile
drwxrwxr-x 2 thomson-work thomson-work   4096 Oct  5 01:18 dft_input
-rw-rw-r-- 1 thomson-work thomson-work    615 Oct  7 19:19 dmft_config.toml
-rw-rw-r-- 1 thomson-work thomson-work   3344 Oct  7 19:29 dmft.out
drwxrwxr-x 2 thomson-work thomson-work   4096 Oct  5 01:18 pseudo
drwxrwxr-x 2 thomson-work thomson-work   4096 Oct  7 16:21 QE_tmp
drwxrwxr-x 2 thomson-work thomson-work   4096 Oct  5 01:18 ref
-rw-rw-r-- 1 thomson-work thomson-work 374155 Oct  5 01:18 tenergy_ce2o3.png
-rw-rw-r-- 1 thomson-work thomson-work 514746 Oct  5 01:18 tutorial.ipynb

Here is the output from ce2o3.scf.out:

     Program PWSCF v.7.3.1 starts on  7Oct2025 at 19:29:27 

     This program is part of the open-source Quantum ESPRESSO suite
     for quantum simulation of materials; please cite
         "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
         "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
         "P. Giannozzi et al., J. Chem. Phys. 152 154105 (2020);
          URL http://www.quantum-espresso.org", 
     in publications or presentations arising from this work. More details at
     http://www.quantum-espresso.org/quote

     Parallel version (MPI), running on     1 processors

     MPI processes distributed on     1 nodes
     142850 MiB available memory on the printing compute node when the environment starts
 
     Waiting for input...
     Reading input from standard input

     Current dimensions of program PWSCF are:
     Max number of different atomic species (ntypx) = 10
     Max number of k-points (npk) =  40000
     Max angular momentum in pseudopotentials (lmaxx) =  4
     file Ce.pbe-spdfn-rrkjus_psl.1.0.0.UPF: wavefunction(s)  4F renormalized
     Message from routine setup:
     using ibrav=0 with symmetry is DISCOURAGED, use correct ibrav instead
 
     Subspace diagonalization in iterative solution of the eigenvalue problem:
     a serial algorithm will be used

 
     G-vector sticks info
     --------------------
     sticks:   dense  smooth     PW     G-vecs:    dense   smooth      PW
     Sum        3115    1039    301               220011    42403    6381
 
     Using Slab Decomposition
 


     bravais-lattice index     =            0
     lattice parameter (alat)  =       7.3510  a.u.
     unit-cell volume          =     535.7698 (a.u.)^3
     number of atoms/cell      =            5
     number of atomic types    =            2
     number of electrons       =        62.00
     number of Kohn-Sham states=           37
     kinetic-energy cutoff     =      70.0000  Ry
     charge density cutoff     =     840.0000  Ry
     scf convergence threshold =      1.0E-08
     mixing beta               =       0.7000
     number of iterations used =            8  local-TF  mixing
     Exchange-correlation= SLA  PW   PBE  PBE
                           (   1   4   3   4   0   0   0)

     celldm(1)=   7.351034  celldm(2)=   0.000000  celldm(3)=   0.000000
     celldm(4)=   0.000000  celldm(5)=   0.000000  celldm(6)=   0.000000

     crystal axes: (cart. coord. in units of alat)
               a(1) = (   1.000000   0.000000   0.000000 )  
               a(2) = (  -0.500000   0.866025   0.000000 )  
               a(3) = (   0.000000   0.000000   1.557408 )  

     reciprocal axes: (cart. coord. in units 2 pi/alat)
               b(1) = (  1.000000  0.577350  0.000000 )  
               b(2) = (  0.000000  1.154701  0.000000 )  
               b(3) = (  0.000000  0.000000  0.642093 )  


     PseudoPot. # 1 for Ce read from file:
     pseudo/Ce.pbe-spdfn-rrkjus_psl.1.0.0.UPF
     MD5 check sum: f5e8d366264fd506265d7c74b2cd6e7c
     Pseudo is Ultrasoft + core correction, Zval = 22.0
     Generated using "atomic" code by A. Dal Corso  v.6.1 svn rev. 13369
     Using radial grid of 1255 points,  8 beta functions with: 
                l(1) =   0
                l(2) =   0
                l(3) =   1
                l(4) =   1
                l(5) =   2
                l(6) =   2
                l(7) =   3
                l(8) =   3
     Q(r) pseudized with 0 coefficients 


     PseudoPot. # 2 for O  read from file:
     pseudo/o_pbe_v1.2.uspp.F.UPF
     MD5 check sum: 734c27235a0248c51dbae37a1fbe46ec
     Pseudo is Ultrasoft + core correction, Zval =  6.0
     Generated by new atomic code, or converted to UPF format
     Using radial grid of  737 points,  5 beta functions with: 
                l(1) =   0
                l(2) =   0
                l(3) =   1
                l(4) =   1
                l(5) =   2
     Q(r) pseudized with  8 coefficients,  rinner =    0.900   0.900   0.900
                                                       0.900   0.900

     atomic species   valence    mass     pseudopotential
     Ce               22.00   140.11600     Ce( 1.00)
     O                 6.00    15.99940     O ( 1.00)

     12 Sym. Ops., with inversion, found



   Cartesian axes

     site n.     atom                  positions (alat units)
         1        Ce     tau(   1) = (   0.5000005   0.2886748   1.1744723  )
         2        Ce     tau(   2) = (  -0.0000005   0.5773506   0.3829354  )
         3        O      tau(   3) = (   0.5000005   0.2886748   0.5566456  )
         4        O      tau(   4) = (  -0.0000005   0.5773506   1.0007622  )
         5        O      tau(   5) = (   0.0000000   0.0000000   0.0000000  )

     number of k points=     9  Methfessel-Paxton smearing, width (Ry)=  0.0100
                       cart. coord. in units 2pi/alat
        k(    1) = (   0.0000000   0.0000000   0.0000000), wk =   0.0416667
        k(    2) = (   0.0000000   0.0000000   0.2140309), wk =   0.0833333
        k(    3) = (   0.0000000   0.2886751   0.0000000), wk =   0.2500000
        k(    4) = (   0.0000000   0.2886751   0.2140309), wk =   0.2500000
        k(    5) = (   0.0000000  -0.5773503   0.0000000), wk =   0.1250000
        k(    6) = (   0.0000000  -0.5773503   0.2140309), wk =   0.2500000
        k(    7) = (   0.2500000   0.4330127   0.0000000), wk =   0.2500000
        k(    8) = (   0.2500000   0.4330127   0.2140309), wk =   0.5000000
        k(    9) = (   0.0000000   0.2886751  -0.2140309), wk =   0.2500000

     Dense  grid:   220011 G-vectors     FFT dimensions: (  72,  72, 108)

     Smooth grid:    42403 G-vectors     FFT dimensions: (  40,  40,  64)

     Estimated max dynamical RAM per process >       2.09 GB

     Check: negative core charge=   -0.000001

     Initial potential from superposition of free atoms

     starting charge      61.9888, renormalised to      62.0000

And here is the output from dmft.out:

Warning: could not identify MPI environment!
Reading the config file dmft_config.toml

general parameters
    afm_order                      False
    beta                           10
    block_threshold                0.001
    broy_max_it                    -1
    calc_energies                  True
    calc_mu_method                 dichotomy
    csc                            True
    dc                             True
    dc_dmft                        True
    dc_type                        0
    dlr_eps                        None
    dlr_wmax                       None
    enforce_off_diag               False
    eta                            0.5
    fixed_mu_value                 None
    g0_conv_crit                   -1.0
    g0_mix                         1.0
    g0_mix_type                    linear
    gimp_conv_crit                 -1.0
    gw_embedding                   False
    h_field                        0.0
    h_field_it                     -1
    h_int_basis                    qe
    h_int_type                     density_density
    h5_save_freq                   1
    J                              0.46
    jobname                        b10-U6.46-J0.46
    load_sigma                     False
    load_sigma_iter                -1
    magmom                         None
    magnetic                       False
    mu_gap_gb2_threshold           None
    mu_gap_occ_deviation           None
    mu_initial_guess               None
    mu_mix_const                   1.0
    mu_mix_per_occupation_offset   0.0
    mu_update_freq                 1
    n_iter_dmft                    5
    n_iter_dmft_first              2
    n_iter_dmft_per                1
    n_iw                           100
    n_tau                          5001
    n_w                            5001
    noise_level_initial_sigma      0.0
    occ_conv_crit                  -1.0
    path_to_sigma                  None
    prec_mu                        0.1
    ratio_F4_F2                    None
    sampling_h5_save_freq          5
    sampling_iterations            0
    seedname                       ce2o3
    set_rot                        None
    sigma_conv_crit                -1.0
    sigma_mix                      1.0
    store_solver                   False
    U                              6.46
    U_crpa_threshold               0.0
    U_prime                        None
    w_range                        [-10, 10]

solver parameters
entry 1
    type                           hubbardI
    idx_impurities                 None
    legendre_fit                   False
    measure_density_matrix         True
    measure_G_l                    False
    measure_G_tau                  True
    n_l                            15

dft parameters
    dft_code                       qe
    dft_exec                       pw.x
    mpi_env                        default
    n_cores                        1
    n_iter                         4
    n_iter_first                   4
    plo_cfg                        plo.cfg
    projector_type                 w90
    store_eigenvals                False
    w90_exec                       wannier90.x
    w90_tolerance                  0.1

  solid_dmft: Running QE...
  solid_dmft: Starting scf calculation...
QE calculation failed. Exiting programm.

the-hampel Oct 7, 2025
Maintainer

And ce2o3.scf.err should contain relevant information about your encountered problem.

nguyentrangiabao05 Oct 7, 2025

And ce2o3.scf.err should contain relevant information about your encountered problem.

Inside the ce2o3.scf.err file, I encountered a strange error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
libc.so.6          000079127B245330  Unknown               Unknown  Unknown
pw.x               00000000006603AA  Unknown               Unknown  Unknown
pw.x               000000000055BC82  Unknown               Unknown  Unknown
pw.x               0000000000556541  Unknown               Unknown  Unknown
pw.x               00000000004C734A  Unknown               Unknown  Unknown
pw.x               0000000000461205  Unknown               Unknown  Unknown
pw.x               00000000004FCD2C  Unknown               Unknown  Unknown
pw.x               0000000000409103  Unknown               Unknown  Unknown
pw.x               0000000000408F8D  Unknown               Unknown  Unknown
libc.so.6          000079127B22A1CA  Unknown               Unknown  Unknown
libc.so.6          000079127B22A28B  __libc_start_main     Unknown  Unknown
pw.x               0000000000408EA5  Unknown               Unknown  Unknown

I suspect that the issue might be caused by one of the libraries inside the Conda environment or a possible library conflict. However, when I run pw.x outside the solid_dmft environment but still within Conda, it works correctly without any errors.

the-hampel Oct 7, 2025
Maintainer

Could you verify if you use the correct mpirun command for the dft code? Please check while outside of solid_dmft which mpirun or which mpiexec depending on how you call pw.x usually. And the use the full exe path and put it into your solid_dmft config file as mpi_exe = /full/path/to/mpirun . Let me know how that goes please.

Very slow for csc-dmft calculations #107

Uh oh!

guuuj Sep 14, 2025

Replies: 2 comments · 16 replies

Uh oh!

the-hampel Sep 16, 2025 Maintainer

Uh oh!

guuuj Sep 17, 2025 Author

Uh oh!

the-hampel Sep 17, 2025 Maintainer

Uh oh!

guuuj Sep 17, 2025 Author

Uh oh!

the-hampel Sep 17, 2025 Maintainer

Uh oh!

guuuj Sep 18, 2025 Author

Uh oh!

nguyentrangiabao05 Oct 7, 2025

Uh oh!

the-hampel Oct 7, 2025 Maintainer

Uh oh!

nguyentrangiabao05 Oct 7, 2025

Uh oh!

the-hampel Oct 7, 2025 Maintainer

Uh oh!

nguyentrangiabao05 Oct 7, 2025

Uh oh!

Uh oh!

the-hampel Oct 7, 2025 Maintainer

guuuj
Sep 14, 2025

Replies: 2 comments 16 replies

the-hampel
Sep 16, 2025
Maintainer

guuuj Sep 17, 2025
Author

the-hampel Sep 17, 2025
Maintainer

guuuj Sep 17, 2025
Author

the-hampel Sep 17, 2025
Maintainer

guuuj Sep 18, 2025
Author

nguyentrangiabao05
Oct 7, 2025

the-hampel Oct 7, 2025
Maintainer

the-hampel Oct 7, 2025
Maintainer

the-hampel Oct 7, 2025
Maintainer