This tutorial can be downloaded link.

Intro Tutorial 3: Parallelization

As detailed in Tutorial 1, the GW workflow of WEST involves three steps:

Step 1: Mean-field calculation (pw.x)
Step 2: Calculation of dielectric screening (wstat.x)
Step 3: Calculation of quasiparticle corrections (wfreq.x)

In each step, several levels of MPI and OpenMP parallelization may be utilized to accelerate the calculation, by distributing the workload across processors. In this tutorial, we will explore the parallelization levels available in WEST.

WEST adopts a multilevel parallelization strategy:

world: The group of all MPI processes.
image: world can be partitioned into images. When using the projective dielectric eigendecomposition (PDEP) technique to compute the static dielectric screening, perturbations are distributed across images. Each image is responsible for computing the density response only for the perturbations owned by the image.
pool: Each image can be partitioned into pools, each working on a subgroup of k-points and spin channels.
bgrp: Each pool can be partitioned into band groups, each working on a subgroup of bands (wavefunctions).
R&G: Wavefunctions in the plane-wave basis set, as well as density in either real (R) or reciprocal (G) space, are distributed across MPI processes. Fast Fourier transform (FFT) is used to transform quantities between R space and G space. By default, a 3D FFT is parallelized by distributing planes of the 3D grid in real space to MPI processes. In reciprocal space, the G-vectors are distributed.
SIMT: (used only if GPUs are available) Each MPI process in a band group is capable of offloading computations to a GPU device using the single instruction, multiple threads (SIMT) protocol.

Image of parallelization levels

In this figure, the multilevel parallelization of WEST is exemplified for the case of 16 total MPI processes. The processes are divided into two images. Each image is divided into two pools, each of which is further divided into two band groups. Within each band group there are two MPI processes, each of which is connected to a GPU device. Using GPUs to accelerate WEST calculations is discussed in Tutorial 4.

WEST (wstat.x, wfreq.x, wbse_init.x, and wbse.x) supports image parallelization (activated by command line option -nimage or -ni), pool parallelization (activated by -npool or -nk) for spin channels (currently not for k-points), band parallelization (activated by -nband or -nb), and R&G parallelization (activated by default).

Some remarks:

If any parallelization flag is NOT given, it is the same as setting the corresponding flag to 1, for instance: mpirun -n 2 wstat.x -i wstat.in > wstat.out is the same as: mpirun -n 2 wstat.x -ni 1 -nk 1 -nb 1 -i wstat.in > wstat.out. This implies that if no parallelization flag is specified, WEST by default only uses the R&G level of parallelization.
The R&G parallelization level helps reduce memory footprint (per MPI process), as each process only stores a fraction of the wavefunctions.
The amount of data exchange is low for image and pool, mild for bgrp, and high for R&G, as the parallel FFT algorithm is known to require all-to-all communications, that is, every MPI process within the same band group must exchange data with every other process. Therefore, lowest walltimes are obtained using the least number of cores per band group.

For more information about the implementation, the reader is referred to Govoni et al., J. Chem. Theory Comput. 11, 2680 (2015) and Yu et al., J. Chem. Theory Comput. 18, 4690-4707 (2022).

Now let us try running WEST with different parallelization strategies. We will reuse the input files used in Tutorial 1. Download these files to your current directory:

[ ]:

%%bash
wget -N -q https://west-code.org/doc/training/silane/pw.in
wget -N -q https://west-code.org/doc/training/silane/wstat.in
wget -N -q https://west-code.org/doc/training/silane/wfreq.in
wget -N -q http://www.quantum-simulation.org/potentials/sg15_oncv/upf/H_ONCV_PBE-1.2.upf
wget -N -q http://www.quantum-simulation.org/potentials/sg15_oncv/upf/Si_ONCV_PBE-1.2.upf

First we perform a ground state calculation using pw.x as usual.

[ ]:

%%bash
mpirun -n 2 pw.x -i pw.in > pw.out

Tutorial 1 runs wstat.x and wfreq.x with two MPI processes (CPU cores).

[ ]:

%%bash
mpirun -n 2 wstat.x -i wstat.in > wstat.out
mpirun -n 2 wfreq.x -i wfreq.in > wfreq.out

Since none of -nimage, -npools, -nband is specified, WEST (wstat.x and wfreq.x) defaults to the R&G level of parallelization. Alternatively, the calculation can be carried out using image parallelization, which is controled by the -ni command line switch.

[ ]:

%%bash
mpirun -n 2 wstat.x -ni 2 -i wstat.in > wstat.2image.out
mpirun -n 2 wfreq.x -ni 2 -i wfreq.in > wfreq.2image.out

The reader is encouraged to try this out and compare the output files obtained with and without -ni 2.

Using a different parallelization strategy should not change the physical observables except for very minor numerical deviations.
The estimated memory consumption roughly doubles when using -ni 2, as each image needs to allocate a copy of the wavefunctions.
Parallelization should show a small impact on the time to solution for small calculations like this.

In addition to image parallelization, WEST supports parallelization over spin and band indices. The parallelization levels can be used separately or collaboratively, as summarized in the table below.

code	image	pool	band group	GPU
`wstat.x`	x	x	x	x\(^a\)
`wfreq.x`	x	x	x	x
`westpp.x`	x			x
`wbse_init.x`	x	x	x\(^b\)	x\(^a\)
`wbse.x`	x	x\(^c\)	x	x

x\(^a\): GPU acceleration not available when running WEST and Qbox in the client-server mode.
x\(^b\): Supported but not recommended.
x\(^c\): Supported only for spin-conserving, not spin-flip.

We underline the importance of choosing the parallelization strategy for running large-scale simulations on massively parallel supercomputers. The optimal parameters would of course depend on the specifics of the simulation and the hardware. In general, the image parallelization level is preferred over the other parallelization levels. Using the spin parallelization level usually speeds up calculations of spin-polarized systems. The efficiency of the R&G level of parallelization is largely limited by the efficiency of parallel FFTs.

Below are some restrictions and guidelines when choosing the parallelization parameters. n_total denotes the total number of MPI processes. n_image, n_pool, and n_bgrp denote the numbers of images, pools, and band groups, respectively. nbndocc denotes the number of occupied bands. n_pdep_eigen, n_pdep_eigen_use, qp_bandrange, n_liouville_eigen, and n_liouville_times are described in the manual.

wstat.x, wfreq.x, wbse_init.x, wbse.x:

n_total / (n_image * n_pool * n_bgrp) < MIN(# planes in the Z direction, # cores in one node) [recommended for FFT efficiency on CPUs]
n_total / (n_image * n_pool * n_bgrp) as small as possible (limited by the memory of the GPU) [recommended for FFT efficiency on GPUs]

wstat.x:

n_image <= n_pdep_eigen * n_pdep_times [required]
n_bgrp <= nbndocc [required]

wfreq.x:

n_image <= n_pdep_eigen_to_use [required]
n_bgrp <= nbndocc [required]
n_bgrp <= qp_bandrange[2] - qp_bandrange[1] + 1 [required]
n_pdep_eigen_to_use / n_image > 32 [recommended for GEMM efficiency on CPUs]
n_pdep_eigen_to_use / n_image > 128 [recommended for GEMM efficiency on GPUs]

wbse_init.x:

n_image <= nbndocc [required]

wbse.x:

n_image <= n_liouville_eigen * n_liouville_times [required for Davidson]
n_image == 1 or 3 [required for Lanczos]
n_bgrp <= nbndocc [required]

An example of using WEST with all the parallelization levels and GPUs is given in Tutorial 4.