Sunday, February 25, 2007

Compiling SIESTA in parallel with Portland compiler

Here I'll post my notes on SIESTA compilation on AMD Opteron machine using Portland compiler (pgf90/pgf95). However, you may also make use of them while doing your compilation with other compiler/architecture. I do not guarantee anything just that everything written below worked for me. However, this doesn't mean that it will work for you. More, if after following my instructions you hard disk will somehow get formatted, dont blame me - your using my advices at your own risk. I also have unfinished version of similarly notes of SIESTA compilation on Intel machine with ifort (an extended version of ones by Sebastien Le Roux) but I still don't have time to finish and publish them. If I do, I'll post the link here also. No instructions on compiling Mpich are given because I assume you're using cluster which has mpich already installed. If this is not the case, I refer you to Sebastien Le Roux HOW-TO which can be found in SIESTA mailing list archives.

What is needed to compile parallel SIESTA?

  • SIESTA distribution

  • Compartible Fortran and C compilers (Intel ifort and icc for intel machine, pgf95/pgcc for AMD etc)

  • MPI (Open MPI/ MPICH/LAM etc) – compiled with the same compiler which will be used for SIESTA compilation. Actually, it is already installed on clusters and supercomputers, so you'd check available MPI and the supporting compilers. If mpich2 and mpich are available, I'd rather recommend using mpich instead of mpich2 – I remember having some problems with the latter.

  • BLAS nad LAPACK – can be downloaded from netlib.org or (which is better) use ACML or Intel MKL instead. Also you may try Goto BLAS but I haven't manage to compile SIESTA with GOTO.

  • BLACS - should be compiled from scratch with the same options and the same compiler as SIESTA.

  • SCALAPACK – should be compiled from scratch after compilation of LAPACK/BLAS and BLACS.

Examples below will actually be related to Opteron Machine with Portland compiler (accessible as pgf95 for Fortran and pgcc for C), but the idea could be transferred to another compilers and architectures.

Option 1. Compiling BLAS and LAPACK from scratch

This works almost everywhere, but less efficient than using specific libraries optimized for specific processors (like ACML, Intel MKL or so). This option is described also in Sebastién Le Roux's HOW-TO, but for Intel machine. I repeat almost the same but for AMD machine which makes almost no difference. Also some things from Sebastién's HOWTO didn't work for me, so I'll mention them also.

To compile BLAS and LAPACK you only need a lapack.tar.gz file with LAPACK, so BLAS is included in this package downloadable from netlib.org.

BLAS AND LAPACK

Here is my make.inc (comments are skipped, my comments are written in italic):

PLAT = _LINUX

FORTRAN = pgf95 The compiler which will be used for SIESTA compilation. On Intel machine it could be ifort. Also you can use pgf90 instead.

OPTS = -g -O0 This indicates that we do not want optimization. Generally, the options should be the same as used for SIESTA compilation, but not necessary. You may want to change to -fastsse or -O2 or smth else available via man pgf95 command. I do not recommend to use -fast flag here, but you may try. Anyway never use it for SIESTA compile, better use -fastsse instead – I'll speak on it later.

DRVOPTS = $(OPTS)

NOOPT =

LOADER = pgf95 Generally, a loader should be the same as compiler indicated in FORTRAN section

LOADOPTS =

ARCH = ar

ARCHFLAGS= cr

RANLIB = ranlib

BLASLIB = ../../blas$(PLAT).a

LAPACKLIB = lapack$(PLAT).a

TMGLIB = tmglib$(PLAT).a

EIGSRCLIB = eigsrc$(PLAT).a

LINSRCLIB = linsrc$(PLAT).a

Variables not commented generally were not changed by me and left default. To compile type:

make blaslib

make lapacklib

After that the libraries .a will be found in the current direcrory. I suggest to rename them into libblas.a and liblapack.a and move somewhere to ~/siesta/lib directory (which you of course should create).

BLACS

Take some MPI file from BMAKES directory and use as a template for Bmake.inc. I'll post here only the lines which have to be modified:

BTOPdir = ~/BLACS Current directory, where BLACS was unpacked

COMMLIB = MPI

PLAT = LINUX

MPIdir = /opt/mpich/pgi62_64 Nota bene! Mpich should be compiled with the same compiler version and architecture as we will use! So you should do if you'll have to compile Mpich by yourself!

MPIdev = ch_p4

MPIplat = LINUX

MPILIBdir = $(MPIdir)/$(MPIdev)/lib Check attentively where your libmpich.a (or equivalent like libmpi.a) is located, this pparameter should indicated the directory where it is stored

MPIINCdir = $(MPIdir)/$(MPIdev)/include path to MPI include libs like mpi.h or so.

MPILIB = $(MPILIBdir)/libmpich.a full path to libmpich.a, check if it is correct and such a library exists

INTFACE = -Dadd_ I don't really know what this option is responsible for, but setting it to -Dadd_ worked for me all the times.

F77 = pgf95

F77NO_OPTFLAGS = -g -O0 Compiler flags for files which should not be compiled with optimization

F77FLAGS = $(F77NO_OPTFLAGS) This can and should be changed, I'm just doing testing compilations without any optimization. As for LAPACK, such flags should be (but not necessary) the same as for SIESTA compilation. Also, in my benchmark tests, the change of such flags in BLACS didn't influence the overall SIESTA performance at all. But it doesn't meant that you should not try to optimize BLACS at all.

F77LOADER = $(F77)

F77LOADFLAGS =

CC = pgcc C compiler, should be fully compartible with Fortran compiler. As for Portland it should be of the same architecture and version as pgf95. So is true for Intel compilers.

CCFLAGS = $(F77FLAGS) In general C compiler flags should dublicate the fortran flags, but you may play with them also.

CCLOADER = $(CC)

CCLOADFLAGS =

So to compile just type: make mpi.

NOTA BENE: Generally, a command make clean cleans the previous compilation, so you can compile from the very beginning if you want to. This should be done after every change in makefiles and before the following recompilation. But in case of BLACS you should type make mpi what=clean instead of make clean which won't work.

After compilation you'll find the compiled libraries in LIB subdirectory. I suggest also to move them into smth like ~/siesta/lib and rename to libblacsF77init.a, libblacsCinit.a, libblacs.a/

SCALAPACK

Here is my SLmake.inc (it's editable part, actually) with comments:

home = ~/scalapack-1.7.5 A path where SCALAPACK was unpacked

PLAT = LINUX

BLACSdir = ~/siesta/lib A directory, where BLACS libraries are located, compiled before.

USEMPI = -DUsingMpiBlacs

SMPLIB = /opt/mpich/pgi62_64/lib/libmpich.a Full path to mpich

I've set the wrong path to the BLACS libraries (three lines below), but it all still worked for me! So it seems SCALAPCK do not generally requires BLACS or it somehow finds the libraries automatically via BLACSdir.

BLACSFINIT = $(BLACSdir)/libblacsF77init.a

BLACSCINIT = $(BLACSdir)/libblacsCinit.a

BLACSLIB = $(BLACSdir)/libblacs.a

For compilers and options see the same comments for BLACS and LAPACK compilation

F77 = pgf95

CC = pgcc

NOOPT = -g -O0

F77FLAGS = $(NOOPT)

DRVOPTS = $(F77FLAGS)

CCFLAGS = $(NOOPT)

SRCFLAG =

F77LOADER = $(F77)

CCLOADER = $(CC)

F77LOADFLAGS =

CCLOADFLAGS =

CDEFS = -DAdd_ -DNO_IEEE $(USEMPI)

BLASLIB = ~/siesta/lib/libblas.a Path to previously compiled BLAS lib.

So after running make you'll find libscalapack.a which I would also copy to ~/siesta/lib or so.

At last: compiling SIESTA

So here is my arch.make :

SIESTA_ARCH=pgf95-natanzon

FC=pgf95

FFLAGS= -g -O0 We are compiling SIESTA without any optimization. You can (and I advise this) to play with optimization flags and not to stop at this options, because they are slow, but guarantee the succesfull compilation, which is more important at the time. You can also set this parameter to -fastsse or so (see man pgf95 for details). Problems were reported while using -fast flag, so I DO NOT recommend to use this flag – it may cause crashes in SIESTA runs.

FPPFLAGS= -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT

FFLAGS_DEBUG= -g -O0 This normally should not be changed – some SIESTA files should be compiled without optimization

RANLIB=echo

#

NETCDF_LIBS=

NETCDF_INTERFACE=

DEFS_CDF=

#

MPI_INTERFACE=libmpi_f90.a This also normally should not be changed

MPI_INCLUDE=/opt/mpich/pgi62_64/include path to MPI include dir

MPI_LIBS=/opt/mpich/pgi62_64/lib path, where libmpich.a is located

DEFS_MPI=-DMPI Option for compiling parallel version of SIESTA. Necessary!

#

HOME_LIB=~/siesta/lib Path where previuosly compiled by us libraries (BLAS, LAPACK, BLACS, SLALAPCK) are located

LAPACK_LIBS= -llapack There are two ways of indicating library path: describe full filename like liblapack.a or describe it in the way presented here. The linker will automatically replace -l by lib and add .a extension.

BLAS_LIBS=-lblas

BLACS_LIBS = lblacsCinit blacsF77init -lblacs

SCALAPACK_LIBS = -lscalapack

LIBS= -L$(HOME_LIB) $(SCALAPACK_LIBS) $(BLACS_LIBS) $(LAPACK_LIBS) $(BLAS_LIBS) $(NETCDF_LIBS)

SYS=bsd

DEFS= $(DEFS_CDF) $(DEFS_MPI)

#

.F.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<

.F90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<

Two compile SIESTA just type make. I expect everything should be ok.

What else? Adding optimization

This should be done from the very beginning but sometimes overoptimized SIESTA doesn't work and it is safe to compile without any optimization first and be sure that everything works. For this I recommend to run some SCF tasks (without cell relaxation, 0 CG moves) to be sure the parallelization works (benchmarks will be presented below).

To optimize you should replace F77FLAGS and CCFLAGS everywhere in the arch.make and supporting libraries and recompile everything from the beginning. So for pgf compiler I recommend to use this one:

F77FLAGS = -fastsse

I think this is the best can be done. On Intel machine -O3 -ttp7 (or so) flag should be used instead. It gives the most good optimization. Also try -fast flag on Intel machine – the compiler should automatically discover the best optimization flags, but it does smth wrong for PGF. So never use it with pgf, but this is, say, N-th time I'm repeating this. Also some additional changes should be made into arch.make :

atom.o: This file and the following should be still compiled without optimization

$(FC) -c $(FFLAGS_DEBUG) atom.f

#

dfscf.o:

$(FC) -c $(FFLAGS_DEBUG) $(INCFLAGS) dfscf.f

#

#

electrostatic.o:

$(FC) -c $(FFLAGS_DEBUG) electrostatic.f

#

.F.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<

.F90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<


Option 2. Linking SIESTA with external libraries (ACML, Intel MKL)

What for? My benchmarks show, that using ACML library on AMD64 machine gradually increases the overall SIESTA performance, so it's worth thinking of linking external libraries like ACML, Intel MKL, GotoBLAS etc.

Note: I haven't manage to link Intel MKL by now, however it doesn't mean this can't be done. Keep trying. So, I just share my experience of linking ACML here. I think another libraries can be linked in the same way.

What is needed?

  • ACML library installed. If you don't have one, just download, unpack and run install-blablabla.sh script, which will do everythin automatically, you only have to provide a path where to install. I suppose my ACML to be installed in ~/acml.

  • Precompiled BLACS according to instructions in Option 1 chapter.

  • Sources of SCALAPACK and of course SIESTA, those two should be recompiled with ACML support.

  • No LAPACK/BLAS are needed at this time, because ACML will replace them.

SCALAPACK

Compared to the previous chapter, only one change should be made into Slmake.inc file, which shows the path to BLAS:

BLASLIB = ~/acml/pgi64/lib/libacml.a A full path to libacml.a Be aware to provide the correct vesrion (32 or 64 bit) compartible with compiler.

SIESTA

Here I'll give you my arch.make file again, but the comments will be given only for lines which have changed.

SIESTA_ARCH=pgf95-matterhorn

#

FC=pgf95

FC_ASIS=$(FC)

#

FFLAGS= -fastsse -Wl,-R/opt/pgi/linux86-64/6.2/libso/ -Wl,-R~/acml/pgi64/lib/

You see additional flags are being passed to compiler. -fastsse indicates optimization, while two other paths tell using ACML and PGI libraries which both are necessary for correct linking

FFLAGS_DEBUG= -g O0

LDFLAGS= -fastsse -Bstatic The second flags tells to statically link the libraries so you won't have troubles with indicating LD_LIBRARY_PATH before SIESTA run. If you have troubles with linking, try to delete -Bstatic flag and recompile. But in this case you should execute

export LD_LIBRARY_PATH=/opt/pgi/linux86-64/6.2/libso

before SIESTA run or otherwise you'll get errors like: Cannot load shared libraries XXX.so

Note, that -Bstatic flag is called -static for ifort. Also the linking can crash for Intel Compiler with such a flag, but for pgf it works fine.

RANLIB=echo

COMP_LIBS= #dc_lapack.a Do not use internal LAPACK of SIESTA!

MPI_ROOT=/opt/mpich/pgi62_64/ch_p4

MPI_LIBS= -L$(MPI_ROOT)/lib -lmpich

ACML = -mp -L~/acml/pgi64/lib -lacml -Wl,-R/opt/pgi/linux86-64/6.2/libso/ -Wl,-R~/acml/pgi64/lib/

First parameter is a path to ACML and two others simply repeat the compiler flags. I don't know what -mp is for.

MPI_INTERFACE=libmpi_f90.a

MPI_INCLUDE=/opt/mpich/pgi62_64/ch_p4/include

DEFS_MPI=-DMPI

#

SCALAPACK = -L~/siesta/lib -lscalapack

BLACS = -L~/siesta/lib -lblacs -lblacsF77init -lblacsCinit -lblacsF77init -lblacs

NUMA = -L/opt/pgi/linux86-64/6.2/libso -lnuma This library is needed for static linking. If you link dynamically (no -Bstatic flag) you do not need this library.

LIBS= $(SCALAPACK) $(BLACS) $(ACML) $(MPI_LIBS) $(NUMA)

SYS=cpu_time

DEFS= $(DEFS_CDF) $(DEFS_MPI)

#

#

# Important (at least for V5.0-1 of the pgf90 compiler...)

# Compile atom.f and electrostatic.f without optimization.

#

atom.o:

$(FC) -c $(FFLAGS_DEBUG) atom.f

#

dfscf.o:

$(FC) -c $(FFLAGS_DEBUG) $(INCFLAGS) dfscf.f

#

#

electrostatic.o:

$(FC) -c $(FFLAGS_DEBUG) electrostatic.f

#

.F.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<

.F90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<

So, voila! That's all I can tell you on parallel SIESTA compilation. I've also done some benchmarks, but I decided not to include them here. Just ask me in the comments, if you want a pdf-file of these notes with benchmark tests on parallelization.

Sunday, February 4, 2007

How to calculate Density of States with SIESTA

Here I'll shortly explain how to calculate density of states with SIESTA.
For this you need to run your calculations and finally get a file SystemLabel.EIG in the working directory. Take a look at this file. This is how a typical output looks like:
       -3.4561                                                                                                                          
196     1 40                                                                                                                                          
    1   -52.65773   -52.61839   -52.61035   -52.59173   -31.48590   -31.40974   -31.38261   -31.22230   -31.10230   -31.05041                               
        -30.99880   -30.97822   -30.88169   -30.85002   -30.80920   -30.74749   -22.38872   -21.66313   -21.52788   -21.47900                               
        -21.10694   -20.81325   -20.77569   -20.74793   -10.80684   -10.31129   -10.14217   -10.05258    -9.95418    -9.79580                               
         -8.93664    -8.90326    -8.39606    -8.29401    -8.20815    -7.69086    -7.60359    -7.48728   
           ....                                                           
    2   -52.66887   -52.60975   -52.60674   -52.59284   -31.51384   -31.50220   -31.45885   -31.18657   -31.02913   -30.99967                               
        -30.99402   -30.96087   -30.86182   -30.84823   -30.79439   -30.72731   -22.64175   -21.65435   -21.63041   -21.51864                               
           ....                                                         
    3   -52.67693   -52.60817   -52.59919   -52.59389   -31.58918   -31.51596   -31.50736   -31.14978   -31.05660   -31.00531        
          ....
The highlighted numbers are very important, so here is the explanation of their meaning:
-3.4561 eV is a Fermi energy of the system, which has 196 bands, 1 spin component (so it is unpolarized) and 40 k-points each of them is also highlighted. After Fermi energy please add the following numbers:
eta  - I'd call it precision, put it less than 1.
Ne - number of eigenvalues, I advise to put it large.
Emin and Emax - interval of energies for which DOS will be calculated (eV).

Put this values exactly on the first line after the Fermi energy. I did it in this way:
-3.4561 0.05 1000 -10 10

Close the file and run eig2dos utility:
    eig2dos < SystemLabel.EIG | tee dos.dat
Eig2dos utility is located in Utils directory of SIESTA. You can compile it using f95 compiler or another one:
    f95 eig2dos.f
You may plot the results using gnuplot or grace program:
    xmgrace dos.dat

Friday, January 19, 2007

Notes on using PBS systems on clusters and supercomputers

This shortnotes may help you (and me after I'll forget it :) ) to make use of large supercomputers with PBS systems installed. I'll tell how to run, monitor and delete tasks using PBS systems. At the moment I'm using the computer located at ACK CYFRONET AGH in Kraków, where Torgue PBS is installed, and a cluster with OpenPBS located at Interdisciplinary Center for Computer Modelling of Warsaw University.  Generalizing a (short) experience of maintaining PBS, I'll tell you how to put tasks in a queue, monitor them and delete;  how to change resources required for the task; and a sample of typical PBS-skript will be provided here. It is supposed that you already know basics of Unix/Linux and have an account at the computer with PBS system installed.


Portable Batch System is a queue system which allows an optimal maintaining of computer resources (memory, CPU etc) on supercomputers and clusters, which are used by many users for large calculations. Most big computers consist of several nodes, each of which has several processor  units. Normally one runs a command from a workstation or terminal just by typing a command name and waits until it is finished.  This command will use 99% CPU time and will run on one processor unit. If one needs to run two commands at the same time, one opens two terminal windows and runs two commands, each of them will use 50% CPU. Well, when we want to make use of parallel calculations on several nodes and CPUs, a special software is needed, actually this is what PBS does.

It works like a cron daemon or Task Scheduler in Windows. You specify a program to be runned, resources to be used (i.e. number of nodes and processors, amount of memory etc.) and put it into a queue with the following command:

qsub script.sh

where script.sh - is a shell script which contains a path to a program and resources to be used. I'll post example of such a script below.
You may also specify resources in the command line. For details address:

man qsub
man pbs_resources

Sometimes you have to specify a name of the node:

qsub -q node_name

For details address a documentation provided by system administrator. After putting a queue you'll get an output:

98982.supercomp

Here 98982 is job identifier, "supercomp" is a node hostname. You can run the following command to see the status of this job:

qstat -f 98982

To see the status of all jobs just type

qstat -a

Or you may want to see only your own jobs:

qstat -n your_loginname

In order to delete the job with id=98982 you may type:

qdel 98982

In order to change the resources given in the script run.sh, you may use the command qalter. I will not discuss all the possible options of the commands qsub, qalter, qdel and qstat, because you can find them typing man command_name. I'll give you minimum skills required for this and recommend to read manuals and literature for further details. Let us look inside of the script run.sh:

#!/bin/sh
#PBS -N WaterStone
#PBS -l nodes=1:ppn=8
#PBS -l walltime=72:00:00
cd $PBS_O_WORKDIR/
~/siesta/Src/siesta < input.fdf > screen.out

This is usual Unix shell script, where special commands recognized by PBS system are put in the comments and begin with #PBS.  On the first line the name of the task is specified (WaterStone) which will appear on statistics (qstat). Second line denotes the number of notes (1) and the number of processors (8). You should address the documentation provided by system administrator if you want to know how to set these parameters. For example, on computers which I'm using presently I have to set nodes=1 always, because I have an access to only one node, and the maximum number of processors is restricted to 8 for my account. Note also, that on most servers tasks which use less resources (less memory and fewer processors) have priority to run, so sometimes it's useful to set ppn=2 or ppn=1 because such a task will start earlier than the task with ppn=8. Of course, this depends on the policy of system administrators.

The fourth line describes the amount of time needed for the task. After this period of time the task will automatically terminated, so if you need more time for this task you should change this paramater with qalter command later.

The fifth line forces to use the directory where script run.sh is located as working directory, where all the program output will be stored, as well as files WaterStone.task_id_o and WaterStone.task_id_e. In the sooner file all "screen" output will be stored, and possible errors will be stored in the latter file. I strongly recommend neighther use spaces in the name of working directory nor create a working directory in a folder which contains spaces in its name or path! This may cause problems, for example, the job cannot be started.

The last line is a command to run with input read from input.fdf and output stored in screen.out. Note, in such a case no output will be written to WaterStone.task_id_o because in such a command no output is put on the "screen". WaterStone.task_id_o file is created only after the task is finished, but in contrast output to screen.out is stored all the time, so I recommend you to redirect all the output to the file like I did it, so you'll be able to monitor it all the time.

Of course, there are many other options which can be defined in the script. For example, you may specify amount of memory for the program:

#PBS -l mem=1gb

Or force the PBS system to notify you about the status of the job:

#PBS -M noddeat@leavejournal.com

Not all the other options may work, this depends on the settings of the particular system.

NOTE: This article is a draft. I'll add more information here as soon as I have time. You may ask questions or share your own experience on PBS here in the comments.

Sorry for bad English, you may correct this article and post a corrected version in the comments. I also suppose that you post comments in English only because I may share a link to this article within English-speaking scientific community.