Sunday, February 25, 2007

Compiling SIESTA in parallel with Portland compiler

Here I'll post my notes on SIESTA compilation on AMD Opteron machine using Portland compiler (pgf90/pgf95). However, you may also make use of them while doing your compilation with other compiler/architecture. I do not guarantee anything just that everything written below worked for me. However, this doesn't mean that it will work for you. More, if after following my instructions you hard disk will somehow get formatted, dont blame me - your using my advices at your own risk. I also have unfinished version of similarly notes of SIESTA compilation on Intel machine with ifort (an extended version of ones by Sebastien Le Roux) but I still don't have time to finish and publish them. If I do, I'll post the link here also. No instructions on compiling Mpich are given because I assume you're using cluster which has mpich already installed. If this is not the case, I refer you to Sebastien Le Roux HOW-TO which can be found in SIESTA mailing list archives.

What is needed to compile parallel SIESTA?

  • SIESTA distribution

  • Compartible Fortran and C compilers (Intel ifort and icc for intel machine, pgf95/pgcc for AMD etc)

  • MPI (Open MPI/ MPICH/LAM etc) – compiled with the same compiler which will be used for SIESTA compilation. Actually, it is already installed on clusters and supercomputers, so you'd check available MPI and the supporting compilers. If mpich2 and mpich are available, I'd rather recommend using mpich instead of mpich2 – I remember having some problems with the latter.

  • BLAS nad LAPACK – can be downloaded from netlib.org or (which is better) use ACML or Intel MKL instead. Also you may try Goto BLAS but I haven't manage to compile SIESTA with GOTO.

  • BLACS - should be compiled from scratch with the same options and the same compiler as SIESTA.

  • SCALAPACK – should be compiled from scratch after compilation of LAPACK/BLAS and BLACS.

Examples below will actually be related to Opteron Machine with Portland compiler (accessible as pgf95 for Fortran and pgcc for C), but the idea could be transferred to another compilers and architectures.

Option 1. Compiling BLAS and LAPACK from scratch

This works almost everywhere, but less efficient than using specific libraries optimized for specific processors (like ACML, Intel MKL or so). This option is described also in Sebastién Le Roux's HOW-TO, but for Intel machine. I repeat almost the same but for AMD machine which makes almost no difference. Also some things from Sebastién's HOWTO didn't work for me, so I'll mention them also.

To compile BLAS and LAPACK you only need a lapack.tar.gz file with LAPACK, so BLAS is included in this package downloadable from netlib.org.

BLAS AND LAPACK

Here is my make.inc (comments are skipped, my comments are written in italic):

PLAT = _LINUX

FORTRAN = pgf95 The compiler which will be used for SIESTA compilation. On Intel machine it could be ifort. Also you can use pgf90 instead.

OPTS = -g -O0 This indicates that we do not want optimization. Generally, the options should be the same as used for SIESTA compilation, but not necessary. You may want to change to -fastsse or -O2 or smth else available via man pgf95 command. I do not recommend to use -fast flag here, but you may try. Anyway never use it for SIESTA compile, better use -fastsse instead – I'll speak on it later.

DRVOPTS = $(OPTS)

NOOPT =

LOADER = pgf95 Generally, a loader should be the same as compiler indicated in FORTRAN section

LOADOPTS =

ARCH = ar

ARCHFLAGS= cr

RANLIB = ranlib

BLASLIB = ../../blas$(PLAT).a

LAPACKLIB = lapack$(PLAT).a

TMGLIB = tmglib$(PLAT).a

EIGSRCLIB = eigsrc$(PLAT).a

LINSRCLIB = linsrc$(PLAT).a

Variables not commented generally were not changed by me and left default. To compile type:

make blaslib

make lapacklib

After that the libraries .a will be found in the current direcrory. I suggest to rename them into libblas.a and liblapack.a and move somewhere to ~/siesta/lib directory (which you of course should create).

BLACS

Take some MPI file from BMAKES directory and use as a template for Bmake.inc. I'll post here only the lines which have to be modified:

BTOPdir = ~/BLACS Current directory, where BLACS was unpacked

COMMLIB = MPI

PLAT = LINUX

MPIdir = /opt/mpich/pgi62_64 Nota bene! Mpich should be compiled with the same compiler version and architecture as we will use! So you should do if you'll have to compile Mpich by yourself!

MPIdev = ch_p4

MPIplat = LINUX

MPILIBdir = $(MPIdir)/$(MPIdev)/lib Check attentively where your libmpich.a (or equivalent like libmpi.a) is located, this pparameter should indicated the directory where it is stored

MPIINCdir = $(MPIdir)/$(MPIdev)/include path to MPI include libs like mpi.h or so.

MPILIB = $(MPILIBdir)/libmpich.a full path to libmpich.a, check if it is correct and such a library exists

INTFACE = -Dadd_ I don't really know what this option is responsible for, but setting it to -Dadd_ worked for me all the times.

F77 = pgf95

F77NO_OPTFLAGS = -g -O0 Compiler flags for files which should not be compiled with optimization

F77FLAGS = $(F77NO_OPTFLAGS) This can and should be changed, I'm just doing testing compilations without any optimization. As for LAPACK, such flags should be (but not necessary) the same as for SIESTA compilation. Also, in my benchmark tests, the change of such flags in BLACS didn't influence the overall SIESTA performance at all. But it doesn't meant that you should not try to optimize BLACS at all.

F77LOADER = $(F77)

F77LOADFLAGS =

CC = pgcc C compiler, should be fully compartible with Fortran compiler. As for Portland it should be of the same architecture and version as pgf95. So is true for Intel compilers.

CCFLAGS = $(F77FLAGS) In general C compiler flags should dublicate the fortran flags, but you may play with them also.

CCLOADER = $(CC)

CCLOADFLAGS =

So to compile just type: make mpi.

NOTA BENE: Generally, a command make clean cleans the previous compilation, so you can compile from the very beginning if you want to. This should be done after every change in makefiles and before the following recompilation. But in case of BLACS you should type make mpi what=clean instead of make clean which won't work.

After compilation you'll find the compiled libraries in LIB subdirectory. I suggest also to move them into smth like ~/siesta/lib and rename to libblacsF77init.a, libblacsCinit.a, libblacs.a/

SCALAPACK

Here is my SLmake.inc (it's editable part, actually) with comments:

home = ~/scalapack-1.7.5 A path where SCALAPACK was unpacked

PLAT = LINUX

BLACSdir = ~/siesta/lib A directory, where BLACS libraries are located, compiled before.

USEMPI = -DUsingMpiBlacs

SMPLIB = /opt/mpich/pgi62_64/lib/libmpich.a Full path to mpich

I've set the wrong path to the BLACS libraries (three lines below), but it all still worked for me! So it seems SCALAPCK do not generally requires BLACS or it somehow finds the libraries automatically via BLACSdir.

BLACSFINIT = $(BLACSdir)/libblacsF77init.a

BLACSCINIT = $(BLACSdir)/libblacsCinit.a

BLACSLIB = $(BLACSdir)/libblacs.a

For compilers and options see the same comments for BLACS and LAPACK compilation

F77 = pgf95

CC = pgcc

NOOPT = -g -O0

F77FLAGS = $(NOOPT)

DRVOPTS = $(F77FLAGS)

CCFLAGS = $(NOOPT)

SRCFLAG =

F77LOADER = $(F77)

CCLOADER = $(CC)

F77LOADFLAGS =

CCLOADFLAGS =

CDEFS = -DAdd_ -DNO_IEEE $(USEMPI)

BLASLIB = ~/siesta/lib/libblas.a Path to previously compiled BLAS lib.

So after running make you'll find libscalapack.a which I would also copy to ~/siesta/lib or so.

At last: compiling SIESTA

So here is my arch.make :

SIESTA_ARCH=pgf95-natanzon

FC=pgf95

FFLAGS= -g -O0 We are compiling SIESTA without any optimization. You can (and I advise this) to play with optimization flags and not to stop at this options, because they are slow, but guarantee the succesfull compilation, which is more important at the time. You can also set this parameter to -fastsse or so (see man pgf95 for details). Problems were reported while using -fast flag, so I DO NOT recommend to use this flag – it may cause crashes in SIESTA runs.

FPPFLAGS= -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT

FFLAGS_DEBUG= -g -O0 This normally should not be changed – some SIESTA files should be compiled without optimization

RANLIB=echo

#

NETCDF_LIBS=

NETCDF_INTERFACE=

DEFS_CDF=

#

MPI_INTERFACE=libmpi_f90.a This also normally should not be changed

MPI_INCLUDE=/opt/mpich/pgi62_64/include path to MPI include dir

MPI_LIBS=/opt/mpich/pgi62_64/lib path, where libmpich.a is located

DEFS_MPI=-DMPI Option for compiling parallel version of SIESTA. Necessary!

#

HOME_LIB=~/siesta/lib Path where previuosly compiled by us libraries (BLAS, LAPACK, BLACS, SLALAPCK) are located

LAPACK_LIBS= -llapack There are two ways of indicating library path: describe full filename like liblapack.a or describe it in the way presented here. The linker will automatically replace -l by lib and add .a extension.

BLAS_LIBS=-lblas

BLACS_LIBS = lblacsCinit blacsF77init -lblacs

SCALAPACK_LIBS = -lscalapack

LIBS= -L$(HOME_LIB) $(SCALAPACK_LIBS) $(BLACS_LIBS) $(LAPACK_LIBS) $(BLAS_LIBS) $(NETCDF_LIBS)

SYS=bsd

DEFS= $(DEFS_CDF) $(DEFS_MPI)

#

.F.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<

.F90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<

Two compile SIESTA just type make. I expect everything should be ok.

What else? Adding optimization

This should be done from the very beginning but sometimes overoptimized SIESTA doesn't work and it is safe to compile without any optimization first and be sure that everything works. For this I recommend to run some SCF tasks (without cell relaxation, 0 CG moves) to be sure the parallelization works (benchmarks will be presented below).

To optimize you should replace F77FLAGS and CCFLAGS everywhere in the arch.make and supporting libraries and recompile everything from the beginning. So for pgf compiler I recommend to use this one:

F77FLAGS = -fastsse

I think this is the best can be done. On Intel machine -O3 -ttp7 (or so) flag should be used instead. It gives the most good optimization. Also try -fast flag on Intel machine – the compiler should automatically discover the best optimization flags, but it does smth wrong for PGF. So never use it with pgf, but this is, say, N-th time I'm repeating this. Also some additional changes should be made into arch.make :

atom.o: This file and the following should be still compiled without optimization

$(FC) -c $(FFLAGS_DEBUG) atom.f

#

dfscf.o:

$(FC) -c $(FFLAGS_DEBUG) $(INCFLAGS) dfscf.f

#

#

electrostatic.o:

$(FC) -c $(FFLAGS_DEBUG) electrostatic.f

#

.F.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<

.F90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<


Option 2. Linking SIESTA with external libraries (ACML, Intel MKL)

What for? My benchmarks show, that using ACML library on AMD64 machine gradually increases the overall SIESTA performance, so it's worth thinking of linking external libraries like ACML, Intel MKL, GotoBLAS etc.

Note: I haven't manage to link Intel MKL by now, however it doesn't mean this can't be done. Keep trying. So, I just share my experience of linking ACML here. I think another libraries can be linked in the same way.

What is needed?

  • ACML library installed. If you don't have one, just download, unpack and run install-blablabla.sh script, which will do everythin automatically, you only have to provide a path where to install. I suppose my ACML to be installed in ~/acml.

  • Precompiled BLACS according to instructions in Option 1 chapter.

  • Sources of SCALAPACK and of course SIESTA, those two should be recompiled with ACML support.

  • No LAPACK/BLAS are needed at this time, because ACML will replace them.

SCALAPACK

Compared to the previous chapter, only one change should be made into Slmake.inc file, which shows the path to BLAS:

BLASLIB = ~/acml/pgi64/lib/libacml.a A full path to libacml.a Be aware to provide the correct vesrion (32 or 64 bit) compartible with compiler.

SIESTA

Here I'll give you my arch.make file again, but the comments will be given only for lines which have changed.

SIESTA_ARCH=pgf95-matterhorn

#

FC=pgf95

FC_ASIS=$(FC)

#

FFLAGS= -fastsse -Wl,-R/opt/pgi/linux86-64/6.2/libso/ -Wl,-R~/acml/pgi64/lib/

You see additional flags are being passed to compiler. -fastsse indicates optimization, while two other paths tell using ACML and PGI libraries which both are necessary for correct linking

FFLAGS_DEBUG= -g O0

LDFLAGS= -fastsse -Bstatic The second flags tells to statically link the libraries so you won't have troubles with indicating LD_LIBRARY_PATH before SIESTA run. If you have troubles with linking, try to delete -Bstatic flag and recompile. But in this case you should execute

export LD_LIBRARY_PATH=/opt/pgi/linux86-64/6.2/libso

before SIESTA run or otherwise you'll get errors like: Cannot load shared libraries XXX.so

Note, that -Bstatic flag is called -static for ifort. Also the linking can crash for Intel Compiler with such a flag, but for pgf it works fine.

RANLIB=echo

COMP_LIBS= #dc_lapack.a Do not use internal LAPACK of SIESTA!

MPI_ROOT=/opt/mpich/pgi62_64/ch_p4

MPI_LIBS= -L$(MPI_ROOT)/lib -lmpich

ACML = -mp -L~/acml/pgi64/lib -lacml -Wl,-R/opt/pgi/linux86-64/6.2/libso/ -Wl,-R~/acml/pgi64/lib/

First parameter is a path to ACML and two others simply repeat the compiler flags. I don't know what -mp is for.

MPI_INTERFACE=libmpi_f90.a

MPI_INCLUDE=/opt/mpich/pgi62_64/ch_p4/include

DEFS_MPI=-DMPI

#

SCALAPACK = -L~/siesta/lib -lscalapack

BLACS = -L~/siesta/lib -lblacs -lblacsF77init -lblacsCinit -lblacsF77init -lblacs

NUMA = -L/opt/pgi/linux86-64/6.2/libso -lnuma This library is needed for static linking. If you link dynamically (no -Bstatic flag) you do not need this library.

LIBS= $(SCALAPACK) $(BLACS) $(ACML) $(MPI_LIBS) $(NUMA)

SYS=cpu_time

DEFS= $(DEFS_CDF) $(DEFS_MPI)

#

#

# Important (at least for V5.0-1 of the pgf90 compiler...)

# Compile atom.f and electrostatic.f without optimization.

#

atom.o:

$(FC) -c $(FFLAGS_DEBUG) atom.f

#

dfscf.o:

$(FC) -c $(FFLAGS_DEBUG) $(INCFLAGS) dfscf.f

#

#

electrostatic.o:

$(FC) -c $(FFLAGS_DEBUG) electrostatic.f

#

.F.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<

.F90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $(DEFS) $<

.f90.o:

$(FC) -c $(FFLAGS) $(INCFLAGS) $<

So, voila! That's all I can tell you on parallel SIESTA compilation. I've also done some benchmarks, but I decided not to include them here. Just ask me in the comments, if you want a pdf-file of these notes with benchmark tests on parallelization.

10 comments:

  1. Hi,
    Thanks,
    It was very useful for me.
    Javad

    ReplyDelete
  2. I have a script which can compile and
    install parallel SIESTA
    on cluster with any architecture automatically,
    one-click procedure. Unfortunately I have no time
    to finish description of this script...

    Michel

    ReplyDelete
  3. Dear Michel,
    I don't think that such script will work on all the architectures, but such a script for sure will help.

    I'd be grateful if you share such a script e.g. by posting it to the SIESTA mail list.

    ReplyDelete
  4. Hi Michel,

    Could you also post the details to compile Siesta 3.0 beta ? The information you have given are very useful.

    Thank you

    - Madan Mithra

    ReplyDelete
  5. Dear Madan Mithra,
    Unfortunately, I haven't tried SIESTA 3.0 yet. But I think that most of these instructions are applicable to this version.

    I advice you to try and post questions on SIESTA mail list.

    ReplyDelete
  6. Thanks for the reply Mr.noddeat. I will post it on the SIESTA mail list. However, just for your information, the installation method is slightly different for Ver.3.0 compared to the previous releases, making it fail in systems using older version of 'make'. I have also found that the mailing list does not get updated recently.

    Thank you

    - Madan Mithra

    ReplyDelete
  7. We've compiled Siesta-2.0.2 with Intel compilers, MKL and Open MPI (the cluster has infiniband hardware). The siesta jobs run fine on single node(8 cores) But when the job goes beyond one node, there is no scaling. In fact the 2nodes(16 cores) takes more, much more time than single node job. The same setup we tried with mvapich2 also. In that case, the jobs which run on 2 nodes endup with MPI communication error.
    Can you please help us to resolve this issue.
    My email id is: sangamesh.banappa@locuz.com

    ReplyDelete
  8. I have installed siesta-3.0-beta with MKL support and ifort compiler in an itanium cluster... I had some problems with the intel 11.X.XXX family compiler (an internal bug) but with the family 10.X.XXX there was no problem.... except fot the associated TESTS, this is the repetitive error message:

    Running h2o test....
    ERROR STOP from Node: 0
    abort pº^D

    Have anybody any idea?

    ReplyDelete
  9. panfidalgo,
    I'm sorry, but I haven't try SIESTA 3.0 by myself. Also, there is to few information to diagnose the message. The problem can be in the setup of your cluster, not SIESTA. Please carefully diagnose the output files.

    Sangamesh,
    Please check if you really have a parallel version of SIESTA running. At the beginning of output you have a message: Serial version or Parallel version.

    ReplyDelete
  10. I would like to compile siesta on a machine IBM AIX 5.3,so how I can do it?

    ReplyDelete

Note: Only a member of this blog may post a comment.