Compiling and Running Code

Getting Started

Toolchain selection

For developing code on Grace we recommend using the intel software stack (which is often referenced as a "toolchain" here at HPRC). This includes the Intel compilers (icc/icpc/ifort), the Intel Math Kernel Library (MKL), and the Intel MPI.

We highly recommend users select a particular toolchain and stick with modules that use it. At present, we support the following toolchains:

intel - Described above
iimpi - Includes Intel's MPI
foss - Open-Source Software stack (GCC/OpenMPI/BLAS/LAPACK/etc)

Detailed information about each of the currently supported toolchain releases can be found on our Toolchains page. Toolchains, like software packages on our clusters, are organized using the Modules System. You can load a toolchain with the following command:

[ username@grace ~]$ module load [toolchain Name]

Important Note: Do NOT mix modules from different toolchains. Remember to ALWAYS purge all modules when switching toolchains.

More information on using the Modules System can be found on our Modules System page.

Using the intel toolchain

After initializing the compiler environment, you can use the "man" command to obtain a complete list of the available compilation options for the language you plan to use. The following three commands will provide information on the C, C++, and Fortran compilers, respectively.

[ username@grace ~]$ man icc
[ username@grace ~]$ man icpc
[ username@grace ~]$ man ifort

Each compiler requires appropriate file name extensions. These extensions are meant to identify files with different programming language contents, thereby enabling the compiler script to hand these files to the appropriate compiling subsystem: preprocessor, compiler, linker, etc. See the table below for valid extensions for each language.

Extension	Compiler	Description
.c	icc	C source code passed to the compiler.
.C, .CC, .cc, .cpp, .cxx	icpc	C++ source code passed to the compiler.
.f, .for, .ftn	ifort	Fixed form Fortran source code passd to the compiler.
.fpp	ifort	Fortran fixed form source code that can be preprocessed by the Intel Fortran preprocessor fpp.
.f90	ifort	Free form Fortran 90/95 source code passed to the compiler.
.F	ifort	Fortran fixed form source code, will be passed to preprocessor (fpp) and then passed to the Fortran compiler.
.o	icc/icpc/ifort	Compiled object file--generated with the -c option--passed to the linker.

Basic Valid File Extensions

Note: The icpc command ("C++" compiler) uses the same compiler options as the icc ("C" compiler) command. Invoking the compiler using icpc compiles '.c', and '.i' files as C++. Invoking the compiler using icc compiles '.c' and '.i' files as C. Using icpc always links in C++ libraries. Using icc only links in C++ libraries if C++ source is provided on the command line.

Compiling

Invoking the compiler

To compile your program and create an executable you need to invoke the correct compiler. The default output file name is a.out but this can be changed using the -o compiler flag. All compilers are capable of preprocessing, compiling, assembling, and linking. See the table below for the correct compiler commands for the different languages.

Language	Compiler	Syntax
C	icc	icc *[c compiler_flags]* file1 [ file2 ]...
C++	icpc	icpc *[c++ compiler_flags]* file1 [ file2 ]...
F90	ifort	ifort *[fortran compiler_flags]* file1 [ file2 ]...
F77	ifort	ifort *[fortran compiler_flags]* file1 [ file2 ]...

In the table above, fileN is an appropriate source file, assembly file, object file, object library, or other linkable file.

Basic compiler flags

The next sections introduce some of the most common compiler flags. These flags are accepted by all compilers (icc/icpc/ifort) with some notable exceptions. For a full description of all the compiler flags please consult the appropriate man pages.

Flag	Description
-help [category]	Displays all available compiler options or category of compiler options categories.
-o	Specifies the name for an output file. For an executable, name of output file will be instead of a.out
-c	Only compile the file, linking phase will be skipped
-L <dir>	Tells the linker to search for libraries in directory <dir> ahead of the standard library directories.
-l <name>	Tells the linker to search for library named libname.so or libname.a

Optimization flags

The default optimization level for Intel compilers is -O2 (which enables optimizations like inlining, constant/copy propagation, loop unrolling,peephole optimizations, etc). The table below shows some addional commonly used optimization flags that can improve run time.

Flag	Description
-O3	Performs -O2 optimizations and enables more aggressive loop transformations.
-xHost	Tells the compiler to generate vector instructions for the highest instruction set available on the host machine.
-fast	Convenience flag. In linux this is shortcut for -ipo, -O3, -no-prec-div, -static, and -xHost
-ip	Perform inter-procedural optimization within the same file
-ipo	Perform inter-procedural optimization between files
-parallel	enable automatic parallelization by the compiler (very conservative)
-opt-report=[n]	generate optimization report. n represent the level of detail (0 ..3, 3 being most detailed)
-vec-report[=n]	generate vectorization report. n represents the level of detail (0..7 , 7 being most detailed)

NOTE: there is no guarantee that using a combination of the flags above will provide additional speedup compared to -O2. In some rare cases (e.g. floating point imprecision) using flags like -fast might result in code that produces incorrect results.

Debugging flags

The table below shows some compiler flags that can be useful in debugging your program.

Flag	Description
-g	Produces symbolic debug information in the object file.
-warn	Specifies diagnostic messages to be issued by the compiler.
-traceback	Tells the compiler to generate extra information in the object file to provide source file traceback information when a severe error occurs at run time.
-check	Checks for certain conditions at run time (e.g. uninitialized variables, array bounds). Note, since the resulting code includes additional run time checks it may affect run time significantly. THIS IS AN IFORT ONLY FLAG
-fpe0	throw exception for invalid, overflow, divide by zero. THIS IS AN IFORT ONLY FLAG

Flags affecting floating point operations

Some optimization might affect how floating point arithmetic is performed. This might result in round off errors in certain cases. The table below shows a number of flags to instruct the compiler how to deal with floating point operations:

Flag	Description
-fp-model precise	disable optimizations that are not value safe on floating point data (See man page for other options)
-fltconsistency	enables improved floating-point consistency. This might slightly reduce execution speed.
-fp-speculation=strict	tells the compiler to disable speculation on floating-point operations (See man page for other options)

Examples

Several examples of compile commands are listed below.

Example 1: Compile program consisting of c source files and an object file.

[ username@grace ~]$ icc objfile.o subroutine.c main.c

Example 2: Compile and link source files and an object file, rename the output myprog.x

[ username@grace ~]$ icc -o myprog.x subroutine.c myobjs.o main.c

Example 3: Compile and link source file and library libmyutils.so residing in directory mylibs

[ username@grace ~]$ icc -L mylibs -lmyutils main.c

Example 4: Compile and link program with aggressive optimization enabled using latest vector instructions and printing an optimization report.

[ username@grace ~]$ icc -fast -xHost -opt-report -o myprog.x myprog.c

OpenMP Programs

Compiling OpenMP code

To compile program containing OpenMP parallel directives the following flags can be used to create multi-threaded versions:

Flag	Description
-qopenmp	Enables parallelizer to generate multi-threaded code.
-qopenmp-stubs	Enables compilation of OpenMP programs in sequential mode.

Examples:

[ username@grace ~]$ icc -qopenmp -o myprog.x myprog.c
[ username@grace ~]$ ifort -qopenmp myprog.x myprog.f90
[ username@grace ~]$ ifort -qopenmp-stubs -o myprog.x myprog.f90

Running OpenMP code

The table below shows some of the more common environmental variables that can be used to affect OpenMP behavior at run time.

Environment Variable	Example	Example-Purpose	Default value
OMP_NUM_THREADS=n[,m]*	OMP_NUM_THREADS=8	Sets the maximum number of threads per nesting level to 8.	1
OMP_STACKSIZE=[B\|K\|M\|G]	OMP_STACKSIZE=8M	Sets the size for the private stack of each worker thread to 8MB. Possible values for type are B(Bytes), K(KB), M(MB), and G(GB).	4M
OMP_SCHEDULE=type[,chunk]	OMP_SCHEDULE=DYNAMIC	Sets the default run-time schedule type to DYNAMIC. Possible values for type are STATIC, DYNAMIC, GUIDED, and AUTO.	STATIC
OMP_DYNAMIC	OMP_DYNAMIC=true	Enable dynamic adjustment of number of threads.	false
OMP_NESTED	OMP_NESTED=true	Enable nested OpenMP regions.	false
OMP_DISPLAY_ENV=val	OMP_DISPLAY_ENV=VERBOSE	Instruct the OpenMP runtime to display OpenMP version and environmental variables in verbose form. Possible values are TRUE, FALSE, VERBOSE.	FALSE

Examples

Example 1: set number of threads to 8 and set the stack size for workers thread to 16MB. Note; insufficient stack size is a common reason of run-time crashes of OpenMP programs.

[ username@grace ~]$ export OMP_NUM_THREADS=8
[ username@grace ~]$ export OMP_STACKSIZE=16M
[ username@grace ~]$ ./myprog.x

Example 2: enable nested parallel regions and set the number of threads to use for first nesting level to 4 and second nesting level to 2.

[ username@grace ~]$ export OMP_NESTED=true
[ username@grace ~]$ export OMP_NUM_THREADS=4,2
[ username@grace ~]$ ./myprog.x

Example 3: set maximum number of threads to use to 16, but let run time decide how many threads will actually be used in order to optimize the use of system resources.

[ username@grace ~]$ export OMP_DYNAMIC=true
[ username@grace ~]$ export OMP_NUM_THREADS=16
[ username@grace ~]$ ./myprog.x

Example 4: change the default scheduling type to dynamic with chunk size of 100.

[ username@grace ~]$ export OMP_SCHEDULE="dynamic,100"
[ username@grace ~]$ export OMP_NUM_THREADS=16
[ username@grace ~]$ ./myprog.x

Advanced OpenMP

The following tables shows some more advanced environmental variables that can be used to control where OpenMP threads will actually be placed.

Env var	Description	Default value
KMP_AFFINITY	binds OpenMP threads to physical threads.
OMP_PLACES	Defines an ordered list of places where threads can execute. Every place is a set of hardware (HW) threads. Can be defined as an explicit list of places described by nonnegative numbers or an abstract name. Abstract name can be 'threads' (every place consists of exactly one hw thread), 'cores' (every place contains all the HW threads of the core), 'socket' (every places contains all the HW threads of the socket)	'threads'
OMP_PROC_BIND	Sets the thread affinity policy to be used for parallel regions at the corresponding nesting level. Acceptable values are true, false, or a comma separated list, each element of which is one of the following values: master (all threads will be bound to same place as master thread), close (all threads will be bound to successive places close to place of master thread), spread (all threads will be distributed among the places evenly). NOTE: if both OMP_PROC_BIND and KMP_AFFINITY are set the latter will take precedence	'false'

Example 1: Suppose node with two sockets, each with 8 cores. Program, with nesting level 2, put outer level threads on different sockets, inner level threads on same socket as master.

[ username@grace ~]$ export OMP_NESTED=true
[ username@grace ~]$ export OMP_NUM_THREADS=2,8
[ username@grace ~]$ export OMP_PLACES="sockets"
[ username@grace ~]$ export OMP_PROC_BIND="spread,master"
[ username@grace ~]$ ./myprog.x

MPI Programs

There are multiple MPI stacks installed on HPRC clusters; OpenMPI and Intel MPI. The recommended MPI stack for software development is the Intel MPI software stack and most of this section will focus on this MPI stack.

Intel MPI

To use the Intel MPI environment you need to load the Intel module. This can be done with the following command:

[ username@grace ~]$ module load intel/2022a

Note: It is no longer possible to load the default intel module. You must specify a version you are loading for the sake of consistency and clarity. More information about finding and loading modules can be found on our Modules Systems page.

Compiling MPI Code

To compile MPI code a MPI compiler wrapper is used. The wrapper will call the appropriate underlying compiler with additional linker flags specific for MPI programs. The Intel MPI software stack has wrappers for Intel compilers as well as wrappers for gnu compilers. Any argument not recognized by the wrapper will be passed to the underlying compiler. Therefore, any valid compiler flag (Intel or gnu) will also work when using the mpi wrappers

The following table shows the most commonly used MPI wrappers used by Intel MPI.

MPI Wrapper	Compiler	Language	Example
mpiicc	icc	C	mpiicc <compiler_flags> prog.c
mpicc	gcc	C	mpicc <compiler_flags> prog.c
mpiicpc	icpc	C++	mpiicpcp <compiler_flags> prog.cpp
mpicxx	g++	C++	mpicxx <compiler_flags> prog.cpp
mpiifort	ifort	Fortran	mpiifort <compiler_flags> prog.f90
mpif90	gfortran	Fortran	mpif90 <compiler_flags> prog.f90

To see the full compiler command of any of the mpi wrapper scripts use the -show flag. This flag does not actually call the compiler, it only prints the full compiler command and exits. This can be useful for debugging purposes and/or when experiencing problems with any of the compiler wrappers

Example: Show the full compiler command for the mpiifort wrapper script

[ username@grace ~]$ mpiifort -show   
ifort -I/software/easybuild/software/impi/4.1.3.049/intel64/include -I/software/easybuild/software/impi/4.1.3.049/intel64/include 
-L/software/easybuild/software/impi/4.1.3.049/intel64/lib -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker   
/software/easybuild/software/impi/4.1.3.049/intel64/lib -Xlinker -rpath -Xlinker 
-opt/intel/mpi-rt/4.1 -lmpigf -lmpi -lmpigi -ldl -lrt -lpthread

Running MPI Code

Running MPI code requires an MPI launcher. The latter will setup the environment and start the requested number of MPI tasks on the needed nodes.

Use the following command to launch an MPI program where [mpi_flags] are options passed to the mpi launcher, is the name of the mpi program and [executable params] are optional parameters for the mpi program. (We continue to assume here use of the Intel MPI stack.):

[ username@grace ~]$ mpirun [mpi_flags] <executable> [executable params]

Note: <executable> must be on the $PATH otherwise the launcher will not be able to find the executable.

For a list of the most common mpi_flags See table below. This table shows only a very small subset of all flags. To see a full listing type mpirun --help

Flag	Description
-np <n>	The number of mpi tasks to start.
-n <n>	The number of mpi tasks to start (same as -np).
-perhost <n>	Places consecutive (MPI) processes on each host/node.
-ppn <n>	Stands for Process (i.e., task) Per Node (same as -perhost)
-hostfile <file>	The name of the file that contains the list of host/node names the launcher will place tasks on.
-f <file>	Same as -hostfile
-hosts	comma separated list of specific host/node names.
-help	Shows list of available flags and options

Hybrid MPI/OpenMP Code

To compile hybrid mpi/OpenMP programs (i.e. MPI programs that also contain OpenMP directives) invoke the appropriate mpi wrapper and add the -openmp flag to enable processing of OpenMP primitives.

Running a hybrid program is very similar to running a pure mpi program. To control the number of OpenMP threads to use per task the OMP_NUM_THREADS environmental variable can be set.

Advanced: mapping tasks and threads

Explicitly mapping mpi tasks to processors can result in significantly better performance. This is especially true for hybrid MPI/OpenMP programs where both mpi tasks and OpenMP threads are being mapped on the available cores on a node. The Intel MPI stack provides a way to control the pinning of MPI tasks using the environmental variable ' I_MPI_PIN_DOMAIN'.

[ username@grace ~]$ export I_MPI_PIN_DOMAIN=<domain>

Where <domain> can have the following values: node, socket, core, cache1, cache2, cache3. The domain tells where to pin the tasks. For example "socket" will pin the tasks on different sockets. To map the OpenMP threads the affinity setting for OpenMP will be used.

NOTE: the above syntax is just one way to describe the pinning. Please visit the Process Pinning documentation or the Intel MPI reference (see Further Information section for link) for alternative ways to pin tasks using the I_MPI_PIN_DOMAIN environmental variable.

Examples

In this section are various examples for compiling and running MPI programs with the Intel toolchain.

Example 1: Compile MPI program written in C and name it mpi_prog.x. Use the underlying Intel compiler with -O3 optimization.

[ username@grace ~]$ mpiicc -o mpi_prog.x -O3 mpi_prog.c

Example 2: Same as Example 1, but this time use underlying gnu Fortran compiler.

[ username@grace ~]$ mpif90 -o mpi_prog.x mpi_prog.f90

Example 3: Run mpi program on local host using 4 tasks.

[ username@grace ~]$ mpirun -np 4 mpi_prog.x

Example 4: Run mpi program on a specific host using 4 tasks.

[ username@grace ~]$ mpirun -np 4 -hosts login1 mpi_prog.x

Example 5: Run mpi program on two different hosts using 4 tasks using host file and assign tasks in round robin fashion

[ username@grace ~]$ mpirun -np 4 -perhost 1 -hostfile mylist mpi_prog.x

where mylist is a file that contains the following lines:

login1
login2

Note: If you don't specify -pernode all the tasks will be started on login1, even though the hostfile contains multiple entries.

Example 6: Run 4 different programs concurrently using mpirun (MPMD style program).

[ username@grace ~]$ mpirun -np 1 prog1.x : -np 1 prog2.x : -np 1 prog3.x : -np 1 prog4.x

Note: For executing a large number of serial (or OpenMP) programs we recommend using the tamulauncher utility.

Example 7: Compile MPI fortran program named hybrid.f90 that also contains OpenMP primitives using underlying Intel Fortran compiler.

[ username@grace ~]$ mpiifort -openmp -o hybrid.x hybrid.f90

Example 8: Run the hybrid program named hybrid.x using 8 tasks where every task will use 2 threads in its OpenMP regions.

[ username@grace ~]$ export OMP_NUM_THREADS=2
[ username@grace ~]$ mpirun -np 8 ./hybrid.x

Example 9: Run hybrid mpi/OpenMP program using 2 tasks and 10 threads, pin the tasks to different sockets, and map all OpenMP threads within the socket.

[ username@grace ~]$ export I_MPI_PIN_DOMAIN=socket
[ username@grace ~]$ export OMP_NUM_THREADS=10
[ username@grace ~]$ export OMP_PLACES="socket"
[ username@grace ~]$ export OMP_PROC_BIND="master"
[ username@grace ~]$ mpirun -np 2 ./hybrid.x

Further Information

For a detailed description of the Intel MPI stack, please visit the Intel MPI Developer Reference Manual. This site contains detailed information about the mpi compiler wrappers, in depth discussion about mpirun and it options, as well as tuning your application for best performance and pinning tasks.

OpenMPI

Using OpenMPI is very similar to using Intel MPI. There are a few minor differences. To use OpenMPI you will need to load one of the OpenMPI modules. HPRC has OpenMPI versions built with Intel compilers as well as gnu compilers. The underlying compiler depends on the loaded OpenMPI module

Example 1: Load OpenMPI version 4.1.4 with GCC dependency 11.3.0.

[ username@grace ~]$ module load GCC/11.3.0 OpenMPI/4.1.4

To see a list of all available OpenMPI versions type:

[ username@grace ~]$ module spider openmpi

Compiling

The table below shows the various mpi compiler wrappers. The names will be the same regardless of the underlying compiler.

MPI wrapper	Language	Example
mpicc	C	mpicc <compiler_flags> prog.c
mpic++	C++	mpic++ <compiler_flags> prog.cpp
mpif90	Fortran	mpif90 <compiler_flags> prog.f90

To see the complete compiler command use the -show flag.

Running

To launch a mpi program you will use the mpirun command. This command is very similar to the Intel MPI mpirun launcher discussed above. However, some of the flags are different for OpenMPI. The table below shows some of the more common flags.

Flag	Description
-np <n>	The number of mpi tasks to start.
-npernode <n>	Places <n> (MPI) processes per node on each allocated node.
-npersocket <n>	Places <n> (MPI) processes per socket on each allocated node.
-hostfile <file>	The name of the file that contains the list of host/node names the launcher will place tasks on.
-host	comma separated list of specific host/node names.

To see all the available options and flags (including short descriptions) use the following command:

[ username@grace ~]$ mpirun -help

CUDA Programming

In order to compile, run, and debug CUDA programs, a CUDA module must be loaded:

[ username@grace ~]$ module load CUDA/12.0

For more information on the modules system, please see our Modules System page.

Compiling CUDA C/C++ with NVIDIA nvcc

The compiler nvcc is the NVIDIA CUDA C/C++ compiler. The command line for invoking it is:

[ username@grace ~]$ nvcc [options] -o cuda_prog.exe file1 file2 ...

where file1, file2, ... are any appropriate source, assembly, object, object library, or other (linkable) files that are linked to generate the executable file cuda_prog.exe.

By default, nvcc will use gcc to compile your source code. However, it is better to use the Intel compiler by adding the flag -ccbin=icc to your compile command.

For more information on nvcc, please refer to the online manual.

Running CUDA Programs

Only some of the login nodes have GPUs installed, so when you want to run GPU code on a login node make sure the node has a GPU installed. To find out load information of the device, please run the NVIDIA system management interface program nvidia-smi. This command will tell you on which GPU device your code is running on, how much memory is used on the device, and the GPU utilization.

[ username@grace ~]$ nvidia-smi  
Sun Apr 23 22:20:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   27C    P0    30W / 250W |      5MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

You can test your CUDA program on the login node as long as you abide by the rules stated in Computing Environment. For production runs, you should submit a batch job to run your code on the compute nodes. In order to be placed on GPU nodes with available GPUs, a job needs to request them with the following two lines in a job file.

#SBATCH --gres=gpu:1 #Request 1 GPU
#SBATCH --partition=gpu #Request the GPU partition/queue

For more detailed information about requesting specific types of GPUs, check out Batch Optional Job Specification section ### Debugging CUDA Programs

CUDA programs must be compiled with "-g -G" to force O0 optimization and to generate code with debugging information. To generate debugging code for K80, compile and link the code with the following:

[ username@grace ~]$ nvcc -g -G arch=compute_70 -code=sm_70 cuda_prog.cu -o cuda_prog.out

For more information on cuda-gdb, please refer to its online manual.

Misc

GNU gcc and Intel C/C++ Interoperability

C++ compilers are interoperable if they can link object files and libraries generated by one compiler with object files and libraries generated by the second compiler, and the resulting executable runs successfully. Some GNU gcc* versions are not interoperable, some versions are interoperable. By default, the Intel compiler will generate code that is interoperable with the version of gcc it finds on your system.

The Intel(R) C++ Compiler options that affect GNU gcc* interoperability include:

-cxxlib
-gcc-name
-gcc-version
-gxx-name
-fabi-version
-no-gcc (see gcc Predefined Macros for more information)

The Intel(R) C++ Compiler is interoperable with GNU gcc* compiler versions greater than or equal to 3.2. See the Intel(R) C++ Compiler Documentation for more information at the Intel Software Documentation page.