Compiling and Running Code
Getting Started
Toolchain selection
For developing code on ACES we recommend using the intel software stack (which is often referenced as a "toolchain" here at HPRC). This includes the Intel compilers (icc/icpc/ifort), the Intel Math Kernel Library (MKL), and the Intel MPI.
We highly recommend users select a particular toolchain and stick with modules that use it. At present, we support the following toolchains:
- intel - Described above
- iimpi - Includes Intel's MPI
- foss - Open-Source Software stack (GCC/OpenMPI/BLAS/LAPACK/etc)
Detailed information about each of the currently supported toolchain releases can be found on our Toolchains page. Toolchains, like software packages on our clusters, are organized using the Modules System. You can load a toolchain with the following command:
[ username@aces ~]$ module load [toolchain Name]
Important Note: Do NOT mix modules from different toolchains. Remember to ALWAYS purge all modules when switching toolchains.
More information on using the Modules System can be found on our Modules System page.
Using the intel toolchain
After initializing the compiler environment, you can use the "man" command to obtain a complete list of the available compilation options for the language you plan to use. The following three commands will provide information on the C, C++, and Fortran compilers, respectively.
[ username@aces ~]$ man icc
[ username@aces ~]$ man icpc
[ username@aces ~]$ man ifort
Each compiler requires appropriate file name extensions. These extensions are meant to identify files with different programming language contents, thereby enabling the compiler script to hand these files to the appropriate compiling subsystem: preprocessor, compiler, linker, etc. See the table below for valid extensions for each language.
Extension | Compiler | Description |
---|---|---|
.c | icc | C source code passed to the compiler. |
.C, .CC, .cc, .cpp, .cxx | icpc | C++ source code passed to the compiler. |
.f, .for, .ftn | ifort | Fixed form Fortran source code passd to the compiler. |
.fpp | ifort | Fortran fixed form source code that can be preprocessed by the Intel Fortran preprocessor fpp. |
.f90 | ifort | Free form Fortran 90/95 source code passed to the compiler. |
.F | ifort | Fortran fixed form source code, will be passed to preprocessor (fpp) and then passed to the Fortran compiler. |
.o | icc/icpc/ifort | Compiled object file--generated with the -c option--passed to the linker. |
Basic Valid File Extensions
Note: The icpc command ("C++" compiler) uses the same compiler options as the icc ("C" compiler) command. Invoking the compiler using icpc compiles '.c', and '.i' files as C++. Invoking the compiler using icc compiles '.c' and '.i' files as C. Using icpc always links in C++ libraries. Using icc only links in C++ libraries if C++ source is provided on the command line.
Compiling
Invoking the compiler
To compile your program and create an executable you need to invoke the correct compiler. The default output file name is a.out but this can be changed using the -o compiler flag. All compilers are capable of preprocessing, compiling, assembling, and linking. See the table below for the correct compiler commands for the different languages.
Language | Compiler | Syntax |
---|---|---|
C | icc | icc [c compiler_flags] file1 [ file2 ]... |
C++ | icpc | icpc [c++ compiler_flags] file1 [ file2 ]... |
F90 | ifort | ifort [fortran compiler_flags] file1 [ file2 ]... |
F77 | ifort | ifort [fortran compiler_flags] file1 [ file2 ]... |
In the table above, fileN is an appropriate source file, assembly file, object file, object library, or other linkable file.
Basic compiler flags
The next sections introduce some of the most common compiler flags. These flags are accepted by all compilers (icc/icpc/ifort) with some notable exceptions. For a full description of all the compiler flags please consult the appropriate man pages.
Flag |
Description |
---|---|
-help [category] |
Displays all available compiler options or category of compiler options categories. |
-o |
Specifies the name for an output file. For an executable, name of output file will be |
-c |
Only compile the file, linking phase will be skipped |
-L <dir> |
Tells the linker to search for libraries in directory <dir> ahead of the standard library directories. |
-l <name> |
Tells the linker to search for library named libname.so or libname.a |
Optimization flags
The default optimization level for Intel compilers is -O2 (which enables optimizations like inlining, constant/copy propagation, loop unrolling,peephole optimizations, etc). The table below shows some addional commonly used optimization flags that can improve run time.
Flag | Description |
---|---|
-O3 | Performs -O2 optimizations and enables more aggressive loop transformations. |
-xHost | Tells the compiler to generate vector instructions for the highest instruction set available on the host machine. |
-fast | Convenience flag. In linux this is shortcut for -ipo, -O3, -no-prec-div, -static, and -xHost |
-ip | Perform inter-procedural optimization within the same file |
-ipo | Perform inter-procedural optimization between files |
-parallel | enable automatic parallelization by the compiler (very conservative) |
-opt-report=[n] | generate optimization report. n represent the level of detail (0 ..3, 3 being most detailed) |
-vec-report[=n] | generate vectorization report. n represents the level of detail (0..7 , 7 being most detailed) |
NOTE: there is no guarantee that using a combination of the flags above will provide additional speedup compared to -O2. In some rare cases (e.g. floating point imprecision) using flags like -fast might result in code that produces incorrect results.
Debugging flags
The table below shows some compiler flags that can be useful in debugging your program.
Flag | Description |
---|---|
-g | Produces symbolic debug information in the object file. |
-warn | Specifies diagnostic messages to be issued by the compiler. |
-traceback | Tells the compiler to generate extra information in the object file to provide source file traceback information when a severe error occurs at run time. |
-check | Checks for certain conditions at run time (e.g. uninitialized variables, array bounds). Note, since the resulting code includes additional run time checks it may affect run time significantly. THIS IS AN IFORT ONLY FLAG |
-fpe0 | throw exception for invalid, overflow, divide by zero. THIS IS AN IFORT ONLY FLAG |
Flags affecting floating point operations
Some optimization might affect how floating point arithmetic is performed. This might result in round off errors in certain cases. The table below shows a number of flags to instruct the compiler how to deal with floating point operations:
Flag | Description |
---|---|
-fp-model precise | disable optimizations that are not value safe on floating point data (See man page for other options) |
-fltconsistency | enables improved floating-point consistency. This might slightly reduce execution speed. |
-fp-speculation=strict | tells the compiler to disable speculation on floating-point operations (See man page for other options) |
Examples
Several examples of compile commands are listed below.
Example 1: Compile program consisting of c source files and an object file.
[ username@aces ~]$ icc objfile.o subroutine.c main.c
Example 2: Compile and link source files and an object file, rename the output myprog.x
[ username@aces ~]$ icc -o myprog.x subroutine.c myobjs.o main.c
Example 3: Compile and link source file and library libmyutils.so residing in directory mylibs
[ username@aces ~]$ icc -L mylibs -lmyutils main.c
Example 4: Compile and link program with aggressive optimization enabled using latest vector instructions and printing an optimization report.
[ username@aces ~]$ icc -fast -xHost -opt-report -o myprog.x myprog.c
OpenMP Programs
Compiling OpenMP code
To compile program containing OpenMP parallel directives the following flags can be used to create multi-threaded versions:
Flag | Description |
---|---|
-qopenmp | Enables parallelizer to generate multi-threaded code. |
-qopenmp-stubs | Enables compilation of OpenMP programs in sequential mode. |
Examples:
[ username@aces ~]$ icc -qopenmp -o myprog.x myprog.c
[ username@aces ~]$ ifort -qopenmp myprog.x myprog.f90
[ username@aces ~]$ ifort -qopenmp-stubs -o myprog.x myprog.f90
Running OpenMP code
The table below shows some of the more common environmental variables that can be used to affect OpenMP behavior at run time.
Environment Variable | Example | Example-Purpose | Default value |
---|---|---|---|
OMP_NUM_THREADS=n[,m]* | OMP_NUM_THREADS=8 | Sets the maximum number of threads per nesting level to 8. | 1 |
OMP_STACKSIZE=[B|K|M|G] | OMP_STACKSIZE=8M | Sets the size for the private stack of each worker thread to 8MB. Possible values for type are B(Bytes), K(KB), M(MB), and G(GB). | 4M |
OMP_SCHEDULE=type[,chunk] | OMP_SCHEDULE=DYNAMIC | Sets the default run-time schedule type to DYNAMIC. Possible values for type are STATIC, DYNAMIC, GUIDED, and AUTO. | STATIC |
OMP_DYNAMIC | OMP_DYNAMIC=true | Enable dynamic adjustment of number of threads. | false |
OMP_NESTED | OMP_NESTED=true | Enable nested OpenMP regions. | false |
OMP_DISPLAY_ENV=val | OMP_DISPLAY_ENV=VERBOSE | Instruct the OpenMP runtime to display OpenMP version and environmental variables in verbose form. Possible values are TRUE, FALSE, VERBOSE. | FALSE |
Examples
Example 1: set number of threads to 8 and set the stack size for workers thread to 16MB. Note; insufficient stack size is a common reason of run-time crashes of OpenMP programs.
[ username@aces ~]$ export OMP_NUM_THREADS=8
[ username@aces ~]$ export OMP_STACKSIZE=16M
[ username@aces ~]$ ./myprog.x
Example 2: enable nested parallel regions and set the number of threads to use for first nesting level to 4 and second nesting level to 2.
[ username@aces ~]$ export OMP_NESTED=true
[ username@aces ~]$ export OMP_NUM_THREADS=4,2
[ username@aces ~]$ ./myprog.x
Example 3: set maximum number of threads to use to 16, but let run time decide how many threads will actually be used in order to optimize the use of system resources.
[ username@aces ~]$ export OMP_DYNAMIC=true
[ username@aces ~]$ export OMP_NUM_THREADS=16
[ username@aces ~]$ ./myprog.x
Example 4: change the default scheduling type to dynamic with chunk size of 100.
[ username@aces ~]$ export OMP_SCHEDULE="dynamic,100"
[ username@aces ~]$ export OMP_NUM_THREADS=16
[ username@aces ~]$ ./myprog.x
Advanced OpenMP
The following tables shows some more advanced environmental variables that can be used to control where OpenMP threads will actually be placed.
Env var | Description | Default value |
---|---|---|
KMP_AFFINITY | binds OpenMP threads to physical threads. | |
OMP_PLACES | Defines an ordered list of places where threads can execute. Every place is a set of hardware (HW) threads. Can be defined as an explicit list of places described by nonnegative numbers or an abstract name. Abstract name can be 'threads' (every place consists of exactly one hw thread), 'cores' (every place contains all the HW threads of the core), 'socket' (every places contains all the HW threads of the socket) | 'threads' |
OMP_PROC_BIND | Sets the thread affinity policy to be used for parallel regions at the corresponding nesting level. Acceptable values are true, false, or a comma separated list, each element of which is one of the following values: master (all threads will be bound to same place as master thread), close (all threads will be bound to successive places close to place of master thread), spread (all threads will be distributed among the places evenly). NOTE: if both OMP_PROC_BIND and KMP_AFFINITY are set the latter will take precedence | 'false' |
Example 1: Suppose node with two sockets, each with 8 cores. Program, with nesting level 2, put outer level threads on different sockets, inner level threads on same socket as master.
[ username@aces ~]$ export OMP_NESTED=true
[ username@aces ~]$ export OMP_NUM_THREADS=2,8
[ username@aces ~]$ export OMP_PLACES="sockets"
[ username@aces ~]$ export OMP_PROC_BIND="spread,master"
[ username@aces ~]$ ./myprog.x
MPI Programs
There are multiple MPI stacks installed on HPRC clusters; OpenMPI and Intel MPI. The recommended MPI stack for software development is the Intel MPI software stack and most of this section will focus on this MPI stack.
Intel MPI
To use the Intel MPI environment you need to load the Intel module. This can be done with the following command:
[ username@aces ~]$ module load intel/2022a
Note: It is no longer possible to load the default intel module. You must specify a version you are loading for the sake of consistency and clarity. More information about finding and loading modules can be found on our Modules Systems page.
Compiling MPI Code
To compile MPI code a MPI compiler wrapper is used. The wrapper will call the appropriate underlying compiler with additional linker flags specific for MPI programs. The Intel MPI software stack has wrappers for Intel compilers as well as wrappers for gnu compilers. Any argument not recognized by the wrapper will be passed to the underlying compiler. Therefore, any valid compiler flag (Intel or gnu) will also work when using the mpi wrappers
The following table shows the most commonly used MPI wrappers used by Intel MPI.
MPI Wrapper | Compiler | Language | Example |
---|---|---|---|
mpiicc | icc | C | mpiicc <compiler_flags> prog.c |
mpicc | gcc | C | mpicc <compiler_flags> prog.c |
mpiicpc | icpc | C++ | mpiicpcp <compiler_flags> prog.cpp |
mpicxx | g++ | C++ | mpicxx <compiler_flags> prog.cpp |
mpiifort | ifort | Fortran | mpiifort <compiler_flags> prog.f90 |
mpif90 | gfortran | Fortran | mpif90 <compiler_flags> prog.f90 |
To see the full compiler command of any of the mpi wrapper scripts use the -show flag. This flag does not actually call the compiler, it only prints the full compiler command and exits. This can be useful for debugging purposes and/or when experiencing problems with any of the compiler wrappers
Example: Show the full compiler command for the mpiifort wrapper script
[ username@aces ~]$ mpiifort -show
ifort -I/software/easybuild/software/impi/4.1.3.049/intel64/include -I/software/easybuild/software/impi/4.1.3.049/intel64/include
-L/software/easybuild/software/impi/4.1.3.049/intel64/lib -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker
/software/easybuild/software/impi/4.1.3.049/intel64/lib -Xlinker -rpath -Xlinker
-opt/intel/mpi-rt/4.1 -lmpigf -lmpi -lmpigi -ldl -lrt -lpthread
Running MPI Code
Running MPI code requires an MPI launcher. The latter will setup the environment and start the requested number of MPI tasks on the needed nodes.
Use the following command to launch an MPI program where [mpi_flags] are options passed to the mpi launcher,
[ username@aces ~]$ mpirun [mpi_flags] <executable> [executable params]
Note: <executable> must be on the $PATH otherwise the launcher will not be able to find the executable.
For a list of the most common mpi_flags See table below. This table shows only a very small subset of all flags. To see a full listing type mpirun --help
Flag | Description |
---|---|
-np <n> | The number of mpi tasks to start. |
-n <n> | The number of mpi tasks to start (same as -np). |
-perhost <n> | Places |
-ppn <n> | Stands for Process (i.e., task) Per Node (same as -perhost) |
-hostfile <file> | The name of the file that contains the list of host/node names the launcher will place tasks on. |
-f <file> | Same as -hostfile |
-hosts {host list} | comma separated list of specific host/node names. |
-help | Shows list of available flags and options |
Hybrid MPI/OpenMP Code
To compile hybrid mpi/OpenMP programs (i.e. MPI programs that also contain OpenMP directives) invoke the appropriate mpi wrapper and add the -openmp flag to enable processing of OpenMP primitives.
Running a hybrid program is very similar to running a pure mpi program. To control the number of OpenMP threads to use per task the OMP_NUM_THREADS environmental variable can be set.
Advanced: mapping tasks and threads
Explicitly mapping mpi tasks to processors can result in significantly better performance. This is especially true for hybrid MPI/OpenMP programs where both mpi tasks and OpenMP threads are being mapped on the available cores on a node. The Intel MPI stack provides a way to control the pinning of MPI tasks using the environmental variable ' I_MPI_PIN_DOMAIN'.
[ username@aces ~]$ export I_MPI_PIN_DOMAIN=<domain>
Where <domain> can have the following values: node, socket, core, cache1, cache2, cache3. The domain tells where to pin the tasks. For example "socket" will pin the tasks on different sockets. To map the OpenMP threads the affinity setting for OpenMP will be used.
NOTE: the above syntax is just one way to describe the pinning. Please visit the Process Pinning documentation or the Intel MPI reference (see Further Information section for link) for alternative ways to pin tasks using the I_MPI_PIN_DOMAIN environmental variable.
Examples
In this section are various examples for compiling and running MPI programs with the Intel toolchain.
Example 1: Compile MPI program written in C and name it mpi_prog.x. Use the underlying Intel compiler with -O3 optimization.
[ username@aces ~]$ mpiicc -o mpi_prog.x -O3 mpi_prog.c
Example 2: Same as Example 1, but this time use underlying gnu Fortran compiler.
[ username@aces ~]$ mpif90 -o mpi_prog.x mpi_prog.f90
Example 3: Run mpi program on local host using 4 tasks.
[ username@aces ~]$ mpirun -np 4 mpi_prog.x
Example 4: Run mpi program on a specific host using 4 tasks.
[ username@aces ~]$ mpirun -np 4 -hosts login1 mpi_prog.x
Example 5: Run mpi program on two different hosts using 4 tasks using host file and assign tasks in round robin fashion
[ username@aces ~]$ mpirun -np 4 -perhost 1 -hostfile mylist mpi_prog.x
where mylist is a file that contains the following lines:
login1
login2
Note: If you don't specify -pernode all the tasks will be started on login1, even though the hostfile contains multiple entries.
Example 6: Run 4 different programs concurrently using mpirun (MPMD style program).
[ username@aces ~]$ mpirun -np 1 prog1.x : -np 1 prog2.x : -np 1 prog3.x : -np 1 prog4.x
Note: For executing a large number of serial (or OpenMP) programs we recommend using the tamulauncher utility.
Example 7: Compile MPI fortran program named hybrid.f90 that also contains OpenMP primitives using underlying Intel Fortran compiler.
[ username@aces ~]$ mpiifort -openmp -o hybrid.x hybrid.f90
Example 8: Run the hybrid program named hybrid.x using 8 tasks where every task will use 2 threads in its OpenMP regions.
[ username@aces ~]$ export OMP_NUM_THREADS=2
[ username@aces ~]$ mpirun -np 8 ./hybrid.x
Example 9: Run hybrid mpi/OpenMP program using 2 tasks and 10 threads, pin the tasks to different sockets, and map all OpenMP threads within the socket.
[ username@aces ~]$ export I_MPI_PIN_DOMAIN=socket
[ username@aces ~]$ export OMP_NUM_THREADS=10
[ username@aces ~]$ export OMP_PLACES="socket"
[ username@aces ~]$ export OMP_PROC_BIND="master"
[ username@aces ~]$ mpirun -np 2 ./hybrid.x
Further Information
For a detailed description of the Intel MPI stack, please visit the Intel MPI Developer Reference Manual. This site contains detailed information about the mpi compiler wrappers, in depth discussion about mpirun and it options, as well as tuning your application for best performance and pinning tasks.
OpenMPI
Using OpenMPI is very similar to using Intel MPI. There are a few minor differences. To use OpenMPI you will need to load one of the OpenMPI modules. HPRC has OpenMPI versions built with Intel compilers as well as gnu compilers. The underlying compiler depends on the loaded OpenMPI module
Example 1: Load OpenMPI version 4.1.4 with GCC dependency 11.3.0.
[ username@aces ~]$ module load GCC/11.3.0 OpenMPI/4.1.4
To see a list of all available OpenMPI versions type:
[ username@aces ~]$ module spider openmpi
Compiling
The table below shows the various mpi compiler wrappers. The names will be the same regardless of the underlying compiler.
MPI wrapper | Language | Example |
---|---|---|
mpicc | C | mpicc <compiler_flags> prog.c |
mpic++ | C++ | mpic++ <compiler_flags> prog.cpp |
mpif90 | Fortran | mpif90 <compiler_flags> prog.f90 |
To see the complete compiler command use the -show flag.
Running
To launch a mpi program you will use the mpirun command. This command is very similar to the Intel MPI mpirun launcher discussed above. However, some of the flags are different for OpenMPI. The table below shows some of the more common flags.
Flag | Description |
---|---|
-np <n> | The number of mpi tasks to start. |
-npernode <n> | Places <n> (MPI) processes per node on each allocated node. |
-npersocket <n> | Places <n> (MPI) processes per socket on each allocated node. |
-hostfile <file> | The name of the file that contains the list of host/node names the launcher will place tasks on. |
-host {host list} | comma separated list of specific host/node names. |
To see all the available options and flags (including short descriptions) use the following command:
[ username@aces ~]$ mpirun -help
CUDA Programming
In order to compile, run, and debug CUDA programs, a CUDA module must be loaded:
[ username@aces ~]$ module load CUDA/12.0
For more information on the modules system, please see our Modules System page.
Compiling CUDA C/C++ with NVIDIA nvcc
The compiler nvcc is the NVIDIA CUDA C/C++ compiler. The command line for invoking it is:
[ username@aces ~]$ nvcc [options] -o cuda_prog.exe file1 file2 ...
where file1, file2, ... are any appropriate source, assembly, object, object library, or other (linkable) files that are linked to generate the executable file cuda_prog.exe.
By default, nvcc will use gcc to compile your source code. However, it is better to use the Intel compiler by adding the flag -ccbin=icc to your compile command.
For more information on nvcc, please refer to the online manual.
Running CUDA Programs
Only some of the login nodes have GPUs installed, so when you want to run GPU code on a login node make sure the node has a GPU installed. To find out load information of the device, please run the NVIDIA system management interface program nvidia-smi. This command will tell you on which GPU device your code is running on, how much memory is used on the device, and the GPU utilization.
[ username@aces ~]$ nvidia-smi
Sun Apr 23 22:20:31 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:3B:00.0 Off | 0 |
| N/A 27C P0 30W / 250W | 5MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
You can test your CUDA program on the login node as long as you abide by the rules stated in Computing Environment. For production runs, you should submit a batch job to run your code on the compute nodes. In order to be placed on GPU nodes with available GPUs, a job needs to request them with the following two lines in a job file.
#SBATCH --gres=gpu:1 #Request 1 GPU
#SBATCH --partition=gpu #Request the GPU partition/queue
For more detailed information about requesting specific types of GPUs, check out Batch Optional Job Specification section ### Debugging CUDA Programs
CUDA programs must be compiled with "-g -G" to force O0 optimization and to generate code with debugging information. To generate debugging code for K80, compile and link the code with the following:
[ username@aces ~]$ nvcc -g -G arch=compute_70 -code=sm_70 cuda_prog.cu -o cuda_prog.out
For more information on cuda-gdb, please refer to its online manual.
Misc
GNU gcc and Intel C/C++ Interoperability
C++ compilers are interoperable if they can link object files and libraries generated by one compiler with object files and libraries generated by the second compiler, and the resulting executable runs successfully. Some GNU gcc* versions are not interoperable, some versions are interoperable. By default, the Intel compiler will generate code that is interoperable with the version of gcc it finds on your system.
The Intel(R) C++ Compiler options that affect GNU gcc* interoperability include:
- -cxxlib
- -gcc-name
- -gcc-version
- -gxx-name
- -fabi-version
- -no-gcc (see gcc Predefined Macros for more information)
The Intel(R) C++ Compiler is interoperable with GNU gcc* compiler versions greater than or equal to 3.2. See the Intel(R) C++ Compiler Documentation for more information at the Intel Software Documentation page.