Compiling and Running on Terra
- 1 Compiling and Running on Terra
- 1.1 A Note on Terra Usage
- 1.2 Getting Started
- 1.3 Compiling
- 1.4 OpenMP Programs
- 1.5 CUDA Programming
- 1.6 Misc
A Note on Terra Usage
This page used a lot of information written for Ada.
The majority of the content on this page is applicable to Terra. Some modifications to modules, directories, or other non-major concepts may be needed.
For developing code on Terra we recommend using the intel software stack (which is often referenced as a "toolchain" here at HPRC), which includes the Intel compilers (icc/icpc/ifort), the Intel Math Kernel Library (MKL), and the Intel MPI. A note for Terra Users: the Intel compilers are the only compilers able to compile programs for the Phi co-processors.
We highly recommend users select a particular toolchain and stick with modules that use it. At present, we support the following toolchains:
- intel - described above
- iomkl - which substitutes OpenMPI for Intel's MPI
- foss - which is entirely Free and Open-Source Software (GCC/OpenMPI/BLAS/LAPACK/etc)
Detailed information about each of the currently supported toolchain releases can be found on our Toolchains page.
Toolchains, like software packages on our clusters, are organized with the Modules System. You can load a toolchain with the following command:
[ netID@cluser ~]$ module load [toolchain Name]
Important Note: Do NOT mix modules from different toolchains. Remember to ALWAYS purge all modules when switching toolchains.
More information on using the Modules System can be found on our Modules System page.
Using the intel toolchain
After initializing the compiler environment, you can use the "man" command to obtain a complete list of the available compilation options for the language you plan to use. The following three commands will provide information on the C, C++, and Fortran compilers, respectively.
[ netID@cluster ~]$ man icc [ netID@cluster ~]$ man icpc [ netID@cluster ~]$ man ifort
Each compiler requires appropriate file name extensions. These extensions are meant to identify files with different programming language contents, thereby enabling the compiler script to hand these files to the appropriate compiling subsystem: preprocessor, compiler, linker, etc. See table below for valid extensions for each language.
|.c||icc||C source code passed to the compiler.|
|.C, .CC, .cc, .cpp, .cxx||icpc||C++ source code passed to the compiler.|
|.f, .for, .ftn||ifort||Fixed form Fortran source code passd to the compiler.|
|.fpp||ifort||Fortran fixed form source code that can be preprocessed by the Intel Fortran preprocessor fpp.|
|.f90||ifort||Free form Fortran 90/95 source code passed to the compiler.|
|.F||ifort||Fortran fixed form source code, will be passed to preprocessor (fpp) and then passed to the Fortran compiler.|
|.o||icc/icpc/ifort||Compiled object file--generated with the -c option--passed to the linker.|
Note: The icpc command ("C++" compiler) uses the same compiler options as the icc ("C" compiler) command. Invoking the compiler using icpc compiles '.c', and '.i' files as C++. Invoking the compiler using icc compiles '.c' and '.i' files as C. Using icpc always links in C++ libraries. Using icc only links in C++ libraries if C++ source is provided on the command line.
Invoking the compiler
To compile your program and create an executable you need to invoke the correct compiler. The default output file name is a.out but this can be changed using the -o compiler flag. All compilers are capable of preprocessing, compiling, assembling, and linking. See table below for the correct compiler commands for the different languages.
|C||icc||icc [c compiler_flags] file1 [ file2 ]...|
|C++||icpc||icpc [c++ compiler_flags] file1 [ file2 ]...|
|F90||ifort||ifort [fortran compiler_flags] file1 [ file2 ]...|
|F77||ifort||ifort [fortran compiler_flags] file1 [ file2 ]...|
In the table above, fileN is an appropriate source file, assembly file, object file, object library, or other linkable file.
Basic compiler flags
The next sections introduce some of the most common compiler flags. These flags are accepted by all compilers (icc/icpc/ifort) with some notable exceptions. For a full description of all the compiler flags please consult the appropriate man pages.
|-help [category]||Displays all available compiler options or category of compiler options categories.|
|-o <file>||Specifies the name for an output file. For an executable, name of output file will be <file> instead of a.out|
|-c||Only compile the file, linking phase will be skipped|
|-L <dir>||Tells the linker to search for libraries in directory <dir> ahead of the standard library directories.|
|-l<name>||Tells the linker to search for library named libname.so or libname.a|
The default optimization level for Intel compilers is -O2 (which enables optimizations like inlining, constant/copy propagation, loop unrolling,peephole optimizations, etc). The table below shows some addional commonly used optimization flags that can be used to improve run time.
|-O3||Performs -O2 optimizations and enables more aggressive loop transformations.|
|-xHost||Tells the compiler to generate vector instructions for the highest instruction set available on the host machine.|
|-fast||Convenience flag. In linux this is shortcut for -ipo, -O3, -no-prec-div, -static, and -xHost|
|-ip||Perform inter-procedural optimization within the same file|
|-ipo||Perform inter-procedural optimization between files|
|-parallel||enable automatic parallelization by the compiler (very conservative)|
|-opt-report=[n]||generate optimization report. n represent the level of detail (0 ..3, 3 being most detailed)|
|-vec-report[=n]||generate vectorization report. n represents the level of detail (0..7 , 7 being most detailed)|
NOTE: there is no guarantee that using a combination of the flags above will provide additional speedup compared to -O2. In some rare cases (e.g. floating point imprecision) using flags like -fast might result in code that might produce incorrect results.
Large Memory Nodes on Terra: These 48 nodes have 128GB of memory.
The table below shows some compiler flags that can be useful in debugging your program.
|-g||Produces symbolic debug information in the object file.|
|-warn||Specifies diagnostic messages to be issued by the compiler.|
|-traceback||Tells the compiler to generate extra information in the object file to provide source file traceback information when a severe error occurs at run time.|
|-check||Checks for certain conditions at run time (e.g. uninitialized variables, array bounds). Note, since the resulting code includes additional run time checks it may affect run time significantly. THIS IS AN IFORT ONLY FLAG|
|-fpe0||throw exception for invalid, overflow, divide by zero. THIS IS AN IFORT ONLY FLAG|
Flags affecting floating point operations
Some optimization might affect how floating point arithmetic is performed. This might result in round off errors in certain cases. The table below shows a number of flags to instruct the compiler how to deal with floating point operations:
|-fp-model precise||disable optimizations that are not value safe on floating point data (See man page for other options)|
|-fltconsistency||enables improved floating-point consistency. This might slightly reduce execution speed.|
|-fp-speculation=strict||tells the compiler to disable speculation on floating-point operations (See man page for other options)|
Several examples of compile commands are listed below.
Example 1: Compile program consisting of c source files and an object file.
[ netID@cluster ~]$ icc objfile.o subroutine.c main.c
Example 2: Compile and link source files and an object file, rename the output myprog.x
[ netID@cluster ~]$ icc -o myprog.x subroutine.c myobjs.o main.c
Example 3: Compile and link source file and library libmyutils.so residing in directory mylibs
[ netID@cluster ~]$ icc -L mylibs -lmyutils main.c
Example 4: Compile and link program with aggressive optimization enabled, uses latest vector instructions, and print optimization report.
[ netID@cluster ~]$ icc -fast -xHost -opt-report -o myprog.x myprog.c
Compiling OpenMP code
To compile program containing OpenMP parallel directives the following flags can be used to create multi-threaded versions:
|-qopenmp||Enables parallelizer to generate multi-threaded code.|
|-qopenmp-stubs||Enables compilation of OpenMP programs in sequential mode.|
[ netID@cluster ~]$ icc -qopenmp -o myprog.x myprog.c [ netID@cluster ~]$ ifort -qopenmp myprog.x myprog.f90 [ netID@cluster ~]$ ifort -qopenmp-stubs -o myprog.x myprog.f90
Running OpenMP code
The table below shows some of the more common environmental variables that can be used to affect OpenMP behavior at run time.
|Environment Variable||Example||Example-Purpose||Default value|
|OMP_NUM_THREADS=n[,m]*||OMP_NUM_THREADS=8||Sets the maximum number of threads per nesting level to 8.||1|
|OMP_STACKSIZE=[B|K|M|G]||OMP_STACKSIZE=8M||Sets the size for the private stack of each worker thread to 8MB. Possible values for type are B(Bytes), K(KB), M(MB), and G(GB).||4M|
|OMP_SCHEDULE=type[,chunk]||OMP_SCHEDULE=DYNAMIC||Sets the default run-time schedule type to DYNAMIC. Possible values for type are STATIC, DYNAMIC, GUIDED, and AUTO.||STATIC|
|OMP_DYNAMIC||OMP_DYNAMIC=true||Enable dynamic adjustment of number of threads.||false|
|OMP_NESTED||OMP_NESTED=true||Enable nested OpenMP regions.||false|
|OMP_DISPLAY_ENV=val||OMP_DISPLAY_ENV=VERBOSE||Instruct the OpenMP runtime to display OpenMP version and environmental variables in verbose form. Possible values are TRUE, FALSE, VERBOSE.||FALSE|
Example 1: set number of threads to 8 and set the stack size for workers thread to 16MB. Note; insufficient stack size is a common reason of run-time crashes of OpenMP programs.
[ netID@cluster ~]$ export OMP_NUM_THREADS=8 [ netID@cluster ~]$ export OMP_STACKSIZE=16M [ netID@cluster ~]$ ./myprog.x
Example 2: enable nested parallel regions and set the number of threads to use for first nesting level to 4 and second nesting level to 2
[ netID@cluster ~]$ export OMP_NESTED=true [ netID@cluster ~]$ export OMP_NUM_THREADS=4,2 [ netID@cluster ~]$ ./myprog.x
Example 3: set maximum number of threads to use to 16, but let run time decide how many threads will actually be used in order to optimize the use of system resources
[ netID@cluster ~]$ export OMP_DYNAMIC=true [ netID@cluster ~]$ export OMP_NUM_THREADS=16 [ netID@cluster ~]$ ./myprog.x
Example 4: change the default scheduling type to dynamic with chunk size of 100.
[ netID@cluster ~]$ export OMP_SCHEDULE="dynamic,100" [ netID@cluster ~]$ export OMP_NUM_THREADS=16 [ netID@cluster ~]$ ./myprog.x
The following tables shows some more advanced environmental variables that can be used to control where OpenMP threads will actually be placed
|Env var||Description||Default value|
|KMP_AFFINITY||binds OpenMP threads to physical threads.|
|OMP_PLACES||Defines an ordered list of places where threads can execute. Every place is a set of hardware (HW) threads. Can be defined as an explicit list of places described by nonnegative numbers or an abstract name. Abstract name can be 'threads' (every place consists of exactly one hw thread), 'cores' (every place contains all the HW threads of the core), 'socket' (every places contains all the HW threads of the socket)||'threads'|
|OMP_PROC_BIND||Sets the thread affinity policy to be used for parallel regions at the corresponding nesting level. Acceptable values are true, false, or a comma separated list, each element of which is one of the following values: master (all threads will be bound to same place as master thread), close (all threads will be bound to successive places close to place of master thread), spread (all threads will be distributed among the places evenly). NOTE: if both OMP_PROC_BIND and KMP_AFFINITY are set the latter will take precedence||'false'|
Example 1: Suppose node with two sockets, each with 8 cores. Program, with nesting level 2, put outer level threads on different sockets, inner level threads on same socket as master.
[ netID@cluster ~]$ export OMP_NESTED=true [ netID@cluster ~]$ export OMP_NUM_THREADS=2,8 [ netID@cluster ~]$ export OMP_PLACES="sockets" [ netID@cluster ~]$ export OMP_PROC_BIND="spread,master" [ netID@cluster ~]$ ./myprog.x
In order to compile, run, and debug CUDA programs, a CUDA module must be loaded:
[ netID@terra3 ~]$ module load CUDA/8.0.44
For more information on the modules system, please see our Modules System page.
Compiling CUDA C/C++ with NVIDIA nvcc
The compiler nvcc is the NVIDIA CUDA C/C++ compiler. The command line for invoking it is:
[ netID@terra3 ~]$ nvcc [options] -o cuda_prog.exe file1 file2 ...
where file1, file2, ... are any appropriate source, assembly, object, object library, or other (linkable) files that are linked to generate the executable file cuda_prog.exe.
The CUDA devices on Terra are dual-GPU K80s. K80 GPUs are compute capability 3.7 devices. When compiling your code, you need to specify:
[ netID@terra3 ~]$ nvcc -arch=compute_37 -code=sm_37 ...
By default, nvcc will use gcc to compile your source code. However, it is better to use the Intel compiler by adding the flag -ccbin=icc to your compile command.
For more information on nvcc, please refer to the online manual .
Running CUDA Programs
Only one login node (terra3) on Terra is installed with one dual-GPU K80. To find out load information of the device, please run the NVIDIA system management interface program nvidia-smi. This command will tell you on which GPU device your code is running on, how much memory is used on the device, and the GPU utilization.
[ netID@terra3 ~]$ nvidia-smi Fri Feb 10 11:44:30 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.48 Driver Version: 367.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 On | 0000:83:00.0 Off | Off | | N/A 27C P8 26W / 149W | 0MiB / 12205MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 On | 0000:84:00.0 Off | Off | | N/A 32C P8 29W / 149W | 0MiB / 12205MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
You can test your CUDA program on the login node as long as you abide by the rules stated in Computing Environment. For production runs, you should submit a batch job to run your code on the compute nodes. Terra has 48 compute nodes each with one dual-GPU K80 and 128GB (host) memory. In order to be placed on GPU nodes with available GPUs, a job needs to request them with the following two lines in a job file.
#SBATCH --gres=gpu:1 #Request 1 GPU #SBATCH --partition=gpu #Request the GPU partition/queue
Debugging CUDA Programs
CUDA programs must be compiled with "-g -G" to force O0 optimization and to generate code with debugging information. To generate debugging code for K80, compile and link the code with the following:
[ netID@terra3 ~]$ nvcc -g -G arch=compute_37 -code=sm_37 cuda_prog.cu -o cuda_prog.out
For more information on cuda-gdb, please refer to its online manual.
GNU gcc and Intel C/C++ Interoperability
C++ compilers are interoperable if they can link object files and libraries generated by one compiler with object files and libraries generated by the second compiler, and the resulting executable runs successfully. Some GNU gcc* versions are not interoperable, some versions are interoperable. By default, the Intel compiler will generate code that is interoperable with the version of gcc it finds on your system.
The Intel(R) C++ Compiler options that affect GNU gcc* interoperability include:
- -no-gcc (see gcc Predefined Macros for more information)
The Intel(R) C++ Compiler is interoperable with GNU gcc* compiler versions greater than or equal to 3.2. See the Intel(R) C++ Compiler Documentation for more information at the Intel Software Documentation page.