Hprc banner tamu.png

Terra:Compile:All

From TAMU HPRC
Revision as of 10:27, 25 April 2018 by Rodriguez.dylan (talk | contribs)

Jump to: navigation, search

Compiling and Running on Terra

A Note on Terra Usage

This page used a lot of information written for Ada.

The majority of the content on this page is applicable to Terra. Some modifications to modules, directories, or other non-major concepts may be needed.

Getting Started

Toolchain selection

For developing code on Terra we recommend using the intel software stack (which is often referenced as a "toolchain" here at HPRC), which includes the Intel compilers (icc/icpc/ifort), the Intel Math Kernel Library (MKL), and the Intel MPI. A note for Terra Users: the Intel compilers are the only compilers able to compile programs for the Phi co-processors.

We highly recommend users select a particular toolchain and stick with modules that use it. At present, we support the following toolchains:

  • intel - described above
  • iomkl - which substitutes OpenMPI for Intel's MPI
  • foss - which is entirely Free and Open-Source Software (GCC/OpenMPI/BLAS/LAPACK/etc)

Detailed information about each of the currently supported toolchain releases can be found on our Toolchains page.

Toolchains, like software packages on our clusters, are organized with the Modules System. You can load a toolchain with the following command:

[ netID@cluser ~]$ module load [toolchain Name]

Important Note: Do NOT mix modules from different toolchains. Remember to ALWAYS purge all modules when switching toolchains.
More information on using the Modules System can be found on our Modules System page.

Using the intel toolchain

After initializing the compiler environment, you can use the "man" command to obtain a complete list of the available compilation options for the language you plan to use. The following three commands will provide information on the C, C++, and Fortran compilers, respectively.

[ netID@cluster ~]$ man icc
[ netID@cluster ~]$ man icpc
[ netID@cluster ~]$ man ifort

Each compiler requires appropriate file name extensions. These extensions are meant to identify files with different programming language contents, thereby enabling the compiler script to hand these files to the appropriate compiling subsystem: preprocessor, compiler, linker, etc. See table below for valid extensions for each language.

Basic Valid File Extensions
Extension Compiler Description
.c icc C source code passed to the compiler.
.C, .CC, .cc, .cpp, .cxx icpc C++ source code passed to the compiler.
.f, .for, .ftn ifort Fixed form Fortran source code passd to the compiler.
.fpp ifort Fortran fixed form source code that can be preprocessed by the Intel Fortran preprocessor fpp.
.f90 ifort Free form Fortran 90/95 source code passed to the compiler.
.F ifort Fortran fixed form source code, will be passed to preprocessor (fpp) and then passed to the Fortran compiler.
.o icc/icpc/ifort Compiled object file--generated with the -c option--passed to the linker.

Note: The icpc command ("C++" compiler) uses the same compiler options as the icc ("C" compiler) command. Invoking the compiler using icpc compiles '.c', and '.i' files as C++. Invoking the compiler using icc compiles '.c' and '.i' files as C. Using icpc always links in C++ libraries. Using icc only links in C++ libraries if C++ source is provided on the command line.


Compiling

Invoking the compiler

To compile your program and create an executable you need to invoke the correct compiler. The default output file name is a.out but this can be changed using the -o compiler flag. All compilers are capable of preprocessing, compiling, assembling, and linking. See table below for the correct compiler commands for the different languages.

Language Compiler Syntax
C icc icc [c compiler_flags] file1 [ file2 ]...
C++ icpc icpc [c++ compiler_flags] file1 [ file2 ]...
F90 ifort ifort [fortran compiler_flags] file1 [ file2 ]...
F77 ifort ifort [fortran compiler_flags] file1 [ file2 ]...

In the table above, fileN is an appropriate source file, assembly file, object file, object library, or other linkable file.

Basic compiler flags

The next sections introduce some of the most common compiler flags. These flags are accepted by all compilers (icc/icpc/ifort) with some notable exceptions. For a full description of all the compiler flags please consult the appropriate man pages.

Flag Description
-help [category] Displays all available compiler options or category of compiler options categories.
-o <file> Specifies the name for an output file. For an executable, name of output file will be <file> instead of a.out
-c Only compile the file, linking phase will be skipped
-L <dir> Tells the linker to search for libraries in directory <dir> ahead of the standard library directories.
-l<name> Tells the linker to search for library named libname.so or libname.a

Optimization flags

The default optimization level for Intel compilers is -O2 (which enables optimizations like inlining, constant/copy propagation, loop unrolling,peephole optimizations, etc). The table below shows some addional commonly used optimization flags that can be used to improve run time.

Flag Description
-O3 Performs -O2 optimizations and enables more aggressive loop transformations.
-xHost Tells the compiler to generate vector instructions for the highest instruction set available on the host machine.
-fast Convenience flag. In linux this is shortcut for -ipo, -O3, -no-prec-div, -static, and -xHost
-ip Perform inter-procedural optimization within the same file
-ipo Perform inter-procedural optimization between files
-parallel enable automatic parallelization by the compiler (very conservative)
-opt-report=[n] generate optimization report. n represent the level of detail (0 ..3, 3 being most detailed)
-vec-report[=n] generate vectorization report. n represents the level of detail (0..7 , 7 being most detailed)

NOTE: there is no guarantee that using a combination of the flags above will provide additional speedup compared to -O2. In some rare cases (e.g. floating point imprecision) using flags like -fast might result in code that might produce incorrect results.

Large Memory Nodes on Terra: These 48 nodes have 128GB of memory.

Debugging flags

The table below shows some compiler flags that can be useful in debugging your program.

Flag Description
-g Produces symbolic debug information in the object file.
-warn Specifies diagnostic messages to be issued by the compiler.
-traceback Tells the compiler to generate extra information in the object file to provide source file traceback information when a severe error occurs at run time.
-check Checks for certain conditions at run time (e.g. uninitialized variables, array bounds). Note, since the resulting code includes additional run time checks it may affect run time significantly. THIS IS AN IFORT ONLY FLAG
-fpe0 throw exception for invalid, overflow, divide by zero. THIS IS AN IFORT ONLY FLAG

Flags affecting floating point operations

Some optimization might affect how floating point arithmetic is performed. This might result in round off errors in certain cases. The table below shows a number of flags to instruct the compiler how to deal with floating point operations:

Flag Description
-fp-model precise disable optimizations that are not value safe on floating point data (See man page for other options)
-fltconsistency enables improved floating-point consistency. This might slightly reduce execution speed.
-fp-speculation=strict tells the compiler to disable speculation on floating-point operations (See man page for other options)

Examples

Several examples of compile commands are listed below.
Example 1: Compile program consisting of c source files and an object file.

[ netID@cluster ~]$ icc objfile.o subroutine.c  main.c 

Example 2: Compile and link source files and an object file, rename the output myprog.x

[ netID@cluster ~]$ icc -o myprog.x  subroutine.c myobjs.o  main.c

Example 3: Compile and link source file and library libmyutils.so residing in directory mylibs

[ netID@cluster ~]$ icc -L mylibs -lmyutils  main.c

Example 4: Compile and link program with aggressive optimization enabled, uses latest vector instructions, and print optimization report.

[ netID@cluster ~]$ icc -fast -xHost -opt-report -o myprog.x myprog.c


OpenMP Programs

Compiling OpenMP code

To compile program containing OpenMP parallel directives the following flags can be used to create multi-threaded versions:

Flag Description
-qopenmp Enables parallelizer to generate multi-threaded code.
-qopenmp-stubs Enables compilation of OpenMP programs in sequential mode.

Examples:

[ netID@cluster ~]$ icc -qopenmp -o myprog.x myprog.c
[ netID@cluster ~]$ ifort -qopenmp myprog.x myprog.f90
[ netID@cluster ~]$ ifort -qopenmp-stubs -o myprog.x myprog.f90

Running OpenMP code

The table below shows some of the more common environmental variables that can be used to affect OpenMP behavior at run time.

Environment Variable Example Example-Purpose Default value
OMP_NUM_THREADS=n[,m]* OMP_NUM_THREADS=8 Sets the maximum number of threads per nesting level to 8. 1
OMP_STACKSIZE=[B|K|M|G] OMP_STACKSIZE=8M Sets the size for the private stack of each worker thread to 8MB. Possible values for type are B(Bytes), K(KB), M(MB), and G(GB). 4M
OMP_SCHEDULE=type[,chunk] OMP_SCHEDULE=DYNAMIC Sets the default run-time schedule type to DYNAMIC. Possible values for type are STATIC, DYNAMIC, GUIDED, and AUTO. STATIC
OMP_DYNAMIC OMP_DYNAMIC=true Enable dynamic adjustment of number of threads. false
OMP_NESTED OMP_NESTED=true Enable nested OpenMP regions. false
OMP_DISPLAY_ENV=val OMP_DISPLAY_ENV=VERBOSE Instruct the OpenMP runtime to display OpenMP version and environmental variables in verbose form. Possible values are TRUE, FALSE, VERBOSE. FALSE

Examples

Example 1: set number of threads to 8 and set the stack size for workers thread to 16MB. Note; insufficient stack size is a common reason of run-time crashes of OpenMP programs.

[ netID@cluster ~]$ export OMP_NUM_THREADS=8
[ netID@cluster ~]$ export OMP_STACKSIZE=16M
[ netID@cluster ~]$ ./myprog.x

Example 2: enable nested parallel regions and set the number of threads to use for first nesting level to 4 and second nesting level to 2

[ netID@cluster ~]$ export OMP_NESTED=true
[ netID@cluster ~]$ export OMP_NUM_THREADS=4,2
[ netID@cluster ~]$ ./myprog.x

Example 3: set maximum number of threads to use to 16, but let run time decide how many threads will actually be used in order to optimize the use of system resources

[ netID@cluster ~]$ export OMP_DYNAMIC=true
[ netID@cluster ~]$ export OMP_NUM_THREADS=16
[ netID@cluster ~]$ ./myprog.x

Example 4: change the default scheduling type to dynamic with chunk size of 100.

[ netID@cluster ~]$ export OMP_SCHEDULE="dynamic,100"
[ netID@cluster ~]$ export OMP_NUM_THREADS=16
[ netID@cluster ~]$ ./myprog.x

Advanced OpenMP

The following tables shows some more advanced environmental variables that can be used to control where OpenMP threads will actually be placed

Env var Description Default value
KMP_AFFINITY binds OpenMP threads to physical threads.
OMP_PLACES Defines an ordered list of places where threads can execute. Every place is a set of hardware (HW) threads. Can be defined as an explicit list of places described by nonnegative numbers or an abstract name. Abstract name can be 'threads' (every place consists of exactly one hw thread), 'cores' (every place contains all the HW threads of the core), 'socket' (every places contains all the HW threads of the socket) 'threads'
OMP_PROC_BIND Sets the thread affinity policy to be used for parallel regions at the corresponding nesting level. Acceptable values are true, false, or a comma separated list, each element of which is one of the following values: master (all threads will be bound to same place as master thread), close (all threads will be bound to successive places close to place of master thread), spread (all threads will be distributed among the places evenly). NOTE: if both OMP_PROC_BIND and KMP_AFFINITY are set the latter will take precedence 'false'

Example 1: Suppose node with two sockets, each with 8 cores. Program, with nesting level 2, put outer level threads on different sockets, inner level threads on same socket as master.

[ netID@cluster ~]$ export OMP_NESTED=true
[ netID@cluster ~]$ export OMP_NUM_THREADS=2,8
[ netID@cluster ~]$ export OMP_PLACES="sockets"
[ netID@cluster ~]$ export OMP_PROC_BIND="spread,master"
[ netID@cluster ~]$ ./myprog.x


CUDA Programming

Access

In order to compile, run, and debug CUDA programs, a CUDA module must be loaded:

[ netID@terra3 ~]$ module load CUDA/8.0.44

For more information on the modules system, please see our Modules System page.

Compiling CUDA C/C++ with NVIDIA nvcc

The compiler nvcc is the NVIDIA CUDA C/C++ compiler. The command line for invoking it is:

[ netID@terra3 ~]$ nvcc [options] -o cuda_prog.exe file1 file2 ...

where file1, file2, ... are any appropriate source, assembly, object, object library, or other (linkable) files that are linked to generate the executable file cuda_prog.exe.

The CUDA devices on Terra are dual-GPU K80s. K80 GPUs are compute capability 3.7 devices. When compiling your code, you need to specify:

[ netID@terra3 ~]$ nvcc -arch=compute_37 -code=sm_37 ...

By default, nvcc will use gcc to compile your source code. However, it is better to use the Intel compiler by adding the flag -ccbin=icc to your compile command.

For more information on nvcc, please refer to the online manual .

Running CUDA Programs

Only one login node (terra3) on Terra is installed with one dual-GPU K80. To find out load information of the device, please run the NVIDIA system management interface program nvidia-smi. This command will tell you on which GPU device your code is running on, how much memory is used on the device, and the GPU utilization.

[ netID@terra3 ~]$ nvidia-smi
Fri Feb 10 11:44:30 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:83:00.0     Off |                  Off |
| N/A   27C    P8    26W / 149W |      0MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:84:00.0     Off |                  Off |
| N/A   32C    P8    29W / 149W |      0MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

You can test your CUDA program on the login node as long as you abide by the rules stated in Computing Environment. For production runs, you should submit a batch job to run your code on the compute nodes. Terra has 48 compute nodes each with one dual-GPU K80 and 128GB (host) memory. In order to be placed on GPU nodes with available GPUs, a job needs to request them with the following two lines in a job file.

#SBATCH --gres=gpu:1                 #Request 1 GPU
#SBATCH --partition=gpu              #Request the GPU partition/queue

Debugging CUDA Programs

CUDA programs must be compiled with "-g -G" to force O0 optimization and to generate code with debugging information. To generate debugging code for K80, compile and link the code with the following:

[ netID@terra3 ~]$ nvcc -g -G arch=compute_37 -code=sm_37 cuda_prog.cu -o cuda_prog.out

For more information on cuda-gdb, please refer to its online manual.

Misc

GNU gcc and Intel C/C++ Interoperability

C++ compilers are interoperable if they can link object files and libraries generated by one compiler with object files and libraries generated by the second compiler, and the resulting executable runs successfully. Some GNU gcc* versions are not interoperable, some versions are interoperable. By default, the Intel compiler will generate code that is interoperable with the version of gcc it finds on your system.

The Intel(R) C++ Compiler options that affect GNU gcc* interoperability include:

  • -cxxlib
  • -gcc-name
  • -gcc-version
  • -gxx-name
  • -fabi-version
  • -no-gcc (see gcc Predefined Macros for more information)

The Intel(R) C++ Compiler is interoperable with GNU gcc* compiler versions greater than or equal to 3.2. See the Intel(R) C++ Compiler Documentation for more information at the Intel Software Documentation page.