ACES
Contents
- 1 ACES Phase I
ACES Phase I
To apply for an account on ACES Phase I please fill out the ACES Phase I application form
Hardware Summary
Component: | Quantity | Description |
---|---|---|
Graphcore IPU | 16 | 16 Colossus GC200 IPUs and dual AMD Rome CPU server on a 100 GbE RoCE fabric |
Intel FPGA PAC D5005 | 2 | FPGA SOC with Intel Stratix 10 SX FPGAs, 64 bit quad-core Arm Cortex-A53 processors, and 32GB DDR4 |
Intel Optane SSDs | 8 | 3 TB of Intel Optane SSDs addressable as memory using MemVerge Memory Machine. |
Access
ssh access for XSEDE/ACCESS users
As of August 31st, login to ACES Phase I for XSEDE/ACCESS users transitioned from the XSEDE sso hub to a jump host:
ssh -J username@aces-jump.hprc.tamu.edu:8822 username@login.aces.hprc.tamu.edu
You can also find your username(s) under your ACCESS profile: https://allocations.access-ci.org/profile
ssh access for XSEDE/ACCESS mobaxterm users
MobaXterm offers the capability to use a jump host to connect to a server in a secured network zone. You can configure your SSH session to use a jump host to save time with the login process.
Click on the Session tab to create a new session. Next click SSH in the top ribbon.
Enter login.aces.hprc.tamu.edu into the Remote host * box.
Enter your username in the Specify username box.
Open the Advanced SSH Settings tab check the Use Private Key box. Then enter the path to your private key.
Go to the Network Settings tab and enable Connect through SSH gateway (jump host) and enter respective login data. The gateway host should be aces-jump.hprc.tamu.edu. The port should be 8822. Click OK.
Start the session by double clicking the session login.aces.hprc.tamu.edu (username) in the left hand ribbon or under the Sessions tab.
Generating SSH Keys
If you do not already have an ed25519 type key-pair you can use the below instructions to generate your ssh-keys on the host you will use to login from. The keys should automatically be generated in $HOME/.ssh/ Do NOT change the name of the generated keys or the location
Instructions for Windows users
Please use the Windows PowerShell to generate your keys.
Once you have launched the Windows PowerShell App type:
ssh-keygen -t ed25519
The command will output the following text:
Generating public/private ed25519 key pair. Enter file in which to save the key (C:\Users\username/.ssh/id_ed25519):
If you already have a .ssh directory, the follow text will not be displayed.
Created directory 'C:\Users\username/.ssh'.
Hit Enter at the following prompt.
Enter passphrase (empty for no passphrase):
Hit Enter at the following prompt.
Enter same passphrase again:
More text output:
Your identification has been saved in C:\Users\username/.ssh/id_ed25519. Your public key has been saved in C:\Users\username/.ssh/id_ed25519.pub. The key fingerprint is: SHA256:long_string_of_text_here The key's randomart image is: (many lines of characters here)
Instructions for Linux and Mac users
Launch a terminal on your Linux or Mac system and type:
ssh-keygen -t ed25519
The command will output the following text:
Generating public/private ed25519 key pair. Enter file in which to save the key (/home/abhinand/.ssh/id_ed25519):
If you already have a .ssh directory, the follow text will not be displayed.
Created directory '/Users/username/.ssh'
Hit Enter at the following prompt.
Enter passphrase (empty for no passphrase):
Hit Enter at the following prompt.
Enter same passphrase again:
More text output:
Your identification has been saved in /Users/username/.ssh/id_ed25519. Your public key has been saved in /Users/username/.ssh/id_ed25519.pub. The key fingerprint is: SHA256:long_string_of_text_here The key's randomart image is: (many lines of characters here)
How to submit your id_ed25519.pub key ?
Type the following command to show the contents of the id_ed25519.pub file.
cat $HOME/.ssh/id_ed25519.pub
The cat command should output a single line of text that starts with
ssh-ed25519 many_characters_here username@host
This should be strings of text specific to your key and system:
Copy the output text from the above cat command and include it in the email to keys@hprc.tamu.edu or attach the file id_ed25519.pub in your email
Do NOT send your private key. You will be notified by email when your pub key has been installed and your account is active.
Graphcore IPUs
From the login node, ssh into the poplar1 system.
[username@login ~]$ ssh poplar1
Set up the Poplar SDK environment
In this step, set up several environment variables to use the Graphcore tools and Poplar graph programming framework.
[username@poplar1 ~]$ source /opt/gc/poplar/poplar_sdk-ubuntu_20_04-[ver]/poplar-ubuntu_20_04-[ver]/enable.sh [username@poplar1 ~]$ source /opt/gc/poplar/poplar_sdk-ubuntu_20_04-[ver]/popart-ubuntu_20_04-[ver]/enable.sh
[ver] indicates the version number of the package.
Example commands with an existing version on ACES:
source /opt/gc/poplar/poplar_sdk-ubuntu_20_04-3.1.0+1205-58b501c780/poplar-ubuntu_20_04-3.1.0+6824-9c103dc348/enable.sh source /opt/gc/poplar/poplar_sdk-ubuntu_20_04-3.1.0+1205-58b501c780/popart-ubuntu_20_04-3.1.0+6824-9c103dc348/enable.sh
mkdir -p /localdata/$USER/tmp export TF_POPLAR_FLAGS=--executable_cache_path=/localdata/$USER/tmp export POPTORCH_CACHE_DIR=/localdata/$USER/tmp # export POPLAR_LOG_LEVEL=INFO # export POPLIBS_LOG_LEVEL=INFO
Set up environments of frameworks for IPU
PyTorch (Poptorch)
Set up PyTorch (Poptorch)
The local home dir is small (300G total). You can store large files in /localdata/username (or use localdata symlink from your home dir). /localdata has 3.5TB available.
[username@poplar1 ~]$ cd localdata [username@poplar1 localdata]$ virtualenv -p python3 poptorch_test [username@poplar1 localdata]$ source poptorch_test/bin/activate [username@poplar1 localdata]$ python -m pip install -U pip [username@poplar1 localdata]$ python -m pip install <sdk_path>/poptorch_x.x.x.whl
For <sdk_path>/poptorch_x.x.x.whl
,
you can use /opt/gc/poplar/poplar_sdk-ubuntu_20_04-3.1.0+1205-58b501c780/poptorch-3.1.0+98660_0a383de63f_ubuntu_20_04-cp38-cp38-linux_x86_64.whl
,
which exists on ACES
Clone a copy of the Graphcore tutorials repository and change the directory to mnist
[username@poplar1 localdata]$ git clone https://github.com/graphcore/tutorials.git [username@poplar1 localdata]$ cd tutorials/simple_applications/pytorch/mnist/
Install the dependencies and run the model
[username@poplar1 mnist]$ pip install -r requirements.txt [username@poplar1 mnist]$ python mnist_poptorch.py
TensorFlow 2
Set up TensorFlow 2 for IPU
The local home dir is small (300G total). You can store large files in /localdata/username (or use localdata symlink from your home dir). /localdata has 3.5TB available.
[username@poplar1 ~]$ cd localdata [username@poplar1 localdata]$ virtualenv venv_tf2 -p python3.8 [username@poplar1 localdata]$ source venv_tf2/bin/activate [username@poplar1 localdata]$ python -m pip install <sdk_path>/tensorflow_x.x.x.whl
For <sdk_path>/tensorflow_x.x.x.whl, you can use /opt/gc/poplar/poplar_sdk-ubuntu_20_04-3.1.0+1205-58b501c780/tensorflow-2.6.3+gc3.1.0+246224+2b7af067dae+amd_znver1-cp38-cp38-linux_x86_64.whl
, which exists on ACES
Clone a copy of the Graphcore tutorials repository and change the directory to tensorflow2/keras/completed_demos
[username@poplar1 localdata]$ git clone https://github.com/graphcore/tutorials.git [username@poplar1 localdata]$ cd tutorials/tutorials/tensorflow2/keras/completed_demos/
Run the model
[username@poplar1 completed_demos]$ python completed_demo_ipu.py
gc-monitor
gc-monitor is a command-line utility that provides a comprehensive overview of IPU device information, including details about any processes that are utilizing the IPUs in the form of a table.
[username@poplar1 localdata]$ gc-monitor
Graphcore Documentation can be found at https://docs.graphcore.ai/en/latest/
Liqid PCIe Card with Intel Optane SSDs
Submit a standard batch job or interactive job to the memverge partition
srun --partition=memverge --time=24:00:00 --pty bash
Sample job file:
#!/bin/bash ##NECESSARY JOB SPECIFICATIONS #SBATCH --job-name=Example #Set the job name to Example #SBATCH --time=24:00:00 #Set the wall clock limit to 24 hrs #SBATCH --nodes=1 #Request 1 nodes #SBATCH --ntasks-per-node=64 #Request 64 tasks/cores per node #SBATCH --mem=248G #Request 248G (248GB) per node #SBATCH --output=Example.%j #Redirect stdout/err to file #SBATCH --partition=memverge #Specify the MemVerge partition #lines required to setup the environment for your code # add the mm command in front of your executable to run with memory machine mm executable
Sample job file to run with singularity
#!/bin/bash ##NECESSARY JOB SPECIFICATIONS #SBATCH --job-name=Example #Set the job name to Example #SBATCH --time=24:00:00 #Set the wall clock limit to 24 hrs #SBATCH --nodes=1 #Request 1 nodes #SBATCH --ntasks-per-node=64 #Request 64 tasks/cores per node #SBATCH --mem=248G #Request 248G (248GB) per node #SBATCH --output=Example.%j #Redirect stdout/err to file #SBATCH --partition=memverge #Specify the MemVerge partitionexport SINGULARITY_BIND='/var/log/memverge,/etc/memverge,/opt/memverge,/var/memverge' # Required directories and libraries for memverge memory machine export SINGULARITY_BIND='/var/log/memverge,/etc/memverge,/opt/memverge,/var/memverge' for lib in \ libblkid.so.1 \ libcrypto.so.1.1 \ libc.so.6 \ libdaxctl.so.1 \ libdl.so.2 \ libgcc_s.so.1 \ libkmod.so.2 \ liblzma.so.5 \ libmount.so.1 \ libm.so.6 \ libndctl.so.6 \ libpcre2-8.so.0 \ libprotobuf-c.so.1 \ libpthread.so.0 \ librt.so.1 \ libselinux.so.1 \ libssl.so.1.1 \ libstdc++.so.6 \ libudev.so.1 \ libuuid.so.1 \ libz.so.1 \ ; do export SINGULARITY_BIND=$SINGULARITY_BIND,/lib64/$lib:/lib/$lib ; done # run your singularity container command, included the mm command for memverge memory machine singularity exec filename.sif mm executable
Intel FPGA PAC D5005
The FPGA nodes support both an older OpenCL development workflow, as well as a newer Intel oneAPI workflow. FPGA code compilation with
Access
⚠️ It is recommended to compile an FPGA binary on a CPU node, as the compilation times are extensive for hardware images, and the availability of FPGA nodes is limited. As such, Texas A&M High Performance Research Computing has created a Singularity container available for performing this compilation on CPU-only nodes. After compilation, users may request an FPGA node to run their emulator/hardware image.
CPU-Only (For Compiling Binaries)
On the login node: Create a copy of the container and samples:
cp -R /scratch/training/oneapi-aces $SCRATCH
Launch an interactive job on a CPU node:
srun --partition=cpu --nodes=1 --mem=64G --time=12:00:00 --pty bash -i
Start an interactive session in the Singularity container
cd $SCRATCH/oneapi-aces singularity shell --env-file env-vars oneapi-2022.1.0.sif
Follow the steps in the next section, "Getting Started".
FPGA (For Running Binaries)
To access the Intel FPGA PAC D5005, submit an interactive job to the FPGA partition from the login node:
srun --partition=fpga --nodes=1 --time=24:00:00 --pty bash -i
Getting Started
Once the session starts, you need to load the environment variables to access and interact with the FPGA on the node:
$ source /opt/intel/oneapi/setvars.sh :: initializing oneAPI environment ... -bash: BASH_VERSION = 4.2.46(2)-release args: Using "$@" for setvars.sh arguments: :: advisor -- latest :: ccl -- latest :: compiler -- latest :: dal -- latest :: debugger -- latest :: dev-utilities -- latest :: dnnl -- latest :: dpcpp-ct -- latest :: dpl -- latest :: intelfpgadpcpp -- latest :: intelpython -- latest :: ipp -- latest :: ippcp -- latest :: ipp -- latest :: mkl -- latest :: mpi -- latest :: tbb -- latest :: vpl -- latest :: vtune -- latest :: oneAPI environment initialized ::
Telemetry for the FPGA can be viewed using the 'fpgainfo' command:
$ fpgainfo FPGA information utility Usage: fpgainfo [-h] [-B <bus>] [-D <device>] [-F <function>] [-S <socket-id>] {errors,power,temp,fme,port,bmc} -h,--help Print this help -B,--bus Set target bus number -D,--device Set target device number -F,--function Set target function number -S,--socket-id Set target socket number Subcommands: Print and clear errors fpgainfo errors [-h] [-c] {all,fme,port} -h,--help Print this help -c,--clear Clear all errors --force Retry clearing errors 64 times to clear certain error conditions Print power metrics fpgainfo power [-h] -h,--help Print this help Print thermal metrics fpgainfo temp [-h] -h,--help Print this help Print FME information fpgainfo fme [-h] -h,--help Print this help Print accelerator port information fpgainfo port [-h] -h,--help Print this help Print all Board Management Controller sensor values fpgainfo bmc [-h] -h,--help Print this help
For continuous monitoring, utilize this command in conjunction with the 'watch' command.
To run a status check on the FPGA, run:
$ aocl diagnose
This will display information about the libraries and initialization status of the FPGA device.
If the device shows as "Unitialized", it can be initialized with a standard image with:
$ aocl initialize acl0 pac_s10 aocl initialize: Running initialize from /opt/intel/oneapi/intelfpgadpcpp/latest/board/intel_s10sx_pac/linux64/libexec Program succeed.
The FPGA device must be initialized with the image that matches the compilation target of a binary i.e. if a binary is compiled for "pac_s10", the board must be initialized with the "pac_s10" standard image before running. There are two potential image options for the "aocl initialize" command:
Name | Description |
---|---|
pac_s10 | Standard Intel FPGA PAC D5005 (Intel Stratix 10 SX) without unified shared memory support (USM). |
pac_s10_usm | Standard Intel FPGA PAC D5005 (Intel Stratix 10 SX) with unified shared memory support (USM). Device must be initialized with this image if a binary using USM will be run on the FPGA device. |
More information regarding unified shared memory can be found here: Unified Shared Memory — DPC++ Reference documentation
If the node has multiple FPGA devices, they can be viewed with:
$ aocl list-devices -------------------------------------------------------------------- Device Name: acl0 BSP Install Location: /opt/intel/oneapi/intelfpgadpcpp/latest/board/intel_s10sx_pac Vendor: Intel Corp Physical Dev Name Status Information pac_ed00000 Passed Intel PAC Platform (pac_ed00000) PCIe 29:00.0 USM not supported DIAGNOSTIC_PASSED --------------------------------------------------------------------
The user can then target the correct device when running their code or initializing the device.
Example
oneAPI Samples
The README.md in each directory contains information for compiling and running.
$ git clone https://github.com/oneapi-src/oneAPI-samples.git $ cd oneAPI-samples/DirectProgramming/DPC++FPGA/Tutorials
Example 1: fpga_compile
Navigate to the "fpga_compile" example under "GettingStarted" within the oneAPI samples repository:
$ cd GettingStarted $ cd fpga_compile
Create a build directory for configuration files:
$ mkdir build $ cd build
Configure the program to compile for the Intel S10 SX PAC (Intel PAC D5005):
$ cmake .. -DFPGA_DEVICE=intel_s10sx_pac:pac_s10
Once the configuration completes, there will be several make options available:
Command | Device Image Type | Compilation Duration | Description |
---|---|---|---|
make fpga_emu | FPGA Emulator | Seconds | Compile for emulation (compiles quickly, targets emulated FPGA device). Allows user to validate design, but does not represent actual performance of code on hardware. |
make report | Optimization Report | Minutes | Generate the optimization report. The FPGA device code is partially compiled for hardware. The compiler generates an optimization report that describes the structures generated on the FPGA, identifies performance bottlenecks, and estimates resource utilization. |
make fpga | FPGA Hardware | Hours | Compile for FPGA hardware (takes longer to compile, targets FPGA device). Compiles the actual bitstream for running the program on hardware. |
The recommended workflow is to compile a program for emulation prior to compiling for hardware execution. This does not actually compile the program to run on the FPGA itself, but rather on the CPU via a virtual FPGA emulation device. This allows a user to validate the correctness of their design while benefiting from the short compile times of CPU compilation. The optimization report assists the user in improving different aspects of their design before moving onto hardware compilation.
Example 2: buffered_host_streaming
Navigate to the "buffered_host_streaming" example under "DesignPatterns" within the oneAPI samples repository:
$ cd DesignPatterns $ cd buffered_host_streaming
Create a build directory for configuration files:
$ mkdir build $ cd build
Configure the program to compile for the Intel S10 SX PAC (Intel PAC D5005):
$ cmake .. -DFPGA_DEVICE=intel_s10sx_pac:pac_s10_usm -DUSM_HOST_ALLOCATIONS_ENABLED=1
Note that the value of FPGA_BOARD is the USM variant of the FPGA device. Like in Example 1, there will be three make targets to choose from: fpga_emu, report, and fpga. Once compilation finishes, ensure that the board is initialized to the correct standard image:
$ aocl initialize acl0 pac_s10_usm $ aocl list-devices -------------------------------------------------------------------- Device Name: acl0 BSP Install Location: /opt/intel/oneapi/intelfpgadpcpp/latest/board/intel_s10sx_pac Vendor: Intel Corp Physical Dev Name Status Information pac_ed00000 Passed Intel PAC Platform (pac_ed00000) PCIe 29:00.0 USM supported DIAGNOSTIC_PASSED --------------------------------------------------------------------
Example output:
# buffered_host_streaming.fpga $ ./buffered_host_streaming.fpga Repetitions: 200 Buffers: 2 Buffer Count: 524288 Iterations: 4 Total Threads: 64 Running the roofline analysis Producer (32 threads) Time: 1.1101 ms Throughput: 30226.1777 MB/s Consumer (32 threads) Time: 1.0272 ms Throughput: 32667.0989 MB/s Producer & Consumer (32 threads, each) Time: 3.4327 ms Throughput: 9774.9486 MB/s Kernel Time: 3.5139 ms Throughput: 9549.1001 MB/s Maximum Design Throughput: 9549.1001 MB/s The FPGA kernel limits the performance of the design Done the roofline analysis Running the full design without API Average latency without API: 4.3190 ms Average processing time without API: 749.3281 ms Average throughput without API: 8955.8717 MB/s Running the full design with API Average latency with API: 4.6629 ms Average processing time with API: 1005.6579 ms Average throughput with API: 6673.1306 MB/s PASSED
Example output if the incorrect standard board image is programmed:
$ aocl inititalize acl0 pac_s10 # incorrect board; binaries compiled with pac_s10_usm $ ./buffered_host_streaming.fpga Repetitions: 200 Buffers: 2 Buffer Count: 524288 Iterations: 4 Total Threads: 64 ERROR: The selected device does not support USM host allocations terminate called without an active exception Aborted (core dumped)
Resources
Resource | Description |
---|---|
FPGA Optimization Guide for Intel® oneAPI Toolkits | The FPGA Optimization Guide for Intel® oneAPI Toolkits provides guidance on leveraging the functionalities of SYCL* to optimize a design. |
Intel® FPGA Programmable Acceleration Card D5005 Data Sheet | This datasheet for the Intel® FPGA PAC shows electrical, mechanical, compliance, and other key specifications. This datasheet assists data center operators and system integrators to properly deploy the Intel® FPGA PAC into their servers. It also documents the FPGA power envelope, connectivity speeds to memory, and network connectivity, so that accelerator function unit (AFU) developers can properly design and test their IP. |
Intel® FPGA Training | Set of labs for using FPGAs with oneAPI through the Intel® DevCloud. |
Intel® Quartus® Prime Pro Edition User Guide: Scripting | Detailed guide for running Quartus programs on the command line. |
Intel® Stratix® 10 FPGAs & SoC FPGA | Assorted documentation for the Stratix 10 FPGA family, including pinouts and device schematics. |
Intel® Stratix® 10 FPGA Developer Center | Provides various resources to complete an Intel® FPGA design on the Stratix 10 architecture. |
Why is FPGA Compilation Different? | Describes differences between CPU, GPU, and FPGA program compilation. |
Support
Please report any issues encountered on the FPGAs to help@hprc.tamu.edu, and include information about actions taken and/or commands run prior to the error so the HPRC team may reproduce and resolve the issue.
Intel GPU
Intel GPU access is by invitation only.
Sample job file:
#!/bin/bash ##NECESSARY JOB SPECIFICATIONS #SBATCH --job-name=Example #Set the job name to Example #SBATCH --time=24:00:00 #Set the wall clock limit to 24 hrs #SBATCH --nodes=1 #Request 1 nodes #SBATCH --ntasks-per-node=64 #Request 64 tasks/cores per node #SBATCH --mem=248G #Request 248G (248GB) per node #SBATCH --output=Example.%j #Redirect stdout/err to file #SBATCH --partition=atsp #Specify the Intel GPU #lines required to setup the environment for your code
Command to activate Intel oneAPI
source /sw/restricted/oneapi_nda/setvars.sh :: initializing oneAPI environment ... -bash: BASH_VERSION = 4.4.20(1)-release args: Using "$@" for setvars.sh arguments: :: advisor -- latest :: ccl -- latest :: clck -- latest :: compiler -- latest :: dal -- latest :: debugger -- latest :: dev-utilities -- latest :: dnnl -- latest :: dpcpp-ct -- latest :: dpl -- latest :: inspector -- latest :: intelpython -- latest :: ipp -- latest :: ippcp -- latest :: ipp -- latest :: itac -- latest :: mkl -- latest :: mpi -- latest :: neural-compressor -- latest :: pytorch -- latest :: tbb -- latest :: tensorflow -- latest :: vpl -- latest :: vtune -- latest :: oneAPI environment initialized ::
Command to activate a specific oneAPI environment:
source /sw/restricted/oneapi_nda/setvars.sh --force source activate env
For example, to load the oneAPI intel optimized python environment run:
source /sw/restricted/oneapi_nda/setvars.sh --force source activate intelpython