Skip to content

Intel PVC GPUs

Introduction

The ACES cluster has 6 nodes with 4 Intel Data Center Max GPU 1100 GPUs. Throughout this documentation, these GPUs are referred to as the Intel PVC GPUs.

Accessing Intel PVC GPUs

Interactive

Access a compute node interactively using srun with the resource options --partition=pvc and --gres=gpu:pvc:<num_gpus>.

srun --time=01:00:00 --partition=pvc --gres=gpu:pvc:1 --pty bash -i

Load all the necessary modules

module purge
module load intel/2023.07
module load intel/AIKit/2023.2.0

The intel/AIKit module comes with a conda binary and few default conda environments in a shared space.

To see all the default environments use the following command

conda env list

For PyTorch, create a new environment by cloning the shared aikit-pt-gpu environment

conda create -n aikit-pt-gpu-clone --clone aikit-pt-gpu

For Tensorflow, create a new environment by cloning the shared aikit-tf-gpu environment

conda create -n aikit-tf-gpu-clone --clone aikit-tf-gpu

Activate the conda environment

source activate aikit-pt-gpu-clone

OR

source activate aikit-tf-gpu-clone

Install any additional packages if required

conda install <required-package>
pip install <required-package>

Run python script

cd $SCRATCH/<path-to-your-script>
python <name-of-script>.py

Job Submission

Intel PVCs can be accessed via slurm with the resource options --partition=pvc and --gres=gpu:pvc:<num_gpus>.

The folowing is an example of a jobscript for the Tensorflow environment. For PyTorch, replace aikit-tf-gpu with aikit-pt-gpu and aikit-tf-gpu-clone with aikit-pt-gpu-clone.

#!/bin/bash
##NECESSARY JOB SPECIFICATIONS

#SBATCH --job-name=tf_demo
#SBATCH --time=01:00:00                  
#SBATCH --nodes=1                  
#SBATCH --output=tf_demo.%j
#SBATCH --mem=100GB
#SBATCH --gres=gpu:pvc:1              
#SBATCH --partition=pvc               

# load all the necessary modules
module purge
module load intel/2023.07
module load intel/AIKit/2023.2.0

ENV_NAME=aikit-tf-gpu-clone

# If it doesn't exist, create the environment
if ! conda env list | grep -q "$ENV_NAME"; then

    conda create -n $ENV_NAME --clone aikit-tf-gpu
fi

# activate the conda environment
source activate $ENV_NAME

# change directory to your script
cd $SCRATCH/<path-to-your-script>

# executable command
python <name-of-script>.py

Monitor utilization

Launch a VNC interactive job through the portal. Make sure to select Intel GPU Max (PVC) as the node type.

Take note of the compute node that is assigned to your job.

Follow Slurm guide to create a job script. Use the same compute node that was assigned earlier using --nodelist option. For example, if the compute node is ac026, then the job script would list the following resources:

#SBATCH --job-name=tf_demo
#SBATCH --time=01:00:00                  
#SBATCH --nodes=1                  
#SBATCH --output=tf_demo.%j
#SBATCH --mem=100GB
#SBATCH --gres=gpu:pvc:1              
#SBATCH --partition=pvc  
#SBATCH --nodelist=ac026 # use the same compute node here

Submit the job using sbatch command

sbatch job.slurm

There are two commands to monitor gpu utilization which are detailed in the following sections.

sysmon

$ sysmon -h
Usage: ./sysmon [options]
Options:
--processes [-p]    Print short device information and running processes (default)
--list [-l]         Print list of devices and subdevices
--details [-d]      Print detailed information for all of the devices and subdevices
--help [-h]         Print help message
--version           Print version

This utility provides basic information about the GPUs on a node, including the list of processes that are attached to the GPU at the moment.

The Process mode (default mode) dumps short information about all the available GPUs and running processes. Example output of process mode:

$ sysmon
=====================================================================================
GPU 0: Intel(R) Data Center GPU Max 1100    PCI Bus: 0000:1b:00.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26516    Subdevices: 0
EU Count: 448    Threads Per EU: 8    EU SIMD Width: 16    Total Memory(MB): 46679.2
Core Frequency(MHz): 1400.0 of 1550.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: 4
     PID,  Device Memory Used(MB),  Shared Memory Used(MB),  GPU Engines, Executable
    4809,                     2.2,                     0.0,      COMPUTE, /usr/bin/xpumd
 2651639,                     5.2,                     0.0,  COMPUTE;DMA, python
 2651661,                 46213.8,                     0.0,  COMPUTE;DMA, python
 2652076,                     2.2,                     0.0,      UNKNOWN, sysmon
=====================================================================================
GPU 1: Intel(R) Data Center GPU Max 1100    PCI Bus: 0000:21:00.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26516    Subdevices: 0
EU Count: 448    Threads Per EU: 8    EU SIMD Width: 16    Total Memory(MB): 46679.2
Core Frequency(MHz): 200.0 of 1550.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: 4
     PID,  Device Memory Used(MB),  Shared Memory Used(MB),  GPU Engines, Executable
    4809,                     2.2,                     0.0,      COMPUTE, /usr/bin/xpumd
 2651639,                     0.4,                     0.0,      COMPUTE, python
 2651661,                     6.9,                     0.0,      COMPUTE, python
 2652076,                     2.2,                     0.0,      UNKNOWN, sysmon
=====================================================================================
GPU 2: Intel(R) Data Center GPU Max 1100    PCI Bus: 0000:29:00.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26516    Subdevices: 0
EU Count: 448    Threads Per EU: 8    EU SIMD Width: 16    Total Memory(MB): 46679.2
Core Frequency(MHz): 200.0 of 1550.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: 4
     PID,  Device Memory Used(MB),  Shared Memory Used(MB),  GPU Engines, Executable
    4809,                     2.2,                     0.0,      COMPUTE, /usr/bin/xpumd
 2651639,                     0.4,                     0.0,      COMPUTE, python
 2651661,                     6.9,                     0.0,      COMPUTE, python
 2652076,                     2.2,                     0.0,      UNKNOWN, sysmon
=====================================================================================
GPU 3: Intel(R) Data Center GPU Max 1100    PCI Bus: 0000:2d:00.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26516    Subdevices: 0
EU Count: 448    Threads Per EU: 8    EU SIMD Width: 16    Total Memory(MB): 46679.2
Core Frequency(MHz): 200.0 of 1550.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: 4
     PID,  Device Memory Used(MB),  Shared Memory Used(MB),  GPU Engines, Executable
    4809,                     2.2,                     0.0,      COMPUTE, /usr/bin/xpumd
 2651639,                     0.4,                     0.0,      COMPUTE, python
 2651661,                     4.4,                     0.0,      COMPUTE, python
 2652076,                     2.2,                     0.0,      UNKNOWN, sysmon

To monitor the usage periodically pair it with linux's watch utility:

watch -n 5 sysmon # calls sysmon every 5 seconds

xpumcli

See the XPU Manager CLI help info

$ xpumcli -h
Intel XPU Manager Command Line Interface -- v1.2
Intel XPU Manager Command Line Interface provides the Intel data center GPU model and monitoring capabilities. It can also be used to change the Intel data center GPU settings and update the firmware.
Intel XPU Manager is based on Intel oneAPI Level Zero. Before using Intel XPU Manager, the GPU driver and Intel oneAPI Level Zero should be installed rightly.
Supported devices:
  - Intel Data Center GPU 
Usage: xpumcli [Options]
  xpumcli -v
  xpumcli -h
  xpumcli discovery
Options:
  -h,--help                   Print this help message and exit
  -v,--version                Display version information and exit.
Subcommands:
  discovery                   Discover the GPU devices installed on this machine and provide the device info.
  topology                    Get the system topology.
  group                       Group the managed GPU devices.
  diag                        Run some test suites to diagnose GPU.
  health                      Get the GPU device component health status.
  policy                      Get and set the GPU policies.
  updatefw                    Update GPU firmware
  config                      Get and change the GPU settings.
  topdown                     Expected feature.
  ps                          List status of processes.
  vgpu                        Create and remove virtual GPUs in SRIOV configuration.
  stats                       List the GPU aggregated statistics since last execution of this command or XPU Manager daemon is started.
  dump                        Dump device statistics data.
  log                         Collect GPU debug logs.
  agentset                    Get or change some XPU Manager settings.
  amcsensor                   List the AMC real-time sensor readings.

XPU manager can be used to get raw device statistics like temperature, core frequency etc.

A full list of available metrics can be found with the dump command.

$ xpumcli dump
Dump device statistics data.
Usage: xpumcli dump [Options]
  xpumcli dump -d [deviceIds] -t [deviceTileIds] -m [metricsIds] -i [timeInterval] -n [dumpTimes]
  xpumcli dump --rawdata --start -d [deviceId] -t [deviceTileId] -m [metricsIds]
  xpumcli dump --rawdata --list
  xpumcli dump --rawdata --stop [taskId]
Options:
  -h,--help                   Print this help message and exit
  -j,--json                   Print result in JSON format

  -d,--device                 The device IDs or PCI BDF addresses to query. The value of "-1" means all devices.
  -t,--tile                   The device tile IDs to query. If the device has only one tile, this parameter should not be specified.
  -m,--metrics                Metrics type to collect raw data, options. Separated by the comma.
                              0. GPU Utilization (%), GPU active time of the elapsed time, per tile
                              1. GPU Power (W), per tile
                              2. GPU Frequency (MHz), per tile
                              3. GPU Core Temperature (Celsius Degree), per tile
                              4. GPU Memory Temperature (Celsius Degree), per tile
                              5. GPU Memory Utilization (%), per tile
                              6. GPU Memory Read (kB/s), per tile
                              7. GPU Memory Write (kB/s), per tile
                              8. GPU Energy Consumed (J), per tile
                              9. GPU EU Array Active (%), the normalized sum of all cycles on all EUs that were spent actively executing instructions. Per tile.
                              10. GPU EU Array Stall (%), the normalized sum of all cycles on all EUs during which the EUs were stalled. Per tile.
                                  At least one thread is loaded, but the EU is stalled. Per tile.
                              11. GPU EU Array Idle (%), the normalized sum of all cycles on all cores when no threads were scheduled on a core. Per tile.
                              12. Reset Counter, per tile.
                              13. Programming Errors, per tile.
                              14. Driver Errors, per tile.
                              15. Cache Errors Correctable, per tile.
                              16. Cache Errors Uncorrectable, per tile.
                              17. GPU Memory Bandwidth Utilization (%)
                              18. GPU Memory Used (MiB)
                              19. PCIe Read (kB/s), per GPU
                              20. PCIe Write (kB/s), per GPU
                              21. Xe Link Throughput (kB/s), a list of tile-to-tile Xe Link throughput.
                              22. Compute engine utilizations (%), per tile.
                              23. Render engine utilizations (%), per tile.
                              24. Media decoder engine utilizations (%), per tile.
                              25. Media encoder engine utilizations (%), per tile.
                              26. Copy engine utilizations (%), per tile.
                              27. Media enhancement engine utilizations (%), per tile.
                              28. 3D engine utilizations (%), per tile.
                              29. GPU Memory Errors Correctable, per tile. Other non-compute correctable errors are also included.
                              30. GPU Memory Errors Uncorrectable, per tile. Other non-compute uncorrectable errors are also included.
                              31. Compute engine group utilization (%), per tile.
                              32. Render engine group utilization (%), per tile.
                              33. Media engine group utilization (%), per tile.
                              34. Copy engine group utilization (%), per tile.
                              35. Throttle reason, per tile.
                              36. Media Engine Frequency (MHz), per tile

  -i                          The interval (in seconds) to dump the device statistics to screen. Default value: 1 second.
  -n                          Number of the device statistics dump to screen. The dump will never be ended if this parameter is not specified.

  --rawdata                   Dump the required raw statistics to a file in background.
  --start                     Start a new background task to dump the raw statistics to a file. The task ID and the generated file path are returned.
  --stop                      Stop one active dump task.
  --list                      List all the active dump tasks.

Usage example for dump command:

$ xpumcli dump -d 0 -m 0,1,2,3,4,5
Timestamp, DeviceId, GPU Utilization (%), GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree), GPU Memory Temperature (Celsius Degree), GPU Memory Utilization (%)
11:01:41.000,    0, 0.00, 27.90,    0, 25.00, 21.00, 0.05
11:01:42.000,    0, 0.00, 28.87,    0, 24.50, 21.00, 0.05
11:01:43.000,    0, 0.00, 28.76,    0, 25.00, 21.00, 0.05
11:01:44.000,    0, 0.00, 28.77,    0, 24.00, 21.00, 0.05
11:01:45.000,    0, 0.00, 28.78,    0, 23.50, 21.00, 0.05
11:01:46.000,    0, 0.00, 28.84,    0, 24.00, 21.00, 0.05
11:01:47.000,    0, 0.00, 28.73,    0, 24.50, 21.00, 0.05
Back to top