TensorFlow Benchmarks (HPRC)

In order to compare different installations of TensorFlow on our clusters (and with other clusters), we conducted a number of benchmarks.

Overview

The benchmarks came from the git repository at TensorFlow Benchmarks downloaded February 8, 2018.

All runs were done in a single batch job on a single node on the cluster. All runs used every CPU available on the node that was allocated. All GPU runs used 2 GPUs (NVIDIA Tesla K20s on ada and NVIDIA Tesla K80s on terra).

For Anaconda's Tensorflow, we used Anaconda3/5.1.0 with either tensorflow/1.5.0 (CPU only) or tensorflow-gpu/1.4.1 (GPU). For locally built versions, we used TensorFlow/1.5.0-foss-2017b-Python-3.6.3 (CPU only) and TensorFlow/1.5.0-goolfc-2017b-Python-3.6.3 (GPU) [Note: we do not yet have a version built with Intel compilers. Nor have we managed to build these on ada or curie yet]

For non-GPU (CPU only) runs, we used:

python tf_cnn_benchmarks.py --batch_size=32 --model=$model --variable_update=parameter_server --device=cpu --data_format=NHWC

For GPU runs, we used:

python tf_cnn_benchmarks.py --batch_size=32 --model=$model --variable_update=parameter_server --num_gpus=2

Summary

Unsurprising result #1... the CUDA/GPU versions ran faster than the non-CUDA versions .

In the end, we found, unsurprisingly, that in most cases, the locally-built non-GPU version outperformed the Anaconda version by a factor of three (x3) or more. This is due to taking advantage of CPU extensions available for the cluster it was built on such as AVX.

Similarly unsurprisingly we found that the CUDA versions showed little difference in performance. Since both use the same CUDA code the difference on the CPU performance was not as significant an impact.

It should be noted we haven't entirely explored the CPU/GPU interaction and how it could be utilized but given the results it would seem that if CPU/GPU (hybrid) use is desired then the locally built version might have better performance.

Regardless, for now, you should probably stick with the GPU versions. The queues might be longer but the SUs saved and the turnaround once it starts might outweigh the time needed to wait for a GPU.

Results

Terra

A complete description of Terra can be found in the Terra user's guide.

model	CPUx28	GPUx2
images/second	time (seconds)	images/second
local	Anaconda	local
alexnet	18.87	5.94
vgg11	2.58	0.71
vgg16	1.33	0.36
resnet50	3.33	0.58
vgg19	1.05	0.29
inception3	3.27	0.68
resnet101	1.93	0.34
resnet152	1.33	0.23

HPE DL385 Gen 10

256GB memory, AMD EPYC 7451 24-Core Processor, one Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 [Radeon Instinct MI25] (rev 01) GPU with 16GB memory

Tensorflow was run in a docker container as a natively installed app, not using Anaconda.

model	CPUx24	GPUx1 (MI25)
images/second	time (seconds)	images/second
local	Anaconda	local
alexnet	10.20	-
vgg11	1.25	-
vgg16	0.62	-
resnet50	0.97	-
vgg19	0.50	-
inception3	1.15	-
resnet101	0.58	-
resnet152	0.39	-

Scripts

Common run script

Referred to as bmtf.sh in the below job scripts.

#!/bin/bash

WITH_GPU=0
if [ "$1" == "--with-gpu" ] ; then
  WITH_GPU=1
fi

RUN_SOURCE=1
RUN_ANACONDA=1

for model in alexnet vgg11 vgg16 resnet50 vgg19 inception3 resnet101 resnet152; do
  # Parameters to pass to the benchmark
  if [ 0 -ne $WITH_GPU ] ; then
    TFPARMS="--batch_size=32 --model=$model --variable_update=parameter_server --num_gpus=2"  # from the documentation
  else
    TFPARMS="--batch_size=32 --model=$model --variable_update=parameter_server --device=cpu --data_format=NHWC" # need last two for non-GPU
  fi
  echo -e "\n### `date` - Running on `hostname` with parameters: python tf_cnn_benchmarks.py $TFPARMS\n"

  # built from scratch
  if [ 0 -ne $RUN_SOURCE ] ; then
    echo -e "\n### Running locally built version of TensorFlow\n"
    ml purge
    if [ 0 -ne $WITH_GPU ] ; then
      ml TensorFlow/1.5.0-goolfc-2017b-Python-3.6.3
    else
      ml TensorFlow/1.5.0-foss-2017b-Python-3.6.3 # not available on ada yet
    fi
    ml
    which python
    /usr/bin/time -av python /scratch/group/hprc/TensorFlowBM/tensorflow_benchmarks-20180208/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py $TFPARMS
  fi

  # installed by conda
  if [ 0 -ne $RUN_ANACONDA ] ; then
    echo -e "\n### Running Anaconda version of TensorFlow\n"
    ml purge
    ml Anaconda3/5.1.0
    if [ 0 -ne $WITH_GPU ] ; then
      source activate $SCRATCH/myAnaconda3/5.1.0-tensorflow-gpu
    else
      source activate $SCRATCH/myAnaconda3/5.1.0-tensorflow
    fi
    conda list
    which python
    /usr/bin/time -av python /scratch/group/hprc/TensorFlowBM/tensorflow_benchmarks-20180208/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py $TFPARMS
    source deactivate
  fi
done
# EOF

Batch job scripts

Terra

GPU

#!/bin/bash
## ENVIRONMENT SETTINGS; CHANGE WITH CAUTION
#SBATCH --export=NONE             # Do not propagate environment
#SBATCH --get-user-env=L          # Replicate login environment
## NECESSARY JOB SPECIFICATIONS
#SBATCH --job-name=bmtf-gpu
#SBATCH --time=24:00:00
#SBATCH --nodes=1                 # Request 1 node
#SBATCH --ntasks-per-node=28      # Request all avail cores on node
#SBATCH --mem=56G                 # Request all avail memory on node 
#SBATCH --output=out.bmtf-gpu.%j
#SBATCH --partition=staff         # Priority testing queue
## OPTIONAL JOB SPECIFICATIONS
#JKPBATCH --account=123456        # Set billing account to 123456
#JKPBATCH --mail-type=ALL         # Send email on all job events
#JKPBATCH --mail-user=j-perdue@tamu.edu # Where to send email
## GPUS
#SBATCH --gres=gpu:2

echo -e "\n# Job submitted from $SLURM_SUBMIT_HOST:$SLURM_SUBMIT_DIR\n"
echo "#################### Start: Job Script `basename $0` ###########################"
cat $0
cat bmtf.sh
echo "#################### End Job Script `basename $0` ###########################"

./bmtf.sh --with-gpu

echo "### Ending Job"
# EOF

CPU-only

#!/bin/bash
## ENVIRONMENT SETTINGS; CHANGE WITH CAUTION
#SBATCH --export=NONE             # Do not propagate environment
#SBATCH --get-user-env=L          # Replicate login environment
## NECESSARY JOB SPECIFICATIONS
#SBATCH --job-name=bmtf
#SBATCH --time=24:00:00
#SBATCH --nodes=1                 # Request 1 node
#SBATCH --ntasks-per-node=28      # Request all avail cores on node
#SBATCH --mem=56G                 # Request all avail memory on node 
#SBATCH --output=out.bmtf.%j
#SBATCH --partition=staff         # Priority testing queue
## OPTIONAL JOB SPECIFICATIONS
#JKPBATCH --account=123456        # Set billing account to 123456
#JKPBATCH --mail-type=ALL         # Send email on all job events
#JKPBATCH --mail-user=j-perdue@tamu.edu # Where to send email
## GPUS
#SBATCH --gres=gpu:0

echo -e "\n# Job submitted from $SLURM_SUBMIT_HOST:$SLURM_SUBMIT_DIR\n"
echo "#################### Start: Job Script `basename $0` ###########################"
cat $0
cat bmtf.sh
echo "#################### End Job Script `basename $0` ###########################"
./bmtf.sh
echo "### Ending Job"
# EOF