Software for Machine Learning

Choosing Which Machine Learning Software to Use

With the variety of machine learning tools available on the clusters, it is important to choose the one which is best suited to the needs of your project. Most of these tools are libraries built on top of Python, so familiarity with Python could influence your decision on which to use. Some packages are better suited for GPU use. Below is a list of machine learning tools available on the clusters along with information about them.

Keras - Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano.
PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration.
Scikit-Learn - An open-source python library for data mining and data analysis. Built on NumPy, SciPy, and matplotlib.
TensorFlow - An open-source python library for Machine Intelligence
Caffe - A modular deep learning framework

Python

The Python is a useful page for learning how to load Python and set up a virtual environment, which may be necessary when installing additional packages/libraries. Note that not every version of Python will be compatible with your chosen machine learning tool. It is important to be aware of the limitations of the package which you are using.

Anaconda

Anaconda is a distribution of Python aimed at simplifying package management. Anaconda can also create virtual environments, and may provide a better approach for usage of machine learning tools.

Datasets

Imagenet Dataset

The Imagenet 2012 dataset is a large collection of images and bounding box annotations. It is widely used in computer vision tasks like image classification and object detection. The processed versions of the dataset are located in the scratch directory.

More specifically, the Tensorflow format (TF records format) of imagenet is located at

/scratch/data/tensorflow-computer-vision-datasets/ILSVRC2012_tfrecord

Here is a preview of the dataset directory structure:

ILSVRC2012_tfrecord
|-- train-00000-of-01024
|-- train-00001-of-01024
|-- train-00002-of-01024
|-- ...

The PyTorch format of Imagenet can be found at

/scratch/data/pytorch-computer-vision-datasets/imagenet-raw-dataset

Here is a preview of the dataset directory structure:

imagenet-raw-dataset
|-- train
|   |-- n01440764
|   |   |-- n01440764_10026.JPEG
|   |   |-- n01440764_10027.JPEG
|   |   |-- n01440764_10029.JPEG
|   |   |-- ...
|-- val
|   |-- n01440764
|   |   |-- ILSVRC2012_val_00000293.JPEG
|   |   |-- ILSVRC2012_val_00002138.JPEG
|   |   |-- ILSVRC2012_val_00003014.JPEG
|   |   |-- ...
|-- bounding_boxes
|   |-- n01440764
|   |   |-- n01440764_10040.xml
|   |   |-- n01440764_10048.xml
|   |   |-- n01440764_10074.xml
|   |   |-- ...
|-- imagenet_2012_bounding_boxes.csv