Software for Machine Learning
Choosing Which Machine Learning Software to Use
With the variety of machine learning tools available on the clusters, it is important to choose the one which is best suited to the needs of your project. Most of these tools are libraries built on top of Python, so familiarity with Python could influence your decision on which to use. Some packages are better suited for GPU use. Below is a list of machine learning tools available on the clusters along with information about them.
- Keras - Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano.
- PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration.
- Scikit-Learn - An open-source python library for data mining and data analysis. Built on NumPy, SciPy, and matplotlib.
- TensorFlow - An open-source python library for Machine Intelligence
- Caffe - A modular deep learning framework
- Faster-RCNN using Tensorflow - Faster R-CNN is an object detection framework based on deep convolutional networks, which includes a Region Proposal Network (RPN) and an Object Detection Network. Both networks are trained for sharing convolutional layers for fast testing.
Python
The Python is a useful page for learning how to load Python and set up a virtual environment, which may be necessary when installing additional packages/libraries. Note that not every version of Python will be compatible with your chosen machine learning tool. It is important to be aware of the limitations of the package which you are using.
Anaconda
Anaconda is a distribution of Python aimed at simplifying package management. Anaconda can also create virtual environments, and may provide a better approach for usage of machine learning tools.
Datasets
Imagenet Dataset
The Imagenet 2012 dataset is a large collection of images and bounding box annotations. It is widely used in computer vision tasks like image classification and object detection. The processed versions of the dataset are located in the scratch directory.
More specifically, the Tensorflow format (TF records format) of imagenet is located at
/scratch/data/tensorflow-computer-vision-datasets/ILSVRC2012_tfrecord
Here is a preview of the dataset directory structure:
ILSVRC2012_tfrecord
|-- train-00000-of-01024
|-- train-00001-of-01024
|-- train-00002-of-01024
|-- ...
The PyTorch format of Imagenet can be found at
/scratch/data/pytorch-computer-vision-datasets/imagenet-raw-dataset
Here is a preview of the dataset directory structure:
imagenet-raw-dataset
|-- train
| |-- n01440764
| | |-- n01440764_10026.JPEG
| | |-- n01440764_10027.JPEG
| | |-- n01440764_10029.JPEG
| | |-- ...
|-- val
| |-- n01440764
| | |-- ILSVRC2012_val_00000293.JPEG
| | |-- ILSVRC2012_val_00002138.JPEG
| | |-- ILSVRC2012_val_00003014.JPEG
| | |-- ...
|-- bounding_boxes
| |-- n01440764
| | |-- n01440764_10040.xml
| | |-- n01440764_10048.xml
| | |-- n01440764_10074.xml
| | |-- ...
|-- imagenet_2012_bounding_boxes.csv