Hprc banner tamu.png

SW:Spark Jupyter Notebook

From TAMU HPRC
Jump to: navigation, search

Spark Jupyter Notebook

default notebook

You can use a default Python virtual environment in the Spark Jupyter Notebook portal app by leaving the "Optional Python Environment to be activated" field blank.

The default Spark notebook uses the module Spark/2.4.0-intel-2018b-Hadoop-2.7-Java-1.8-Python-3.6.6 and the following python packages jupyter, numpy, sklearn, pandas, seaborn, pyarrow

create your own notebook

You can create your own Spark Jupyter Notebook python virtual environment for use on the HPRC Portal but you must use the following module to create your Python virtualenv

GRACE:
module load iccifort/2019.5.281  impi/2018.5.288  Spark/2.4.5-Python-3.7.4-Java-1.8
TERRA:
module load Spark/2.4.0-intel-2018b-Hadoop-2.7-Java-1.8-Python-3.6.6

Notice that you will need to make sure you have enough available file quota (~10,000) since pip creates thousands of files.

To create a Python virtual environment called my_spark_notebook-python-3.6.6-foss-2018b (you can name it whatever you like), do the following on the command line. You can save your virtual environments in any $SCRATCH directory you want. In this example a directory called /scratch/user/mynetid/pip_envs is used but you can use another name instead of pip_envs

mkdir -p /scratch/user/mynetid/pip_envs

A good practice is to name your environment so that you can identify which Python version is in your virtualenv so that you know which module to load.

The next three lines will create your virtual environment using the Spark module on Terra.

module purge
module load Spark/2.4.0-intel-2018b-Hadoop-2.7-Java-1.8-Python-3.6.6
export SPARK_HOME=$EBROOTSPARK
virtualenv /scratch/user/mynetid/pip_envs/my_spark_notebook-python-3.6.6-foss-2018b

Then you can activate the virtual environment by using the full path to the activate command inside your virtual environment and install Python packages.

First install the required dependencies (jupyter, numpy, sklearn, pandas, seaborn, pyarrow) then you can install your additional packages.

source /scratch/user/mynetid/pip_envs/my_spark_notebook-python-3.6.6-foss-2018b/bin/activate
python3 -m pip install jupyter
python3 -m pip install numpy
python3 -m pip install sklearn
python3 -m pip install pandas
python3 -m pip install seaborn
python3 -m pip install pyarrow

When you are finished installing your python packages, go to the Spark Jupyter Notebook portal app and enter the full path of your virtualenv in the field 'Optional Python Environment to be activated' such as /scratch/user/mynetid/pip_envs/my_spark_notebook-python-3.6.6-foss-2018b in this example.