College of Engineering HPC

HPC Access Instructions

What is an HPC?

The College of Engineering (CoE) High Performance Computing (HPC) system is a computing system with powerful multi-core and multi-socket servers, high performance storage, GPUs, and large amounts of memory, tied together by an ultra-fast inter-connection network. Its design allows the HPC to support some of the most compute and memory intensive programs developed today. While HPC clusters in general are designed for software that works across multiple nodes, each node in the cluster is a powerful machine that can cover the computational needs of most users.

The CoE HPC cluster is a Linux distributed cluster featuring a large number of nodes with leading edge Intel processors that are tightly integrated via a very high-speed communication network. A subset of the nodes have additional memory (256 GB per node) and accelerators (NVIDIA P-100 GPUs), making them suitable for some of the most demanding computational tasks.

  • Compute infrastructure:  A total of 36 nodes of various configurations with a total of 1008 compute cores provided by Intel Xeon E5-2660 v4 (codename Broadwell, 2.0GHz, 35M Cache) processors, communicating via a 56 Gbps high-speed internal network. This system provides 100 TFlop/s of peak performance. Of the nodes, 15 include 1 NVIDIA Tesla P100 12 GB GPUs, and 1 has 2 NVIDIA Tesla P100 GPUs. The GPU subsystem provides 68.6 TFlop/s of the overall 100 TFlop/s peak performance.
  • Memory: 20 nodes (compute nodes) have 128 GB of RAM, and 16 nodes (GPU and condo nodes) feature 256 GB. The total memory of the system is 6.7 TB.
  • SSD input/output: All GPU and condo nodes are equipped with solid state drives (SSDs) for ultra-high-performance input/output (I/O).  The total system SSD capacity is 12.8 TB.
  • Parallel file systems: The HPC has 110 TB of home directory and data storage, which is available from all nodes via /home and /data. Additionally, the HPC has a high-throughput Lustre parallel file system with a usable space of 0.541 PB, available via /scratch. Each group will have a sub-directory in /data and /scratch that they can write to.

How to Access the HPC

The HPC can only be accessed while connected to the campus network, i.e., SJSU_premier WiFi, on-campus LAN port, or VPN. If connecting from outside campus, you will need to first establish a VPN connection. Instructions for setting up VPN and you can use your SJSUOne credentials to establish a connection.

HPC systems are primarily accessed via a terminal interface and many of our users have the ability to write custom programs to run complex analysis. In the future, we may also provide interactive access to the HPC systems though Jupyter Notebooks or other interactive options.

If you are connecting from a computer running the Windows OS, you will need to download and install PuTTY. OSX and Linux do not require additional software to connect.

Windows

Connect via PuTTY to coe-hpc1.sjsu.edu.

Linux

Open the Terminal app and type:

ssh SJSU_ID@coe-hpc.sjsu.edu
Or ssh SJSU_ID@coe-hpc1.sjsu.edu if the previous command gives you time out error via VPN

Please note:

  • You will be prompted for a password. Type in your SJSUOne password, then press ENTER. The terminal will not display your password as you type. The HPC does not store your password, nor does it verify your password locally. Thus, if you have been told your HPC account has been created and you cannot log in, double-check your password by logging into SJSUOne through another channel (e.g., one.sjsu.edu) to verify that you remember the correct password, then try again.
  • The first time you log into the HPC system, you will be asked if you would like to cache the server fingerprint. Type yes and press ENTER.
  • You are now connected to the HPC Login node. From here you can compile programs, submit jobs, or request interactive nodes.

Requesting HPC accounts/access

The CoE HPC is only available to be used by CoE faculty and students. However, only faculty may request access to the HPC. Students needing access for a research project or a class should ask their research advisor or class professor to request HPC access via this form, which can also be found on the Department of Computer Engineering web site here. Access is granted for up to 6 months for a class project, 1 year for a capstone or research project, and indefinitely for faculty. Student access may be renewed/extended by submitting an additional access request in subsequent semesters.

Accessing HPC Resources

The HPC system is a community resource, shared by many students and faculty in the College of Engineering. As such, it uses a resource scheduling program, called SLURM, to ensure fair access to computing resources among all users. Slurm allows requesting resources both for interactive computing and in batch mode, i.e., a series of commands will be automatically executed when the resources are allocated. 

When first accessing the HPC (via ssh or putty, for example), users are logged in to the same server, the login node. This node can be used to write scripts and code, compile programs, test execution of your programs on small data. However, it should not ever be used for large-scale computation, as it will negatively impact the ability of other users to access and use the HPC system. Instead, users should schedule jobs to be executed by slurm in batch mode when resources become available (preferred) or request interactive resources to use for executing necessary computations.

Batch Jobs

Batch jobs are simple Linux Bash scripts that contain one or more commands that should be executed by Slurm. The top of the script must contain instructions for the Slurm Batch scheduler, in the form of comments, that will dictate the type of and amount of resources that are being requested. An example such script is listed below.

#!/bin/bash
#
#SBATCH --job-name=pl2ap
#SBATCH --output=pl2ap-srun.log
#
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=1000
#SBATCH --mail-user=username@sjsu.edu
#SBATCH --mail-type=END
export OMP_NUM_THREADS=4
export OMP_PLACES=cores
export OMP_PROC_BIND=spread
/home/david/programs/pl2ap/build/pl2ap pl2ap -t 0.9 /home/david/data/WikiWords200.csr

The example script above executes a multi-threaded OpenMP program called pl2ap using 4 cores on 1 node. The SBATCH lines tell Slurm what resources are needed (1 task, running on 1 node, requesting 4 cores and 1GB RAM per core, for a period of 10 minutes) and provide other options for running the job (job name, what the job log file should be named). Note that the compute nodes have a maximum time limit of 24 hours and the gpu nodes of 48 hours. The default time on all nodes is 4 hours. Condo nodes do not have any time limit. The mail-user and mail-type parameters specify that the HPC should email the user at the provided address when the job is complete (or has ended for some other reason, e.g., if it has run out of time). The OMP_NUM_THREADS, OMP_PLACES, and OMP_PROC_BIND environmental variables are used to ensure thread affinity to physical cores. Note that the program and data are stored in the user’s home directory (/home/david/). Home directories are a parallel data resource available on the login and all other nodes. You should always include full paths when referencing programs you execute, data they require, and log files the programs should write to.

Assuming the script above is stored in a file called myscript.sh, you can schedule the job by executing:

sbatch myscript.sh

Execute man sbatch for detailed information on options you can include in the batch script. Additional Slurm help and tutorials can be found at:

Interactive Jobs

Interactive resources can be requested through the srun command, by specifying a pseudo-terminal the task should execute in, which is accomplished by using the –pty flag.

srun --ntasks 1 --nodes 1 --cpus-per-task 4 --pty /bin/bash

which is equivalent to,

srun -n 1 -N 1 -c 4 --pty /bin/bash

GPUs are a type of resource only available on GPU capable nodes (gpu and condo partitions). They are only available if requested using the –gres flag.

srun -p gpu --gres=gpu --pty /bin/bash

Partitions/Queues

There are several queues, or node partitions, that can be used for submitting jobs. The compute partition contains all compute nodes (128GB RAM, no GPU), and the gpu partition contains all general GPU programming capable nodes (256GB RAM, NVIDIA P100, A100, H100 GPU). The condo partition is a special partition that contains nodes that belong to research labs in the College of Engineering. Those nodes can be used by the general users as long as they are not being used by their owners. If the condo node owner requests resources on their node, any currently running jobs will be preempted to allow access to the node owner. Note that each queue has a different set of limits on requesting resources. You can find the current limits and status of nodes in those partitions by executing sinfo.

PARTITION TIMELIMIT (days) NODES NODELIST  
Compute 12 17 c[1-14,18-20]  
GPU (H100) 7 2 g[2,6]  
GPU (A100)  7 2 g[7,13]  
GPU (P100)  7 10 g[1,3-5.9.11-12,14-16]  
Condo 21 4 condo[3-4,6-7]  

Additionally, the program squeue can be used to find out more details about the status of jobs currently in the Slurm queue.

Condo nodes/queue

Condo nodes are owned by specific faculty in the CoE. However, the nodes are configured and accessed in the same way as any other HPC compute/gpu node. When the condo node is not being used by its owner, it is available to be used by any user, with the caveat that their jobs may be preempted at any time. When the owner (or someone in their group) requests access to resources on their condo nodes, any jobs currently using those resources on the node are alerted (via a signal) that they need to stop and are forcefully stopped within 5 minutes so that resources can be provided to the condo owner.

Condo nodes should be homogeneous with each other as much as possible to enable potential parallel computing workloads as well as to reduce the burden of creating (and maintaining) customized node images for different types of hardware architectures. As such, condo nodes are purchased at most once a year. An email will go out to all CoE faculty providing details about the current condo node model, giving them the opportunity to buy-in to the condo queue.

Modular Software

A great deal of software for parallel and scientific computing has been pre-loaded and is available via modules. To see available modules, execute,

module avail

and use the load command to make those modules available to your scripts.

module load python3
python -V

Additional details can be found at User Guide for Lmod.

Interactive Data Science via Jupyter Notebook

The HPC is equipped with several versions of Python and a number of libraries useful in Data Mining, Machine Learning, and Data Science in general. Moreover, Jupyter Notebook is available in each of the modular Python versions on the HPC. The tutorial below will show you how to establish an interactive session on the HPC and access that session from your computer’s browser via an SSH tunnel. The directions will be provided only for Linux. Please read the appropriate manuals or find online instructions for setting up an SSH tunnel via PuTTY.

First, an explanation of how the process works. We will create two separate SSH sessions to the HPC. The first will be used to establish an interactive HPC session. The second SSH session will open a tunnel and forward traffic from a port with ID , which will be a number you choose between 10000 and 63999. If a port with that number is already in use, you will need to select another port ID. We will then open another tunnel from the HPC login node to the HPC interactive node you were assigned, using the same port. Finally, on the HPC interactive node, we will start Jupyter Notebook and use the provided token/link to access the Notebook session from our browser.

In the following steps, replace SJSU_ID, PORT_ID, and NODE_ID with appropriate values.

  1. Connect to HPC and request interactive session. You may customize the interactive session request as required for your task. See Slurm manual for details. Take note of the interactive node that your session started on. For example, you may get a session on c1 or g5. Use this value for NODE_ID in step 3 below. Leave the Terminal window open (you can minimize it).
    ssh SJSU_ID@coe-hpc.sjsu.edu
    srun -n 1 -N 1 -c 1 --pty /bin/bash
  2. In a different Terminal window (or Putty session), start a tunnel to the HPC, forwarding port PORT_ID.
    ssh -L PORT_ID:localhost:PORT_ID SJSU_ID@coe-hpc.sjsu.edu
  3. Start a tunnel from the HPC login node to the HPC interactive node, forwarding port . See note in step 1.
    ssh -L PORT_ID:localhost:PORT_ID SJSU_ID@NODE_ID
  4. Start Jupyter Notebook on the HPC interactive node. You may customize which version of Python you want to load (see Modules section above). After the command below, you will get a link with a token ID that will allow you to access your Jupyter Notebook. Copy the link.
    module load python3
    jupyter notebook --no-browser --port=PORT_ID
  5. Paste the Notebook link from the previous step in your browser, then press ENTER. Enjoy!
  6. At the end of your session, remember to close down Jupyter Notebook (save your notebook, then, in the interactive node terminal window, click CTRL+C, then type y and ENTER), then close down both SSH sessions to the HPC (in the terminal window, type exit, then press ENTER).

How Do I Install X Version of Software Y?

The short answer is you’ll need to do it yourself. The HPC was built and is being maintained by faculty in the Department of Computer Engineering and we do not have the time or resources available to support helping users with software installs or to modify the HPC environment when new software is needed by some faculty. You can install most software in your home directory by following instructions for making/compiling the software. Additionally, the HPC provides multiple alternatives to help you install custom software:

  • A lot of software can be installed via Python modules using pip or Anaconda (see examples below for Tensorflow).
  • Java programs can be run from your home directory structure.
  • If low-level libraries are needed and cannot be compiled locally in your home directory, consider using Sigularity, an application similar to Docker available on the HPC via module load singularity. Singularity provides functionality to build small/minimal containers and run those containers as single application environments. 

How Do I Install Tensorflow?

Tensorflow is very particular about the versions of Python and CUDA it works with. Using Tensorflow without GPUs is very simple. Execute the following, substituting the Python version for your desired Python version.

module load python3
python -m pip install tensorflow

If you need to use Tensorflow with GPUs, read on.

Install Tensorflow-gpu for Python 3.6

# First, install tensorflow-gpu in the correct Python installation.
coe-hpc1:~$ module load python3/3.6.6
coe-hpc1:~$ module load cuda/10.0
coe-hpc1:~$ python -m pip install --user tensorflow-gpu
# Now test that tensorflow is working.
coe-hpc1:~$ srun -p gpu --gres=gpu -n 1 -N 1 -c 2 --pty /bin/bash
g1:~$ module load python3/3.6.6
g1:~$ module load cuda/10.0
g1:~$ ipython
Python 3.6.6 (default, Sep 1 2018, 23:40:54) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.In [1]: import tensorflow as tf
In [2]: sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2019-08-22 01:13:56.622426: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[...]
2019-08-22 01:13:56.903809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:03:00.0
[...]
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:03:00.0, compute capability: 6.0

Install Tensorflow-gpu in Python 3.7 via Anaconda

First, install Anaconda (if not already done so). Additional details for installing Anaconda in a Linux environment.

  1. Download preferred Anaconda environment The example below uses the Linux Installer for the Python 3.7 version of Anaconda 2019.07.
    coe-hpc1:~$ wget https://repo.anaconda.com/archive/Anaconda3-2019.07-Linux-x86_64.sh
  2. Double-check the downloaded file is legit by comparing its sha256 sum with the one published by Anaconda. The hashes for all packages. The one for the specific package we downloaded is at https://docs.anaconda.com/anaconda/install/hashes/Anaconda3-2019.07-Linux-x86_64.sh-hash/. 
    The command below should produce the same hash as on the web page.
    coe-hpc1:~$ sha256sum Anaconda3-2019.07-Linux-x86_64.sh
  3. Execute the installer 
    coe-hpc1:~$ sh Anaconda3-2019.07-Linux-x86_64.sh
  4. The installer prompts "In order to continue the installation process, please review the license agreement." Click Enter to view license terms.
  5. Scroll to the bottom of the license terms and enter "Yes" to agree.
  6. Click Enter to accept the default install location.
  7. The installer prompts "Do you wish the installer to initialize Anaconda3 by running conda init?" Choose "yes".
  8. Log out of the HPC and log back in to activate Anaconda. If you chose no in step 7, Anaconda will not be activated when you logged back in.  In order to initialize anaconda, run `source /bin/activate` 
    and then run `conda init`.

Now set up an environment for your project and activate it. In the future, after logging into the HPC or a node, you will have to run `conda activate tf-gpu` to be able to use the installed libraries. Additionally, install tensorflow-gpu and ipython (precursor to jupyter notebook for the command line) in the environment. Currently, TF 1.14 does not work with Python 3.7, 
which is the installed version of Python in Anaconda. As such, we must install TF 1.13 instead.
(base) coe-hpc1:~$ module load cuda/10.0
(base) coe-hpc1:~$ conda create --name tf-gpu tensorflow-gpu=1.13 ipython

If prompted to, update Anaconda.
(base) coe-hpc1:~$ conda update -n base -c defaults conda

Test TF-gpu is working. Nore that TF-gpu will only work on a gpu/condo node if you have requested and have been granted access to the GPU resource. It will produce errors on the login node or on compute nodes.
(base) coe-hpc1:~$ srun -p gpu --gres=gpu -n 1 -N 1 -c 2 --pty /bin/bash
(base) g1:~$ module load cuda/10.0
(base) g1:~$ conda activate tf-gpu

Conda may have added once of the site-package paths from the other Python installations (e.g., the base Python 3 installation on the system) in your list of syspaths. If this happens, TF will fail with an error. Double-check the system path list and correct if necessary.
(tf-gpu) g1:~$ ipython
Python 3.7.4 (default, Aug 13 2019, 20:35:49) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.7.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import sys
In [2]: sys.path 
Out[2]: 
['/home/david/anaconda3/envs/tf-gpu/bin',
'/home/david/anaconda3/envs/tf-gpu/lib/python37.zip',
'/home/david/anaconda3/envs/tf-gpu/lib/python3.7',
'/home/david/anaconda3/envs/tf-gpu/lib/python3.7/lib-dynload',
'',
'/home/david/.local/lib/python3.7/site-packages',
'/home/david/anaconda3/envs/tf-gpu/lib/python3.7/site-packages',
'/home/david/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/IPython/extensions',
'/home/david/.ipython']
In [3]: sys.path.remove('/home/david/.local/lib/python3.7/site-packages')
In [4]: import tensorflow as tf
In [5]: sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) 
2019-08-22 10:47:10.502572: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
[...]
2019-08-22 10:47:10.667236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:03:00.0
totalMemory: 11.91GiB freeMemory: 11.66GiB
[...]
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device

Install Tensorflow-gpu in Python 3.6 via Anaconda

If necessary, follow the steps in the `Install Tensorflow in Python 3.7 via Anaconda` tutorial to install and initialize Anaconda.

Create a Python 3.6 Anaconda environment and install tensorflow-gpu and ipython.
(base) coe-hpc1:~$ module load cuda/10.0
(base) coe-hpc1:~$ conda create -n py36 python=3.6 tensorflow-gpu ipython

Test TF-gpu is working. Nore that TF-gpu will only work on a gpu/condo node if you have requested and have been granted access to the GPU resource. It will produce errors on the login node or on compute nodes.
(base) coe-hpc1:~$ srun -p gpu --gres=gpu -n 1 -N 1 -c 2 --pty /bin/bash
(base) g1:~$ module load cuda/10.0
(base) g1:~$ conda activate py36
(py36) g1:~$ ipython
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.7.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import tensorflow as tf
In [2]: sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) 
2019-08-22 10:54:23.302337: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[...]
2019-08-22 10:54:23.516445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:03:00.0
[...]
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:03:00.0, compute capability: 6.0