Clusters¶

The Oden Institute has a number of small clusters that are owned by centers and are used only by those affiliated with the respective center.

CSEM Peano¶

Warning

Peano will be going offline on June 30th. It has outlived its usefullness

Peano is a 40 node compute cluster made up of stampede1 nodes acquired from TACC. Each node contains 2 x 8 core E5-2680 (Sandy Bridge) processors with 32GB RAM, a single 250GB drive, and Infiniband FDR interconnect (Mellanox ConnectX-3).

The login node is a Dell PE R515 with 2 x hex core Opteron processors, 32GB RAM, dual gigE nics, Inifiniband FDR (Mellanox ConnectX-3 Pro), with an attached MD1000 storage array. Storage for home directories is 3.6TB in a RAID 5 configuration. An additional scratch area, ‘/scratch/’, is available and is RAID 5 group with 16TB of available storage.

The array was configured using OpenHPC (http://openhpc.community/). The queuing engine is Slurm (https://slurm.schedmd.com/) and provisioning is handled by Warewulf (http://warewulf.lbl.gov). A brief howto is provided below and is also contained in the MOTD when logging in:

To run an interactive shell, try:
         srun -p normal -t 0:30:00 -n 32 --pty /bin/bash -l

For an example slurm job script for MPI, see:
         /opt/ohpc/pub/examples/slurm/job.mpi

Further information:

This cluster is only available to students in the CSEM program.
SSH access is only from campus or UT’s VPN.
Home directories are not mounted.

Also note, OpenHPC makes available a pre-packaged builds for a variety of MPI families such as OpenMPI, MPICH, and MVAPICH. Only the GNU family of pre-packaged builds have been installed. Available Intel built MPI families in the module system have not been tested, use at your risk.

To request a user account to access Peano or for any help, please submit a help request to RT.

Note

Home directories and scratch areas on peano are not backed up.

CRIOS sverdrup¶

Note

Sverdrup and its associated storage node (sverdrup-nas) both went under upgrades beginning late March 2024.

System information:

OpenHPC cluster running Rocky Linux 9.3 (https://openhpc.community/)
$HOME is an NFS file system -> /home (2.0 TB)
/scratch is 105TB
/scratch2 is 125TB
/opt/apps/ is an NFS file system -> 100GB
Queuing system is SLURM 22.05. One queue available –> normal (35 Intel Omnipath nodes -> 28 cores/node, 64 GB/node, 980 cores)

Node composition:

Dual-socket 14-core Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz (Haswell)
64 GB of RAM per node
Omni-Path high performance communications HBA (100Gbs)
10Gbs networking

Wish list¶

Requests by research staff with the CRIOS group.

CRIOS cluster is getting a major reboot from March 25 - April 5 and will be unreachable at that time. Below is a request list of software and features Patrick can pass to Oden RT, in no particular order:

git >= 2.31
Allow all users to ssh into compute nodes used interactively
Update GNU compiler collection >= 12.2, but still need all the current GNU collection
tmux >= 2.7
Keep z-shell as a shell option
Keep the feature to use Jupyter Notebook/lab connection from the
compute node to the local browser
Gbd on all compute nodes
Ability to use Vim and compile codes on compute nodes (I wonder if
that means having the same env as the login node.)
Singularity (open source container platform like Docker, TACC has it)
Okular is no supported, use evince or xpdf.

The following base packages provided with Rocky Linux have been installed:

git-2.39
zsh-5.8
tmux- 3.2a
GNU 12.2 and GNU 13.1, OpenHPC builds as modules
Base OS for gdb installed, 10.2
apptainer-1.3 (formerly singularity)

Module¶

Lmod modules has been installed. Use the module commands to view available modules.

Intel compilers¶

Intel OneAPI compilers have been installed and are available via the module command.

Apptainer¶

Apptainer, formerly Singularity, has been installed on all the nodes using packages provided by Rocky Linux. Version is 1.3. There is no module needed for loading apptainer.

Note

Apptainer is not installed on the login node.

Create a job script, job.apptainer

#!/bin/bash

#SBATCH -J test               # Job name
#SBATCH -o job.%j.out         # Name of stdout output file (%j expands to jobId)
#SBATCH -N 2                  # Total number of nodes requested
#SBATCH -n 16                 # Total number of mpi tasks requested
#SBATCH -t 01:30:00           # Run time (hh:mm:ss) - 1.5 hours

# Launch apptainer job

/usr/bin/apptainer run docker://busybox uname -r

Yields

[stew@sverdrup]$ sbatch job.apptainer
Check the status of the log file

INFO:    Using cached SIF image
5.14.0-362.24.1.el9_3.x86_64

Compute node access¶

A request was made to allow ssh access to the nodes without having to go through Slurm.

There is an advantage to use srun when reserving nodes. The environment is exported properly to the node(s) for submission. This is not the case for regular ssh sessions to nodes.

Warning

It is not recommended ssh’ing into the nodes unless you have a job running there you own and have queued via Slurm. It’s strongly recommended you use srun to reserve a node over ssh’ing into the node. The underlying queueing engine is not aware of ssh sessions and the node could be allocated by someone else or a job could be queued to a node that someone has ssh’d into. This could cause a conflict resources on the node.

Jupyter Notebook¶

Using a python virtual environment, installed jupyter-notbook into the home directory. Was able to connect to jupyter server over ssh to both the login node and to a compute node using a proxy jump. This should be the same as before.