SLURM

What is it?

Slurm is the job queuing system used on RCC's HPC cluster "premise".

You will find much more information at the official slurm website: http://slurm.schedmd.com

Why you need it

You must use slurm commands on "premise.sr.unh.edu" to submit your jobs for scheduling on the "premise" cluster's compute nodes. The options you provide define the required hardware and restrict the nodes on which your job will run.

Commands

All of the slurm commands support a --help option that provides a lot good usager information.

sinfo: Shows the status of the compute nodes.
srun: Interactively run the given command on a remote node.
squeue: Shows your jobs that are running or waiting to run.
sacct: Shows your jobs that have completed or failed.
sbatch: Submit a job into the job queue.

Common sbatch Options

The expectation is that your slurm job uses no more resources than you have requested. Unless you specify it the default is to run 1 task on 1 Node with 1 cpu (also called core or thread) and reserving 2MB of physical RAM.

--help

Display the full list of options. Available for most slurm commands.

-n #

Number times this task will be executed. Assumed to be a single thread unless specified otherwise. All tasks will be run on one or more nodes simultaneously depending availability of the required resources on each node.

-N #

Minimum number of nodes to run this task on.

--ntasks-per-node=n

Number of tasks to invoke on each node. Another way of requesting your code be run on multiple nodes.

-c #

Number of CPUs (also called threads or cores) required by each task.

--mem=MB

Maximum number of MegaBytes RAM your code needs to run. Default is 2MB of RAM.

--mem-per-cpu=MB

Alternate way of providing --mem that is distributed by thread.

--gres=list

List of Generic RESources being requested. (See section below.) Can be a single resource like "ramdisk", have a quantity like "gpu:2". Normally only one item is listed, but it could be a true comma separated list of resources.

-D path

Change directory to the provided path before executing the task.

--mail-user=ops@sr.unh.edu

Send job status email to the address given

--mail-type=type

The state changes you wish to be notified about. Values include: BEGIN, END, FAIL or ALL

--time=#

Execution time limit in minutes. Your job will be killed if not complete after running for this many minutes.

-p name

Partition to submit jobs to, defaults to "shared". Users who have purchased nodes can use their exlusive "partition" to queue their jobs with a higher priority the nodes they paid for.

Your jobs will get a scheduling priority over jobs in the "shared" queue, but they will only be run on your hardware, even if "shared" nodes are idle. Using this option to gain priority is not always the best choice.

Examples

sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
shared*      up 365-00:00:     14  alloc node[101-114]
harish       up 365-00:00:      4  alloc node[101-104]

sacct --format JobID,JobName,State,Start,End,User,TotalCPU,NodeList

1278           hostname  COMPLETED 2016-06-27T13:17:47 2016-06-27T13:17:47       rea  00:00.042   node[101-114]

See "man sacct" for a list of more field names arguments for --format. Or use "sacct -e" to display them.

squeue

     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      1150    shared run1_pap  harishv  R 22-05:27:13      4 node[105-108]
      1153    shared run2_pap  harishv  R 22-05:20:56      3 node[109-111]
      1158    shared    npt03     dra4  R 16-22:39:46      3 node[112-114]
      1162    harish PB12PEO9  harishv  R 2-00:55:25      1 node103
      1163    harish     1cmz  hossein  R 1-23:02:31      1 node104
      1171    harish  emb12r1     dra4  R    5:19:18      1 node101
      1172    harish  emb12r2     dra4  R    5:18:40      1 node102

squeue -o '%A,%u,%j,%D,%N,%C,%b,%h,%M,%L,%o'

JOBID,USER,NAME,NODES,NODELIST,CPUS,GRES,OVER_SUBSCRIBE,TIME,TIME_LEFT,COMMAND
2565,harishv,1zv4,1,node101,24,gpu:2,NO,36:00,39-23:24:00,/mnt/lustre/chem-eng/harishv/tmp/tmpqUweoX

squeue -o '%6A %8P %16V %10u %10g %8M %16S %14Li %10R'

JOBID  PARTITIO SUBMIT_TIME      USER       GROUP      TIME     START_TIME       TIME_LEFT     i NODELIST(R
8192   harish   2017-07-22T18:12 harishv    chem-eng   0:00     2017-08-31T12:39 40-00:00:00   i (Resources
8054   shared   2017-07-19T14:38 arthurmelo hale       3-02:04: 2017-07-21T15:44 361-21:55:14  i node[109,1
8237   shared   2017-07-24T10:45 itelles12  macmanesla 5:51:30  2017-07-24T11:57 364-18:08:30  i node110
8249   shared   2017-07-24T15:15 gvc1002    mel        2:33:29  2017-07-24T15:15 364-21:26:31  i node105

sacct --format jobid,partition,user,jobname,state,end,elapsed

20642            shared       rea slurm_eg1+  COMPLETED 2018-02-19T17:06:37   00:00:06 
20633            shared       rea slurm_eg1+  COMPLETED 2018-02-19T17:06:37   00:00:06 
20643            shared       rea slurm_eg1+     FAILED 2018-02-19T17:08:37   00:00:00

Shows prior job run accounting. The elapsed time column for completed jobs may be useful.

srun -n 3 hostname

node105.rcchpc
node105.rcchpc
node105.rcchpc

Schedules the given command to be run 3 independant tasks. More than one task may run simultaneously on a single node, until the resources on that node are depleted. The above shows each single threaded job ran on the same node. Running with "-n 30" would exceed the 24 cores on a single node and require more than one node to run this job.

srun --mail-user user@unh.edu bad_command

slurmstepd: error: execve(): bad_command: No such file or directory
srun: error: node105: task 0: Exited with exit code 2

srun printenv

LANG=en_US.UTF-8
LANGUAGE=
XMODIFIERS=@im=none
USER=rea
LOGNAME=rea
HOME=/mnt/lustre/rcc/rea
PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
MAIL=/var/spool/mail/rea
SHELL=/bin/tcsh
SSH_CLIENT=2606:4100:38a0:241::27 53596 22
SSH_CONNECTION=2606:4100:38a0:241::27 53596 2606:4100:38a0:240::100 22
SSH_TTY=/dev/pts/4
TERM=xterm-256color
XDG_SESSION_ID=1065
XDG_RUNTIME_DIR=/run/user/3489
HOSTTYPE=x86_64-linux
VENDOR=unknown
OSTYPE=linux
MACHTYPE=x86_64
SHLVL=1
PWD=/mnt/lustre/rcc/rea
GROUP=users
HOST=premise.sr.unh.edu
REMOTEHOST=q.sr.unh.edu
HOSTNAME=premise.sr.unh.edu
LS_COLORS=rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:*.tar=38;5;9:*.tgz=38;5;9:*.arc=38;5;9:*.arj=38;5;9:*.taz=38;5;9:*.lha=38;5;9:*.lz4=38;5;9:*.lzh=38;5;9:*.lzma=38;5;9:*.tlz=38;5;9:*.txz=38;5;9:*.tzo=38;5;9:*.t7z=38;5;9:*.zip=38;5;9:*.z=38;5;9:*.Z=38;5;9:*.dz=38;5;9:*.gz=38;5;9:*.lrz=38;5;9:*.lz=38;5;9:*.lzo=38;5;9:*.xz=38;5;9:*.bz2=38;5;9:*.bz=38;5;9:*.tbz=38;5;9:*.tbz2=38;5;9:*.tz=38;5;9:*.deb=38;5;9:*.rpm=38;5;9:*.jar=38;5;9:*.war=38;5;9:*.ear=38;5;9:*.sar=38;5;9:*.rar=38;5;9:*.alz=38;5;9:*.ace=38;5;9:*.zoo=38;5;9:*.cpio=38;5;9:*.7z=38;5;9:*.rz=38;5;9:*.cab=38;5;9:*.jpg=38;5;13:*.jpeg=38;5;13:*.gif=38;5;13:*.bmp=38;5;13:*.pbm=38;5;13:*.pgm=38;5;13:*.ppm=38;5;13:*.tga=38;5;13:*.xbm=38;5;13:*.xpm=38;5;13:*.tif=38;5;13:*.tiff=38;5;13:*.png=38;5;13:*.svg=38;5;13:*.svgz=38;5;13:*.mng=38;5;13:*.pcx=38;5;13:*.mov=38;5;13:*.mpg=38;5;13:*.mpeg=38;5;13:*.m2v=38;5;13:*.mkv=38;5;13:*.webm=38;5;13:*.ogm=38;5;13:*.mp4=38;5;13:*.m4v=38;5;13:*.mp4v=38;5;13:*.vob=38;5;13:*.qt=38;5;13:*.nuv=38;5;13:*.wmv=38;5;13:*.asf=38;5;13:*.rm=38;5;13:*.rmvb=38;5;13:*.flc=38;5;13:*.avi=38;5;13:*.fli=38;5;13:*.flv=38;5;13:*.gl=38;5;13:*.dl=38;5;13:*.xcf=38;5;13:*.xwd=38;5;13:*.yuv=38;5;13:*.cgm=38;5;13:*.emf=38;5;13:*.axv=38;5;13:*.anx=38;5;13:*.ogv=38;5;13:*.ogx=38;5;13:*.aac=38;5;45:*.au=38;5;45:*.flac=38;5;45:*.mid=38;5;45:*.midi=38;5;45:*.mka=38;5;45:*.mp3=38;5;45:*.mpc=38;5;45:*.ogg=38;5;45:*.ra=38;5;45:*.wav=38;5;45:*.axa=38;5;45:*.oga=38;5;45:*.spx=38;5;45:*.xspf=38;5;45:
LESSOPEN=||/usr/bin/lesspipe.sh %s
MODULESHOME=/usr/share/Modules
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles:/mnt/lustre/software/modulefiles
LOADEDMODULES=
SLURM_PRIO_PROCESS=0
SRUN_DEBUG=3
SLURM_CLUSTER_NAME=rcchpc
SLURM_SUBMIT_DIR=/mnt/lustre/rcc/rea
SLURM_SUBMIT_HOST=premise.sr.unh.edu
SLURM_JOB_NAME=printenv
SLURM_JOB_CPUS_PER_NODE=24
SLURM_NTASKS=1
SLURM_NPROCS=1
SLURM_DISTRIBUTION=cyclic
SLURM_JOB_ID=1176
SLURM_JOBID=1176
SLURM_STEP_ID=0
SLURM_STEPID=0
SLURM_NNODES=1
SLURM_JOB_NUM_NODES=1
SLURM_NODELIST=node105
SLURM_JOB_PARTITION=shared
SLURM_TASKS_PER_NODE=1
SLURM_SRUN_COMM_PORT=49012
SLURM_STEP_NODELIST=node105
SLURM_JOB_NODELIST=node105
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_LAUNCHER_PORT=49012
SLURM_SRUN_COMM_HOST=10.200.200.1
CUDA_VISIBLE_DEVICES=NoDevFiles
GPU_DEVICE_ORDINAL=NoDevFiles
SLURM_TOPOLOGY_ADDR=node105
SLURM_TOPOLOGY_ADDR_PATTERN=node
TMPDIR=/tmp
SLURM_CPUS_ON_NODE=24
SLURM_TASK_PID=20277
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
SLURM_LAUNCH_NODE_IPADDR=10.200.200.1
SLURM_GTIDS=0
SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
SLURM_JOB_UID=3489
SLURM_JOB_USER=rea
SLURMD_NODENAME=node105

srun --exclude=node117,node118 hostname

node133.rcchpc

Exclude tells slurm not to consider these nodes when scheduling your job. Use the "--exclude" option on "srun" or "sbatch" to reduce the pool of available machines on which your job will be scheduled. This can be very useful if your code doesn't run well on certain hardware, or you would wish to ensure that your long running job does not tie up a higher end node than needed.

At this time the Premise cluster has two nodes (node117 & node118) with AMD (not Intel) based CPUs. Code optimized for Intel may not run as efficiently (or at all) on these two nodes.

Node ranges may also be given as node[117-118], but the brackets may need to be escaped so they are not interpretted by the calling shell.

For example: --exclude=node\[117-118\]

Consider using the --exclude option to keep long runnning jobs from these enhanced nodes:

GPU: node[101-104,119-124]
High RAM: node[109-112],node125
AMD CPU: node[117-118]

sacct -o jobid,jobname,maxrss,state --units=M

Use this command AFTER an "srun" of your code to determine how much memory it used. This can be helpful to determine a minimum starting value. Always use a slightly higher value than your test case returns.

bash-4.2$ srun env

bash-4.2$ sacct -j 20632 -o jobid,jobname,maxrss,state --units=M
20632               env      0.87M  COMPLETED

Generic RESource (gres)

Slurm allows for the definition of resources that cam be requested in a generic way. The premise cluster has two "gres" resources defined. Those resources are:

gpu

Four nodes are equiped with a NVidia K80 GPUs. These appear as two individual GPU chips, each of which is equivolent to a slightly under clocked NVidia K40 GPU.

Regardless of which physical GPU you are allocated they will are referenced as if they are the only GPUs on the system starting at 0. If you request a single GPU and are allocated the second GPU you must access it as GPU 0 in your code. If you request both "K40" GPUs then your code will have access to both 0 and 1.

Please note that the GPU node purchaser has priority on these nodes and keeps them busy most of the time. Wait times for this resource is expected to be significant.

ramdisk

Four nodes have been configured to allow jobs to make use of available ram as a disk. The RAM on each node is limited so only the high memomry nodes offer this resource. We have chosen to not automatically delete the contents when a job finishes, so it is important that upon normal exit your jobs clean up after themselves (and that you manually clean up after them when they exit abnormally).

To request a gres resource you use the --gres argument to srun or sbatch with a value that specifies the generic resource you are requesting. Resource quantity defaults to 1, but can be explicit like "--gres=gpu:2".

Example gres gpu request

sbatch --gres=gpu:1 myscript.csh

The myscript.csh should run the desired code as if there was only one GPU on the system. Most likely this will be the default for most GPU utilizing codes, but you may need to reference the GPU as index 0.

Example using a gres ramdisk

sbatch --gres=ramdisk hello_ramdisk.csh

The hello_ramdisk.csh script could look like this:

#!/bin/csh
echo "ramdisk as I found it"
ls -R /mnt/ramdisk

echo "make a directory for my job and create a file(s)"
set mydir=/mnt/ramdisk/$SLURM_JOB_ID
mkdir $mydir
echo "hello" > $mydir/world
echo "check the file"
cat -n  $mydir/world
echo "ramdisk in use"
ls -R /mnt/ramdisk

echo "cleaning up after myself"
rm -rf $mydir

echo "ramdisk as I am leaving it"
ls -R /mnt/ramdisk

Code

sbatch script to run namd2

#!/bin/tcsh
#SBATCH -D $HOME/run_from_dir
#SBATCH --partition=shared
#SBATCH -J myjobname
#SBATCH --time=960:00:00
#SBATCH -N 1
#SBATCH --cpus-per-task=24 
#SBATCH --output NAMD-myjob.log
#SBATCH --gres=gpu:2

cd $HOME/run_from_dir

foreach mod ( mpi/mvapich2-x86_64 namd )
  module load $mod
  end

module list

namd2  +devices 0-1 ++ppn 24 ./input_file.conf

sbatch python script example

The following example is a complex slurm job python script. Most example slurm job scripts are shell scripts, but other shell scripting languages may also be used. This example uses "#SBATCH --array" comment syntax to submit 10 slurm jobs in a single submit, and to limit the concurrently running jobs to 2. These "#SBATCH" comments must appear first in the script.

A python process pool is created to match the sized of the slurm job submission option "--cpus-per-task". For complexity, the worker function is run 50 times on each node.

The "sig_handler function" is useful when running this example by hand, but not required for slurm. Adding the line "signal.signal(signal.SIGUSR1, sig_handler)" to the main() function would capture the event generated by slurm option "--signal=B:USR1@120".

The provided "SEnv class" is a shorthand way to use environment variables, and provide default values. The defaults are useful when running the code outside of slurm. In all cases below it would be cleaner to do the integer type cast inside the SEnv method. But that would cause problems for non-int slurm environment variables.

Zip is used to generate a dictionary matching the input argument with it's result. The best result utilization depends on your data. Consider using pickle to save the final output for a later post processing merge step.

#!/usr/bin/python
#SBATCH --ntasks-per-node=1 --cpus-per-task=24 --nodes=1 --mem=60 --time=6:00
#SBATCH --array=0-9%2
'''
Run my 1 task, 24 thread job, on 1 node, using 60MB RAM, for 6min.
Run 10 jobs starting at 0, running at most 2 concurrently
'''

import signal, sys, os, time, logging, multiprocessing

pool = None
def sig_handler(signum, frame): 
    "Handle ^C and other interrupts better."
    logging.warn("got signum={}".format(signum))
    global pool
    if pool:
        logging.warn("Pool cleanup")
        pool.terminate()
        pool.join()
        sys.exit(-1)

def worker(n, f=0.1, sec=2): 
    "The work being threaded."
    logging.debug("worker {}".format(n))
    time.sleep(sec)
    return n*f

class SEnv(dict): 
    "System Env lookup, dictionary defaults when env var DNE."
    def __getattribute__(self, name): return os.getenv(name, super(SEnv, self).get(name) )

def main():
    logging.basicConfig( level=logging.INFO)
    signal.signal(signal.SIGINT, sig_handler)
    e = SEnv(dict(SLURM_CPUS_PER_TASK=4, SLURM_ARRAY_TASK_ID=3))
    ary_id = int(e.SLURM_ARRAY_TASK_ID)
    global pool
    pool = multiprocessing.Pool(processes=int(e.SLURM_CPUS_PER_TASK))
    single_arg_list = range(ary_id*100, ary_id*100+50)
    results = pool.map(worker, single_arg_list)
    logging.info( "all processes completed: results={}".format(
            dict(zip(single_arg_list,results))))

if __name__ == "__main__": main()

Non-exclusive node usage

Jobs are given exclusive use of an entire node by default. RCC has configured slurm to allow users to share a node by opting in with the "--share" option. Unfortunately the "--share" option is not listed by "sbatch --help".

When memory is unspecified it defaults to the total amount of RAM on the node. For slurm to know how much available memory remains you must specify the memory needed in MB (--mem=32).

CPU cores are similar to memory and you are given everything available on the node by default. For slurm to know the remaining cpu resources you must specify your job cpu core needs (-c 6).

Non-exclusively execute your "run.sh" script using upto 32 MB RAM and 6 CPU cores:

sbatch --share --mem=32 -c 6 run.sh

Run four NAMD jobs on a single node:

module add namd
sNAMD.py --sopt "--share --mem=32" --ppn 6 input.conf
sNAMD.py --sopt "--share --mem=32" --ppn 6 input.conf
sNAMD.py --sopt "--share --mem=32" --ppn 6 input.conf
sNAMD.py --sopt "--share --mem=32" --ppn 6 input.conf

Non-exclusive active nodes with sufficient resources are allocated before idle nodes. This means more efficient use of the HPC cluster, since two or more jobs will share a single node and leave more nodes available.

Import notes:

Exceeding the specified memory cancels your running job.
Your job is restricted to the cpu cores specified. Slowed down but not cancelled.