Slurm is the job queuing system used on RCC's HPC cluster "premise".
You will find much more information at the official slurm website: http://slurm.schedmd.com
You must use slurm commands on "premise.sr.unh.edu" to submit your jobs for scheduling on the "premise" cluster's compute nodes. The options you provide define the required hardware and restrict the nodes on which your job will run.
All of the slurm commands support a --help option that provides a lot good usager information.
The expectation is that your slurm job uses no more resources than you have requested. Unless you specify it the default is to run 1 task on 1 Node with 1 cpu (also called core or thread) and reserving 2MB of physical RAM.
Partition to submit jobs to, defaults to "shared". Users who have purchased nodes can use their exlusive "partition" to queue their jobs with a higher priority the nodes they paid for.
Your jobs will get a scheduling priority over jobs in the "shared" queue, but they will only be run on your hardware, even if "shared" nodes are idle. Using this option to gain priority is not always the best choice.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST shared* up 365-00:00: 14 alloc node[101-114] harish up 365-00:00: 4 alloc node[101-104]
1278 hostname COMPLETED 2016-06-27T13:17:47 2016-06-27T13:17:47 rea 00:00.042 node[101-114]
See "man sacct" for a list of more field names arguments for --format. Or use "sacct -e" to display them.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1150 shared run1_pap harishv R 22-05:27:13 4 node[105-108] 1153 shared run2_pap harishv R 22-05:20:56 3 node[109-111] 1158 shared npt03 dra4 R 16-22:39:46 3 node[112-114] 1162 harish PB12PEO9 harishv R 2-00:55:25 1 node103 1163 harish 1cmz hossein R 1-23:02:31 1 node104 1171 harish emb12r1 dra4 R 5:19:18 1 node101 1172 harish emb12r2 dra4 R 5:18:40 1 node102
JOBID,USER,NAME,NODES,NODELIST,CPUS,GRES,OVER_SUBSCRIBE,TIME,TIME_LEFT,COMMAND 2565,harishv,1zv4,1,node101,24,gpu:2,NO,36:00,39-23:24:00,/mnt/lustre/chem-eng/harishv/tmp/tmpqUweoX
JOBID PARTITIO SUBMIT_TIME USER GROUP TIME START_TIME TIME_LEFT i NODELIST(R 8192 harish 2017-07-22T18:12 harishv chem-eng 0:00 2017-08-31T12:39 40-00:00:00 i (Resources 8054 shared 2017-07-19T14:38 arthurmelo hale 3-02:04: 2017-07-21T15:44 361-21:55:14 i node[109,1 8237 shared 2017-07-24T10:45 itelles12 macmanesla 5:51:30 2017-07-24T11:57 364-18:08:30 i node110 8249 shared 2017-07-24T15:15 gvc1002 mel 2:33:29 2017-07-24T15:15 364-21:26:31 i node105
20642 shared rea slurm_eg1+ COMPLETED 2018-02-19T17:06:37 00:00:06 20633 shared rea slurm_eg1+ COMPLETED 2018-02-19T17:06:37 00:00:06 20643 shared rea slurm_eg1+ FAILED 2018-02-19T17:08:37 00:00:00
Shows prior job run accounting. The elapsed time column for completed jobs may be useful.
node105.rcchpc node105.rcchpc node105.rcchpc
Schedules the given command to be run 3 independant tasks. More than one task may run simultaneously on a single node, until the resources on that node are depleted. The above shows each single threaded job ran on the same node. Running with "-n 30" would exceed the 24 cores on a single node and require more than one node to run this job.
slurmstepd: error: execve(): bad_command: No such file or directory srun: error: node105: task 0: Exited with exit code 2
LANG=en_US.UTF-8 LANGUAGE= XMODIFIERS=@im=none USER=rea LOGNAME=rea HOME=/mnt/lustre/rcc/rea PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin MAIL=/var/spool/mail/rea SHELL=/bin/tcsh SSH_CLIENT=2606:4100:38a0:241::27 53596 22 SSH_CONNECTION=2606:4100:38a0:241::27 53596 2606:4100:38a0:240::100 22 SSH_TTY=/dev/pts/4 TERM=xterm-256color XDG_SESSION_ID=1065 XDG_RUNTIME_DIR=/run/user/3489 HOSTTYPE=x86_64-linux VENDOR=unknown OSTYPE=linux MACHTYPE=x86_64 SHLVL=1 PWD=/mnt/lustre/rcc/rea GROUP=users HOST=premise.sr.unh.edu REMOTEHOST=q.sr.unh.edu HOSTNAME=premise.sr.unh.edu LS_COLORS=rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:*.tar=38;5;9:*.tgz=38;5;9:*.arc=38;5;9:*.arj=38;5;9:*.taz=38;5;9:*.lha=38;5;9:*.lz4=38;5;9:*.lzh=38;5;9:*.lzma=38;5;9:*.tlz=38;5;9:*.txz=38;5;9:*.tzo=38;5;9:*.t7z=38;5;9:*.zip=38;5;9:*.z=38;5;9:*.Z=38;5;9:*.dz=38;5;9:*.gz=38;5;9:*.lrz=38;5;9:*.lz=38;5;9:*.lzo=38;5;9:*.xz=38;5;9:*.bz2=38;5;9:*.bz=38;5;9:*.tbz=38;5;9:*.tbz2=38;5;9:*.tz=38;5;9:*.deb=38;5;9:*.rpm=38;5;9:*.jar=38;5;9:*.war=38;5;9:*.ear=38;5;9:*.sar=38;5;9:*.rar=38;5;9:*.alz=38;5;9:*.ace=38;5;9:*.zoo=38;5;9:*.cpio=38;5;9:*.7z=38;5;9:*.rz=38;5;9:*.cab=38;5;9:*.jpg=38;5;13:*.jpeg=38;5;13:*.gif=38;5;13:*.bmp=38;5;13:*.pbm=38;5;13:*.pgm=38;5;13:*.ppm=38;5;13:*.tga=38;5;13:*.xbm=38;5;13:*.xpm=38;5;13:*.tif=38;5;13:*.tiff=38;5;13:*.png=38;5;13:*.svg=38;5;13:*.svgz=38;5;13:*.mng=38;5;13:*.pcx=38;5;13:*.mov=38;5;13:*.mpg=38;5;13:*.mpeg=38;5;13:*.m2v=38;5;13:*.mkv=38;5;13:*.webm=38;5;13:*.ogm=38;5;13:*.mp4=38;5;13:*.m4v=38;5;13:*.mp4v=38;5;13:*.vob=38;5;13:*.qt=38;5;13:*.nuv=38;5;13:*.wmv=38;5;13:*.asf=38;5;13:*.rm=38;5;13:*.rmvb=38;5;13:*.flc=38;5;13:*.avi=38;5;13:*.fli=38;5;13:*.flv=38;5;13:*.gl=38;5;13:*.dl=38;5;13:*.xcf=38;5;13:*.xwd=38;5;13:*.yuv=38;5;13:*.cgm=38;5;13:*.emf=38;5;13:*.axv=38;5;13:*.anx=38;5;13:*.ogv=38;5;13:*.ogx=38;5;13:*.aac=38;5;45:*.au=38;5;45:*.flac=38;5;45:*.mid=38;5;45:*.midi=38;5;45:*.mka=38;5;45:*.mp3=38;5;45:*.mpc=38;5;45:*.ogg=38;5;45:*.ra=38;5;45:*.wav=38;5;45:*.axa=38;5;45:*.oga=38;5;45:*.spx=38;5;45:*.xspf=38;5;45: LESSOPEN=||/usr/bin/lesspipe.sh %s MODULESHOME=/usr/share/Modules MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles:/mnt/lustre/software/modulefiles LOADEDMODULES= SLURM_PRIO_PROCESS=0 SRUN_DEBUG=3 SLURM_CLUSTER_NAME=rcchpc SLURM_SUBMIT_DIR=/mnt/lustre/rcc/rea SLURM_SUBMIT_HOST=premise.sr.unh.edu SLURM_JOB_NAME=printenv SLURM_JOB_CPUS_PER_NODE=24 SLURM_NTASKS=1 SLURM_NPROCS=1 SLURM_DISTRIBUTION=cyclic SLURM_JOB_ID=1176 SLURM_JOBID=1176 SLURM_STEP_ID=0 SLURM_STEPID=0 SLURM_NNODES=1 SLURM_JOB_NUM_NODES=1 SLURM_NODELIST=node105 SLURM_JOB_PARTITION=shared SLURM_TASKS_PER_NODE=1 SLURM_SRUN_COMM_PORT=49012 SLURM_STEP_NODELIST=node105 SLURM_JOB_NODELIST=node105 SLURM_STEP_NUM_NODES=1 SLURM_STEP_NUM_TASKS=1 SLURM_STEP_TASKS_PER_NODE=1 SLURM_STEP_LAUNCHER_PORT=49012 SLURM_SRUN_COMM_HOST=10.200.200.1 CUDA_VISIBLE_DEVICES=NoDevFiles GPU_DEVICE_ORDINAL=NoDevFiles SLURM_TOPOLOGY_ADDR=node105 SLURM_TOPOLOGY_ADDR_PATTERN=node TMPDIR=/tmp SLURM_CPUS_ON_NODE=24 SLURM_TASK_PID=20277 SLURM_NODEID=0 SLURM_PROCID=0 SLURM_LOCALID=0 SLURM_LAUNCH_NODE_IPADDR=10.200.200.1 SLURM_GTIDS=0 SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint SLURM_JOB_UID=3489 SLURM_JOB_USER=rea SLURMD_NODENAME=node105
node133.rcchpc
Exclude tells slurm not to consider these nodes when scheduling your job. Use the "--exclude" option on "srun" or "sbatch" to reduce the pool of available machines on which your job will be scheduled. This can be very useful if your code doesn't run well on certain hardware, or you would wish to ensure that your long running job does not tie up a higher end node than needed.
At this time the Premise cluster has two nodes (node117 & node118) with AMD (not Intel) based CPUs. Code optimized for Intel may not run as efficiently (or at all) on these two nodes.
Node ranges may also be given as node[117-118], but the brackets may need to be escaped so they are not interpretted by the calling shell.
For example: --exclude=node\[117-118\]
Consider using the --exclude option to keep long runnning jobs from these enhanced nodes:
Use this command AFTER an "srun" of your code to determine how much memory it used. This can be helpful to determine a minimum starting value. Always use a slightly higher value than your test case returns.
bash-4.2$ srun env bash-4.2$ sacct -j 20632 -o jobid,jobname,maxrss,state --units=M 20632 env 0.87M COMPLETED
Slurm allows for the definition of resources that cam be requested in a generic way. The premise cluster has two "gres" resources defined. Those resources are:
Four nodes are equiped with a NVidia K80 GPUs. These appear as two individual GPU chips, each of which is equivolent to a slightly under clocked NVidia K40 GPU.
Regardless of which physical GPU you are allocated they will are referenced as if they are the only GPUs on the system starting at 0. If you request a single GPU and are allocated the second GPU you must access it as GPU 0 in your code. If you request both "K40" GPUs then your code will have access to both 0 and 1.
Please note that the GPU node purchaser has priority on these nodes and keeps them busy most of the time. Wait times for this resource is expected to be significant.
To request a gres resource you use the --gres argument to srun or sbatch with a value that specifies the generic resource you are requesting. Resource quantity defaults to 1, but can be explicit like "--gres=gpu:2".
sbatch --gres=gpu:1 myscript.csh
The myscript.csh should run the desired code as if there was only one GPU on the system. Most likely this will be the default for most GPU utilizing codes, but you may need to reference the GPU as index 0.
sbatch --gres=ramdisk hello_ramdisk.csh
The hello_ramdisk.csh script could look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | #!/bin/csh echo "ramdisk as I found it" ls -R /mnt/ramdisk echo "make a directory for my job and create a file(s)" set mydir=/mnt/ramdisk/$SLURM_JOB_ID mkdir $mydir echo "hello" > $mydir/world echo "check the file" cat -n $mydir/world echo "ramdisk in use" ls -R /mnt/ramdisk echo "cleaning up after myself" rm -rf $mydir echo "ramdisk as I am leaving it" ls -R /mnt/ramdisk |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | #!/bin/tcsh #SBATCH -D $HOME/run_from_dir #SBATCH --partition=shared #SBATCH -J myjobname #SBATCH --time=960:00:00 #SBATCH -N 1 #SBATCH --cpus-per-task=24 #SBATCH --output NAMD-myjob.log #SBATCH --gres=gpu:2 cd $HOME/run_from_dir foreach mod ( mpi/mvapich2-x86_64 namd ) module load $mod end module list namd2 +devices 0-1 ++ppn 24 ./input_file.conf |
The following example is a complex slurm job python script. Most example slurm job scripts are shell scripts, but other shell scripting languages may also be used. This example uses "#SBATCH --array" comment syntax to submit 10 slurm jobs in a single submit, and to limit the concurrently running jobs to 2. These "#SBATCH" comments must appear first in the script.
A python process pool is created to match the sized of the slurm job submission option "--cpus-per-task". For complexity, the worker function is run 50 times on each node.
The "sig_handler function" is useful when running this example by hand, but not required for slurm. Adding the line "signal.signal(signal.SIGUSR1, sig_handler)" to the main() function would capture the event generated by slurm option "--signal=B:USR1@120".
The provided "SEnv class" is a shorthand way to use environment variables, and provide default values. The defaults are useful when running the code outside of slurm. In all cases below it would be cleaner to do the integer type cast inside the SEnv method. But that would cause problems for non-int slurm environment variables.
Zip is used to generate a dictionary matching the input argument with it's result. The best result utilization depends on your data. Consider using pickle to save the final output for a later post processing merge step.
#!/usr/bin/python #SBATCH --ntasks-per-node=1 --cpus-per-task=24 --nodes=1 --mem=60 --time=6:00 #SBATCH --array=0-9%2 ''' Run my 1 task, 24 thread job, on 1 node, using 60MB RAM, for 6min. Run 10 jobs starting at 0, running at most 2 concurrently ''' import signal, sys, os, time, logging, multiprocessing pool = None def sig_handler(signum, frame): "Handle ^C and other interrupts better." logging.warn("got signum={}".format(signum)) global pool if pool: logging.warn("Pool cleanup") pool.terminate() pool.join() sys.exit(-1) def worker(n, f=0.1, sec=2): "The work being threaded." logging.debug("worker {}".format(n)) time.sleep(sec) return n*f class SEnv(dict): "System Env lookup, dictionary defaults when env var DNE." def __getattribute__(self, name): return os.getenv(name, super(SEnv, self).get(name) ) def main(): logging.basicConfig( level=logging.INFO) signal.signal(signal.SIGINT, sig_handler) e = SEnv(dict(SLURM_CPUS_PER_TASK=4, SLURM_ARRAY_TASK_ID=3)) ary_id = int(e.SLURM_ARRAY_TASK_ID) global pool pool = multiprocessing.Pool(processes=int(e.SLURM_CPUS_PER_TASK)) single_arg_list = range(ary_id*100, ary_id*100+50) results = pool.map(worker, single_arg_list) logging.info( "all processes completed: results={}".format( dict(zip(single_arg_list,results)))) if __name__ == "__main__": main()
Jobs are given exclusive use of an entire node by default. RCC has configured slurm to allow users to share a node by opting in with the "--share" option. Unfortunately the "--share" option is not listed by "sbatch --help".
When memory is unspecified it defaults to the total amount of RAM on the node. For slurm to know how much available memory remains you must specify the memory needed in MB (--mem=32).
CPU cores are similar to memory and you are given everything available on the node by default. For slurm to know the remaining cpu resources you must specify your job cpu core needs (-c 6).
sbatch --share --mem=32 -c 6 run.sh
module add namd sNAMD.py --sopt "--share --mem=32" --ppn 6 input.conf sNAMD.py --sopt "--share --mem=32" --ppn 6 input.conf sNAMD.py --sopt "--share --mem=32" --ppn 6 input.conf sNAMD.py --sopt "--share --mem=32" --ppn 6 input.conf
Non-exclusive active nodes with sufficient resources are allocated before idle nodes. This means more efficient use of the HPC cluster, since two or more jobs will share a single node and leave more nodes available.