Monitoring Jobs and Partitions/Queues
Kill Running Jobs
Jobs can be killed by using scancel. Shown below are some example commands.
# Kill all PENDING job owned by sleong on partition general-short
scancel --state=PENDING --user=sleong --partition=general-short
# Kill the job with job id 1234
scancel 1234
# Send signal to the job with job id 1234 with signal USR1
scancel --signal=USR1 1234Check Job Status
To check Slurm job status, squeue and scontrol are used. Shown below are example commands.
# Show all jobs
squeue
# Show a user's jobs
squeue -u sleong
# Show a specific job
squeue --job 1234
# Show detail of a job
scontrol show job=1234Check Queues/Partitions
The sinfo is used to check Slurm Queues/Partitions. Shown below is an example command and output.
[sleong@c2-login-001 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
consumption up infinite 0 n/a
general* up infinite 5 inval c2-gpu-[017-018],c2-node-[084-086]
general* up infinite 2 down* c2-gpu-[010,012]
general* up infinite 1 drng c2-node-001
general* up infinite 2 drain c2-node-[002,004]
general* up infinite 2 mix c2-gpu-011,c2-node-003
general* up infinite 1 alloc c2-gpu-013
general* up infinite 90 idle c2-bigmem-001,c2-gpu-[001-009,014-015],c2-node-[005-078,080-083]
general-preemptable up infinite 0 n/a
general-short up 30:00 5 inval c2-gpu-[017-018],c2-node-[084-086]
general-short up 30:00 2 down* c2-gpu-[010,012]
general-short up 30:00 1 drng c2-node-001
general-short up 30:00 2 drain c2-node-[002,004]
general-short up 30:00 2 mix c2-gpu-011,c2-node-003
general-short up 30:00 1 alloc c2-gpu-013
general-short up 30:00 90 idle c2-bigmem-001,c2-gpu-[001-009,014-015],c2-node-[005-078,080-083]
subscription up infinite 0 n/a There is another Slurm command sjstat that is similar to the LSF bqueues. Shown below is an example command and output.
[sleong@c2-login-001 ~]$ sjstat -c
Scheduling pool data:
-------------------------------------------------------------
Pool Memory Cpus Total Usable Free Other Traits
-------------------------------------------------------------
consumpti 0Mb 0 0 0 0
general* 4116480Mb 64 1 1 1 location=local
general* 1024000Mb 64 101 94 90 location=local
general* 773120Mb 256 1 1 1 location=local
general-p 0Mb 0 0 0 0
general-s 4116480Mb 64 1 1 1 location=local
general-s 1024000Mb 64 101 94 90 location=local
general-s 773120Mb 256 1 1 1 location=local
subscript 0Mb 0 0 0 0 Check Job Output
There is a similar command to LSF bpeek in Slurm called sattach. You can use it to check for standard output and error of your jobs. Shown below is an example command.
# Usage: sattach [options] <jobid.stepid>
sattach 15568.0Query Slurm Accounting/Usage Data
If you want to query your usage and accounting information, Slurm provides a command sacct to query Slurm accounting data. Shown below are example commands. Please refer to Slurm Workload Manager - sacct documentation for details.
# Show your usage
sacct
# Show your usage in details
sacct -lA job usage data can be queried with the command sstat. Shown below is an example command.
sstat 1234