The LSF Scheduler

The LSF Scheduler is the current scheduling program for the Engineering Cluster. It controls all job activity - including jobs created via OpenOnDemand.

Interactive Jobs

Interactive jobs come in several flavors.

  • VNC Sessions are started with the Interactive Apps menu option in the top grey bar in OpenOndemand.
  • Jupyter Notebooks are started the same way.


You may also start interactive shell jobs. If you intend to have a long running job, we suggest SSHing (Or use the Clusters menu above for Shell Access) to ssh.engr.wustl.edu and starting the screen command, which will place you in a false terminal that will continue to run after you disconnect.

If you do use screen, hold CTRL-ALT-D to detach from the screen and let things continue in the background. To reconnect later, reconnect to ssh.engr.wustl.edu and use the command screen -r. If you have multiple screens, it will tell you what is running.

You can start your screen with screen -R screenname, replacing screenname with a descriptive name you can use to reconnect later with screen -r screenname.

Another good utility for this is tmux.

You can start an interactive shell with the command:

bsub -q interactive -Is /bin/bash


Generally, most options to bsub will work with interactive jobs, such as requesting a GPU:

bsub -gpu "num=2:mode=exclusive_process:gmodel=TeslaK40c" -q interactive -Is /bin/bash


The -Is /bin/bash must be the last item on the command line.

Note that your job submission will sit and wait if resources are not available!

See the Basic Batch Jobs section below for more common resource options you can use with interactive jobs.

Viewing Job Queues

TheĀ bqueuesĀ command gives you basic command about the status of the job queues in the system. Job queues are groups of nodes that are tasked to run certain types of jobs - nodes can be in more than one queue. Most queues are for special types of jobs, or for nodes that are dedicated to a certain research group.

  [seasuser@ssh ~]$ bqueues
  QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
  admin            50  Open:Active       -    -    -    -     0     0     0     0
  dataq            33  Open:Active       -    -    -    -     0     0     0     0
  normal           30  Open:Inact        -    -    -    -     0     0     0     0
  interactive      30  Open:Active       -    -    -    -     0     0     0     0
  SEAS-Lab-PhD      1  Open:Active       -    -    -    -     0     0     0     0

More information onĀ bqueues is here

There are a handful of common queues in the ENGR cluster available for use. Other listed queues from theĀ bqueues command are for the use of specific labs, or the OnDemand system.

QueueUsageNotes
normalDefault queue for batch jobs
interactiveQueue for interactive jobs
cpu-computeAdvanced queue for batch jobs7-day job time limit
cpu-compute-longAdvanced queue for batch jobs21-day job time limit
cpu-compute-debugTesting queue for the resources above4-hour job time limit
gpu-computeAdvanced queue with Ampere GPUs for batch jobs7-day job time limit
gpu-compute-longAdvanced queue with Ampere GPUs for batch jobs21-day job time limit
gpu-compute-debugTesting queue for the GPU queues above4-hour job time limit

Viewing Cluster Nodes

bhostsĀ gives information about the status of individual nodes in the cluster.

[seasuser@ssh ~]$ bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
gnode01.seas.wustl ok              -      1      0      0      0      0      0
gnode02.seas.wustl ok              -     16      0      0      0      0      0
node01.seas.wustl. ok              -      1      0      0      0      0      0
node02.seas.wustl. ok              -      1      0      0      0      0      0

More information on bhosts is here.

TheĀ bhosts -w -gpu command will list all GPUs in the cluster with their full names, but please be aware many GPUs are reserved for the use of specific research groups.

Viewing Running Jobs

bjobsĀ gives information on running jobs.

[seasuser@ssh lsf]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
123     seasuse RUN   SEAS-CPU   ssh.seas.wu node01.seas CPU-Test   Jan 01 12:00


More information onĀ bjobs options are here.

Basic Batch Jobs

Basic LSF Job Submission

bsubĀ submits job script files.

A simple example of a job script file looks like so:

#BSUB -o cpu_test.%J
#BSUB -R '(!gpu)'
#BSUB -N
#BSUB -J PythonJob
#BSUB -R "rusage[mem=25]"
python mycode.py


The lines beginning with ā€œ#ā€ are often referred to asĀ pragmasĀ - part of a script that defines variables.

This file submits a Python script to be ran with the output going to a file called mycode_out.XXX (-o mycode_out.%J, where %J becomes the job number). When the job is done, Iā€™m mailed a notification (-N), and when I check status with bjobs, the job name is PythonJob (-J PythonJob).

The first -R option selects a machine type. At this time there are no specific types to request, however, this example requests that the scheduler avoid a GPU node ā€“ if you are not using GPU resources, itā€™s good cluster citizenship to avoid those nodes.

The last -R option requests an amount of RAM in GB your job will use. Itā€™s good manners to do this on shared systems.

More information onĀ bsub options are here.

If you are using Python 3+, and you're expecting to follow your output file as your Python code prints out information, you might not be seeing it as it happens!
Python 3+ doesn't flush output with the regular "print" function.Ā 

To force it to do so, execute with the flag "-u":
python -u mycode.py

or make sure your print statements include a 'newline' "\n" at the end.Ā 

See:

https://stackoverflow.com/questions/25897335/why-doesnt-print-output-show-up-immediately-in-the-terminal-when-there-is-no-ne

for more.


To submit the job, if you have saved the above example as the file ā€œcpujobā€:

[seasuser@ssh lsf]$ bsub < cpujob
Job <123> is submitted to queue default

shows the job is submitted successfully.

Sometimes when you create a job file with a Windows file editor, the file has a different type of "newline" code than Linux uses. That confuses LSF's bsub program, and usually results in a job that fails instantly. To see if your submission file is affected, do


file cpujob

and if it comes back as having "<CR><LF>" line terminators, do


dos2unix cpujob

to convert the line terminators to Linux-style.


Requesting Multiple CPUs

You can request that a job be assigned more than one CPU.Ā It is imperative that the application you are running, or the code you are executing, respect that setting.Ā To request multiple CPUS, add

#BSUB -n 4
#BSUB -R "span[hosts=1]"

to your script to request 4 CPUs as an example. Most hosts in the ENGR cluster have at least 16 cores, some up to 36 - the second line tells LSF to make sure all the CPUs requested are on the same host. To use CPUs on multiple hosts, your application/program must support MPI, addressed in another article.

Your application or code must be told to use the number of CPUs asked for. LSF sets an environment variable called LSB_DJOB_NUMPROC that matches the number of CPUs you requested. The method for doing this varies widely.

R, for example, can have this in its scripts:

cores <- as.numeric(Sys.getenv('LSB_DJOB_NUMPROC'))

to read that variable as the number of cores it can use. In Python, os.environ[ā€˜LSB_DJOB_NUMPROCā€™] could be passed to the proper variable or function depending on the code in use.

Avoiding GPUs

If you are not using GPUs, it is good manners to avoid those nodes. You can do that like so:

#BSUB -R '(!gpu)'

Requesting Memory

You should prerequest the amount of RAM (in GB) your job expects to need ā€“ itā€™s good manners on shared systems. Do this with the pragma:

#BSUB -R "rusage[mem=25]"

This example requests 25GB of system memory. Your job wonā€™t start until thereā€™s a node capable of giving you that much RAM.

Requesting GPUs

The SEAS Compute cluster has support for GPU computing. Requesting a GPU requires you to choose the queue you wish to submit to ā€“ many GPUs are within faculty-owned devices and have limited availability.

A simple example of a GPU job script file looks like so:

#BSUB -o mygpucode_out.%J
#BSUB -R "select[type==any]"
#BSUB -gpu "num=1:mode=exclusive_process:gmodel=TeslaK40c"
#BSUB -N
#BSUB -J PythonGPUJob
python mygpucode.py

This file submits a Python script to be anassociated to the research group ā€œSEAS-Lab-Groupā€ with the output going to a file called mygpucode_out.XXX (-o mygpucode_out.%J, where %J becomes the job number). When the job is done, Iā€™m mailed a notification (-N), and when I check status with bjobs, the job name is PythonGPUJob (-J PythonGPUJob). Iā€™ve further requested a single Tesla K40c to run the job against (-gpu ā€œnum=1:mode=exclusive_process:gmodel=TeslaK40cā€) that starts in exclusive process mode ā€“ so only that process can access the GPU.

In the open access cluster nodes, the available GPU types are:

  • NVIDIAGeForceRTX2080
  • NVIDIAGeForceGTX1080Ti
  • TeslaK40c
  • NVIDIAGeForceGTXTITANBlack
  • NVIDIAGeForceGTXTITAN

The available GPU types in the gpu-compute queues (see Special Queues below) are:

  • NVIDIAA40
  • NVIDIAA10080GBPCIe
  • NVIDIAA100_SXM4_80GB

The full name of a card must be used, and capitalization matters.

The complete list of pragmas available for requesting GPU resources is here: https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_command_ref/bsub.gpu.1.html

Requesting Specific Hosts or Groups of Hosts

If you have a need to run on a specific host, you can request one or more candidates by adding this to your submission file:

#BSUB -m "host1 host2"

where ā€œhost1 host2ā€ are the hostnames of machines youā€™d want to allow your job to run on ā€“ you can specify one or more hosts, and LSF will choose either whatā€™s available (in the case of multiple hosts) or wait for a specific request to become available.

Alternatively, you can choose a group of hosts (you can see existing groups with theĀ bmgroupsĀ command):

#BSUB -m SEAS-CPUALL


The above pragma would pick any of the CPU only open boxes in the cluster.

Be very carefulĀ with this pragma - you can cause your jobs to run poorly or not execute as soon as they could if you pile your jobs on specific hosts, rather than choosing resources generically.

Requesting Multiple Queues

You can submit jobs to multiple queues, and the job will run wherever resources are first available.

#BSUB -q "cpu-compute normal" 

will submit a job to either of the above queues, favoring the new cpu-compute queue, and will fall back to the normal queue if no resources are available there.

Deleting (Killing) Jobs

If you need to stop a job that is either running or pending, use the command

bkill jobnum

where ā€œjobnumā€ is the number of the job to kill, as shown from theĀ bjobsĀ command.

Job Arrays

If you have a job that would benefit from running thousands of times, with input/output files changing against a job index number, you can start a job array.

Job arrays will run the same program multiple times, and a environment variable called $LSF_JOBINDEX will increment to provide a unique number to each job. An array job submission might look like so:

#BSUB -n 3
#BSUB -R "span[hosts=1]"
#BSUB -J "array[1-9999]" thisjobname

python thiswork.py -i $LSB_JOBINDEX

This job script requested 3 CPUs on a single host for each array task, and started 9,999 tasks, with the input file being the number of the task - so the first task job expected an input file named ā€œ1ā€.

Basic MPI Jobs - Normal Queue

MPICH is using the regular networking interface of any given node to transfer MPI communications. You should only use this in cases where your program does not rely on heavy MPI traffic to accomplish its work - minor status coordination of embarassingly parallel jobs is OK, large-scale attempts at memory sharing will not work out well.

bsubĀ submits job script files.

A simple example of a MPICH job script file looks like so:

#BSUB -G SEAS-Lab-Group
#BSUB -o mpi_test.%J
#BSUB -R '(!gpu)'
#BSUB -N
#BSUB -J MPIjob
#BSUB -a mpich
#BSUB -n 30

module add mpi/mpich-3.2-x86_64
mpiexec -np 30 /path/to/mpihello


The lines beginning with ā€œ#ā€ are often referred to asĀ pragmasĀ - part of a script that defines variables.

This file submits a MPI ā€œHello Worldā€ program (available under /project/compute/bin) to the research group ā€œSEAS-Lab-Groupā€. The output goes to a file ā€œmpi_test.%Jā€, where %J becomes the job number.

When the job is done, Iā€™m mailed a notification (-N), and when I check status with bjobs, the job name is MPIJob (-J).

Additionally, I`ve told the scheduler I want the ā€œmpichā€ resource (-a mpich) and I want to use 30 CPUs (-n 30).

The -G option is not necessary if your research group does not have any lab-owned priority hardware, and the group name ā€œSEAS-Lab-Groupā€ is a nonexistent placeholder.

The -R option selects a machine type. At this time there are no specific types to request, however, this example requests that the scheduler avoid a GPU node ā€“ if you are not using GPU resources, itā€™s good cluster citizenship to avoid those nodes.

More information onĀ bsub options are here.

To submit the job, if you have saved the above example as the file ā€œmpijobā€:

[seasuser@ssh lsf]$ bsub < mpijob
Job <123> is submitted to queue default

Apptainer/Docker Containers

Apptainer (Previously Singularity) is a type of container management that runsĀ daemonless, executing much more like a user application. It can run/convert existing Docker containers, pull containers from Apptainer, Docker, or personal container repositories.

Unlike Docker, it stores its container format in a single file, appended withĀ .sif. Also unlike Docker, the defaultĀ pullĀ action for Apptainer pulls the containerĀ to the current working directory, and not a central store location.

Apptainer's documentation is here:Ā https://apptainer.org/docs/user/latest/

In these basic instructions, we will use Apptainer as a user application within LSF - starting jobs as normal, but executing Apptainer and a container as the main content of the job.

SeeĀ Managing Apptainer/Singularity Containers for information on how to create Apptainer containers within the ENGR cluster.

Interactive Apptainer Jobs

Start your interactive job as you normally would, requesting specific resources (like GPUs) as normal. Once in your session, you can pull a new container or start a container that already exists locally.

To pull a container, make sure you are in a directory that has enough space, or your shared project directory, and do:

apptainer pull alpine

or

apptainer pull docker://alpine

will bring down the default Alpine (a small basic Linux environment) from either the Apptainer or Docker container repository, respectively. After that:

apptainer exec alpine /bin/sh

enters into the container. Alternatively:

apptainer run alpine

will run the defined runscript for the container (do note Alpine does not have a run script - it will simply exit, having no task to complete).

Accessing File Resources Inside Containers

Apptainer will, by default, mount your home directory, /tmp, /var/tmp, and $PWD. You can bind additional locations if needed, either to the original location or to specific locations as required by your container.

If your container expects to see your data, stored on the system as /project/group/projdata, under /data, you can do:

apptainer run --bind /project/group/projdata:/data mycontainer

A quick way to recreate the general directories you would expect to see from /project or /storage1 on the compute node would be:

apptainer run --bind /opt,/project,/home,/storage1 mycontainer

A single directory listed as part of a ā€“bind option simply binds the directory in-place.

GPUs and Apptainer

Apptainer fully supports GPU containers. To run a GPU container, add theĀ ā€“nvĀ flag to your command line:

apptainer run --bind /project,/opt,/home,/storage1 --nv tensorflow_latest-gpu.sif

The python script indicated there would run as you would expect, just within the Tensorflow container.

Apptainer Background Instances

Some containers are meant to provide services to other processes, such as the CARLA Simulator. Apptainer containers can run as instances, meaning they will run as background processes. To run something as an instance:

apptainer instance start --nv CARLA_latest.sif user_CARLA

8user_CARLA* here is the friendly name of the instance. To see running instances:

user@host> Apptainer instance list
INSTANCE NAME    PID    IP    IMAGE
user_CARLA       5222         /tmp/CARLA_latest.sif

To connect to a running instance, you can use either theĀ runĀ orĀ shellĀ subcommands:

apptainer run instance://user_CARLA

Apptainer in Batch Jobs

Using Apptainer in a batch job is just as easy as running any other batch job. Craft your job script as you normally would, substituting an Apptainer command to run the container rather than an on-host executable file.


Special Queues - CPU and GPU

Special CPU Queues

There are three special CPU queues in the cluster meant to service batch jobs with our best systems.

QueueTime LimitInteractive?

cpu-compute7 daysn

cpu-compute-long21 daysn

cpu-compute-debug4 hoursy

Jobs will be killed at the specified time limits. Only the Debug queue allows interactive jobs under a short time limit to maximize availability of these limited resources.

You may submit to these queues with the ā€œ-qā€ option, indicating the queue name after the flag.

Scratch Space

You may utilize the path /scratch on all of these machines to access a large slice of NVME scratch space. Data is cleaned up automatically after 10 days, and you are responsible for moving data from /scratch to a permanent storage location.

The directory /scratch/long is cleaned after 24 days, for jobs using the cpu-compute-long queue.

Special GPU Queues

There are three special GPU queues in the cluster meant to service batch jobs with our best GPUs, which at this time includes nVIDIA A40s and A100s.

QueueTime LimitInteractive?

gpu-compute7 daysn

gpu-compute-long21 daysn

gpu-compute-debug4 hoursy

gpu-compute-debug-long12 hoursy

Jobs will be killed at the specified time limits. Only the Debug queue allows interactive jobs under a short time limit to maximize availability of these limited resources.

You may submit to these queues with the ā€œ-qā€ option, indicating the queue name after the flag. The request string for each GPU model:

GPULSF Device Name# Devices
nVIDIA A100 80GBNVIDIAA10080GBPCIe8
nVIDIA A100 SXM4 80GBNVIDIAA100_SXM4_80GB12
nVIDIA A40 48GBNVIDIAA4012

Scratch Space

You may utilize the path /scratch on all of these machines to access a large slice of NVME scratch space. Data is cleaned up automatically after 10 days, and you are responsible for moving data from /scratch to a permanent storage location.

The directory /scratch/long is cleaned after 24 days, for jobs using the cpu-compute-long queue.


Multi-instance GPU capability is currently disabled on A100 hosts.


Multi-Instance A100 GPUs

Two of the 8 A100s are currently configured for Multi-Instance. This allows users to request a logical subset of a GPU, leaving the remainder available for other users. This is highly recommended for interactive debug sessions, and would yield a maximum of 14 10GB GPUs for use interactively.

Each GPU can be subdivided up to 7 ways, each recieving a certain portion of the RAM (listed in GB) and compute capacity (listed in sevenths of the whole) of each card.

MIG OptionMGPU SizesMaximum MIG GPUs
mig=110GB/1C10GB/1C10GB/1C10GB/1C10GB/1C10GB/1C10GB/1Cx7 GPUs with 10GB RAM and 1 compute slice
mig=220GB/2C20GB/2C20GB/2Cxx3 GPUs with 20GB RAM and 2 compute slices
mig=340GB/3C40GB/3C2 GPUs with 40GB RAM and 3 compute slices
mig=440GB/4Cxxx1 GPU with 40GB of RAM and 4 compute slices
mig=780GB/7C1 GPU with 80GB of RAM and 7 compute slices

The above table shows the effects of requesting MIG CPUs. In the event that multiple MIG sizes are requested, the availability of a requested size is dependent on what has already been subdivided. The rule of whatā€™s possible follows a requirement of no overlapping vertical coverage on the above table, starting from the left hand side.

For example, if a job requests a ā€œmig=3ā€ GPU, subsequent jobs could either request (1) ā€œmig=2ā€, or (3) ā€œmig=1ā€ jobs. The ā€œmig=4ā€ GPU is a special case, taking 4/7ths of the compute against half the RAM, limiting other users to either (1) ā€œmig=2ā€ GPUs or (2) ā€œmig=1ā€ GPUs.

ā€œmig=3ā€ and ā€œmig=4ā€ GPUs are granted 2 NVDEC units; ā€œmig=2ā€ GPUs are granted one; ā€œmig=1ā€ GPUs do not recieve a NVDEC unit.

Requesting a MIG would look like so, requesting a mig=1 GPU:


#BSUB -gpu ā€œnum=1:gmodel=NVIDIAA10080GBPCIe:mig=1ā€