The LSF Scheduler
The LSF Scheduler is the current scheduling program for the Engineering Cluster. It controls all job activity - including jobs created via OpenOnDemand.
Interactive Jobs
Interactive jobs come in several flavors.
- VNC Sessions are started with the Interactive Apps menu option in the top grey bar in OpenOndemand.
- Jupyter Notebooks are started the same way.
You may also start interactive shell jobs. If you intend to have a long running job, we suggest SSHing (Or use the Clusters menu above for Shell Access) to ssh.engr.wustl.edu and starting the screen command, which will place you in a false terminal that will continue to run after you disconnect.
If you do use screen, hold CTRL-ALT-D to detach from the screen and let things continue in the background. To reconnect later, reconnect to ssh.engr.wustl.edu and use the command screen -r. If you have multiple screens, it will tell you what is running.
You can start your screen with screen -R screenname, replacing screenname with a descriptive name you can use to reconnect later with screen -r screenname.
Another good utility for this is tmux.
You can start an interactive shell with the command:
bsub -q interactive -Is /bin/bash
Generally, most options to bsub will work with interactive jobs, such as requesting a GPU:
bsub -gpu "num=2:mode=exclusive_process:gmodel=TeslaK40c" -q interactive -Is /bin/bash
The -Is /bin/bash must be the last item on the command line.
Note that your job submission will sit and wait if resources are not available!
See the Basic Batch Jobs section below for more common resource options you can use with interactive jobs.
Viewing Job Queues
TheĀ bqueuesĀ command gives you basic command about the status of the job queues in the system. Job queues are groups of nodes that are tasked to run certain types of jobs - nodes can be in more than one queue. Most queues are for special types of jobs, or for nodes that are dedicated to a certain research group.
[seasuser@ssh ~]$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
admin 50 Open:Active - - - - 0 0 0 0
dataq 33 Open:Active - - - - 0 0 0 0
normal 30 Open:Inact - - - - 0 0 0 0
interactive 30 Open:Active - - - - 0 0 0 0
SEAS-Lab-PhD 1 Open:Active - - - - 0 0 0 0
More information onĀ bqueues is here
There are a handful of common queues in the ENGR cluster available for use. Other listed queues from theĀ bqueues command are for the use of specific labs, or the OnDemand system.
Queue | Usage | Notes |
---|---|---|
normal | Default queue for batch jobs | |
interactive | Queue for interactive jobs | |
cpu-compute | Advanced queue for batch jobs | 7-day job time limit |
cpu-compute-long | Advanced queue for batch jobs | 21-day job time limit |
cpu-compute-debug | Testing queue for the resources above | 4-hour job time limit |
gpu-compute | Advanced queue with Ampere GPUs for batch jobs | 7-day job time limit |
gpu-compute-long | Advanced queue with Ampere GPUs for batch jobs | 21-day job time limit |
gpu-compute-debug | Testing queue for the GPU queues above | 4-hour job time limit |
Viewing Cluster Nodes
bhostsĀ gives information about the status of individual nodes in the cluster.
[seasuser@ssh ~]$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
gnode01.seas.wustl ok - 1 0 0 0 0 0
gnode02.seas.wustl ok - 16 0 0 0 0 0
node01.seas.wustl. ok - 1 0 0 0 0 0
node02.seas.wustl. ok - 1 0 0 0 0 0
More information on bhosts is here.
TheĀ bhosts -w -gpu command will list all GPUs in the cluster with their full names, but please be aware many GPUs are reserved for the use of specific research groups.
Viewing Running Jobs
bjobsĀ gives information on running jobs.
[seasuser@ssh lsf]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
123 seasuse RUN SEAS-CPU ssh.seas.wu node01.seas CPU-Test Jan 01 12:00
More information onĀ bjobs options are here.
Basic Batch Jobs
Basic LSF Job Submission
bsubĀ submits job script files.
A simple example of a job script file looks like so:
#BSUB -o cpu_test.%J
#BSUB -R '(!gpu)'
#BSUB -N
#BSUB -J PythonJob
#BSUB -R "rusage[mem=25]"
python mycode.py
The lines beginning with ā#ā are often referred to asĀ pragmasĀ - part of a script that defines variables.
This file submits a Python script to be ran with the output going to a file called mycode_out.XXX (-o mycode_out.%J, where %J becomes the job number). When the job is done, Iām mailed a notification (-N), and when I check status with bjobs, the job name is PythonJob (-J PythonJob).
The first -R option selects a machine type. At this time there are no specific types to request, however, this example requests that the scheduler avoid a GPU node ā if you are not using GPU resources, itās good cluster citizenship to avoid those nodes.
The last -R option requests an amount of RAM in GB your job will use. Itās good manners to do this on shared systems.
More information onĀ bsub options are here.
If you are using Python 3+, and you're expecting to follow your output file as your Python code prints out information, you might not be seeing it as it happens!
Python 3+ doesn't flush output with the regular "print" function.Ā
To force it to do so, execute with the flag "-u":
python -u mycode.py
or make sure your print statements include a 'newline' "\n" at the end.Ā
See:
for more.
To submit the job, if you have saved the above example as the file ācpujobā:
[seasuser@ssh lsf]$ bsub < cpujob
Job <123> is submitted to queue default
shows the job is submitted successfully.
Sometimes when you create a job file with a Windows file editor, the file has a different type of "newline" code than Linux uses. That confuses LSF's bsub program, and usually results in a job that fails instantly. To see if your submission file is affected, do
file cpujob
and if it comes back as having "<CR><LF>" line terminators, do
dos2unix cpujob
to convert the line terminators to Linux-style.
Requesting Multiple CPUs
You can request that a job be assigned more than one CPU.Ā It is imperative that the application you are running, or the code you are executing, respect that setting.Ā To request multiple CPUS, add
#BSUB -n 4
#BSUB -R "span[hosts=1]"
to your script to request 4 CPUs as an example. Most hosts in the ENGR cluster have at least 16 cores, some up to 36 - the second line tells LSF to make sure all the CPUs requested are on the same host. To use CPUs on multiple hosts, your application/program must support MPI, addressed in another article.
Your application or code must be told to use the number of CPUs asked for. LSF sets an environment variable called LSB_DJOB_NUMPROC that matches the number of CPUs you requested. The method for doing this varies widely.
R, for example, can have this in its scripts:
cores <- as.numeric(Sys.getenv('LSB_DJOB_NUMPROC'))
to read that variable as the number of cores it can use. In Python, os.environ[āLSB_DJOB_NUMPROCā] could be passed to the proper variable or function depending on the code in use.
Avoiding GPUs
If you are not using GPUs, it is good manners to avoid those nodes. You can do that like so:
#BSUB -R '(!gpu)'
Requesting Memory
You should prerequest the amount of RAM (in GB) your job expects to need ā itās good manners on shared systems. Do this with the pragma:
#BSUB -R "rusage[mem=25]"
This example requests 25GB of system memory. Your job wonāt start until thereās a node capable of giving you that much RAM.
Requesting GPUs
The SEAS Compute cluster has support for GPU computing. Requesting a GPU requires you to choose the queue you wish to submit to ā many GPUs are within faculty-owned devices and have limited availability.
A simple example of a GPU job script file looks like so:
#BSUB -o mygpucode_out.%J
#BSUB -R "select[type==any]"
#BSUB -gpu "num=1:mode=exclusive_process:gmodel=TeslaK40c"
#BSUB -N
#BSUB -J PythonGPUJob
python mygpucode.py
This file submits a Python script to be anassociated to the research group āSEAS-Lab-Groupā with the output going to a file called mygpucode_out.XXX (-o mygpucode_out.%J, where %J becomes the job number). When the job is done, Iām mailed a notification (-N), and when I check status with bjobs, the job name is PythonGPUJob (-J PythonGPUJob). Iāve further requested a single Tesla K40c to run the job against (-gpu ānum=1:mode=exclusive_process:gmodel=TeslaK40cā) that starts in exclusive process mode ā so only that process can access the GPU.
In the open access cluster nodes, the available GPU types are:
- NVIDIAGeForceRTX2080
- NVIDIAGeForceGTX1080Ti
- TeslaK40c
- NVIDIAGeForceGTXTITANBlack
- NVIDIAGeForceGTXTITAN
The available GPU types in the gpu-compute queues (see Special Queues below) are:
- NVIDIAA40
- NVIDIAA10080GBPCIe
- NVIDIAA100_SXM4_80GB
The full name of a card must be used, and capitalization matters.
The complete list of pragmas available for requesting GPU resources is here: https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_command_ref/bsub.gpu.1.html
Requesting Specific Hosts or Groups of Hosts
If you have a need to run on a specific host, you can request one or more candidates by adding this to your submission file:
#BSUB -m "host1 host2"
where āhost1 host2ā are the hostnames of machines youād want to allow your job to run on ā you can specify one or more hosts, and LSF will choose either whatās available (in the case of multiple hosts) or wait for a specific request to become available.
Alternatively, you can choose a group of hosts (you can see existing groups with theĀ bmgroupsĀ command):
#BSUB -m SEAS-CPUALL
The above pragma would pick any of the CPU only open boxes in the cluster.
Be very carefulĀ with this pragma - you can cause your jobs to run poorly or not execute as soon as they could if you pile your jobs on specific hosts, rather than choosing resources generically.
Requesting Multiple Queues
You can submit jobs to multiple queues, and the job will run wherever resources are first available.
#BSUB -q "cpu-compute normal"
will submit a job to either of the above queues, favoring the new cpu-compute queue, and will fall back to the normal queue if no resources are available there.
Deleting (Killing) Jobs
If you need to stop a job that is either running or pending, use the command
bkill jobnum
where ājobnumā is the number of the job to kill, as shown from theĀ bjobsĀ command.
Job Arrays
If you have a job that would benefit from running thousands of times, with input/output files changing against a job index number, you can start a job array.
Job arrays will run the same program multiple times, and a environment variable called $LSF_JOBINDEX will increment to provide a unique number to each job. An array job submission might look like so:
#BSUB -n 3
#BSUB -R "span[hosts=1]"
#BSUB -J "array[1-9999]" thisjobname
python thiswork.py -i $LSB_JOBINDEX
This job script requested 3 CPUs on a single host for each array task, and started 9,999 tasks, with the input file being the number of the task - so the first task job expected an input file named ā1ā.
Basic MPI Jobs - Normal Queue
bsubĀ submits job script files.
A simple example of a MPICH job script file looks like so:
#BSUB -G SEAS-Lab-Group
#BSUB -o mpi_test.%J
#BSUB -R '(!gpu)'
#BSUB -N
#BSUB -J MPIjob
#BSUB -a mpich
#BSUB -n 30
module add mpi/mpich-3.2-x86_64
mpiexec -np 30 /path/to/mpihello
The lines beginning with ā#ā are often referred to asĀ pragmasĀ - part of a script that defines variables.
This file submits a MPI āHello Worldā program (available under /project/compute/bin) to the research group āSEAS-Lab-Groupā. The output goes to a file āmpi_test.%Jā, where %J becomes the job number.
When the job is done, Iām mailed a notification (-N), and when I check status with bjobs, the job name is MPIJob (-J).
Additionally, I`ve told the scheduler I want the āmpichā resource (-a mpich) and I want to use 30 CPUs (-n 30).
The -G option is not necessary if your research group does not have any lab-owned priority hardware, and the group name āSEAS-Lab-Groupā is a nonexistent placeholder.
The -R option selects a machine type. At this time there are no specific types to request, however, this example requests that the scheduler avoid a GPU node ā if you are not using GPU resources, itās good cluster citizenship to avoid those nodes.
More information onĀ bsub options are here.
To submit the job, if you have saved the above example as the file āmpijobā:
[seasuser@ssh lsf]$ bsub < mpijob
Job <123> is submitted to queue default
Apptainer/Docker Containers
Apptainer (Previously Singularity) is a type of container management that runsĀ daemonless, executing much more like a user application. It can run/convert existing Docker containers, pull containers from Apptainer, Docker, or personal container repositories.
Unlike Docker, it stores its container format in a single file, appended withĀ .sif. Also unlike Docker, the defaultĀ pullĀ action for Apptainer pulls the containerĀ to the current working directory, and not a central store location.
Apptainer's documentation is here:Ā https://apptainer.org/docs/user/latest/
In these basic instructions, we will use Apptainer as a user application within LSF - starting jobs as normal, but executing Apptainer and a container as the main content of the job.
SeeĀ Managing Apptainer/Singularity Containers for information on how to create Apptainer containers within the ENGR cluster.
Interactive Apptainer Jobs
Start your interactive job as you normally would, requesting specific resources (like GPUs) as normal. Once in your session, you can pull a new container or start a container that already exists locally.
To pull a container, make sure you are in a directory that has enough space, or your shared project directory, and do:
apptainer pull alpine
or
apptainer pull docker://alpine
will bring down the default Alpine (a small basic Linux environment) from either the Apptainer or Docker container repository, respectively. After that:
apptainer exec alpine /bin/sh
enters into the container. Alternatively:
apptainer run alpine
will run the defined runscript for the container (do note Alpine does not have a run script - it will simply exit, having no task to complete).
Accessing File Resources Inside Containers
Apptainer will, by default, mount your home directory, /tmp, /var/tmp, and $PWD. You can bind additional locations if needed, either to the original location or to specific locations as required by your container.
If your container expects to see your data, stored on the system as /project/group/projdata, under /data, you can do:
apptainer run --bind /project/group/projdata:/data mycontainer
A quick way to recreate the general directories you would expect to see from /project or /storage1 on the compute node would be:
apptainer run --bind /opt,/project,/home,/storage1 mycontainer
A single directory listed as part of a ābind option simply binds the directory in-place.
GPUs and Apptainer
Apptainer fully supports GPU containers. To run a GPU container, add theĀ ānvĀ flag to your command line:
apptainer run --bind /project,/opt,/home,/storage1 --nv tensorflow_latest-gpu.sif
The python script indicated there would run as you would expect, just within the Tensorflow container.
Apptainer Background Instances
Some containers are meant to provide services to other processes, such as the CARLA Simulator. Apptainer containers can run as instances, meaning they will run as background processes. To run something as an instance:
apptainer instance start --nv CARLA_latest.sif user_CARLA
8user_CARLA* here is the friendly name of the instance. To see running instances:
user@host> Apptainer instance list
INSTANCE NAME PID IP IMAGE
user_CARLA 5222 /tmp/CARLA_latest.sif
To connect to a running instance, you can use either theĀ runĀ orĀ shellĀ subcommands:
apptainer run instance://user_CARLA
Apptainer in Batch Jobs
Using Apptainer in a batch job is just as easy as running any other batch job. Craft your job script as you normally would, substituting an Apptainer command to run the container rather than an on-host executable file.
Special Queues - CPU and GPU
Special CPU Queues
There are three special CPU queues in the cluster meant to service batch jobs with our best systems.
Queue | Time Limit | Interactive? | ||
---|---|---|---|---|
cpu-compute | 7 days | n | ||
cpu-compute-long | 21 days | n | ||
cpu-compute-debug | 4 hours | y |
Jobs will be killed at the specified time limits. Only the Debug queue allows interactive jobs under a short time limit to maximize availability of these limited resources.
You may submit to these queues with the ā-qā option, indicating the queue name after the flag.
Scratch Space
You may utilize the path /scratch on all of these machines to access a large slice of NVME scratch space. Data is cleaned up automatically after 10 days, and you are responsible for moving data from /scratch to a permanent storage location.
The directory /scratch/long is cleaned after 24 days, for jobs using the cpu-compute-long queue.
Special GPU Queues
There are three special GPU queues in the cluster meant to service batch jobs with our best GPUs, which at this time includes nVIDIA A40s and A100s.
Queue | Time Limit | Interactive? | ||
---|---|---|---|---|
gpu-compute | 7 days | n | ||
gpu-compute-long | 21 days | n | ||
gpu-compute-debug | 4 hours | y | ||
gpu-compute-debug-long | 12 hours | y |
Jobs will be killed at the specified time limits. Only the Debug queue allows interactive jobs under a short time limit to maximize availability of these limited resources.
You may submit to these queues with the ā-qā option, indicating the queue name after the flag. The request string for each GPU model:
GPU | LSF Device Name | # Devices |
---|---|---|
nVIDIA A100 80GB | NVIDIAA10080GBPCIe | 8 |
nVIDIA A100 SXM4 80GB | NVIDIAA100_SXM4_80GB | 12 |
nVIDIA A40 48GB | NVIDIAA40 | 12 |
Scratch Space
You may utilize the path /scratch on all of these machines to access a large slice of NVME scratch space. Data is cleaned up automatically after 10 days, and you are responsible for moving data from /scratch to a permanent storage location.
The directory /scratch/long is cleaned after 24 days, for jobs using the cpu-compute-long queue.
Multi-instance GPU capability is currently disabled on A100 hosts.
Multi-Instance A100 GPUs
Two of the 8 A100s are currently configured for Multi-Instance. This allows users to request a logical subset of a GPU, leaving the remainder available for other users. This is highly recommended for interactive debug sessions, and would yield a maximum of 14 10GB GPUs for use interactively.
Each GPU can be subdivided up to 7 ways, each recieving a certain portion of the RAM (listed in GB) and compute capacity (listed in sevenths of the whole) of each card.
The above table shows the effects of requesting MIG CPUs. In the event that multiple MIG sizes are requested, the availability of a requested size is dependent on what has already been subdivided. The rule of whatās possible follows a requirement of no overlapping vertical coverage on the above table, starting from the left hand side.
For example, if a job requests a āmig=3ā GPU, subsequent jobs could either request (1) āmig=2ā, or (3) āmig=1ā jobs. The āmig=4ā GPU is a special case, taking 4/7ths of the compute against half the RAM, limiting other users to either (1) āmig=2ā GPUs or (2) āmig=1ā GPUs.
āmig=3ā and āmig=4ā GPUs are granted 2 NVDEC units; āmig=2ā GPUs are granted one; āmig=1ā GPUs do not recieve a NVDEC unit.
Requesting a MIG would look like so, requesting a mig=1 GPU:
#BSUB -gpu ānum=1:gmodel=NVIDIAA10080GBPCIe:mig=1ā