Compute2 General Guidelines

Compute2 General Guidelines

Basics

Slurm puts a lot of hardware at your fingertips, but it must be used wisely so you don’t affect others’ work. This page contains guidelines for using Slurm.

  • Don’t submit 1000 jobs until you’ve seen 1 job finish successfully.

  • Job submission has overhead. Avoid submitting jobs that complete in just seconds. If you can’t avoid it, look into job arrays.

  • Limit the total number of files you place in a single directory. On the order of thousands at most. Normal filesystem operations become unwieldy with directories of millions of files.

  • Every Slurm jobs gets its own temporary directory that gets cleaned up for you! Do your work there, if possible. The path is: /tmp/.

  • Don’t monopolize queues with large numbers of long-running (multi-day) jobs. Use QOS to limit your running jobs if necessary.

  • Use srun for interactive jobs and sbatch for batch jobs.

  • Don’t rely on the host environment to develop or install your software.

    • Create your own environment using container technology, or

    • Encapsulate all your dependencies

General Partition Default Job Resources

The resources listed below are the defaults for running jobs in the general partitions. These may not be efficient for user jobs.

Users are expected to be knowledgeable of their analysis and ask for resources efficiently.

  • Per Job Default Parameters

    • --cpus-per-task = 1

    • --mem-per-cpu = 4GB

    • --time = 8 hours

Partition/Queue Configuration

Partition

Priority

Partition Limits

Cost

Description

Partition

Priority

Partition Limits

Cost

Description

workshop

High

Run Time Limit: 24 hours

Max Jobs: 16

Max CPU Cores: 8

Max Memory: 64 GB

Max GPUs: 1 MIG ( 1/7 H100 GPU)

No costs

Shared pool of hosts for workshop jobs. It has higher priority than all other general partitions which share hosts with this partition. To ensure the availability of resources for workshop partition, we will use Slurm reservation feature to reserve resources for a specific period for workshop participants.

general-interactive

High

Run Time Limit: 5 days

Max Jobs: 2

Max CPU Cores: 8

Max Memory: 64 GB

Max GPUs: 1 MIG ( 1/7 H100 GPU)

Low Costs per CPU and GPU per hour

Shared pool of CPU and GPU hosts for interactive jobs for entry-level users and developers. NoVNC and OOD Desktop provide a virtual environment for developers to replace local workstations.

general-short

High

Run Time Limit: 30 minutes

Max Jobs: 16

Max CPU Cores: 218

Max Memory: 1744 GB

Low Costs per CPU and GPU per hour

Shared pool of hosts for jobs are less than or equal to 30 minutes. Could resource could be leveraged to backfill.

general-gpu

Normal

Run Time Limit: 15 days

Max GPUs: 8

Max Memory per GPU: 80 GB

Low Costs per CPU and GPU per hour

Shared pool of hosts for GPU jobs. It has higher priority of general-short which share hosts with this partition.

general-cpu

Normal

Run Time Limit: 15 days

Max Jobs: 100

Max CPU Cores: 128

Max Memory: 2 TB

Low Costs per CPU and GPU per hour

Shared pool of hosts for general use.

general-bigmem

Normal

Run Time Limit: 15 days

Max CPU Cores: 128

Max Memory: 8 TB

c2-bigmem-*

Shared pool of hosts for general use.

general-preemptible-cpu

Low

Run Time Limit: 15 days

Grace Time: 0 Minutes

No costs

Pool of host shared from the subscription and condo partitions. Jobs will be terminated when owners/subscribers need these resources.

general-preemptible-gpu

Low

Run Time Limit: 15 days

Grace Time: 0 Minutes

No costs

Pool of host shared from the subscription and condo partitions. Jobs will be terminated when owners/subscribers need these resources.

condo-$name

Normal

QOS per group

 

Pay for server + operational fee

Purchase servers and receive a dedicated partition named for PI lab, department or school. Optionally include general pool of resources for easier scheduling. Custom configurations can be made for condo partitions like include general resources for to augment condo resources without the need to use a different partition. Condo owners can opt into sharing resources in the preemptible queue.

Unless indicated differently, max resource limits are per user, across all jobs, per partition.

Cost Calculator

  • There is a calculator that has been developed for users to use to estimate the cost of jobs.

  • This can be downloaded here: Cost Calculator

Storage Types

Storage Type

Description

Size

Performance

Storage Persistence

Globus Accessible

SMB Accessible

Across Nodes

Storage Type

Description

Size

Performance

Storage Persistence

Globus Accessible

SMB Accessible

Across Nodes

Active Storage

Persistent active storage, such as storage1, storage2, and storage3. The path format is /storageN/fs1/<allocation name>. Replace the N with 1, 2, or 3 for storage1, storage2, or storage3 respectively.

5TB to many Petabytes.

High

Yes

Yes

Yes

Yes

Home Directory

User home directory, for example, /home/<washukey>

Default is 50GB

High

Yes

No

No

Yes

Scratch Storage

Temporary scratch storage, such as scratch2. The path format is /scratch2/fs1/<allocation name>

Default is 10TB

Higher

No (The system deletes data older than 30 days)

No

No

Yes

Local Storage

Local storage devices, such as /tmp

Less than 10TB normally

Highest

No

No

No

No