Parallel Computing
MPI Parallel Jobs
MPI jobs are scheduled largely like any other job. To facilitate communication between parallel jobs run both on the same host and on different hosts, you’ll need to enable host networking and IPC modes with the LSF_DOCKER_NETWORK
and LSF_DOCKER_IPC
environment variables, as well as request job slots and resources sufficient to run the job. Below is a hypothetical example of how to run an mpi_hello_world MPI application with 2 processes forced to run on separate hosts:
> export LSF_DOCKER_NETWORK=host > export LSF_DOCKER_IPC=host > bsub -n 2 -R 'affinity[core(1)] span[ptile=1]' -I -q general-interactive \ -a 'docker(joe.user/ubuntu16.04_openmpi_lsf:1.1)' /usr/local/mpi/bin/mpirun /usr/local/bin/mpi_hello_world Job <2042> is submitted to queue <default>. <<Waiting for dispatch ...>> <<Starting on compute1-exec-3.ris.wustl.edu>> 1.1: Pulling from joe.user/ubuntu16.04_openmpi_lsf Digest: sha256:5661e21c7f20ea1b3b14537d17078cdc288c4d615e82d96bf9f2057366a0fb8f Status: Image is up to date for joe.user/ubuntu16.04_openmpi_lsf:1.1 docker.io/joe.user/ubuntu16.04_openmpi_lsf:1.1 Hello world from processor compute1-exec-3.ris.wustl.edu, rank 0 out of 2 processors Hello world from processor compute1-exec-9.ris.wustl.edu, rank 1 out of 2 processors
GPU Parallel Jobs
There are execution nodes in the cluster with GPUs. These are “resources” in the parlance of the cluster manager. Simply add the -gpu
flag to make them visible.
We provide services for the entire WashU campus and as such, we suggest a best practices of not over allocating how many resources a single user utilizes. To that point, we suggest that a single user take up no more than 10 GPUs at a time.
Below is a hypothetical example using Tensorflow 2:
#! /usr/bin/env python # docker run --gpus all -it tensorflow/tensorflow:latest-gpu /usr/local/bin/python tf.test.py from __future__ import absolute_import, division, print_function, unicode_literals import tensorflow as tf mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax') ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test, verbose=2)
Here we save the above code as a file named tf2.py and submit to the cluster:
-
If you are using other options that are part of the
-R
option, you will need to listgpuhost
first. -
- If you are using
select
for another option, you will need to placegpuhost
within that and first. -
-
-R select[gpuhost,port8888=1]
-
- If you are using
> bsub -Is -q general-interactive -R 'gpuhost' -gpu "num=4:gmodel=TeslaV100_SXM2_32GB" -a 'docker(tensorflow/tensorflow:latest-gpu)' /bin/bash -c "CUDA_VISIBLE_DEVICES=0,1,2,3; python tf2.py; ... lots of output ... 2019-11-01 22:07:25.455050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2019-11-01 22:07:25.607212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:18:00.0 2019-11-01 22:07:25.608657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:3b:00.0 2019-11-01 22:07:25.610081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:86:00.0 2019-11-01 22:07:25.611511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 ...
Note that in this particular example, the Tensorflow 2 container expects the use of the CUDA_VISIBLE_DEVICES environment variable. This behavior will likely vary with different software containers.