Blog from February, 2017

Project Update: Network mount separation

Executive Summary (Request For Action)

We're working to separate workstation mounts from HPC mounts, so that heavy usage under LSF won't affect responsiveness of your workstation. We pushed out some changes today that'll require everything to remount, so please log out tonight.

We ask users of Linux workstations to log out tonight so that network mounts can be refreshed.

Project details can be found here:  ITDEV-1917 - Getting issue details... STATUS

Details for those who want to know

There is a new DNS alias "ces-workstation" and its two IP addresses

If you want to make these changes effective right away, and you are a member of the "info" LDAP group, read on and follow these instructions. Otherwise, log out of your workstation at the end of the day today and we'll take care of this for you.

Today we made configuration changes to our "Cluster Export Services" (CES) nodes that made a new pair of IP addresses available. We then deployed configuration changes to workstations to make use of these new IP addresses.

The DNS alias "ces-workstation" maps to two new IPs:

-> host ces-workstation
ces-workstation.gsc.wustl.edu has address 10.100.3.207
ces-workstation.gsc.wustl.edu has address 10.100.3.206

These two IPs are served by the host named "ces3".

The rest of the CES cluster, hosts ces1 and ces2, use 6 other IPs:

-> host ces
ces.gsc.wustl.edu has address 10.100.3.200
ces.gsc.wustl.edu has address 10.100.3.201
ces.gsc.wustl.edu has address 10.100.3.202
ces.gsc.wustl.edu has address 10.100.3.203
ces.gsc.wustl.edu has address 10.100.3.204
ces.gsc.wustl.edu has address 10.100.3.205

Check what server your workstation is using for network mounts

A typical workstation might have several NFS mounts present. Until today, they would have used any of the 6 IPs served by the CES cluster. Here's an example where two data mounts are served by the IP "10.100.3.201".

-> mount -t nfs | egrep "10.100.3.20[0-5]"
gpfs-aggr3.gsc.wustl.edu:/vol/aggr3 on /vol/aggr3 type nfs (rw,tcp,nfsvers=3,addr=10.100.3.201)
ces201:/vol/aggr3/sata130 on /gscmnt/sata130 type nfs (rw,intr,tcp,nfsvers=3,mountproto=tcp,sloppy,addr=10.100.3.201)
ces201:/vol/aggr50/gc5002 on /gscmnt/gc5002 type nfs (rw,intr,tcp,nfsvers=3,mountproto=tcp,sloppy,addr=10.100.3.201)

Note that /gsc and /gscuser go to the home-app cluster, not the CES cluster.

Unmount existing NFS mounts so that they can be remounted with the new IPs

Members of the "info" LDAP group have permission to use "sudo" to run the umount command.

-> groups | grep -q info && echo "YES! I'm in 'info'" || echo "No, I am not in 'info'"
YES! I'm in 'info'

Use sudo to unmount NFS mounts:

-> sudo umount -t nfs -a
umount.nfs: /gscmnt/gc5002: device is busy
umount.nfs: /gscuser: device is busy
umount.nfs: /gsc: device is busy

Note that /gscuser will reply "busy" because an active user login (you) will have files open in /gscuser, so it will not be unmounted. That's ok, we are not concerned about /gscuser.

If you have running programs on /gsc it will report "busy" as well, but we don't care about /gsc right now either.

If any /gscmnt point responds "busy" you need to find a program running that's making use of a file on those mount points and stop that program so that the mount point can be released.

Once this command returns nothing, you have properly cleaned the mounts:

-> mount -t nfs | grep ces

Automount will automatically mount a desired mount point from the new location

Now take a look in some directory you want to use:

-> ls -l /gscmnt/sata130
total 0
-rw-r--r-- 1 root root 0 2007-01-23 12:54 DISK_TECHD
drwxrwsr-x 2 root techd 512 2017-02-09 21:29 techd

That mount wasn't there before. Automount went and mounted for you. Now look to see what server IP we're using:

-> mount -t nfs | grep ces
ces-workstation:/vol/aggr3/sata130 on /gscmnt/sata130 type nfs (rw,intr,tcp,nfsvers=3,mountproto=tcp,sloppy,addr=10.100.3.207)

Here we see the desired new name and IP, ces-workstation on 10.100.3.207.

Now this workstation is using the ces3 server, and load produced by HPC jobs will have a lesser impact on performance.

What's the plan to address storage problems?

Executive Summary

We know that network attached storage access has been bad on a regular basis. I wanted to write up this IT blog entry to communicate what our team is doing to address these ongoing problems.

We have several projects ongoing concurrently. Each one has components that we think will help storage performance.

Separate HPC and Workstation network mounts

/wiki/spaces/IT/pages/180459155

The active action items related to that page are here:

ITDEV-1917 - Getting issue details... STATUS

Once we have separation between HPC nodes and workstations, along with a software update, we think we'll have better control and visibility of access patterns.

Adding hardware to the CES cluster

CES stands for "Cluster Export Services". These are the servers that handle the network mount protocols, eg. NFS, SMB, CIFS, etc. The older lucid based compute cluster NFS mounts storage through the CES cluster, as do the lucid Workstations. There are 3 servers for this. When we complete the above project to separate HPC from Workstation mounts, we'll have HPC going through 2 servers, and Workstations going through 1. We'll purchase an additional server to add to the Workstation pool, making that one a cluster of 2.

ITDEV-1900 - Getting issue details... STATUS

Improve HPC configuration related to LSF, Docker, and resource consumption

There is a separate project to address stability of HPC nodes related to memory utilization as well as LSF configuration. That work is here:

ITDEV-1719 - Getting issue details... STATUS

We believe we can create LSF "elim" scripts to reveal metrics related to IO, and thus increase our control of job-start, to avoid the "thundering herd" problem.

Provide an updated Workstation image

We have work in progress on an updated Workstation image (to Ubuntu 16.04, Xenial):

ITDEV-813 - Getting issue details... STATUS

We feel like we can separate the MGI user base into different communities with different workstation needs. Some people only need to SSH into something for an LSF client. Others need to install development software. Many need to run Docker. Some need a "workstation" with email, browser, audio, etc. Rather than try to have it "all done" at once, we can break it into smaller feature sets and offer early release for small and focused use cases.

 

First Post!

Hello MGI community!

I'm starting this IT blog in an attempt to create a single place to communicate news and events related to MGI IT infrastructure!