Summary

We are aware of an uptick in issues related to storage latency in the last several weeks. Remediation actions were delayed as we prepared for then executed the May 16 Maintenance Outage. With that work complete, we turned our attention back to this latency problem. This week we think we made some progress identifying causes and changes are being made to address them.

Details

The relevant hosts are the "production-hpc" and "research-hpc" LSF host groups. These two host groups have evolved over the last several months to include the blade17 and blade14 hosts. You may guess, correctly, that these are different generations of computer:

blade14: 96671G RAM, 24 processor, Intel(R) Xeon(R) CPU X5660 @ 2.80GHz, 2x 1G NIC
blade17: 386746G RAM, 48 processor, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 2x 10G NIC

The blade17s have much more RAM, CPU, and network power than the blade14s.

We tune the blades in a number of ways to try to optimize the kernel for their workloads. Several of the tuning parameters relate to memory management for the GPFS cluster storage software. GPFS requires 16G of memory. In addition the Linux OS needs memory to do basic things like drive the networks, run SSH, puppet, cron, fork/exec basic bash programs, etc. So we take the 16G for GPFS and add a cushion for the OS, reserving 25G of RAM for the OS. We use an LSF "elim" program (elim.mem) to subtract this 25G of RAM from the memory to be offered to LSF jobs. So on a blade14, there's ~71G of RAM available for use by LSF jobs.

During recent occurances of "storage latency", we observe processes unable to allocate memory:

error: fork: Cannot allocate memory

One process that reports this is "ssh". The GPFS cluster software utilizes ssh for delivering commands to its cluster members. When ssh can't fork, it can't execute instructions.

> ssh root@blade15-1-12
ssh_exchange_identification: read: Connection reset by peer

When this happens, the cluster members can't talk to the blade:

May 25th 2017, 09:23:55.000 linuscs116 mmfs Thu May 25 09:23:45.980 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:23:52.000 linuscs117 mmfs Thu May 25 09:23:46.156 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:23:52.000 home-app3 mmfs Thu May 25 09:23:47.648 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:23:47.000 pnsd2 mmfs Thu May 25 09:23:46.505 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:23:46.000 pnsd2 mmfs [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:10:59.000 pnsd1 mmfs Thu May 25 09:10:52.739 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:09:38.000 home-app4 mmfs Thu May 25 09:09:30.206 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:09:30.000 home-app4 mmfs [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:09:30.000 home-app4 mmfs [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:00:35.000 linuscs118 mmfs Thu May 25 09:00:29.430 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:00:31.000 linuscs88 mmfs Thu May 25 09:00:25.079 2017: [E] Connection from 10.100.5.172 timed out

At this point, the cluster members must decide what to do about the unresponsive node. Filesystem activity pauses:

Thu May 25 07:52:31.888 2017: [I] Recovering nodes in cluster gpfs-home-app.gsc.wustl.edu: 10.100.5.172
Thu May 25 07:53:05.269 2017: [N] Node 10.100.5.172 (blade15-1-12) lease renewal is overdue. Pinging to check if it is alive
Thu May 25 07:54:58.546 2017: [D] Leave protocol detail info: LA: 165 LFLG: 4883640 LFLG delta: 165
Thu May 25 07:54:58.559 2017: [I] Recovering nodes in cluster gpfs-sol.gsc.wustl.edu: 10.100.5.172
Thu May 25 07:55:05.295 2017: [E] Node 10.100.5.172 (blade15-1-12) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60.
Thu May 25 07:55:06.074 2017: [I] Recovering nodes in cluster gpfs.gsc.wustl.edu: 10.100.5.172
Thu May 25 07:55:07.837 2017: [I] Log recovery for log group 212 in aggr14 completed in 0.130482000s
Thu May 25 07:55:08.205 2017: [I] Recovered 1 nodes for file system aggr14.
Thu May 25 07:55:09.902 2017: [D] Leave protocol detail info: LA: 165 LFLG: 4883651 LFLG delta: 165
Thu May 25 07:55:09.931 2017: [I] Recovering nodes in cluster gpfs-sol2.gsc.wustl.edu: 10.100.5.172
Thu May 25 09:10:52.739 2017: [E] Connection from 10.100.5.172 timed out

Now at this point, the reader might wonder, "What good is this clustered filesystem if everything stops when a node goes bad?" At this point, please pause to remember your Computer Science, the CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem).

In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Consistency, Availability, Partition tolerance


In other words, the CAP Theorem states that in the presence of a network partition, one has to choose between consistency and availability.

For a cluster filesystem, would you rather choose "Available" or "Consistent"? That is, if you choose availability, you must accept data corruption. Here, we choose Consistency, and thus give up Availability.

In short, we'd rather have your filesystem be slow than corrupt your data.

Why are we running out of memory?

But why are we running out of memory? We're reserving some for the OS. We impose limits in LSF. What are we missing?

Yesterday we re-discovered one of our tuning parameters.

root@blade17-1-1:~# sysctl vm.min_free_kbytes
vm.min_free_kbytes = 11631000
 
(~/git/puppet-modules)-(master)
(ins)-> grep -A1 vm.min_free_kbytes hiera/roles/ostack_kilo_hpc.yaml
  'vm.min_free_kbytes':
    'value': '11631000'

What's this parameter for?

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-tunables.html

min_free_kbytes

The minimum number of kilobytes to keep free across the system. This value is used to compute a watermark value for each low memory zone, which are then assigned a number of reserved free pages proportional to their size.

Be cautious when setting this parameter, as both too-low and too-high values can be damaging and break your system. Setting min_free_kbytes too low prevents the system from reclaiming memory. This can result in system hangs and OOM-killing multiple processes.

However, setting this parameter to a value that is too high (5-10% of total system memory) will cause your system to become out-of-memory immediately. Linux is designed to use all available RAM to cache file system data. Setting a high min_free_kbytes value results in the system spending too much time reclaiming memory.

This parameter must be tuned to find a Goldilocks value that is not too small and not too large. Based on our history (https://jira.gsc.wustl.edu/browse/INFOSYS-15484) we've set this value to 3% of total memory. But we learned two things this week:

We set this number to a fixed value across all the HPC nodes, missing the fact that the blade14s have much less RAM than the blade17s. On the blade17s it was 3%, but the same number on a blade14 is 12% of its RAM!
We failed to account for this amount of RAM in the number we reserve for LSF. This allows LSF jobs to consume memory we should be reserving for the kernel!

Issue #1 is fixed here https://jira.gsc.wustl.edu/browse/ITDEV-3309 and was deployed last night.

Issue #2 is being tracked here https://jira.gsc.wustl.edu/browse/ITDEV-3311 and will be deployed as soon as possible.

In addition to these parameters being fixed, we're also going to update our server tests to auto-close blades when we detect problems like these (and others, like improper permissions on the docker socket). That is being tracked here: https://jira.gsc.wustl.edu/browse/ITDEV-3317

We are hopeful that these improvements will return stability to the cluster, and you can get back to your work!

Blog from May, 2017

Summary

Details

Why are we running out of memory?