Summary
We are aware of an uptick in issues related to storage latency in the last several weeks. Remediation actions were delayed as we prepared for then executed the May 16 Maintenance Outage. With that work complete, we turned our attention back to this latency problem. This week we think we made some progress identifying causes and changes are being made to address them.
Details
The relevant hosts are the "production-hpc" and "research-hpc" LSF host groups. These two host groups have evolved over the last several months to include the blade17 and blade14 hosts. You may guess, correctly, that these are different generations of computer:
- blade14: 96671G RAM, 24 processor, Intel(R) Xeon(R) CPU X5660 @ 2.80GHz, 2x 1G NIC
- blade17: 386746G RAM, 48 processor, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 2x 10G NIC
The blade17s have much more RAM, CPU, and network power than the blade14s.
We tune the blades in a number of ways to try to optimize the kernel for their workloads. Several of the tuning parameters relate to memory management for the GPFS cluster storage software. GPFS requires 16G of memory. In addition the Linux OS needs memory to do basic things like drive the networks, run SSH, puppet, cron, fork/exec basic bash programs, etc. So we take the 16G for GPFS and add a cushion for the OS, reserving 25G of RAM for the OS. We use an LSF "elim" program (elim.mem) to subtract this 25G of RAM from the memory to be offered to LSF jobs. So on a blade14, there's ~71G of RAM available for use by LSF jobs.
During recent occurances of "storage latency", we observe processes unable to allocate memory:
error: fork: Cannot allocate memory
One process that reports this is "ssh". The GPFS cluster software utilizes ssh for delivering commands to its cluster members. When ssh can't fork, it can't execute instructions.
> ssh root@blade15-1-12 ssh_exchange_identification: read: Connection reset by peer
When this happens, the cluster members can't talk to the blade:
May 25th 2017, 09:23:55.000 linuscs116 mmfs Thu May 25 09:23:45.980 2017: [E] Connection from 10.100.5.172 timed out May 25th 2017, 09:23:52.000 linuscs117 mmfs Thu May 25 09:23:46.156 2017: [E] Connection from 10.100.5.172 timed out May 25th 2017, 09:23:52.000 home-app3 mmfs Thu May 25 09:23:47.648 2017: [E] Connection from 10.100.5.172 timed out May 25th 2017, 09:23:47.000 pnsd2 mmfs Thu May 25 09:23:46.505 2017: [E] Connection from 10.100.5.172 timed out May 25th 2017, 09:23:46.000 pnsd2 mmfs [E] Connection from 10.100.5.172 timed out May 25th 2017, 09:10:59.000 pnsd1 mmfs Thu May 25 09:10:52.739 2017: [E] Connection from 10.100.5.172 timed out May 25th 2017, 09:09:38.000 home-app4 mmfs Thu May 25 09:09:30.206 2017: [E] Connection from 10.100.5.172 timed out May 25th 2017, 09:09:30.000 home-app4 mmfs [E] Connection from 10.100.5.172 timed out May 25th 2017, 09:09:30.000 home-app4 mmfs [E] Connection from 10.100.5.172 timed out May 25th 2017, 09:00:35.000 linuscs118 mmfs Thu May 25 09:00:29.430 2017: [E] Connection from 10.100.5.172 timed out May 25th 2017, 09:00:31.000 linuscs88 mmfs Thu May 25 09:00:25.079 2017: [E] Connection from 10.100.5.172 timed out
At this point, the cluster members must decide what to do about the unresponsive node. Filesystem activity pauses:
Thu May 25 07:52:31.888 2017: [I] Recovering nodes in cluster gpfs-home-app.gsc.wustl.edu: 10.100.5.172 Thu May 25 07:53:05.269 2017: [N] Node 10.100.5.172 (blade15-1-12) lease renewal is overdue. Pinging to check if it is alive Thu May 25 07:54:58.546 2017: [D] Leave protocol detail info: LA: 165 LFLG: 4883640 LFLG delta: 165 Thu May 25 07:54:58.559 2017: [I] Recovering nodes in cluster gpfs-sol.gsc.wustl.edu: 10.100.5.172 Thu May 25 07:55:05.295 2017: [E] Node 10.100.5.172 (blade15-1-12) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60. Thu May 25 07:55:06.074 2017: [I] Recovering nodes in cluster gpfs.gsc.wustl.edu: 10.100.5.172 Thu May 25 07:55:07.837 2017: [I] Log recovery for log group 212 in aggr14 completed in 0.130482000s Thu May 25 07:55:08.205 2017: [I] Recovered 1 nodes for file system aggr14. Thu May 25 07:55:09.902 2017: [D] Leave protocol detail info: LA: 165 LFLG: 4883651 LFLG delta: 165 Thu May 25 07:55:09.931 2017: [I] Recovering nodes in cluster gpfs-sol2.gsc.wustl.edu: 10.100.5.172 Thu May 25 09:10:52.739 2017: [E] Connection from 10.100.5.172 timed out
Now at this point, the reader might wonder, "What good is this clustered filesystem if everything stops when a node goes bad?" At this point, please pause to remember your Computer Science, the CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem).
In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Consistency, Availability, Partition tolerance In other words, the CAP Theorem states that in the presence of a network partition, one has to choose between consistency and availability.
For a cluster filesystem, would you rather choose "Available" or "Consistent"? That is, if you choose availability, you must accept data corruption. Here, we choose Consistency, and thus give up Availability.
In short, we'd rather have your filesystem be slow than corrupt your data.
Why are we running out of memory?
But why are we running out of memory? We're reserving some for the OS. We impose limits in LSF. What are we missing?
Yesterday we re-discovered one of our tuning parameters.
root@blade17-1-1:~# sysctl vm.min_free_kbytes vm.min_free_kbytes = 11631000 (~/git/puppet-modules)-(master) (ins)-> grep -A1 vm.min_free_kbytes hiera/roles/ostack_kilo_hpc.yaml 'vm.min_free_kbytes': 'value': '11631000'
What's this parameter for?
min_free_kbytes The minimum number of kilobytes to keep free across the system. This value is used to compute a watermark value for each low memory zone, which are then assigned a number of reserved free pages proportional to their size. Be cautious when setting this parameter, as both too-low and too-high values can be damaging and break your system. Setting min_free_kbytes too low prevents the system from reclaiming memory. This can result in system hangs and OOM-killing multiple processes. However, setting this parameter to a value that is too high (5-10% of total system memory) will cause your system to become out-of-memory immediately. Linux is designed to use all available RAM to cache file system data. Setting a high min_free_kbytes value results in the system spending too much time reclaiming memory.
This parameter must be tuned to find a Goldilocks value that is not too small and not too large. Based on our history (https://jira.gsc.wustl.edu/browse/INFOSYS-15484) we've set this value to 3% of total memory. But we learned two things this week:
- We set this number to a fixed value across all the HPC nodes, missing the fact that the blade14s have much less RAM than the blade17s. On the blade17s it was 3%, but the same number on a blade14 is 12% of its RAM!
- We failed to account for this amount of RAM in the number we reserve for LSF. This allows LSF jobs to consume memory we should be reserving for the kernel!
Issue #1 is fixed here https://jira.gsc.wustl.edu/browse/ITDEV-3309 and was deployed last night.
Issue #2 is being tracked here https://jira.gsc.wustl.edu/browse/ITDEV-3311 and will be deployed as soon as possible.
In addition to these parameters being fixed, we're also going to update our server tests to auto-close blades when we detect problems like these (and others, like improper permissions on the docker socket). That is being tracked here: https://jira.gsc.wustl.edu/browse/ITDEV-3317
We are hopeful that these improvements will return stability to the cluster, and you can get back to your work!
The May 16 maintenance outage was finished last week, with some drama related to Samba services (smb-cluster) and the aggr14 filesystem. All services were restored by late last week, and we've made a collection of "follow up" tasks that you can see here:
- ITDEV-3229Getting issue details... STATUS
It took about 4 hours just to get things cleanly shut down for maintenance. The primary goals, to enable DMAPI on GPFS filesystems, were accomplished relatively soon after that. The secondary goals regarding the home-app cluster were completed next, making future matinenance on home-app servers much easier. The unexpected work came after, when Samba services did not properly return, and the repair of the aggr14 filesystem corruption took some time. Regarding aggr14, it turns out that some of the reported filesystem corruption was in fact a false positive report caused by a bug that has since been fixed in the next version of GPFS. The number of files that were actually corrupted was small, and the data was recovered after the filesystem came back online. So, in the end, no data was lost at all. Ironically, the aggr14 data in question is scheduled for deletion.
As usual, we thank you for your patience during maintenance outages. We know that the interruption can be frustrating.
There will be an IT Systems Outage on Tuesday May 16 2017 beginning at 5:15pm
Details of this outage, its goals and impact, can be found here: /wiki/spaces/IT/pages/180463890
In brief:
- All running LSF jobs will be terminated
- Any pending LSF jobs will be left pending
- User sessions will be terminated, save your work and log out, but leave your workstation running