Page MenuHomePhabricator

Remove nfsiostat collector for diamond if possible, which may be broken on tools workers
Closed, ResolvedPublic0 Estimated Story Points

Description

Besides the fact that monitoring anything NFS on the client side is asking for trouble, I noticed that NFSiostat diamond collector is trying to run on tools-worker-1010 (and I presume other workers as well). It appears to be struggling with something in python:

Jul 17 03:01:55 tools-worker-1010 diamond[672]: Collector failed!
Jul 17 03:01:55 tools-worker-1010 diamond[672]: Traceback (most recent call last):
Jul 17 03:01:55 tools-worker-1010 diamond[672]: File "/usr/lib/python2.7/dist-packages/diamond/utils/scheduler.py", line 73, in collector_process
Jul 17 03:01:55 tools-worker-1010 diamond[672]: collector._run()
Jul 17 03:01:55 tools-worker-1010 diamond[672]: File "/usr/lib/python2.7/dist-packages/diamond/collector.py", line 477, in _run
Jul 17 03:01:55 tools-worker-1010 diamond[672]: self.collect()
Jul 17 03:01:55 tools-worker-1010 diamond[672]: File "/usr/share/diamond/collectors/Nfsiostat/Nfsiostat.py", line 460, in collect
Jul 17 03:01:55 tools-worker-1010 diamond[672]: for k, v in info[a].iteritems():
Jul 17 03:01:55 tools-worker-1010 diamond[672]: AttributeError: 'NoneType' object has no attribute 'iteritems'
Jul 17 03:01:56 tools-worker-1010 diamond[672]: Collector failed!
Jul 17 03:01:56 tools-worker-1010 diamond[672]: Traceback (most recent call last):
Jul 17 03:01:56 tools-worker-1010 diamond[672]: File "/usr/lib/python2.7/dist-packages/diamond/utils/scheduler.py", line 73, in collector_process
Jul 17 03:01:56 tools-worker-1010 diamond[672]: collector._run()
Jul 17 03:01:56 tools-worker-1010 diamond[672]: File "/usr/lib/python2.7/dist-packages/diamond/collector.py", line 477, in _run
Jul 17 03:01:56 tools-worker-1010 diamond[672]: self.collect()
Jul 17 03:01:56 tools-worker-1010 diamond[672]: File "/usr/share/diamond/collectors/Nfsiostat/Nfsiostat.py", line 460, in collect
Jul 17 03:01:56 tools-worker-1010 diamond[672]: for k, v in info[a].iteritems():
Jul 17 03:01:56 tools-worker-1010 diamond[672]: AttributeError: 'NoneType' object has no attribute 'iteritems'
Jul 17 03:01:58 tools-worker-1010 diamond[672]: Collector failed!
Jul 17 03:01:58 tools-worker-1010 diamond[672]: Traceback (most recent call last):
Jul 17 03:01:58 tools-worker-1010 diamond[672]: File "/usr/lib/python2.7/dist-packages/diamond/utils/scheduler.py", line 73, in collector_process
Jul 17 03:01:58 tools-worker-1010 diamond[672]: collector._run()
Jul 17 03:01:58 tools-worker-1010 diamond[672]: File "/usr/lib/python2.7/dist-packages/diamond/collector.py", line 477, in _run
Jul 17 03:01:58 tools-worker-1010 diamond[672]: self.collect()
Jul 17 03:01:58 tools-worker-1010 diamond[672]: File "/usr/share/diamond/collectors/Nfsiostat/Nfsiostat.py", line 460, in collect
Jul 17 03:01:58 tools-worker-1010 diamond[672]: for k, v in info[a].iteritems():
Jul 17 03:01:58 tools-worker-1010 diamond[672]: AttributeError: 'NoneType' object has no attribute 'iteritems'

This probably should not be monitored on the clients (and probably not with diamond).
There are probably other tickets to merge this into, but I wanted to make this before I forgot.

Event Timeline

  • define labstore::nfs_mount: diamond::collector { 'Nfsiostat' }

This is a custom collector deployed via Puppet (modules/diamond/files/collector/nfsiostat.py), I doubt there's an existing Prometheus exporter for this use case, so probably the existing one needs to be converted.

I think that node_exporter's mountstats collector might have all the data we want/need to replace diamond::collector { 'Nfsiostat' }. It does not track data exactly like the diamond collector that Chase ported over from sysstat, but it appears to be very through. The biggest difference I see in a casual examination is that the Prometheus version does not recognize that /proc/self/mountstats reports total ops by mount point (because that is how the file is organized), but that these are actually stats which are computed per server, not per mount. This leads to duplicated data like:

node_mountstats_nfs_total_read_bytes_total{export="nfs-tools-project.svc.eqiad.wmnet:/project/tools/home",instance="tools-sgebastion-07:9100",job="node",protocol="tcp"}	57397540672
node_mountstats_nfs_total_read_bytes_total{export="nfs-tools-project.svc.eqiad.wmnet:/project/tools/project",instance="tools-sgebastion-07:9100",job="node",protocol="tcp"}	57397540672

Figuring out how to enable this collector selectively as we have been doing for the diamond collector within the labstore::nfs_mount define is a challenge. Our puppet code wants the prometheus::node_exporter::collectors_extra hiera setting to list any and all extra collectors to enable. One way we could handle it is just making setting up that hiera key a part of the process of adding NFS mounts to a Cloud VPS host. For projects like tools we can use prefix hiera to add the needed settings. Similar things can be done in other projects where NFS is project global or managed by instance prefix. I don't think that we want to add the collector for all Cloud VPS instances just so that we get metrics for the small(ish) number of instances which are NFS clients.

@Bstorm are there existing NFS client dashboards that you use in https://grafana-labs.wikimedia.org/ or is this diamond data something that we added at some point but never setup visualizations for? I imported an upstream dashboard for mountstats data that you can see at https://grafana-labs.wikimedia.org/d/keQ9P_-iz/nfs-mountstats?orgId=1&var-job=node&var-node=tools-sgebastion-07&var-port=9100&var-mountpoint=All. The only instance we seem to have the mountstats collector already enabled for is tools-sgebastion-07. This collector was enabled by @JHedden in rOPUP4b0033cc956e: tools: add mountstats to bastion node exporter about a month ago.

@Bstorm are there existing NFS client dashboards that you use in https://grafana-labs.wikimedia.org/ or is this diamond data something that we added at some point but never setup visualizations for? I imported an upstream dashboard for mountstats data that you can see at https://grafana-labs.wikimedia.org/d/keQ9P_-iz/nfs-mountstats?orgId=1&var-job=node&var-node=tools-sgebastion-07&var-port=9100&var-mountpoint=All. The only instance we seem to have the mountstats collector already enabled for is tools-sgebastion-07. This collector was enabled by @JHedden in rOPUP4b0033cc956e: tools: add mountstats to bastion node exporter about a month ago.

The diamond data doesn't seem to be used anywhere. It *has* caused problems in the past because monitoring NFS on the client side is a great way to build up load and cause issues whenever a mount becomes unavailable. I see little value in monitoring NFS mount stats on the client side by mount instead of by server because the problems typically affect all mounts from the server. I don't think we lose much with the prometheus exporter. I kind of wish it could be presented as "server" instead of "export", but we can always label things and mess with grafana to do that, right?

In general, I've always relied on nethogs to find the biggest consumers of NFS data. Using this might be handy with an alert watermark attached to it--especially since I see the prometheus exporter parses out every operation that takes place, which is really cool (saw that here https://tools-prometheus.wmflabs.org/tools/graph?g0.range_input=1h&g0.expr=node_mountstats_nfs_operations_sent_bytes_total{instance="tools-sgebastion-07:9100"}&g0.tab=1).

My only other comment on that topic is that we might be able to gate that exporter setting on profile::wmcs::nfsclient in general, which would make tools servers that have mount_nfs = false not use the exporter and respect the setting for all purposes--or something like that.

The diamond data doesn't seem to be used anywhere. It *has* caused problems in the past because monitoring NFS on the client side is a great way to build up load and cause issues whenever a mount becomes unavailable.

I think the Prometheus exporter works just by parsing /proc/self/mountstats and /proc/self/mountinfo, but maybe these memory segments get locked when the kernel is trying to refresh mounts?

I see little value in monitoring NFS mount stats on the client side by mount instead of by server because the problems typically affect all mounts from the server.

We can just drop the Diamond collector and not replace it with Prometheus if the data is not valuable. The "measure all the things" mentality has saved me a few times in my life, but only related to financial transactions not server load.

We can just drop the Diamond collector and not replace it with Prometheus if the data is not valuable. The "measure all the things" mentality has saved me a few times in my life, but only related to financial transactions not server load.

Well, I think that monitoring the way the prometheus data does is potentially valuable...just I don't care much if it is by mount vs by server, since most issues will be by server rather than actually by mount. I am also curious why @JHedden enabled it, because he may have a really good idea on how it could be used before we nix the idea.

I enabled the node exporter mountstats plugin to help diagnose the "slowness" our users have been reporting on tools-sgebastion-07.tools.eqiad.wmflabs. Being able to line up multiple system metrics next to each other with a historical timeline can help identify usage patterns and resource contention.

My assumption is that it's happening a lot more than the few pings we get in IRC. The more data we collect, the better we can react to questions like "why was this so slow last week", or "every time I use X it's slow".

Side note: I also enabled the traffic shaping (qdisc) plugin around the same time for this same reason.

If you feel that either of these are not providing any value, I'm OK with removing them.

I think that's a great reason to enable the metric, especially on the bastions where this happens all the time. We could, honestly, just add that hiera value to the bastion prefixes and keep a dashboard with an alert limit in it. It won't actually alert us unless we do more than that, but it might be nice to see something like "bastion with nfs values above ____" in the tools-basic-alerts or something. But let's blow away the nfsiostat collector because nobody is looking at it anymore. I think people were using it directly with graphite most likely for doing the sort of analysis you just mentioned.

Bstorm triaged this task as Medium priority.Feb 11 2020, 4:29 PM