Opened 5 weeks ago

Last modified 25 hours ago

#29816 assigned task

replace "Tor VM hosts" spreadsheet with Grafana dashboard

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Minor Keywords:
Cc: Actual Points:
Parent ID: #30273 Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

Our KVM allocation strategy is currently managed through a Google spreadsheet. This is suboptimal for a few reasons:

  1. it is hard to keep up to date - for example, moly is not listed in there even though it's in LDAP as a "KVM host"
  1. it's not real time data - for example, even if a host is allocated one vCPU, it might be totally idle most of the time and doing mostly network or disk, while another one might hit the CPU hard. actual load is what matters
  1. it's hosted by Google - that has a few problems, the most important of which is that some TPA do not actually *want* to use Google services and might be reluctant to update it, worsening problem 1

I propose we shift this to a Grafana dashboard. I already have a prototype in the form of the Node exporter server metrics Grafana Dashboard which shows multiple hosts basic stats in parallel. I set the default of the dashboard in Grafana to show the 6 KVM hosts:

<https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics?orgId=1&from=now-12h&to=now&var-node=kvm4.torproject.org:9100&var-node=kvm5.torproject.org:9100&var-node=macrum.torproject.org:9100&var-node=moly.torproject.org:9100&var-node=textile.torproject.org:9100&var-node=unifolium.torproject.org:9100>

That looks like this:

https://paste.anarc.at/snaps/snap-2019.04.17-16.48.43.png

.. but it's not ideal:

  • it's showing irrelevant stats for this purpose like context switches or detailed disk or memory stats
  • it's missing critical information like the number of KVM guests hosted on the machine, how many CPUs and disk space is allocated and so on

This is the information we should be showing:

  • disk capacity vs allocation
  • disk utilization
  • CPU count vs allocation
  • actual CPU utilization
  • load?
  • memory capacity vs allocation
  • actual memory usage

Some of that information currently lives *only* in the spreadsheet. For example, disk allocations are only available there, as the KVM guests run on QCOW (Qemu Copy On Write) filesystems that only take space when actually used by the guest. This has the advantage of allowing us to over-provision, but means we must keep that metadata somewhere else.

So for now it's in the spreadsheet, but we could find a way to move it somewhere Prometheus can scrape. One trick that Prometheus has is that it can expose metrics stored as text files in /var/lib/prometheus/node-exporter/*.prom. This is how the smartctl and APT metrics get shipped for example: a cron job (well, a systemd timer) regularly writes that file, atomically. So one option could be to move this information to (say) LDAP or Puppet/Hiera and write that information into that file using a cronjob (LDAP) or Puppet (Hiera).

Then we'd build a custom Grafana dashboard and get rid of the other spreadsheet.

A stop-gap measure might be to simplify the spreadsheet and move it to a plain text markdown file. We would lose the automatic calculations the spreadsheet provide, in exchange for easier updating and transparency.

Child Tickets

Change History (6)

comment:1 Changed 5 weeks ago by anarcat

if I might add, given the trouble I am having figuring out how moly was built and what hardware it's running, I'm thinking more and more we should keep more details about the various devices somewhere. maybe it could be in LDAP, but I can't help but think this is stuff that could very well live in a YAML file in Hiera.

comment:2 Changed 5 weeks ago by anarcat

other possible inspiration include:

the latter specifically mentions some interesting metrics that we might be able to use for our purposes:

  "block.<num>.capacity" - logical size in bytes of the block device
                           backing image as unsigned long long.
  "block.<num>.physical" - physical size in bytes of the container of the
                           backing image as unsigned long long.
Last edited 5 weeks ago by anarcat (previous) (diff)

comment:3 Changed 3 weeks ago by anarcat

if I might add, given the trouble I am having figuring out how moly was built and what hardware it's running, I'm thinking more and more we should keep more details about the various devices somewhere. maybe it could be in LDAP, but I can't help but think this is stuff that could very well live in a YAML file in Hiera.

another thing to consider here is that we don't have a clear, global view of which (physical) machines we have and how much they cost. we do have a list of machines in LDAP, but that includes limited information and does not include cost, so it's hard to do requirements assessment and depreciation evaluation.

comment:4 Changed 3 weeks ago by anarcat

Owner: changed from tpa to anarcat
Status: newassigned

i'll start looking into this more directly as part of the Hiera move in #30020.

comment:5 Changed 7 days ago by anarcat

Description: modified (diff)

comment:6 Changed 25 hours ago by anarcat

Parent ID: #30273

this is part of the broader inventory problem, which i documented in a separate ticket

Note: See TracTickets for help on using tickets.