Opened 7 months ago

Closed 6 months ago

#29684 closed defect (fixed)

setup a grafana server somewhere

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #29681 Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

Prometheus on its own is nice, but the graphs are not that great. We should setup Grafana on top of that instead.

Grafana is a pain in the bottom to install in Debian: there are upstream packages, but they are a mess so my approach has been to use Docker so far.

I guess we can use the test server for this for now.

note there is a puppet module for grafana, which supports deploying both with the upstream debian package and docker: https://forge.puppet.com/puppet/grafana

Child Tickets

Change History (9)

comment:1 Changed 7 months ago by anarcat

Description: modified (diff)

comment:2 Changed 6 months ago by anarcat

grafana has been installed using the upstream debian packages on grafana.torproject.org. the admin password is in tor-passwords.git, but i'm thinking of setting up LDAP authentication instead to avoid having another source of truth for passwords.

things seem to be already working: i configured the prometheus datasource and a dashboard by hand, but that could also be deployed through puppet (eventually).

comment:3 Changed 6 months ago by anarcat

note that i had a discussion with ln5 about how to deploy the granafa package, because it's not an official debian package. ln5 had concerns about introducing Docker in the infrastructure for various (legitimate) reasons about complexity and security theater. hopefully i'm summarizing well, correct me if I'm wrong. :)

in the end, I chose to deploy using the upstream debian package because it's a technology we all know well, but I think this is a special case (a webapp) where using Docker containers actually makes a lot of sense. things move too fast for debian packaging to catch up - just figuring out all those node.js dependencies would be a nightmare. plus the Docker environment provides *some* isolation from nasty upstream code: because the container runs as the "grafana" user, there are builtin protections (process boundaries and user isolation) in the Linux kernel that isolate us from possible bad code, while the Debian package runs as root during install. and even if the daemon actually runs as a dedicated user in Debian, it doesn't benefit from the namespace, pid, cgroups isolation as far as I know.

but for the time being, it's a good enough solution, especially since deploying the docker way would have involved auditing *three* more puppet modules...

comment:4 Changed 6 months ago by anarcat

only remaining things to do here, just to be clear:

  • confirm the choice of deploying using debian packages
  • deploy a better authentication system than "one big admin password" (maybe?)
  • deploy dashboards through puppet
  • deploy the Prometheus datasource through puppet
  • correctly configure an admin password on first install

All of this is optional if we treat the Grafana configuration as "precious" data that we don't need to manage through Puppet. I'm tempted to do that right now and just go along with what we have.

comment:5 Changed 6 months ago by anarcat

the first step here, to be clear, is a choice between the following options:

  1. Grafana installed with the upstream Debian package, no isolation (current situation)
  2. Grafana installed with the upstream Debian package, in its own VM
  3. Grafana installed with the upstream Docker image
  4. Something else than Grafana, but still using Prometheus
  5. Going back to Munin

TL;DR: I'm for option 1 for now and eventually option 3 if upstream can't figure out Debian packaging. I need a decision on this to move forward with the munin-node cleanup and Grafana configuration, but I'll continue the deployment of Prometheus exporters everywhere in any case (unless people feel strongly for option 5).

Taking those in reverse order:

I don't think anyone is seriously considering option 5 here, but I just added it to make things clear.

I am somewhat opposed to option 4: I don't know of any good replacement for Prometheus that is better packaged in Debian and will allow us to graph metrics from Prometheus the way we need. We *can* build custom graphs and dashboards using the console templates but my experience with Prometheus graphs so far has been painful at best. They are hard to make and hard to share, while there is already a library of Grafana dashboards we can draw from (even if a little small).

Regarding option 3, I don't care that much about Debian vs Docker. I originally wanted to try Docker images because I didn't feel comfortable installing arbitrary upstream code as root in our infrastructure. I also liked the idea to get the little extra isolation Docker provides, from that non-vetted upstream code, even if it means a few extra layers of abstractions and weird tradeoffs. But (understandably) ln5 wasn't comfortable using containers altogether and I figured it might be simpler to just use a Debian package for now, since it's something we're all familiar with.

(Option 2) So that's why we're running the upstream Debian package now, without isolation - that is, in the same VM as the Prometheus server. As discussed with ln5 over IRC, the catastrophic scenario that we would avoid by setting up Grafana in a separate VM is that someone takes over the Grafana server, and use that to start attacking other nodes in the network running the Prometheus exporters. They would need to hack those and also escape their sandboxes to do any more significant damage to other nodes. Another attack vector is getting to the Prometheus data itself, but that is currently protected by a "invite" password so it's not really that much of a concern. If an attacker could get privilege escalation and access to the Prometheus accounts, they might be able to silence alarms and inject arbitrary data in the Prometheus database, that said.

Setting up a separate VM for Grafana would mean that the Grafana server wouldn't talk to Prometheus locally anymore, which could have performance impact over the graph generation time. We *could* host the two VMs on the same physical box, but that would require rebuilding the Prometheus server as well.

So I don't think the tradeoffs of running Grafana in a separate VM is worth it.

I would continue with the current Debian-based setup (option 1) or, if we're worried about trusting those packages, switch to the Docker image (option 3). In any case, I would prefer if we could continue the implementation to be on par with what we get with Munin out of the box, which involves adding a few more exporters to get stats about databases and webservers.

This is all Prometheus stuff and so far I haven't seen resistance to that technology, so from now on I'll go under the assertion that I can continue deploying those exporters, which are well packaged in Debian and easier to deploy anyways, with minimal dependencies.

The open question for me is whether I should tear out the traces of Munin configuration on the hosts. There are still munin-node daemons running everywhere and failing cronjobs doing noises. By removing that stuff, I would also see what's there that's missing from our Prometheus setup which would be useful in itself.

The other question is if we go with Grafana at all or find "something else" (option 4). I'd like to keep going with Grafana and finish its configuration, naturally, but I'm open to alternative suggestions of course.

Alright, sorry for the long email, but I figured it was worth documenting all the options carefully.

comment:6 Changed 6 months ago by anarcat

i have had the approval from weasel to go ahead with Grafana, either Docker or .debs, he doesn't care, although he'd prefer untrusted debs be in a separate VM.

keep authentication simple: no need to share precious LDAP creds with yet another webapp - it's already the case with jenkins and weasel doesn't like that. so we could setup anon auth like we did for prom and nagios and just keep the admin pass in the password manager.

comment:7 Changed 6 months ago by anarcat

something regarding authentication came up through a third-party scraper deployment, in #29863. there were concerns the node exporter would leak information that could be exploited for a side-channel attacks. the node exporter is firewalled, but then all that data is then made available on the prometheus server protected only by a trivial password. they will make an assessment of the exposed data and see if the additional authentication burden is worth the risk.

comment:8 Changed 6 months ago by anarcat

progress: deployed data source and dashboards through puppet. now, going back to our checklist, we have:

  • confirm the choice of deploying using debian packages (done - we can switch to Docker eventually if we want to)
  • deploy a better authentication system than "one big admin password" (still pending)
  • deploy dashboards through puppet (done)
  • deploy the Prometheus datasource through puppet (done)
  • correctly configure an admin password on first install (next up, depends on #30009)
Last edited 6 months ago by anarcat (previous) (diff)

comment:9 Changed 6 months ago by anarcat

Resolution: fixed
Status: assignedclosed

i have moved the authentication questions into #30023 and docker deployment info #30026 so I believe we can close this ticket for now.

Note: See TracTickets for help on using tickets.