setup a grafana server somewhere

Trac:
Parent Ticket: #29681 (moved)

added component::internal services/tor sysadmin team owner::anarcat parent::29681 priority::medium resolution::fixed severity::normal status::closed type::defect labels

Trac:
Description: Prometheus on its own is nice, but the graphs are not that great. We should setup Grafana on top of that instead.

Grafana is a pain in the bottom to install in Debian: there are upstream packages, but they are a mess so my approach has been to use Docker so far.

I guess we can use the test server for this for now.

to

Prometheus on its own is nice, but the graphs are not that great. We should setup Grafana on top of that instead.

Grafana is a pain in the bottom to install in Debian: there are upstream packages, but they are a mess so my approach has been to use Docker so far.

I guess we can use the test server for this for now.

note there is a puppet module for grafana, which supports deploying both with the upstream debian package and docker: https://forge.puppet.com/puppet/grafana

grafana has been installed using the upstream debian packages on grafana.torproject.org. the admin password is in tor-passwords.git, but i'm thinking of setting up LDAP authentication instead to avoid having another source of truth for passwords.

things seem to be already working: i configured the prometheus datasource and a dashboard by hand, but that could also be deployed through puppet (eventually).

note that i had a discussion with ln5 about how to deploy the granafa package, because it's not an official debian package. ln5 had concerns about introducing Docker in the infrastructure for various (legitimate) reasons about complexity and security theater. hopefully i'm summarizing well, correct me if I'm wrong. :)

in the end, I chose to deploy using the upstream debian package because it's a technology we all know well, but I think this is a special case (a webapp) where using Docker containers actually makes a lot of sense. things move too fast for debian packaging to catch up - just figuring out all those node.js dependencies would be a nightmare. plus the Docker environment provides some isolation from nasty upstream code: because the container runs as the "grafana" user, there are builtin protections (process boundaries and user isolation) in the Linux kernel that isolate us from possible bad code, while the Debian package runs as root during install. and even if the daemon actually runs as a dedicated user in Debian, it doesn't benefit from the namespace, pid, cgroups isolation as far as I know.

but for the time being, it's a good enough solution, especially since deploying the docker way would have involved auditing three more puppet modules...

only remaining things to do here, just to be clear:

confirm the choice of deploying using debian packages
deploy a better authentication system than "one big admin password" (maybe?)
deploy dashboards through puppet
deploy the Prometheus datasource through puppet
correctly configure an admin password on first install

All of this is optional if we treat the Grafana configuration as "precious" data that we don't need to manage through Puppet. I'm tempted to do that right now and just go along with what we have.

the first step here, to be clear, is a choice between the following options:

Grafana installed with the upstream Debian package, no isolation (current situation)
Grafana installed with the upstream Debian package, in its own VM
Grafana installed with the upstream Docker image
Something else than Grafana, but still using Prometheus
Going back to Munin

TL;DR: I'm for option 1 for now and eventually option 3 if upstream can't figure out Debian packaging. I need a decision on this to move forward with the munin-node cleanup and Grafana configuration, but I'll continue the deployment of Prometheus exporters everywhere in any case (unless people feel strongly for option 5).

Taking those in reverse order:

I don't think anyone is seriously considering option 5 here, but I just added it to make things clear.

I am somewhat opposed to option 4: I don't know of any good replacement for Prometheus that is better packaged in Debian and will allow us to graph metrics from Prometheus the way we need. We can build custom graphs and dashboards using the console templates but my experience with Prometheus graphs so far has been painful at best. They are hard to make and hard to share, while there is already a library of Grafana dashboards we can draw from (even if a little small).

Regarding option 3, I don't care that much about Debian vs Docker. I originally wanted to try Docker images because I didn't feel comfortable installing arbitrary upstream code as root in our infrastructure. I also liked the idea to get the little extra isolation Docker provides, from that non-vetted upstream code, even if it means a few extra layers of abstractions and weird tradeoffs. But (understandably) ln5 wasn't comfortable using containers altogether and I figured it might be simpler to just use a Debian package for now, since it's something we're all familiar with.

(Option 2) So that's why we're running the upstream Debian package now, without isolation - that is, in the same VM as the Prometheus server. As discussed with ln5 over IRC, the catastrophic scenario that we would avoid by setting up Grafana in a separate VM is that someone takes over the Grafana server, and use that to start attacking other nodes in the network running the Prometheus exporters. They would need to hack those and also escape their sandboxes to do any more significant damage to other nodes. Another attack vector is getting to the Prometheus data itself, but that is currently protected by a "invite" password so it's not really that much of a concern. If an attacker could get privilege escalation and access to the Prometheus accounts, they might be able to silence alarms and inject arbitrary data in the Prometheus database, that said.

Setting up a separate VM for Grafana would mean that the Grafana server wouldn't talk to Prometheus locally anymore, which could have performance impact over the graph generation time. We could host the two VMs on the same physical box, but that would require rebuilding the Prometheus server as well.

So I don't think the tradeoffs of running Grafana in a separate VM is worth it.

I would continue with the current Debian-based setup (option 1) or, if we're worried about trusting those packages, switch to the Docker image (option 3). In any case, I would prefer if we could continue the implementation to be on par with what we get with Munin out of the box, which involves adding a few more exporters to get stats about databases and webservers.

This is all Prometheus stuff and so far I haven't seen resistance to that technology, so from now on I'll go under the assertion that I can continue deploying those exporters, which are well packaged in Debian and easier to deploy anyways, with minimal dependencies.

The open question for me is whether I should tear out the traces of Munin configuration on the hosts. There are still munin-node daemons running everywhere and failing cronjobs doing noises. By removing that stuff, I would also see what's there that's missing from our Prometheus setup which would be useful in itself.

The other question is if we go with Grafana at all or find "something else" (option 4). I'd like to keep going with Grafana and finish its configuration, naturally, but I'm open to alternative suggestions of course.

Alright, sorry for the long email, but I figured it was worth documenting all the options carefully.

i have had the approval from weasel to go ahead with Grafana, either Docker or .debs, he doesn't care, although he'd prefer untrusted debs be in a separate VM.

keep authentication simple: no need to share precious LDAP creds with yet another webapp - it's already the case with jenkins and weasel doesn't like that. so we could setup anon auth like we did for prom and nagios and just keep the admin pass in the password manager.

something regarding authentication came up through a third-party scraper deployment, in #29863 (moved). there were concerns the node exporter would leak information that could be exploited for a side-channel attacks. the node exporter is firewalled, but then all that data is then made available on the prometheus server protected only by a trivial password. they will make an assessment of the exposed data and see if the additional authentication burden is worth the risk.

progress: deployed data source and dashboards through puppet. now, going back to our checklist, we have:

confirm the choice of deploying using debian packages (done - we can switch to Docker eventually if we want to)
deploy a better authentication system than "one big admin password" (still pending)
deploy dashboards through puppet (done)
deploy the Prometheus datasource through puppet (done)
correctly configure an admin password on first install (next up, depends on #30009 (moved))

i have moved the authentication questions into #30023 (moved) and docker deployment info #30026 (moved) so I believe we can close this ticket for now.

Trac:
Status: assigned to closed
Resolution: N/A to fixed

closed

mentioned in issue #30023 (moved)

mentioned in issue #30026 (moved)

mentioned in issue #29681 (moved)

setup a grafana server somewhere

Child items ...

Activity