Add disk space monitoring for snowflake infrastructure

Trac:
Child Ticket(s): #30929 (moved)

added anti-censorship-roadmap budget_needed component::circumvention/snowflake priority::medium severity::normal sponsor::30-can status::assigned type::task labels

Trac:
Description: We've run out of disk space at both the snowflake bridge (#28390 (moved)) and the broker (#29861 (moved)), which has caused snowflake to stop working. We've set up rotating and compressed logs but it would be nice to have some disk space monitoring to alert us if/when this happens again

to

We've run out of disk space at both the snowflake bridge (#26661 (moved), #28390 (moved)) and the broker (#29861 (moved)), which has caused snowflake to stop working. We've set up rotating and compressed logs but it would be nice to have some disk space monitoring to alert us if/when this happens again

Trac:
Cc: dcf, arlolra, cohosh to dcf, arlolra, cohosh, anarcat

On IRC, armadev and anarcat suggest using prometheus, which is being used to monitor torproject machines. https://help.torproject.org/tsa/howto/prometheus/ If the Snowflake bridge and broker were hosted and managed by Tor, then I gather that we would get disk space monitoring (and other things) automatically through their common puppet deployment. But as they are currently separately managed, anarcat says says we need to

apt install prometheus-node-exporter (from stretch-backports please)
provide the sysadmins with canonical hostnames

apt install prometheus-node-exporter (from stretch-backports please)

is actually:

apt install -t stretch-backports prometheus-node-exporter

... assuming that backports is already configured. if it isn't, such a line in /etc/apt/sources.list.d/backports.debian.org.list should suffice:

deb	https://deb.debian.org/debian/	stretch-backports	main contrib non-free

... followed by an apt update, naturally.

I believe the two hostnames to scrape are snowflake-broker.bamsoftware.com and snowflake.bamsoftware.com, correct me if that's wrong. So that second bullet point is probably already done. :)

The plan from TPA side is to hook those as part of a new node job in the scrape_configs in prometheus.yml, from Puppet, in profile::prometheus::server.

ah, and keep in mind that prometheus doesn't currently send alerts although it's planned, in the long term, to replace our current alerting system (nagios) with Prometheus, once we're confident it doesn't misbehave and performs well. this will take at least a few weeks.

Trac:
Description: We've run out of disk space at both the snowflake bridge (#26661 (moved), #28390 (moved)) and the broker (#29861 (moved)), which has caused snowflake to stop working. We've set up rotating and compressed logs but it would be nice to have some disk space monitoring to alert us if/when this happens again

to

We've run out of disk space at both the snowflake bridge (#26661 (moved), #28390 (moved)) and the broker (#29861 (moved)), which has caused snowflake to stop working. We've set up rotating and compressed logs but it would be nice to have some disk space monitoring to alert us if/when this happens again

Also, as discussed on IRC, we should eventually move the broker to a TPA machine.

the alerting project, that is the move from Nagios to Prometheus, has officially been documented in #29864 (moved)

This should be set up on this end. I'm adding it only for the broker at the moment, just see how it goes. I added a firewall rule to allow connections from the TPA prometheus machine.

I believe the two hostnames to scrape are snowflake-broker.bamsoftware.com and snowflake.bamsoftware.com, correct me if that's wrong. So that second bullet point is probably already done. :)

Yup, that's correct. Right now it's just set up for snowflake-broker.

Okay, taking a step back for bit.

Right now prometheus is running and the output can be accessed by locally visiting localhost:9100/metrics. I'm going to hold off on allowing prometheus1.torproject.org to access it until we are confident that the data exported won't introduce new side channels.

In a discussion with anarcat in IRC, it looks like at the moment there is anonymous access to the prometheus server that shows the graphs, and the authorization for accessing this is very light. We should make sure we're comfortable with this before doing it.

Here is a summary of the features available: https://github.com/prometheus/node_exporter

Replying to cohosh:

Right now prometheus is running and the output can be accessed by locally visiting localhost:9100/metrics. I'm going to hold off on allowing prometheus1.torproject.org to access it until we are confident that the data exported won't introduce new side channels.

I did ssh -N -L 9100:127.0.0.1:9100 snowflake-broker and then browsed to !http://127.0.0.1:9100/metrics. And yeah, looks like there is some potentially useful (to an attacker) stuff in there:

# HELP apt_upgrades_pending Apt package pending updates by origin.
# HELP node_boot_time_seconds Node boot time, in unixtime.
# HELP node_entropy_available_bits Bits of available entropy.
# HELP node_intr_total Total number of interrupts serviced.
# HELP node_uname_info Labeled system information as provided by the uname system call.
# HELP node_network_transmit_bytes_total Network device statistic transmit_bytes.
# HELP node_sockstat_TCP_inuse Number of TCP sockets in state inuse.

It doesn't look disastrous on its own, but I think you're right to be cautious.

It looks like this information is not encrypted in transit? Maybe we could expose it on an authenticated onion service instead?

that's right: endpoints are in cleartext. in general, Prometheus does not offer any form of encryption or authentication, that's delegated to reverse proxies or whatever you want to put in front:

https://prometheus.io/docs/operating/security/#authentication-authorization-and-encryption

so far we've avoided doing this on the exporters because of the sheer complexity of adding stunnel or tor or whatever to all those targets. it would be quite annoying, to say the least, to wrap probes in tor - i don't actually know if we could do that. but i suspect we might be able to scrape metrics off HTTPS remotes, i just haven't done that yet.

I edited /etc/init.d/prometheus-node-exporter and /etc/default/prometheus-node-exporter to add the following line:

ARGS="--no-collector.arp --no-collector.bcache --no-collector.bonding --no-collector.conntrack --no-collector.cpu --no-collector.edac --no-collector.entropy --no-collector.filefd --no-collector.hwmon --no-collector.infiniband --no-collector.ipvs --no-collector.loadavg --no-collector.mdadm --no-collector.meminfo --no-collector.netclass --no-collector.netdev --no-collector.netstat --no-collector.nfs --no-collector.nfsd --no-collector.sockstat --no-collector.stat --no-collector.textfile --no-collector.timex --no-collector.uname --no-collector.vmstat --no-collector.xfs --no-collector.zfs"

Unfortunately there's no --disable-all option or config file so we have to disable all of the collectors individual except the ones that we want. The collectors listed here aren't quite what we got. The full list is in /etc/default/prometheus-node-exporter.

The above arguments will leave only the diskstats, filesystem, systemd, and time collectors enabled for now:

Apr  9 15:13:03 broker systemd[1]: Started Prometheus exporter for machine metrics.
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg="Starting node_exporter (version=0.17.0+ds, branch=debian/sid, revision=0.17.0+ds-2~bpo9+1)" source="node_exporter.go:156"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg="Build context (go=go1.10.5, user=pkg-go-maintainers@lists.alioth.debian.org, date=20181223-17:51:48)" source="node_exporter.go:157"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg="Enabled collectors:" source="node_exporter.go:97"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg=" - diskstats" source="node_exporter.go:104"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg=" - filesystem" source="node_exporter.go:104"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg=" - systemd" source="node_exporter.go:104"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg=" - time" source="node_exporter.go:104"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg="Listening on :9100" source="node_exporter.go:170"

We might also want to enable collectors related to memory/CPU to make sure we're not straining the system too badly.

I think the biggest ones to stay away from are related to network data (like netstat).

I'll put this in needs_review and if we decide it's merge_ready we can allow the Tor Project prometheus scraper to access port 9100.

Trac:
Status: new to needs_review

I didn't necessarily mean to imply that we need to turn off a lot of the exported metrics; just that some of them are sensitive enough that we can't just expose port 9100 to everyone. It feels a little weird in this day and age to rely on IP address–based authentication, but personally I can live with it. I'm fine with going ahead with what you've set up, cohosh, and making incremental adjustments in the future if we decide they're necessary.

Replying to anarcat:

it would be quite annoying, to say the least, to wrap probes in tor - i don't actually know if we could do that.

The way this would work is we would give you an onion name and an auth cookie. You put those in HidServAuth in torrc as

HidServAuth xxxxxxxxxxxxxxxx.onion authcookieauthcookie

Then, instead of configuring prometheus to fetch from !http://snowflake.bamsoftware.com:9100/, you configure it to fetch from !http://xxxxxxxxxxxxxxxx.onion:9100/ with a proxy_url of !socks5://127.0.0.1:9050/.

On the server side, we would add HiddenServiceAuthorizeClient to torrc:

HiddenServiceDir /var/lib/tor/prometheus_node_exporter
HiddenServicePort 9100 127.0.0.1:9100
HiddenServiceAuthorizeClient basic prometheus

and then get the auth cookie from /var/lib/tor/prometheus_node_exporter/hostname.

Nice! This looks like a really neat solution.

In IRC it also sounded like there was little-to-no authentication on the server that displays these metrics after scraping. Is that the case?

Trac:
Parent: N/A to #30152 (moved)

Add disk space monitoring for snowflake infrastructure

Child items ...

Activity