Opened 4 weeks ago

Last modified 45 hours ago

#29863 needs_review task

Add disk space monitoring for snowflake infrastructure

Reported by: cohosh Owned by:
Priority: Medium Milestone:
Component: Obfuscation/Snowflake Version:
Severity: Normal Keywords: snowflake
Cc: dcf, arlolra, cohosh, anarcat Actual Points:
Parent ID: #30152 Points:
Reviewer: Sponsor: Sponsor19

Description (last modified by cohosh)

We've run out of disk space at both the snowflake bridge (#26661, #28390) and the broker (#29861), which has caused snowflake to stop working. We've set up rotating and compressed logs but it would be nice to have some disk space monitoring to alert us if/when this happens again

Also, as discussed on IRC, we should eventually move the broker to a TPA machine.

Child Tickets

Change History (19)

comment:1 Changed 4 weeks ago by dcf

Description: modified (diff)

comment:2 Changed 4 weeks ago by anarcat

Cc: anarcat added

comment:3 Changed 4 weeks ago by dcf

On IRC, armadev and anarcat suggest using prometheus, which is being used to monitor torproject machines.

https://help.torproject.org/tsa/howto/prometheus/

If the Snowflake bridge and broker were hosted and managed by Tor, then I gather that we would get disk space monitoring (and other things) automatically through their common puppet deployment. But as they are currently separately managed, anarcat says says we need to

  • apt install prometheus-node-exporter (from stretch-backports please)
  • provide the sysadmins with canonical hostnames

comment:4 Changed 4 weeks ago by anarcat

  • apt install prometheus-node-exporter (from stretch-backports please)

is actually:

apt install -t stretch-backports prometheus-node-exporter

... assuming that backports is already configured. if it isn't, such a line in /etc/apt/sources.list.d/backports.debian.org.list should suffice:

deb	https://deb.debian.org/debian/	stretch-backports	main contrib non-free

... followed by an apt update, naturally.

I believe the two hostnames to scrape are snowflake-broker.bamsoftware.com and snowflake.bamsoftware.com, correct me if that's wrong. So that second bullet point is probably already done. :)

The plan from TPA side is to hook those as part of a new node job in the scrape_configs in prometheus.yml, from Puppet, in profile::prometheus::server.

comment:5 Changed 4 weeks ago by anarcat

ah, and keep in mind that prometheus doesn't currently send alerts although it's planned, in the long term, to replace our current alerting system (nagios) with Prometheus, once we're confident it doesn't misbehave and performs well. this will take at least a few weeks.

comment:6 Changed 4 weeks ago by cohosh

Description: modified (diff)

comment:7 Changed 4 weeks ago by anarcat

the alerting project, that is the move from Nagios to Prometheus, has officially been documented in #29864

comment:8 Changed 2 weeks ago by cohosh

This should be set up on this end. I'm adding it only for the broker at the moment, just see how it goes. I added a firewall rule to allow connections from the TPA prometheus machine.

I believe the two hostnames to scrape are snowflake-broker.bamsoftware.com and snowflake.bamsoftware.com, correct me if that's wrong. So that second bullet point is probably already done. :)

Yup, that's correct. Right now it's just set up for snowflake-broker.

comment:9 Changed 2 weeks ago by cohosh

Okay, taking a step back for bit.

Right now prometheus is running and the output can be accessed by locally visiting localhost:9100/metrics. I'm going to hold off on allowing prometheus1.torproject.org to access it until we are confident that the data exported won't introduce new side channels.

In a discussion with anarcat in IRC, it looks like at the moment there is anonymous access to the prometheus server that shows the graphs, and the authorization for accessing this is very light. We should make sure we're comfortable with this before doing it.

comment:10 Changed 2 weeks ago by cohosh

Here is a summary of the features available: https://github.com/prometheus/node_exporter

comment:11 in reply to:  9 Changed 2 weeks ago by dcf

Replying to cohosh:

Right now prometheus is running and the output can be accessed by locally visiting localhost:9100/metrics. I'm going to hold off on allowing prometheus1.torproject.org to access it until we are confident that the data exported won't introduce new side channels.

I did ssh -N -L 9100:127.0.0.1:9100 snowflake-broker and then browsed to http://127.0.0.1:9100/metrics. And yeah, looks like there is some potentially useful (to an attacker) stuff in there:

# HELP apt_upgrades_pending Apt package pending updates by origin.
# HELP node_boot_time_seconds Node boot time, in unixtime.
# HELP node_entropy_available_bits Bits of available entropy.
# HELP node_intr_total Total number of interrupts serviced.
# HELP node_uname_info Labeled system information as provided by the uname system call.
# HELP node_network_transmit_bytes_total Network device statistic transmit_bytes.
# HELP node_sockstat_TCP_inuse Number of TCP sockets in state inuse.

It doesn't look disastrous on its own, but I think you're right to be cautious.

It looks like this information is not encrypted in transit? Maybe we could expose it on an authenticated onion service instead?

comment:12 Changed 2 weeks ago by anarcat

that's right: endpoints are in cleartext. in general, Prometheus does not offer any form of encryption or authentication, that's delegated to reverse proxies or whatever you want to put in front:

https://prometheus.io/docs/operating/security/#authentication-authorization-and-encryption

so far we've avoided doing this on the exporters because of the sheer complexity of adding stunnel or tor or whatever to all those targets. it would be quite annoying, to say the least, to wrap probes in tor - i don't actually know if we could do that. but i suspect we *might* be able to scrape metrics off HTTPS remotes, i just haven't done that yet.

comment:13 Changed 11 days ago by cohosh

I edited /etc/init.d/prometheus-node-exporter and /etc/default/prometheus-node-exporter to add the following line:

ARGS="--no-collector.arp --no-collector.bcache --no-collector.bonding --no-collector.conntrack --no-collector.cpu --no-collector.edac --no-collector.entropy --no-collector.filefd --no-collector.hwmon --no-collector.infiniband --no-collector.ipvs --no-collector.loadavg --no-collector.mdadm --no-collector.meminfo --no-collector.netclass --no-collector.netdev --no-collector.netstat --no-collector.nfs --no-collector.nfsd --no-collector.sockstat --no-collector.stat --no-collector.textfile --no-collector.timex --no-collector.uname --no-collector.vmstat --no-collector.xfs --no-collector.zfs"

Unfortunately there's no --disable-all option or config file so we have to disable all of the collectors individual except the ones that we want. The collectors listed here aren't quite what we got. The full list is in /etc/default/prometheus-node-exporter.

The above arguments will leave only the diskstats, filesystem, systemd, and time collectors enabled for now:

Apr  9 15:13:03 broker systemd[1]: Started Prometheus exporter for machine metrics.
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg="Starting node_exporter (version=0.17.0+ds, branch=debian/sid, revision=0.17.0+ds-2~bpo9+1)" source="node_exporter.go:156"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg="Build context (go=go1.10.5, user=pkg-go-maintainers@lists.alioth.debian.org, date=20181223-17:51:48)" source="node_exporter.go:157"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg="Enabled collectors:" source="node_exporter.go:97"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg=" - diskstats" source="node_exporter.go:104"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg=" - filesystem" source="node_exporter.go:104"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg=" - systemd" source="node_exporter.go:104"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg=" - time" source="node_exporter.go:104"
Apr  9 15:13:03 broker prometheus-node-exporter[5469]: time="2019-04-09T15:13:03Z" level=info msg="Listening on :9100" source="node_exporter.go:170"

We might also want to enable collectors related to memory/CPU to make sure we're not straining the system too badly.

I think the biggest ones to stay away from are related to network data (like netstat).

I'll put this in needs_review and if we decide it's merge_ready we can allow the Tor Project prometheus scraper to access port 9100.

comment:14 Changed 11 days ago by cohosh

Status: newneeds_review

comment:15 Changed 9 days ago by dcf

I didn't necessarily mean to imply that we need to turn off a lot of the exported metrics; just that some of them are sensitive enough that we can't just expose port 9100 to everyone. It feels a little weird in this day and age to rely on IP address–based authentication, but personally I can live with it. I'm fine with going ahead with what you've set up, cohosh, and making incremental adjustments in the future if we decide they're necessary.

comment:16 in reply to:  12 Changed 9 days ago by dcf

Replying to anarcat:

it would be quite annoying, to say the least, to wrap probes in tor - i don't actually know if we could do that.

The way this would work is we would give you an onion name and an auth cookie. You put those in HidServAuth in torrc as

HidServAuth xxxxxxxxxxxxxxxx.onion authcookieauthcookie

Then, instead of configuring prometheus to fetch from http://snowflake.bamsoftware.com:9100/, you configure it to fetch from http://xxxxxxxxxxxxxxxx.onion:9100/ with a proxy_url of socks5://127.0.0.1:9050/.

On the server side, we would add HiddenServiceAuthorizeClient to torrc:

HiddenServiceDir /var/lib/tor/prometheus_node_exporter
HiddenServicePort 9100 127.0.0.1:9100
HiddenServiceAuthorizeClient basic prometheus

and then get the auth cookie from /var/lib/tor/prometheus_node_exporter/hostname.

comment:17 Changed 9 days ago by cohosh

Nice! This looks like a really neat solution.

In IRC it also sounded like there was little-to-no authentication on the server that displays these metrics after scraping. Is that the case?

comment:18 Changed 9 days ago by phw

Parent ID: #30152

comment:19 in reply to:  17 Changed 45 hours ago by cohosh

Replying to cohosh:

In IRC it also sounded like there was little-to-no authentication on the server that displays these metrics after scraping. Is that the case?

anarcat has opened this ticket: #30023 to deal with authentication on the graphana server. It's also worth noting that snowflake as well as other third-party services TPO decides to monitor will be on the same server.

We can move the discussion on authentication to that ticket.

Note: See TracTickets for help on using tickets.