We've run out of disk space at both the snowflake bridge (#26661 (moved), #28390 (moved)) and the broker (#29861 (moved)), which has caused snowflake to stop working. We've set up rotating and compressed logs but it would be nice to have some disk space monitoring to alert us if/when this happens again
Also, as discussed on IRC, we should eventually move the broker to a TPA machine.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Trac: Description: We've run out of disk space at both the snowflake bridge (#28390 (moved)) and the broker (#29861 (moved)), which has caused snowflake to stop working. We've set up rotating and compressed logs but it would be nice to have some disk space monitoring to alert us if/when this happens again
to
We've run out of disk space at both the snowflake bridge (#26661 (moved), #28390 (moved)) and the broker (#29861 (moved)), which has caused snowflake to stop working. We've set up rotating and compressed logs but it would be nice to have some disk space monitoring to alert us if/when this happens again
On IRC, armadev and anarcat suggest using prometheus, which is being used to monitor torproject machines.
https://help.torproject.org/tsa/howto/prometheus/
If the Snowflake bridge and broker were hosted and managed by Tor, then I gather that we would get disk space monitoring (and other things) automatically through their common puppet deployment. But as they are currently separately managed, anarcat says says we need to
... assuming that backports is already configured. if it isn't, such a line in /etc/apt/sources.list.d/backports.debian.org.list should suffice:
deb https://deb.debian.org/debian/ stretch-backports main contrib non-free
... followed by an apt update, naturally.
I believe the two hostnames to scrape are snowflake-broker.bamsoftware.com and snowflake.bamsoftware.com, correct me if that's wrong. So that second bullet point is probably already done. :)
The plan from TPA side is to hook those as part of a new nodejob in the scrape_configs in prometheus.yml, from Puppet, in profile::prometheus::server.
ah, and keep in mind that prometheus doesn't currently send alerts although it's planned, in the long term, to replace our current alerting system (nagios) with Prometheus, once we're confident it doesn't misbehave and performs well. this will take at least a few weeks.
Trac: Description: We've run out of disk space at both the snowflake bridge (#26661 (moved), #28390 (moved)) and the broker (#29861 (moved)), which has caused snowflake to stop working. We've set up rotating and compressed logs but it would be nice to have some disk space monitoring to alert us if/when this happens again
to
We've run out of disk space at both the snowflake bridge (#26661 (moved), #28390 (moved)) and the broker (#29861 (moved)), which has caused snowflake to stop working. We've set up rotating and compressed logs but it would be nice to have some disk space monitoring to alert us if/when this happens again
Also, as discussed on IRC, we should eventually move the broker to a TPA machine.
This should be set up on this end. I'm adding it only for the broker at the moment, just see how it goes. I added a firewall rule to allow connections from the TPA prometheus machine.
I believe the two hostnames to scrape are snowflake-broker.bamsoftware.com and snowflake.bamsoftware.com, correct me if that's wrong. So that second bullet point is probably already done. :)
Yup, that's correct. Right now it's just set up for snowflake-broker.
Right now prometheus is running and the output can be accessed by locally visiting localhost:9100/metrics. I'm going to hold off on allowing prometheus1.torproject.org to access it until we are confident that the data exported won't introduce new side channels.
In a discussion with anarcat in IRC, it looks like at the moment there is anonymous access to the prometheus server that shows the graphs, and the authorization for accessing this is very light. We should make sure we're comfortable with this before doing it.
Right now prometheus is running and the output can be accessed by locally visiting localhost:9100/metrics. I'm going to hold off on allowing prometheus1.torproject.org to access it until we are confident that the data exported won't introduce new side channels.
I did ssh -N -L 9100:127.0.0.1:9100 snowflake-broker and then browsed to !http://127.0.0.1:9100/metrics. And yeah, looks like there is some potentially useful (to an attacker) stuff in there:
# HELP apt_upgrades_pending Apt package pending updates by origin.# HELP node_boot_time_seconds Node boot time, in unixtime.# HELP node_entropy_available_bits Bits of available entropy.# HELP node_intr_total Total number of interrupts serviced.# HELP node_uname_info Labeled system information as provided by the uname system call.# HELP node_network_transmit_bytes_total Network device statistic transmit_bytes.# HELP node_sockstat_TCP_inuse Number of TCP sockets in state inuse.
It doesn't look disastrous on its own, but I think you're right to be cautious.
It looks like this information is not encrypted in transit? Maybe we could expose it on an authenticated onion service instead?
that's right: endpoints are in cleartext. in general, Prometheus does not offer any form of encryption or authentication, that's delegated to reverse proxies or whatever you want to put in front:
so far we've avoided doing this on the exporters because of the sheer complexity of adding stunnel or tor or whatever to all those targets. it would be quite annoying, to say the least, to wrap probes in tor - i don't actually know if we could do that. but i suspect we might be able to scrape metrics off HTTPS remotes, i just haven't done that yet.
Unfortunately there's no --disable-all option or config file so we have to disable all of the collectors individual except the ones that we want. The collectors listed here aren't quite what we got. The full list is in /etc/default/prometheus-node-exporter.
The above arguments will leave only the diskstats, filesystem, systemd, and time collectors enabled for now:
I didn't necessarily mean to imply that we need to turn off a lot of the exported metrics; just that some of them are sensitive enough that we can't just expose port 9100 to everyone. It feels a little weird in this day and age to rely on IP address–based authentication, but personally I can live with it. I'm fine with going ahead with what you've set up, cohosh, and making incremental adjustments in the future if we decide they're necessary.