Opened 5 months ago

Closed 3 weeks ago

#31244 closed enhancement (fixed)

long term prometheus metrics

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

data retention on the primary prometheus has been expanded to 30 days, which is nice, but that's not enough. create another (a third, technically, but a second in this cluster) prometheus server that would scrape *all* metrics off the *first* server, but at a different sampling rate so we can keep metrics for a longer, possibly multi-year timeline.

Review the storage requirements math in #29388 and compare with reality.

This, obviously, is a followup to the general prometheus setup ticket in #29389.

Child Tickets

Change History (10)

comment:1 Changed 3 months ago by anarcat

in #29388, i said:

> (1.3byte/(15s)) * 15 d * 2500 * 80  to Gibyte

  ((1,3 * byte) / (15 * second)) * (15 * jour) * 2500 * 80 =
  approx. 20,92123 gibibytes

If we expand this to 30d (the current retention policy), we get:

> 30d×1.3byte/(15s)×2500×80 to Gibyte

  (((30 * day) * (1.3 * byte)) / (15 * second)) * 2500 * 80 = approx. 41.842461 gibibytes

In other words, the current server should take about 40Gibytes of storage. It's actually taking much less:

21G	/var/lib/prometheus/metrics2/

There are a few reasons for this:

  1. we don't have 2500 metrics, we have 1289
  2. we don't have 80 hosts, we have 75
  3. each host doesn't necessarily expose all metrics

Regardless of 3, stripping down to 1300 metrics over 75 hosts gives an estimate that actually matches the current consumption, more or less:

> 30d×1.3byte/(15s)×1300×75 to Gibyte

  (((30 * jour) * (1,3 * byte)) / (15 * second)) * 1300 * 75 = approx. 20,3982 gibibytes

So let's play with those schedules a bit. Here's the same data, but with hourly pulls for a year:

> 365d×1.3byte/(1h)×1300×75 to Gibyte

  (((365 * jour) * (1,3 * byte)) / (1 * hour)) * 1300 * 75 = approx. 1,0340754 gibibytes

Holy macaroni! Only 1GB! We could keep 20 years of data with this!

Let's see 15 minutes increments:

> 365d×1.3byte/(15min)×1300×75 to Gibyte

  (((365 * jour) * (1,3 * byte)) / (15 * minute)) * 1300 * 75 = approx. 4,1363016 gibibytes

Still very reasonable! And 5 minutes frequency will, of course, give us:

> 365d×1.3byte/(5min)×1300×75 to Gibyte

  (((365 * jour) * (1,3 * byte)) / (5 * minute)) * 1300 * 75 = approx. 12,408905 gibibytes

So, basically, we have this:

Frequency Retention period Storage used
15 second 30 days 20 GiB
5 min 10 year 120 GiB
5 min 5 year 60 GiB
5 min 1 year 12 GiB
15 min 10 year 40 GiB
15 min 5 year 20 GiB
15 min 1 year 4 GiB
1 hour 10 year 10 GiB
1 hour 5 year 5 GiB
1 hour 1 year 1 GiB

So how long do we want to keep that stuff anyways? I like the 15 minutes 5 year plan, personnally (20GB) although I *also* like the idea of just shoving samples every 5 minutes like we were doing with Munin, which gives us 12GiB, or 60 GiB over five years...

Thoughts?

comment:2 Changed 3 months ago by anarcat

so far, the 20-30GiB storage on nbg1 is struggling a bit, so i'm tempted to downgrade to 15 minutes samples. but maybe we can start by 5min/1yr (12GiB) and see how that goes for now?

comment:3 Changed 3 months ago by anarcat

if/when this gets created, check ticket #31781 while we're here.

comment:4 Changed 3 months ago by anarcat

i originally thought of pushing this out faster to remove the load on the original prometheus server, but it seems the problem with that one is not necessarily due to load but more to network issues, see #31916 for details.

comment:5 Changed 8 weeks ago by anarcat

Owner: changed from tpa to anarcat
Status: newassigned

i've decided to postpone the creation of a secondary server and instead change the retention period on the current server to see if it fixes reliability issues detailed in #31916. if, in 30 days, we still have this problem, then we can setup a secondary to see if we can reproduce the problem there. after all, we don't need a redundant setup as long as we don't do alerting, for which we still use nagios (#29864). see also the commit log for more details:

origin/master 7cda3928fe9c6bf83ee3e8977b74d58acbb7519a
Author:     Antoine Beaupré <anarcat@debian.org>
AuthorDate: Tue Oct 22 13:46:05 2019 -0400
Commit:     Antoine Beaupré <anarcat@debian.org>
CommitDate: Tue Oct 22 13:46:05 2019 -0400

Parent:     91e379a5 make all mpm_worker paramaters configurable
Merged:     master sudo-ldap
Contained:  master

downgrade scrape interval on internal prometheus server (#31916)

This is an attempt at fixing the reliability issues on the prometheus
server detailed in #31916. The current theory is that ipsec might be
the culprit, but it's also possible that the prometheus is overloaded
and that's creating all sorts of other, unrelated problems.

This is sidetracking the setup of a *separate* long term monitoring
server (#31244), of course, but I'm not sure that's really necessary
for now. Since we don't use prometheus for alerting (#29864), we don't
absolutely /need/ redundancy here so we can afford a SPOF for
Prometheus while we figure out this bug.

If, in thirday days, we still have reliability problems, we will know
this is not due to the retention period and can cycle back to the
other solutions, including creating a secondary server to see if it
reproduces the problem.

1 file changed, 2 insertions(+), 1 deletion(-)
modules/profile/manifests/prometheus/server/internal.pp | 3 ++-

modified   modules/profile/manifests/prometheus/server/internal.pp
@@ -42,7 +42,8 @@ class profile::prometheus::server::internal (
     vhost_name          => $vhost_name,
     collect_scrape_jobs => $collect_scrape_jobs,
     scrape_configs      => $scrape_configs,
-    storage_retention   => '30d',
+    storage_retention   => '365d',
+    scrape_interval     => '5m',
   }
   # expose our IP address to exporters so they can allow us in
   #

comment:6 Changed 8 weeks ago by anarcat

Status: assignedneeds_information

comment:7 Changed 7 weeks ago by anarcat

it has not been 30 days yet, but we are still seeing problem and worse, all rate graphs have been broken by this change. Because rates are usually over 5 minutes, Prometheus freaks out and Grafana spews out "no data" on critical graphs like network usage. I've pushed this back to 1m scrape interval but kept the 365d retention. This might eventually cause problems, but we have plenty of time to deal with those and I need those graphs back online ASAP.

comment:8 Changed 3 weeks ago by anarcat

nbg1 has hit 90% disk usage recently (hit 80% on nov. 6th):

/dev/sda1           38G     33G  3,8G  90% /

I've tweaked the reserved space down to 1% to give us some extra room:

root@hetzner-nbg1-01:~# df -h /
Sys. de fichiers Taille Utilisé Dispo Uti% Monté sur
/dev/sda1           38G     33G  4,9G  87% /

but this is definitely not going to work in the long term. we'll need to give more space to this or something. we'd need to basically double the disk space for this server to keep the year of samples we want.

comment:9 Changed 3 weeks ago by anarcat

doubling the disk space (from 40GB to 80GB) would double the price, but it can easily be done in the rescale menu. all that's needed is a reboot.. i guess i'll add that to the agenda for monday.

comment:10 Changed 3 weeks ago by anarcat

Resolution: fixed
Status: needs_informationclosed

i grew the server to 80GB, which resolves this for a year at least. we'll see how it goes from here.

Note: See TracTickets for help on using tickets.