Opened 8 weeks ago

Last modified 5 days ago

#31244 new enhancement

long term prometheus metrics

Reported by: anarcat Owned by: tpa
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

data retention on the primary prometheus has been expanded to 30 days, which is nice, but that's not enough. create another (a third, technically, but a second in this cluster) prometheus server that would scrape *all* metrics off the *first* server, but at a different sampling rate so we can keep metrics for a longer, possibly multi-year timeline.

Review the storage requirements math in #29388 and compare with reality.

This, obviously, is a followup to the general prometheus setup ticket in #29389.

Child Tickets

Change History (1)

comment:1 Changed 5 days ago by anarcat

in #29388, i said:

> (1.3byte/(15s)) * 15 d * 2500 * 80  to Gibyte

  ((1,3 * byte) / (15 * second)) * (15 * jour) * 2500 * 80 =
  approx. 20,92123 gibibytes

If we expand this to 30d (the current retention policy), we get:

> 30d×1.3byte/(15s)×2500×80 to Gibyte

  (((30 * day) * (1.3 * byte)) / (15 * second)) * 2500 * 80 = approx. 41.842461 gibibytes

In other words, the current server should take about 40Gibytes of storage. It's actually taking much less:

21G	/var/lib/prometheus/metrics2/

There are a few reasons for this:

  1. we don't have 2500 metrics, we have 1289
  2. we don't have 80 hosts, we have 75
  3. each host doesn't necessarily expose all metrics

Regardless of 3, stripping down to 1300 metrics over 75 hosts gives an estimate that actually matches the current consumption, more or less:

> 30d×1.3byte/(15s)×1300×75 to Gibyte

  (((30 * jour) * (1,3 * byte)) / (15 * second)) * 1300 * 75 = approx. 20,3982 gibibytes

So let's play with those schedules a bit. Here's the same data, but with hourly pulls for a year:

> 365d×1.3byte/(1h)×1300×75 to Gibyte

  (((365 * jour) * (1,3 * byte)) / (1 * hour)) * 1300 * 75 = approx. 1,0340754 gibibytes

Holy macaroni! Only 1GB! We could keep 20 years of data with this!

Let's see 15 minutes increments:

> 365d×1.3byte/(15min)×1300×75 to Gibyte

  (((365 * jour) * (1,3 * byte)) / (15 * minute)) * 1300 * 75 = approx. 4,1363016 gibibytes

Still very reasonable! And 5 minutes frequency will, of course, give us:

> 365d×1.3byte/(5min)×1300×75 to Gibyte

  (((365 * jour) * (1,3 * byte)) / (5 * minute)) * 1300 * 75 = approx. 12,408905 gibibytes

So, basically, we have this:

Frequency Retention period Storage used
15 second 30 days 20 GiB
5 min 10 year 120 GiB
5 min 5 year 60 GiB
5 min 1 year 12 GiB
15 min 10 year 40 GiB
15 min 5 year 20 GiB
15 min 1 year 4 GiB
1 hour 10 year 10 GiB
1 hour 5 year 5 GiB
1 hour 1 year 1 GiB

So how long do we want to keep that stuff anyways? I like the 15 minutes 5 year plan, personnally (20GB) although I *also* like the idea of just shoving samples every 5 minutes like we were doing with Munin, which gives us 12GiB, or 60 GiB over five years...

Thoughts?

Note: See TracTickets for help on using tickets.