Opened 6 weeks ago

Closed 2 weeks ago

Last modified 4 days ago

#29388 closed task (fixed)

Find out requirements for running Prometheus

Reported by: ln5 Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #29389 Points:
Reviewer: Sponsor:

Description

Investigate disk storage and latency requirements, as well as RAM requirements, for running Prometheus.

Child Tickets

Attachments (1)

prometheus-2-0-stats.png (31.3 KB) - added by anarcat 2 weeks ago.

Download all attachments as: .zip

Change History (6)

comment:1 Changed 2 weeks ago by anarcat

Owner: changed from hiro to anarcat

i can research this. in particular, retention is a problem in prom as it's limited to 15 days and we might want more, but that will require a *lot* of disk space. so i'll also research alternatives and downsampling mechanisms.

Last edited 2 weeks ago by anarcat (previous) (diff)

comment:2 Changed 2 weeks ago by anarcat

Parent ID: #29389

Changed 2 weeks ago by anarcat

Attachment: prometheus-2-0-stats.png added

comment:3 Changed 2 weeks ago by anarcat

okay, I had an interesting conversation with folks in #prometheus on freenode about the topic of downsampling. the prom folks argue that downsampling is not necessary because the TSDB (time-series database) compresses samples very efficiently. apparently, the "worst-case compresision is 1.3 bytes per sample" which means that, for a year of samples taken every minute, you get:

> 1.3byte/minute * year

  (1,3 * (byte / minute)) * year = 683,748 kilobytes

... that is: 683KB/sample/year. A typical "node exporter" site has about 2500 metrics, times our current ~80 host setup, means an entire year of samples would take up 136GB (127GiB):

> 1.3byte/minute * year * 2500 * 80

  (1,3 * (byte / minute)) * year * 2500 * 80 = 136,7496 gigabytes

> 1.3byte/minute * year * 2500 * 80 to Gibyte

  (1,3 * (byte / minute)) * year * 2500 * 80 = approx. 127,35799 gibibytes

This is actually not bad at all in terms of total amount of storage. The problem they identified is more the performance impact of doing historical queries. He (SuperQ) said that queries can take as much as 20-30s when the cache is cold and ~500ms when hot. But I guess this is something that can also be figured out later.

Still, the disk usage difference with Munin is kind of dramatic. Here's the disk usage on my home server, running with three targets. You can see Prometheus (yellow line) slowly filling up the disk up to its retention limit (~300 days, using around 12GiB) while Munin (green line) stays steadily at 380MB.


That's with about four targets. If we would extrapolate that to Tor's setup with 80 targets, that would give us 240GiB of disk use, about double of the above estimate. That might be related to the fact that I didn't change the sample rate: I stuck to the 15 second scrape interval, while the above calculations used 60 seconds intervals. I would therefore have expected to have around 25 GiB disk used (127GiB/20 * 4) instead, which goes to show Prometheus is actually pretty good at compressing those samples.

Server memory would also go a long way in generating responsive graphs: if the "hot" part of the database can be cached to memory, it will make queries go much faster. Hopefully we'll rarely do queries over an entire year and will not need hundreds of GiB of memory.

Now, this brings us back to downsampling: if we *do* want to have year-long queries happening frequently, then we'll stumble upon those slow queries from time to time, so we'll need to find a solution to that problem, which, for the record, was determined to be out of scope by the Prometheus team.

Traditionally, that solution in Prom land is Federation: simply have a second server that pulls from the first with a different sampling frequency. So can have a first server that pulls everyone every 15 seconds and keep two weeks of sample and then the second server pulls the other every day, a third pulls the second every month, etc... This complicates the setup as it requires multiple servers to be setup but also means there are now multiple data sources to parse. Grafana *does* support parsing multiple datasources, but it makes panels more complicated and most won't work out of the box with such a setup.

Others have come up with solutions for this:

  • Digital Ocean wrote a tool called Vulcan for this (now abandoned)
  • some Prometheus folks started the Thanos project (good introduction) that builds on top of Prometheus to enable downsampling
  • another project that was mentioned in the chat is Cortex, a "multitenant, horizontally scalable Prometheus as a Service" that seems to be specifically designed for Kubernetes
  • finally, there's a tool called sop (introduction) that can extract samples from a Prometheus instance and archive it into "something else" like OpenTSDB, InfluxDB, or another Prometheus server, after downsampling
  • update: there's also a tool called Gnocchi that might be worth investigating

All those solutions add more complexity to a tool that's already not very familiar to the team, so I would suggest we first start deploying Prometheus with its normal retention period (15 days) and scraping interval (15 seconds) and see what that brings us. This would give us a fairly reasonable 20GiB of disk usage to start with:

> (1.3byte/(15s)) * 15 d * 2500 * 80  to Gibyte

  ((1,3 * byte) / (15 * second)) * (15 * jour) * 2500 * 80 =
  approx. 20,92123 gibibytes

Obviously, this would be extremely fast if it all lived in memory, but I think we could also get away with a 1 to 10 (2GB) or 1 to 5 (4GB) ratio of memory.

So, long story short, we should expect Prometheus to use:

  • 2GB of RAM
  • 30GB of disk

to start with, with possibly much more disk space (~10x) growth over time. Latency is of course critical so it would preferable to run this on SSD drives at least.

Last edited 4 days ago by anarcat (previous) (diff)

comment:4 Changed 2 weeks ago by anarcat

Also note that there's an entirely different discussion to be had regarding alerting or, more broadly, "how do we phase out nagios", which I'm keeping out of this for now. Let's treat Prometheus as a Munin replacement for now and deal with the broader monitoring ideas later.

comment:5 Changed 2 weeks ago by anarcat

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.