Opened 7 months ago

Closed 5 months ago

Last modified 5 months ago

#29410 closed task (fixed)

Can Prometheus help with multiple checks turning into one single alarm?

Reported by: ln5 Owned by: tpa
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #29681 Points:
Reviewer: Sponsor:

Description

This question came up when discussing doing more checks of services over IPv6.

Child Tickets

Change History (3)

comment:1 Changed 6 months ago by anarcat

it depends what you mean by "multiple checks" or "duplicates". prom's alerting system is designed to be highly available (HA) so even if multiple alerting nodes need to alert, the user will receive only one (thanks to $magic).

but i suspect that if you have multiple slightly different checks, it *will* do multiple alerts. the key here, in my experience with nagios anyways, is to setup alerting dependencies so that if service A (e.g. http://tpo) fails because of service B (e.g. ICMP to tpo *or* ICMPv6 to tpo), you only get warned once (e.g. for service B). but it's tricky to setup, easy to get wrong, and i'm not sure prometheus support such dependency chains.

I found prometheus to be somewhat lacking in monitoring: the HA design is good, but there's nothing like the vast diversity of service checks that nagios has. in particular, you can set threshold and there *are* many plugins to monitor a lot of things, but they don't come with predefined limits, like "90% disk usage or load of NCPU*2 is WARNING", so you need to define those on your own.

comment:2 Changed 5 months ago by anarcat

Parent ID: #29681
Resolution: fixed
Status: newclosed

i think I answered this question, otherwise reopen.

comment:3 Changed 5 months ago by anarcat

further discussion on the topic of alerting should be held in #29864

Note: See TracTickets for help on using tickets.