927941 - Disable notifications for individual slaves

Reporter

Description

•

11 years ago

The plethora of nagios alerts for slaves in the #buildduty IRC channel makes it hard to see when there are larger issues that buildduty should be jumping on immediately, e.g. network issues, command queues filling up, issues with masters, etc.

After conferring with the people responsible for buildduty, we've agreed to start using the nagios web interface to address issues with individual slaves. We'd like to turn off the IRC alerts for individual slaves to unclutter the #buildduty channel.

If we could replace the individual slave alerts with aggregate alerts so we could be notified with a WARNING when, say, 50% of a given pool is offline, that would be ideal. I don't know how difficult aggregate alerts are to setup; maybe we punt that work to another bug.

Justin Wood (:Callek)

Updated

•

11 years ago

Blocks: re-nagios

Amy Rich [:arr] [:arich]

Comment 1

•

11 years ago

Relops no longer manages nagios for releng, so slotting this into the SRE group's queue.

Assignee: relops → server-ops

Component: RelOps → Server Operations

Product: Infrastructure & Operations → mozilla.org

QA Contact: arich → shyam

Ashish Vijayaram [:ashish]

Assignee

Comment 2

•

11 years ago

(In reply to Chris Cooper [:coop] from comment #0)
> The plethora of nagios alerts for slaves in the #buildduty IRC channel makes
> it hard to see when there are larger issues that buildduty should be jumping
> on immediately, e.g. network issues, command queues filling up, issues with
> masters, etc.
> 
> After conferring with the people responsible for buildduty, we've agreed to
> start using the nagios web interface to address issues with individual
> slaves. We'd like to turn off the IRC alerts for individual slaves to
> unclutter the #buildduty channel.
> 
> If we could replace the individual slave alerts with aggregate alerts so we
> could be notified with a WARNING when, say, 50% of a given pool is offline,
> that would be ideal. I don't know how difficult aggregate alerts are to
> setup; maybe we punt that work to another bug.

Cluster checks are not difficult to setup and can be done pretty easily. Let us know hosts and/or services that need to be clustered.

Chris Cooper [:coop] (he/him)

Reporter

Comment 3

•

11 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #2)
> Cluster checks are not difficult to setup and can be done pretty easily. Let
> us know hosts and/or services that need to be clustered.

I think that having cluster checks for the existing Host Groups (http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/status.cgi?hostgroup=all&style=summary) would be adequate.

However, we already have access to that info via the web interface, so having that info via IRC at a 50% threshold per group is a nice-to-have addition. It shouldn't stop us from disabling the individual host IRC alerts now.

Chris Cooper [:coop] (he/him)

Reporter

Comment 4

•

11 years ago

Ping: any update on this?

Ashish Vijayaram [:ashish]

Assignee

Comment 5

•

11 years ago

I'll take this bug and will cluster the large hostgroups to alert WARNING at 50% and CRITICAL at 75%. What would you want to do with individual host alerts?

FYI - the cluster check does not (either in IRC or on the Web UI) elaborate which hosts are down. It'll just put out a number as such: 21 up, 0 down, 0 unreachable. You'd have to navigate the web UI and figure out which hosts are down in that hostgroup.

Assignee: server-ops → ashish

Status: NEW → ASSIGNED

Chris Cooper [:coop] (he/him)

Reporter

Comment 6

•

11 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #5)
> I'll take this bug and will cluster the large hostgroups to alert WARNING at
> 50% and CRITICAL at 75%. What would you want to do with individual host
> alerts?

Those thresholds sound good.

We still want individual host alerts to appear in the web UI, we just want them to disappear from IRC.
 
> FYI - the cluster check does not (either in IRC or on the Web UI) elaborate
> which hosts are down. It'll just put out a number as such: 21 up, 0 down, 0
> unreachable. You'd have to navigate the web UI and figure out which hosts
> are down in that hostgroup.

That will work. We're committed to using the web UI to deal with individual hosts anyway.

Thanks.

Ashish Vijayaram [:ashish]

Assignee

Comment 7

•

11 years ago

Aha! I just re-discovered that cluster checks already exist for about two dozen hostgroups:

https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.scl3.mozilla.com

I've changed them all to percentage thresholds at 50% and 75%.

Justin Wood (:Callek)

Comment 8

•

11 years ago

As discussed on vidyo, :ashish will work on removing the e-mail and IRC notifications for individual slaves, independant of the cluster checks being complete.

And then will continue on the cluster checks.

:ashish can you please let us know once you have an ETA on the first part.

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Assignee

Comment 9

•

11 years ago

(In reply to Justin Wood (:Callek) from comment #8)
> As discussed on vidyo, :ashish will work on removing the e-mail and IRC
> notifications for individual slaves, independant of the cluster checks being
> complete.
> 
Significantly:

Notifications for pandas removed in sysadmins r79407.
Notifications for tegras removed in sysadmins r79408.

Those two should ease out alerts in #buildduty and email. WIP on the others.

> And then will continue on the cluster checks.
> 
> :ashish can you please let us know once you have an ETA on the first part.
Optimistically by Dec 20 given the number of hostgroups to go through (and I'm on PTO rest of this week)...

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Assignee

Comment 10

•

11 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #9)
> > And then will continue on the cluster checks.
> > 
> > :ashish can you please let us know once you have an ETA on the first part.
> Optimistically by Dec 20 given the number of hostgroups to go through (and
> I'm on PTO rest of this week)...

FWIW I will be adding the cluster checks for each hostgroup in parallel, so that would mean Dec 20 for everything requested (remove alerts + add/verify cluster checks) in this bug.

Summary: Disable IRC alerts for issues with individual slaves → Disable notifications for individual slaves

Justin Wood (:Callek)

Comment 11

•

11 years ago

(In reply to Ashish Vijayaram [:ashish] [PTO till 12/15] from comment #9)
> (In reply to Justin Wood (:Callek) from comment #8)
> > As discussed on vidyo, :ashish will work on removing the e-mail and IRC
> > notifications for individual slaves, independant of the cluster checks being
> > complete.
> > 
> Significantly:
> 
> Notifications for pandas removed in sysadmins r79407.
> Notifications for tegras removed in sysadmins r79408.
> 

Missing [at least] one piece:

[11:16:06]	nagios-releng	Wed 08:16:08 PST [4236] tegra-149.build.mtv1.mozilla.com:tegra agent check is CRITICAL: Connection refused (http://m.allizom.org/tegra+agent+check)

Ashish Vijayaram [:ashish]

Assignee

Comment 12

•

11 years ago

(In reply to Justin Wood (:Callek) from comment #11)
> Missing [at least] one piece:
> 
> [11:16:06]	nagios-releng	Wed 08:16:08 PST [4236]
> tegra-149.build.mtv1.mozilla.com:tegra agent check is CRITICAL: Connection
> refused (http://m.allizom.org/tegra+agent+check)

That is a service check. Do you want to remove all service check alerts for pandas/tegras as well?

Justin Wood (:Callek)

Comment 13

•

11 years ago

Correct. We don't want alerts for individual slaves in our or email but still retained in nagios ui. That means service checks as well as health/ping checks.

Justin Wood (:Callek)

Comment 14

•

11 years ago

*in irc or email

Ashish Vijayaram [:ashish]

Assignee

Comment 15

•

11 years ago

Notifications for bld-centos6-hp hosts removed in 79684. These have a couple of checks like Disk, IDE and Buildbot checks:

https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=bld-centos6-hp&style=detail

I'll leave those as-is for now and get the noisier alerts out of the way first.

Ashish Vijayaram [:ashish]

Assignee

Comment 16

•

11 years ago

Notifications for panda-relays removed in r79713.
Notifications for prod-t-w732-ix removed in r79726.
Notifications for prod-t-xp32-ix removed in r79727.
Notifications for prod-talos-linux32-ix removed in r79728.
Notifications for prod-talos-linux64-ix removed in r79729.
Notifications for prod-talos-mtnlion-r5 removed in r79730.

Ashish Vijayaram [:ashish]

Assignee

Comment 17

•

11 years ago

Notifications for prod-talos-r3-fed removed in r79802.
Notifications for prod-talos-r3-fed64 removed in r79803.
Notifications for prod-talos-r4-snow removed in r79804.
Notifications for r5-production-builders removed in r79805.
Notifications for r5-try-builders removed in r79806.
Notifications for scl1-bld-linux64-ix removed in r79807.
Notifications for scl3-production-buildbot-masters removed in r79808.
Notifications for t-w864-ix removed in r79809.
Notifications for tegra-linux-servers removed in r79810.
Notifications for use1-production-buildbot-masters removed in r79811.
Notifications for usw2-production-buildbot-masters removed in r79812.
Notifications for w64-ix-slaves/w64r2-ix-slaves removed in r79813.

Justin Wood (:Callek)

Comment 18

•

10 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #17)
> Notifications for scl3-production-buildbot-masters removed in r79808.
> Notifications for use1-production-buildbot-masters removed in r79811.
> Notifications for usw2-production-buildbot-masters removed in r79812.

Whops!  Buildbot-masters should have stayed.

Also I see r5-lion "service" checks still alerting today, :ashish, whats the status of the service check removals?

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Assignee

Comment 19

•

10 years ago

(In reply to Justin Wood (:Callek) from comment #18)
> (In reply to Ashish Vijayaram [:ashish] from comment #17)
> > Notifications for scl3-production-buildbot-masters removed in r79808.
> > Notifications for use1-production-buildbot-masters removed in r79811.
> > Notifications for usw2-production-buildbot-masters removed in r79812.
> 
> Whops!  Buildbot-masters should have stayed.
> 
Notifications for *-production-buildbot-masters added back in r80267.

> Also I see r5-lion "service" checks still alerting today, :ashish, whats the
> status of the service check removals?
> 
To avoid a lot of back-and-forth it would help to get a list of what should alert where. It would surely be an extensive list but I don't have enough know-how to decide on that... Some services alert also oncalls in addition to #buildteam + email + irc. OTOH if you would like me to go all-out and not have any service notifications sent to #buildteam + email + irc then let me know and I can do just that.

Flags: needinfo?(ashish)

Justin Wood (:Callek)

Comment 20

•

10 years ago

:ashish I just see:

[22:20:43]	nagios-releng	Thu 19:21:20 PST [4085] b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
[22:25:42]	nagios-releng	Thu 19:26:19 PST [4086] b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
[22:30:42]	nagios-releng	Thu 19:31:19 PST [4087] b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%

which if it was a slave should be alerting less often, even if we wanted it in #buildduty.

Can you please make sure it, and its services don't alert in #buildduty.

(In reply to Ashish Vijayaram [:ashish] from comment #19)
> To avoid a lot of back-and-forth it would help to get a list of what should
> alert where. It would surely be an extensive list but I don't have enough
> know-how to decide on that... Some services alert also oncalls in addition
> to #buildteam + email + irc. OTOH if you would like me to go all-out and not
> have any service notifications sent to #buildteam + email + irc then let me
> know and I can do just that.

Can you give me a list, url, or instructions on how to gather said list of *existing* alerts [host and service] that go to #buildduty, release[+*] ?  or enumerate them here in a way that works for you (e.g. by group) so I can answer this for you, and get this bug closed out for once.

Justin Wood (:Callek)

Updated

•

10 years ago

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Assignee

Comment 21

•

10 years ago

(In reply to Justin Wood (:Callek) from comment #20)
> :ashish I just see:
> 
> [22:20:43]	nagios-releng	Thu 19:21:20 PST [4085]
> b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss
> = 100%
> [22:25:42]	nagios-releng	Thu 19:26:19 PST [4086]
> b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss
> = 100%
> [22:30:42]	nagios-releng	Thu 19:31:19 PST [4087]
> b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss
> = 100%
> 
> which if it was a slave should be alerting less often, even if we wanted it
> in #buildduty.
> 
> Can you please make sure it, and its services don't alert in #buildduty.
> 

Notifications for bld-centos6-hp removed in r81322.

> (In reply to Ashish Vijayaram [:ashish] from comment #19)
> > To avoid a lot of back-and-forth it would help to get a list of what should
> > alert where. It would surely be an extensive list but I don't have enough
> > know-how to decide on that... Some services alert also oncalls in addition
> > to #buildteam + email + irc. OTOH if you would like me to go all-out and not
> > have any service notifications sent to #buildteam + email + irc then let me
> > know and I can do just that.
> 
> Can you give me a list, url, or instructions on how to gather said list of
> *existing* alerts [host and service] that go to #buildduty, release[+*] ? 
> or enumerate them here in a way that works for you (e.g. by group) so I can
> answer this for you, and get this bug closed out for once.

Two ways:

* Web UI - https://nagios.mozilla.org/releng-scl3/cgi-bin/config.cgi?type=services
* Puppet - preferred for better flexibility

All checks that alert "build" contact group should be in your radar. Feel free to ping/PM me on irc to discuss.

Flags: needinfo?(ashish)

Chris Cooper [:coop] (he/him)

Reporter

Comment 22

•

10 years ago

I'm still seeing individual alerts for the following classes of slaves:

* bld-centos6-hp (buildbot)
* bld-lion-r5 (buildbot, disk)
* b-linux64-hp (buildbot, ping, IDE)
* w64-ix-slave (disk)

Ashish Vijayaram [:ashish]

Assignee

Comment 23

•

10 years ago

Yeah, the service checks are still pending some review (see Comment 21). The difficulty arises in how service checks are mapped to complete hostgroups (1->many). Splitting these is quite tedious and it would be much simpler if we could go through each service check and evaluate where that check should alert. That evaluation is what Callek is working on (Comment 20).

Justin Wood (:Callek)

Comment 24

•

10 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #23)
> Yeah, the service checks are still pending some review (see Comment 21). The
> difficulty arises in how service checks are mapped to complete hostgroups
> (1->many). Splitting these is quite tedious and it would be much simpler if
> we could go through each service check and evaluate where that check should
> alert. That evaluation is what Callek is working on (Comment 20).

Ok looking at: https://nagios.mozilla.org/releng-scl3/cgi-bin/config.cgi?type=services its actually harder than I wanted to glean useful info out of it, can you possibly do one of the following:

(a) tell me how to limit my search (e.g. to things that already alert to a specific contact group [e.g. build])
(a.2) If possible, in addition to (a) tell me how to sort the columns in the web UI (e.g. sort by host makes some of this quite hard)
(b) tar/gz up the nagios configs from puppet and send them my way, using gpg if necessary
(b.2) get me approval/access for r/o of appropriate spot(s) on infra puppet [has the added benefit that I'm willing to write you a patch to fix these]
(c) enumerate here a full list of services that alert to "build", (and hostgroups they apply to)

After looking at that page, my preference is (in order of most preferred to least), c, b, a.

Thank You.

Justin Wood (:Callek)

Comment 25

•

10 years ago

so corey (:cshields) granted me infra puppet access per an IRC convo with him, I have yet to test said access but I should be able to move forward here.

Justin Wood (:Callek)

Comment 26

•

10 years ago

I just sent ashish a first-go patch at some cleanup, I know there will be more to do (I didn't audit all things yet) and there is at least one thing I need to split out as discussed above.

I also note for ashish to do for me (since many individual hosts in said incorrect state)

b-linux64-hp-002.build.scl1.mozilla.com should not alert host status to us to 'build'

foopy* should perform HOST checks (e.g. ping/up/down) and alert to 'build'

Ashish Vijayaram [:ashish]

Assignee

Comment 27

•

10 years ago

Mailed-in patch reviewed and submitted in r81825.

Justin Wood (:Callek)

Comment 28

•

10 years ago

So I've just e-mailed ashish another patch, after this one I have resolved c#26

I actually don't think I need to split up the service checks I thought I saw, looks like its all "server" class machines that we want to know about (I saw "ubuntu" and its really just kvm host systems)

I think with this patch, we can resolve this and file individual bugs for future issues.

Ashish Vijayaram [:ashish]

Assignee

Comment 29

•

10 years ago

Thanks :Callek! Submitted your patch in r82194. Closing this out now.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard