Closed Bug 927941 Opened 11 years ago Closed 10 years ago

Disable notifications for individual slaves

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: ashish)

References

Details

The plethora of nagios alerts for slaves in the #buildduty IRC channel makes it hard to see when there are larger issues that buildduty should be jumping on immediately, e.g. network issues, command queues filling up, issues with masters, etc.

After conferring with the people responsible for buildduty, we've agreed to start using the nagios web interface to address issues with individual slaves. We'd like to turn off the IRC alerts for individual slaves to unclutter the #buildduty channel.

If we could replace the individual slave alerts with aggregate alerts so we could be notified with a WARNING when, say, 50% of a given pool is offline, that would be ideal. I don't know how difficult aggregate alerts are to setup; maybe we punt that work to another bug.
Blocks: re-nagios
Relops no longer manages nagios for releng, so slotting this into the SRE group's queue.
Assignee: relops → server-ops
Component: RelOps → Server Operations
Product: Infrastructure & Operations → mozilla.org
QA Contact: arich → shyam
(In reply to Chris Cooper [:coop] from comment #0)
> The plethora of nagios alerts for slaves in the #buildduty IRC channel makes
> it hard to see when there are larger issues that buildduty should be jumping
> on immediately, e.g. network issues, command queues filling up, issues with
> masters, etc.
> 
> After conferring with the people responsible for buildduty, we've agreed to
> start using the nagios web interface to address issues with individual
> slaves. We'd like to turn off the IRC alerts for individual slaves to
> unclutter the #buildduty channel.
> 
> If we could replace the individual slave alerts with aggregate alerts so we
> could be notified with a WARNING when, say, 50% of a given pool is offline,
> that would be ideal. I don't know how difficult aggregate alerts are to
> setup; maybe we punt that work to another bug.

Cluster checks are not difficult to setup and can be done pretty easily. Let us know hosts and/or services that need to be clustered.
(In reply to Ashish Vijayaram [:ashish] from comment #2)
> Cluster checks are not difficult to setup and can be done pretty easily. Let
> us know hosts and/or services that need to be clustered.

I think that having cluster checks for the existing Host Groups (http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/status.cgi?hostgroup=all&style=summary) would be adequate.

However, we already have access to that info via the web interface, so having that info via IRC at a 50% threshold per group is a nice-to-have addition. It shouldn't stop us from disabling the individual host IRC alerts now.
Ping: any update on this?
I'll take this bug and will cluster the large hostgroups to alert WARNING at 50% and CRITICAL at 75%. What would you want to do with individual host alerts?

FYI - the cluster check does not (either in IRC or on the Web UI) elaborate which hosts are down. It'll just put out a number as such: 21 up, 0 down, 0 unreachable. You'd have to navigate the web UI and figure out which hosts are down in that hostgroup.
Assignee: server-ops → ashish
Status: NEW → ASSIGNED
(In reply to Ashish Vijayaram [:ashish] from comment #5)
> I'll take this bug and will cluster the large hostgroups to alert WARNING at
> 50% and CRITICAL at 75%. What would you want to do with individual host
> alerts?

Those thresholds sound good.

We still want individual host alerts to appear in the web UI, we just want them to disappear from IRC.
 
> FYI - the cluster check does not (either in IRC or on the Web UI) elaborate
> which hosts are down. It'll just put out a number as such: 21 up, 0 down, 0
> unreachable. You'd have to navigate the web UI and figure out which hosts
> are down in that hostgroup.

That will work. We're committed to using the web UI to deal with individual hosts anyway.

Thanks.
Aha! I just re-discovered that cluster checks already exist for about two dozen hostgroups:

https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.scl3.mozilla.com

I've changed them all to percentage thresholds at 50% and 75%.
As discussed on vidyo, :ashish will work on removing the e-mail and IRC notifications for individual slaves, independant of the cluster checks being complete.

And then will continue on the cluster checks.

:ashish can you please let us know once you have an ETA on the first part.
Flags: needinfo?(ashish)
(In reply to Justin Wood (:Callek) from comment #8)
> As discussed on vidyo, :ashish will work on removing the e-mail and IRC
> notifications for individual slaves, independant of the cluster checks being
> complete.
> 
Significantly:

Notifications for pandas removed in sysadmins r79407.
Notifications for tegras removed in sysadmins r79408.

Those two should ease out alerts in #buildduty and email. WIP on the others.

> And then will continue on the cluster checks.
> 
> :ashish can you please let us know once you have an ETA on the first part.
Optimistically by Dec 20 given the number of hostgroups to go through (and I'm on PTO rest of this week)...
Flags: needinfo?(ashish)
(In reply to Ashish Vijayaram [:ashish] from comment #9)
> > And then will continue on the cluster checks.
> > 
> > :ashish can you please let us know once you have an ETA on the first part.
> Optimistically by Dec 20 given the number of hostgroups to go through (and
> I'm on PTO rest of this week)...

FWIW I will be adding the cluster checks for each hostgroup in parallel, so that would mean Dec 20 for everything requested (remove alerts + add/verify cluster checks) in this bug.
Summary: Disable IRC alerts for issues with individual slaves → Disable notifications for individual slaves
(In reply to Ashish Vijayaram [:ashish] [PTO till 12/15] from comment #9)
> (In reply to Justin Wood (:Callek) from comment #8)
> > As discussed on vidyo, :ashish will work on removing the e-mail and IRC
> > notifications for individual slaves, independant of the cluster checks being
> > complete.
> > 
> Significantly:
> 
> Notifications for pandas removed in sysadmins r79407.
> Notifications for tegras removed in sysadmins r79408.
> 

Missing [at least] one piece:

[11:16:06]	nagios-releng	Wed 08:16:08 PST [4236] tegra-149.build.mtv1.mozilla.com:tegra agent check is CRITICAL: Connection refused (http://m.allizom.org/tegra+agent+check)
(In reply to Justin Wood (:Callek) from comment #11)
> Missing [at least] one piece:
> 
> [11:16:06]	nagios-releng	Wed 08:16:08 PST [4236]
> tegra-149.build.mtv1.mozilla.com:tegra agent check is CRITICAL: Connection
> refused (http://m.allizom.org/tegra+agent+check)

That is a service check. Do you want to remove all service check alerts for pandas/tegras as well?
Correct. We don't want alerts for individual slaves in our or email but still retained in nagios ui. That means service checks as well as health/ping checks.
*in irc or email
Notifications for bld-centos6-hp hosts removed in 79684. These have a couple of checks like Disk, IDE and Buildbot checks:

https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=bld-centos6-hp&style=detail

I'll leave those as-is for now and get the noisier alerts out of the way first.
Notifications for panda-relays removed in r79713.
Notifications for prod-t-w732-ix removed in r79726.
Notifications for prod-t-xp32-ix removed in r79727.
Notifications for prod-talos-linux32-ix removed in r79728.
Notifications for prod-talos-linux64-ix removed in r79729.
Notifications for prod-talos-mtnlion-r5 removed in r79730.
Notifications for prod-talos-r3-fed removed in r79802.
Notifications for prod-talos-r3-fed64 removed in r79803.
Notifications for prod-talos-r4-snow removed in r79804.
Notifications for r5-production-builders removed in r79805.
Notifications for r5-try-builders removed in r79806.
Notifications for scl1-bld-linux64-ix removed in r79807.
Notifications for scl3-production-buildbot-masters removed in r79808.
Notifications for t-w864-ix removed in r79809.
Notifications for tegra-linux-servers removed in r79810.
Notifications for use1-production-buildbot-masters removed in r79811.
Notifications for usw2-production-buildbot-masters removed in r79812.
Notifications for w64-ix-slaves/w64r2-ix-slaves removed in r79813.
(In reply to Ashish Vijayaram [:ashish] from comment #17)
> Notifications for scl3-production-buildbot-masters removed in r79808.
> Notifications for use1-production-buildbot-masters removed in r79811.
> Notifications for usw2-production-buildbot-masters removed in r79812.

Whops!  Buildbot-masters should have stayed.

Also I see r5-lion "service" checks still alerting today, :ashish, whats the status of the service check removals?
Flags: needinfo?(ashish)
(In reply to Justin Wood (:Callek) from comment #18)
> (In reply to Ashish Vijayaram [:ashish] from comment #17)
> > Notifications for scl3-production-buildbot-masters removed in r79808.
> > Notifications for use1-production-buildbot-masters removed in r79811.
> > Notifications for usw2-production-buildbot-masters removed in r79812.
> 
> Whops!  Buildbot-masters should have stayed.
> 
Notifications for *-production-buildbot-masters added back in r80267.

> Also I see r5-lion "service" checks still alerting today, :ashish, whats the
> status of the service check removals?
> 
To avoid a lot of back-and-forth it would help to get a list of what should alert where. It would surely be an extensive list but I don't have enough know-how to decide on that... Some services alert also oncalls in addition to #buildteam + email + irc. OTOH if you would like me to go all-out and not have any service notifications sent to #buildteam + email + irc then let me know and I can do just that.
Flags: needinfo?(ashish)
:ashish I just see:

[22:20:43]	nagios-releng	Thu 19:21:20 PST [4085] b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
[22:25:42]	nagios-releng	Thu 19:26:19 PST [4086] b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
[22:30:42]	nagios-releng	Thu 19:31:19 PST [4087] b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%

which if it was a slave should be alerting less often, even if we wanted it in #buildduty.

Can you please make sure it, and its services don't alert in #buildduty.

(In reply to Ashish Vijayaram [:ashish] from comment #19)
> To avoid a lot of back-and-forth it would help to get a list of what should
> alert where. It would surely be an extensive list but I don't have enough
> know-how to decide on that... Some services alert also oncalls in addition
> to #buildteam + email + irc. OTOH if you would like me to go all-out and not
> have any service notifications sent to #buildteam + email + irc then let me
> know and I can do just that.

Can you give me a list, url, or instructions on how to gather said list of *existing* alerts [host and service] that go to #buildduty, release[+*] ?  or enumerate them here in a way that works for you (e.g. by group) so I can answer this for you, and get this bug closed out for once.
Flags: needinfo?(ashish)
(In reply to Justin Wood (:Callek) from comment #20)
> :ashish I just see:
> 
> [22:20:43]	nagios-releng	Thu 19:21:20 PST [4085]
> b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss
> = 100%
> [22:25:42]	nagios-releng	Thu 19:26:19 PST [4086]
> b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss
> = 100%
> [22:30:42]	nagios-releng	Thu 19:31:19 PST [4087]
> b-linux64-hp-001.build.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss
> = 100%
> 
> which if it was a slave should be alerting less often, even if we wanted it
> in #buildduty.
> 
> Can you please make sure it, and its services don't alert in #buildduty.
> 

Notifications for bld-centos6-hp removed in r81322.

> (In reply to Ashish Vijayaram [:ashish] from comment #19)
> > To avoid a lot of back-and-forth it would help to get a list of what should
> > alert where. It would surely be an extensive list but I don't have enough
> > know-how to decide on that... Some services alert also oncalls in addition
> > to #buildteam + email + irc. OTOH if you would like me to go all-out and not
> > have any service notifications sent to #buildteam + email + irc then let me
> > know and I can do just that.
> 
> Can you give me a list, url, or instructions on how to gather said list of
> *existing* alerts [host and service] that go to #buildduty, release[+*] ? 
> or enumerate them here in a way that works for you (e.g. by group) so I can
> answer this for you, and get this bug closed out for once.

Two ways:

* Web UI - https://nagios.mozilla.org/releng-scl3/cgi-bin/config.cgi?type=services
* Puppet - preferred for better flexibility

All checks that alert "build" contact group should be in your radar. Feel free to ping/PM me on irc to discuss.
Flags: needinfo?(ashish)
I'm still seeing individual alerts for the following classes of slaves:

* bld-centos6-hp (buildbot)
* bld-lion-r5 (buildbot, disk)
* b-linux64-hp (buildbot, ping, IDE)
* w64-ix-slave (disk)
Yeah, the service checks are still pending some review (see Comment 21). The difficulty arises in how service checks are mapped to complete hostgroups (1->many). Splitting these is quite tedious and it would be much simpler if we could go through each service check and evaluate where that check should alert. That evaluation is what Callek is working on (Comment 20).
(In reply to Ashish Vijayaram [:ashish] from comment #23)
> Yeah, the service checks are still pending some review (see Comment 21). The
> difficulty arises in how service checks are mapped to complete hostgroups
> (1->many). Splitting these is quite tedious and it would be much simpler if
> we could go through each service check and evaluate where that check should
> alert. That evaluation is what Callek is working on (Comment 20).

Ok looking at: https://nagios.mozilla.org/releng-scl3/cgi-bin/config.cgi?type=services its actually harder than I wanted to glean useful info out of it, can you possibly do one of the following:

(a) tell me how to limit my search (e.g. to things that already alert to a specific contact group [e.g. build])
(a.2) If possible, in addition to (a) tell me how to sort the columns in the web UI (e.g. sort by host makes some of this quite hard)
(b) tar/gz up the nagios configs from puppet and send them my way, using gpg if necessary
(b.2) get me approval/access for r/o of appropriate spot(s) on infra puppet [has the added benefit that I'm willing to write you a patch to fix these]
(c) enumerate here a full list of services that alert to "build", (and hostgroups they apply to)

After looking at that page, my preference is (in order of most preferred to least), c, b, a.

Thank You.
so corey (:cshields) granted me infra puppet access per an IRC convo with him, I have yet to test said access but I should be able to move forward here.
I just sent ashish a first-go patch at some cleanup, I know there will be more to do (I didn't audit all things yet) and there is at least one thing I need to split out as discussed above.

I also note for ashish to do for me (since many individual hosts in said incorrect state)

b-linux64-hp-002.build.scl1.mozilla.com should not alert host status to us to 'build'

foopy* should perform HOST checks (e.g. ping/up/down) and alert to 'build'
Mailed-in patch reviewed and submitted in r81825.
So I've just e-mailed ashish another patch, after this one I have resolved c#26

I actually don't think I need to split up the service checks I thought I saw, looks like its all "server" class machines that we want to know about (I saw "ubuntu" and its really just kvm host systems)

I think with this patch, we can resolve this and file individual bugs for future issues.
Thanks :Callek! Submitted your patch in r82194. Closing this out now.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.