Closed
Bug 1330293
Opened 7 years ago
Closed 6 years ago
Prevent nagios_blocker_checker.pl from running longer than 5 minutes (and log to sentry if it does)
Categories
(bugzilla.mozilla.org :: General, defect, P3)
Tracking
()
RESOLVED
FIXED
People
(Reporter: gcox, Assigned: gcox)
Details
User Story
In infra bug 1329995, there was a stackup of 5 nagios_blocker_checker.pl scripts, sitting around for over an hour. This wasted swap and pushed the box into alarm. nrpe calls on this box have -t 60 to timeout, but this didn't propagate down to the actual perl, leaving the processes orphaned, which is why I think the script deserves high-but-finite time limiter. Coincidence is not causation, but the timestamp on the process coincides with an admin hopping onto the admin server and running a BMO update, in case this alters your thinking.
Attachments
(1 file)
nagios_blocker_checker.pl: if it doesn't complete in 5 minutes, the odds are that NRPE has long since given up and abandoned it. The perl should have something (ala alarm(300)) to cut itself off in case it gets stuck.
Comment 1•7 years ago
|
||
Two hours running for these processes so far, killing... [root@bugzillaadm.private.scl3 pradcliffe]# ps auxww | grep '1640[12]' root 16401 26.5 40.1 2477232 1576324 ? R 14:57 33:35 /usr/bin/perl /data/bugzilla/www/bugzilla.mozilla.org/scripts/nagios_blocker_checker.pl server-ops-devservices@mozilla-org.bugs root 16402 26.5 39.3 2454016 1545352 ? R 14:57 33:36 /usr/bin/perl /data/bugzilla/www/bugzilla.mozilla.org/scripts/nagios_blocker_checker.pl --product Infrastructure & Operations --component MOC: Projects --severity blocker [root@bugzillaadm.private.scl3 pradcliffe]# date Tue Jan 24 17:04:16 UTC 2017
Comment 2•7 years ago
|
||
:dkl could you work on this we get alerted every week when you guys push updates . Thu 09:01:02 PST [5723] bugzillaadm.private.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 23% free (466 MB out of 2047 MB) (http://m.mozilla.org/Swap)
Comment 3•7 years ago
|
||
This is still going on, got another page and had to go kill another script today.
Comment 4•7 years ago
|
||
dkl: do you have any bandwidth to get a timeout in the perl script this quarter?
Flags: needinfo?(dkl)
Comment 5•7 years ago
|
||
(In reply to Keegan Ferrando [:fauweh] from comment #4) > dkl: do you have any bandwidth to get a timeout in the perl script this > quarter? I am not working on BMO at the moment. Needinfo'ing dylan about this question. dkl
Flags: needinfo?(dkl) → needinfo?(dylan)
Comment 6•7 years ago
|
||
It's on my radar, but not very high at the moment.
Flags: needinfo?(dylan)
Priority: -- → P3
Assignee | ||
Comment 7•6 years ago
|
||
Tossed a PR at this. https://github.com/mozilla-bteam/bmo/pull/327 I feel pretty good about the alarm bit, not so sure about the sentry bit, but.
Comment 8•6 years ago
|
||
The sentry bit will work. I was surprised we exported that sentry function, but apparently we do.
Updated•6 years ago
|
Assignee: nobody → gcox
Comment 9•6 years ago
|
||
Assignee | ||
Comment 10•6 years ago
|
||
Merged, so marking this as fixed until proven otherwise. The deployment of the script could trigger the issue one more time, but, oh well, can't win 'em all.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•