Closed
Bug 1082852
Opened 10 years ago
Closed 9 years ago
slaverebooter hangs indefinitely waiting for a free worker
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: coop, Assigned: Callek)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4264] )
Attachments
(2 files)
4.34 KB,
patch
|
Callek
:
feedback+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
4.18 KB,
patch
|
Callek
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
The extra debugging I added to the slaverebooter script has narrowed the issue down to the loop where the script waits for a free worker thread: 2014-10-12 05:09:46,832 - DEBUG - Waiting for a free worker... I haven't had time to debug the logic that surrounds that loop, but I have hacked in a hard timeout for now. The script now also generates more debug output by default. The script also generates it's own rotating log file in an attempt to avoid the console-redirection-based logging we rely on now. This is currently running on bm74. Callek: can you verify the logic of the worker thread creation and reaping? The timeout I added should save us for now, but I'd rather not throw reboot requests into the void.
Attachment #8505040 -
Flags: feedback?(bugspam.Callek)
Assignee | ||
Comment 1•10 years ago
|
||
Comment on attachment 8505040 [details] [diff] [review] Run slaverebooter in debug mode Review of attachment 8505040 [details] [diff] [review]: ----------------------------------------------------------------- I didn't review too closely tonight, but f+ for more logging and rotated logs!
Attachment #8505040 -
Flags: feedback?(bugspam.Callek) → feedback+
Reporter | ||
Comment 2•10 years ago
|
||
Comment on attachment 8505040 [details] [diff] [review] Run slaverebooter in debug mode Review of attachment 8505040 [details] [diff] [review]: ----------------------------------------------------------------- Since it's already running in production, I've landed this. https://hg.mozilla.org/build/tools/rev/f935d3cd6cb4
Attachment #8505040 -
Flags: checked-in+
Reporter | ||
Comment 3•10 years ago
|
||
We're in similar state again on bm74, only now we're looping indefinitely on the following: 2014-11-18 18:55:02,291 - DEBUG - Found a running worker. Attempting to join... It's been "running" in this state since Nov 14, which corresponds pretty well with philor's comment in https://bugzilla.mozilla.org/show_bug.cgi?id=1100621#c0 -r--r--r-- 1 cltbld cltbld 1 Nov 14 19:43 slaverebooter.lock Callek: do you want to have a look, or should I just get this unblocked by removing the ?
Reporter | ||
Comment 5•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #3) > Callek: do you want to have a look, or should I just get this unblocked by > removing the ? Er, by removing the lockfile.
Assignee | ||
Comment 6•10 years ago
|
||
Sooo, info: 2014-11-14 19:47:04,333 - DEBUG - Waiting for a free worker... 2014-11-14 19:47:04,783 - INFO - talos-linux64-ix-021 - Graceful shutdown failed, rebooting anyway 2014-11-14 19:47:04,834 - DEBUG - Waiting for a free worker... 2014-11-14 19:47:04,862 - INFO - talos-linux64-ix-021 - Reboot queued 2014-11-14 19:47:05,335 - DEBUG - Waiting for a free worker... ... # Many instances of "Waiting for a free worker" ~2 per second... ... 2014-11-14 20:13:12,202 - DEBUG - Waiting for a free worker... 2014-11-14 20:13:12,703 - WARNING - Gave up waiting for a free worker after 1800 seconds 2014-11-14 20:13:12,704 - DEBUG - Found a running worker. Attempting to join... ... # Repeated up until this very second, ~1 per second! [jwood@buildbot-master74.srv.releng.usw2.mozilla.com slaverebooter]$ grep "Found a running worker. Attempting to join..." reboot-idle-slaves.log{,.1} | wc -l 343423 [jwood@buildbot-master74.srv.releng.usw2.mozilla.com slaverebooter]$ grep "Waiting for a free worker..." reboot-idle-slaves.log{,.1} | wc -l 19181 This all corresponds to the block at http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#258 so something is either thread-racy or broken here. Might need more debugging goop. (relatedly, what happened to our nagios alert for slaverebooter.lock!?!?!)
Assignee | ||
Comment 7•10 years ago
|
||
fwiw I killed the procs and removed the lockfile (which was old) -r--r--r-- 1 cltbld cltbld 1 Nov 14 19:43 slaverebooter.lock
Comment 8•10 years ago
|
||
It's dead again. talos-linux64, you say? Probably not a coincidence that talos-linux64 has about 50 more slaves than it needs, and ssh reboots fail 100% of the time on it, and graceful shutdowns fail 100% of the time on it. Might improve our chances of keeping slaverebooter alive through the weekend, which I'm going to take off, if we disabled 50 linux64 slaves (and maybe 10 or 20 linux32s) so it wouldn't have to deal with them. Or, just tell it not to even try to deal with them, that'd probably work fine too.
Reporter | ||
Updated•10 years ago
|
Assignee: nobody → bugspam.Callek
OS: Mac OS X → Linux
Hardware: x86 → x86_64
Reporter | ||
Comment 9•10 years ago
|
||
Here are some extra debugging aids I've added, as well as some tweaks to the various timeouts. This is currently running in production on bm74.
Attachment #8535608 -
Flags: review?(bugspam.Callek)
Reporter | ||
Comment 10•10 years ago
|
||
Despite the changes in attachment 8535608 [details] [diff] [review], we're still eventually getting into a state where we loop indefinitely on "Found a running worker. Attempting to join...." There are a bunch of things we need to do to fix this: * create/resurrect a nagios alert for slaverebooter.lock file age. If it's older than 1 iteration, i.e. 1 hour, something is wrong. This will help with manually unblocking until we debug the problem. * auto-terminate the script after some threshold (50min?) even if there are still workers active. Will we compound the worker issue by running again right away? I don't know. * figure out what that running worker is actually blocked on. Is it the same worker every time, or multiple different workers? Is the blockage local or on slaveapi? This is the next stage of debugging that needs to happen here. * fix the cronjob to stop creating the slaverebooter.log, now that we have rotating logs generated by the script itself
Assignee | ||
Comment 11•10 years ago
|
||
Comment on attachment 8535608 [details] [diff] [review] More debugging, better timeouts Review of attachment 8535608 [details] [diff] [review]: ----------------------------------------------------------------- ::: buildfarm/maintenance/reboot-idle-slaves.py @@ +40,4 @@ > > SLAVE_QUEUE = Queue.Queue() > > +PP = pprint.PrettyPrinter(indent=4) is PP still used? (I don't see it)
Reporter | ||
Comment 12•10 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #11) > is PP still used? (I don't see it) Nope, but I figured we were still in debugging mode here. I'm happy to rip it out when we're happy with slaverebooter performance.
Assignee | ||
Comment 13•10 years ago
|
||
Comment on attachment 8535608 [details] [diff] [review] More debugging, better timeouts wfm (for now, this script is a pain)
Attachment #8535608 -
Flags: review?(bugspam.Callek) → review+
Reporter | ||
Comment 14•10 years ago
|
||
Comment on attachment 8535608 [details] [diff] [review] More debugging, better timeouts Review of attachment 8535608 [details] [diff] [review]: ----------------------------------------------------------------- https://hg.mozilla.org/build/tools/rev/0d8d4f943f6e
Attachment #8535608 -
Flags: checked-in+
Updated•9 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4264]
Assignee | ||
Comment 15•9 years ago
|
||
this was fixed waaay back in 2014, however we have since disabled slaverebooter.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•