Closed Bug 1082852 Opened 10 years ago Closed 9 years ago

slaverebooter hangs indefinitely waiting for a free worker

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: Callek)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4264] )

Attachments

(2 files)

The extra debugging I added to the slaverebooter script has narrowed the issue down to the loop where the script waits for a free worker thread:

2014-10-12 05:09:46,832 - DEBUG - Waiting for a free worker...

I haven't had time to debug the logic that surrounds that loop, but I have hacked in a hard timeout for now. The script now also generates more debug output by default. The script also generates it's own rotating log file in an attempt to avoid the console-redirection-based logging we rely on now.

This is currently running on bm74.

Callek: can you verify the logic of the worker thread creation and reaping? The timeout I added should save us for now, but I'd rather not throw reboot requests into the void.
Attachment #8505040 - Flags: feedback?(bugspam.Callek)
Comment on attachment 8505040 [details] [diff] [review]
Run slaverebooter in debug mode

Review of attachment 8505040 [details] [diff] [review]:
-----------------------------------------------------------------

I didn't review too closely tonight, but f+ for more logging and rotated logs!
Attachment #8505040 - Flags: feedback?(bugspam.Callek) → feedback+
Comment on attachment 8505040 [details] [diff] [review]
Run slaverebooter in debug mode

Review of attachment 8505040 [details] [diff] [review]:
-----------------------------------------------------------------

Since it's already running in production, I've landed this.

https://hg.mozilla.org/build/tools/rev/f935d3cd6cb4
Attachment #8505040 - Flags: checked-in+
We're in similar state again on bm74, only now we're looping indefinitely on the following:

2014-11-18 18:55:02,291 - DEBUG - Found a running worker. Attempting to join...

It's been "running" in this state since Nov 14, which corresponds pretty well with philor's comment in https://bugzilla.mozilla.org/show_bug.cgi?id=1100621#c0

-r--r--r-- 1 cltbld cltbld         1 Nov 14 19:43 slaverebooter.lock

Callek: do you want to have a look, or should I just get this unblocked by removing the ?
(In reply to Chris Cooper [:coop] from comment #3)
 > Callek: do you want to have a look, or should I just get this unblocked by
> removing the ?

Er, by removing the lockfile.
Sooo, info:

2014-11-14 19:47:04,333 - DEBUG - Waiting for a free worker...
2014-11-14 19:47:04,783 - INFO - talos-linux64-ix-021 - Graceful shutdown failed, rebooting anyway
2014-11-14 19:47:04,834 - DEBUG - Waiting for a free worker...
2014-11-14 19:47:04,862 - INFO - talos-linux64-ix-021 - Reboot queued
2014-11-14 19:47:05,335 - DEBUG - Waiting for a free worker...
...
# Many instances of "Waiting for a free worker" ~2 per second...
...
2014-11-14 20:13:12,202 - DEBUG - Waiting for a free worker...
2014-11-14 20:13:12,703 - WARNING - Gave up waiting for a free worker after 1800 seconds
2014-11-14 20:13:12,704 - DEBUG - Found a running worker. Attempting to join...
...
# Repeated up until this very second, ~1 per second!


[jwood@buildbot-master74.srv.releng.usw2.mozilla.com slaverebooter]$ grep "Found a running worker. Attempting to join..." reboot-idle-slaves.log{,.1} | wc -l
343423
[jwood@buildbot-master74.srv.releng.usw2.mozilla.com slaverebooter]$ grep "Waiting for a free worker..." reboot-idle-slaves.log{,.1} | wc -l
19181

This all corresponds to the block at http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#258 so something is either thread-racy or broken here.

Might need more debugging goop.

(relatedly, what happened to our nagios alert for slaverebooter.lock!?!?!)
fwiw I killed the procs and removed the lockfile (which was old)

-r--r--r-- 1 cltbld cltbld         1 Nov 14 19:43 slaverebooter.lock
It's dead again. talos-linux64, you say? Probably not a coincidence that talos-linux64 has about 50 more slaves than it needs, and ssh reboots fail 100% of the time on it, and graceful shutdowns fail 100% of the time on it.

Might improve our chances of keeping slaverebooter alive through the weekend, which I'm going to take off, if we disabled 50 linux64 slaves (and maybe 10 or 20 linux32s) so it wouldn't have to deal with them. Or, just tell it not to even try to deal with them, that'd probably work fine too.
Assignee: nobody → bugspam.Callek
OS: Mac OS X → Linux
Hardware: x86 → x86_64
Here are some extra debugging aids I've added, as well as some tweaks to the various timeouts.

This is currently running in production on bm74.
Attachment #8535608 - Flags: review?(bugspam.Callek)
Despite the changes in attachment 8535608 [details] [diff] [review], we're still eventually getting into a state where we loop indefinitely on "Found a running worker. Attempting to join...." 

There are a bunch of things we need to do to fix this:

* create/resurrect a nagios alert for slaverebooter.lock file age. If it's older than 1 iteration, i.e. 1 hour, something is wrong. This will help with manually unblocking until we debug the problem.

* auto-terminate the script after some threshold (50min?) even if there are still workers active. Will we compound the worker issue by running again right away? I don't know.

* figure out what that running worker is actually blocked on. Is it the same worker every time, or multiple different workers? Is the blockage local or on slaveapi? This is the next stage of debugging that needs to happen here.

* fix the cronjob to stop creating the slaverebooter.log, now that we have rotating logs generated by the script itself
Comment on attachment 8535608 [details] [diff] [review]
More debugging, better timeouts

Review of attachment 8535608 [details] [diff] [review]:
-----------------------------------------------------------------

::: buildfarm/maintenance/reboot-idle-slaves.py
@@ +40,4 @@
>  
>  SLAVE_QUEUE = Queue.Queue()
>  
> +PP = pprint.PrettyPrinter(indent=4)

is PP still used? (I don't see it)
(In reply to Justin Wood (:Callek) from comment #11) 
> is PP still used? (I don't see it)

Nope, but I figured we were still in debugging mode here. I'm happy to rip it out when we're happy with slaverebooter performance.
Comment on attachment 8535608 [details] [diff] [review]
More debugging, better timeouts

wfm (for now, this script is a pain)
Attachment #8535608 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8535608 [details] [diff] [review]
More debugging, better timeouts

Review of attachment 8535608 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/0d8d4f943f6e
Attachment #8535608 - Flags: checked-in+
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4264]
this was fixed waaay back in 2014, however we have since disabled slaverebooter.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: