Closed Bug 1082852 Opened 10 years ago Closed 9 years ago

slaverebooter hangs indefinitely waiting for a free worker

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: coop, Assigned: Callek)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4264] )

Attachments

(2 files)

Run slaverebooter in debug mode 10 years ago Chris Cooper [:coop] (he/him) 4.34 KB, patch	Callek : feedback+ coop : checked-in+	Details \| Diff \| Splinter Review
More debugging, better timeouts 10 years ago Chris Cooper [:coop] (he/him) 4.18 KB, patch	Callek : review+ coop : checked-in+	Details \| Diff \| Splinter Review

Chris Cooper [:coop] (he/him)

Reporter

Description

•

10 years ago

Attached patch Run slaverebooter in debug mode — Details — Splinter Review

The extra debugging I added to the slaverebooter script has narrowed the issue down to the loop where the script waits for a free worker thread:

2014-10-12 05:09:46,832 - DEBUG - Waiting for a free worker...

I haven't had time to debug the logic that surrounds that loop, but I have hacked in a hard timeout for now. The script now also generates more debug output by default. The script also generates it's own rotating log file in an attempt to avoid the console-redirection-based logging we rely on now.

This is currently running on bm74.

Callek: can you verify the logic of the worker thread creation and reaping? The timeout I added should save us for now, but I'd rather not throw reboot requests into the void.

Attachment #8505040 - Flags: feedback?(bugspam.Callek)

Justin Wood (:Callek)

Assignee

Comment 1

•

10 years ago

Comment on attachment 8505040 [details] [diff] [review]
Run slaverebooter in debug mode

Review of attachment 8505040 [details] [diff] [review]:
-----------------------------------------------------------------

I didn't review too closely tonight, but f+ for more logging and rotated logs!

Attachment #8505040 - Flags: feedback?(bugspam.Callek) → feedback+

Chris Cooper [:coop] (he/him)

Reporter

Comment 2

•

10 years ago

Comment on attachment 8505040 [details] [diff] [review]
Run slaverebooter in debug mode

Review of attachment 8505040 [details] [diff] [review]:
-----------------------------------------------------------------

Since it's already running in production, I've landed this.

https://hg.mozilla.org/build/tools/rev/f935d3cd6cb4

Attachment #8505040 - Flags: checked-in+

Chris Cooper [:coop] (he/him)

Reporter

Comment 3

•

10 years ago

We're in similar state again on bm74, only now we're looping indefinitely on the following:

2014-11-18 18:55:02,291 - DEBUG - Found a running worker. Attempting to join...

It's been "running" in this state since Nov 14, which corresponds pretty well with philor's comment in https://bugzilla.mozilla.org/show_bug.cgi?id=1100621#c0

-r--r--r-- 1 cltbld cltbld         1 Nov 14 19:43 slaverebooter.lock

Callek: do you want to have a look, or should I just get this unblocked by removing the ?

Chris Cooper [:coop] (he/him)

Reporter

Comment 5

•

10 years ago

(In reply to Chris Cooper [:coop] from comment #3)
 > Callek: do you want to have a look, or should I just get this unblocked by
> removing the ?

Er, by removing the lockfile.

Justin Wood (:Callek)

Assignee

Comment 6

•

10 years ago

Sooo, info:

2014-11-14 19:47:04,333 - DEBUG - Waiting for a free worker...
2014-11-14 19:47:04,783 - INFO - talos-linux64-ix-021 - Graceful shutdown failed, rebooting anyway
2014-11-14 19:47:04,834 - DEBUG - Waiting for a free worker...
2014-11-14 19:47:04,862 - INFO - talos-linux64-ix-021 - Reboot queued
2014-11-14 19:47:05,335 - DEBUG - Waiting for a free worker...
...
# Many instances of "Waiting for a free worker" ~2 per second...
...
2014-11-14 20:13:12,202 - DEBUG - Waiting for a free worker...
2014-11-14 20:13:12,703 - WARNING - Gave up waiting for a free worker after 1800 seconds
2014-11-14 20:13:12,704 - DEBUG - Found a running worker. Attempting to join...
...
# Repeated up until this very second, ~1 per second!


[jwood@buildbot-master74.srv.releng.usw2.mozilla.com slaverebooter]$ grep "Found a running worker. Attempting to join..." reboot-idle-slaves.log{,.1} | wc -l
343423
[jwood@buildbot-master74.srv.releng.usw2.mozilla.com slaverebooter]$ grep "Waiting for a free worker..." reboot-idle-slaves.log{,.1} | wc -l
19181

This all corresponds to the block at http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#258 so something is either thread-racy or broken here.

Might need more debugging goop.

(relatedly, what happened to our nagios alert for slaverebooter.lock!?!?!)

Justin Wood (:Callek)

Assignee

Comment 7

•

10 years ago

fwiw I killed the procs and removed the lockfile (which was old)

-r--r--r-- 1 cltbld cltbld         1 Nov 14 19:43 slaverebooter.lock

Phil Ringnalda (:philor)

Comment 8

•

10 years ago

It's dead again. talos-linux64, you say? Probably not a coincidence that talos-linux64 has about 50 more slaves than it needs, and ssh reboots fail 100% of the time on it, and graceful shutdowns fail 100% of the time on it.

Might improve our chances of keeping slaverebooter alive through the weekend, which I'm going to take off, if we disabled 50 linux64 slaves (and maybe 10 or 20 linux32s) so it wouldn't have to deal with them. Or, just tell it not to even try to deal with them, that'd probably work fine too.

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

10 years ago

Assignee: nobody → bugspam.Callek

OS: Mac OS X → Linux

Hardware: x86 → x86_64

Chris Cooper [:coop] (he/him)

Reporter

Comment 9

•

10 years ago

Attached patch More debugging, better timeouts — Details — Splinter Review

Here are some extra debugging aids I've added, as well as some tweaks to the various timeouts.

This is currently running in production on bm74.

Attachment #8535608 - Flags: review?(bugspam.Callek)

Chris Cooper [:coop] (he/him)

Reporter

Comment 10

•

10 years ago

Despite the changes in attachment 8535608 [details] [diff] [review], we're still eventually getting into a state where we loop indefinitely on "Found a running worker. Attempting to join...." 

There are a bunch of things we need to do to fix this:

* create/resurrect a nagios alert for slaverebooter.lock file age. If it's older than 1 iteration, i.e. 1 hour, something is wrong. This will help with manually unblocking until we debug the problem.

* auto-terminate the script after some threshold (50min?) even if there are still workers active. Will we compound the worker issue by running again right away? I don't know.

* figure out what that running worker is actually blocked on. Is it the same worker every time, or multiple different workers? Is the blockage local or on slaveapi? This is the next stage of debugging that needs to happen here.

* fix the cronjob to stop creating the slaverebooter.log, now that we have rotating logs generated by the script itself

Justin Wood (:Callek)

Assignee

Comment 11

•

10 years ago

Comment on attachment 8535608 [details] [diff] [review]
More debugging, better timeouts

Review of attachment 8535608 [details] [diff] [review]:
-----------------------------------------------------------------

::: buildfarm/maintenance/reboot-idle-slaves.py
@@ +40,4 @@
>  
>  SLAVE_QUEUE = Queue.Queue()
>  
> +PP = pprint.PrettyPrinter(indent=4)

is PP still used? (I don't see it)

Chris Cooper [:coop] (he/him)

Reporter

Comment 12

•

10 years ago

(In reply to Justin Wood (:Callek) from comment #11) 
> is PP still used? (I don't see it)

Nope, but I figured we were still in debugging mode here. I'm happy to rip it out when we're happy with slaverebooter performance.

Justin Wood (:Callek)

Assignee

Comment 13

•

10 years ago

Comment on attachment 8535608 [details] [diff] [review]
More debugging, better timeouts

wfm (for now, this script is a pain)

Attachment #8535608 - Flags: review?(bugspam.Callek) → review+

Chris Cooper [:coop] (he/him)

Reporter

Comment 14

•

10 years ago

Comment on attachment 8535608 [details] [diff] [review]
More debugging, better timeouts

Review of attachment 8535608 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/0d8d4f943f6e

Attachment #8535608 - Flags: checked-in+

:kanban-engops

Updated

•

9 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4264]

Justin Wood (:Callek)

Assignee

Comment 15

•

9 years ago

this was fixed waaay back in 2014, however we have since disabled slaverebooter.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Tools → General

You need to log in before you can comment on or make changes to this bug.