Closed
Bug 1079778
Opened 10 years ago
Closed 9 years ago
Disabled pandas taking jobs
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: pmoore, Unassigned)
References
Details
Today, we had an issue where panda-0425 kept taking jobs, despite being disabled in slavealloc. Reason: This is where we check the disabled/enabled state from slavealloc: https://hg.mozilla.org/build/tools/file/2469042323a6/buildfarm/mobile/watch_devices.sh#l56 The new buildbot.tac file is only retrieved if the buildbot slave is not running. Typically, if a panda starts failing jobs, and it is considered to be a problem with the foopy, an error.flg file will get generated on the foopy in the panda directory, to mark it as "bad", and we expect the buildbot job that is running to trigger the buildbot slave to terminate after completing the job. Once buildbot slave is no longer running, the new buildbot.tac file gets pulled down from slavealloc, and we are all good to go. Today, this did not happen. Even though the panda was failing every job it took, no error.flg file got placed on foopy66 in the /builds/panda-0425 directory. Therefore buildbot kept running, and watch_devices.sh never attempted to download the new buildbot.tac file from slavealloc, which would have shown that the slave is disabled. I think there are two problems with the current approach: 1) it assumes that a panda will only be disabled if it is faulty (but it may also be disabled for a different reason, such as for a loaner, or to be decomm'd or moved to a different rack) 2) it assumes that if the jobs are failing, the error.flg file will get generated, which appears not to be the case I would propose that we fix this in one of the following ways, either: 1) When a device is disabled in slavealloc, the disabled.flg file gets generated automatically directly on the foopy (which is also supported for disabling a device) 2) We query buildbot.tac file from slavealloc every iteration of watch_devices.sh (every 5 mins per device) to see if slave has been disabled, and act accordingly. Considerations: option 1 has extra complication that it needs to handle retries in case of failure to write to foopy, and option 2 considerably increases traffic from foopies to slavealloc, increasing load on systems. Other considered solutions welcome! :D Pete
Updated•10 years ago
|
Blocks: panda-0425
Updated•10 years ago
|
No longer blocks: panda-0425
Reporter | ||
Updated•10 years ago
|
Blocks: panda-0425
Reporter | ||
Updated•10 years ago
|
Assignee: nobody → pmoore
Reporter | ||
Comment 1•10 years ago
|
||
Happened again today, interestingly with same panda as original case (panda-0425). Therefore I'm going to try to put this bug to the top of my queue.
Comment 2•10 years ago
|
||
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/81854248
Comment 3•10 years ago
|
||
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/82064766
Reporter | ||
Updated•9 years ago
|
Assignee: pmoore → nobody
Comment 5•9 years ago
|
||
Beats me. I rarely if ever look at what pandas are taking jobs.
Flags: needinfo?(ryanvm)
Comment 6•9 years ago
|
||
It's certainly not the case that every panda that's disabled continues taking jobs, but then that never was the case. If this was about panda-0425 continuing to take jobs after being disabled, we wouldn't know if it still occurs, since that panda has been disabled for 133 days now.
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•