Closed Bug 1079778 Opened 10 years ago Closed 9 years ago

Disabled pandas taking jobs

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: pmoore, Unassigned)

References

Details

Today, we had an issue where panda-0425 kept taking jobs, despite being disabled in slavealloc.

Reason:

This is where we check the disabled/enabled state from slavealloc:
https://hg.mozilla.org/build/tools/file/2469042323a6/buildfarm/mobile/watch_devices.sh#l56

The new buildbot.tac file is only retrieved if the buildbot slave is not running.

Typically, if a panda starts failing jobs, and it is considered to be a problem with the foopy, an error.flg file will get generated on the foopy in the panda directory, to mark it as "bad", and we expect the buildbot job that is running to trigger the buildbot slave to terminate after completing the job. Once buildbot slave is no longer running, the new buildbot.tac file gets pulled down from slavealloc, and we are all good to go.

Today, this did not happen. Even though the panda was failing every job it took, no error.flg file got placed on foopy66 in the /builds/panda-0425 directory. Therefore buildbot kept running, and watch_devices.sh never attempted to download the new buildbot.tac file from slavealloc, which would have shown that the slave is disabled.

I think there are two problems with the current approach:
1) it assumes that a panda will only be disabled if it is faulty (but it may also be disabled for a different reason, such as for a loaner, or to be decomm'd or moved to a different rack)
2) it assumes that if the jobs are failing, the error.flg file will get generated, which appears not to be the case

I would propose that we fix this in one of the following ways, either:
1) When a device is disabled in slavealloc, the disabled.flg file gets generated automatically directly on the foopy (which is also supported for disabling a device)
2) We query buildbot.tac file from slavealloc every iteration of watch_devices.sh (every 5 mins per device) to see if slave has been disabled, and act accordingly.

Considerations: option 1 has extra complication that it needs to handle retries in case of failure to write to foopy, and option 2 considerably increases traffic from foopies to slavealloc, increasing load on systems.

Other considered solutions welcome! :D

Pete
Depends on: 1082955
No longer blocks: panda-0425
Blocks: panda-0425
Assignee: nobody → pmoore
Happened again today, interestingly with same panda as original case (panda-0425). Therefore I'm going to try to put this bug to the top of my queue.
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/81854248
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/82064766
Hey Ryan, is this still occurring?

Pete
Flags: needinfo?(ryanvm)
Assignee: pmoore → nobody
Beats me. I rarely if ever look at what pandas are taking jobs.
Flags: needinfo?(ryanvm)
It's certainly not the case that every panda that's disabled continues taking jobs, but then that never was the case. If this was about panda-0425 continuing to take jobs after being disabled, we wouldn't know if it still occurs, since that panda has been disabled for 133 days now.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.