Closed Bug 1079778 Opened 10 years ago Closed 9 years ago

Disabled pandas taking jobs

Tracking

(Not tracked)

Status:

RESOLVED WORKSFORME

People

(Reporter: pmoore, Unassigned)

References

Details

Pete Moore [:pmoore][:pete]

Reporter

Description

•

10 years ago

Today, we had an issue where panda-0425 kept taking jobs, despite being disabled in slavealloc.

Reason:

This is where we check the disabled/enabled state from slavealloc:
https://hg.mozilla.org/build/tools/file/2469042323a6/buildfarm/mobile/watch_devices.sh#l56

The new buildbot.tac file is only retrieved if the buildbot slave is not running.

Typically, if a panda starts failing jobs, and it is considered to be a problem with the foopy, an error.flg file will get generated on the foopy in the panda directory, to mark it as "bad", and we expect the buildbot job that is running to trigger the buildbot slave to terminate after completing the job. Once buildbot slave is no longer running, the new buildbot.tac file gets pulled down from slavealloc, and we are all good to go.

Today, this did not happen. Even though the panda was failing every job it took, no error.flg file got placed on foopy66 in the /builds/panda-0425 directory. Therefore buildbot kept running, and watch_devices.sh never attempted to download the new buildbot.tac file from slavealloc, which would have shown that the slave is disabled.

I think there are two problems with the current approach:
1) it assumes that a panda will only be disabled if it is faulty (but it may also be disabled for a different reason, such as for a loaner, or to be decomm'd or moved to a different rack)
2) it assumes that if the jobs are failing, the error.flg file will get generated, which appears not to be the case

I would propose that we fix this in one of the following ways, either:
1) When a device is disabled in slavealloc, the disabled.flg file gets generated automatically directly on the foopy (which is also supported for disabling a device)
2) We query buildbot.tac file from slavealloc every iteration of watch_devices.sh (every 5 mins per device) to see if slave has been disabled, and act accordingly.

Considerations: option 1 has extra complication that it needs to handle retries in case of failure to write to foopy, and option 2 considerably increases traffic from foopies to slavealloc, increasing load on systems.

Other considered solutions welcome! :D

Pete

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

Blocks: panda-0425

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

Depends on: 1082955

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

No longer blocks: panda-0425

Pete Moore [:pmoore][:pete]

Reporter

Updated

•

10 years ago

Blocks: panda-0425

Pete Moore [:pmoore][:pete]

Reporter

Updated

•

10 years ago

Assignee: nobody → pmoore

Pete Moore [:pmoore][:pete]

Reporter

Comment 1

•

10 years ago

Happened again today, interestingly with same panda as original case (panda-0425). Therefore I'm going to try to put this bug to the top of my queue.

:kanban-engops

Comment 2

•

10 years ago

A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/81854248

:kanban-engops

Comment 3

•

10 years ago

A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/82064766

Pete Moore [:pmoore][:pete]

Reporter

Comment 4

•

9 years ago

Hey Ryan, is this still occurring?

Pete

Flags: needinfo?(ryanvm)

Pete Moore [:pmoore][:pete]

Reporter

Updated

•

9 years ago

Assignee: pmoore → nobody

Ryan VanderMeulen [:RyanVM]

Comment 5

•

9 years ago

Beats me. I rarely if ever look at what pandas are taking jobs.

Flags: needinfo?(ryanvm)

Phil Ringnalda (:philor)

Comment 6

•

9 years ago

It's certainly not the case that every panda that's disabled continues taking jobs, but then that never was the case. If this was about panda-0425 continuing to take jobs after being disabled, we wouldn't know if it still occurs, since that panda has been disabled for 133 days now.

Chris Cooper [:coop] (he/him)

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WORKSFORME

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Disabled pandas taking jobs

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: pmoore, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Updated

Updated

Updated