Closed
Bug 1072405
Opened 10 years ago
Closed 10 years ago
Investigate why backfilled pandas haven't taken any jobs
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: coop)
References
Details
(Whiteboard: [capacity])
Attachments
(3 files)
5.60 KB,
patch
|
Callek
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
15.94 KB,
patch
|
Callek
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
36.82 KB,
patch
|
Callek
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
None of the pandas that bug 1056143 backfilled into production while their chassis was being removed have taken any jobs since.
Assignee | ||
Comment 1•10 years ago
|
||
If I had to guess, these pandas are likely still associate with their original foopy instead of a foopy that still exists. I'll dig into this tomorrow.
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Assignee | ||
Comment 2•10 years ago
|
||
I'm slotting the replacement pandas into the same foopies as the original pandas that they replaced. This information is contained in https://bugzilla.mozilla.org/show_bug.cgi?id=1056143#c8
Attachment #8496045 -
Flags: review?(bugspam.Callek)
Comment 3•10 years ago
|
||
Comment on attachment 8496045 [details] [diff] [review] Add replacement pandas to new foopies. Review of attachment 8496045 [details] [diff] [review]: ----------------------------------------------------------------- stamp+, are we removing (or did we already) the swapped-out pandas from devices.json?
Attachment #8496045 -
Flags: review?(bugspam.Callek) → review+
Assignee | ||
Comment 4•10 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #3) > stamp+, are we removing (or did we already) the swapped-out pandas from > devices.json? That will be step #2. I'll post a patch for it shortly.
Assignee | ||
Comment 5•10 years ago
|
||
Attachment #8496289 -
Flags: review?(bugspam.Callek)
Assignee | ||
Comment 6•10 years ago
|
||
Comment on attachment 8496045 [details] [diff] [review] Add replacement pandas to new foopies. Review of attachment 8496045 [details] [diff] [review]: ----------------------------------------------------------------- https://hg.mozilla.org/build/tools/rev/5ad6931211e8
Attachment #8496045 -
Flags: checked-in+
Comment 7•10 years ago
|
||
Comment on attachment 8496289 [details] [diff] [review] Remove decommissioned pandas from devices.json Review of attachment 8496289 [details] [diff] [review]: ----------------------------------------------------------------- I didn't cross check the panda list. But r+
Attachment #8496289 -
Flags: review?(bugspam.Callek) → review+
Reporter | ||
Comment 8•10 years ago
|
||
Step in the right direction, now they are taking jobs, but every single one of them is now disabled, because they all fail every other job (with a tiny bit of variation as they break some jobs even earlier) like https://tbpl.mozilla.org/php/getParsedLog.php?id=49026169&tree=Mozilla-Inbound#error1, failing to powercycle 75 times and thus burning the job.
Assignee | ||
Comment 9•10 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #8) > Step in the right direction, now they are taking jobs, but every single one > of them is now disabled, because they all fail every other job (with a tiny > bit of variation as they break some jobs even earlier) like > https://tbpl.mozilla.org/php/getParsedLog.php?id=49026169&tree=Mozilla- > Inbound#error1, failing to powercycle 75 times and thus burning the job. Failing to powercycle means that the relay host is probably wrong and needs updating too. That info isn't available in bug 1056143, so I'm going to need to go spelunking in inventory to find it.
Assignee | ||
Comment 10•10 years ago
|
||
Comment on attachment 8496289 [details] [diff] [review] Remove decommissioned pandas from devices.json Review of attachment 8496289 [details] [diff] [review]: ----------------------------------------------------------------- https://hg.mozilla.org/build/tools/rev/00d99fd9508f
Attachment #8496289 -
Flags: checked-in+
Assignee | ||
Comment 11•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #9) > Failing to powercycle means that the relay host is probably wrong and needs > updating too. That info isn't available in bug 1056143, so I'm going to need > to go spelunking in inventory to find it. https://hg.mozilla.org/build/tools/rev/16985a437a01 I've kicked off a re-image of all the affected pandas, and have re-enabled them all in slavealloc.
Reporter | ||
Comment 12•10 years ago
|
||
I may well have disabled some for failing after you fixed them, thinking that I was just disabling ones that I failed to actually disable last night.
Assignee | ||
Comment 13•10 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #12) > I may well have disabled some for failing after you fixed them, thinking > that I was just disabling ones that I failed to actually disable last night. I will check them all again after the in-progress reconfig finishes.
Reporter | ||
Comment 14•10 years ago
|
||
Still busted, and disabled again: panda-0616, panda-0623, panda-0630, panda-0631.
Reporter | ||
Comment 15•10 years ago
|
||
Also disabled panda-0615, panda-0624, panda-0628, panda-0632 and panda-0633, so let's just say "all of them, once they finally take two jobs so they can burn one of them."
Assignee | ||
Comment 16•10 years ago
|
||
Since the relay assignments in devices.json are correct now, either the assignments are wrong in the inventory or maybe there's a problem at the mozpool layer. I'll start diving into the logs today.
Assignee | ||
Comment 17•10 years ago
|
||
Despite a reconfig on Saturday which is supposed to update the tools checkout on the foopies (tools/buildfarm/maintenance/end_to_end_reconfig.sh), the foopies still had an out-of-date tools repo. This caused any mozharness scripts that referenced the tools checkout to use stale relay data when trying to reboot the pandas. I updated the tools checkout on the foopies today, and then re-enabled 3 pandas to gauge results. Each of those 3 pandas has now run at least 2 successful jobs in a row, so I've re-enabled all the other pandas now as well.
Assignee | ||
Updated•10 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 18•10 years ago
|
||
The pandas I'm removing represent our reserve capacity. They'll be used to backfill any hardware failures in the panda pool. When that happens, a given panda will be slotted into a new foopy and relay, so keeping this old information in devices.json is pointless. Removing these pandas and their associated foopy mappings will also make fabric actions that touch foopies less painful, because we won't be trying to access 20 decommissioned foopies.
Attachment #8497132 -
Flags: review?(bugspam.Callek)
Comment 19•10 years ago
|
||
Comment on attachment 8497132 [details] [diff] [review] Prune reserve pandas from devices.json Review of attachment 8497132 [details] [diff] [review]: ----------------------------------------------------------------- Stamp
Attachment #8497132 -
Flags: review?(bugspam.Callek) → review+
Assignee | ||
Comment 20•10 years ago
|
||
Comment on attachment 8497132 [details] [diff] [review] Prune reserve pandas from devices.json Review of attachment 8497132 [details] [diff] [review]: ----------------------------------------------------------------- https://hg.mozilla.org/build/tools/rev/179bfe89bf2a
Attachment #8497132 -
Flags: checked-in+
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•