Closed
Bug 1151591
Opened 9 years ago
Closed 9 years ago
Please re-balance the Linux and Windows test pools
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: coop)
References
Details
Attachments
(1 file)
818 bytes,
patch
|
Callek
:
review+
|
Details | Diff | Splinter Review |
As far as I understand, we're backlogged for Windows 7 and I know we have idle capacity in the Linux pools. From looking at today's data: http://builddata.pub.build.mozilla.org/reports/pending/pending.html I can only see backlog for Windows 8 jobs. sheriffs: If we got X machines from the Linux pools to turn into Windows testers; what distribution would you want? (xp vs win7 vs win8). We might want to verify this after few days worth of data based on the new scheduling based on SETA.
Reporter | ||
Comment 1•9 years ago
|
||
From the testpool emails [1] I would propose * win7 40% * win8 40% * win-xp 20% Ryan: would this work for you? [1] win7-ix: 6142 0: 5349 87.09% 15: 424 6.90% 30: 361 5.88% 45: 4 0.07% 60: 4 0.07% win8-ix: 5734 0: 4832 84.27% 15: 531 9.26% 30: 344 6.00% 45: 25 0.44% 60: 2 0.03% xp-ix: 5171 0: 4766 92.17% 15: 196 3.79% 30: 209 4.04%
Flags: needinfo?(ryanvm)
Assignee | ||
Comment 3•9 years ago
|
||
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #1) > From the testpool emails [1] > I would propose > * win7 40% > * win8 40% > * win-xp 20% I worry somewhat about extrapolating from a single days' wait times. All three platforms have capacity issues during peak load. I would split any moved slaves -- I suggested 30 total in https://bugzilla.mozilla.org/show_bug.cgi?id=1122901#c21 -- equally between the three platforms. If we decide we *care* more Win7 and Win8 results, then I'm fine with the proposed allotment. For reference, here are current ADI counts, in case we want to make support decisions based on what our users are using: * Win7: 64M * WinXP: 21M * Win7: 17M From https://dataviz.mozilla.org/views/PlatformVersionFirefoxADI/WindowsDetails#1 I will start the process of disabling some talos-linux*-ix slaves for reimaging tomorrow.
Assignee: nobody → coop
Priority: -- → P2
Reporter | ||
Comment 4•9 years ago
|
||
I'm happy with even distribution. Thanks coop for taking this on!
Assignee | ||
Comment 5•9 years ago
|
||
As a first step, I've disabled 30 talos-linux*-ix slaves to see whether we miss the capacity tomorrow before we go throught he process of re-imaging them all: * talos-linux32-ix-0[46-55] * talos-linux64-ix-[100-119] cc-ing buildduty so they're aware.
Assignee | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-102
Assignee | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-104
Assignee | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-106
Assignee | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-107
Assignee | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-108
Assignee | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-117
Assignee | ||
Updated•9 years ago
|
Blocks: talos-linux32-ix-049
Assignee | ||
Updated•9 years ago
|
Blocks: talos-linux32-ix-050
Assignee | ||
Updated•9 years ago
|
Blocks: talos-linux32-ix-055
Comment 6•9 years ago
|
||
I was about to suggest that rather than taking a contiguous chunk of working slaves, you instead start by taking the slaves which have been out of service back as far as last July, depending off various "we can't update talos" bugs, when I realized that actually means you can't take *any* of the working ones. Because we cannot reimage Linux talos slaves, they are not excess capacity, they are our entire stock of spares. Take the disabled ones, they are useless, but until we can successfully reimage, the working ones are not up for grabs.
Assignee | ||
Comment 7•9 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #6) > I was about to suggest that rather than taking a contiguous chunk of working > slaves, you instead start by taking the slaves which have been out of > service back as far as last July, depending off various "we can't update > talos" bugs, when I realized that actually means you can't take *any* of the > working ones. > > Because we cannot reimage Linux talos slaves, they are not excess capacity, > they are our entire stock of spares. Take the disabled ones, they are > useless, but until we can successfully reimage, the working ones are not up > for grabs. I understand, but it's much easier for both releng and DCOps to work with contiguous blocks of machines. I'd also much rather fix bug 1141416 so we can actually re-image these slaves again. Not being able to re-image is untenable long-term. We're currently blocked on a community member who was testing a fix in bug 1112773. If that doesn't come to fruition in the next week, I'll have someone from buildduty test it in staging.
Assignee | ||
Comment 8•9 years ago
|
||
I've already added these machines to slavealloc (disabled), and the various metro|e10s|normal variants to graphserver.
Attachment #8636633 -
Flags: review?(bugspam.Callek)
Assignee | ||
Comment 9•9 years ago
|
||
Amy: I'm going to need a time-slice from a Windows admin to get these added to DNS, added to the correct domains, and configured in our GPO setup. Any estimate on when that could happen? Also, do I need to invoke DCOps here at all, or is Relops be able to handle everything?
Flags: needinfo?(arich)
Comment 10•9 years ago
|
||
I don't think there's anything you specifically need a windows admin for. This is all dcops (switching VLANs) and anyone with access to inventory. I believe all of the installation and configuration stuff should just magically happen once they have the right information in inventory and are netbooted. I can do the latter bit. It looks like you've designated 10 for each pool, correct?
Flags: needinfo?(arich)
Updated•9 years ago
|
Attachment #8636633 -
Flags: review?(bugspam.Callek) → review+
Comment 11•9 years ago
|
||
Inventory and nagios changes made. I've opened bug 1186137 for dcops to change the vlan.
Depends on: 1186137
Assignee | ||
Comment 12•9 years ago
|
||
(In reply to Amy Rich [:arr] [:arich] from comment #11) > Inventory and nagios changes made. I've opened bug 1186137 for dcops to > change the vlan. Thanks, Amy.
Comment 13•9 years ago
|
||
There's an issue with the w7 and w8 machines, but most of the xp machines appear to be up (not 164). Can you take a look at them and verify that they pass snuff?
Assignee | ||
Comment 14•9 years ago
|
||
In production: https://hg.mozilla.org/build/buildbot-configs/rev/b166a5cdbcc2
Comment 15•9 years ago
|
||
Coop, the rest of these have finished installing. If you could give them a look to see if they're good to go...
Assignee | ||
Comment 16•9 years ago
|
||
I enabled t-xp32-ix-172 this morning and it's happily passing jobs now, so I've enabled the rest as well. Amy: is there any cleanup we want to do on the linux side now that these are re-imaged?
Flags: needinfo?(arich)
Comment 17•9 years ago
|
||
Since we re-purposed instead of decomming, there's no cleanup to do in inventory or nagios. I don't know if you have additional things you need to clean up in buildbot/slavealloc/etc, though.
Flags: needinfo?(arich)
Assignee | ||
Comment 18•9 years ago
|
||
I've fixed the entries in graphserver that were causing issues this morning. I've also marked all of the linux slaves as decomm in slavealloc. I think we are done here.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 19•9 years ago
|
||
Thank you very much!
Updated•6 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•