Closed Bug 1151591 Opened 9 years ago Closed 9 years ago

Please re-balance the Linux and Windows test pools

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

x86_64
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: coop)

References

Details

Attachments

(1 file)

As far as I understand, we're backlogged for Windows 7 and I know we have idle capacity in the Linux pools.


From looking at today's data:
http://builddata.pub.build.mozilla.org/reports/pending/pending.html
I can only see backlog for Windows 8 jobs.

sheriffs: If we got X machines from the Linux pools to turn into Windows testers; what distribution would you want? (xp vs win7 vs win8).

We might want to verify this after few days worth of data based on the new scheduling based on SETA.
From the testpool emails [1]
I would propose 
* win7   40%
* win8   40%
* win-xp 20%

Ryan: would this work for you?

[1]
win7-ix: 6142
  0:     5349    87.09%
 15:      424     6.90%
 30:      361     5.88%
 45:        4     0.07%
 60:        4     0.07%


win8-ix: 5734
  0:     4832    84.27%
 15:      531     9.26%
 30:      344     6.00%
 45:       25     0.44%
 60:        2     0.03%


xp-ix: 5171
  0:     4766    92.17%
 15:      196     3.79%
 30:      209     4.04%
Flags: needinfo?(ryanvm)
sgtm, thanks
Flags: needinfo?(ryanvm)
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #1)
> From the testpool emails [1]
> I would propose 
> * win7   40%
> * win8   40%
> * win-xp 20%

I worry somewhat about extrapolating from a single days' wait times. All three platforms have capacity issues during peak load. 

I would split any moved slaves -- I suggested 30 total in https://bugzilla.mozilla.org/show_bug.cgi?id=1122901#c21 -- equally between the three platforms. 

If we decide we *care* more Win7 and Win8 results, then I'm fine with the proposed allotment.

For reference, here are current ADI counts, in case we want to make support decisions based on what our users are using:

* Win7:  64M
* WinXP: 21M
* Win7:  17M

From https://dataviz.mozilla.org/views/PlatformVersionFirefoxADI/WindowsDetails#1

I will start the process of disabling some talos-linux*-ix slaves for reimaging tomorrow.
Assignee: nobody → coop
Priority: -- → P2
I'm happy with even distribution.

Thanks coop for taking this on!
As a first step, I've disabled 30 talos-linux*-ix slaves to see whether we miss the capacity tomorrow before we go throught he process of re-imaging them all:

* talos-linux32-ix-0[46-55]
* talos-linux64-ix-[100-119]

cc-ing buildduty so they're aware.
I was about to suggest that rather than taking a contiguous chunk of working slaves, you instead start by taking the slaves which have been out of service back as far as last July, depending off various "we can't update talos" bugs, when I realized that actually means you can't take *any* of the working ones.

Because we cannot reimage Linux talos slaves, they are not excess capacity, they are our entire stock of spares. Take the disabled ones, they are useless, but until we can successfully reimage, the working ones are not up for grabs.
(In reply to Phil Ringnalda (:philor) from comment #6)
> I was about to suggest that rather than taking a contiguous chunk of working
> slaves, you instead start by taking the slaves which have been out of
> service back as far as last July, depending off various "we can't update
> talos" bugs, when I realized that actually means you can't take *any* of the
> working ones.
> 
> Because we cannot reimage Linux talos slaves, they are not excess capacity,
> they are our entire stock of spares. Take the disabled ones, they are
> useless, but until we can successfully reimage, the working ones are not up
> for grabs.

I understand, but it's much easier for both releng and DCOps to work with contiguous blocks of machines.

I'd also much rather fix bug 1141416 so we can actually re-image these slaves again. Not being able to re-image is untenable long-term. We're currently blocked on a community member who was testing a fix in bug 1112773. If that doesn't come to fruition in the next week, I'll have someone from buildduty test it in staging.
I've already added these machines to slavealloc (disabled), and the various metro|e10s|normal variants to graphserver.
Attachment #8636633 - Flags: review?(bugspam.Callek)
Amy: I'm going to need a time-slice from a Windows admin to get these added to DNS, added to the correct domains, and configured in our GPO setup. Any estimate on when that could happen?

Also, do I need to invoke DCOps here at all, or is Relops be able to handle everything?
Flags: needinfo?(arich)
I don't think there's anything you specifically need a windows admin for. This is all dcops (switching VLANs) and anyone with access to inventory. I believe all of the installation and configuration stuff should just magically happen once they have the right information in inventory and are netbooted. I can do the latter bit.

It looks like you've designated 10 for each pool, correct?
Flags: needinfo?(arich)
Attachment #8636633 - Flags: review?(bugspam.Callek) → review+
Inventory and nagios changes made. I've opened bug 1186137 for dcops to change the vlan.
Depends on: 1186137
(In reply to Amy Rich [:arr] [:arich] from comment #11)
> Inventory and nagios changes made. I've opened bug 1186137 for dcops to
> change the vlan.

Thanks, Amy.
There's an issue with the w7 and w8 machines, but most of the xp machines appear to be up (not 164). Can you take a look at them and verify that they pass snuff?
Coop, the rest of these have finished installing. If you could give them a look to see if they're good to go...
I enabled t-xp32-ix-172 this morning and it's happily passing jobs now, so I've enabled the rest as well.

Amy: is there any cleanup we want to do on the linux side now that these are re-imaged?
Flags: needinfo?(arich)
Since we re-purposed instead of decomming, there's no cleanup to do in inventory or nagios. I don't know if you have additional things you need to clean up in buildbot/slavealloc/etc, though.
Flags: needinfo?(arich)
I've fixed the entries in graphserver that were causing issues this morning. I've also marked all of the linux slaves as decomm in slavealloc.

I think we are done here.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Thank you very much!
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: