Closed Bug 1375986 Opened 7 years ago Closed 6 years ago

too many win7 instances running for pending load (buildbot jobs)

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P1)

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: kmoir, Unassigned)

Details

Attachments

(6 files)

Since bug 1370300 landed our needs for win7 spot instances should have dropped significantly.

However, catlee mentioned to me last week that the number of the win7 instances provisioned is too high for the pending count

Here are relevant graphs
-running jobs
https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/running
instances by moz-type
https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/ec2-dashboard?from=now-2d&to=now&panelId=8&fullscreen

Could you please investigate why so many win7 aws instances are being provisioned for buildbot when our pending counts don't seem to warrant this need?
Summary: too many win7 instances up for pending load → too many win7 instances running for pending load (buildbot jobs)
Assigning this to Alin. But this comes second in priority after key rotations and bug 1373289.
Assignee: nobody → aselagea
Status: NEW → ASSIGNED
Priority: -- → P2
Priority: P2 → P1
Checked the AWS console and terminated the few win7 instances older that 3 days.

On the other hand, I think [1] is taking into account all running instances from both BB and TC (need to confirm that) since it's looking at data coming from 'aws_watch_pending', so dropping those win7 debug tests that were running in BB *should* have resulted in a significant drop in the number of running instances - which happened - but I don't think we should expect it to drop at half of the initial value since we have the TC tests.

I created three separate panels (see attachments), but grabbing data from reportor (which I think only shows BB data) and noticed the current number of win7 instances used for BB jobs is roughly half of the one used before bug 1370300 landed. I don't have rights to create a new dashboard, so I can't provide links to those panels. That's why I preferred to attach some screenshots.

So I wonder if there are other numbers you'd expect here? Or am I missing something?

[1] https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/ec2-dashboard?from=1496332460400&to=1501084460400&panelId=8&fullscreen
So the problem seems to be related to the fact that idle w732 instances are not being stopped. 
Looking at the two screenshots below, it follows that during the last 2 days we had ~60 t-w732 spot machines running constantly, even though the number of pending jobs mostly down to zero.
Looking at "aws_stop_idle.py" script, I've noticed that we only pass the following instance types as parameters:
    - bld-linux64
    - tst-linux64
    - tst-linux32
    - tst-emulator64
    - try-linux64
    - av-linux64

I think we should add the Windows instance types as well:
    - b-2008
    - y-2008
    - t-w732
    - g-w732
Per IRC:

<•catlee> aselagea|buildduty: I don't think that will work, since ssh into the machines doesn't work
16:36:39 or does it?
16:37:13 I was wondering maybe about the idleizer settings in slavealloc for those machines
16:37:14 A— aselagea|buildduty checks
16:39:58 C<•catlee> the idleizer settings are in the template in slavealloc I think
16:42:36 A<aselagea|buildduty> Alin Selagea hmm, ssh works..but only using a password 
16:42:47 it's not working when using a key
Evolution of the no. of running Windows instances during the last 7 days (b-2008, y-2008, t-w732, g-w732).
Attached image running_jobs_7_days.PNG
Evolution of the no. of running Windows jobs during the last 7 days (b-2008, y-2008, t-w732, g-w732).
Sorry for not tackling this earlier. 

So at this point, the number of AWS instances required for buildbot jobs is fairly low. The two screenshots above show that the idle instances get terminated once the queue empties, but we probably wait too much time before doing so. 

I *think* we could adjust idleizer settings [1] a bit and reduce:
   - the idle time before a reboot (7 hours at this point)
   - the time before a reboot after being disconnected from a buildbot master (1 hour at this point)

[1] https://hg.mozilla.org/build/tools/file/default/lib/python/slavealloc/logic/buildbottac.py#l50

@Chris: what would you suggest here?
Flags: needinfo?(catlee)
I think these values are set directly in the slavealloc templates at this point, rather than coming from buildbottac.py.

e.g. https://secure.pub.build.mozilla.org/slavealloc/api/gettac/g-w732-spot-001 has
max_idle_time=1500
max_disconnected_time=3600*1

These settings are configured in the database I think.
Flags: needinfo?(catlee)
So the slavealloc table containing the template is "tac_templates" - I was confused by the fact that we store python code as a field value :)

Indeed, the idleizer values are the same as the ones mentioned in #comment 13 above, so it seems we'll need to adjust those.

@catlee: noticed that you did similar changes in bug 1304064 and you were able to follow some metrics in order to confirm that the  effects were the expected ones. Do you recall how you obtained those metrics? Those would also be useful in this case.
Note: I will no longer work on buildduty in 2018, so I'm counting on the new buildduty folks to deal with this bug ;)
Thanks!
Assignee: aselagea → nobody
Status: ASSIGNED → NEW
Bulk change of QA Contact to :jlund, per https://bugzilla.mozilla.org/show_bug.cgi?id=1428483
QA Contact: catlee → jlund
because we are now Buildbot free (aside from esr52) we should worry about provisioner efficiency improvements.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: