Closed Bug 1192234 Opened 9 years ago Closed 9 years ago

High Linux64 pending job backlog

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Unassigned)

Details

We're seeing very high Linux64 test backlog right now, but slave health isn't showing us anywhere near maxed out on instances. I'm closing trees until this can be investigated.
Builds are also falling way behind.
Summary: High Linux64 pending test jobs → High Linux64 pending job backlog
looking:

 Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: 2015-08-07 07:01:38,323 - Cannot start
Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: Traceback (most recent call last):
Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:   File "aws_watch_pending.py", line 260, in do_request_spot_instances
Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:     is_spot=True, dryrun=dryrun, all_instances=all_instances)
Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:   File "aws_watch_pending.py", line 313, in do_request_instance
Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:     "region_dns_atom": get_region_dns_atom(region)})
Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:   File "/builds/aws_manager/cloud-tools/cloudtools/aws/instance.py", line 315, in user_data_from_template
Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:     user_data = user_data.format(**tokens)
Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: KeyError: 'moz_instance_type'
....and started at 

 Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: 2015-08-06 11:16:17,648 - Cannot start
Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: Traceback (most recent call last):
Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:   File "aws_watch_pending.py", line 260, in do_request_spot_instances
Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:     is_spot=True, dryrun=dryrun, all_instances=all_instances)
Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:   File "aws_watch_pending.py", line 313, in do_request_instance
Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:     "region_dns_atom": get_region_dns_atom(region)})
Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:   File "/builds/aws_manager/cloud-tools/cloudtools/aws/instance.py", line 315, in user_data_from_template
Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py:     user_data = user_data.format(**tokens)
Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: KeyError: 'moz_instance_type'
catlee applied the following fix manually to see if that stops the tracebacks:



(aws_manager)[buildduty@aws-manager2.srv.releng.scl3.mozilla.com cloud-tools]$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   cloudtools/aws/instance.py

no changes added to commit (use "git add" and/or "git commit -a")
(aws_manager)[buildduty@aws-manager2.srv.releng.scl3.mozilla.com cloud-tools]$ git diff
diff --git a/cloudtools/aws/instance.py b/cloudtools/aws/instance.py
index 1c801b5..365ee8a 100644
--- a/cloudtools/aws/instance.py
+++ b/cloudtools/aws/instance.py
@@ -312,7 +312,7 @@ def create_block_device_mapping(ami, device_map):
 def user_data_from_template(moz_instance_type, tokens):
     user_data = get_user_data_tmpl(moz_instance_type)
     if user_data:
-        user_data = user_data.format(**tokens)
+        user_data = user_data.format(moz_instance_type=moz_instance_type, **tokens)
     return user_data
I missed a token during an earlier refactor. It is corrected here: https://github.com/mozilla/build-cloud-tools/pull/99
we're now creating new instances properly, but it's taking a while to get through the backlog.
After the above fix, we hit another issue that prevented us from creating new instances:

There are not enough free addresses in subnet 'subnet-5cd0d828' to satisfy the requested number of instances. (400 response code)

The SNS alert about this also failed to arrive in a timely manner, coming >30min after we started hitting this error. 

We were still in this state as of 30min ago, but a new round of spot requests has just triggered. I'm watching those to see if they succeed.
(In reply to Chris Cooper [:coop] from comment #7) 
> We were still in this state as of 30min ago, but a new round of spot
> requests has just triggered. I'm watching those to see if they succeed.

The new requests are being honored now, so I think we're past this particular hurdle. 

There are systemic improvements we can make to avoid these problems in the future, or at least make diagnosis easier and more timely. From IRC:

[1:07pm] catlee: there are several problems here: 1) we don't get alerts when this code is failing; 2) it takes too long to create spot requests (like 3-4s per spot instance), so it takes a really long time to spin up enough instances to respond to a load spike now; 3) the code doesn't track free IP addresses well enough, so tries to stuff too many instances into the same subnet
[1:07pm] catlee: it will error out eventually and recover, but then you're back to 2)
[1:07pm] coop: is there a bug on file for #1 already?
[1:08pm] catlee: and because we're churning so much, we probably are hitting the request limit more often
[1:08pm] catlee: no
[1:08pm] catlee: oh, and apparently we fail to kill this script off when it's taking too long to run, so we have 2 going in parallel
[1:09pm] coop: so sounds like there are scripting, logging, and reporting improvements we could make to avoid a lot of pain
(In reply to Chris Cooper [:coop] from comment #8)
> [1:09pm] coop: so sounds like there are scripting, logging, and reporting
> improvements we could make to avoid a lot of pain

I've re-purposed bug 1109871 to track these improvements.
Things are caught up, so I've reopened trees.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.