Closed
Bug 1192234
Opened 9 years ago
Closed 9 years ago
High Linux64 pending job backlog
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: RyanVM, Unassigned)
Details
We're seeing very high Linux64 test backlog right now, but slave health isn't showing us anywhere near maxed out on instances. I'm closing trees until this can be investigated.
Reporter | ||
Comment 1•9 years ago
|
||
Builds are also falling way behind.
Summary: High Linux64 pending test jobs → High Linux64 pending job backlog
Comment 2•9 years ago
|
||
looking: Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: 2015-08-07 07:01:38,323 - Cannot start Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: Traceback (most recent call last): Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 260, in do_request_spot_instances Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: is_spot=True, dryrun=dryrun, all_instances=all_instances) Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 313, in do_request_instance Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: "region_dns_atom": get_region_dns_atom(region)}) Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "/builds/aws_manager/cloud-tools/cloudtools/aws/instance.py", line 315, in user_data_from_template Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: user_data = user_data.format(**tokens) Aug 07 07:01:38 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: KeyError: 'moz_instance_type'
Comment 3•9 years ago
|
||
....and started at Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: 2015-08-06 11:16:17,648 - Cannot start Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: Traceback (most recent call last): Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 260, in do_request_spot_instances Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: is_spot=True, dryrun=dryrun, all_instances=all_instances) Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "aws_watch_pending.py", line 313, in do_request_instance Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: "region_dns_atom": get_region_dns_atom(region)}) Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: File "/builds/aws_manager/cloud-tools/cloudtools/aws/instance.py", line 315, in user_data_from_template Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: user_data = user_data.format(**tokens) Aug 06 11:16:17 aws-manager2.srv.releng.scl3.mozilla.com aws_watch_pending.py: KeyError: 'moz_instance_type'
Comment 4•9 years ago
|
||
catlee applied the following fix manually to see if that stops the tracebacks: (aws_manager)[buildduty@aws-manager2.srv.releng.scl3.mozilla.com cloud-tools]$ git status On branch master Your branch is up-to-date with 'origin/master'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: cloudtools/aws/instance.py no changes added to commit (use "git add" and/or "git commit -a") (aws_manager)[buildduty@aws-manager2.srv.releng.scl3.mozilla.com cloud-tools]$ git diff diff --git a/cloudtools/aws/instance.py b/cloudtools/aws/instance.py index 1c801b5..365ee8a 100644 --- a/cloudtools/aws/instance.py +++ b/cloudtools/aws/instance.py @@ -312,7 +312,7 @@ def create_block_device_mapping(ami, device_map): def user_data_from_template(moz_instance_type, tokens): user_data = get_user_data_tmpl(moz_instance_type) if user_data: - user_data = user_data.format(**tokens) + user_data = user_data.format(moz_instance_type=moz_instance_type, **tokens) return user_data
Comment 5•9 years ago
|
||
I missed a token during an earlier refactor. It is corrected here: https://github.com/mozilla/build-cloud-tools/pull/99
Comment 6•9 years ago
|
||
we're now creating new instances properly, but it's taking a while to get through the backlog.
Comment 7•9 years ago
|
||
After the above fix, we hit another issue that prevented us from creating new instances: There are not enough free addresses in subnet 'subnet-5cd0d828' to satisfy the requested number of instances. (400 response code) The SNS alert about this also failed to arrive in a timely manner, coming >30min after we started hitting this error. We were still in this state as of 30min ago, but a new round of spot requests has just triggered. I'm watching those to see if they succeed.
Comment 8•9 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #7) > We were still in this state as of 30min ago, but a new round of spot > requests has just triggered. I'm watching those to see if they succeed. The new requests are being honored now, so I think we're past this particular hurdle. There are systemic improvements we can make to avoid these problems in the future, or at least make diagnosis easier and more timely. From IRC: [1:07pm] catlee: there are several problems here: 1) we don't get alerts when this code is failing; 2) it takes too long to create spot requests (like 3-4s per spot instance), so it takes a really long time to spin up enough instances to respond to a load spike now; 3) the code doesn't track free IP addresses well enough, so tries to stuff too many instances into the same subnet [1:07pm] catlee: it will error out eventually and recover, but then you're back to 2) [1:07pm] coop: is there a bug on file for #1 already? [1:08pm] catlee: and because we're churning so much, we probably are hitting the request limit more often [1:08pm] catlee: no [1:08pm] catlee: oh, and apparently we fail to kill this script off when it's taking too long to run, so we have 2 going in parallel [1:09pm] coop: so sounds like there are scripting, logging, and reporting improvements we could make to avoid a lot of pain
Comment 9•9 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #8) > [1:09pm] coop: so sounds like there are scripting, logging, and reporting > improvements we could make to avoid a lot of pain I've re-purposed bug 1109871 to track these improvements.
Reporter | ||
Comment 10•9 years ago
|
||
Things are caught up, so I've reopened trees.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•