Closed Bug 1143681 Opened 9 years ago Closed 8 years ago

Some AWS test slaves not being recycled as expected

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: coop, Unassigned)

References

()

Details

Attachments

(1 file)

For the last few weeks, we've had a bunch (>80) of tst-linux64-spot nodes that are not being recycled properly, despite often high numbers (>1000) of pending jobs.

As an example, the instance that has been in this state the longest, tst-linux64-spot-233, has no status info when I try to look it up on aws-manager2, and the AWS console reports nothing about this instance either.

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:search=tst-linux64-spot-2;sort=tag:Name
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tst-linux64-spot&name=tst-linux64-spot-233

Attached is the list of instances that are in this state in case there is some pattern. (the URL provides the same data)

We should figure out what is preventing these instances from being recycled properly and find a way to make sure it happens automatically.
May be similar to https://bugzilla.mozilla.org/show_bug.cgi?id=1141339#c20 where we had instances without the recycling script installed.

Probably we should install that script on all machines and make it exit 0 on non-aws instances.

The script lives here: http://hg.mozilla.org/build/puppet/file/1cc84a9642ee/modules/runner/files/check_ami.py
If you search for the hostname in Spot Requests in the AWS console, you can click on the instance ID to get more details about the state. For all the instances I've checked so far, the tags have been empty (including Name), which is why you can't find them in the instance list by name.

The states of the individual instances have been a combination of starting up or shutting down. Here's a sampling:

tst-linux64-spot-233  use1 i-69c0d065 shutting-down
tst-linux64-spot-1456 usw2 i-69c0d065 pending
tst-linux64-spot-783  usw2 i-97c0d09b pending
tst-linux64-spot-379  usw2 i-6dc0d061 pending
tst-linux64-spot-1166 usw2 i-95c0d099 pending

I tried terminating tst-linux64-spot-1166 just to make sure I could. It worked.

Knowing this, I can sort the console listing to batch terminate these instances by hand.

However, is there a way to add a check or timeout to cloud-tools for instances that spend undue time in the pending or shutting-down state?
If the instances have launch_time attribute (sometimes it disappears) we can use it. Pseudo code would look like this:

if now - i.launch_time > 2 days:
  i.terminate()

Sometimes these instances are indestructible, I had to open an AWS ticket last month to kill one of those.
(In reply to Rail Aliiev [:rail] from comment #3)
> Sometimes these instances are indestructible, I had to open an AWS ticket
> last month to kill one of those.

I was able to terminate all the pending ones (which is promising), but I couldn't affect the state of those listed as shutting-down. I opened a support case for those: https://console.aws.amazon.com/support/home?region=us-east-1#/case/?displayId=1359595871&language=en
(In reply to Chris Cooper [:coop] from comment #4) 
> I was able to terminate all the pending ones (which is promising), but I
> couldn't affect the state of those listed as shutting-down. I opened a
> support case for those:
> https://console.aws.amazon.com/support/home?region=us-east-1#/case/
> ?displayId=1359595871&language=en

Support case was resolved this morning.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: