Closed Bug 1143681 Opened 9 years ago Closed 8 years ago

Some AWS test slaves not being recycled as expected

Tracking

(Not tracked)

Status:

RESOLVED WORKSFORME

People

(Reporter: coop, Unassigned)

References

(
URL
)

Details

Attachments

(1 file)

List of tst-linux64-spot instances with unknown state 9 years ago Chris Cooper [:coop] (he/him) 9.60 KB, text/plain		Details

Chris Cooper [:coop] (he/him)

Reporter

Description

•

9 years ago

Attached file List of tst-linux64-spot instances with unknown state — Details

For the last few weeks, we've had a bunch (>80) of tst-linux64-spot nodes that are not being recycled properly, despite often high numbers (>1000) of pending jobs.

As an example, the instance that has been in this state the longest, tst-linux64-spot-233, has no status info when I try to look it up on aws-manager2, and the AWS console reports nothing about this instance either.

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:search=tst-linux64-spot-2;sort=tag:Name
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tst-linux64-spot&name=tst-linux64-spot-233

Attached is the list of instances that are in this state in case there is some pattern. (the URL provides the same data)

We should figure out what is preventing these instances from being recycled properly and find a way to make sure it happens automatically.

Rail Aliiev [:rail]

Comment 1

•

9 years ago

May be similar to https://bugzilla.mozilla.org/show_bug.cgi?id=1141339#c20 where we had instances without the recycling script installed.

Probably we should install that script on all machines and make it exit 0 on non-aws instances.

The script lives here: http://hg.mozilla.org/build/puppet/file/1cc84a9642ee/modules/runner/files/check_ami.py

Chris Cooper [:coop] (he/him)

Reporter

Comment 2

•

9 years ago

If you search for the hostname in Spot Requests in the AWS console, you can click on the instance ID to get more details about the state. For all the instances I've checked so far, the tags have been empty (including Name), which is why you can't find them in the instance list by name.

The states of the individual instances have been a combination of starting up or shutting down. Here's a sampling:

tst-linux64-spot-233  use1 i-69c0d065 shutting-down
tst-linux64-spot-1456 usw2 i-69c0d065 pending
tst-linux64-spot-783  usw2 i-97c0d09b pending
tst-linux64-spot-379  usw2 i-6dc0d061 pending
tst-linux64-spot-1166 usw2 i-95c0d099 pending

I tried terminating tst-linux64-spot-1166 just to make sure I could. It worked.

Knowing this, I can sort the console listing to batch terminate these instances by hand.

However, is there a way to add a check or timeout to cloud-tools for instances that spend undue time in the pending or shutting-down state?

Rail Aliiev [:rail]

Comment 3

•

9 years ago

If the instances have launch_time attribute (sometimes it disappears) we can use it. Pseudo code would look like this:

if now - i.launch_time > 2 days:
  i.terminate()

Sometimes these instances are indestructible, I had to open an AWS ticket last month to kill one of those.

Chris Cooper [:coop] (he/him)

Reporter

Comment 4

•

9 years ago

(In reply to Rail Aliiev [:rail] from comment #3)
> Sometimes these instances are indestructible, I had to open an AWS ticket
> last month to kill one of those.

I was able to terminate all the pending ones (which is promising), but I couldn't affect the state of those listed as shutting-down. I opened a support case for those: https://console.aws.amazon.com/support/home?region=us-east-1#/case/?displayId=1359595871&language=en

Chris Cooper [:coop] (he/him)

Reporter

Comment 5

•

9 years ago

(In reply to Chris Cooper [:coop] from comment #4) 
> I was able to terminate all the pending ones (which is promising), but I
> couldn't affect the state of those listed as shutting-down. I opened a
> support case for those:
> https://console.aws.amazon.com/support/home?region=us-east-1#/case/
> ?displayId=1359595871&language=en

Support case was resolved this morning.

Chris AtLee [:catlee]

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WORKSFORME

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Some AWS test slaves not being recycled as expected

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: coop, Unassigned)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Updated

Attachment

General

Description

File Name

Content Type