Closed Bug 1600071 Opened 5 years ago Closed 4 years ago

Reduce AWS over-provisioning

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1627769

People

(Reporter: catlee, Unassigned)

References

Details

Attachments

(1 file)

https://datastudio.google.com/u/0/reporting/1bKG2tsH0dfC810wT38LLwB-UH7Fhnc9L/page/tyG6 shows that less than half of our t-win10-64 instances are used.

This means that we are creating many more instances than required, and those instances spend a few minutes booting up, then wait for a few minutes before shutting down.

One idea is to reduce toSpawn by some proportion of runningCapacity, with the rationale that some of our running instances will finish their tasks and become available to run jobs soon.

Irene is out until Monday, so I spoke to bstack about this today. Here's the transcript of our Slack conversation:

coop 4:50 PM catlee is asking about windows idle times again WRT https://bugzilla.mozilla.org/show_bug.cgi?id=1585644
does that fall under the auspices of your worker manager improvements, or should i task owlish with it as a separate investigation under the AWS provider?

bstack 4:51 PM I think it is more of a worker thing if I understand the issue correctly
it depends on how they're tracking the numbers, etc
but this can be included in my wider world if we want. I would probably farm some of the work out to relops once we know what's going on

coop 4:53 PM yeah, my goal here is either to find and fix a provider bug that's causing over-provisioning, or punt to relops

bstack 4:53 PM if we are just seeing the issue with windows workers then it is probably an issue with the workers in a sense
otoh if they can't possibly make the windows workers boot faster then we need to account for it in our provisioning
it isn't a provider issue in either case.
just a worker-manager issue in general or a worker imaging one
aiui the reason the windows idle stuff is so long is that they run puppet on boot or something

coop 4:55 PM catlee was saying that boot time on windows is only 3-4min. not sure how that compares to linux

** bstack 4:55 PM** I think on linux it is more on the order of 20-30 seconds
but I could be wrong about that

coop 4:56 PM what's the cycle time on subsequent provisioning passes? do we keep track of previous requests between iterations to avoid over-provisioning?
and does switching to spot fleet help with any of this?

bstack 4:57 PM haha, I see catlee has mentioned that to you too
that doesn't exactly work because we only know how many pending tasks there are. we don't know if they're new pending tasks on top of what was claimed by the workers we spun up the last time or if they are still pending from before
we could slow down provisioning in general but then it is a matter of time before we start to hear about how slow we are
plus dustin has found that the pending count that queue provides is wildly inaccurate so that isn't helping I expect
imho this needs to be solved at the queue level and probably the best way to solve it is with postgres there if we're doing that soon
otherwise it needs to be more bookkeeping on queue's part which we can do but is a bit of wasted work if postgres happens right afterward
or maybe there are ways to do this better with the data we have that I haven't thought of yet
it is a decently hard problem though that has a lot of seemingly simple solutions that are often wrong in surprising ways
(imho of course)
I guess we could hack something in where for a workerpool you tell us how long you think the workers will take to claim a task
and then we could not provision more rapidly than that time * 1.5 or something
but much like the old scalingRatio I think that will be a short-term fix that ends up being wrong as time goes on and actually being more unhelpful than helpful
since it would require people keeping track of those numbers and updating things
we could also maybe do something with querying the queue's list of workers to see if they've claimed work yet. that would be a possibility. I'll talk to dustin about it

In my load testing, I have found that the pendingCount will remain high for a long time even when claimWork calls are not returning any new work. Part of this is the 20-second cache we apply in the queue (only making the expensive API calls to Azure every 20s for each taskQueue). But even factoring that in, I found that pending counts remained >>0 for much longer than 20s. I suspect that there's some very eventual eventual-consistency going on there. Azure only promises that the returned number is >= the actual number, so they're within their specified behavior.

It sounds like there are a few assumptions that the current system are making that are overly simplistic:

  • Instances that are provisioned are able to boot up and claim a task before the next time the provider checks pending load
  • No current running instances will finish their current tasks and claim a new pending task

In other words, the provider attempts to spin up instances 1:1 to the size of the pending queue. Is that accurate?

If so, then these are the responsibility of the provider IMO. The specifics of timings will vary by worker type, but it will almost always be the case that currently pending tasks are claimed by existing workers before newly requested instances are able to claim them.

What were the problems with scalingRatio?

Could the provider keep some local state and assume that new instances will take at least N minutes to claim a new task, so reduce the number of new instances it's going to request by the number of new instances recently requested? IMO it's better to err on the side of under-provisioning slightly.

I don't think provisioning is as big an immediate concern as not properly terminating workers that have ceased to be useful.

In an ideal world, workers are meant to shutdown after 5 minutes of idle. If they are consistently busy, workers will shutdown after 72 hrs, and are hard-killed after 96hrs. I've cc-ed both Wander and Pete to provide clarification on my understanding here.

However, something is not working correctly here. A quick search in the AWS console reveals many instances still running after more than a week:

e.g. us-east-1

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:launchTime=%3C2020-02-14T00:00-05:00;instanceState=running;sort=launchTime

comm-3/b-win2012: 3
comm-t/t-win7-32: 4
gecko-1/b-win2012: 5
gecko-3/b-win2012: 4
gecko-t/t-linux-large: 28
gecko-t/t-linux-xlarge: 6
gecko-t/t-win7-32: 98
gecko-t/t-win7-32-gpu: 134
gecko-t/t-win10-64: 45
l10n-3/linux: 1

Total: 328

Similar queries for other regions:
us-west-1:
https://us-west-1.console.aws.amazon.com/ec2/v2/home?region=us-west-1#Instances:launchTime=%3C2020-02-14T00:00-05:00;instanceState=running;sort=tag:Name
Total: 85

us-west-2:
https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Instances:launchTime=%3C2020-02-14T00:00-05:00;instanceState=running;sort=tag:Name
Total: 125

eu-central-1:
https://eu-central-1.console.aws.amazon.com/ec2/v2/home?region=eu-central-1#Instances:launchTime=%3C2020-02-14T00:00-05:00;instanceState=running;sort=tag:Name
total: 36

That's 574 instances and associated workers running longer than they should, perhaps indefinitely. I've tasked :bstack with figuring out how these workers get into that state and then figuring out how to make sure worker lifetimes are enforced.

Assignee: nobody → bstack
Flags: needinfo?(wcosta)
Flags: needinfo?(pmoore)

Hmm, I am investigating this issue in baremetal instances, I see in the log the workerShutdown event, but they are still running. I thought this was exclusive to baremetal, but it feels like it is not even exclusive to docker-worker images.

Flags: needinfo?(wcosta)

This is to investigate why some instances don't shutdown. Notice that
ssh is not enabled in the instances by default. We need to enable the
ssh security group for the instance in the EC2 console first.

I could see today some metal instances still running after the expiration period. From papertrail, I see the shutdown has started

Feb 23 22:52:10 ip-10-145-34-202 docker-worker {"type":"shutdown","source":"top","provisionerId":"gecko-t","workerId":"i-034d4e740481f21cf","workerGroup":"aws","workerType":"t-linux-large","workerNodeType":"m5.large"}
Feb 23 22:52:10 ip-10-145-34-202 docker-worker WORKER_METRICS {"eventType":"instanceShutdown","worker":"docker-worker","workerPoolId":"gecko-t/t-linux-large","workerId":"i-034d4e740481f21cf","timestamp":1582509130,"region":"us-east-1","instanceType":"m5.large"}
Feb 23 22:52:10 ip-10-145-34-202 rsyslogd [origin software="rsyslogd" swVersion="7.4.4" x-pid="1035" x-info="http://www.rsyslog.com"] exiting on signal 15.

But it stalled for some reason. I tried to connect to it through ssh, but sshd was already down.

I'm not sure I have access to these workers. Indeed worker manager should enforce that cloud workers live a maximum of 96 hours as a fallback.

From the worker perspective, the fact they live so long suggests that something isn't quite right. Rob, any ideas?

Flags: needinfo?(pmoore) → needinfo?(rthijssen)

(In reply to Pete Moore [:pmoore][:pete] from comment #8)

From the worker perspective, the fact they live so long suggests that something isn't quite right. Rob, any ideas?

if an instance is properly borked, and we are relying on a process on that same instance to shut it down, i suspect we may wait a long time for satisfaction.

dustin once wrote a beautifully named scheduled task to target and terminate unproductive instances. i suggest we revive it or something similar. see bug 1509892 and bug 1435635.

Flags: needinfo?(rthijssen)
Blocks: 1625317

not working on this currently. unassigning myself for now and will pick it back up when I get around to this again

Assignee: bstack → nobody

:wcosta do you still need ssh access to the workers?

Flags: needinfo?(wcosta)

(In reply to Tom Prince [:tomprince] from comment #11)

:wcosta do you still need ssh access to the workers?

No, now with staging environment, I can debug problems there.

Flags: needinfo?(wcosta)

I think this ended up being a dupe of bug 1627769. :bstack has a fix in progress.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: