Closed Bug 1298010 Opened 8 years ago Closed 8 years ago

update Generic Worker to terminate host instance if the worker type definition has changed since the instance was started

Categories

(Taskcluster :: Workers, defect)

Unspecified
Windows
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: grenade, Assigned: pmoore)

References

Details

Attachments

(1 file)

if the worker type definition has changed, since an instance was created, it should complete the job in progress and then shutdown/terminate.
The proposed solution we discussed in #taskcluster was this:

1. implement an unauthenticated api endpoint that shows the lastModified of a workerType
2. In the worker process, poll that infrequently and compare to the value provided by the provisioner in the UserData service
3. If the time is different, we can assume that the worker type definition has changed and the worker should do this restart thing.

There is another option involving Pulse which would require a reliably pulse connection and would also require listening to the AMQP server.

I think for the polling, there should be a soft and a hard deadline for this system.  The soft kill should be "complete this job" and a hard kill of however long your longest job is + 1/2 hours.  That way you can have wall-clock guartunees about how long after a worker type is modified that it's fully deployed.

PR incoming for the provisioner changes required
one exception to the terminate-if-changed rule, might be: if the max instances have increased. somebody may have increased the max limit for example in order to increase load capacity and in that case, terminating instances might be unhelpful.
Yeah, maybe also price changes, or regions getting added.

The decision about whether to terminate could end up being quite a tricky one to decide (e.g. let's say a region gets removed - that might mean an instance should be removed if it is in that region, but not if it is in a different region). Another tricky thing is e.g. price changes - should we terminate? Can we determine what price we are currently paying, and whether it is over the maxPrice?

I think we should aim to keep it as simple as possible, and keep any business logic for making that decision on the side of the provisioner that understands the worker type definitions.

I wonder if instead we should have a different endpoint that a worker calls to simply ask "should I die?" and the provisioner replies "yes, DIE!!!" or "naaaaahhh, I'll let you live a little longer".

The max instances increase use case is a good one! Also I think it would be wrong to say "don't bump the lastModified if maxInstances changes", since it is important we monitor when these things change. That's why I'm thinking that "lastModified" might not be the best thing to determine whether an instance should terminate, but instead have a separate data point to track that.
+1 i really like the should-i-die endpoint idea. i think the sheriffs would too.
Assignee: nobody → pmoore
To avoid needing an extra endpoint, and to keep the decision manual for the moment, how about we add something like "rolloutRevision" to the config in the worker type definition.

It would start with:

  "rolloutRevision": 1

Then each time we actively decide we want all workers to roll over to the new version, we increment it.

Then between tasks, or within a given time period of inactivity (15 mins?) each worker will query the latest revision, to see if it matches. If not, it terminates.

The only tricky piece at the moment, is probably the worker does not have the scopes to query the config (currently the provisioner hands the config over on instance startup).
The problem is that for the provisioner to know whether an instance should die, we'd need a bunch more information from the queue to be able to know for sure.  We'd also need to track a lot more state with the instance than we currently do.  I think the best approach is to keep the killing/shutdown of running workers is to keep that information in the worker itself.

If I remember correctly, the approach we thought we'd be taking is that any change to the worker type definition would cause workers to stop accepting new jobs and shutdown when finished the current one.  I still think this is the right approach.  It does mean that on the odd deployment, we'll get a few wasted billing cycles, but it avoids a huge amount of complexity.

I think that the rollout revision idea is a good one.  Exposing a field like that, unprotected for all worker types would be trivial and fine to do.  It doesn't strictly need to be be an incrementing number, since it is a boolean comparison of equality.

What if we didn't set this in the schema, but rather tracked it internally and created a new option "?rollout=yes" to the endpoint /worker-type/:workerType/update, which updated a UUID that was stored in the azure entity and exposed through a /worker-type/:workerType/deployment-id endpoint.  Doing it this way would lower the chance of the field being used incorrectly, and would allow for documentation in the API docs.
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #6)
> What if we didn't set this in the schema, but rather tracked it internally
> and created a new option "?rollout=yes" to the endpoint
> /worker-type/:workerType/update, which updated a UUID that was stored in the
> azure entity and exposed through a /worker-type/:workerType/deployment-id
> endpoint.  Doing it this way would lower the chance of the field being used
> incorrectly, and would allow for documentation in the API docs.

I like it!

So iiuc, when updating the worker type, you make a choice about whether to gracefully terminate existing instances via a query string parameter in the updateWorkerType request. If this is set, it will update the value returned by the new deployment-id endpoint. Workers would also be given the UUID of the deployment they originated from, and they could compare the value from the endpoint against this initial value. If it is different, they terminate.

A couple of questions:

1) Am I right to think the worker would get it's own UUID by a mechanism other than querying the endpoint on startup, to avoid the race condition that it is from one deployment, and between the spot instance request being raised, and the worker running, there being subsequent deployments? In other words, to avoid that the value returned from the endpoint has moved on since the worker was created, and it to reason that it is itself from a later deployment than it is? Would we pass this in with the worker types secrets?

2) Would we also need to update the provisioner page on tools.taskcluster.net to enable setting this optional query string parameter?

Maybe instead of combining it with the updateWorkerType endpoint, we could have a separate endpoint, to signal that any workers not on the current UUID should terminate themselves? I'm wondering if this might reduce impact to the tools site, yet still have the benefit of being a documented API endpoint, that could also be called some time after rolling out a worker type change.
Flags: needinfo?(jhford)
To keep things simple for now, I've granted permission to the worker types that use generic worker to access the worker type definitions. This means, between tasks, if running under the aws provisioner, they will query the worker type definition to check the latest config. In there will be a deploymentId property, which they can compare against their own.

Granting permission was done with this script:



package main

import (
	"log"
	"os"

	tcclient "github.com/taskcluster/taskcluster-client-go"
	"github.com/taskcluster/taskcluster-client-go/auth"
)

func main() {
	workerTypes := []string{
		"ami-test-win2012r2",
		"gecko-t-win10-64",
		"gecko-t-win10-64-gpu",
		"gecko-t-win7-32",
		"gecko-t-win7-32-gpu",
		"gecko-1-b-win2012",
		"gecko-2-b-win2012",
		"gecko-3-b-win2012",
		"win2012r2",
		"win2012r2-mediatests",
		"win2012r2_gw-test",
	}

	myAuth := auth.New(
		&tcclient.Credentials{
			ClientID:    os.Getenv("TASKCLUSTER_CLIENT_ID"),
			AccessToken: os.Getenv("TASKCLUSTER_ACCESS_TOKEN"),
			Certificate: os.Getenv("TASKCLUSTER_CERTIFICATE"),
		},
	)

	for _, wt := range workerTypes {
		role := &auth.CreateRoleRequest{
			Description: "This role was auto-generated by pmoore on 28 October 2016 to allow generic worker to access its own config, if its worker type is updated. It queries the provisioner to get the latest worker type definition, to establish if a new deployment is required, by checking the deploymentId field, and seeing if that is different to its current deploymentId. If it is, it will shut down.",
			Scopes: []string{
				"aws-provisioner:view-worker-type:" + wt,
			},
		}
		roleName := "worker-type:aws-provisioner-v1/" + wt
		resp, err := myAuth.CreateRole(roleName, role)
		if err != nil {
			panic(err)
		}
		log.Printf("Created role %v on %v", roleName, resp.Created)
	}

	log.Print("All done.")
}
This resulted in:

2016/10/28 17:36:22 Created role worker-type:aws-provisioner-v1/ami-test-win2012r2 on 2016-10-28T15:36:22.195Z
2016/10/28 17:36:23 Created role worker-type:aws-provisioner-v1/gecko-t-win10-64 on 2016-10-28T15:36:23.077Z
2016/10/28 17:36:23 Created role worker-type:aws-provisioner-v1/gecko-t-win10-64-gpu on 2016-10-28T15:36:23.600Z
2016/10/28 17:36:24 Created role worker-type:aws-provisioner-v1/gecko-t-win7-32 on 2016-10-28T15:36:24.019Z
2016/10/28 17:36:25 Created role worker-type:aws-provisioner-v1/gecko-t-win7-32-gpu on 2016-10-28T15:36:24.437Z
2016/10/28 17:36:25 Created role worker-type:aws-provisioner-v1/gecko-1-b-win2012 on 2016-10-28T15:36:25.224Z
2016/10/28 17:36:26 Created role worker-type:aws-provisioner-v1/gecko-2-b-win2012 on 2016-10-28T15:36:25.623Z
2016/10/28 17:36:26 Created role worker-type:aws-provisioner-v1/gecko-3-b-win2012 on 2016-10-28T15:36:26.151Z
2016/10/28 17:36:26 Created role worker-type:aws-provisioner-v1/win2012r2 on 2016-10-28T15:36:26.547Z
2016/10/28 17:36:27 Created role worker-type:aws-provisioner-v1/win2012r2-mediatests on 2016-10-28T15:36:27.075Z
2016/10/28 17:36:27 Created role worker-type:aws-provisioner-v1/win2012r2_gw-test on 2016-10-28T15:36:27.488Z
2016/10/28 17:36:27 All done.
I've rolled out release 6.1.0alpha1 to worker type ami-test-win2012r2 and have triggered a job, which is running.

I've then updated the deploymentId of the worker type, so within 30 mins, we should see in the logs that the worker spots a new deploymentId, and shuts down.

This worker type isn't currently logging to papertrail, so I've ssh'd onto the machine and am tailing the logs in a terminal - so should catch the log entries before it shuts down.

The arbitrary task it is running (which is designed to fail in less than 30 mins) is https://tools.taskcluster.net/task-inspector/#DdtqIRH3SxO3JnSQKpGUFA/0
So it really did terminate after 30 mins!

Unfortunately for me, the machine was shut down so quickly, my tail command didn't catch the pertinent lines of the log before the ssh session dropped:

2016/10/28 20:12:04 Zero tasks returned in Azure XML QueueMessagesList
2016/10/28 20:12:04 Zero tasks returned in Azure XML QueueMessagesList
2016/10/28 20:12:04 No task claimed...
2016/10/28 20:12:05 Zero tasks returned in Azure XML QueueMessagesList
2016/10/28 20:12:05 Zero tasks returned in Azure XML QueueMessagesList
2016/10/28 20:12:05 No task claimed...
packet_write_wait: Connection to 54.67.61.92: Broken pipe

I'm running a second test now, where I won't update the deploymentId, and let's see if this time it doesn't shut down the machine after 30 mins. If this test passes as well, I'd say we are good to release 6.1.0. Seems a bit scary though, if it worked first time!
Frustratingly, I only realised an hour into my next test, that when I'd updated the "deploymentId" in the worker type config, my browser had a cached version of the worker type config, and I had reverted the ami ids. So now I have to run again!

Anyways, here is the second arbitrary, designed-to-fail task that will cause a worker to get spawned:
https://tools.taskcluster.net/task-inspector/#LnwfHZZ6SO259KfW9qX9Rg/0

When the task is claimed, I'll ssh onto the worker as before, and tail the log, to check that it checks after 30 mins for a new config, and determines that the deploymentId this time hasn't been updated...
Wow, it really worked!

2016/10/28 22:14:42 Zero tasks returned in Azure XML QueueMessagesList
2016/10/28 22:14:42 Zero tasks returned in Azure XML QueueMessagesList
2016/10/28 22:14:42 No task claimed...
2016/10/28 22:14:43 Checking if there is a new deploymentId...
2016/10/28 22:14:44 No change to deploymentId - "second deployment" == "second deployment"
2016/10/28 22:14:44 Zero tasks returned in Azure XML QueueMessagesList
2016/10/28 22:14:44 Zero tasks returned in Azure XML QueueMessagesList
2016/10/28 22:14:44 No task claimed...
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/9ba431b5fb3043735be989e67d04ea5bb6705e4a
Bug 1298010 - shut down worker if deploymentId updated in worker type config

https://github.com/taskcluster/generic-worker/commit/b662c514cc6b954665451c00d39040b0423bbe06
Merge pull request #27 from taskcluster/bug1298010

Bug 1298010 - shut down worker type if deploymentId has updated in worker type config
Releasing 6.1.0 now with the new feature!
Attachment #8805757 - Flags: review?(rthijssen)
Attachment #8805757 - Flags: review?(rthijssen) → review+
Can this be closed?
looks good. it appears that all workers currently running are from the latest amis. this suggests that the patch has worked. i'll keep an eye on it over the coming days and reopen if necessary.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Flags: needinfo?(jhford)
Blocks: 1293783
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: