Closed Bug 1588834 Opened 5 years ago Closed 5 years ago

Add support for aws-provider to generic-worker.

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tomprince, Assigned: pmoore)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 1 obsolete file)

Generic worker has support for running under aws-provierioner and worker-manger/gcp-provider but not worker-manager/aws-provider. This would be solved by Bug 1558532, but reading the requirements there, it seems like it may be a large amount of work to support that. In light of needing to migrate to aws-provider before Nov 9, adding the support to generic-worker directly seems like a good interim solution.

Agreed, that is certainly the shorter path.

FWIW I see from taskcluster-worker-runner implementation that we can tell if we are running under AWS Provider or AWS Provisioner based on the properties in the user data. They use completely different property names.

AWS Provisioner:

  • data
  • workerType
  • provisionerId
  • region
  • taskclusterRootUrl
  • securityToken
  • capacity

AWS Provider:

  • workerPoolId
  • providerId
  • rootUrl
  • workerGroup

So it is pretty straightforward to determine at runtime which service spawned the worker, without needing to add additional command line options to generic-worker (we can reuse the existing --configure-for-aws option to handle both).

From :dustin in slack:

one thing to check when implementing this is that TASKCLUSTER_WORKER_LOCATION is set correctly, otherwise sccache won't work
and be really sure it's the same format as implemented in worker runner, or we'll be sadfaces later :slightly_smiling_face:
otherwise, I think the registerWorker code that landed in gcp.go can be copy/pasta'd to aws.go

(In reply to Tom Prince [:tomprince] from comment #0)

Generic worker has support for running under aws-provierioner and worker-manger/gcp-provider but not worker-manager/aws-provider. This would be solved by Bug 1558532, but reading the requirements there, it seems like it may be a large amount of work to support that. In light of needing to migrate to aws-provider before Nov 9, adding the support to generic-worker directly seems like a good interim solution.

I think all generic-worker worker types that will be migrated to the community cluster and currently run in AWS can already run in Google Cloud, so maybe this isn't a blocker for the migration.

Tom, which worker types were you thinking of?

Flags: needinfo?(mozilla)

This is for all the stuff managed by OCC for firefox-ci.

Flags: needinfo?(mozilla)
Depends on: 1518507

Agreed, it would be good to have an AWS fallback to mitigate any of the following potential scenarios:

  • We have problems with licensing Windows 7 / Windows 10 workers in Google Cloud
  • We have problems under load in GCP
  • We have problems greening up jobs in GCP

We only don't need AWS Provider support if none of these things go wrong, which is a considerable risk to take.

The alternatives to adding support in generic-worker natively are also considerably more complex:

  • Getting generic-worker working with worker-runner in Windows (and having windows releases of worker-runner that run as a windows service, plus the changes needed to OpenCloudConfig etc)
  • Migrating AWS Provisioner to the new firefox-ci cluster (a huge change)
  • Continuing to run AWS Provisioner under taskcluster.net but getting it talking to the firefox CI cluster (huge job, lot of risk)

So I agree that adding the support in generic-worker natively is relatively straightforward, key to mitigating risk of the above listed issues, and much easier than the alternative approaches to achieving the same.

In other words, we should totally do this - so I will look into it in the coming days.

Although first we might need bug 1518507 to be completed (including child bug 1588625).

(In reply to Pete Moore [:pmoore][:pete] from comment #6)

Agreed, it would be good to have an AWS fallback to mitigate any of the following potential scenarios:

  • We have problems with licensing Windows 7 / Windows 10 workers in Google Cloud
  • We have problems under load in GCP
  • We have problems greening up jobs in GCP

We only don't need AWS Provider support if none of these things go wrong, which is a considerable risk to take.

No Firefox CI workloads are migrating to GCP before Nov 9, with the possible exception of some builds (pending hg optimizations).

However, we're planning to turn off aws-provisioner and ec2-manager on Nov 9. Generic-worker will absolutely need to support AWS provider so that we can continue running existing workloads in AWS.

Note, that I hope do a staging release in a staging cluster late Friday, which is blocked on this.

I've implemented, and am currently testing.

Most of the diff is shuffling code around.

I think I should be able to release on Monday.

Note, the generic-worker repo currently has an integration with the taskcluster.net taskcluster-github and the community taskcluster-github, so I expect the community taskcluster tasks to fail.

Assignee: nobody → pmoore
Comment on attachment 9102636 [details] [review]
GitHub Pull Request for generic-worker

Hi Brian,

I've successfully deployed to [pmoore-test/gwci-linux-beta](https://taskcluster-ui.herokuapp.com/worker-manager/pmoore-test%2Fgwci-linux-beta) from this PR branch, so the linux CI workers for generic-worker for this branch actually run the same version of generic-worker as the generic-worker they are testing, under the AWS Provider.

Note, the win10 workers are not starting due to an OCC issue from today. I'll create a separate bug for that.

Many thanks!
Attachment #9102636 - Flags: review?(bstack)
Attachment #9102636 - Flags: review?(bstack) → review+

This deploys generic-worker 16.4.0 to production.

Successful try push here.

Attachment #9103338 - Flags: review?(rthijssen)

This migrates the generic-worker linux CI workerpool from GCP to AWS Provider, running
generic-worker 16.4.0 which is the first release to support running under AWS Provider.

Attachment #9103357 - Flags: review?(mozilla)
Attachment #9103338 - Flags: review?(rthijssen) → review+

Many thanks Rob. I've merged the changes, but not deployed as I'm not sure I have the latest L3 chain of trust key.

Are you happy to deploy it?

deploy: gecko-1-b-win2012 gecko-2-b-win2012 gecko-3-b-win2012 gecko-t-win10-64-gpu gecko-t-win10-64 gecko-t-win7-32-gpu gecko-t-win7-32

Many thanks.

Flags: needinfo?(rthijssen)

Tom, as you migrate worker types in AWS Provisioner to worker pools running under AWS Provider, it might be a good opportunity to rename the staging worker types we have for Windows. These are typically used for testing generic-worker updates.

What do you think of the following names?

Production Windows worker pools
===============================

aws-provisioner-v1/gecko-1-b-win2012       =>  gecko-1/b-win2012
aws-provisioner-v1/gecko-t-win10-64        =>  gecko-t/t-win10-64
aws-provisioner-v1/gecko-t-win10-64-gpu    =>  gecko-t/t-win10-64-gpu
aws-provisioner-v1/gecko-t-win7-32         =>  gecko-t/t-win7-32
aws-provisioner-v1/gecko-t-win7-32-gpu     =>  gecko-t/t-win7-32-gpu

Staging Windows worker pools
============================

aws-provisioner-v1/gecko-1-b-win2012-beta  =>  staging-gecko-1/b-win2012
aws-provisioner-v1/gecko-t-win10-64-beta   =>  staging-gecko-t/t-win10-64
aws-provisioner-v1/gecko-t-win10-64-gpu-b  =>  staging-gecko-t/t-win10-64-gpu
aws-provisioner-v1/gecko-t-win7-32-beta    =>  staging-gecko-t/t-win7-32
aws-provisioner-v1/gecko-t-win7-32-gpu-b   =>  staging-gecko-t/t-win7-32-gpu

The following are used by the generic-worker CI to run the generic-worker unit and integration tests on production-like worker environments:

aws-provisioner-v1/gecko-t-win10-64-cu
aws-provisioner-v1/gecko-t-win7-32-cu
aws-provisioner-v1/win2012r2-cu

We could potentially run these workers in the community cluster rather than the firefox-ci cluster, although it would be beneficial if we have a means to keep the images in sync with the firefox-ci counterpart images, in order that we detect integration issues as swiftly/efficiently as possible. Since the mercurial ci-configuration repository is separate from the github community cluster config repository (whose name I've unfortunately forgotten), I'm not sure how easy it will be to keep the two in sync, and therefore it may be more practical to leave workers in the firefox-ci deployment for generic-worker CI so that the images can be easily shared across the worker pools. What are your thoughts on this?

Note, we need a separate worker pool than the staging worker pool, since the generic-worker config is different on the staging pool and the CI pool (in the CI, generic-worker need to run tasks as root/LocalSystem so runs with config setting runTasksAsCurrentUser set to true, unlike the staging workers that have this setting set to false, as does production).

If it is ok to have some worker pools in firefox-ci for the specific purpose of integration testing worker changes, I would propose the following names. What do you think?

aws-provisioner-v1/gecko-t-win10-64-cu     => taskcluster-ci/gecko-t-win10-64
aws-provisioner-v1/gecko-t-win7-32-cu      => taskcluster-ci/gecko-t-win7-32
aws-provisioner-v1/win2012r2-cu            => taskcluster-ci/gecko-1-b-win2012
Flags: needinfo?(mozilla)

Hey Dustin, please see comment 15. I realise I should have requested your feedback on this too.

Flags: needinfo?(dustin)

Having dedicated workers for testing worker changes makes sense -- I'll leave it to Tom whether those make sense in the staging deployment or the firefox-ci deployment, and if the latter whether "staging" is a confusing name for them.

We discussed the integration testing in slack and pete is going to go ahead with it.

Flags: needinfo?(dustin)

I haven't had a chance to think about naming yet. I did try out the worker image, but it seemed that occ didn't like something about the config.

I've attached the ci-config patch I used to create it.

Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)

i think we'll need to patch occ to work with worker-managers ec2 provider. it's probably not going to just work without tweaking a few bits around where occ looks for instance metadata.

Flags: needinfo?(rthijssen)

A couple of things (none of which need to be addressed here, but would be nice for the future):

  • Most of the generic-worker config appears to be about how the AMI is configured, and isn't really sensible for somebody using the AMI to configure. I think these configuration options should be backed into the AMI.
  • I'm guessing OCC looks at the aws-provisioner/worker-manager metadata about what worker-type. If this is being used to pull configuration from OCC, I think it would be useful to de-couple the identifier used for that from the worker-name. I'd like to be able to use an AMI+config on a staging worker, and then after verification, switch the production workers to point at the same AMI+config, rather than having a separate OCC config for the two worker types.
Flags: needinfo?(pmoore)

(In reply to Tom Prince [:tomprince] from comment #19)

but it seemed that occ didn't like something about the config.

What did it not like? Do you have the logs?

:markco discovered that (one) issue was the calls getting http://169.254.169.254/latest/meta-data/public-keys. Since aws-provider based images don't have keys set, that gives a 404, which causes the call to fail, and thus the script. I tried https://github.com/mozilla-releng/OpenCloudConfig/commit/0e3ce7f3ab8c2be14936b1deef8cddaf200824ba to handle it, but it looks like that isn't enough to handle the errors from the requests.

Flags: needinfo?(rthijssen)

i don't think the public key lookup is the issue. we only use that metadata url to determine if the instance occ is running on is an ami-build instance (a special case for image building). when occ is running on a worker instance, we expect a 404 on that metadata http lookup and fall back to the instance userdata where we look for a json object containing a workerType property.

i see that the gecko-t/t-win10-64-beta-3 worker pool definition now has an additionalUserData.workerType field which i'm guessing is worker-managers syntax for sending userdata to the instances it spawns. that should work well but in my testing this morning i haven't observed worker-manager spawning one of these instance types and i see no errors in trying to spawn this worker type. to debug this we'll need to watch the instance logs when worker-manager starts one of these which it doesn't seem to be doing right now, for reasons i don't understand.

please ping me this evening (morning in america) and maybe we can figure this out if we can get worker-manager to fire up an instance.

Flags: needinfo?(rthijssen)

(In reply to Rob Thijssen [:grenade (EET/UTC+0300)] from comment #24)

i don't think the public key lookup is the issue. we only use that metadata url to determine if the instance occ is running on is an ami-build instance (a special case for image building). when occ is running on a worker instance, we expect a 404 on that metadata http lookup and fall back to the instance userdata where we look for a json object containing a workerType property.

Hey Rob,

Indeed the metadata for AWS Provider is indeed a little different to the metadata from AWS Provisioner, and unfortunately no longer contains the workerType property.

In AWS Provider, the JSON object in userdata contains these properties:

  • workerPoolId
  • providerId
  • workerGroup
  • rootUrl
  • workerConfig

For AWS Provisioner, the JSON object in userdata contained these properties:

  • data
  • capacity
  • workerType
  • provisionerId
  • region
  • availabilityZone
  • instanceType
  • spotBid
  • price
  • launchSpecGenerated
  • lastModified
  • provisionerBaseUrl
  • taskclusterRootUrl
  • securityToken

Note that the workerPoolId is essentially <provisionerId>/<workerType> so if you need the worker type, it should be possible to scrape it from the worker pool ID (if that helps).

Flags: needinfo?(mozilla)
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Blocks: 1592844

OpenCloudConfig looks for a top-level workerType key in the instance user-data. Since
AWS-Provider doesn't set this, we set it here to tell OpenCloudConfig the
appopriate manifest.

ami deployment in progress for worker types:

  • gecko-3-b-win2012
  • gecko-3-b-win2012-c4
  • gecko-3-b-win2012-c5
  • mpd001-3-b-win2012

https://tools.taskcluster.net/groups/Bgu28fWQTKudoPl12rB7Xw

Comment on attachment 9103652 [details]
Bug 1588834: [WIP] g-w on worker-manager

Revision D50290 was moved to bug 1589706. Setting attachment 9103652 [details] to obsolete.

Attachment #9103652 - Attachment is obsolete: true
Attachment #9103357 - Flags: review?(mozilla)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: