Closed Bug 1588834 Opened 5 years ago Closed 5 years ago

Add support for aws-provider to generic-worker.

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: tomprince, Assigned: pmoore)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 1 obsolete file)

GitHub Pull Request for generic-worker 5 years ago Pete Moore [:pmoore][:pete] 54 bytes, text/x-github-pull-request	bstack : review+	Details \| Review
GitHub Pull Request for OpenCloudConfig 5 years ago Pete Moore [:pmoore][:pete] 58 bytes, text/x-github-pull-request	grenade : review+	Details \| Review
Bug 1588834 - Upgrade to generic-worker 16.4.0 on pmoore-test/gwci-linux worker pool 5 years ago Pete Moore [:pmoore][:pete] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1588834: [WIP] g-w on worker-manager 5 years ago Tom Prince [:tomprince] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1588834: Add a way to specify additional aws user-data to pass to workers; r?Callek 5 years ago Tom Prince [:tomprince] 47 bytes, text/x-phabricator-request		Details \| Review

Tom Prince [:tomprince]

Reporter

Description

•

5 years ago

Generic worker has support for running under aws-provierioner and worker-manger/gcp-provider but not worker-manager/aws-provider. This would be solved by Bug 1558532, but reading the requirements there, it seems like it may be a large amount of work to support that. In light of needing to migrate to aws-provider before Nov 9, adding the support to generic-worker directly seems like a good interim solution.

Pete Moore [:pmoore][:pete]

Assignee

Comment 1

•

5 years ago

•

Edited

Agreed, that is certainly the shorter path.

Pete Moore [:pmoore][:pete]

Assignee

Comment 2

•

5 years ago

FWIW I see from taskcluster-worker-runner implementation that we can tell if we are running under AWS Provider or AWS Provisioner based on the properties in the user data. They use completely different property names.

AWS Provisioner:

data
workerType
provisionerId
region
taskclusterRootUrl
securityToken
capacity

AWS Provider:

workerPoolId
providerId
rootUrl
workerGroup

So it is pretty straightforward to determine at runtime which service spawned the worker, without needing to add additional command line options to generic-worker (we can reuse the existing --configure-for-aws option to handle both).

Tom Prince [:tomprince]

Reporter

Comment 3

•

5 years ago

From :dustin in slack:

one thing to check when implementing this is that TASKCLUSTER_WORKER_LOCATION is set correctly, otherwise sccache won't work
and be really sure it's the same format as implemented in worker runner, or we'll be sadfaces later :slightly_smiling_face:
otherwise, I think the registerWorker code that landed in gcp.go can be copy/pasta'd to aws.go

Pete Moore [:pmoore][:pete]

Assignee

Comment 4

•

5 years ago

(In reply to Tom Prince [:tomprince] from comment #0)

Generic worker has support for running under aws-provierioner and worker-manger/gcp-provider but not worker-manager/aws-provider. This would be solved by Bug 1558532, but reading the requirements there, it seems like it may be a large amount of work to support that. In light of needing to migrate to aws-provider before Nov 9, adding the support to generic-worker directly seems like a good interim solution.

I think all generic-worker worker types that will be migrated to the community cluster and currently run in AWS can already run in Google Cloud, so maybe this isn't a blocker for the migration.

Tom, which worker types were you thinking of?

Flags: needinfo?(mozilla)

Tom Prince [:tomprince]

Reporter

Comment 5

•

5 years ago

This is for all the stuff managed by OCC for firefox-ci.

Flags: needinfo?(mozilla)

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

5 years ago

Depends on: 1518507

Pete Moore [:pmoore][:pete]

Assignee

Comment 6

•

5 years ago

•

Edited

Agreed, it would be good to have an AWS fallback to mitigate any of the following potential scenarios:

We have problems with licensing Windows 7 / Windows 10 workers in Google Cloud
We have problems under load in GCP
We have problems greening up jobs in GCP

We only don't need AWS Provider support if none of these things go wrong, which is a considerable risk to take.

The alternatives to adding support in generic-worker natively are also considerably more complex:

Getting generic-worker working with worker-runner in Windows (and having windows releases of worker-runner that run as a windows service, plus the changes needed to OpenCloudConfig etc)
Migrating AWS Provisioner to the new firefox-ci cluster (a huge change)
Continuing to run AWS Provisioner under taskcluster.net but getting it talking to the firefox CI cluster (huge job, lot of risk)

So I agree that adding the support in generic-worker natively is relatively straightforward, key to mitigating risk of the above listed issues, and much easier than the alternative approaches to achieving the same.

In other words, we should totally do this - so I will look into it in the coming days.

Although first we might need bug 1518507 to be completed (including child bug 1588625).

Chris Cooper [:coop] (he/him)

Comment 7

•

5 years ago

•

Edited

(In reply to Pete Moore [:pmoore][:pete] from comment #6)

Agreed, it would be good to have an AWS fallback to mitigate any of the following potential scenarios:

We have problems with licensing Windows 7 / Windows 10 workers in Google Cloud

We have problems under load in GCP

We have problems greening up jobs in GCP

We only don't need AWS Provider support if none of these things go wrong, which is a considerable risk to take.

No Firefox CI workloads are migrating to GCP before Nov 9, with the possible exception of some builds (pending hg optimizations).

However, we're planning to turn off aws-provisioner and ec2-manager on Nov 9. Generic-worker will absolutely need to support AWS provider so that we can continue running existing workloads in AWS.

Tom Prince [:tomprince]

Reporter

Comment 8

•

5 years ago

Note, that I hope do a staging release in a staging cluster late Friday, which is blocked on this.

Pete Moore [:pmoore][:pete]

Assignee

Comment 9

•

5 years ago

Attached file GitHub Pull Request for generic-worker — Details

I've implemented, and am currently testing.

Most of the diff is shuffling code around.

I think I should be able to release on Monday.

Note, the generic-worker repo currently has an integration with the taskcluster.net taskcluster-github and the community taskcluster-github, so I expect the community taskcluster tasks to fail.

Assignee: nobody → pmoore

Pete Moore [:pmoore][:pete]

Assignee

Comment 10

•

5 years ago

Comment on attachment 9102636 [details] [review]
GitHub Pull Request for generic-worker

Hi Brian,

I've successfully deployed to [pmoore-test/gwci-linux-beta](https://taskcluster-ui.herokuapp.com/worker-manager/pmoore-test%2Fgwci-linux-beta) from this PR branch, so the linux CI workers for generic-worker for this branch actually run the same version of generic-worker as the generic-worker they are testing, under the AWS Provider.

Note, the win10 workers are not starting due to an OCC issue from today. I'll create a separate bug for that.

Many thanks!

Attachment #9102636 - Flags: review?(bstack)

Brian Stack [:bstack]

Updated

•

5 years ago

Attachment #9102636 - Flags: review?(bstack) → review+

Pete Moore [:pmoore][:pete]

Assignee

Comment 11

•

5 years ago

Attached file GitHub Pull Request for OpenCloudConfig — Details

This deploys generic-worker 16.4.0 to production.

Successful try push here.

Attachment #9103338 - Flags: review?(rthijssen)

Pete Moore [:pmoore][:pete]

Assignee

Comment 12

•

5 years ago

Attached file Bug 1588834 - Upgrade to generic-worker 16.4.0 on pmoore-test/gwci-linux worker pool — Details

This migrates the generic-worker linux CI workerpool from GCP to AWS Provider, running
generic-worker 16.4.0 which is the first release to support running under AWS Provider.

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

5 years ago

Attachment #9103357 - Flags: review?(mozilla)

Rob Thijssen [:grenade (EET/UTC+0300)]

Updated

•

5 years ago

Attachment #9103338 - Flags: review?(rthijssen) → review+

Pete Moore [:pmoore][:pete]

Assignee

Comment 13

•

5 years ago

Many thanks Rob. I've merged the changes, but not deployed as I'm not sure I have the latest L3 chain of trust key.

Are you happy to deploy it?

deploy: gecko-1-b-win2012 gecko-2-b-win2012 gecko-3-b-win2012 gecko-t-win10-64-gpu gecko-t-win10-64 gecko-t-win7-32-gpu gecko-t-win7-32

Many thanks.

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 14

•

5 years ago

deployment in progress: https://tools.taskcluster.net/groups/QpowuSeYRjujBW5EsjeYdw

Flags: needinfo?(rthijssen)

Pete Moore [:pmoore][:pete]

Assignee

Comment 15

•

5 years ago

•

Edited

Tom, as you migrate worker types in AWS Provisioner to worker pools running under AWS Provider, it might be a good opportunity to rename the staging worker types we have for Windows. These are typically used for testing generic-worker updates.

What do you think of the following names?

Production Windows worker pools
===============================

aws-provisioner-v1/gecko-1-b-win2012       =>  gecko-1/b-win2012
aws-provisioner-v1/gecko-t-win10-64        =>  gecko-t/t-win10-64
aws-provisioner-v1/gecko-t-win10-64-gpu    =>  gecko-t/t-win10-64-gpu
aws-provisioner-v1/gecko-t-win7-32         =>  gecko-t/t-win7-32
aws-provisioner-v1/gecko-t-win7-32-gpu     =>  gecko-t/t-win7-32-gpu

Staging Windows worker pools
============================

aws-provisioner-v1/gecko-1-b-win2012-beta  =>  staging-gecko-1/b-win2012
aws-provisioner-v1/gecko-t-win10-64-beta   =>  staging-gecko-t/t-win10-64
aws-provisioner-v1/gecko-t-win10-64-gpu-b  =>  staging-gecko-t/t-win10-64-gpu
aws-provisioner-v1/gecko-t-win7-32-beta    =>  staging-gecko-t/t-win7-32
aws-provisioner-v1/gecko-t-win7-32-gpu-b   =>  staging-gecko-t/t-win7-32-gpu

The following are used by the generic-worker CI to run the generic-worker unit and integration tests on production-like worker environments:

aws-provisioner-v1/gecko-t-win10-64-cu
aws-provisioner-v1/gecko-t-win7-32-cu
aws-provisioner-v1/win2012r2-cu

We could potentially run these workers in the community cluster rather than the firefox-ci cluster, although it would be beneficial if we have a means to keep the images in sync with the firefox-ci counterpart images, in order that we detect integration issues as swiftly/efficiently as possible. Since the mercurial ci-configuration repository is separate from the github community cluster config repository (whose name I've unfortunately forgotten), I'm not sure how easy it will be to keep the two in sync, and therefore it may be more practical to leave workers in the firefox-ci deployment for generic-worker CI so that the images can be easily shared across the worker pools. What are your thoughts on this?

Note, we need a separate worker pool than the staging worker pool, since the generic-worker config is different on the staging pool and the CI pool (in the CI, generic-worker need to run tasks as root/LocalSystem so runs with config setting runTasksAsCurrentUser set to true, unlike the staging workers that have this setting set to false, as does production).

If it is ok to have some worker pools in firefox-ci for the specific purpose of integration testing worker changes, I would propose the following names. What do you think?

aws-provisioner-v1/gecko-t-win10-64-cu     => taskcluster-ci/gecko-t-win10-64
aws-provisioner-v1/gecko-t-win7-32-cu      => taskcluster-ci/gecko-t-win7-32
aws-provisioner-v1/win2012r2-cu            => taskcluster-ci/gecko-1-b-win2012

Flags: needinfo?(mozilla)

Pete Moore [:pmoore][:pete]

Assignee

Comment 16

•

5 years ago

Hey Dustin, please see comment 15. I realise I should have requested your feedback on this too.

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 17

•

5 years ago

Having dedicated workers for testing worker changes makes sense -- I'll leave it to Tom whether those make sense in the staging deployment or the firefox-ci deployment, and if the latter whether "staging" is a confusing name for them.

We discussed the integration testing in slack and pete is going to go ahead with it.

Flags: needinfo?(dustin)

Tom Prince [:tomprince]

Reporter

Comment 18

•

5 years ago

Attached file Bug 1588834: [WIP] g-w on worker-manager (obsolete) — Details

Tom Prince [:tomprince]

Reporter

Comment 19

•

5 years ago

I haven't had a chance to think about naming yet. I did try out the worker image, but it seemed that occ didn't like something about the config.

I've attached the ci-config patch I used to create it.

Flags: needinfo?(rthijssen)

Flags: needinfo?(pmoore)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 20

•

5 years ago

i think we'll need to patch occ to work with worker-managers ec2 provider. it's probably not going to just work without tweaking a few bits around where occ looks for instance metadata.

Flags: needinfo?(rthijssen)

Tom Prince [:tomprince]

Reporter

Comment 21

•

5 years ago

A couple of things (none of which need to be addressed here, but would be nice for the future):

Most of the generic-worker config appears to be about how the AMI is configured, and isn't really sensible for somebody using the AMI to configure. I think these configuration options should be backed into the AMI.
I'm guessing OCC looks at the aws-provisioner/worker-manager metadata about what worker-type. If this is being used to pull configuration from OCC, I think it would be useful to de-couple the identifier used for that from the worker-name. I'd like to be able to use an AMI+config on a staging worker, and then after verification, switch the production workers to point at the same AMI+config, rather than having a separate OCC config for the two worker types.

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

5 years ago

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Assignee

Comment 22

•

5 years ago

(In reply to Tom Prince [:tomprince] from comment #19)

but it seemed that occ didn't like something about the config.

What did it not like? Do you have the logs?

Tom Prince [:tomprince]

Reporter

Comment 23

•

5 years ago

:markco discovered that (one) issue was the calls getting http://169.254.169.254/latest/meta-data/public-keys. Since aws-provider based images don't have keys set, that gives a 404, which causes the call to fail, and thus the script. I tried https://github.com/mozilla-releng/OpenCloudConfig/commit/0e3ce7f3ab8c2be14936b1deef8cddaf200824ba to handle it, but it looks like that isn't enough to handle the errors from the requests.

Tom Prince [:tomprince]

Reporter

Updated

•

5 years ago

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 24

•

5 years ago

i don't think the public key lookup is the issue. we only use that metadata url to determine if the instance occ is running on is an ami-build instance (a special case for image building). when occ is running on a worker instance, we expect a 404 on that metadata http lookup and fall back to the instance userdata where we look for a json object containing a workerType property.

i see that the gecko-t/t-win10-64-beta-3 worker pool definition now has an additionalUserData.workerType field which i'm guessing is worker-managers syntax for sending userdata to the instances it spawns. that should work well but in my testing this morning i haven't observed worker-manager spawning one of these instance types and i see no errors in trying to spawn this worker type. to debug this we'll need to watch the instance logs when worker-manager starts one of these which it doesn't seem to be doing right now, for reasons i don't understand.

please ping me this evening (morning in america) and maybe we can figure this out if we can get worker-manager to fire up an instance.

Flags: needinfo?(rthijssen)

Pete Moore [:pmoore][:pete]

Assignee

Comment 25

•

5 years ago

(In reply to Rob Thijssen [:grenade (EET/UTC+0300)] from comment #24)

i don't think the public key lookup is the issue. we only use that metadata url to determine if the instance occ is running on is an ami-build instance (a special case for image building). when occ is running on a worker instance, we expect a 404 on that metadata http lookup and fall back to the instance userdata where we look for a json object containing a workerType property.

Hey Rob,

Indeed the metadata for AWS Provider is indeed a little different to the metadata from AWS Provisioner, and unfortunately no longer contains the workerType property.

In AWS Provider, the JSON object in userdata contains these properties:

workerPoolId
providerId
workerGroup
rootUrl
workerConfig

For AWS Provisioner, the JSON object in userdata contained these properties:

data
capacity
workerType
provisionerId
region
availabilityZone
instanceType
spotBid
price
launchSpecGenerated
lastModified
provisionerBaseUrl
taskclusterRootUrl
securityToken

Note that the workerPoolId is essentially <provisionerId>/<workerType> so if you need the worker type, it should be possible to scrape it from the worker pool ID (if that helps).

Tom Prince [:tomprince]

Reporter

Comment 26

•

5 years ago

It looks like https://github.com/mozilla-releng/OpenCloudConfig/commit/f2fbeae37e42d71341b2934ba656897f2ced505d is enough to get generic worker running.

Flags: needinfo?(mozilla)

Tom Prince [:tomprince]

Reporter

Updated

•

5 years ago

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Tom Prince [:tomprince]

Reporter

Updated

•

5 years ago

Blocks: 1592844

Tom Prince [:tomprince]

Reporter

Comment 27

•

5 years ago

Attached file Bug 1588834: Add a way to specify additional aws user-data to pass to workers; r?Callek — Details

OpenCloudConfig looks for a top-level workerType key in the instance user-data. Since
AWS-Provider doesn't set this, we set it here to tell OpenCloudConfig the
appopriate manifest.

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 28

•

5 years ago

ami deployment in progress for worker types:

gecko-3-b-win2012
gecko-3-b-win2012-c4
gecko-3-b-win2012-c5
mpd001-3-b-win2012

https://tools.taskcluster.net/groups/Bgu28fWQTKudoPl12rB7Xw

Phabricator Automation

Comment 29

•

5 years ago

Comment on attachment 9103652 [details]
Bug 1588834: [WIP] g-w on worker-manager

Revision D50290 was moved to bug 1589706. Setting attachment 9103652 [details] to obsolete.

Attachment #9103652 - Attachment is obsolete: true

Tom Prince [:tomprince]

Reporter

Updated

•

5 years ago

Attachment #9103357 - Flags: review?(mozilla)

You need to log in before you can comment on or make changes to this bug.