Closed Bug 1480412 Opened 6 years ago Closed 6 years ago

Upgrade generic-worker to at least version 10.11.3 on macOS workers (releng-hardware/gecko-t-osx-1010)

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

Production
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: dhouse)

References

Details

Attachments

(3 files, 1 obsolete file)

First, of course, we'll need to test 10.11.2 in staging, but then we should roll it out to production if staging tests are ok.

This will pick up the latest worker fixes, including bug 1475689.
Added package to the server
-rw-r--r-- 1 puppetsync puppetsync 13704784 Aug  2 05:43 generic-worker-v10.11.2-darwin-amd64
I did not get this tested on staging today. (attempted cloning moz-central 3x, each time failing. maybe because the cdn bundles are old? 455955 file changes after the bundle 3cb90f1, times out. I'll try again this evening when things may be faster.)
I've kicked off some tests on the mac staging pool:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1ad701fbf0e4adfb006d0d096f65ea9cc10af1c0

(clone failed again last night (timeouts), so this morning I manually pulled the bundle from cdn and updated without trouble)
and the staging (beta) workers are running the tasks: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010-beta
I need to check/upgrade the generic-worker on these beta workers to make sure I'm testing it (I do not know yet how it is upgraded -- through puppet, or was it automatically pulled when it became available in the repo?).
Confirmed staging is running the same as prod.
I'll pin them to my puppet environment with the new generic-worker.
```
[taskcluster 2018-08-03T16:06:32.930Z] Worker Type (gecko-t-osx-1010-beta) settings:
[taskcluster 2018-08-03T16:06:32.931Z]   {
[taskcluster 2018-08-03T16:06:32.931Z]     "config": {
[taskcluster 2018-08-03T16:06:32.931Z]       "deploymentId": "",
[taskcluster 2018-08-03T16:06:32.931Z]       "runTasksAsCurrentUser": true
[taskcluster 2018-08-03T16:06:32.931Z]     },
[taskcluster 2018-08-03T16:06:32.931Z]     "generic-worker": {
[taskcluster 2018-08-03T16:06:32.931Z]       "go-arch": "amd64",
[taskcluster 2018-08-03T16:06:32.931Z]       "go-os": "darwin",
[taskcluster 2018-08-03T16:06:32.931Z]       "go-version": "go1.10.2",
[taskcluster 2018-08-03T16:06:32.931Z]       "release": "https://github.com/taskcluster/generic-worker/releases/tag/v10.10.0",
[taskcluster 2018-08-03T16:06:32.931Z]       "revision": "2b70c93b13e56ff4c31e197904da38c23e0fa09e",
[taskcluster 2018-08-03T16:06:32.931Z]       "source": "https://github.com/taskcluster/generic-worker/commits/2b70c93b13e56ff4c31e197904da38c23e0fa09e",
[taskcluster 2018-08-03T16:06:32.931Z]       "version": "10.10.0"
[taskcluster 2018-08-03T16:06:32.931Z]     },
[taskcluster 2018-08-03T16:06:32.931Z]     "machine-setup": {
[taskcluster 2018-08-03T16:06:32.931Z]       "config": "https://github.com/mozilla-releng/build-puppet/raw/master/modules/generic_worker/templates/generic-worker.config.erb",
[taskcluster 2018-08-03T16:06:32.931Z]       "docs": "https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Modules/generic_worker"
[taskcluster 2018-08-03T16:06:32.931Z]     }
[taskcluster 2018-08-03T16:06:32.931Z]   }
[taskcluster 2018-08-03T16:06:32.931Z] Task ID: C6MHLeuSQ8GZLFCyKLM0Dg
```

Verified the darwin binary is in place:
```
[dhouse@releng-puppet2.srv.releng.mdc1.mozilla.com puppet]$ ls -la /data/repos/EXEs/generic-worker-v10.11.2-*
-rw-r--r-- 1 puppetsync puppetsync 13704784 Aug  2 05:43 /data/repos/EXEs/generic-worker-v10.11.2-darwin-amd64
-rw-r--r-- 1 puppetsync puppetsync 13821142 Aug  2 05:45 /data/repos/EXEs/generic-worker-v10.11.2-linux-amd64
```

Will be upgrading from v10.8.4 to v10.11.2
Attached file upgrade osx generic-worker to v10.11.2 (obsolete) —
Assignee: relops → dhouse
Blocks: 1480839
My switch to v10.11.2 caused a tree-closure because the machines switched over to pull from the non-beta workerType queue when I moved their puppet pin from dragos' to my own. I need to fix a variable name mixup.
No longer blocks: 1480839
Depends on: 1480839
Past that issue, I'm still seeing the malformed payload for my tests running on v10.11.2:
```
[taskcluster:error] TASK FAIL since the task payload is invalid. See errors:
[taskcluster:error] - osGroups: Additional property osGroups is not allowed
[taskcluster:error] Validation of payload failed for task XkV5V1BPQWi_5DIr7WlOLw
```

quarantined again. looks like a change may be needed in this generic-worker version to handle the current task definitions with osGroups
Attachment #8997536 - Flags: checked-in+
The underlying cause of the payload exceptions was due to removal of the 'osGroups' feature from non-Windows platforms in generic-worker 10.11.2.

The reason it was removed was because osGroups feature is not supported on non-Windows platforms, since on those platforms, tasks run as the same user as the generic-worker process (and thus the OS user cannot add itself to other groups).

However, what I hadn't considered was that there might be tasks being generated for non-Windows platforms (mac, linux) that specify an empty list of groups. Indeed this was the case, so removing the feature on mac/linux broke these tasks.

Rather than fix the task generation code not to specify osGroups if it is on a platform that doesn't implement the feature, I've instead allowed an empty list of groups to be included in the task payload for mac/linux. This avoids the need to uplift task generation fixes to all gecko trees/branches.
Attachment #8997822 - Flags: review?(bstack)
(In reply to Pete Moore [:pmoore][:pete] from comment #10)
> Created attachment 8997822 [details] [review]
> Github Pull Request for generic-worker

Thanks Pete for proactively finding and fixing this!
Attachment #8997822 - Flags: review?(bstack) → review+
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/aa934cb2a5d356c421544782a3748add7f906dce
Bug 1480412 - allow empty osGroups list on non-Windows platforms

Since osGroups feature isn't supported on non-Windows platforms currently,
I had moved the code into the Windows codebase. I hadn't realised that the
gecko task generation code was creating non-Windows tasks specifying an
empty osGroups list. Rather than fix the task generation in all trees and
branches of gecko (and any other projects that might be doing the same)
not to specify osGroups, this commit allows an empty list to be supplied.
Specifying a non-empty list will result in a malformed-payload exception.

https://github.com/taskcluster/generic-worker/commit/9c251ae47a23769980bfa5bced1b5f0b74665265
Merge pull request #119 from taskcluster/bug1480412

Bug 1480412 - allow empty osGroups list on non-Windows platforms
(In reply to Dave House [:dhouse] from comment #11)
> (In reply to Pete Moore [:pmoore][:pete] from comment #10)
> > Created attachment 8997822 [details] [review]
> > Github Pull Request for generic-worker
> 
> Thanks Pete for proactively finding and fixing this!

No worries! And sorry for the bug. :-)

I've triggered a new release (10.11.3) which contains the fix from comment 10, which should be published to github shortly; updating bug title to reflect this.

Thanks.
Summary: Upgrade generic-worker to at least version 10.11.2 on macOS workers (releng-hardware/gecko-t-osx-1010) → Upgrade generic-worker to at least version 10.11.3 on macOS workers (releng-hardware/gecko-t-osx-1010)
I downloaded the release v10.11.3 for darwin and linux to the releng distinguished puppetmaster (will copy out to the other puppet masters within about an hour):
```
-rw-r--r-- 1 puppetsync puppetsync 13713600 Aug  7 01:55 generic-worker-v10.11.3-darwin-amd64
-rw-r--r-- 1 puppetsync puppetsync 13830051 Aug  7 01:58 generic-worker-v10.11.3-linux-amd64
```
I kicked off a try build for testing v10.11.3 on the staging(beta) pool. First, applied puppet pinned to my environment, and confirmed version and workerType on all four darwin beta workers:
  "workerType": "gecko-t-osx-1010-beta",
  generic-worker 10.11.3 [ revision: https://github.com/taskcluster/generic-worker/commits/fc610c547908a843f5cf922a164e0d406de9d3f2 ]
(In reply to Dave House [:dhouse] from comment #16)
> I kicked off a try build for testing v10.11.3 on the staging(beta) pool.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=cac7f55fb1c0603f123002c4a7a6aece138431c1
Attachment #8997482 - Attachment is obsolete: true
I've run the full set of tests over the staging/beta machines multiple times in the last few days and none have been left in the bad state. I've also decreased the max-run-time/timeouts and not seen any stuck from timeouts.

Danut collected a list of the tests that often cause the mac testers to get stuck (and not reboot themselves):
I think this includes them:
try: -b do -p macosx64 -u mochitest-gl[10.10],mochitest-e10s-1[10.10],mochitest-e10s-bc[10.10],mochitest-media-e10s[10.10] -t none --artifact

I'll run those specific tests over the machines a few more times to make sure none get stuck.
(In reply to Dave House [:dhouse] from comment #19)
> I've run the full set of tests over the staging/beta machines multiple times
> in the last few days and none have been left in the bad state. I've also
> decreased the max-run-time/timeouts and not seen any stuck from timeouts.
> 
> Danut collected a list of the tests that often cause the mac testers to get
> stuck (and not reboot themselves):
> I think this includes them:
> try: -b do -p macosx64 -u
> mochitest-gl[10.10],mochitest-e10s-1[10.10],mochitest-e10s-bc[10.10],
> mochitest-media-e10s[10.10] -t none --artifact
> 
> I'll run those specific tests over the machines a few more times to make
> sure none get stuck.

Hi Dave,

How did the tests go?

Thanks!
Flags: needinfo?(dhouse)
(In reply to Pete Moore [:pmoore][:pete] from comment #20)
> (In reply to Dave House [:dhouse] from comment #19)
> > I've run the full set of tests over the staging/beta machines multiple times
> > in the last few days and none have been left in the bad state. I've also
> > decreased the max-run-time/timeouts and not seen any stuck from timeouts.
> > 
> > Danut collected a list of the tests that often cause the mac testers to get
> > stuck (and not reboot themselves):
> > I think this includes them:
> > try: -b do -p macosx64 -u
> > mochitest-gl[10.10],mochitest-e10s-1[10.10],mochitest-e10s-bc[10.10],
> > mochitest-media-e10s[10.10] -t none --artifact
> > 
> > I'll run those specific tests over the machines a few more times to make
> > sure none get stuck.
> 
> Hi Dave,
> 
> How did the tests go?
> 
> Thanks!

The tests showed no workers getting stuck on the new generic-worker. I ran the same tests against production, the old version, and got multiple machines stuck. (I re-ran this set today on production and staging again to verify: REV=bdab63898372039a3c7ba6e2539ca9e5147a7e6a (prod) and REV=9e41c5982f0714847274caad00ac6d7a6879d319 (stage) )

Joel, do you want to run some tests against staging also, or can you sign-off on upgrading to this version of generic worker?
Flags: needinfo?(dhouse) → needinfo?(jmaher)
I added more tasks to both pushes and the talos jobs are failing.

I would prefer if you do a:
./mach try -b do -p macosx -u all -t all
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #22)
> I added more tasks to both pushes and the talos jobs are failing.
> 
> I would prefer if you do a:
> ./mach try -b do -p macosx -u all -t all

Thanks. I've retried with all tests:

current production:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9c48ac6843272da994db4c54384a87b7519f3c9

staging (new version):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=45696c25993d913e7af27cc598d8ee536e572fad
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #22)
> I added more tasks to both pushes and the talos jobs are failing.
> 
> I would prefer if you do a:
> ./mach try -b do -p macosx -u all -t all

I think this needs to be "macosx64" rather than "macosx" ...

Dave, can you try again with "macosx64"?

I wonder why `./mach try` doesn't throw an error when an unrecognised platform is passed in...
Flags: needinfo?(dhouse)
(In reply to Pete Moore [:pmoore][:pete] from comment #24)
> (In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #22)
> > I added more tasks to both pushes and the talos jobs are failing.
> > 
> > I would prefer if you do a:
> > ./mach try -b do -p macosx -u all -t all
> 
> I think this needs to be "macosx64" rather than "macosx" ...
> 
> Dave, can you try again with "macosx64"?
> 
> I wonder why `./mach try` doesn't throw an error when an unrecognised
> platform is passed in...

I made new builds with "macosx64". I used a commit message and not `./mach try`. So mach might catch the unrecognized platform.

staging:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d969124d5b5b5396f2da5cca577d7d5e16af7693

prod:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b07afa5221b617de2b30aa017b9210b8c45ee3cc
Flags: needinfo?(dhouse)
Joel, could you check these talos timings (this is the generic-worker upgrade for osx):

Another try run, without adjusted timeouts, to verify talos timings:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=adea5fb9d06047574441cebaae0f0478e225f15f
https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=adea5fb9d06047574441cebaae0f0478e225f15f
Flags: needinfo?(jmaher)
the noise seems to be cut in half and there are no measureable changes, so this looks good;  thanks for pushing to try and gathering enough data points.
Flags: needinfo?(jmaher)
Dragos, could you move this to production since I'm out? Joel signed off. So I think we just need to notify ciduty to watch for issues and tell Joel it is going to production.
Flags: needinfo?(dcrisan)
Merged the branch into master.
Flags: needinfo?(dcrisan)
Comment on attachment 8998247 [details] [review]
upgrade generic-worker to v10.11.3

Thanks Dragos!
Attachment #8998247 - Flags: checked-in+
The workers are running generic-worker v10.11.3, However, in checking I see ones without the worker process running and the last failed job was a timeout. So the problem may have reappeared. I'll check more and follow up in bug 1475689
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: