Closed Bug 1360539 Opened 7 years ago Closed 7 years ago

generic-worker 8.3.0 to 8.4.1 causing process failures on win2012r2 worker types (runTasksAsCurrentUser: false)

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Unassigned)

Details

When I last rebuilt the AMIs for nss-win2012r2 worker type, we consistently got this strange error when running tasks:

[taskcluster 2017-04-25T17:37:53.181Z] === Task Starting ===
[taskcluster 2017-04-25T17:37:53.185Z] Executing command 0: bash -c "hg clone -r $NSS_HEAD_REVISION $NSS_HEAD_REPOSITORY nss || (sleep 2; hg clone -r $NSS_HEAD_REVISION $NSS_HEAD_REPOSITORY nss) || (sleep 5; hg clone -r $NSS_HEAD_REVISION $NSS_HEAD_REPOSITORY nss)"

C:\Users\task_1493141864>bash -c "hg clone -r $NSS_HEAD_REVISION $NSS_HEAD_REPOSITORY nss || (sleep 2; hg clone -r $NSS_HEAD_REVISION $NSS_HEAD_REPOSITORY nss) || (sleep 5; hg clone -r $NSS_HEAD_REVISION $NSS_HEAD_REPOSITORY nss)" 
c:\mozilla-build\msys\bin\bash.exe: *** CreateFileMappingA, Win32 error 0.  Terminating.
[taskcluster 2017-04-25T17:37:53.422Z]   Exit Code: 1
[taskcluster 2017-04-25T17:37:53.423Z]   User Time: 0s
[taskcluster 2017-04-25T17:37:53.424Z] Kernel Time: 0s
[taskcluster 2017-04-25T17:37:53.426Z]   Wall Time: 160.9993ms
[taskcluster 2017-04-25T17:37:53.427Z] Peak Memory: 1703936
[taskcluster 2017-04-25T17:37:53.428Z]      Result: FAILED
[taskcluster 2017-04-25T17:37:53.429Z] === Task Finished ===

I am not sure at this moment what caused this.

This happened when rolling out generic worker 8.3.0 release - however, we did not hit this on other platforms / workers, so maybe it is something specific to the command that ran.

I was able to resolve the situation by rolling back to a previous AMI that had been built for win2012r2 worker type.

The changes could be:

1) the base AMIs that we build from are regularly updated by Amazon, so there could be a change there that caused this
2) something could have gone wrong (and unnoticed) during AMI creation
3) the userdata script that we run against the amazon base images to create the worker type image could be downloading content from a link which is now returning different content (e.g. maybe we take the latest version of a 3rd party tool which has now been updated)
4) it could be something in the updated generic worker that broke things

In order to see if this is number 4 I'm creating two worker types in parallel:

win2012r2
win2012r2-gw-8-0-1

They should be identical to each other, other than the version of generic worker, which is 8.3.0 on win2012r2 and 8.0.1 on win2012r2-gw-8-0-1.

Once these have deployed, I'll try rerunning a recent NSS build directly from Task Creator against both worker types, to see if win2012r2 exhibits the same problem that NSS had, and win2012r2-gw-8-0-1 doesn't - which would indicate that the problem lies in generic worker 8.3.0.

If both work, I'll put it down to option 2) above, and assume it was an intermittent problem that has disappeared, that we don't need to worry about too much, and then I'll upgrade NSS to 8.3.0 again, and run a couple of tests.

For any other combination of failure/success with these two worker types, I'll have to investigate further.

Just to reiterate: NSS are not currently blocked, as I rolled back to an old AMI - but I'd like to get to the bottom of this issue because at the moment it isn't clear if we can cleanly upgrade them, if a problem occurs and they need a change in their environment (of course we can always monkey-patch).
So in the end I rolled out a couple more fixes, and ended up upgrading win2012r2 to 8.4.1. When rerunning NSS builds, I was able to reproduce the problem:

https://tools.taskcluster.net/task-inspector/#ceAdW_fsQuC46nsbiTkWlQ/0

Yet against 8.0.1 it ran fine:

https://tools.taskcluster.net/task-inspector/#KJ3U6sGxQ3SQd_87yuhxVg/0

This leads me to believe, this is a generic-worker change causing the problem. Furthermore, I believe the only change that could be the problem is the change introduced in bug 1343582 which is namely:

https://github.com/taskcluster/runlib/commit/d5b357573f26f1ed9f2523367533eaea838b78a2

Although several commands function without failure, it seems certain commands consistently fail, so I suspect this could be related to a particular permission that an executable requires, or a feature it uses.
Similarly, in a separate repo, we are seeing failures on 8.4.1 that pass on 8.0.1:

Failing on 8.4.1:
https://tools.taskcluster.net/task-inspector/#D328fToWTKuLC6p1d_4qKg/0

Passing on 8.0.1:
https://tools.taskcluster.net/task-inspector/#KJ3U6sGxQ3SQd_87yuhxVg/0

This is the exact same task, just with a different worker type - and the only difference between the two worker types is the generic worker version.

Therefore I am going to roll back all Windows Server 2012 R2 worker types to 8.0.1, and test in a different worker type.
Component: Integration → Generic-Worker
Summary: nss-win2012r2 AMIs did not work on latest release → generic-worker 8.3.0 to 8.4.1 causing process failures on win2012r2 worker types (runTasksAsCurrentUser: false)
I've added a test for this condition, which successfully reproduces the issue:

https://tools.taskcluster.net/task-group-inspector/#/TePNaU6pQheSDpzQ8_5Rqg/TIJyajJCSPK75QQfDmUIYA?_k=hjcdzg

So when tests pass again, we can be confident we have fixed the issue.
I've rolled back nss-win2012r2 and win2012r2 to 8.0.1 of generic worker, which is a known stable version on Windows.

NSS Windows builds now ok on nss-win2012r2 worker type:

  * https://tools.taskcluster.net/task-inspector/#dnIhftvJTaq5j38c3xgksQ

QBRT Windows tests now ok on win2012r2 worker type:

  * https://tools.taskcluster.net/task-inspector/#Xzd8ZcczSvqrhEegRqz7hw

Note, this only affects workers that have `runTasksAsCurrentUser: true` - which in gecko, is currently only:

  * gecko-1-b-win2012
  * gecko-2-b-win2012
  * gecko-3-b-win2012

All of these are currently on version 8.2.0, which seems to be fine.

Therefore there should be no other worker types that need rolling back at this time.

(In reply to Pete Moore [:pmoore][:pete] from comment #1)
> This leads me to believe, this is a generic-worker change causing the
> problem. Furthermore, I believe the only change that could be the problem is
> the change introduced in bug 1343582 which is namely:
> 
> https://github.com/taskcluster/runlib/commit/d5b357573f26f1ed9f2523367533eaea838b78a2

I'm not so sure about this hypothesis now, since Windows 7 exhibited this failure too (see link in comment 4) and this change should only have affected Windows 8 and higher. I will need to investigate further why Windows 7,  Windows 10, Windows Server 2012 R2 have all been affected.
(In reply to Pete Moore [:pmoore][:pete] from comment #5)
> (In reply to Pete Moore [:pmoore][:pete] from comment #1)
> > This leads me to believe, this is a generic-worker change causing the
> > problem. Furthermore, I believe the only change that could be the problem is
> > the change introduced in bug 1343582 which is namely:
> > 
> > https://github.com/taskcluster/runlib/commit/d5b357573f26f1ed9f2523367533eaea838b78a2
> 
> I'm not so sure about this hypothesis now, since Windows 7 exhibited this
> failure too (see link in comment 4) and this change should only have
> affected Windows 8 and higher. I will need to investigate further why
> Windows 7,  Windows 10, Windows Server 2012 R2 have all been affected.

So the plot thickens. In my latest experiment, I reverted the above commit from runlib library, and this indeed fixed the issue on Windows 10 and Windows Server 2012 R2, however, the test is still failing on Windows 7:

  * https://tools.taskcluster.net/task-group-inspector/#/YSo1SRj3T_KScxRsTeEWyA/UGjP2HKUQW6i2Z0aaGwsWw?_k=lw0gc0 

The particular Win7 test failure can be seen here:

  * https://public-artifacts.taskcluster.net/fWV0meMlQlGwW3oN8jglpg/0/public/logs/live_backing.log

In case that artifact expires, this is the content:

[taskcluster 2017-05-03T11:22:58.334Z] Uploading file public/logs/live.log as artifact public/logs/live.log
[taskcluster 2017-05-03T11:22:59.171Z] Worker Type (cew6g40dThu-pTUMAi3LKw) settings:
[taskcluster 2017-05-03T11:22:59.171Z]   {
[taskcluster 2017-05-03T11:22:59.171Z]     "aws": {
[taskcluster 2017-05-03T11:22:59.171Z]       "ami-id": "test-ami",
[taskcluster 2017-05-03T11:22:59.171Z]       "availability-zone": "outer-space",
[taskcluster 2017-05-03T11:22:59.171Z]       "instance-id": "test-instance-id",
[taskcluster 2017-05-03T11:22:59.171Z]       "instance-type": "p3.enormous",
[taskcluster 2017-05-03T11:22:59.171Z]       "local-ipv4": "87.65.43.21",
[taskcluster 2017-05-03T11:22:59.171Z]       "public-ipv4": "12.34.56.78"
[taskcluster 2017-05-03T11:22:59.171Z]     },
[taskcluster 2017-05-03T11:22:59.172Z]     "generic-worker": {
[taskcluster 2017-05-03T11:22:59.172Z]       "go-arch": "386",
[taskcluster 2017-05-03T11:22:59.172Z]       "go-os": "windows",
[taskcluster 2017-05-03T11:22:59.172Z]       "go-version": "go1.7.5",
[taskcluster 2017-05-03T11:22:59.172Z]       "release": "test-release-url",
[taskcluster 2017-05-03T11:22:59.172Z]       "version": "8.4.1"
[taskcluster 2017-05-03T11:22:59.172Z]     },
[taskcluster 2017-05-03T11:22:59.172Z]     "machine-setup": {
[taskcluster 2017-05-03T11:22:59.172Z]       "maintainer": "pmoore@mozilla.com",
[taskcluster 2017-05-03T11:22:59.172Z]       "script": "test-script-url"
[taskcluster 2017-05-03T11:22:59.172Z]     }
[taskcluster 2017-05-03T11:22:59.172Z]   }
[taskcluster 2017-05-03T11:22:59.172Z] === Task Starting ===
[taskcluster 2017-05-03T11:22:59.172Z] Executing command 0: c:\mozilla-build\msys\bin\bash.exe -c "echo hello"

C:\Users\task_1493810577>c:\mozilla-build\msys\bin\bash.exe -c "echo hello" 
c:\mozilla-build\msys\bin\bash.exe: *** CreateFileMappingA, Win32 error 0.  Terminating.
[taskcluster 2017-05-03T11:22:59.202Z]   Exit Code: 1
[taskcluster 2017-05-03T11:22:59.202Z]   User Time: 0s
[taskcluster 2017-05-03T11:22:59.202Z] Kernel Time: 0s
[taskcluster 2017-05-03T11:22:59.202Z]   Wall Time: 23ms
[taskcluster 2017-05-03T11:22:59.202Z] Peak Memory: 2273280
[taskcluster 2017-05-03T11:22:59.202Z]      Result: FAILED
[taskcluster 2017-05-03T11:22:59.202Z] === Task Finished ===
[taskcluster 2017-05-03T11:22:59.202Z] Task Duration: 30ms



So it looks like there is a general problem occurring now on all Windows platforms for just some commands (including this one) when using a different desktop in the create process (CreateProcessWithLogon / CreateProcessAsUser) syscalls. However, not all commands exhibit this failure.

For the time being, all gecko windows workers either use CreateProcess system call (Windows 7, Windows 10) or they are on an older version of generic worker that doesn't set the desktop handle during process creation (Windows Server 2012 R2) so isn't affected. This means any screenshots won't work on Windows Server 2012 R2, but that isn't so much a problem as we only use it for builds in gecko.

The code which creates the desktop (and applies permissions) is here:

  * https://github.com/taskcluster/runlib/blob/0099c90690c9ee45d7851936276fd14c5bda9aa8/platform/extra_windows.go#L25-L86
This is now fixed with a major rewrite of the engine, which no longer creates processes in the windows station associated to the running process, rather we obtain an access token to the windows logon session of the interactive user, and create processes using it.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.