Closed Bug 1439588 Opened 6 years ago Closed 6 years ago

Add feature to support running Windows tasks in an elevated process (Administrator)

Categories

(Taskcluster :: Workers, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED
mozilla61

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Attachments

(4 files)

Support for adding a task user to a system group was added with the osGroups feature of generic-worker - however, there is currently no mechanism to escape UAC constraints and run elevated (as Administrator) processes.

This page on UAC explains the issue: https://msdn.microsoft.com/en-us/library/aa826699(v=vs.85).aspx

> User Account Control
>
> Under UAC, accounts in the local Administrators group have two access tokens, one with
> standard user privileges and one with administrator privileges. Because of UAC access
> token filtering, a script is normally run under the standard user token, unless it is
> run "as an Administrator" in elevated privilege mode.

The token we intercept from the interactive logon session seems to be the filtered token, without admin permissions. We can see this by executing a task which requires administrator rights, setting osGroups to ["Administrators"] and seeing that it fails with an access denied error.

The code where we obtain the access token is here:

https://github.com/taskcluster/runlib/blob/7fa257e0aaed6b00edeea02f62c38e67ff4b4924/win32/extra_windows.go#L1078-L1093

From reading https://blogs.msdn.microsoft.com/winsdk/2013/03/22/how-to-launch-a-process-as-a-full-administrator-when-uac-is-enabled/ it looks like we should be able to obtain the unfiltered access token from the filtered access token:

> You can programmatically retrieve the Administrator Token via the GetTokenInformation()
> call and specifying TokenLinkedToken as the token information class.  This operation
> requires the caller to have the SeTcbPrivilege so not any user can make this call.

Since we run generic-worker as LocalSystem, we already have the SeTcbPrivilege, so it looks like we should be able to make a call to GetTokenInformation() passing in the token we got from WTSQueryUserToken, specifying TokenLinkedToken as the token information class, and that we'll get the fully scoped unfiltered access token back for running fully elevated processes.

I would propose we add this as an explicit opt-in feature, e.g. requiring

> "features": {
>   "runAsAdministrator": true
> }

This could be protected behind scope "generic-worker:runAsAdministrator:<provisionerId>/<workerType>". 

Normally to add a task user to Administrators group, you would require "generic-worker:os-group:Administrators", and to add "osGroups": ["Administrators"] to the payload. Maybe we should keep this requirement, and return a malformed-payload exception if the runAsAdministrator feature is enabled, without the osGroups containing "Administrators".
QA Contact: pmoore
Assignee: nobody → pmoore
A safer way to implement this feature is to restrict which commands can be run as administrator.

My new proposal is to create a program/script which makes an HTTP POST to http://localhost:<some port>/run-as-administrator (or similar) with a request body containing a command to run as administrator.

This script would be similar conceptually to runas/sudo/su/... in that it would allow you to run a command as administrator (elevated).

The worker would listen on the given port for requests, and would use the mechanics described in comment 0 to obtain an auth token to execute the process with the elevated UAC rights.

The commands that can be called would be restricted to known "safe" commands, and configured on the worker, not editable by the task user.

To invoke the script with an allowed command, we have two options I can think of:

1) The script is called e.g. run-as-administrator, and you prefix the command and arguments you want to run by "run-as-administrator" (much like you would prefix a command with "sudo" on other systems). In this case we would need a configuration file to list which commands are whitelisted (like /etc/sudoers).

2) We could create a shortcut to the script for each command we allow, and put it in the PATH - so you still run e.g. xperf, but xperf is just a shortcut/symbolic link to the wrapper script described here. In this case, we probably would still want a configuration file that contains a whitelist of commands, and the script would need to check the command is listed in the whitelist, since otherwise someone could just create a symbolic link of their own with the name of an evil command they want to run.
(In reply to Pete Moore [:pmoore][:pete] from comment #1)
> 1) The script is called e.g. run-as-administrator, and you prefix the
> command and arguments you want to run by "run-as-administrator" (much like
> you would prefix a command with "sudo" on other systems). In this case we
> would need a configuration file to list which commands are whitelisted (like
> /etc/sudoers).
> 
> 2) We could create a shortcut to the script for each command we allow, and
> put it in the PATH - so you still run e.g. xperf, but xperf is just a
> shortcut/symbolic link to the wrapper script described here. In this case,
> we probably would still want a configuration file that contains a whitelist
> of commands, and the script would need to check the command is listed in the
> whitelist, since otherwise someone could just create a symbolic link of
> their own with the name of an evil command they want to run.

I wrote that too quickly!

Clearly the worker needs to validate that the command is allowed, not the script, so that validation would occur in the worker, regardless of how the script was invoked. So in any case we'd need a whitelist of allowed commands that the worker can interrogate, and trust the validity of.
:pmoore, this bug has been idle- I admit I got busy and forgot about it, I assume you did too.  Can we get this fixed soon?
Flags: needinfo?(pmoore)
:jmaher, apologies; indeed this bug has been sat idle for a couple of weeks. I'll be returning to it in the next few days...
Flags: needinfo?(pmoore)
:pmoore, can you provide an update here?  We are 5 weeks and counting without data from the xperf task
Flags: needinfo?(pmoore)
Hi Joel,

Apologies for delay. I'm starting on this today, and will post progress updates...
Flags: needinfo?(pmoore)
Hi Joel,

I'm afraid I haven't managed to make progress on this, and I am going to be on PTO next week for one week.

I think the best thing for now, if this is urgently needed, is to create a new worker type just for these talos tasks, that use the old generic-worker. Note the win7 manifest has two discrete commits that updated from generic-worker 8 to generic-worker 10:

https://github.com/mozilla-releng/OpenCloudConfig/commit/49c903137670d6afe5262c2c0ec386070db44654
https://github.com/mozilla-releng/OpenCloudConfig/commit/902d24ab34314b314ec98a18512cfc12ed4fcfa0

Maybe you could create a worker type like talos-win7 that was based on gecko-t-win7-32 but with these two commits reverted, and hopefully that should just work. This would be a stop gap until this feature is implemented.

Another option would be to use taskcluster-worker, but I suspect that is a much bigger project than just rolling out a custom worker type using an old version of our win7 manifest which had the older version of generic-worker running on it.

Otherwise, I'll be back in a week, and this bug has pretty much made it to the top of my priority queue (we had a bunch of tree closures to take care of which consumed most of my attention).

Thanks, and good luck!
Pete, I keep pestering you on this, please make this a P1 priority to fix ASAP.
Flags: needinfo?(pmoore)
Flags: needinfo?(pmoore)
Priority: -- → P1
Started work on this - not ready yet.
(In reply to Pete Moore [:pmoore][:pete] from comment #0)
> Support for adding a task user to a system group was added with the osGroups
> feature of generic-worker - however, there is currently no mechanism to
> escape UAC constraints and run elevated (as Administrator) processes.
> 
> This page on UAC explains the issue:
> https://msdn.microsoft.com/en-us/library/aa826699(v=vs.85).aspx
> 
> > User Account Control
> >
> > Under UAC, accounts in the local Administrators group have two access tokens, one with
> > standard user privileges and one with administrator privileges. Because of UAC access
> > token filtering, a script is normally run under the standard user token, unless it is
> > run "as an Administrator" in elevated privilege mode.
> 
> The token we intercept from the interactive logon session seems to be the
> filtered token, without admin permissions. We can see this by executing a
> task which requires administrator rights, setting osGroups to
> ["Administrators"] and seeing that it fails with an access denied error.
> 
> The code where we obtain the access token is here:
> 
> https://github.com/taskcluster/runlib/blob/
> 7fa257e0aaed6b00edeea02f62c38e67ff4b4924/win32/extra_windows.go#L1078-L1093
> 
> From reading
> https://blogs.msdn.microsoft.com/winsdk/2013/03/22/how-to-launch-a-process-
> as-a-full-administrator-when-uac-is-enabled/ it looks like we should be able
> to obtain the unfiltered access token from the filtered access token:
> 
> > You can programmatically retrieve the Administrator Token via the GetTokenInformation()
> > call and specifying TokenLinkedToken as the token information class.  This operation
> > requires the caller to have the SeTcbPrivilege so not any user can make this call.
> 
> Since we run generic-worker as LocalSystem, we already have the
> SeTcbPrivilege, so it looks like we should be able to make a call to
> GetTokenInformation() passing in the token we got from WTSQueryUserToken,
> specifying TokenLinkedToken as the token information class, and that we'll
> get the fully scoped unfiltered access token back for running fully elevated
> processes.

This part (retrieval of elevated token from filtered token) has now been implemented in https://github.com/taskcluster/generic-worker/pull/98/commits/d2c8c34c2b1d016a4b8df96d332ff8bb7134322c

I've managed to test this method works successfully using the following gist:
https://gist.github.com/petemoore/8e43861943d175076e4b365039c8d5cd

So the remaining parts are those described in comment 1 and comment 2.
(In reply to Pete Moore [:pmoore][:pete] from comment #10)

> So the remaining parts are those described in comment 1 and comment 2.

I've gone for a simpler solution initially, in order that we can unblock the talos xperf tests.

The solution is to add `runTaskAsAdministrator` as a feature flag, and if this is set, all commands will be run with UAC process elevation.

The features is guarded by scopes, which we will need to grant in a few places in order that these test tasks get them..

I'll follow up in a separate bug for the full sudo-like solution.

I'm just working through the final test-fix iterations, but will be ready to request a code review shortly.
!! Houston, we have a (chicken and egg) problem... !!

One of the new unit tests is failing, and I think I might know what the cause is...

1) a new task user is created
2) the worker sets the autologon registry entries to the new username/password
3) the worker reboots the machine
4) the worker windows service starts up
5) the worker waits for the interactive logon, and captures the active console session id
6) the worker gets the auth token of the logon session from the session id
7) the worker claims a task
8) the worker discovers the task runs as administrator, so places the user in the Administrators group
9) the worker discovers the task should run with UAC elevation, so queries the elevated auth token from the regular auth token from step 6 (as per comment 0)
10) the call fails since when the auth token was created, it wasn't associated to an elevated token, as at that time, the user wasn't in the Administrators group


The failure we are getting is:

Making system call GetTokenInformation with args: [884 13 1169B888 4 119C408C]
  Result: 0 520 A specified logon session does not exist. It may already have been terminated.

We can see that the arguments passed are correct, since:

  * 0x884 is the same handle that is used in successful calls to e.g. GetUserProfileDirectoryW
  * 0x13 is correct for TokenLinkedToken
  * 0x4 is correct size for TOKEN_LINKED_TOKEN

The other two values are pointers to memory to receive values, so would not be the cause of the error.


Possible solutions

1) An extra reboot

This is kind of complex. At this point we are already reclaiming the task. If we reboot, we have to be sure that we don't lose the claim on the task during the reboot, and do quite some work to enable the worker to reclaim a task it already had a claim on after it gets rebooted. I don't like this idea. It feels dirty and fragile.

2) Dedicated worker types for UAC privileged tasks

This means more worker types to manage, but other than that, is pretty simple to implement.

3) Adding Administrator group proactively, and then removing it if not required

At first I thought this would not be good, because then the UAC elevated token would still be valid, even if the group was removed, but then I realised the worker would never need to hand it out. So this could be a possibility. It feels a little dangerous, as if something goes wrong removing the Administrative privileges of the user, it could still have them. Never granting Administrative privileges from the outset is a lot safer in this regard. This could be mitigated by putting some checks in place that Administrative privileges really are removed, but I'm not 100% comfortable with this as it opens up possible avenues of attack if the token can be intercepted before the task run somehow.

4) Calling LogonUser to create a new logon session and getting an elevated token from there

This probably would restrict UI capabilities for the commands executed as Administrator - this could be a problem for tests. High risk the token just wouldn't be suitable.

5) Find some additional / alternative syscalls to refresh the token

For example, maybe we can fetch a new auth token from the existing session id (perhaps even a new call to WTSQueryUserToken after adding user to local Administrators group will return a new auth token).

6) Do some more debugging and discover that the problem isn't this at all

The above suggestions were all based on the assumption that the correct diagnosis of the problem has been made. We can do a bit more digging to be sure this really is the case, just in case I've misdiagnosed the problem here, and the originally linked token can be used after all, if we just are doing something wrong when fetching it.
Option 7) Disable UAC

This solves the problem, but it does inhibit our ability to implement a sudo-like feature to whitelist specific commands to run with Administrator privileges. With UAC disabled, once you have Administrator rights, you have them for every command you run, and we can't control or limit that.
I've experimented a little on Windows 7, by doing the following

i) logging into an admin user, and creating a new (non-admin) test user
ii) logging into that new test user (via "Switch user")
iii) switching back to the admin user, and making the new non-admin user an admin
iv) switching back to the new test user, and seeing if I can run as admin, without needing to log out and back in again

I was able to run as admin, but by virtue of being presented the possibility to reauthenticate as any admin user on the machine. This is a different mechanism to grabbing the linked auth token from the primary auth token of the active logon session. So this doesn't tell me a lot.

So my plan to proceed is as follows:

Step 1) Set windows worker pool size to 1 for generic-worker CI worker types, and set a long idle time so they don't die for several hours
Step 2) Submit jobs to trigger provisioner to spawn an instance of each worker type
Step 3) Log onto each of the three workers manually (win7, win10, win2012), add the CI task test user to the Administrators group, and reboot the machine
Step 4) Resubmit the latest CI jobs, to see if the failing test (TestRunAsAdministratorEnabled) now passes.

This is a trick that takes advantage of the fact that tasks are run in the context of the task user created by the CI, rather than a user created by the generic-worker which is being tested in the CI (in the unit tests, a stable version of generic-worker is running the CI, in other words, generic-worker is running generic-worker; the "parent" generic-worker is the one that creates the task users, since the "child" generic-worker mustn't reboot the machine between tests).

After I've done this, I will know if I can rule out option 6 from comment 12 or not, since if these actions now cause the test to pass, the only difference is that the task user was already an admin when the logon session occurred, so must indeed must be the cause.
(In reply to Pete Moore [:pmoore][:pete] from comment #14)

> So my plan to proceed is as follows:
> 
> Step 1) Set windows worker pool size to 1 for generic-worker CI worker
> types, and set a long idle time so they don't die for several hours
> Step 2) Submit jobs to trigger provisioner to spawn an instance of each
> worker type
> Step 3) Log onto each of the three workers manually (win7, win10, win2012),
> add the CI task test user to the Administrators group, and reboot the machine
> Step 4) Resubmit the latest CI jobs, to see if the failing test
> (TestRunAsAdministratorEnabled) now passes.
> 
> This is a trick that takes advantage of the fact that tasks are run in the
> context of the task user created by the CI, rather than a user created by
> the generic-worker which is being tested in the CI (in the unit tests, a
> stable version of generic-worker is running the CI, in other words,
> generic-worker is running generic-worker; the "parent" generic-worker is the
> one that creates the task users, since the "child" generic-worker mustn't
> reboot the machine between tests).
> 
> After I've done this, I will know if I can rule out option 6 from comment 12
> or not, since if these actions now cause the test to pass, the only
> difference is that the task user was already an admin when the logon session
> occurred, so must indeed must be the cause.

Today has been enlightening!

I wasn't able to get TestRunAsAdministratorEnabled to pass this way, and after a *lot* of digging, I finally discovered that UAC is disabled on at least Windows 7 (which is the platform I have been concentrating on). I suspect it is also disabled on Windows 10 and Windows Server 2012 R2.

However, this was quite unexpected, since if UAC is disabled, I would not expect another of our unit tests to pass (TestChainOfTrustWithAdministratorPrivs), that relies on UAC being enabled in order to pass. This particular CI test places the task user in the Administrators group, and then tries to read the chain of trust private key, and requires that the task fails. Without UAC I'm not sure why that test passes. So I am investigating!

What I have discovered so far is that when I create a task that adds the task user to the Administrators group, I can see from the worker logs, that it is successful:



i-0165679b0d2fd6723.gecko-t-win7-32.use1.mozilla.com generic-worker: 2018/06/28 16:32:54 Running command: 'net' 'localgroup' 'Administrators' '/add' 'task_1530202904' 
i-0165679b0d2fd6723.gecko-t-win7-32.use1.mozilla.com generic-worker: The command completed successfully. 

However looking at the task log[1] I see that the Administrators group isn't listed as a group of the user:

[taskcluster 2018-06-28T16:32:54.315Z] Executing command 0: whoami /groups

Z:\task_1530202904>whoami /groups 

GROUP INFORMATION
-----------------

Group Name                             Type             SID          Attributes                                        
====================================== ================ ============ ==================================================
Everyone                               Well-known group S-1-1-0      Mandatory group, Enabled by default, Enabled group
BUILTIN\Remote Desktop Users           Alias            S-1-5-32-555 Mandatory group, Enabled by default, Enabled group
BUILTIN\Users                          Alias            S-1-5-32-545 Mandatory group, Enabled by default, Enabled group
NT AUTHORITY\INTERACTIVE               Well-known group S-1-5-4      Mandatory group, Enabled by default, Enabled group
CONSOLE LOGON                          Well-known group S-1-2-1      Mandatory group, Enabled by default, Enabled group
NT AUTHORITY\Authenticated Users       Well-known group S-1-5-11     Mandatory group, Enabled by default, Enabled group
NT AUTHORITY\This Organization         Well-known group S-1-5-15     Mandatory group, Enabled by default, Enabled group
LOCAL                                  Well-known group S-1-2-0      Mandatory group, Enabled by default, Enabled group
NT AUTHORITY\NTLM Authentication       Well-known group S-1-5-64-10  Mandatory group, Enabled by default, Enabled group
Mandatory Label\Medium Mandatory Level Label            S-1-16-8192  Mandatory group, Enabled by default, Enabled group


So my new working hypothesis is that the osGroup membership is stored statically in the auth token (which we fetch in step 6 of comment 12), so that as far as the processes executed by the task are concerned, any added group memberships do not apply.

I will have to test this theory of course, but based on the two output excerpts above, it is hard to see another possible explanation.

What I hope is that I can fetch a new auth token after applying group memberships, that will reflect the group membership changes.

I will also possibly need to add some code to check with UAC is enabled or not, before deciding to fetch the UAC linked token.

I've also discovered today that being UAC elevated doesn't necessarily mean running as Administrator, although I'm not too clear on the details at the moment. The difference may relate to whether UAC is enabled or not (because a process can be elevated, even if UAC is disabled, in which case it is elevated, but not running as Administrator - this is my guess).

--
[1] https://tools.taskcluster.net/groups/Tssdtir7QHS28fhwPP8fGA/tasks/Tssdtir7QHS28fhwPP8fGA/runs/0
(In reply to Pete Moore [:pmoore][:pete] from comment #15)

> I've also discovered today that being UAC elevated doesn't necessarily mean
> running as Administrator, although I'm not too clear on the details at the
> moment. The difference may relate to whether UAC is enabled or not (because
> a process can be elevated, even if UAC is disabled, in which case it is
> elevated, but not running as Administrator - this is my guess).

Context: https://code.msdn.microsoft.com/windowsapps/CppUACSelfElevation-5bfc52dd
So this is really strange now.

I have a loaner for a task with admin privileges ("osGroups": ["Administrator"]):

  https://tools.taskcluster.net/groups/VR9efY32RSCDvH52lGX3NA/tasks/VR9efY32RSCDvH52lGX3NA/runs/0/artifacts


I open up a new cmd.exe shell, and check that the Administrators group contains my task user, i.e. that the worker was able to add the task user to the group:

> C:\Users\task_1530205824>net localgroup Administrators
> Alias name     Administrators
> Comment
> 
> Members
> 
> -------------------------------------------------------------------------------
> root
> task_1530205824
> The command completed successfully.

Indeed, we see my task user in the list.

Next I prove that I am that task user:

> C:\Users\task_1530205824>whoami
> i-080461f667b7d\task_1530205824

Looking good so far! So then surely, if I show my group memberships, Administrators group will appear, right? Wrong!

> C:\Users\task_1530205824>whoami /groups
> 
> GROUP INFORMATION
> -----------------
> 
> Group Name                             Type             SID          Attributes
> ====================================== ================ ============ ==================================================
> Everyone                               Well-known group S-1-1-0      Mandatory group, Enabled by default, Enabled group
> BUILTIN\Remote Desktop Users           Alias            S-1-5-32-555 Mandatory group, Enabled by default, Enabled group
> BUILTIN\Users                          Alias            S-1-5-32-545 Mandatory group, Enabled by default, Enabled group
> NT AUTHORITY\INTERACTIVE               Well-known group S-1-5-4      Mandatory group, Enabled by default, Enabled group
> CONSOLE LOGON                          Well-known group S-1-2-1      Mandatory group, Enabled by default, Enabled group
> NT AUTHORITY\Authenticated Users       Well-known group S-1-5-11     Mandatory group, Enabled by default, Enabled group
> NT AUTHORITY\This Organization         Well-known group S-1-5-15     Mandatory group, Enabled by default, Enabled group
> LOCAL                                  Well-known group S-1-2-0      Mandatory group, Enabled by default, Enabled group
> NT AUTHORITY\NTLM Authentication       Well-known group S-1-5-64-10  Mandatory group, Enabled by default, Enabled group
> Mandatory Label\Medium Mandatory Level Label            S-1-16-8192  Mandatory group, Enabled by default, Enabled group
> 
> C:\Users\task_1530205824>


The reason I find this very strange, is that with the loaner procedure, the "stolen" auth token that the worker uses isn't involved, I am simply RDP'd into the interactive desktop.

So I am baffled why the command shell that was created *after* the group membership was applied, shows that the Administrators group contains my user, shows that I am the user I think I am, but then does not list the Administrators group in my list of groups, and shows "Mandatory Label\Medium Mandatory Level" (S-1-16-8192) rather than "Mandatory Label\High Mandatory Level" (S-1-16-12288).
So I think the problem is that the group membership of the logged in user is calculated at logon, and isn't refreshed during the session. Even though the user can see that he/she is added to a group by querying the group, as far as Windows is concerned, those groups don't apply until he/she logs in and gets a new auth token.

So this explains why TestRunAsAdministratorEnabled was passing in the CI - it wasn't because UAC was preventing the chain of trust certificate from being read, it was because user didn't have Administrator group permissions, despite successfully being added to the group.

This also explains why the attempt to fetch the linked token wasn't successful, since the primary auth token would not be a split logon token if the user was not seen to be in the Administrators group.

And finally, the last part of the puzzle. Why did TestRunAsAdministratorEnabled not start passing when manually putting the user in the Administrators group, and then rebooting? I guess because UAC was disabled, so there was no split token.


So that explains what has been going on, and after a full day of digging, I think I just about got to the bottom of it.

However, the questions now are:

1) How to solve the problem with needing to logout/login to pick up *local group membership changes*
2) Should we support both UAC-enabled and UAC-disabled environments in the worker, or just one of them?

And finally I'd like to produce some examples of making some manual alterations to an environment to prove the conclusion is correct - i.e. to get RunAsAdministrator to work. Doing this means we can be sure when we invest the effort to solve 1) that it will work, and we won't get a nasty surprise that it still doesn't solve the problem.

I'm going to sleep on all this and come back with a fresh head tomorrow...
Looking at AdjustTokenGroups it looks like I can only enable/disable groups already associated to the token, rather than dynamically add new groups to the existing auth token:

https://msdn.microsoft.com/en-us/library/Aa375199(v=VS.85).aspx

This means I'll need to create a new auth token to get new groups, which probably means a new logon.

It may be possible to produce a fully enabled access token with LogonUser - but it may be that doing this I'll get something back that I can't augment to contain everything that was in the token I already had. I will investigate! If this does turn out to be possible, and I can find some utilities that allow me to easily inspect all the data attached to a given auth token, then this would hopefully avoid the need to log out and log back in again.

Otherwise we're just left with two options as far as I can see:
1) dedicated worker types for admin task users
2) an extra reboot and the risk of not reclaiming a task in time if the reboot takes too long. note, claims last 20 minutes, so we should never require 20 mins downtime for a reboot, but it is a risk worth highlighting (e.g. some kind of disk frag maintenance tool kicking in on a reboot, etc)
(In reply to Pete Moore [:pmoore][:pete] from comment #19)
> Looking at AdjustTokenGroups it looks like I can only enable/disable groups
> already associated to the token, rather than dynamically add new groups to
> the existing auth token:
> 
> https://msdn.microsoft.com/en-us/library/Aa375199(v=VS.85).aspx
> 
> This means I'll need to create a new auth token to get new groups, which
> probably means a new logon.
> 
> It may be possible to produce a fully enabled access token with LogonUser -
> but it may be that doing this I'll get something back that I can't augment
> to contain everything that was in the token I already had. I will
> investigate! If this does turn out to be possible, and I can find some
> utilities that allow me to easily inspect all the data attached to a given
> auth token, then this would hopefully avoid the need to log out and log back
> in again.

I found a rather basic tool (https://docs.microsoft.com/en-us/sysinternals/downloads/accesschk) that gave me some data back, but not as much as I'd like. I ran a few tasks that dump output, but the information was pretty basic. See e.g. https://taskcluster-artifacts.net/DgIlvOmmQ0a3KTAvAGsY1g/0/public/logs/live_backing.log that I ran on gecko-t-win7-32:

Z:\task_1530701859>bin\accesschk.exe -accepteula -p -f explorer.exe 

Accesschk v6.12 - Reports effective permissions for securable objects
Copyright (C) 2006-2017 Mark Russinovich
Sysinternals - www.sysinternals.com

[1244] explorer.exe
  RW I-0900BE1836D97\task_1530701859
  RW NT AUTHORITY\SYSTEM

  Token security:
  RW I-0900BE1836D97\task_1530701859
  RW NT AUTHORITY\SYSTEM
  RW I-0900BE1836D97\task_1530701859-S-1-5-5-0-68062
  R  BUILTIN\Administrators

  Token contents:
    User:
      I-0900BE1836D97\task_1530701859
    Groups:
      I-0900BE1836D97\None                             MANDATORY
      Everyone                                         MANDATORY
      BUILTIN\Remote Desktop Users                     MANDATORY
      BUILTIN\Users                                    MANDATORY
      NT AUTHORITY\INTERACTIVE                         MANDATORY
      CONSOLE LOGON                                    MANDATORY
      NT AUTHORITY\Authenticated Users                 MANDATORY
      NT AUTHORITY\This Organization                   MANDATORY
      I-0900BE1836D97\task_1530701859-S-1-5-5-0-68062  LOGONID,MANDATORY
      LOCAL                                            MANDATORY
      NT AUTHORITY\NTLM Authentication                 MANDATORY
      Mandatory Label\Medium Mandatory Level           INTEGRITY
    Privileges:
      I-0900BE1836D97\task_1530701859                  DISABLED
      I-0900BE1836D97\task_1530701859                  ENABLED
      I-0900BE1836D97\task_1530701859                  DISABLED
      I-0900BE1836D97\task_1530701859                  DISABLED
      I-0900BE1836D97\task_1530701859                  DISABLED
      I-0900BE1836D97\task_1530701859                  DISABLED

Whats more, it looks rather buggy, since the listed privileges are not privileges at all, but the (same) local username (which is listed 6 times).

However, I realised that the win32 LogonUser / LoadUserProfile approach could still work, since the winlogon session has already created securable objects (desktop, windows station, etc) which the user should be authorised to modify according to the user ACLs - therefore it seems reasonable that a new primary access token would have permissions to interact with these objects in the same way that the access token derived from the user of the interactive desktop session has. However, from reading MSDN win32 API docs, I'm guessing we might need to directly set the TokenSessionId of the primary access token, which can be done via SetTokenInformation.

I've made changes to the osGroups feature in generic-worker to retrieve a new access token after adding user to task groups, and to refresh the task commands to use this new access token. I've pushed this in the PR (see bug attachment) and will see if this gets us further.

Note, I think we can (and should) enable UAC on our workers if this all works, so that we can control access to privileged functions via taskcluster auth controls and worker features. UAC is really a helpful tool for us in this regard, and enables us to implement the win-sudo feature discussed in this bug, which I'm not yet rolling out for this first pass.

> Otherwise we're just left with two options as far as I can see:
> 1) dedicated worker types for admin task users
> 2) an extra reboot and the risk of not reclaiming a task in time if the
> reboot takes too long. note, claims last 20 minutes, so we should never
> require 20 mins downtime for a reboot, but it is a risk worth highlighting
> (e.g. some kind of disk frag maintenance tool kicking in on a reboot, etc)

Assuming this all works out, these two options are no longer needed.
(In reply to Pete Moore [:pmoore][:pete] from comment #20)
> I've made changes to the osGroups feature in generic-worker to retrieve a
> new access token after adding user to task groups, and to refresh the task
> commands to use this new access token. I've pushed this in the PR (see bug
> attachment) and will see if this gets us further.
> 
> Note, I think we can (and should) enable UAC on our workers if this all
> works, so that we can control access to privileged functions via taskcluster
> auth controls and worker features. UAC is really a helpful tool for us in
> this regard, and enables us to implement the win-sudo feature discussed in
> this bug, which I'm not yet rolling out for this first pass.

We now have the following failures, which I expect are mostly since UAC is disabled on our workers, but I am now investigating...


Windows 7
=========

--- FAIL: TestRunAsAdministratorEnabled (5.22s)
	helper_test.go:210: Scheduled task IUbxGvTqSBCOqD2sEYrX8g
	helper_test.go:225: Worker exited with exit code 0 as required.
	helper_test.go:370: Expected task IUbxGvTqSBCOqD2sEYrX8g to resolve as 'completed/completed' but resolved as 'exception/malformed-payload'

--- FAIL: TestRunAsAdministratorDisabled (7.04s)
	helper_test.go:210: Scheduled task cR9A0cW_RgSCf6O2T9pngg
	helper_test.go:225: Worker exited with exit code 0 as required.
	helper_test.go:370: Expected task cR9A0cW_RgSCf6O2T9pngg to resolve as 'failed/failed' but resolved as 'completed/completed'

--- FAIL: TestChainOfTrustWithoutRunAsAdministrator (5.17s)
	helper_test.go:210: Scheduled task EKgPoHW-SYOfjah-nYCHYQ
	helper_test.go:225: Worker exited with exit code 0 as required.
	helper_test.go:370: Expected task EKgPoHW-SYOfjah-nYCHYQ to resolve as 'failed/failed' but resolved as 'exception/malformed-payload'


Windows 10
==========

--- FAIL: TestRunAsAdministratorEnabled (5.05s)
	helper_test.go:210: Scheduled task TdxMgkjWTEuZCLpcl25fZA
	helper_test.go:225: Worker exited with exit code 0 as required.
	helper_test.go:370: Expected task TdxMgkjWTEuZCLpcl25fZA to resolve as 'completed/completed' but resolved as 'exception/malformed-payload'

--- FAIL: TestRunAsAdministratorMissingOSGroup (5.34s)
	helper_test.go:210: Scheduled task Tsn7Kx_FQfCl4xVAmlrQpg
	helper_test.go:225: Worker exited with exit code 0 as required.
	helper_test.go:370: Expected task Tsn7Kx_FQfCl4xVAmlrQpg to resolve as 'exception/malformed-payload' but resolved as 'completed/completed'

--- FAIL: TestChainOfTrustWithoutRunAsAdministrator (5.26s)
	helper_test.go:210: Scheduled task MvVar3RgSI6IgQBiLlWECg
	helper_test.go:225: Worker exited with exit code 0 as required.
	helper_test.go:370: Expected task MvVar3RgSI6IgQBiLlWECg to resolve as 'failed/failed' but resolved as 'exception/malformed-payload'


Windows Server 2012 R2
======================

--- FAIL: TestRunAsAdministratorEnabled (5.20s)
	helper_test.go:210: Scheduled task fZ4JYPwKQ_G8WZsPfTdkuA
	helper_test.go:225: Worker exited with exit code 0 as required.
	helper_test.go:370: Expected task fZ4JYPwKQ_G8WZsPfTdkuA to resolve as 'completed/completed' but resolved as 'exception/malformed-payload'

--- FAIL: TestRunAsAdministratorMissingOSGroup (5.20s)
	helper_test.go:210: Scheduled task YqnXrmaTR8iO8Wp63NsaSA
	helper_test.go:225: Worker exited with exit code 0 as required.
	helper_test.go:370: Expected task YqnXrmaTR8iO8Wp63NsaSA to resolve as 'exception/malformed-payload' but resolved as 'completed/completed'

--- FAIL: TestChainOfTrustWithoutRunAsAdministrator (5.19s)
	helper_test.go:210: Scheduled task QVtxYhSGRMmiHrbinkyjvg
	helper_test.go:225: Worker exited with exit code 0 as required.
	helper_test.go:370: Expected task QVtxYhSGRMmiHrbinkyjvg to resolve as 'failed/failed' but resolved as 'exception/malformed-payload'
I checked our Windows 7 image, and it looks like Windows Security Center Service won't start, which explains why UAC isn't enabled.

Rather than trying to fix this, since we have UAC disabled globally across all our workers, I have for now commented out the code that does the token swap. In a separate effort, we can try to detect if UAC is required or not to run as administrator, and only fetch the linked token if it is required.

There are several registry keys under:

  HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System

which affect the status of UAC, it isn't a binary affair. Different internet sources suggest different combinations of those registry subkeys are relevant, and I'm having a tough time getting an authoritative answer to which is correct.

Since UAC seems to only cover part of the Windows Security Center subsystem, I suspect even if I was to obtain the correct registry settings for enabling UAC appropriately, the since Windows Security Center Service doesn't run, it may not resolve the issue.

In the meantime, I've been making more pushes and fixing issues one by one. Hopefully close to the end now.
It turns out we have UAC disabled on Windows 7 and enabled on Windows 10 and Windows Server 2012 R2. With more investigation I was able to determine how to reason from Local Policy whether UAC is enabled or not globally, and have therefore been able to handle both cases of it being enabled or disabled on the system.

In the case that it is disabled, yet "runAsAdministrator" is included in the task payload, the worker now returns a MalformedPayloadError explaining that this property can only be used on worker types that have UAC enabled.

I've written a bunch of tests, and the code is passing on all platforms. I've put it up for review, and :jhford is taking a look at it.

I've squashed the commits into a much smaller set, but as a consequence some of the commits (especially the last) are rather large. It didn't feel worthwhile further splitting the commits at this stage, but it should be noted the PR now contains a considerable amount of change.

In the process of writing this feature I discovered a couple of unrelated bugs which I have fixed at the same time, and added tests to catch the issues if they appear again as regressions.
Attachment #8984120 - Flags: review?(jhford)
Comment on attachment 8984120 [details] [review]
Github Pull Request for generic-worker

r+ yesterday in PR.
Attachment #8984120 - Flags: review?(jhford) → review+
Checking on this work, are we ready to land?
Hi Jim,

We're in the process of releasing the worker, and testing it out on try. If that all goes well, the next step is adjusting roles to grant permissions where required for the feature to be enabled for the talos jobs, and then updating gecko to generate task definitions for xperf tasks to enable the feature. So we're approaching the finishing line. :)
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/615ab6213436ca093e75f02488eb2b12c023fa92
Bug 1439588 - Support running task commands with UAC 'Run As Administrator' on Windows

https://github.com/taskcluster/generic-worker/commit/fe7c8f8ec63bd65f8160180135bec66b0658eaf9
Merge pull request #98 from taskcluster/bug1439588

Bug 1439588 - support running whitelisted commands as Administrator
I've hit an unwelcome issue when testing on try, related to tests that require desktop resizing and/or programmatically moving the mouse pointer.

I've found the source of the problem, which seems to be that the auth token we get back from LogonUser is restricted in ways that the token we get back from WTSQueryUserToken isn't.

I've implemented the GetTokenInformation and SetTokenInformation calls in order that I can audit the various Token Information Class data of the respective tokens, to see if I can find the anomaly. I'm hoping the difference is in one of the Token Information data structures, rather than any particular securable objects (like the desktop) being explicitly associated to an access token rather than to a user - inspecting and adapting ACLs I suspect to be more challenging than setting token information.

I'm explicitly setting the TokenSessionId, so I'm pretty sure it isn't that, but I think with about a day (or two) of work I should be able to implement most (if not all) of the Token Information Class data structure representations in go, and thus generate a report to audit the differences between the two tokens, and I'm hoping that will shed light on the required change.

I'm actually on sick leave at the moment, so this may take a few days longer due to that.

Will keep the bug posted with updates.
I've written some code to query the token information, and found one difference so far that may be significant:

Original token:

2018/07/20 19:24:04 ==================================================
2018/07/20 19:24:04 Token Primary Group (S-1-5-21-458149028-3186198778-2564299442-513): None/I-0B0C6521B1950 (0X2)
2018/07/20 19:24:04 Token User (S-1-5-21-458149028-3186198778-2564299442-1010): task_1532113032/I-0B0C6521B1950 (0X1) - with attributes: 0X0
2018/07/20 19:24:04 Token Session ID: 0X1
2018/07/20 19:24:04 Token UI Access: 0X0
2018/07/20 19:24:04 Token Group (S-1-5-21-458149028-3186198778-2564299442-513): None/I-0B0C6521B1950 (0X2) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-1-0): Everyone/ (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-32-555): Remote Desktop Users/BUILTIN (0X4) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-32-545): Users/BUILTIN (0X4) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-4): INTERACTIVE/NT AUTHORITY (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-2-1): CONSOLE LOGON/ (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-11): Authenticated Users/NT AUTHORITY (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-15): This Organization/NT AUTHORITY (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-113): Local account/NT AUTHORITY (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-5-0-134062): LogonSessionId_0_134062/NT AUTHORITY (0XB) - with attributes: 0XC0000007
2018/07/20 19:24:04 Token Group (S-1-2-0): LOCAL/ (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-64-10): NTLM Authentication/NT AUTHORITY (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-16-8192): Medium Mandatory Level/Mandatory Label (0XA) - with attributes: 0X60
2018/07/20 19:24:04 ==================================================


Refreshed token:

2018/07/20 19:24:04 ==================================================
2018/07/20 19:24:04 Token Primary Group (S-1-5-21-458149028-3186198778-2564299442-513): None/I-0B0C6521B1950 (0X2)
2018/07/20 19:24:04 Token User (S-1-5-21-458149028-3186198778-2564299442-1010): task_1532113032/I-0B0C6521B1950 (0X1) - with attributes: 0X0
2018/07/20 19:24:04 Token Session ID: 0X1
2018/07/20 19:24:04 Token UI Access: 0X0
2018/07/20 19:24:04 Token Group (S-1-5-21-458149028-3186198778-2564299442-513): None/I-0B0C6521B1950 (0X2) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-1-0): Everyone/ (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-32-555): Remote Desktop Users/BUILTIN (0X4) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-32-545): Users/BUILTIN (0X4) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-4): INTERACTIVE/NT AUTHORITY (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-2-1): CONSOLE LOGON/ (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-11): Authenticated Users/NT AUTHORITY (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-15): This Organization/NT AUTHORITY (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-113): Local account/NT AUTHORITY (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-5-5-0-41431533): LogonSessionId_0_41431533/NT AUTHORITY (0XB) - with attributes: 0XC0000007
2018/07/20 19:24:04 Token Group (S-1-5-64-10): NTLM Authentication/NT AUTHORITY (0X5) - with attributes: 0X7
2018/07/20 19:24:04 Token Group (S-1-16-8192): Medium Mandatory Level/Mandatory Label (0XA) - with attributes: 0X60
2018/07/20 19:24:04 ==================================================


The new token has a different logon SID

2018/07/20 19:24:04 Token Group (S-1-5-5-0-134062): LogonSessionId_0_134062/NT AUTHORITY (0XB) - with attributes: 0XC0000007
<- vs ->
2018/07/20 19:24:04 Token Group (S-1-5-5-0-41431533): LogonSessionId_0_41431533/NT AUTHORITY (0XB) - with attributes: 0XC0000007


I'm wondering if there are specific ACEs that include the logon SID e.g. in a DACL of the windows desktop.

It may be that I need to explicitly add ACEs with the new logon SID...
(In reply to Pete Moore [:pmoore][:pete] from comment #29)
> The new token has a different logon SID
> 
> 2018/07/20 19:24:04 Token Group (S-1-5-5-0-134062):
> LogonSessionId_0_134062/NT AUTHORITY (0XB) - with attributes: 0XC0000007
> <- vs ->
> 2018/07/20 19:24:04 Token Group (S-1-5-5-0-41431533):
> LogonSessionId_0_41431533/NT AUTHORITY (0XB) - with attributes: 0XC0000007
> 
> 
> I'm wondering if there are specific ACEs that include the logon SID e.g. in
> a DACL of the windows desktop.
> 
> It may be that I need to explicitly add ACEs with the new logon SID...

This article suggests this hunch may be correct:

https://www.installsetupconfig.com/win32programming/accesscontrollistaclexample3_4.html

Probably flow is something like:

Get handle to Windows Station with:
 https://msdn.microsoft.com/en-us/library/windows/desktop/ms683225(v=vs.85).aspx
 -> GetProcessWindowStation 

Get handle to desktop with:
 https://docs.microsoft.com/en-us/windows/desktop/api/processthreadsapi/nf-processthreadsapi-getcurrentthreadid
 -> GetCurrentThreadId
 https://msdn.microsoft.com/en-us/library/windows/desktop/ms683232(v=vs.85).aspx
 -> GetThreadDesktop
In https://github.com/taskcluster/generic-worker/pull/108#partial-pull-merging I've managed to get the desktop resizing and mouse pointer moving working on Windows 7 and Windows 10. I have an open issue on Windows Server 2012 R2 which I still need to fix.

In short, my solution is to launch a new process using the interactive session primary access token which creates a new DACL for the current process windows stations and thread desktop granting full control to Everyone (S-1-1-0). The generic-worker launches this process and waits for it to terminate, checking that it has a 0 exit code signifying success.

After this, it creates a secondary login token for the task user, and runs task commands using this new token. This new token may have additional group memberships than the original token, if task.payload.osGroups contains any OS group names.

Things still to fix:

1) Investigate and fix the current failure on Windows Server 2012 R2
2) Find a way to launch the process to adjust DACL of windows station/desktop that works both from running application and from tests (launching os.Args[0] with different args doesn't work when running as a test due to arg parsing)
3) Rather than create new DACLs, see if it is possible to modify existing DACLs
4) Don't grant access to Everyone, only to logon SID of access token returned by secondary LogonUser call
(In reply to Pete Moore [:pmoore][:pete] from comment #31)

> Things still to fix:
> 
> 1) Investigate and fix the current failure on Windows Server 2012 R2

On win2012 only, the issue happens when the CI mounts an existing cache of the generic-worker source code checkout. When there is no existing cache, the issue does not occur.

The problem comes when the worker comes to run the generated wrapper script, since it is not executable for the current task user.

Example from https://taskcluster-artifacts.net/RK60Wa_OSUeo3T1CozXRSw/0/public/logs/live_backing.log
====================================================================================================

Using icacls we can see the problem with the following DACL:

icacls C:\Users\task_1532449724\gopath1.10.3\src\github.com\taskcluster\generic-worker\testdata\TestProtectedArtifactsReplaced\task_1532449724\command_000000_wrapper.bat

NT AUTHORITY\SYSTEM:(F)
BUILTIN\Administrators:(F)
S-1-5-21-2074816669-3677499281-3203234625-1003:(F)


The machine users are:

Name             SID    
Administrator    S-1-5-21-2074816669-3677499281-3203234625-500   
cygwinsshd       S-1-5-21-2074816669-3677499281-3203234625-1002  
Guest            S-1-5-21-2074816669-3677499281-3203234625-501   
sshd             S-1-5-21-2074816669-3677499281-3203234625-1001  
task_1532449724  S-1-5-21-2074816669-3677499281-3203234625-1010  


The SID S-1-5-21-2074816669-3677499281-3203234625-1003 is presumably the task user that created the cache.


Since the CI generic-worker is running as current user, it intentionally doesn't modify any ACLs when it mounts and unmounts caches, since the cache should be owned by the current user, which is LocalSystem. This should be fine.

The wrapper script is in a new location in the new task run to the cached version, since the task id is new, so even though C:\Users\task_1532449724\gopath1.10.3\src\github.com\taskcluster\generic-worker\testdata\TestProtectedArtifactsReplaced\task_1532449724\command_000000_wrapper.bat is inside the cache folder, it should be newly generated when the new task (RK60Wa_OSUeo3T1CozXRSw) runs.

So the question is, why, and only on win2012, does it get created with the above DACL, referring to the previous task user, when it is created by the LocalSystem user, without a reference to the task user at all, but globally executable by all users?

See https://github.com/taskcluster/generic-worker/blob/98a6ca7f702edcb9d0ec8d136fa1ca2e212c6666/plat_windows.go#L312-L317

At the time the wrapper script is written, it should be executable for all users, so I am not sure why it is getting:

NT AUTHORITY\SYSTEM:(F)
BUILTIN\Administrators:(F)
S-1-5-21-2074816669-3677499281-3203234625-1003:(F)

especially since *generic-worker* is running as LocalSystem, not running as the task user.



> 2) Find a way to launch the process to adjust DACL of windows
> station/desktop that works both from running application and from tests
> (launching os.Args[0] with different args doesn't work when running as a
> test due to arg parsing)

Done in https://github.com/taskcluster/generic-worker/blob/98a6ca7f702edcb9d0ec8d136fa1ca2e212c6666/plat_windows.go#L729-L734

Note, I may change this to use ImpersonateLoggedOnUser / RevertToSelf instead to avoid launching an additional process ( https://github.com/taskcluster/generic-worker/blob/cd24966b60b09419d6882a1737a114af13a8e867/win32/win32_windows.go#L781-L803 ). However I need to first check this will work, since GetProcessWindowsStation might use the access token associated to the process rather than the thread which is impersonating the task user, in which case that approach wouldn't work. Testing needed.

> 3) Rather than create new DACLs, see if it is possible to modify existing
> DACLs

This is something of an optimisation that could be implemented later.

> 4) Don't grant access to Everyone, only to logon SID of access token
> returned by secondary LogonUser call

I am already iterating through the token groups of the access token, so I should be able to fish out the logon SID (S-1-5-5-X-Y) when doing that.
(In reply to Pete Moore [:pmoore][:pete] from comment #32)

> At the time the wrapper script is written, it should be executable for all
> users, so I am not sure why it is getting:
> 
> NT AUTHORITY\SYSTEM:(F)
> BUILTIN\Administrators:(F)
> S-1-5-21-2074816669-3677499281-3203234625-1003:(F)
> 
> especially since *generic-worker* is running as LocalSystem, not running as
> the task user.

Added more debug to CI, hope it will shed some light...
Here are the results with the extra debug.

1) We see that the file command_000000_wrapper.bat does not exist in the cache from the previous task run, as intended, by running 'dir' against the task directory:


> 2018/07/27 18:57:28 Dir before creating wrapper script...
>  Volume in drive C has no label.
>  Volume Serial Number is B2AF-A34A
> 
>  Directory of C:\Users\task_1532717137\gopath1.10.3\src\github.com\taskcluster\generic-worker\testdata\TestProtectedArtifactsReplaced\task_1532717137
> 
> 07/27/2018  06:57 PM    <DIR>          .
> 07/27/2018  06:57 PM    <DIR>          ..
> 07/27/2018  06:57 PM    <DIR>          generic-worker
>                0 File(s)              0 bytes
>                3 Dir(s)  52,564,131,840 bytes free


So this rules out the possibility that the wrapper file was the cached version.

Before we create the file in this directory, we also check the permissions of the directory:


> 2018/07/27 18:57:28 icacls of parent dir before creating wrapper script...
> C:\Users\task_1532717137\gopath1.10.3\src\github.com\taskcluster\generic-worker\testdata\TestProtectedArtifactsReplaced\task_1532717137
>   NT AUTHORITY\SYSTEM:(OI)(CI)(F)
>   BUILTIN\Administrators:(OI)(CI)(F)                                                                                                                                        >   S-1-5-21-1597670785-231816465-330029964-1003:(OI)(CI)(F)
> 
> Successfully processed 1 files; Failed processing 0 files


Above we see that Administrators and the original task user have full control of the directory, and that files and directories inside this directory will inherit these permissions.

Note, the original task user needn't be in this list, but should also do no harm. However, I will check why the ACE is not getting removed from the DACL.


Next, we create the wrapper script with *read* and *execute* permissions for *all users* and write permission for the LocalSystem user:

https://github.com/taskcluster/generic-worker/blob/1d03c93d102188ed32d90128b2f90e16e257f115/plat_windows.go#L330-L335

> 	// now generate the .bat script that runs all of this
> 	err = ioutil.WriteFile(
> 		wrapper,
> 		[]byte(contents),
> 		0755,
> 	)


Now when we 'dir' the directory again, we see the file has been created:


> 2018/07/27 18:57:28 Dir after creating wrapper script...
>  Volume in drive C has no label.
>  Volume Serial Number is B2AF-A34A
> 
>  Directory of C:\Users\task_1532717137\gopath1.10.3\src\github.com\taskcluster\generic-worker\testdata\TestProtectedArtifactsReplaced\task_1532717137
> 
> 07/27/2018  06:57 PM    <DIR>          .
> 07/27/2018  06:57 PM    <DIR>          ..
> 07/27/2018  06:57 PM               788 command_000000_wrapper.bat
> 07/27/2018  06:57 PM    <DIR>          generic-worker
>                1 File(s)            788 bytes
>                3 Dir(s)  52,564,127,744 bytes free



When we check the permissions of the parent directory, there are no changes, as we would expect:


> 2018/07/27 18:57:28 icacls of parent dir after creating wrapper script...
> C:\Users\task_1532717137\gopath1.10.3\src\github.com\taskcluster\generic-worker\testdata\TestProtectedArtifactsReplaced\task_1532717137
>   NT AUTHORITY\SYSTEM:(OI)(CI)(F)
>   BUILTIN\Administrators:(OI)(CI)(F)
>   S-1-5-21-1597670785-231816465-330029964-1003:(OI)(CI)(F)
> 
> Successfully processed 1 files; Failed processing 0 files


However, when we check the DACL of the file we just created, we don't see all users as having execute permission on the file:


> 2018/07/27 18:57:28 icacls of wrapper script after creating it...
> C:\Users\task_1532717137\gopath1.10.3\src\github.com\taskcluster\generic-worker\testdata\TestProtectedArtifactsReplaced\task_1532717137\command_000000_wrapper.bat
>   NT AUTHORITY\SYSTEM:(F)
>   BUILTIN\Administrators:(F)
>   S-1-5-21-1597670785-231816465-330029964-1003:(F)
> 
> Successfully processed 1 files; Failed processing 0 files


Instead, we see that it has inherited the exact same permissions its parent directory has. In other words, the '0755' that we specified in https://github.com/taskcluster/generic-worker/blob/1d03c93d102188ed32d90128b2f90e16e257f115/plat_windows.go#L334 seems to have had little (no) effect.


Finally, we check which users are active on the system, to see what the SID is of the currently active task user:

> 2018/07/27 18:57:28 wmic useraccount get name,sid...
> Name             SID                                           
> Administrator    S-1-5-21-1597670785-231816465-330029964-500   
> cygwinsshd       S-1-5-21-1597670785-231816465-330029964-1002  
> Guest            S-1-5-21-1597670785-231816465-330029964-501   
> sshd             S-1-5-21-1597670785-231816465-330029964-1001  
> task_1532717137  S-1-5-21-1597670785-231816465-330029964-1004  


Here we see that the new task user has a different SID to the original task user, as expected (ending in -1004 instead of -1003).


So the big question - why is the 0755 on line 334 of plat_windows.go having no effect?! And why only on win2012?

The investigation continues...
The 0755 is passed into 'perm' parameter of https://github.com/golang/go/blob/go1.10.3/src/syscall/syscall_windows.go#L247-L291
 
> func Open(path string, mode int, perm uint32) (fd Handle, err error) {
>     if len(path) == 0 {
>         return InvalidHandle, ERROR_FILE_NOT_FOUND
>     }    
>     pathp, err := UTF16PtrFromString(path)
>     if err != nil {
>         return InvalidHandle, err
>     }    
>     var access uint32
>     switch mode & (O_RDONLY | O_WRONLY | O_RDWR) {
>     case O_RDONLY:
>         access = GENERIC_READ
>     case O_WRONLY:
>         access = GENERIC_WRITE
>     case O_RDWR:
>         access = GENERIC_READ | GENERIC_WRITE
>     }    
>     if mode&O_CREAT != 0 {
>         access |= GENERIC_WRITE
>     }    
>     if mode&O_APPEND != 0 {
>         access &^= GENERIC_WRITE
>         access |= FILE_APPEND_DATA
>     }    
>     sharemode := uint32(FILE_SHARE_READ | FILE_SHARE_WRITE)
>     var sa *SecurityAttributes
>     if mode&O_CLOEXEC == 0 {
>         sa = makeInheritSa()
>     }    
>     var createmode uint32
>     switch {
>     case mode&(O_CREAT|O_EXCL) == (O_CREAT | O_EXCL):
>         createmode = CREATE_NEW
>     case mode&(O_CREAT|O_TRUNC) == (O_CREAT | O_TRUNC):
>         createmode = CREATE_ALWAYS
>     case mode&O_CREAT == O_CREAT:
>         createmode = OPEN_ALWAYS
>     case mode&O_TRUNC == O_TRUNC:
>         createmode = TRUNCATE_EXISTING
>     default:
>         createmode = OPEN_EXISTING
>     }    
>     h, e := CreateFile(pathp, access, sharemode, sa, createmode, FILE_ATTRIBUTE_NORMAL, 0)
>     return h, e 
> }


The property mode is set as the union of the following flags:

os.O_WRONLY
os.O_CREATE
os.O_TRUNC
syscall.O_CLOEXEC

If I've interpreted the code correctly, this implies that a kernel32.dll CreateFileW syscall is made ( https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-createfilew ) with the parameters set as follows:

  *lpFileName = <wrapper script file location>
  dwDesiredAccess = GENERIC_WRITE
  dwShareMode = FILE_SHARE_READ | FILE_SHARE_WRITE
  *lpSecurityAttributes = 0
  dwCreationDisposition = CREATE_ALWAYS
  dwFlagsAndAttributes = FILE_ATTRIBUTE_NORMAL
  hTemplateFile = NULL
Hunch - https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-_token_default_dacl is different in token from winlogon session vs token returned by LogonUser.

Just parking this thought so I can return to it later while I continue investigating why the last '5' (all: read|execute) in '0755' file permission seems not to be respected in the generated file/DACL...
(In reply to Pete Moore [:pmoore][:pete] from comment #34)
> Here are the results with the extra debug.

Note, these came from https://taskcluster-artifacts.net/NiRdLGZzS0K_r_of7OYenA/0/public/logs/live_backing.log

and the go test that failure being analysed was: https://tools.taskcluster.net/groups/VLHBh5zERbWm4Z7d09FxNw/tasks/DVmfmhGJTZ26VsycRWz7Zw/runs/0/logs/public%2Flogs%2Flive.log
WOW. The file system flags aren't respected on Windows. I finally found this in the docs for Chmod (incidentally not in os.Mkdir / os.Open / os.Create / <other file functions in standard library>):

https://golang.org/pkg/os/#Chmod

> On Windows, the mode must be non-zero but otherwise only the 0200 bit (owner writable) of mode is used; it
> controls whether the file's read-only attribute is set or cleared. attribute. The other bits are currently unused. Use
> mode 0400 for a read-only file and 0600 for a readable+writable file. 


... I'm ... speechless.

Previously I had read https://golang.org/pkg/os/#FileMode which says:

> A FileMode represents a file's mode and permission bits. The bits have the same definition on all systems, so
> that information about files can be moved from one system to another portably. Not all bits apply to all systems.
> The only required bit is ModeDir for directories. 

The sentence "Not all bits apply to all systems." is somewhat of an understatement in this case, given that of the standard 9 permission bits (rwxrwxrwx) only one of them applies on Windows, and the remaining eight are ignored!

... So it looks like I'll need to manage the DACLs directly myself.
That should do it. :-)
Attachment #8996438 - Flags: review?(bstack)
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/033a26b51599325b5817a0228868106b57395048
Bug 1439588 - grant full control of desktop/windows station to everyone

https://github.com/taskcluster/generic-worker/commit/404a101c19202a81298364a2777d86325731cf87
Merge pull request #116 from taskcluster/bug1439588-part2

Bug 1439588 - grant full control of desktop/windows station to everyone
Testing new worker in:

  * https://treeherder.mozilla.org/#/jobs?repo=try&revision=f2bf1f4f7b0f6c046b7ae42398b940cafef5333d

Note, win2012 builders are not updated due to bug 1480108 but this should at least test testers upgrade.

@jmaher, can you provide me with a list of task names that need to run as administrator?

If you have links to existing tasks, even better. Thanks!
(In reply to Pete Moore [:pmoore][:pete] from comment #41)

> @jmaher, can you provide me with a list of task names that need to run as
> administrator?
> 
> If you have links to existing tasks, even better. Thanks!

No worries - I found the tasks, e.g. test-windows7-32/opt-talos-xperf-e10s

I see they are still running, but reporting values of '0' so the tasks are still enabled and running, just reporting 0's instead of the correct values.

I'll work on a patch to run these tasks as Administrator, and to grant permissions to any roles that are required in order for these tasks to have the elevated privileges required.
Attachment #8996438 - Flags: review?(bstack) → review+
Depends on: 1480723
(In reply to Pete Moore [:pmoore][:pete] from comment #42)
> (In reply to Pete Moore [:pmoore][:pete] from comment #41)
> 
> > @jmaher, can you provide me with a list of task names that need to run as
> > administrator?
> > 
> > If you have links to existing tasks, even better. Thanks!
> 
> No worries - I found the tasks, e.g. test-windows7-32/opt-talos-xperf-e10s
> 
> I see they are still running, but reporting values of '0' so the tasks are
> still enabled and running, just reporting 0's instead of the correct values.
> 
> I'll work on a patch to run these tasks as Administrator, and to grant
> permissions to any roles that are required in order for these tasks to have
> the elevated privileges required.

In the latest nightly push, I see references to the following xperf tasks (in https://taskcluster-artifacts.net/fjkQG77mQUakA8GdV6Wj5w/0/public/full-task-graph.json):

test-windows10-64-ccov/debug-talos-xperf-e10s
test-windows10-64-msvc/opt-talos-xperf-e10s
test-windows10-64-pgo/opt-talos-xperf-e10s
test-windows10-64-qr/opt-talos-xperf-e10s
test-windows10-64/opt-talos-xperf-e10s
test-windows7-32-msvc/opt-talos-xperf-e10s
test-windows7-32-pgo/opt-talos-xperf-e10s
test-windows7-32/opt-talos-xperf-e10s

It looks like the xperf tasks are defined here:
https://dxr.mozilla.org/mozilla-central/rev/a2d65d03e46a9a42b5bee5c2a7864d3f987a8ca7/taskcluster/ci/test/talos.yml#600-620

From the second link I only see references to win7, so I'm not sure where the win10 tasks come from in the nightly push above.

If I remember correctly we have UAC disabled on win7, and enabled on win10. In any case, generic-worker handles both UAC enabled and disabled, so this isn't an issue. I'll submit some tasks with the new feature enabled, and see if I can get some non-zero results!
xperf tasks are only on windows 7, we need to find the windows 10 version of the toolchain and api/tool changes; ideally we would move to win10, it is just a low priority.
Thanks Joel!

Trying an xperf task here:
  https://tools.taskcluster.net/groups/T1UR9XINS6SaIx1kehDM9Q/tasks/N7d4ZvefQGi4lMV-S7WEiw/details

You came back from holiday just at the right time! ;-)
Joel, I managed to get a passing run, but like you said there seem to be some suspicious zero values.

How should we proceed from here?

https://tools.taskcluster.net/groups/T1UR9XINS6SaIx1kehDM9Q/tasks/Bi1bGCY8TMWWChX5HgSV5w/details

Maybe getting a loaner is the next step?

https://wiki.mozilla.org/index.php?title=ReleaseEngineering/How_To/Self_Provision_a_TaskCluster_Windows_Instance&oldid=1194794#For_generic-worker_10.5.0_onwards

Let me know what you think, or how we can best troubleshoot this issue. Thanks!
Flags: needinfo?(jmaher)
I am excited we are this far- I believe this is working in general, we probably have a small tweak to make.  Moving forward reproducing this on a loaner from a command prompt would be excellent as that would ensure we can see errors that might not be reported or run each subcommand by itself and verify inputs and outputs.

I suspect the problem is related to the output files.  xperf runs at the kernel level and records all types of events.  We have it configured to record network and file IO (reads, writes, cache hits, etc.)  How this works is we start xperf recording (which will create a large .etl file), then we do some actions in the browser.  Finally when the browser test is done, we run xperf again to save the file and stop recording.  Now the fun begins, where we run xperf to analyze the file and output intermediate files (reading in the output file, filtering for events, outputting to a .csv file).  Finally we parse the .csv file and summarize events/counters so we have data to report to tree herder.

Given that explanation, I suspect we are having issues writing or reading certain files.

take the link given above:
https://taskcluster-artifacts.net/Bi1bGCY8TMWWChX5HgSV5w/0/public/logs/live_backing.log

I see this line in it:
Warning: we are looking for xperf results file etl_output_thread_stats.csv, and didn't find it

so somehow we didn't read the .etl file or write the .csv file.  In looking at what is running on trunk, we get the same things as try is showing.  There is no data from xperf.

I suspect xperf is not running properly as looking back 2 weeks in mozilla-central I see similar results in the log files.
Flags: needinfo?(jmaher)
working together on vidyo, Pete and I think we figured it out.  We are accessing xperf.exe properly, but when we parse it to generate human readable summaries, it fails to create a file.  I am not sure why exactly, but in debugging this in an interactive loaner, we saw that when we did:
xperf -i firefox.etl -o firefox.csv

we failed to write firefox.csv as this was trying to write to the xperf directory.
The solution we found together was to run xperf from the task directory. In our session we did this using the full xperf path, it is probably also possible to do this just by setting the PATH to include the xperf directory ("C:\Program Files\Microsoft Windows Performance Toolkit"). This could either be done on the worker[1] or in the task. Probably doing it on the worker is the cleanest option, in case that location changes, or other tasks wish to run xperf.

Joel, I think we do that, we should be able to call xperf directly from the task directory and not need to specify any full paths - not for the xperf executable, and not for output directory either - the output directory could remain a relative path to the working directory. I think the only change we need is to make sure that xperf isn't called from the xperf directory (which I guess is handled in mozharness code(?)).

--
[1] https://github.com/mozilla-releng/OpenCloudConfig/blob/9a43518f8ccf75e6077b0348483ddec0aed32f2f/userdata/Manifest/gecko-t-win7-32.json#L440-L472
Flags: needinfo?(jmaher)
I'm not quite sure which OCC step installs "C:\Program Files\Microsoft Windows Performance Toolkit" - Rob, do you know?
Flags: needinfo?(rthijssen)
Hey Joel, can you review this before Rob, to check you are happy with this approach?

My idea is that we call xperf from the task directory, and its location is determined via the PATH.

If you're happy with this, then if Rob is happy too, I think we can try it out in staging.
Attachment #8998212 - Flags: review?(jmaher)
Attachment #8998212 - Flags: feedback?(rthijssen)
Comment on attachment 8998212 [details] [diff] [review]
Github Pull Request for OpenCloudConfig

Review of attachment 8998212 [details] [diff] [review]:
-----------------------------------------------------------------

this sounds like a lovely idea
Attachment #8998212 - Attachment is patch: true
Attachment #8998212 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8998212 - Flags: review?(jmaher) → review+
Comment on attachment 8998212 [details] [diff] [review]
Github Pull Request for OpenCloudConfig

lgtm
Flags: needinfo?(rthijssen)
Attachment #8998212 - Flags: feedback?(rthijssen) → feedback+
Summary for people that might need to support this bug while I'm on PTO (mostly for taskcluster team):

xperf tasks need to run as administrator on Windows 7. The following steps are needed to achieve this:

1) generic-worker 10.11.2 or higher needs to be deployed on gecko-t-win7-32 - see bug 1480723

2) xperf tasks need to stop running from folder "C:\Program Files\Microsoft Windows Performance Toolkit".

jmaher is working on that (at the moment in this bug I believe, I don't think we have a different bug yet for the xperf changes). The reason for this is the task currently writes files in the current directory, and tasks shouldn't write files to system folders. Usually they can't because they are not administrators, but in the case of a task that runs as Administrator, this is harder to control. Instead they should write to files under the task directory. The creation of the csv file - see comment 50 - is failing because it fails to write this csv file to this system folder. From getting a loaner (comment 51 and comment 52), we could see that by running xperf from the task directory we could get it to write the csv file successfully.

In comment 54 I made an OCC patch to add xperf to the system PATH so that the full xperf PATH does not need to be specified any more, which should aid with this talos runner refactor.

3) xperf task definitions generated by the decision task need to be slightly different - they need to include scope "generic-worker:os-group:aws-provisioner-v1/gecko-t-win7-32/Administrators" in task.scopes and they need to include "Administrators" in task.payload.osGroups. See e.g. https://queue.taskcluster.net/v1/task/Bi1bGCY8TMWWChX5HgSV5w as an example.

4) We need to decide which scm level(s) the xperf tasks should be allowed to run against, considering they can run as Administrator, it could be wise to disallow them on try, for example. This could make it harder for jmaher to make try pushes though, so maybe we should initially allow on all scm levels, and then remove the scope from scm level 1 after it has landed. Or maybe we just leave it - we probably should consult with Joel and security team on this.

5) We may need to grant the scope "generic-worker:os-group:aws-provisioner-v1/gecko-t-win7-32/Administrators" to various roles/clients in order that there aren't scope permission failures when these tasks get submitted.

Note, I still intend to "tighten up" the runAsAdministrator feature in future, so that only whitelisted commands can run as administrator, a bit like having an /etc/sudoers file. That work isn't done yet.
this morning I landed the changes for making xperf run assuming the scopes are set properly and I have also verified that things run with the regular win7 machines, not win7-beta (i.e. all other work is rolled out)

Now I need to figure out how to make the scopes and osgroups be properly set automatically.  Where I get stuck is I tried adding 'scopes' to the xperf task and it failed in the decision task, is this something we need to do in a transform or somewhere else in the backend of taskcluster.
Flags: needinfo?(jmaher) → needinfo?(dustin)
:pmoore, that didn't work for me and I haven't heard from Dustin (I think he is on PTO)
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher ) (UTC+2) from comment #60)
> :pmoore, that didn't work for me and I haven't heard from Dustin (I think he
> is on PTO)

Yes, he was on PTO, but he's just got back now, so I expect he can help.
Thanks for finding the try push, Pete.

 talos-xperf:
..
+    scopes: ['generic-worker:os-group:aws-provisioner-v1/gecko-t-win7-32/Administrators']
+    os-groups: ['Administrators']

Navigating from there to the failed task, it looks like the error was:

[task 2018-08-09T14:44:05.452Z] Exception: In test 'talos-xperf':
[task 2018-08-09T14:44:05.452Z] extra keys not allowed @ data['scopes']

And indeed, that does not appear in the test schema
  https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/transforms/tests.py?q=taskcluster%2Ftaskgraph%2Ftransforms%2Ftests.py&redirect_type=direct#140

but note the comment for os-groups:

    # os user groups for test task workers; required scopes, will be
    # added automatically
    Optional('os-groups'): optionally_keyed_by(
        'test-platform',
        [basestring]),

However, I think those "required scopes" are the old kind -- "generic-worker:os-group:Administrators".
https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/mozharness_test.py?q=taskcluster%2Ftaskgraph%2Ftransforms%2Fjob%2Fmozharness_test.py&redirect_type=direct#206
    taskdesc['scopes'].extend(
        ['generic-worker:os-group:{}'.format(group) for group in test['os-groups']])

so I think we'll need to modify that to apply either the old or new scope format.  Pete, how do we know when to apply one or the other?

Secondarily, we'll probably need to issue some more scopes to repos.  What repos should have which scopes?
Flags: needinfo?(dustin) → needinfo?(pmoore)
Nice sleuthing Dustin!

Although the osGroups feature has existed for a while, it has never been used in gecko, so we probably just need to fix the scope generation to include the provisionerId and workerType (it was an oversight of mine when I originally created the feature, that I forgot to include them).

I'm actually surprised to see there is code in gecko to support os-groups - I'm now wondering if I implemented that myself and forgot about it, or if someone else did it.

I'll make a patch that should do the trick...
Flags: needinfo?(pmoore)
(In reply to Pete Moore [:pmoore][:pete] from comment #64)
> I've made a first stab here, let's see how it goes...
> 
>   *
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=c3c03cde681671327660a7abac1e972deb23b509

That failed due to role moz-tree:level:1:* not having enough scopes. I've fixed that, and will retrigger...
(In reply to Pete Moore [:pmoore][:pete] from comment #65)
> That failed due to role moz-tree:level:1:* not having enough scopes. I've
> fixed that, and will retrigger...

Retrigger of decision task worked! So now just need to wait for xperf tasks to run.

I can see that the task definitions are correct though, so the changes to /taskcluster seem to be correct. Now it is just a matter of seeing that the changes to xperf tests are sufficient.
How's are the xperf results here Joel?

  * https://treeherder.mozilla.org/#/jobs?repo=try&revision=bb99fccb6115fcda820807ff32ad42550f6fe613

This is the same as the previous try job, but with the flake8 linting issue resolved.
Flags: needinfo?(jmaher)
yes, this shows all the proper data I am looking for- lets land it or whatever we need to do :)
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #68)
> yes, this shows all the proper data I am looking for- lets land it or
> whatever we need to do :)

This is the diff I applied on top of your try push:

https://hg.mozilla.org/try/rev/bb99fccb6115fcda820807ff32ad42550f6fe613

If you add that diff to your changes, that should be all you need. Don't worry about using my actual commit - probably best for the commits to come from you.

Are you happy to take care of it from here?
Flags: needinfo?(jmaher)
If you feel it is preferable, we can put the changes to taskcluster/taskgraph/transforms/job/mozharness_test.py in a separate bug, but that may be overkill. That is just a fix for the scope generation when OS groups are specified.
run xperf in os groups=administrators and support os_groups in taskcluster
ok, this works for me- verified on try and pushed a phabricator diff for review.
Flags: needinfo?(jmaher)
Comment on attachment 9003224 [details]
Bug 1439588 - run xperf in os administrator group. r=pmoore

Pete Moore [:pmoore][:pete] has approved the revision.
Attachment #9003224 - Flags: review+
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/65db8f04be61
run xperf in os administrator group. r=pmoore
https://hg.mozilla.org/mozilla-central/rev/65db8f04be61
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: