Closed Bug 1303455 Opened 8 years ago Closed 8 years ago

TC Windows tests run without JOB_OBJECT_LIMIT_BREAKAWAY_OK

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gbrown, Assigned: pmoore)

References

Details

Attachments

(4 files)

In bug 1280474, we are trying to run Windows tests in taskcluster with generic-worker. A lot of tests are failing in this environment; some of the problems appear to be related to process management.

The most obvious error is 

20:29:11     INFO -  ProcessManager UNABLE to use job objects to manage child processes
20:29:11     INFO -  ProcessManager NOT managing child processes

That's generated by mozprocess [1] because winprocess.CanCreateJobObject() returns False. CanCreateJobObject() returns False on taskcluster because JOB_OBJECT_LIMIT_BREAKAWAY_OK is not set in the limit flags [2]; JOB_OBJECT_LIMIT_BREAKAWAY_OK is set during buildbot tests.

[1] https://dxr.mozilla.org/mozilla-central/rev/f1dbeb5dee22cabc56b2cf25d2eab1435abbf131/testing/mozbase/mozprocess/mozprocess/processhandler.py#263

[2] https://dxr.mozilla.org/mozilla-central/rev/f1dbeb5dee22cabc56b2cf25d2eab1435abbf131/testing/mozbase/mozprocess/mozprocess/winprocess.py#432
It's not clear to me what determines whether a process has JOB_OBJECT_LIMIT_BREAKAWAY_OK set.

I'm wondering if it is a security/privilege issue. I called "whoami /all" from the mozharness script and note that buildbot's cltbld user appears to be an Administrator, while the taskcluster user is not.

BUILDBOT

13:14:56     INFO - Running command: ['whoami', '/all']
13:14:56     INFO - Copy/paste: whoami /all
13:14:56     INFO -  USER INFORMATION
13:14:56     INFO -  ----------------
13:14:56     INFO -  User Name              SID
13:14:56     INFO -  ====================== ==============================================
13:14:56     INFO -  t-w732-spot-112\cltbld S-1-5-21-2192203079-3471170106-2880594189-1000
13:14:56     INFO -  GROUP INFORMATION
13:14:56     INFO -  -----------------
13:14:56     INFO -  Group Name                           Type             SID          Attributes
13:14:56     INFO -  ==================================== ================ ============ ===============================================================
13:14:56     INFO -  Everyone                             Well-known group S-1-1-0      Mandatory group, Enabled by default, Enabled group
13:14:56     INFO -  BUILTIN\Administrators               Alias            S-1-5-32-544 Mandatory group, Enabled by default, Enabled group, Group owner
13:14:56     INFO -  BUILTIN\Remote Desktop Users         Alias            S-1-5-32-555 Mandatory group, Enabled by default, Enabled group
13:14:56     INFO -  BUILTIN\Users                        Alias            S-1-5-32-545 Mandatory group, Enabled by default, Enabled group
13:14:56     INFO -  NT AUTHORITY\INTERACTIVE             Well-known group S-1-5-4      Mandatory group, Enabled by default, Enabled group
13:14:56     INFO -  CONSOLE LOGON                        Well-known group S-1-2-1      Mandatory group, Enabled by default, Enabled group
13:14:56     INFO -  NT AUTHORITY\Authenticated Users     Well-known group S-1-5-11     Mandatory group, Enabled by default, Enabled group
13:14:56     INFO -  NT AUTHORITY\This Organization       Well-known group S-1-5-15     Mandatory group, Enabled by default, Enabled group
13:14:56     INFO -  LOCAL                                Well-known group S-1-2-0      Mandatory group, Enabled by default, Enabled group
13:14:56     INFO -  NT AUTHORITY\NTLM Authentication     Well-known group S-1-5-64-10  Mandatory group, Enabled by default, Enabled group
13:14:56     INFO -  Mandatory Label\High Mandatory Level Label            S-1-16-12288 Mandatory group, Enabled by default, Enabled group
13:14:56     INFO -  PRIVILEGES INFORMATION
13:14:56     INFO -  ----------------------
13:14:56     INFO -  Privilege Name                  Description                               State
13:14:56     INFO -  =============================== ========================================= ========
13:14:56     INFO -  SeIncreaseQuotaPrivilege        Adjust memory quotas for a process        Disabled
13:14:56     INFO -  SeSecurityPrivilege             Manage auditing and security log          Disabled
13:14:56     INFO -  SeTakeOwnershipPrivilege        Take ownership of files or other objects  Disabled
13:14:56     INFO -  SeLoadDriverPrivilege           Load and unload device drivers            Disabled
13:14:56     INFO -  SeSystemProfilePrivilege        Profile system performance                Disabled
13:14:56     INFO -  SeSystemtimePrivilege           Change the system time                    Disabled
13:14:56     INFO -  SeProfileSingleProcessPrivilege Profile single process                    Disabled
13:14:56     INFO -  SeIncreaseBasePriorityPrivilege Increase scheduling priority              Disabled
13:14:56     INFO -  SeCreatePagefilePrivilege       Create a pagefile                         Disabled
13:14:56     INFO -  SeBackupPrivilege               Back up files and directories             Disabled
13:14:56     INFO -  SeRestorePrivilege              Restore files and directories             Disabled
13:14:56     INFO -  SeShutdownPrivilege             Shut down the system                      Disabled
13:14:56     INFO -  SeDebugPrivilege                Debug programs                            Disabled
13:14:56     INFO -  SeSystemEnvironmentPrivilege    Modify firmware environment values        Disabled
13:14:56     INFO -  SeChangeNotifyPrivilege         Bypass traverse checking                  Enabled
13:14:56     INFO -  SeRemoteShutdownPrivilege       Force shutdown from a remote system       Disabled
13:14:56     INFO -  SeUndockPrivilege               Remove computer from docking station      Disabled
13:14:56     INFO -  SeManageVolumePrivilege         Perform volume maintenance tasks          Disabled
13:14:56     INFO -  SeImpersonatePrivilege          Impersonate a client after authentication Enabled
13:14:56     INFO -  SeCreateGlobalPrivilege         Create global objects                     Enabled
13:14:56     INFO -  SeIncreaseWorkingSetPrivilege   Increase a process working set            Disabled
13:14:56     INFO -  SeTimeZonePrivilege             Change the time zone                      Disabled
13:14:56     INFO -  SeCreateSymbolicLinkPrivilege   Create symbolic links                     Disabled
13:14:56     INFO - Return code: 0

TASKCLUSTER

20:29:08     INFO - Running command: ['whoami', '/all']
20:29:08     INFO - Copy/paste: whoami /all
20:29:08     INFO -  USER INFORMATION
20:29:08     INFO -  ----------------
20:29:08     INFO -  User Name                       SID
20:29:08     INFO -  =============================== ==============================================
20:29:08     INFO -  i-0d2eae1827188\task_1474055472 S-1-5-21-2192203079-3471170106-2880594189-1010
20:29:08     INFO -  GROUP INFORMATION
20:29:08     INFO -  -----------------
20:29:08     INFO -  Group Name                             Type             SID          Attributes
20:29:08     INFO -  ====================================== ================ ============ ==================================================
20:29:08     INFO -  Everyone                               Well-known group S-1-1-0      Mandatory group, Enabled by default, Enabled group
20:29:08     INFO -  BUILTIN\Users                          Alias            S-1-5-32-545 Mandatory group, Enabled by default, Enabled group
20:29:08     INFO -  BUILTIN\Remote Desktop Users           Alias            S-1-5-32-555 Mandatory group, Enabled by default, Enabled group
20:29:08     INFO -  NT AUTHORITY\INTERACTIVE               Well-known group S-1-5-4      Mandatory group, Enabled by default, Enabled group
20:29:08     INFO -  CONSOLE LOGON                          Well-known group S-1-2-1      Mandatory group, Enabled by default, Enabled group
20:29:08     INFO -  NT AUTHORITY\Authenticated Users       Well-known group S-1-5-11     Mandatory group, Enabled by default, Enabled group
20:29:08     INFO -  NT AUTHORITY\This Organization         Well-known group S-1-5-15     Mandatory group, Enabled by default, Enabled group
20:29:08     INFO -  LOCAL                                  Well-known group S-1-2-0      Mandatory group, Enabled by default, Enabled group
20:29:08     INFO -  NT AUTHORITY\NTLM Authentication       Well-known group S-1-5-64-10  Mandatory group, Enabled by default, Enabled group
20:29:08     INFO -  Mandatory Label\Medium Mandatory Level Label            S-1-16-8192  Mandatory group, Enabled by default, Enabled group
20:29:08     INFO -  PRIVILEGES INFORMATION
20:29:08     INFO -  ----------------------
20:29:08     INFO -  Privilege Name                Description                          State
20:29:08     INFO -  ============================= ==================================== ========
20:29:08     INFO -  SeShutdownPrivilege           Shut down the system                 Disabled
20:29:08     INFO -  SeChangeNotifyPrivilege       Bypass traverse checking             Enabled
20:29:08     INFO -  SeUndockPrivilege             Remove computer from docking station Disabled
20:29:08     INFO -  SeIncreaseWorkingSetPrivilege Increase a process working set       Disabled
20:29:08     INFO -  SeTimeZonePrivilege           Change the time zone                 Disabled
20:29:08     INFO - Return code: 0
See Also: → 1280474
:pmoore, do you think that the task user groups could be made a configurable option in generic-worker so that we can test the effect of adding the user to the admin group?
https://github.com/taskcluster/generic-worker/blob/master/plat_windows.go#L145

i've also seen http://serverfault.com/questions/198197/what-restrictions-does-winrm-winrs-place-on-the-processes-that-it-runs the answers hint at restrictions on the cmd shell from where processes are run, so that's another thing i'm investigating.
Flags: needinfo?(pmoore)
for talos we run xperf which requires admin privileges.  I think it is expected from our tests that we run with admin credentials, we have a lot of tools that manage a set of processes and tweak settings and simulate stuff.
I'm not going to lie, this isn't pretty.

What we stand to lose from running a task as administrator, is safe sandboxing. Currently, users are like "guest users" whereby when their session ends, all trace of their existence disappears. The user is deleted, the registry changes are unmounted, the home folder is deleted, and the next task starts in a pristine clean environment, or at least in theory. The user shouldn't be able to alter system state in order that the worker is reusable between tasks.

When we allow a user to be an administrator, it means it will have access to any secrets on the worker (including taskcluster credentials, private gpg keys - which would break chain of trust, the livelog secret - which could enable malicious streaming of fake logs, and potentially any other secrets that might be used for other integrations). In addition, it could mean that a task could leave the worker in an altered state for a subsequent task, which also would be a security risk.

Admittedly, the situation in buildbot is not better - however, this is one of the reasons we wanted to move to a safer setup.

I think it would make most sense if we could dig deeper into what specific privileges are required, and see if it might be possible to grant a limited set, but not full administrator control. I think for every required privilege, we should also audit whether it is really required, or the test(s) can be adapted to work without it.

If this is an unsatisfactory answer, let's set up a meeting to discuss how we manage these conflicting requirements.
Flags: needinfo?(pmoore)
I would vote for an alternative masure- getting things running now so we can migrate over and then work on investigating this guest mode usage.  Possibly we can restrict admin to a few jobs.  What I see is that we have issues spawning e10s child process with our current worker, that is a non starter.  Possibly we can tweak the worker to make this work smoother.

I have always maintained since the beginning that we require admin access for the xperf talos tests to work- this has to hook into the lower levels of the OS to get raw fileIO and processIO counters.  Possibly there is a middle ground somewhere- but talos will be on hardware.

Pete, what type of secrets do we risk getting out?  we have credentials to upload/interact with s3, treeherder, taskcluster for each task- these tests are not run on build machines, so we don't have concerns with signing keys.
Pete, even if we determine that a subset of permissions need to be added, we'll still need to be able to adjust the task users groups. Eg: maybe they could be added to a dedicated testers group. In any case, having the ability to adjust group memberships looks like it's going to be necessary. The answer isn't unsatisfactory, it's a valid point, but our testing and progress is currently blocked by a limitation in the generic worker which could be patched in a way that would allow us to make progress. Can you consider supporting us in this?
Flags: needinfo?(pmoore)
I'll add the feature, and we'll create a separate security bug to track the security implications, which we'll link to this one.
Flags: needinfo?(pmoore)
I have collected some more info about root cause of our tc windows 7 process management failures. For all of our e10s jobs and at least one xpcshell test, failure occurs when CreateProcessAsUserW is called at [1]; it sets GetLastError() to 5 - access denied. In all cases I have observed, we are calling CreateProcessAsUserW with the CREATE_BREAKAWAY_FROM_JOB flag (I'm not sure if that is the cause of the access denied, but it seems problematic in light of https://bugzilla.mozilla.org/show_bug.cgi?id=1303455#c0).

[1] https://hg.mozilla.org/mozilla-central/annotate/62f79d676e0e11b3ad59a5425b3ebb3ec5bbefb5/security/sandbox/chromium/sandbox/win/src/target_process.cc#l149
Blocks: 1304040
The reasoning behind adding this feature is understanding that this worker will be run on machines not managed by the TaskCluster team.  As such, we need to expand the flexibility to match the environment that one chooses to run the worker in.  This might include elevating privileges of the user used for tasks.
Following on from Geoffs look at access denied issues, I'm looking at a specific test failure.

https://treeherder.allizom.org/#/jobs?repo=try&revision=544ec1d97d4b4f066113eca3f764f96f1812e1dc&filter-tier=2&group_state=expanded&selectedJob=28218592
https://public-artifacts.taskcluster.net/JX9yYSRrR2-PHb4kfA97TQ/0/public/logs/live_backing.log
https://papertrailapp.com/systems/win7-i-0fa18b2c1d79d7690/events?focus=714927988487282692

The raw log from the test contains a lot of errors around access to paths on the c: drive. eg:

16:49:13     INFO -  Failed to remove directory: c:\users\task_1474389326\appdata\local\temp\xpc-plugins-7f8dao. Waiting.

or:

17:55:55     INFO -  "CONSOLE_MESSAGE: (info) 1474394154967	Toolkit.Telemetry	TRACE	TelemetryStorage::loadPingfile - unreadable ping c:\\users\\task_1474389326\\appdata\\local\\temp\\xpc-profile-zpettu\\datareporting\\aborted-session-ping: Win error 2 during operation open on file c:\\users\\task_1474389326\\appdata\\local\\temp\\xpc-profile-zpettu\\datareporting\\aborted-session-ping (The system cannot find the file specified.


I'm not 100% on this, but I hope it's a safe guess that the tests are using the environment configuration to determine where to look for (or create/delete) these files. Papertrail logs show that the generic worker provides environment configuration to the task:

Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  2016/09/20 16:35:35 Generating script: 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  2016/09/20 16:35:35 :: This script runs command 2 defined in TaskId JX9yYSRrR2-PHb4kfA97TQ... 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  @echo off 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set ALLUSERSPROFILE=C:\ProgramData 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set APPDATA=C:\Users\task_1474389326\AppData\Roaming 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set ChocolateyInstall=C:\ProgramData\chocolatey 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set CommonProgramFiles=C:\Program Files\Common Files 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set COMPUTERNAME=I-0FA18B2C1D79D 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set ComSpec=C:\windows\system32\cmd.exe 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set DCLOCATION=SCL3 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set DNSSUFFIX=wintest.releng.scl3.mozilla.com 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set FP_NO_HOST_CHECK=NO 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set HOMEDRIVE=Z: 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set HOMEPATH=\task_1474389326 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set KTS_HOME=C:\Program Files\KTS 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set KTS_VERSION=1.19c 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set LOCALAPPDATA=C:\Users\task_1474389326\AppData\Local 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set MOZILLABUILD=C:\mozilla-build 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set NUMBER_OF_PROCESSORS=8 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set OS=Windows_NT 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set Path=C:\mozilla-build\7zip;C:\mozilla-build\info-zip;C:\mozilla-build\kdiff3;C:\mozilla-build\moztools-x64\bin;C:\mozilla-build\mozmake;C:\mozilla-build\msys\bin;C:\mozilla-build\msys\local\bin;C:\mozilla-build\nsis-3.0b3;C:\mozilla-build\nsis-2.46u;C:\mozilla-build\python;C:\mozilla-build\python\Scripts;C:\mozilla-build\upx391w;C:\mozilla-build\wget;C:\windows\system32;C:\windows;C:\windows\System32\Wbem;C:\windows\System32\WindowsPowerShell\v1.0\;C:\mozilla-build\python27;C:\mozilla-build\python27\Scripts;C:\mozilla-build\vim\vim72;C:\CoreUtils\bin;C:\mozilla-build\buildbotve\scripts;c:\mozilla-build\hg\;c:\Program Files\Microsoft Windows Performance Toolkit\;C:\Users\cltbld\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0;C:\ProgramData\chocolatey\bin;C:\Program Files\Puppet Labs\Puppet\bin;C:\mozilla-build\hg 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set PROCESSOR_ARCHITECTURE=x86 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set PROCESSOR_IDENTIFIER=x86 Family 6 Model 62 Stepping 4, GenuineIntel 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set PROCESSOR_LEVEL=6 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set PROCESSOR_REVISION=3e04 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set ProgramData=C:\ProgramData 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set ProgramFiles=C:\Program Files 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set PROMPT=$P$G 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set PSModulePath=C:\windows\system32\WindowsPowerShell\v1.0\Modules\;C:\Program Files\AWS Tools\PowerShell\ 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set PUBLIC=C:\Users\Public 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set SystemDrive=C: 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set SystemRoot=C:\windows 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set tcexitcode=0 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set TEMP=C:\Users\task_1474389326\AppData\Local\Temp 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set test1=testie 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set TMP=C:\Users\task_1474389326\AppData\Local\Temp 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set USERDOMAIN=I-0FA18B2C1D79D 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set USERNAME=task_1474389326 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set USERPROFILE=C:\Users\task_1474389326 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set windir=C:\windows 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set windows_tracing_flags=3 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set windows_tracing_logfile=C:\BVTBin\Tests\installpackage\csilogfile.log 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  cd Z:\task_1474389326 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set errorlevel= 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  call Z:\task_1474389326\command_000002.bat 2>&1 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set tcexitcode=%errorlevel% 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  set > Z:\task_1474389326\env.txt 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  cd > Z:\task_1474389326\dir.txt 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  exit /b %tcexitcode% 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  2016/09/20 16:35:35 Script "Z:\\task_1474389326\\command_000002.bat": 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  2016/09/20 16:35:35 Contents: 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  2016/09/20 16:35:35 @echo on 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  c:\mozilla-build\python\python.exe -u mozharness\scripts\desktop_unittest.py --cfg mozharness\configs\unittests\win_taskcluster_unittest.py --xpcshell-suite=xpcshell --no-read-buildbot-config --installer-url https://queue.taskcluster.net/v1/task/TNYaBfxDTWGI0hB0PRjHGw/artifacts/public/build/firefox-51.0a1.en-US.win32.zip --test-packages-url https://queue.taskcluster.net/v1/task/TNYaBfxDTWGI0hB0PRjHGw/artifacts/public/build/firefox-51.0a1.en-US.win32.test_packages.json --download-symbols true 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  @echo off 
Sep 20 17:35:37 win7-i-0fa18b2c1d79d7690 generic-worker:  2016/09/20 16:35:35 Running command: 'Z:\task_1474389326\command_000002_wrapper.bat'

Some of the environment is correct (HOMEDRIVE, HOMEPATH), but a lot isn't (APPDATA, LOCALAPPDATA, TEMP, TMP, USERPROFILE).

I think that the generic-worker should be explicitly setting the values of APPDATA, LOCALAPPDATA, TEMP, TMP and USERPROFILE to directories under the users home path on the z: drive (or wherever the config states the homedir should be) and creating those directories if they don't exist. When we start to correct permissions for tests, it won't be helpful to have to make allowances for appdata and temp folders that are on the c: drive when they should be under the users home directory.
I am curious why z: vs c:?  I see that we are using c:\users\task_xxx\..., to me I interpret that as 100% in the scope of the generic worker and should be fully readable/writable/executable?
g-w takes a config parameter to tell it under what directory it should create task user home directories. gecko-* windows worker types use z:\ because it is mapped to the ephemeral drive which is free, rather than the c:\users directory which costs money. note that the task work directory is correctly set to Z:\task_1474389326, where the work is all taking place. it's just the temp and appdata folders that are referencing c:\Users\task_1474389326
thanks for the explanation.  To confirm, outside of c:\mozilla-build we should see no c:\ access in the log file?

outside of the above mentioned, I see some env vars set:
'WINDOWS_TRACING_LOGFILE': 'C:\\BVTBin\\Tests\\installpackage\\csilogfile.log'
'KTS_HOME': 'C:\\Program Files\\KTS'
'PUBLIC': 'C:\\Users\\Public'
'ALLUSERSPROFILE': 'C:\\ProgramData'
'CHOCOLATEYINSTALL': 'C:\\ProgramData\\chocolatey' <- who names a progam chocolatey ???

and in path, I see:
C:\\CoreUtils\\bin
C:\\Users\\cltbld\\AppData\\Local\\Programs\\Common\\Microsoft\\Visual C++ for Python\\9.0
C:\\ProgramData\\chocolatey\\bin
C:\\Program Files\\Puppet Labs\\Puppet\\bin
I'm not sure. It's one of the things I want to understand better through this migration as well. If we have tests reading from or writing to locations on the C: drive, I want to understand what locations. Using Z: gives us a simple mechanism for locking down permissions (if we want to). There might be cases where we explicitly grant access to specific paths on C: but we/I have yet to establish what and why.
Thanks for this Rob.

I suspect the problem lies somewhere deep under the covers of https://github.com/taskcluster/generic-worker/blob/7edf51d0f80732396412db68fbfe7d2bc188c7f0/plat_windows.go#L153

There is probably an assumption somewhere in the code that the Users parent directory is C:\Users - and we'll need to dig into this to understand where that comes from in the calling library (a hardcoded reference, something configured in the registry, an environment variable, etc).
Probably when the local user is created, something is determining where the home folder should be created, which is incorrectly using C:\Users instead of Z:\Users... I'll try to trace this through the various syscalls when the user is created and when the process is created, to find how the USERPROFILE is set (and associated variables).

It might be that the location is frozen to this location by the OS, but I suspect even if that was the case, I can probably adjust the registry entry/entries after user creation to relocate to the preferred configured location for the worker type.

There are a chain of syscalls, so I'll need to dig through all of them.
The log in comment 10 shows tests putting things in the temp directory. Specifically, that looks to be from an xpcshell test. The xpcshell harness just calls mkdtemp to create these:
https://dxr.mozilla.org/mozilla-central/rev/29beaebdfaccbdaeb4c1ee5a43a9795ab015ef49/testing/xpcshell/runxpcshelltests.py#364

Looking at the Python 2.7 source for mkdtemp:
http://svn.python.org/view/python/branches/release27-maint/Lib/tempfile.py?revision=82503&view=markup#l296

It looks like it uses _candidate_tempdir_list to find the tempdir:
http://svn.python.org/view/python/branches/release27-maint/Lib/tempfile.py?revision=82503&view=markup#l140

Its first choice is whatever's in the first of the $TMPDIR / $TEMP / $TMP environment variables.
Is this now resolved with the resolution of bug 1306988 (the "osGroups" feature in generic worker) and the workaround described in https://bugzilla.mozilla.org/show_bug.cgi?id=1306989#c3 ?
Flags: needinfo?(rthijssen)
this is not resolved. we still see the error but it is not due to problems in g-w. the ability to test different permission groups is helping with debugging and eliminating possible causes.
Assignee: nobody → rthijssen
Component: Generic-Worker → General Automation
Flags: needinfo?(rthijssen)
Product: Taskcluster → Release Engineering
QA Contact: catlee
In https://treeherder.mozilla.org/#/jobs?repo=try&revision=029efe60b61e6ac8b9897defe87f9dcf79ef4f19, I repeated my experiment from comment 1, now using "osGroups" to join Administrators. Now whoami reports privileges on taskcluster Windows 7 jobs very similar to corresponding jobs in buildbot.

19:28:25     INFO - Copy/paste: whoami /all
19:28:25     INFO -  USER INFORMATION
19:28:25     INFO -  ----------------
19:28:25     INFO -  User Name                       SID
19:28:25     INFO -  =============================== ==============================================
19:28:25     INFO -  i-0d0fbe76c0089\task_1477507637 S-1-5-21-2192203079-3471170106-2880594189-1005
19:28:25     INFO -  GROUP INFORMATION
19:28:25     INFO -  -----------------
19:28:25     INFO -  Group Name                           Type             SID          Attributes
19:28:25     INFO -  ==================================== ================ ============ ===============================================================
19:28:25     INFO -  Everyone                             Well-known group S-1-1-0      Mandatory group, Enabled by default, Enabled group
19:28:25     INFO -  BUILTIN\Users                        Alias            S-1-5-32-545 Mandatory group, Enabled by default, Enabled group
19:28:25     INFO -  BUILTIN\Remote Desktop Users         Alias            S-1-5-32-555 Mandatory group, Enabled by default, Enabled group
19:28:25     INFO -  BUILTIN\Administrators               Alias            S-1-5-32-544 Mandatory group, Enabled by default, Enabled group, Group owner
19:28:25     INFO -  NT AUTHORITY\INTERACTIVE             Well-known group S-1-5-4      Mandatory group, Enabled by default, Enabled group
19:28:25     INFO -  CONSOLE LOGON                        Well-known group S-1-2-1      Mandatory group, Enabled by default, Enabled group
19:28:25     INFO -  NT AUTHORITY\Authenticated Users     Well-known group S-1-5-11     Mandatory group, Enabled by default, Enabled group
19:28:25     INFO -  NT AUTHORITY\This Organization       Well-known group S-1-5-15     Mandatory group, Enabled by default, Enabled group
19:28:25     INFO -  LOCAL                                Well-known group S-1-2-0      Mandatory group, Enabled by default, Enabled group
19:28:25     INFO -  NT AUTHORITY\NTLM Authentication     Well-known group S-1-5-64-10  Mandatory group, Enabled by default, Enabled group
19:28:25     INFO -  Mandatory Label\High Mandatory Level Label            S-1-16-12288 Mandatory group, Enabled by default, Enabled group
19:28:25     INFO -  PRIVILEGES INFORMATION
19:28:25     INFO -  ----------------------
19:28:25     INFO -  Privilege Name                  Description                               State
19:28:25     INFO -  =============================== ========================================= ========
19:28:25     INFO -  SeIncreaseQuotaPrivilege        Adjust memory quotas for a process        Disabled
19:28:25     INFO -  SeSecurityPrivilege             Manage auditing and security log          Disabled
19:28:25     INFO -  SeTakeOwnershipPrivilege        Take ownership of files or other objects  Disabled
19:28:25     INFO -  SeLoadDriverPrivilege           Load and unload device drivers            Disabled
19:28:25     INFO -  SeSystemProfilePrivilege        Profile system performance                Disabled
19:28:25     INFO -  SeSystemtimePrivilege           Change the system time                    Disabled
19:28:25     INFO -  SeProfileSingleProcessPrivilege Profile single process                    Disabled
19:28:25     INFO -  SeIncreaseBasePriorityPrivilege Increase scheduling priority              Disabled
19:28:25     INFO -  SeCreatePagefilePrivilege       Create a pagefile                         Disabled
19:28:25     INFO -  SeBackupPrivilege               Back up files and directories             Disabled
19:28:25     INFO -  SeRestorePrivilege              Restore files and directories             Disabled
19:28:25     INFO -  SeShutdownPrivilege             Shut down the system                      Disabled
19:28:25     INFO -  SeDebugPrivilege                Debug programs                            Disabled
19:28:25     INFO -  SeSystemEnvironmentPrivilege    Modify firmware environment values        Disabled
19:28:25     INFO -  SeChangeNotifyPrivilege         Bypass traverse checking                  Enabled
19:28:25     INFO -  SeRemoteShutdownPrivilege       Force shutdown from a remote system       Disabled
19:28:25     INFO -  SeUndockPrivilege               Remove computer from docking station      Disabled
19:28:25     INFO -  SeManageVolumePrivilege         Perform volume maintenance tasks          Disabled
19:28:25     INFO -  SeImpersonatePrivilege          Impersonate a client after authentication Enabled
19:28:25     INFO -  SeCreateGlobalPrivilege         Create global objects                     Enabled
19:28:25     INFO -  SeIncreaseWorkingSetPrivilege   Increase a process working set            Disabled
19:28:25     INFO -  SeTimeZonePrivilege             Change the time zone                      Disabled
19:28:25     INFO -  SeCreateSymbolicLinkPrivilege   Create symbolic links                     Disabled
19:28:25     INFO - Return code: 0

That's great to see!

But, JOB_OBJECT_LIMIT_BREAKAWAY_OK is still not set in the limit flags, so CanCreateJobObject() still returns False, and tests still cannot create processes, as in Comment 0.
It is worth noting that JOB_OBJECT_LIMIT_BREAKAWAY_OK is also not set on Windows 10 -- but that does not cause any test failures. On Windows 6.2 (note that our Windows 7 is 6.1.7601) and above, nested jobs can be used. In terms of our python code, see for instance:

https://dxr.mozilla.org/mozilla-central/rev/f9f3cc95d7282f1fd83f66dd74acbcdbfe821915/testing/mozbase/mozprocess/mozprocess/processhandler.py#267 and

https://dxr.mozilla.org/mozilla-central/rev/f9f3cc95d7282f1fd83f66dd74acbcdbfe821915/testing/mozbase/mozprocess/mozprocess/processhandler.py#366

On Windows 10, we do not have can_create_jobs (because JOB_OBJECT_LIMIT_BREAKAWAY_OK is missing) but we can_nest_jobs, so no warning is printed (and future operations succeed).

On Windows 7, we do not have either can_create_jobs nor can_nest_jobs (and there is no hope of getting can_nest_jobs on Windows 7).
so how do we successfully run tests on the buildbot AMIs?
Pete, I am not sure where this problem is defined at, but if you could log into a machine and run the command manually from a command prompt, that would help us determine if this is a generic-worker.exe issue or an AMI issue.

I did poke around in the generic worker code, and was not able to determine how we launch a process.  I am not familiar with GO as a programming language and was not in a position to learn the differences of GO.

In reading Microsoft's CreateProcess() documentation it looks like we need to create a job group with the proper flags set, then associate the processes with that.  Once flags are set properly for the job, then any CreateProcess calls from a process in the job group will inherit the proper flags.  If that is the case, then it is safe to assume that we are launching the worker incorrectly, or doing something in process creation from the worker to remove/restrict the process/job flags.

As it stands this is blocking us from doing any work in windows 7.  I would not trust any results we get on what few jobs might pass now until this is resolved as it will change the process scope running for all tasks.
Flags: needinfo?(pmoore)
(In reply to Joel Maher ( :jmaher) from comment #23)
> Pete, I am not sure where this problem is defined at, but if you could log
> into a machine and run the command manually from a command prompt, that
> would help us determine if this is a generic-worker.exe issue or an AMI
> issue.

Sure - I can take a look at this on Monday. Unfortunately it's 9pm on Friday night now.

> 
> I did poke around in the generic worker code, and was not able to determine
> how we launch a process.  I am not familiar with GO as a programming
> language and was not in a position to learn the differences of GO.

See https://github.com/taskcluster/generic-worker/blob/c61c8950947e1f62f078599210ea828060389f44/syscall/exec_windows.go#L19

> 
> In reading Microsoft's CreateProcess() documentation it looks like we need
> to create a job group with the proper flags set, then associate the
> processes with that.  Once flags are set properly for the job, then any
> CreateProcess calls from a process in the job group will inherit the proper
> flags.  If that is the case, then it is safe to assume that we are launching
> the worker incorrectly, or doing something in process creation from the
> worker to remove/restrict the process/job flags.
> 
> As it stands this is blocking us from doing any work in windows 7.  I would
> not trust any results we get on what few jobs might pass now until this is
> resolved as it will change the process scope running for all tasks.

Let's continue working together to solve this problem, as none of us knows the source of the problem yet. I'll let you know how I get on on Monday; please also update the bug if you make any progress.
Flags: needinfo?(pmoore)
this can easily wait until Monday- lets spend time on Monday working on this together- possibly we can narrow down the root cause- I have some time in your afternoon :)
MSDN's "Job Objects" page has some possible helpful info: https://msdn.microsoft.com/en-us/library/windows/desktop/ms684161(v=vs.85).aspx. I'm thinking we might be able to QueryInformationJobObject(NULL, ...) to get our current process' job handle, then use SetInformationJobObject() to set the limits we need on our job -- but that's very speculative and I'm no expert in this area.
Let's go through this together this evening.

I'll be running some job errands this afternoon, which will also mean we have more crossover time this evening.

Thanks Geoff for the msdn link - that page is really useful - indeed, it looks like I should be able to set some process options when I spawn the process, so I'm sure we can get this resolved swiftly.
logging into a machine, I was able to confirm a few things:
1) running the raw commands from a regular admin cmd.exe prompt, they run fast, either on c:\ or z:\
2) I can reproduce the problem from the generic-worker.exe (I hack a command prompt), but cannot reproduce from a cmd.exe
3) I created a new scheduled task to mirror the settings of the generic worker and instead ran cmd.exe- I was able run processes normally.

Basically we have the condition where when running jobs from the generic-worker.exe we are running inside a job object.  When we run from a regular shell or a scheduled task shell, we are not in a job object.

This is pointing towards that of the generic-worker.exe process launching needs to run without a jobObject, or adjust flags.  There is a good chance that the same fix will also remedy the performance issues we are seeing.

in going through this, it because evident that we could quickly iterate on a loaner box by getting access to the AMI, then relaunching the scheduled task with new workers.

One other inconsistency that I saw in the scheduled task definition [1] was that we have the user as |GenericWorker|, but for manually created tasks they are |i-<nodeid>\GenericWorker|.  I don't think this is a problem since importing a similar task but with just cmd.exe works as expected.

[1] https://pastebin.mozilla.org/8923714
Attached file jobobject.py
python script to validate if we are in a job object or not
Attached file testcreateprocess.go
my attempt at writing a go program so it will launch a cmd.exe shell to run |python jobobject.py| from either createprocess or createprocesswithlogon.

When I use createprocess, we are not in a jobobject
When I use createprocesswithlogon, we are IN a jobobject

I would expect us to ensure that we pass our tests (ideally NOT IN a job object) prior to calling this fixed.

:pmoore, let me know if you need more information.
Flags: needinfo?(pmoore)
It looks like CreateProcessAsUser[1] system call might avoid the job creation - I'm going to have a play swapping out CreateProcessWithLogonW[2] with CreateProcessAsUser.

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/ms682429(v=vs.85).aspx
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/ms682431(v=vs.85).aspx

This is based on an optimistic interpretation of https://github.com/contester/runlib/blob/e1c5f73778c7081e9ce20b930999d03f51e78f4d/subprocess/subprocess_windows.go#L140 and https://github.com/contester/runlib/blob/e1c5f73778c7081e9ce20b930999d03f51e78f4d/subprocess/subprocess_windows.go#L166-L195 whereby I assume that sub.NoJob is set to true if a job already exists (and therefore no new job should be created).
Flags: needinfo?(pmoore)
I have a build currently running (7.0.0alpha1) using CreateProcessAsUser instead of CreateProcessWithLogonW.

I will test this when I've set up a win7 environment to use it.

----------------------

In addition, I discovered the following, which supports the hypothesis that moving to CreateProcessAsUser is the way to go for Windows 7:

http://stackoverflow.com/questions/27339341/job-control-with-createprocesswithlogonw

--------------

    Interactions with Win32 Job Objects

    CreateProcessWithLogonW executes the new process as a child of the Secondary Logon service, which has the outcome of making the process escape any Job Object membership/restrictions even if the Job Object did not allow breakaway.

    Furthermore, the Secondary Logon service automatically creates its own new Job Object and assigns the new process into it. As such, it is not possible for the caller to explicitly assign the new process to any other Job Object (since a process may only be assigned to one Job Object, and can never be removed from a Job Object once it has been assigned to one).
Blast - will need to debug this....

2016/11/02 15:39:27 Running command: 'cmd' '/c' 'del' '/q' '/f' 'Z:\task_1478100703'
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x10 pc=0x75e0d2]

goroutine 1 [running]:
panic(0x7bb080, 0x11e92030)
        /home/travis/go/src/runtime/panic.go:500 +0x331
github.com/taskcluster/generic-worker/syscall.StartProcess(0x12047e00, 0x1b, 0x11ff3a40, 0x6, 0x6, 0x11f59838, 0x11fe4cc0, 0xf, 0x11ff1a40, 0x1d, ...)
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/syscall/exec_windows.go:144 +0x6f2
github.com/taskcluster/generic-worker/os.startProcess(0x12047e00, 0x1b, 0x11ff3a40, 0x6, 0x6, 0x11f5998c, 0x11fe4cc0, 0xf, 0x11ff1a40, 0x1d, ...)
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/os/exec_windows.go:41 +0x2d3
github.com/taskcluster/generic-worker/os.StartProcess(0x12047e00, 0x1b, 0x11ff3a40, 0x6, 0x6, 0x11f5998c, 0x11fe4cc0, 0xf, 0x11ff1a40, 0x1d, ...)
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/os/doc_windows.go:17 +0x7f
github.com/taskcluster/generic-worker/os/exec.(*Cmd).Start(0x11eb7080, 0x0, 0x0)
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/os/exec/exec_windows.go:153 +0x620
github.com/taskcluster/generic-worker/os/exec.(*Cmd).Run(0x11eb7080, 0x0, 0x0)
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/os/exec/exec_windows.go:354 +0x2b
main.runCommands(0x0, 0x11fe4cc0, 0xf, 0x11ff1a40, 0x1d, 0x11f59b20, 0x1, 0x1, 0x0, 0x0)
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:547 +0x2cf
main.deleteHomeDir(0x11ff19a0, 0x12, 0x11fe4cc0, 0xf, 0x0, 0x0)
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:85 +0x4b7
main.deleteHomeDirs()
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:219 +0x584
main.deleteExistingOSUsers()
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:186 +0x1e
main.taskCleanup(0x0, 0x0)
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:383 +0xf8
main.startup(0x0, 0x0)
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:68 +0x9e
main.runWorker()
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:393 +0x38
main.main()
        /home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:285 +0x648
The above issue was fixed relatively easily, but now we have a tricky problem....

-------------------------------------------------------

2016/11/03 12:33:48 Running command: 'cmd' '/c' 'md' 'Z:\task_1478176427\public\logs'

username: "task_1478176427"
password: "pWd0_KDzLh3SFtTIlI7pa6xjyKizo"
LOGON_WITH_PROFILE: '\x01'
argv0p: %!q(*uint16=0x1202ad00)
argvp: %!q(*uint16=0x120088a0)
flags: 'Ѐ'
attr.Env: ["=C:=C:\\gopath\\src\\github.com\\taskcluster\\generic-worker" "=ExitCode=00000000" "ALLUSERSPROFILE=C:\\ProgramData" "APPDATA=C:\\Users\\GenericWorker\\AppData\\Roaming" "CLIENTNAME=Petes-iMac.loca" "CommonProgramFiles=C:\\Program Files\\Common Files" "COMPUTERNAME=I-00107EBFEFA9C" "ComSpec=C:\\windows\\system32\\cmd.exe" "DCLOCATION=SCL3" "DNSSUFFIX=use1.mozilla.com" "FP_NO_HOST_CHECK=NO" "GOPATH=C:\\gopath" "GOROOT=C:\\go" "HOMEDRIVE=C:" "HOMEPATH=\\Users\\GenericWorker" "KTS_HOME=C:\\Program Files\\KTS" "KTS_VERSION=1.19c" "LOCALAPPDATA=C:\\Users\\GenericWorker\\AppData\\Local" "LOGONSERVER=\\\\I-00107EBFEFA9C" "MOZILLABUILD=C:\\mozilla-build" "NUMBER_OF_PROCESSORS=8" "OS=Windows_NT" "Path=C:\\Program Files\\Windows Resource Kits\\Tools\\;C:\\Program Files\\Mercurial;C:\\mozilla-build\\7zip;C:\\mozilla-build\\info-zip;C:\\mozilla-build\\kdiff3;C:\\mozilla-build\\moztools-x64\\bin;C:\\mozilla-build\\mozmake;C:\\mozilla-build\\msys\\bin;C:\\mozilla-build\\msys\\local\\bin;C:\\mozilla-build\\nsis-3.0b3;C:\\mozilla-build\\nsis-2.46u;C:\\mozilla-build\\python;C:\\mozilla-build\\python\\Scripts;C:\\mozilla-build\\upx391w;C:\\mozilla-build\\wget;C:\\mozilla-build\\yasm;C:\\windows\\system32;C:\\windows;C:\\windows\\System32\\Wbem;C:\\windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\mozilla-build\\python27;C:\\mozilla-build\\python27\\Scripts;C:\\mozilla-build\\vim\\vim72;C:\\CoreUtils\\bin;C:\\mozilla-build\\buildbotve\\scripts;c:\\Program Files\\Microsoft Windows Performance Toolkit\\;C:\\Users\\cltbld\\AppData\\Local\\Programs\\Common\\Microsoft\\Visual C++ for Python\\9.0;C:\\ProgramData\\chocolatey\\bin;C:\\Program Files\\Puppet Labs\\Puppet\\bin;C:\\mozilla-build\\hg" "PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC" "PIP_DOWNLOAD_CACHE=y:\\pip-cache" "PROCESSOR_ARCHITECTURE=x86" "PROCESSOR_IDENTIFIER=x86 Family 6 Model 62 Stepping 4, GenuineIntel" "PROCESSOR_LEVEL=6" "PROCESSOR_REVISION=3e04" "ProgramData=C:\\ProgramData" "ProgramFiles=C:\\Program Files" "PROMPT=$P$G" "PSModulePath=C:\\windows\\system32\\WindowsPowerShell\\v1.0\\Modules\\;C:\\Program Files\\AWS Tools\\PowerShell\\" "PUBLIC=C:\\Users\\Public" "SESSIONNAME=RDP-Tcp#0" "SystemDrive=C:" "SystemRoot=C:\\windows" "TEMP=C:\\Users\\GenericWorker\\AppData\\Local\\Temp" "TMP=C:\\Users\\GenericWorker\\AppData\\Local\\Temp" "TOOLTOOL_CACHE=y:\\tooltool-cache" "USERDOMAIN=I-00107EBFEFA9C" "USERNAME=GenericWorker" "USERPROFILE=C:\\Users\\GenericWorker" "windir=C:\\windows" "windows_tracing_flags=3" "windows_tracing_logfile=C:\\BVTBin\\Tests\\installpackage\\csilogfile.log"]
dirp: %!q(*uint16=<nil>)
si: &{'D' %!q(*uint16=<nil>) %!q(*uint16=<nil>) %!q(*uint16=<nil>) '\x00' '\x00' '\x00' '\x00' '\x00' '\x00' '\x00' 'Ā' '\x00' '\x00' %!q(*uint8=<nil>) 'Ɛ' '\x0f' '\x13'}
pi: &{'\x00' '\x00' '\x00' '\x00'}

2016/11/03 12:33:48 fork/exec C:\windows\system32\cmd.exe: A required privilege is not held by the client.

-------------------------------------------------------

I installed Windows Server 2003 Resource Kit Tools from https://www.microsoft.com/en-us/download/confirmation.aspx?id=17657 and ran:

C:\>ntrights -u GenericWorker +r SeIncreaseQuotaPrivilege
 Granting SeIncreaseQuotaPrivilege to GenericWorker   ... successful

C:\>ntrights -u GenericWorker +r SeAssignPrimaryTokenPrivilege
 Granting SeAssignPrimaryTokenPrivilege to GenericWorker   ... successful

C:\>

However, that also did not fix it. Some reasons I can think of that might be the cause of this:

1) There are other privileges needed
2) ntrights technically isn't supported on windows 7, so maybe it didn't work even though output suggested it did (unlikely)
3) The task_* user accounts need the privilege(s) rather than the GenericWorker account
4) The user accounts are ok, but the access token retrieved in the login system call requires the privileges
5) The generic-worker.exe process also requires the privileges, which were given to GenericWorker user that launched the process

The contester library https://github.com/contester/runlib doesn't seem to refer explicitly to these privileges, and presumably works, so this leads me to suspect this is more likely related to how the environment is set up, rather than requiring runtime manipulation of policy.

This is a classic case of an error message not describing the instance data it relates to (i.e. not mentioning which privilege is not held by which client). :(
I've managed to reproduce this issue when using the runexe library directly (see https://github.com/contester/runlib):

C:\>runexe -l mozilla -p qwe12345!@#$% C:\Windows\System32\notepad.exe
Invocation failed: Program
Comment: CreateFrozen/CreateProcessAsUser/syscall:CreateProcessAsUser: A required privilege is not held by the client.

The runlib/runexe tool uses the same runlib/subprocess library that we are using to manage process execution. I've therefore contacted the author of runlib to ask if he knows what the problem might be. The runlib library is used for testing code submissions on http://codeforces.com/ so is known to work - so hopefully the resolution is just granting some permissions somewhere, to some object in the environment, in order to get it working.

In parallel, I will see if I can grant the local security policy privileges by some other means than using Windows Server 2003 Resource Kit, in case the problem is running this on Windows 7 (remote test worker) / Windows Server 2008 (local development VM).
Note, this is what I've already tried, which hasn't fixed it so far. "peter james moore" is the admin account that is running runexe, "mozilla" is the test account that the notepad.exe process should be executed under (from the example in comment 35).


C:\Windows\system32>ntrights -u "peter james moore" +r SeIncreaseQuotaPrivilege
 Granting SeIncreaseQuotaPrivilege to peter james moore   ... successful

C:\Windows\system32>ntrights -u "peter james moore" +r SeAssignPrimaryTokenPrivilege
 Granting SeAssignPrimaryTokenPrivilege to peter james moore   ... successful

C:\Windows\system32>ntrights -u mozilla +r SeIncreaseQuotaPrivilege
 Granting SeIncreaseQuotaPrivilege to mozilla   ... successful

C:\Windows\system32>ntrights -u mozilla +r SeAssignPrimaryTokenPrivilege
 Granting SeAssignPrimaryTokenPrivilege to mozilla   … successful

C:\Windows\system32>wmic useraccount where name='mozilla' set passwordexpires=false
Updating property(s) of '\\WIN-7LN26LS103P\ROOT\CIMV2:Win32_UserAccount.Domain="WIN-7LN26LS103P",Name="mozilla"'
Property(s) update successful.

C:\Windows\system32>net localgroup "Remote Desktop Users" /add "peter james moore"
The command completed successfully.
Agggghhhhhhhhhhhh

A restart fixed the runexe problem!!!!

Let me see if it fixes the problem in generic worker too.............
Assignee: rthijssen → pmoore
Now I'm closer - getting

2016/11/03 19:17:18 ERROR encountered: Task not successful due to following exception(s):
Exception 1)
exit status -1073741502

I believe this is "STATUS_DLL_INIT_FAILED" but I'll need to look into this tomorrow.
So ........

It turns out that using the mechanics of the runlib library for also creating desktop, windows station etc (not just logging in and loading user profile) will resolve the problem.

I've spent considerable time working my way through the entire code base to understand all these additional parts, and have reached the conclusion that the best approach would be to use higher level functions provided by this library that take care of these concerns under the hood. Fortunately the library is very well written (if not very well documented) such that the code for running processes under different user accounts is cleanly separated from the parts which take care of testing software submissions for software contests. However, this still requires quite a bit of rework in the generic worker to abstract the concepts around execution of commands, resulting processes, and results from completed processes, since runlib is only designed for windows/linux and has no darwin support. This isn't too complicated, but requires a bit of work. In any case, this was in need of doing anyway, as it wasn't very well architected as it was.

Advantages that this approach gives us:

* we should now have separate desktops for each task - this provides a framework for building the "interactive" feature on top of, for windows loaners
* we can potentially run the generic worker as a service, which is cleaner on many counts (and has profound security benefits)
* it makes having a native engine on linux with separate user accounts a step closer
* we can differentiate between wall time / kernel time / user time when running processes, which means regardless of other load on a machine, we can get accurate data for performance of a task
* we have the possibility to add new features to generic worker to take advantage of the runlib features such as:
  o limit memory usage
  o restrict UI access
  o manage process affinity
  o apply limits for process user time / kernel time
  o inject DLL
  o pause processes

Regardless, solving this bug should also solve the performance problems we've been witnessing, as a result of running in a job group.
this sounds freaking awesome \o/
So I'm pretty close! I finished writing the code, now I am just debugging it / fixing it.

Have a problem with the stdout/stderr/stdin management which I haven't quite got to the bottom of yet. In runlib they are files, rather than io.Reader/io.Writer objects. This shouldn't be too much of a problem, but it isn't quite working yet - so I'll have to part this until Monday morning.

But code is implemented, so just a case of debugging. This is all going on the createprocessasuser git branch of generic worker...
The unit/integration tests are all passing now!
https://github.com/taskcluster/generic-worker/pull/28

I then manually tested by deploying on ami-test-win2012r2, but discovered there is still a permissions problem:


Task log:

[taskcluster 2016-11-06T12:50:01.162Z] goroutine 1 [running]:
[taskcluster 2016-11-06T12:50:01.162Z] runtime/debug.Stack(0xc04233b860, 0x7f23c0, 0xc0421221e0)
[taskcluster 2016-11-06T12:50:01.162Z] 	C:/Go/src/runtime/debug/stack.go:24 +0x80
[taskcluster 2016-11-06T12:50:01.162Z] main.(*TaskRun).run.func2(0xc042214c18, 0xc0422a5000, 0xa46460, 0xc042214ca8)
[taskcluster 2016-11-06T12:50:01.162Z] 	C:/gopath/src/github.com/taskcluster/generic-worker/main.go:1024 +0x25e
[taskcluster 2016-11-06T12:50:01.162Z] panic(0x7f23c0, 0xc0421221e0)
[taskcluster 2016-11-06T12:50:01.162Z] 	C:/Go/src/runtime/panic.go:458 +0x251
[taskcluster 2016-11-06T12:50:01.162Z] main.(*TaskRun).run(0xc0422a5000, 0x0)
[taskcluster 2016-11-06T12:50:01.162Z] 	C:/gopath/src/github.com/taskcluster/generic-worker/main.go:1045 +0x1738
[taskcluster 2016-11-06T12:50:01.162Z] main.FindAndRunTask(0x3b9aca00)
[taskcluster 2016-11-06T12:50:01.162Z] 	C:/gopath/src/github.com/taskcluster/generic-worker/main.go:517 +0x235
[taskcluster 2016-11-06T12:50:01.162Z] main.runWorker()
[taskcluster 2016-11-06T12:50:01.162Z] 	C:/gopath/src/github.com/taskcluster/generic-worker/main.go:447 +0x43b
[taskcluster 2016-11-06T12:50:01.162Z] main.main()
[taskcluster 2016-11-06T12:50:01.162Z] 	C:/gopath/src/github.com/taskcluster/generic-worker/main.go:284 +0x7fa
[taskcluster 2016-11-06T12:50:01.162Z] 
[taskcluster 2016-11-06T12:50:01.162Z] &os.SyscallError{Syscall:"CreateDesktop", Err:0x5}


Worker log:

2016/11/06 13:53:12 Shutting down immediately - panic occurred!
2016/11/06 13:53:12 goroutine 1 [running]:
runtime/debug.Stack(0x87efd1, 0x2b, 0x0)
        C:/Go/src/runtime/debug/stack.go:24 +0x80
main.runWorker.func1()
        C:/gopath/src/github.com/taskcluster/generic-worker/main.go:407 +0x95
panic(0x7f23c0, 0xc042146920)
        C:/Go/src/runtime/panic.go:458 +0x251
main.(*TaskRun).run.func1(0xc04201e0b8, 0xc042403000)
        C:/gopath/src/github.com/taskcluster/generic-worker/main.go:1012 +0x202
panic(0x7f23c0, 0xc042146920)
        C:/Go/src/runtime/panic.go:458 +0x251
main.(*TaskRun).run.func2(0xc04201e0b8, 0xc042403000, 0xa46460, 0xc04201e128)
        C:/gopath/src/github.com/taskcluster/generic-worker/main.go:1030 +0x132
panic(0x7f23c0, 0xc042146920)
        C:/Go/src/runtime/panic.go:458 +0x251
main.(*TaskRun).run(0xc042403000, 0x0)
        C:/gopath/src/github.com/taskcluster/generic-worker/main.go:1045 +0x1738
main.FindAndRunTask(0x3b9aca00)
        C:/gopath/src/github.com/taskcluster/generic-worker/main.go:517 +0x235
main.runWorker()
        C:/gopath/src/github.com/taskcluster/generic-worker/main.go:447 +0x43b
main.main()
        C:/gopath/src/github.com/taskcluster/generic-worker/main.go:284 +0x7fa

2016/11/06 13:53:12 Cause: CreateDesktop: Access is denied.


I'll have a look into this on Monday. In the morning I have an appointment with the city council, so will be starting a bit later.

Note - the unit tests run everything as current user (since unit tests run without admin privileges) which is why the unit tests/integration tests didn't pick this up. Anyway, hope to get to the bottom of this on Monday, then I can make a new release.
So, unbelievably, I have it working!!!

I can locally run the generic worker on a local VM, it is now using the runlib library to handle desktop/session creation and process execution, and ... works!

In order to have this running in production, tomorrow I just need to:

1) automate assigning required local security privileges to GenericWorker user during AMI creation for all the windows aws provisioner worker types (both relops-managed OpenCloudConfig worker types and taskcluster internally managed worker types)
2) minor task logging fixes (to better show what commands are being executed, and give useful feedback after commands complete)
3) make a new release, and roll it out to the same worker types that were tweaked in step 1 to have extra required permissions

At a point in the future (in other bugs) we can look at running generic worker as a service rather than a login item, and also implementing interactive feature for instant loaners, and even maybe adding more controls around processes (user time / kernel time timeouts, memory caps, etc)....
I made release 7.0.0alpha 4 and am currently rolling it out on ami-test-win2012r2 worker type for manual testing, before rolling out globally...
I'm hitting a problem in production I'm not hitting locally, and am not sure of the cause yet.

I've made a release 7.0.0alpha5 with some more debugging info.

Very strange.

Locally I'm running Windows Server 2008 R2 - the remote test worker is Windows Server 2012 R2 - I don't expect that to be the reason, but I'm not sure yet.

The worker is not able to take a task, I'm getting

2016/11/08 19:25:28 Running command: 'wmic useraccount where name='task_1478633128' set passwordexpires=false'

panic: CreateFrozen/CreateProcess: Access is denied.

goroutine 1 [running]:
panic(0x7edfa0, 0xc0420064e0)
	C:/go/src/runtime/panic.go:500 +0x1af
main.runCommands(0x0, 0x0, 0xc0421419a8, 0x2, 0x2, 0x86ad73, 0x1)
	C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:548 +0x473
main.(*OSUser).createOSUserAccountForce(0xc042321c80, 0x836400, 0xc04234ba01, 0xc042321c80)
	C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:148 +0x3ae
main.(*OSUser).createNewOSUser(0xc042321c80, 0xc042321c80, 0x2)
	C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:130 +0x37
main.createNewTaskUser(0x1d, 0x31)
	C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:106 +0x220
main.taskCleanup(0xc042141c28, 0x1)
	C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:371 +0x113
main.startup(0x0, 0x8ef)
	C:/gopath/src/github.com/taskcluster/generic-worker/plat_windows.go:70 +0xab
main.runWorker()
	C:/gopath/src/github.com/taskcluster/generic-worker/main.go:392 +0x46
main.main()
	C:/gopath/src/github.com/taskcluster/generic-worker/main.go:284 +0x7fa

It seems some more permissions are needed, which is a little baffling, but I'm sure I'll be able to get to the bottom of it. A little frustrating, but since it is working locally, we can't be far from success now.
I've fixed the problem, I believe - but making a new release to be absolutely sure.

My hypothesis is that, during AMI setup, there are some process / start up flag(s) in the runlib library[1] that assume either a graphical environment or impose some other requirement, which are not required/assumed by the go standard library call to CreateProcess[2]. Therefore, during *installation* of generic worker, I have reverted to using the standard go library for executing processes. I now only use the runlib library for executing task processes while the worker is running. This seems to have resolved the problem.

Whether it is related to the execution context of the bootstrapping process having a graphical context or not, the problem is most likely that the AMI bootstrapping process is executed under a system/headless context, which seems not to play nicely when invoking the subprocess functions of the runlib library directly - therefore switching seems the most convenient and simplest solution. It is quite likely the runlib library would also be capable of executing the installation processes, given permission tweaks, or adjusted library calls, but the investment of investigating the root cause seems not to be worth the effort considering we have a workable solution using the standard library for these calls.

To reiterate, the problem was about executing processes during the *installation* of generic worker on the "golden image" (generic worker is a self-installing executable), *not* the execution or processes of the generic worker when running tasks on a worker.

[1] https://github.com/contester/runlib/blob/e1c5f73778c7081e9ce20b930999d03f51e78f4d/subprocess/subprocess_windows.go#L198-L210
[2] https://github.com/golang/go/blob/410974497365b2e6e23811dbc58475b00c69d0db/src/syscall/exec_windows.go#L324
Here is a production job that ran with the new worker (in this case for NSS).

The only strange thing I see is that the reported User Time / Kernel Time seems to be a bit off:

https://tools.taskcluster.net/task-inspector/#RsFuem4OTNipJ1H1TMeg2A/0

I'm considering pulling out the output of User Time / Kernel Time until it is understood why it is reporting strange results.

We explicitly avoid creating a job group - i'm wondering if this has a consequence that user time / kernel time is only tracked in the initiating process, rather than any sub processes created. This might explain why the numbers are so low.

Another option we have is to make the creation of a job group an attribute of the task payload, such that you choose either to opt in or opt out (not sure if "in" or "out" is best for default). If the wonky times are a consequence of explicitly not being in a job group, it might make sense to create a job group by default, and only opt out when it is known a job group is not allowed (such as the tests that caused this bug in the first place).

Maybe for now I'll roll out as is, and we can discuss / investigate in the coming days/weeks and decide how to proceed.
While I'm waiting for the AMI generation process for ami-test-win2012r2 worker type, I'll see if I can upgrade a win7 machine to use the new worker, and rerun an old task, such as this one:

https://queue.taskcluster.net/v1/task/DvCA9fJdQSWNuZemCVh_HQ
So I hijacked a running machine, killed the running worker, upgraded the binary to v7.0.0alpha7, changed the provisionerId (so another machine wouldn't take the required job), upgraded the credentials (since with the new provisionerId, the old ones weren't enough), changed the run script to not use --configure-for-aws (and thereby undo my config change), and manually ran the adjusted run-generic-worker.bat script (rather than rebooting, otherwise DSC would reset the generic worker binary) and then manually touched the dsc flag file C:\dsc\task-claim-state.valid to trigger the updated worker to start.

Then I modified the task from comment 49 to have updated timestamps and the altered provisionerId, and triggered this task run:

https://tools.taskcluster.net/task-inspector/#bdpo2BhkQZOt6G6ja6eucQ/0
The test failed due to screen resolution or mouse pointer location issues - no doubt a consequence of me being logged in.

However, it didn't fail because of being in a job group, yippee!!

In any case, I will submit a PR to OpenCloudConfig to pick this up and build new win7/win10 amis.

If that all goes well, I'll submit a second one to update win2012 amis (builders).
NSS job running against latest ami-test-win2012r2 image: https://tools.taskcluster.net/task-inspector/#fJflHoN-TMOUR1gdYC63pg/0
This is a huge change, but has been tested in production using a test worker type and making an alpha release or two (the most recent of which was 7.0.0alpha7).

This is currently blocking testing - I'd propose we roll out only on win7/win10 machines initially, as testing is currently blocked, so we can't lose anything.

Thanks!
Pete
Attachment #8808987 - Flags: review?(wcosta)
:grenade currently on PTO so assigning to rail.

Thanks!
Attachment #8809007 - Flags: review?(rail)
Comment on attachment 8809007 [details]
Github Pull Request for OpenCloudConfig

stamp
Attachment #8809007 - Flags: review?(rail) → review+
Attachment #8808987 - Flags: review?(wcosta) → review+
I'm heading off shortly, but please be aware, when https://tools.taskcluster.net/push-inspector/#/b8YWY5zaT52XeBmMUog85g?_k=6hjtgc completes, the win 7 / win 10 worker types will be available.

It can take up to 30 mins for the older workers to terminate, so after 30 mins of this task graph completing, you should be able to resubmit a task and it will pick up the new worker.

You will know it worked, if at the top of the log file, it says the generic worker version is 7.0.0.
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
Wonderful! I will follow-up with some tests today.
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d46c58f0d8f7995438feae255fc749e479c08fa4 appears to use generic worker version 7.0.0 for most (maybe all) tests, but the same process launching problems are evident.

https://public-artifacts.taskcluster.net/O5Lu--5DTjS3yNF3fugtpQ/0/public/logs/live_backing.log


[taskcluster 2016-11-09T18:34:15.725Z]     "generic-worker": {
[taskcluster 2016-11-09T18:34:15.725Z]       "go-arch": "386",
[taskcluster 2016-11-09T18:34:15.725Z]       "go-os": "windows",
[taskcluster 2016-11-09T18:34:15.725Z]       "go-version": "go1.7.3",
[taskcluster 2016-11-09T18:34:15.725Z]       "release": "https://github.com/taskcluster/generic-worker/releases/tag/v7.0.0",
[taskcluster 2016-11-09T18:34:15.725Z]       "version": "7.0.0"

ProcessManager UNABLE to use job objects to manage child processes
ProcessManager NOT managing child processes

19:16:53     INFO -  [Parent 3416] WARNING: Failed to launch tab subprocess: file z:/task_1478711748/build/src/ipc/glue/GeckoChildProcessHost.cpp, line 576
19:16:53     INFO -  [Parent 3416] WARNING: Failed to launch tab subprocess: file z:/task_1478711748/build/src/ipc/glue/GeckoChildProcessHost.cpp, line 576
19:16:53     INFO -  [Parent 3416] WARNING: Failed to launch tab subprocess: file z:/task_1478711748/build/src/ipc/glue/GeckoChildProcessHost.cpp, line 576
19:16:53     INFO -  JavaScript error: chrome://global/content/bindings/remote-browser.xml, line 376: NS_ERROR_UNEXPECTED: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [nsIScriptSecurityManager.getLoadContextCodebasePrincipal]
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c937f1a5580136ae32bdc1c5b63cfeb4806e0925 adds my traditional process debug logging. 

Comparing bb to tc, I note that bb runs as an Administrator and tc does not; I cannot seem to get the os-groups to work now -- scope problems -- but maybe privileges are not really an issue? 

Anyway, the mozharness script is still having trouble with process management because at https://hg.mozilla.org/try/rev/f752479203c3a0999d8abeed4f8c4436b95889b4#l3.12, on bb we get limitflags == 2048 == JOB_OBJECT_LIMIT_BREAKAWAY_OK while on tc we get limitflags == 0.
Hey guys - I'm really sorry this didn't fix it first time - however, I'm sure we can get to the bottom of it, as all of the machinery is in place to make this possible now. I'll deep dive and when I come up for air, I'll let you know what I found!
d'oh - so it looks like the attribute "NoSub" is the inverse of what we might expect, based on

https://github.com/contester/runlib/blob/e1c5f73778c7081e9ce20b930999d03f51e78f4d/subprocess/subprocess_windows.go#L140

In this statement, it would use CreateProcessWithLogonW if the NoSub boolean is true - and we want to avoid using CreateProcessWithLogonW. The check for being Windows 8 or higher is also ok, because then it is fine to use CreateProcessWithLogonW (such as for Windows 10). So basically, I'm guessing I just need to invert the boolean here (so it is false, not true):

https://github.com/taskcluster/generic-worker/blob/7f6a71f0b6f05d3de37396e3b0fc77dc646a26ae/process/process_windows.go#L163

and then we are done! Hopefully!

So maybe "NoJob" doesn't mean "don't create a job" but means "calling process not already in a job"? Who knows. But the logic appears correct, if we invert it.

I'll hijack a worker, update the worker, rerun the test, and see!

Watch this space...
Good news! With the switch, I now see we are using CreateProcessAsUser instead of CreateProcessWithLogonW:

Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 2016/11/10 11:34:35 Shutting down immediately - panic occurred! 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 2016/11/10 11:34:35 goroutine 1 [running]: 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: runtime/debug.Stack(0x0, 0x0, 0x0) 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/go/src/runtime/debug/stack.go:24 +0x80 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.runWorker.func1() 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:407 +0x66 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: panic(0x7b5cc0, 0x11e30e80) 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/go/src/runtime/panic.go:458 +0x40b 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.(*TaskRun).run.func1(0x12016028, 0x11cecc00) 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1012 +0x162 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: panic(0x7b5cc0, 0x11e30e80) 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/go/src/runtime/panic.go:458 +0x40b 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.(*TaskRun).run.func2(0x12016028, 0x11cecc00, 0x9a14c0, 0x120171f0) 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1030 +0x1c7 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: panic(0x7b5cc0, 0x11e30e80) 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/go/src/runtime/panic.go:458 +0x40b 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.(*TaskRun).ExecuteCommand(0x11cecc00, 0x0, 0xce2f340) 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:893 +0x43b 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.(*TaskRun).run(0x11cecc00, 0x0) 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1097 +0xba0 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.FindAndRunTask(0x3b9aca00) 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:517 +0x2a9 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.runWorker() 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:447 +0x625 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: main.main() 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:284 +0x648 
Nov 10 12:34:37 i-06f7f245ad10f3176.gecko-t-win7-32.usw1.mozilla.com generic-worker: 2016/11/10 11:34:35 Cause: CreateFrozen/CreateProcessAsUser/syscall:CreateProcessAsUser: A required privilege is not held by the client. 

This is great, and also explains why I did not need to set the privileges in comment 34, as it was previously calling CreateProcessWithLogonW instead of CreateProcessAsUser.

Will hijack a new machine, add these privileges, and try again...
Hmmm, this is a little bit tricky - I need to reboot after adding permissions, but then DSC must be running at boot up, because after rebooting, the privilege is revoked. :/
I've managed to set the first privilege[1], trying again:

https://tools.taskcluster.net/task-inspector/#IJsfJe6pTo6QkO4BpVWHCQ/0

[1] "C:\Program Files\Windows Resource Kits\Tools\ntrights.exe" -u GenericWorker -r SeIncreaseQuotaPrivilege
Now trying again without logging in (to avoid screen resolution problems etc).
Stupidly forgot to upgrade the worker!

Trying again:

https://tools.taskcluster.net/task-inspector/#CA6cSzTMSkK7wWaTt4tm5Q/0
So I've spent MANY hours looking into this today, and basically, rolling out the new worker is pretty simple - and indeed the boolean flag change was required.

*** HOWEVER ***

Amazingly, assigning the security policy privilege to the generic worker user is non-trivial.

Locally I've been using ntrights.exe from Windows Server 2003 - but this isn't supported on other versions of windows.

In theory, a tool called secedit can be used - I haven't managed to get this to work comfortably - see https://github.com/taskcluster/generic-worker/blob/b8d2251c25979f011bcca65f4519c54464248a83/plat_windows.go#L438-L466

However, I have just discovered a system call I can use:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms721786(v=vs.85).aspx

I'm going to try to use this system call directly, rather than call out to secedit which isn't working for me at this moment in time.

I had some troubles testing on a live worker, as I can't get updates to stop rolling my local changes back (no doubt I'm doing something stupid). I spent quite a while trying to do this, but was unsuccessful, so I'm trying to write the patch so that I don't need to manually alter a machine, but we can roll with some functioning AMIs.

I've been testing locally on a VM and can confirm that granting this permission with ntrights.exe is sufficient to resolve the issue.

So I'm hoping tomorrow this issue will be gone...
:grenade kindly applied some updates to OpenCloudConfig to get the permissions working and upgraded the worker on win7 - now running again .....

https://tools.taskcluster.net/task-inspector/#Mc0T0zYnQTSxn-6RM2CUTQ/0
We are now failing with:


11:41:33     INFO - Running pre test command run mouse & screen adjustment script with 'c:\mozilla-build\python\python.exe Z:\task_1478862106\mozharness\external_tools\mouse_and_screen_resolution.py --configuration-file Z:\task_1478862106\mozharness\external_tools\machine-configuration.json'
11:41:33     INFO - Running command: ('c:\\mozilla-build\\python\\python.exe', 'Z:\\task_1478862106\\mozharness\\external_tools\\mouse_and_screen_resolution.py', '--configuration-file', 'Z:\\task_1478862106\\mozharness\\external_tools\\machine-configuration.json') in Z:\task_1478862106\build
11:41:33     INFO - Copy/paste: c:\mozilla-build\python\python.exe Z:\task_1478862106\mozharness\external_tools\mouse_and_screen_resolution.py --configuration-file Z:\task_1478862106\mozharness\external_tools\machine-configuration.json
11:41:36     INFO -  Screen resolution (current): (1600, 1200)
11:41:36     INFO -  Changing the screen resolution...
11:41:36     INFO -  Screen resolution (new): (1600, 1200)
11:41:36     INFO -  Mouse position (current): (0, 0)
11:41:36     INFO -  Mouse position (new): (0, 0)
11:41:36     INFO -  INFRA-ERROR: The new screen resolution or mouse positions are not what we expected


It might be that the test needs to be adjusted, as maybe it makes an assumption that the system display is attached to the process user's desktop - which is no longer the case. Perhaps the python code that changes the screen resolution needs to first get a handle on the current process's desktop handle, and adjust that desktop size.

This is just a hunch.

Of course, assuming this is the problem, this will be backwardly compatible with buildbot, so is more about fixing a test assumption that previously didn't cause a problem because it was running in a one-shared-desktop environment, whereas now we have secure the system by using sandboxed tasks running as non-privileged users with their own desktops.
Indeed, this looks to be the cause.

I'm not sure if it is possible to resize a desktop that isn't attached to a display, but I did see it is possible to switch desktops using

https://msdn.microsoft.com/en-us/library/windows/desktop/ms686347(v=vs.85).aspx

Therefore the python code could probably make use of this syscall, to switch the active desktop to the desktop associated to the current running process, before it adjusts the desktop size.

I think we can probably close this bug now, as this is a new/different problem. Everyone ok with that?
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
See https://msdn.microsoft.com/en-us/library/windows/desktop/ms682573(v=vs.85).aspx for an overview of handling of desktops in Windows.

Note, we create a dedicated desktop for the given task using CreateDesktop call (https://msdn.microsoft.com/en-us/library/windows/desktop/ms682124(v=vs.85).aspx).
(In reply to Pete Moore [:pmoore][:pete] from comment #73)
> Indeed, this looks to be the cause.
> 
> I'm not sure if it is possible to resize a desktop that isn't attached to a
> display, but I did see it is possible to switch desktops using
> 
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms686347(v=vs.85).
> aspx

On closer reading, SwitchDesktop creates and opens a desktop. Since the desktop is created, we just want to open it.

https://msdn.microsoft.com/en-us/library/windows/desktop/ms684303(v=vs.85).aspx

This looks to be the better call.
running this on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3f325de71041c07f65f155975ad78ff7cf46a390&selectedJob=30908170

I see win7 vm running (very slow compared to win10):
https://tools.taskcluster.net/task-inspector/#MLUpDxU0TTm3hnZ6n15-tA/0

this has worker:
[taskcluster 2016-11-11T15:03:11.838Z]     "generic-worker": {
[taskcluster 2016-11-11T15:03:11.838Z]       "go-arch": "386",
[taskcluster 2016-11-11T15:03:11.838Z]       "go-os": "windows",
[taskcluster 2016-11-11T15:03:11.838Z]       "go-version": "go1.7.3",
[taskcluster 2016-11-11T15:03:11.838Z]       "release": "https://github.com/taskcluster/generic-worker/releases/tag/v7.0.1alpha1",
[taskcluster 2016-11-11T15:03:11.838Z]       "version": "7.0.1alpha1"
[taskcluster 2016-11-11T15:03:11.838Z]     },


with my try push, I disable the mouse and screen resolution preflight tool to run, so now we try to run firefox and get the same error:
ProcessManager UNABLE to use job objects to manage child processes
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
I've taken another look.

In release 7.0.0 we set `NoJob` to true, to avoid creating a job. In release 7.0.1, we set `NoJob` to false. It looks like neither work, and digging a bit deeper, I found out why:

If we set `NoJob` to true (as we did in 7.0.0), CreateProcessWithLogonW syscall gets used - which creates a job group.

If we set `NoJob` to false, we use the correct syscall, but a job gets created manually:
https://github.com/contester/runlib/blob/master/subprocess/subprocess_windows.go#L249-L272

In other words, switching this from true to false got us out of the frying pan, and into the fire.

Now taking a close look: https://github.com/contester/runlib/blob/master/subprocess/subprocess_windows.go#L190 suggests that we will still be ok, even if we manually create a job, since CREATE_BREAKAWAY_FROM_JOB is set here in the CreateProcess syscall: https://github.com/contester/runlib/blob/master/subprocess/subprocess_windows.go#L190 

In other words - being tied to a job is ok, if we can breakaway from it, because aiui the mozprocess library only throws its hands up in the air if the process is already associated to a job, *and* it can't break away from it.

So on first glance, this seems ok, but then https://msdn.microsoft.com/en-us/library/windows/desktop/ms684147(v=vs.85).aspx says:

"JOB_OBJECT_LIMIT_BREAKAWAY_OK: If any process associated with the job creates a child process using the CREATE_BREAKAWAY_FROM_JOB flag while this limit is in effect, the child process is not associated with the job."

In other words, it is not sufficient to just set this flag on the process, the JOBOBJECT_BASIC_LIMIT_INFORMATION structure associated to the job group needs to have JOB_OBJECT_LIMIT_BREAKAWAY_OK set too.

And we discover on line 303, that this flag is *NOT* being set:
https://github.com/contester/runlib/blob/master/subprocess/subprocess_windows.go#L303

Therefore, the flag CREATE_BREAKAWAY_FROM_JOB on line 190 has little effect.

I believe this is the root cause.

Since the subprocess library is a third party library, I think the swiftest solution is to fork it, fix the issue, and then submit a PR back to the original author, to see if he is interested in adopting the change too.
I've forked the runlib library and adapted as follows:
https://github.com/taskcluster/runlib/commit/89093431f1d669a007d99ef3fa7362f2fc5faa69

And in generic worker, I now refer to our fork, rather than the original library:
https://github.com/taskcluster/generic-worker/commit/4106ae398c2a2f89d292cafd73f1188c87a0aaa1

There were several options for implementing, including setting the JOB_OBJECT_LIMIT_BREAKAWAY_OK limit flag explicitly in the job - but I decided to opt to not to create a job at all, for simplicity, so set NoJob back to true, and made sure CreateProcessAsUser was used instead of CreateProcessWithLogonW when this flag is set on Windows 7.

I do want to check that without a job, the task still terminates when it times out, as this is no longer managed by the job.
Thanks :grenade for mering OCC update.

New AMIs being generated here: https://tools.taskcluster.net/task-group-inspector/#/b_xzcV7OR0ugm9Z8PVDdtw

When this task is complete, we should be able to post new jobs - HOWEVER ....

Note - *THE WORKER TYPE HAS CHANGED* so tasks should use worker type "gecko-t-win7-32-beta" (note the "-beta" suffix).

This worker type is something grenade has set up so we can test AMI changes without impacting other work. Thanks grenade! ++
So logging onto a live machine, and playing around with Process Explorer and Process Monitor sadly did not reveal a lot.

Processes were run with low priority, but bumping the priority higher did not improve the situation (this can be done interactively in Process Explorer).

Process Monitor revealed a lot of file access to mozharness log files, but that seems reasonable, as there are writes across multiple files for every line logged.

From Process Explorer, it was possible to see that no job group is getting created, which is what we wanted.

I am now not quite sure which avenues to explore to understand why the processes are running so slowly.

I checked MSDN to see if I could find extra parameters / settings in StartupInfo or ProcessInfo that might be relevant, but couldn't see anything suspicious.

Maybe a good night's sleep will clear my head and in the morning I'll have a bright fancy idea.

Feel free to play around with workers and let me know if you find anything / have any ideas about how we might proceed.

Maybe a dump from process monitor might be useful...
Maybe we should have another face-to-face tomorrow to brainstorm / live debug.
pmoore: this is fixed now right?
Flags: needinfo?(pmoore)
\o/

Yes, I believe bug 1312383 solved this for us. Thanks Rob!
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(pmoore)
Resolution: --- → FIXED
Blocks: 1316858
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: