generic-worker: "Not able to secure config file" on Windows
Categories
(Taskcluster :: Workers, defect)
Tracking
(Not tracked)
People
(Reporter: pmoore, Assigned: pmoore)
Details
Attachments
(1 file)
On nss-win2012r2-new (nss staging windows worker type) I deployed generic-worker 16.5.4 and hit the following failure in the logs of all workers that were started up:
2019/11/06 10:43:33 UTC Not able to secure config file
2019/11/06 10:43:33 UTC The operation completed successfully.
This is strange for a number of reasons:
-
Nothing related to securing config files changed in the new release, so indicates this may be a change to the base image that the worker was built on.
-
The actual error is
syscall.Errno(0)
which typically means "no error", however the go library we are using to apply ACLs previously returnednil
when there was no error, and since we vendor this library, there is no reason it should have changed. It also hasn't been modified since June 2019, so even if the vendoring was broken, it would not account for the change. -
Even though we get back an error to reflect that no error had occurred (
syscall.Errno(0)
) we can see that the ACL had not been applied to the file that should have been updated. Compare the results from a "good" worker (different worker type):
Good worker (aws-provisioner-v1/nss-win2012r2):
Administrator@WIN-UM4KQUJ7J6D:~ $ icacls 'C:\generic-worker\generic-worker.config'
C:\generic-worker\generic-worker.config BUILTIN\Administrators:(F)
Successfully processed 1 files; Failed processing 0 files
Bad worker (aws-provisioner-v1/nss-win2012r2-new):
Administrator@WIN-3PS5T72K022:~ $ icacls 'c:\generic-worker\generic-worker.config'
c:\generic-worker\generic-worker.config NT AUTHORITY\SYSTEM:(I)(F)
BUILTIN\Administrators:(I)(F)
BUILTIN\Users:(I)(RX)
Successfully processed 1 files; Failed processing 0 files
I checked that the Windows Service was running as LocalSystem
account, to be sure that the issue wasn't that an unprivileged user was trying to apply the ACL, but indeed it was.
Then as Administrator, I manually set the ACL:
Administrator@WIN-3PS5T72K022:~ $ icacls 'C:\generic-worker\generic-worker.config' /grant:r 'Administrators:(GA)' /inheritance:r
processed file: C:\generic-worker\generic-worker.config
Successfully processed 1 files; Failed processing 0 files
Then I checked that the permissions were correct:
Administrator@WIN-3PS5T72K022:~ $ icacls 'C:\generic-worker\generic-worker.config'
C:\generic-worker\generic-worker.config BUILTIN\Administrators:(F)
Successfully processed 1 files; Failed processing 0 files
Administrator@WIN-3PS5T72K022:~ $
Finally, since the permissions were now ok, I started the worker again to check if the problem was resolved, but to my surprise I got exactly the same failure.
I can only conclude that for some reason there may have been a subtle change to the system call interface in the version of Windows Server 2012 R2 that this worker runs on, such that the previously working third party code that executes the system calls, no longer works.
I think the most pragmatic solution is for us to remove the dependency on the go-acl library and call out to the system icacls
executable instead. This has two other advantages:
- it reduces the generic-worker footprint by depending on fewer go libraries
- in case of failures,
icacls
output is probably more descriptive / well understood / better documented than the third party go library icacls
will likely continue being supported/maintained by Microsoft as a standard system tool, whereas there aren't really guarantees of support for thego-acl
library
Assignee | ||
Comment 1•5 years ago
|
||
Assignee | ||
Comment 2•5 years ago
|
||
Another thing that I can't work out yet, is that the go-acl library already replaces syscall.Errno(0)
with nil
so we should never be able to get syscall.Errno(0)
back from a function call:
Assignee | ||
Comment 3•5 years ago
|
||
Now things get even weirder.
If I replace generic-worker 16.5.4 with generic-worker 16.5.3 on the same worker, it does not hit the issue.
Now I'm ..... confused.
Assignee | ||
Comment 4•5 years ago
|
||
And even more bizarrely - we are running 16.5.4 successfully on Windows 10. So it seems not all Windows versions are affected:
Assignee | ||
Comment 5•5 years ago
|
||
OK, mystery solved. At least, most of it.
There was a bug where the generic-worker config file filename wasn't set, so the ACL was being executed against the file "".
I have no idea why this didn't cause a problem on Windows 7 and Windows 10, and I also still don't yet understand why the error was syscall.Errno(0)
but I'm sure with a bit of digging, now the root cause is understood, we can find out the reason for both of these open matters.
In the meantime, I've made a fix and pushed it to the PR branch in this bug.
In hindsight, with closer inspection of the log, I should have seen that the config file is logged in the message, and from the output in comment 0, that it was the empty string.
Assignee | ||
Comment 6•5 years ago
|
||
Released in generic-worker 16.5.5.
Description
•