Closed Bug 1594353 Opened 5 years ago Closed 5 years ago

generic-worker: "Not able to secure config file" on Windows

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

Details

Attachments

(1 file)

On nss-win2012r2-new (nss staging windows worker type) I deployed generic-worker 16.5.4 and hit the following failure in the logs of all workers that were started up:

2019/11/06 10:43:33 UTC Not able to secure config file 
2019/11/06 10:43:33 UTC The operation completed successfully.

This is strange for a number of reasons:

  1. Nothing related to securing config files changed in the new release, so indicates this may be a change to the base image that the worker was built on.

  2. The actual error is syscall.Errno(0) which typically means "no error", however the go library we are using to apply ACLs previously returned nil when there was no error, and since we vendor this library, there is no reason it should have changed. It also hasn't been modified since June 2019, so even if the vendoring was broken, it would not account for the change.

  3. Even though we get back an error to reflect that no error had occurred (syscall.Errno(0)) we can see that the ACL had not been applied to the file that should have been updated. Compare the results from a "good" worker (different worker type):

Good worker (aws-provisioner-v1/nss-win2012r2):

Administrator@WIN-UM4KQUJ7J6D:~ $ icacls 'C:\generic-worker\generic-worker.config'
C:\generic-worker\generic-worker.config BUILTIN\Administrators:(F)

Successfully processed 1 files; Failed processing 0 files

Bad worker (aws-provisioner-v1/nss-win2012r2-new):

Administrator@WIN-3PS5T72K022:~ $ icacls 'c:\generic-worker\generic-worker.config'
c:\generic-worker\generic-worker.config NT AUTHORITY\SYSTEM:(I)(F)
                                        BUILTIN\Administrators:(I)(F)
                                        BUILTIN\Users:(I)(RX)

Successfully processed 1 files; Failed processing 0 files

I checked that the Windows Service was running as LocalSystem account, to be sure that the issue wasn't that an unprivileged user was trying to apply the ACL, but indeed it was.

Then as Administrator, I manually set the ACL:

Administrator@WIN-3PS5T72K022:~ $ icacls 'C:\generic-worker\generic-worker.config' /grant:r 'Administrators:(GA)' /inheritance:r
processed file: C:\generic-worker\generic-worker.config
Successfully processed 1 files; Failed processing 0 files

Then I checked that the permissions were correct:

Administrator@WIN-3PS5T72K022:~ $ icacls 'C:\generic-worker\generic-worker.config'
C:\generic-worker\generic-worker.config BUILTIN\Administrators:(F)

Successfully processed 1 files; Failed processing 0 files
Administrator@WIN-3PS5T72K022:~ $ 

Finally, since the permissions were now ok, I started the worker again to check if the problem was resolved, but to my surprise I got exactly the same failure.

I can only conclude that for some reason there may have been a subtle change to the system call interface in the version of Windows Server 2012 R2 that this worker runs on, such that the previously working third party code that executes the system calls, no longer works.

I think the most pragmatic solution is for us to remove the dependency on the go-acl library and call out to the system icacls executable instead. This has two other advantages:

  1. it reduces the generic-worker footprint by depending on fewer go libraries
  2. in case of failures, icacls output is probably more descriptive / well understood / better documented than the third party go library
  3. icacls will likely continue being supported/maintained by Microsoft as a standard system tool, whereas there aren't really guarantees of support for the go-acl library
Assignee: nobody → pmoore
Status: NEW → ASSIGNED

Another thing that I can't work out yet, is that the go-acl library already replaces syscall.Errno(0) with nil so we should never be able to get syscall.Errno(0) back from a function call:

Now things get even weirder.

If I replace generic-worker 16.5.4 with generic-worker 16.5.3 on the same worker, it does not hit the issue.

Now I'm ..... confused.

And even more bizarrely - we are running 16.5.4 successfully on Windows 10. So it seems not all Windows versions are affected:

https://tools.taskcluster.net/groups/cyc19w3gQM2RmZ0pgoEr7g/tasks/YME8LhlXSWuk1kEbc6xZTQ/runs/0/logs/public%2Flogs%2Flive.log#L24

OK, mystery solved. At least, most of it.

There was a bug where the generic-worker config file filename wasn't set, so the ACL was being executed against the file "".

I have no idea why this didn't cause a problem on Windows 7 and Windows 10, and I also still don't yet understand why the error was syscall.Errno(0) but I'm sure with a bit of digging, now the root cause is understood, we can find out the reason for both of these open matters.

In the meantime, I've made a fix and pushed it to the PR branch in this bug.

In hindsight, with closer inspection of the log, I should have seen that the config file is logged in the message, and from the output in comment 0, that it was the empty string.

Released in generic-worker 16.5.5.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: