Closed Bug 1458873 Opened 6 years ago Closed 6 years ago

Process termination when aborting task not always successful on Windows

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Attachments

(1 file)

When a task exceeds its maximum run time, it is aborted by the worker, running processes are killed and the task is resolved.

It seems in some situations, the processes continue running, which prevents the task being resolved by the worker, which happens on return from the synchronous process termination call.
On initial inspection, it looks like https://golang.org/pkg/os/#Process.Kill is returning, without error, but that process does not *actually* get killed.

Digging deeper. A bug in the go standard library?
The parent process is a batch script, which runs other processes. Clearly killing the parent process in isolation isn't enough to kill the entire process tree, which explains why processes keep on running.

However, it isn't clear why the execution flow does not continue when the process is killed, since the execution is waiting for the parent process to complete before resolving the task - so even if subprocesses hadn't been killed, I would expect the task resolution workflow to kick in when the parent process is killed.

Still investigating...
It looks like the problem is with the way a command is aborted.

The currently running process is terminated, which happens immediately. It is possible to inspect the process table, and see that the process has disappeared.

However, the call to https://golang.org/pkg/os/exec/#Cmd.Wait does not only wait for the process to die, it also waits for i/o to complete on the process file handles (stdin/stdout/stderr).

My hypothesis at the moment is that subprocesses have inherited these handles, and these subprocesses are not terminated automatically when terminating the parent process, and therefore, Cmd.Wait will only return once all subprocesses that have inherited the file handles have also completed, and the file handles have been released.

In Windows you would typically create a job object and associate it to the spawned process, setting subprocesses to automatically associate to the job group, in order that the entire process tree can be terminated. This would certainly solve the issue on Windows 10, but in Windows 7, life is slightly more complicated, because there nested job groups are not supported, and some tasks already use job objects, so would need to be allowed to "breakaway" from the job group the worker would create, ultimately meaning that we wouldn't be able to terminate all subprocesses. See https://bugzilla.mozilla.org/show_bug.cgi?id=1330718#c9 for some more detail on this topic. https://msdn.microsoft.com/en-us/library/windows/desktop/ms684161%28v=vs.85%29.aspx also contains general information on job groups.

I'm going to do some more testing to validate this hypothesis, and evaluate the options I see that we have available:

1) See if it is possible to kill all processes that inherit console file handles, such that Cmd.Wait can promptly return, even if some some processes have "broken away" and do not get terminated
2) Consider moving the task resolution out of the go routine that waits for the command to complete
3) See if it is possible to prevent file handles being inherited by subprocesses at all, so that Cmd.Wait promptly returns after task abortion
4) Investigate the possibility of terminating the process and releasing its handles by creating a job group with a hard time limit.

Note, workers necessarily reboot between tasks since new tasks require new OS users, which requires a new winlogon session for the new task user, which requires a reboot (winlogon is initiated by the bootup sequence in the kernel, it can't be initiated from a userspace process call). Since the reboot takes place, all processes will invariably be terminated, so we shouldn't have any resource leakage by leaving processes running.
Bingo - the problem was slightly more subtle.

The standard out and standard error of task command processes (and their subprocesses) are associated to a io.multiWriter (that implements io.Writer interface) rather than to e.g. a *os.File. Internally in the standard library, go routines are launched to copy data from the standard out/err of the process into the io.multiWriter. Since the standard out/error of the process outlives the process itself (since subprocesses inherit it), the files remain open until the subprocesses terminate. Therefore the Cmd.Wait continues to run.

There are a few options worth investigating...

1) Rather than calling cmd.Wait(), call cmd.Process.Wait() and take care of managing closing of file handles, terminating go routines etc if needed
2) See if it is possible to close file handles that are associated to the running processes
3) See if it is possible to associate the command to different file handles after the process has started running
4) Instead of using a io.multiWriter, see if I can use a different writer that can be closed on the receiving end
5) See if I can close an io.multiWriter somehow
6) See if I can manipulate the internally used goroutines of the Cmd
7) Do a better job of terminating the entire process tree (see comment 4 for complications with this on Win7)
8) Don't use a multiWriter but log to an actual file, so no go routines are created with the Cmd.
9) Iterate through process table and find processes that share a console handle with the process being killed, and kill those processes too.

In the end I went for option 1) and am testing it now.
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Attachment #8973517 - Flags: review?(jhford)
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/23fa87a5c957618a9ba5636368bb7f5c175c50ae
Bug 1458873 - Improve TestAbortAfterMaxRunTime to detect delayed task resolution

https://github.com/taskcluster/generic-worker/commit/30a4573d2a6c9c6f4770ec940a48efe828bcad17
Bug 1458873 - Fix delayed task resolution by waiting for process, not command

https://github.com/taskcluster/generic-worker/commit/41f8b46065135509c96de6799bf218c5c68a8817
Bug 1458873 - use exec.Command rather than exec.CommandContext on Windows

This makes Windows consistent with other platforms. Maybe at some
point we'll switch back to using exec.CommandContext but the
implementation wasn't very clean (context was stored in a struct
rather than passed explicitly to methods, which is an antipattern)
so better to remove it for now.

https://github.com/taskcluster/generic-worker/commit/73c67bb24abdcb6192762f97cd2a509548ab22ed
Merge pull request #87 from taskcluster/bug1458873

Bug 1458873 - Fix task abortion
Released in generic-worker 10.7.12.
I'm struggling to roll this out to staging at the moment as I can't seem to get win7 workers running:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=72f36810a3677b4a07ffa44c4372b0ef8423895e

I will create a separate blocking bug for this.
Blocks: 1460084
Depends on: 1460178
Comment on attachment 8973517 [details] [review]
Github Pull Request for generic-worker

(was r+'d in gh)
Attachment #8973517 - Flags: review?(jhford) → review+
Depends on: 1461901
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: