Closed Bug 1602512 Opened 5 years ago Closed 4 years ago

generic-worker: command abortion race condition

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

Details

Attachments

(1 file)

The TestReclaimCancelledTask integration test has picked up a race condition. See task log lines 17620-18068.

This task is created that in command 6 cancels itself. Command 7 should not run since the task abortion took place in command 6, but since command 7 has not started at that point, the abortion of command 7 has no effect, and the worker goes on to run command 7 (which sleeps for 5 mins).

The command abortion code needs some cleanup. Currently, when we discover that a task has been cancelled, we abort all commands in the task definition. We expect that only one is running, so we don't fail if a command can't be aborted (i.e. process killed) we don't expect the kill to be successful against commands that have already completed or commands that haven't run yet.

Since killing the command will result in it getting a non zero exit code, we also assume that the main thread that is executing tasks won't execute any more, since it stops after a failed command.

However, this is pretty ugly, and it lead to a race condition, despite us using a mutex around the command object when killing it.

The problem occurs when the race is won as follows:

  • main thread starts command 1
  • command 1 completes successfully
  • abortion thread attempts to kill command 1, but sees it is not running, so continue to next command
  • abortion thread tries to kill command 2 but it hasn't started, so ignores and continues...
  • main thread starts command 2

The problem we see here is that the task abortion occurs precisely between command 1 completing and command 2 starting, rather than during execution of command 1. In this rare case, command 1 will not have a bad exit code, and so the main thread will schedule command 2 to run, but the kill of command 2 has already happened, so it won't get killed.

Instead the task abortion should operate at a task level, the task should be marked as aborted, its state guarded by a mutex, and the main thread should not try to execute new commands when the task has been marked for abortion.

Additionally, at any time, the task should know which was the most recent command it executed, and only if there is a command that has been started, and not completed, should the worker attempt to kill the command.

This isn't really a critical bug, since when a task is aborted, it is resolved as a failure, and expected to fail. However, it would be nice if the worker didn't consume more resources than it needed to by starting new commands when previous commands have terminated prematurely. Only artifacts of a failed task are ultimately affected, but we should still clean this up.

Assignee: nobody → pmoore
Status: NEW → ASSIGNED

Released in generic-worker 16.6.0.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: