Closed Bug 1329617 Opened 7 years ago Closed 7 years ago

generic-worker 7.2.5 stops claiming tasks after a task it is running gets cancelled

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

Details

Attachments

(4 files)

From inspecting a live worker, I can confirm generic-worker.exe process is still running, also livelog.exe is still running, but no new tasks are getting claimed. The last thing to be logged is that a task cannot be reclaimed because it was cancelled.

Since the generic worker continues to run, we continue to pay for the idle instances, until after 96 hours the provisioner kills them. Therefore cancelling tasks is currently costing us money, rather than saving us money. :((
Here we see generic-worker.exe is still running, and so is livelog.exe from the previous task.
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Here we see the last lines of C:\generic-worker\generic-worker.log. The system date shows as 9 Jan, and the last log entry is from 6 Jan.
I'm reasonably confident the problem is that the main thread is blocked waiting here:

https://github.com/taskcluster/generic-worker/blob/v7.2.5/main.go#L1109

This is because, from the logs, I can see that this defer block *is* called successfully:

https://github.com/taskcluster/generic-worker/blob/v7.2.5/main.go#L1114-L1118

However, the maxRunTime timer doesn't get cancelled, as we see it kick in later in the logs.

Timing looks like this:

03:35:16 UTC: Task starts
03:35:19 UTC: Start executing command 10 (the python command in https://tools.taskcluster.net/task-inspector/#LlUymtVBRZCC6QEUVgTS6A/)
03:56:17 UTC: User cancels task
04:09:18 UTC: Task reclaim go routine notices that task can't be reclaimed because it has been cancelled. It does not terminate the processes (empty implementation: still a TODO...)
04:10:44 UTC: Task completes naturally, successfully but knows it cannot resolve task as task was cancelled

....

max runtimer is not cancelled, and we can see that run() method does not complete execution of defer blocks since we also do not get e.g. this log line: https://github.com/taskcluster/generic-worker/blob/v7.2.5/main.go#L541

....

05:05:16 max run timer, having not been cancelled, reports it set status to aborted, because task has already been cancelled


Looking at the sequence of defer blocks in the run() method - it looks like this must be the place the go routine gets blocked.
https://github.com/taskcluster/generic-worker/blob/v7.2.5/main.go#L762 reassigns the timer, so this should be a parameter of the defer function.

I'll make a patch and test.
Attachment #8825110 - Flags: review?(jopsen)
(In reply to Pete Moore [:pmoore][:pete] from comment #4)
> https://github.com/taskcluster/generic-worker/blob/v7.2.5/main.go#L762
> reassigns the timer, so this should be a parameter of the defer function.
> 
> I'll make a patch and test.

Just to be clear - this wasn't the reason. In the code there was a mix up between the reclaim timer and the max run timer (whoops) which caused some confusion, but the main problem was that as per https://golang.org/pkg/time/#Timer.Stop we were running essentially:

if !t.Stop() {
	<-t.C
}

Unfortunately this doesn't work, if your timer was created using https://golang.org/pkg/time/#AfterFunc as can be seen by https://play.golang.org/p/6pqRerGVcg

Unfortunately the API docs are not so explicit about this, but this can be inferred by the trailing line "This cannot be done concurrent to other receives from the Timer's channel." - and AboutFunc must be internally receiving from the channel.

This patch fixes the mix up between max run timer and reclaim timer, and makes sure not to drain either timer channel when attemping to stop the timers when the run() method completes...
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/b8ecf6f9f7c476ec2bfbee47843789900e3e9c71
Bug 1329617 - never drain timer channel when stopping a timer returned by time.AboutFunc

https://github.com/taskcluster/generic-worker/commit/5859baf95225d5ffef0eb1863beff0b4050770c7
Merge pull request #32 from taskcluster/bug1329617

Bug 1329617 - never drain timer channel when stopping a timer returned by time.AboutFunc
Hey Rob!

Please don't merge this until CET working hours, so I can be around to support. But feel free to review the changes already.

Thanks!
Pete
Attachment #8825127 - Flags: review?(rthijssen)
Comment on attachment 8825110 [details] [review]
Github Pull Request for generic-worker

Looks okay to me, but I might not be the best reviewer here..
Attachment #8825110 - Flags: review?(jopsen) → review+
All landed in production now, closing.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Attachment #8825127 - Flags: review?(rthijssen) → review+
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: