Closed
Bug 1329617
Opened 7 years ago
Closed 7 years ago
generic-worker 7.2.5 stops claiming tasks after a task it is running gets cancelled
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Assigned: pmoore)
Details
Attachments
(4 files)
From inspecting a live worker, I can confirm generic-worker.exe process is still running, also livelog.exe is still running, but no new tasks are getting claimed. The last thing to be logged is that a task cannot be reclaimed because it was cancelled. Since the generic worker continues to run, we continue to pay for the idle instances, until after 96 hours the provisioner kills them. Therefore cancelling tasks is currently costing us money, rather than saving us money. :((
Assignee | ||
Comment 1•7 years ago
|
||
Here we see generic-worker.exe is still running, and so is livelog.exe from the previous task.
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Assignee | ||
Comment 2•7 years ago
|
||
Here we see the last lines of C:\generic-worker\generic-worker.log. The system date shows as 9 Jan, and the last log entry is from 6 Jan.
Assignee | ||
Comment 3•7 years ago
|
||
I'm reasonably confident the problem is that the main thread is blocked waiting here: https://github.com/taskcluster/generic-worker/blob/v7.2.5/main.go#L1109 This is because, from the logs, I can see that this defer block *is* called successfully: https://github.com/taskcluster/generic-worker/blob/v7.2.5/main.go#L1114-L1118 However, the maxRunTime timer doesn't get cancelled, as we see it kick in later in the logs. Timing looks like this: 03:35:16 UTC: Task starts 03:35:19 UTC: Start executing command 10 (the python command in https://tools.taskcluster.net/task-inspector/#LlUymtVBRZCC6QEUVgTS6A/) 03:56:17 UTC: User cancels task 04:09:18 UTC: Task reclaim go routine notices that task can't be reclaimed because it has been cancelled. It does not terminate the processes (empty implementation: still a TODO...) 04:10:44 UTC: Task completes naturally, successfully but knows it cannot resolve task as task was cancelled .... max runtimer is not cancelled, and we can see that run() method does not complete execution of defer blocks since we also do not get e.g. this log line: https://github.com/taskcluster/generic-worker/blob/v7.2.5/main.go#L541 .... 05:05:16 max run timer, having not been cancelled, reports it set status to aborted, because task has already been cancelled Looking at the sequence of defer blocks in the run() method - it looks like this must be the place the go routine gets blocked.
Assignee | ||
Comment 4•7 years ago
|
||
https://github.com/taskcluster/generic-worker/blob/v7.2.5/main.go#L762 reassigns the timer, so this should be a parameter of the defer function. I'll make a patch and test.
Assignee | ||
Comment 5•7 years ago
|
||
Attachment #8825110 -
Flags: review?(jopsen)
Assignee | ||
Comment 6•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #4) > https://github.com/taskcluster/generic-worker/blob/v7.2.5/main.go#L762 > reassigns the timer, so this should be a parameter of the defer function. > > I'll make a patch and test. Just to be clear - this wasn't the reason. In the code there was a mix up between the reclaim timer and the max run timer (whoops) which caused some confusion, but the main problem was that as per https://golang.org/pkg/time/#Timer.Stop we were running essentially: if !t.Stop() { <-t.C } Unfortunately this doesn't work, if your timer was created using https://golang.org/pkg/time/#AfterFunc as can be seen by https://play.golang.org/p/6pqRerGVcg Unfortunately the API docs are not so explicit about this, but this can be inferred by the trailing line "This cannot be done concurrent to other receives from the Timer's channel." - and AboutFunc must be internally receiving from the channel. This patch fixes the mix up between max run timer and reclaim timer, and makes sure not to drain either timer channel when attemping to stop the timers when the run() method completes...
Comment 7•7 years ago
|
||
Commits pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/b8ecf6f9f7c476ec2bfbee47843789900e3e9c71 Bug 1329617 - never drain timer channel when stopping a timer returned by time.AboutFunc https://github.com/taskcluster/generic-worker/commit/5859baf95225d5ffef0eb1863beff0b4050770c7 Merge pull request #32 from taskcluster/bug1329617 Bug 1329617 - never drain timer channel when stopping a timer returned by time.AboutFunc
Assignee | ||
Comment 8•7 years ago
|
||
Hey Rob! Please don't merge this until CET working hours, so I can be around to support. But feel free to review the changes already. Thanks! Pete
Attachment #8825127 -
Flags: review?(rthijssen)
Comment 9•7 years ago
|
||
Comment on attachment 8825110 [details] [review] Github Pull Request for generic-worker Looks okay to me, but I might not be the best reviewer here..
Attachment #8825110 -
Flags: review?(jopsen) → review+
Assignee | ||
Comment 10•7 years ago
|
||
All landed in production now, closing.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Attachment #8825127 -
Flags: review?(rthijssen) → review+
Assignee | ||
Comment 11•6 years ago
|
||
Released in https://github.com/taskcluster/generic-worker/releases/tag/v7.2.6
QA Contact: pmoore
Updated•5 years ago
|
Component: Generic-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•