Closed
Bug 1180187
Opened 9 years ago
Closed 6 years ago
generic-worker: listen for and handle worker shutdown
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Assigned: pmoore)
References
Details
(Whiteboard: [generic-worker])
Attachments
(1 file)
Spot nodes can terminate: https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/ The generic worker should have a dedicated go routine to poll http://169.254.169.254/latest/meta-data/spot/termination-time at 5s intervals in order to catch spot node termination notices. Once established that the node will be terminated, the generic worker should handle the prompt termination of any running tasks, with appropriate reportException handling, giving a reason of "worker-shutdown". The log should have "[taskcluster] Spot node shutdown" added on a new line, and the log should be uploaded. If other artifacts already exist, it may be reasonable to upload them too.
Assignee | ||
Comment 1•9 years ago
|
||
Assigning all generic worker bugs to myself for now. If anyone wants to take this bug, feel free to add a comment to request it. I can provide context.
Assignee: nobody → pmoore
Assignee | ||
Updated•9 years ago
|
Component: TaskCluster → Generic-Worker
Product: Testing → Taskcluster
Updated•8 years ago
|
Component: Generic-Worker → Worker
Whiteboard: [generic-worker]
Assignee | ||
Updated•7 years ago
|
Component: Worker → Generic-Worker
Assignee | ||
Updated•6 years ago
|
QA Contact: pmoore
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 3•6 years ago
|
||
Drafted an implementation - still need to add test(s).
Attachment #8957670 -
Flags: review?(jhford)
Comment 4•6 years ago
|
||
Commits pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/92d7f96555267d3cba10ca890289f4a65c6a8499 Bug 1180187 - Resolve worker-shutdown for spot terminations https://github.com/taskcluster/generic-worker/commit/0593d903d56475ec5ffe65fbd8d942ebc02b5d6c Merge pull request #78 from taskcluster/bug1180187 Bug 1180187 - listen for spot termination notice and abort task on discovery
Assignee | ||
Comment 5•6 years ago
|
||
This was released in 10.7.0 but doesn't seem to be catching all spot terminations. I've added extra debugging in 10.7.2: https://github.com/taskcluster/generic-worker/commit/a0d5a75dd704a16fa11b035b6d492d0696b6a577 I've rolled that out to staging, so will wait for some reports to come in on papertrail.
Comment 6•6 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/3b2e6b4e69d24e830724f918220b20ef34de60d7 Bug 1180187 - add more logs to aid debugging
Assignee | ||
Comment 7•6 years ago
|
||
Looks to me like a deadlock occurring... Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: 2018/03/16 21:07:07 Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Spot request has MAYBE been issued??? Decide for yourself! Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: 2018/03/16 21:07:07 HTTP/1.0 200 OK Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Content-Length: 20 Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Accept-Ranges: bytes Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Connection: close Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Content-Type: text/plain Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Date: Fri, 16 Mar 2018 21:07:08 GMT Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Etag: "3708311465" Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Last-Modified: Fri, 16 Mar 2018 21:07:05 GMT Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Server: EC2ws Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: 2018-03-16T21:09:05Z Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: 2018/03/16 21:07:07 resp.StatusCode = 200 Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: 2018/03/16 21:07:07 WARNING: ABORTING task since an imminent spot termination notice has been received! The last line we see in the logs comes from: https://github.com/taskcluster/generic-worker/blob/v10.7.3/aws.go#L323 The abort function it calls is: https://github.com/taskcluster/generic-worker/blob/v10.7.3/main.go#L1283-L1289 This needs to update the task status, which is protected with a mutex. I suspect something else holds the mutex. The machine shutsdown a couple of minutes later, and the task run is eventually resolved by the queue as exception/claim-expired rather than the worker resolving it as exception/worker-shutdown as is intended... Strangely the unit test passes: https://github.com/taskcluster/generic-worker/blob/v10.7.3/aws_test.go#L8-L23 So something must be different when running as a unit test, to when this runs in production for real...
Assignee | ||
Comment 8•6 years ago
|
||
Papertrail link for above output: https://papertrailapp.com/systems/1695099951/events?focus=911410543708000271&selected=911410543708000271
Comment 9•6 years ago
|
||
:pmoore, what is the status here? this bug is marked as the root cause for bug 1444168 which had 80+ failures in the last week.
Flags: needinfo?(pmoore)
Assignee | ||
Comment 10•6 years ago
|
||
Hi Joel, It turned out not to be a deadlock, rather that process termination wasn't implemented. This then got fixed in bug 1447265, but hasn't been rolled out yet as I am on PTO. It will be rolled out next week in bug 1399401 when I'm back.
Depends on: 1447265
Flags: needinfo?(pmoore)
Assignee | ||
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Attachment #8957670 -
Flags: review?(jhford)
Assignee | ||
Comment 11•6 years ago
|
||
Released in https://github.com/taskcluster/generic-worker/releases/tag/v10.7.3
Updated•5 years ago
|
Component: Generic-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•