Closed
Bug 1503062
Opened 6 years ago
Closed 5 years ago
dependencyResolver stopped working
Categories
(Taskcluster :: Operations and Service Requests, task, P2)
Taskcluster
Operations and Service Requests
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: bstack)
References
Details
Oct 29 23:52:42 taskcluster-queue app/dependencyResolver.4: 2018-10-29T23:52:42.015Z base:entity TIMING: getEntity on QueueTasks took 135.544098 milliseconds. Oct 29 23:52:42 taskcluster-queue app/dependencyResolver.3: 2018-10-29T23:52:42.481Z base:entity TIMING: deleteEntity on QueueTaskGroupActiveSets took 94.896785 milliseconds. Oct 29 23:52:43 taskcluster-queue app/dependencyResolver.2: 2018-10-29T23:52:43.215Z base:entity TIMING: queryEntities on QueueTaskDependency took 97.040273 milliseconds. Oct 29 23:52:43 taskcluster-queue app/dependencyResolver.3: 2018-10-29T23:52:43.595Z base:entity TIMING: deleteEntity on QueueTaskGroupActiveSets took 92.485661 milliseconds. Oct 29 23:52:53 taskcluster-queue app/dependencyResolver.4: raven@2.5.0 alert: failed to send exception to sentry: HTTP Error (429): undefined Oct 29 23:52:53 taskcluster-queue app/dependencyResolver.4: Failed to log error to Sentry: Error: HTTP Error (429): undefined Oct 29 23:52:53 taskcluster-queue app/dependencyResolver.1: raven@2.5.0 alert: failed to send exception to sentry: HTTP Error (429): undefined Oct 29 23:52:53 taskcluster-queue app/dependencyResolver.1: Failed to log error to Sentry: Error: HTTP Error (429): undefined Oct 29 23:52:54 taskcluster-queue app/dependencyResolver.2: raven@2.5.0 alert: failed to send exception to sentry: HTTP Error (429): undefined Oct 29 23:52:54 taskcluster-queue app/dependencyResolver.2: Failed to log error to Sentry: Error: HTTP Error (429): undefined Oct 29 23:52:54 taskcluster-queue app/dependencyResolver.3: raven@2.5.0 alert: failed to send exception to sentry: HTTP Error (429): undefined Oct 29 23:52:54 taskcluster-queue app/dependencyResolver.3: Failed to log error to Sentry: Error: HTTP Error (429): undefined
Reporter | ||
Comment 1•6 years ago
|
||
https://github.com/taskcluster/taskcluster-lib-monitor/issues/76
Reporter | ||
Comment 2•6 years ago
|
||
Looks like there was an Azure downtime around the same time. 429 is a rate limit, meaning Sentry is tired of hearing from us. Perhaps from this very error.
Reporter | ||
Comment 3•6 years ago
|
||
Azure was returning OperationTimedOut errors for a brief time around when this failed (23:52:43). But the error "stuck" because it killed all four dependencyResolver processes. I restarted all dynos and it's up and running again. It's really not clear to me why this error caused the resolver to exit and not restart. John, is that something you could have a look at?
Assignee: dustin → nobody
Flags: needinfo?(jhford)
Comment 4•6 years ago
|
||
Looking at the metrics dashboard, there was a Heroku issue which happened at a similar time. I'm not sure if the timelines match up perfectly, but it looks close enough to be related. I also don't see any crashes in the Heroku metrics, which suggest to me that this was something which didn't crash the process. It's probably that it was waiting around forever for a promise to resolve which didn't. Let's see if this happens again, and also get the new pr from comment 1 landed in the meantime.
Flags: needinfo?(jhford)
Assignee | ||
Comment 5•6 years ago
|
||
I think this reared its head today again. I believe bug 1503430 was caused by this.
Updated•5 years ago
|
Severity: normal → major
Priority: P1 → P2
Assignee | ||
Comment 7•5 years ago
|
||
I guess not!
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(bstack)
Resolution: --- → FIXED
Reporter | ||
Comment 9•5 years ago
|
||
From bug 1521453, we should probably look at why Azure errors are causing this worker dyno to hang -- that's an issue that will likely follow us into kubernetes.
Reporter | ||
Updated•5 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•5 years ago
|
Component: Operations → Operations and Service Requests
Updated•5 years ago
|
Assignee | ||
Comment 10•5 years ago
|
||
We've added monitoring around this that should email the tc team when this happens again.
Status: REOPENED → RESOLVED
Closed: 5 years ago → 5 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•