Queue doesn't take jobs from "unscheduled" to "pending" - Trees closed
Categories
(Taskcluster :: Operations and Service Requests, task, P1)
Tracking
(Not tracked)
People
(Reporter: jlorenzo, Assigned: dustin)
References
Details
Attachments
(1 file, 1 obsolete file)
First discovered task: https://tools.taskcluster.net/groups/JZJ2hNJESBGMNPLcno8ing/tasks/d3Y4qjbZTf2q7BSgRGG7yg/details
It now impacts gecko: https://tools.taskcluster.net/groups/VN32mlc7QQCTbS-3ygjRfA/tasks/UQFewG6ZTliV9FzE3_xw8g/details
This may either be a bug on the Queue, or just some slowness. I don't have access to the logs of the queue to check this out.
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
https://app.signalfx.com/#/detector/CjefBMEAYAQ/edit?incidentId=DzJF2hjAYH8&is=anomalous
Looks like the dependency resolver failed and was revived later, all before I was awake. So, I think this is resolved now.
John, I think the issue is that the process hangs. What do you think about using lib-iterate here?
Comment 2•5 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #1)
John, I think the issue is that the process hangs. What do you think about
using lib-iterate here?
I'm in favour of using lib-iterate wherever we have these sort of polling loops.
Comment 3•5 years ago
|
||
I've created to switch the dependency resolver to using lib-iterate. In order to correctly block the start/terminate calls until they've completed, I had to change them to async functions, but I'm getting unit test failures which seem to be related to that change. I tried to figure out how the polling management works in the unit tests, but it's a bit complicated.
Dustin, could you take a look at how this change from sync to async start/terminate functions in the helper.js file?
Assignee | ||
Updated•5 years ago
|
Comment 4•5 years ago
|
||
Issue is back, trees are closed for it.
Assignee | ||
Comment 5•5 years ago
|
||
I restarted the dynos and things should be back up and running. I'll work on landing that PR.
Comment 6•5 years ago
|
||
Issue came back again just now, I restarted dynos.
Assignee | ||
Comment 8•5 years ago
|
||
PR is landed, so hopefully we'll get some better behavior if this happens again.
Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 9•5 years ago
|
||
We haven't seen this in two weeks. Here's wondering if the improvements to tc-lib-iterate that landed since then have helped.
Reporter | ||
Comment 10•5 years ago
|
||
Crap, :jcristau just noticed [1] is still unscheduled, even though all its dependencies are resolved. Do you think I should reopen this bug, Dustin?
[1] https://tools.taskcluster.net/groups/SPmSTdYTT2udPMXsetxccg/tasks/FT9aUmVLS1OU0oAtrtCEyg/details
Assignee | ||
Comment 11•5 years ago
|
||
No, this bug was about a complete failure of the dependency resolver, which has not happened here. I can't see from our logs what did happen with this task, though (they just show it being created and then later manually cancelled and re-run).
Comment 12•5 years ago
|
||
iiuc, this is the same as Bug 1477097 - task stuck as unscheduled even with all deps resolved, possible issue with queue dependency resolver
where we had a task that was unscheduled with all deps resolved. So we did our cancel
and rerun
trick.
Assignee | ||
Comment 13•5 years ago
|
||
Yes, the issue you had sounds like bug 1477097. This bug is about something different :)
Assignee | ||
Comment 14•5 years ago
|
||
..and cancelling and rerunning was the right trick to apply.
Description
•