Closed Bug 1527583 Opened 5 years ago Closed 5 years ago

Queue doesn't take jobs from "unscheduled" to "pending" - Trees closed

Categories

(Taskcluster :: Operations and Service Requests, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlorenzo, Assigned: dustin)

References

Details

Attachments

(1 file, 1 obsolete file)

First discovered task: https://tools.taskcluster.net/groups/JZJ2hNJESBGMNPLcno8ing/tasks/d3Y4qjbZTf2q7BSgRGG7yg/details
It now impacts gecko: https://tools.taskcluster.net/groups/VN32mlc7QQCTbS-3ygjRfA/tasks/UQFewG6ZTliV9FzE3_xw8g/details

This may either be a bug on the Queue, or just some slowness. I don't have access to the logs of the queue to check this out.

Summary: Queue doesn't take jobs from "unscheduled" to "pending" → Queue doesn't take jobs from "unscheduled" to "pending" - trees closed
See Also: → 1503062
Severity: blocker → normal
Summary: Queue doesn't take jobs from "unscheduled" to "pending" - trees closed → Queue doesn't take jobs from "unscheduled" to "pending"

https://app.signalfx.com/#/detector/CjefBMEAYAQ/edit?incidentId=DzJF2hjAYH8&is=anomalous

Looks like the dependency resolver failed and was revived later, all before I was awake. So, I think this is resolved now.

John, I think the issue is that the process hangs. What do you think about using lib-iterate here?

Flags: needinfo?(jhford)

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #1)

John, I think the issue is that the process hangs. What do you think about
using lib-iterate here?

I'm in favour of using lib-iterate wherever we have these sort of polling loops.

Flags: needinfo?(jhford)
Attached file GitHub Pull Request (obsolete) —

I've created to switch the dependency resolver to using lib-iterate. In order to correctly block the start/terminate calls until they've completed, I had to change them to async functions, but I'm getting unit test failures which seem to be related to that change. I tried to figure out how the polling management works in the unit tests, but it's a bit complicated.

Dustin, could you take a look at how this change from sync to async start/terminate functions in the helper.js file?

Assignee: nobody → jhford
Status: NEW → ASSIGNED
Attachment #9043845 - Flags: feedback?(dustin)
Attachment #9043845 - Flags: feedback?(dustin)

Issue is back, trees are closed for it.

Severity: normal → blocker
Flags: needinfo?(dustin)
Summary: Queue doesn't take jobs from "unscheduled" to "pending" → Queue doesn't take jobs from "unscheduled" to "pending" - Trees closed

I restarted the dynos and things should be back up and running. I'll work on landing that PR.

Assignee: jhford → dustin
Flags: needinfo?(dustin)

Issue came back again just now, I restarted dynos.

See IRC #taskcluster logs.

PR is landed, so hopefully we'll get some better behavior if this happens again.

Component: General → Operations and Service Requests
Attachment #9043845 - Attachment is obsolete: true

We haven't seen this in two weeks. Here's wondering if the improvements to tc-lib-iterate that landed since then have helped.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

Crap, :jcristau just noticed [1] is still unscheduled, even though all its dependencies are resolved. Do you think I should reopen this bug, Dustin?

[1] https://tools.taskcluster.net/groups/SPmSTdYTT2udPMXsetxccg/tasks/FT9aUmVLS1OU0oAtrtCEyg/details

Flags: needinfo?(dustin)

No, this bug was about a complete failure of the dependency resolver, which has not happened here. I can't see from our logs what did happen with this task, though (they just show it being created and then later manually cancelled and re-run).

Flags: needinfo?(dustin)

iiuc, this is the same as Bug 1477097 - task stuck as unscheduled even with all deps resolved, possible issue with queue dependency resolver where we had a task that was unscheduled with all deps resolved. So we did our cancel and rerun trick.

Yes, the issue you had sounds like bug 1477097. This bug is about something different :)

..and cancelling and rerunning was the right trick to apply.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: