Closed Bug 1356800 Opened 7 years ago Closed 7 years ago

[generic-worker] Handle some errors during artifact uploading more gracefully

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garndt, Unassigned)

Details

Some errors that are caught when trying to upload an artifact result in the worker to panic.  The queue has a few scenarios where an error (such as 400) is returned because of input errors (such as artifact expiration > task expiration) where it would probably be better to log that to the task log and fail the task.

One such example is here:
https://queue.taskcluster.net/v1/task/V-qWmL-PSzaogy8YTcOqaQ

It looks like the situation is handled here:
https://github.com/taskcluster/generic-worker/blob/master/artifacts.go#L372-L378

In the situation where the task log shows that the task it was successful (like for this task), but was resolved with worker-shutdown it can seem that the automatic retrigger for this exception will fix the issue, but really it's a problem with the task definition to begin with, and subsequent retriggers will fail as well.
(In reply to Greg Arndt [:garndt] from comment #0)
> Some errors that are caught when trying to upload an artifact result in the
> worker to panic.  The queue has a few scenarios where an error (such as 400)
> is returned because of input errors (such as artifact expiration > task
> expiration) where it would probably be better to log that to the task log
> and fail the task.

That particular check is handled up front, when the payload is validated, so does not cause a worker panic.

> One such example is here:
> https://queue.taskcluster.net/v1/task/V-qWmL-PSzaogy8YTcOqaQ
> 
> It looks like the situation is handled here:
> https://github.com/taskcluster/generic-worker/blob/master/artifacts.go#L372-
> L378
> 
> In the situation where the task log shows that the task it was successful
> (like for this task), but was resolved with worker-shutdown it can seem that
> the automatic retrigger for this exception will fix the issue, but really
> it's a problem with the task definition to begin with, and subsequent
> retriggers will fail as well.

I suspect the problem was that https://github.com/taskcluster/generic-worker/blob/bf89d92b6fec8eac4eacbcc11affcea3b128e5db/artifacts.go#L178-L180 (incorrectly) assuming that `putResp` is not nil. I have a fix for that I'll add here.

I'm not sure papertrail still has the logs for that particular event, but I'll see if I can get the logs, to confirm.
Component: Worker → Generic-Worker
(In reply to Pete Moore [:pmoore][:pete] from comment #1)
> (In reply to Greg Arndt [:garndt] from comment #0)
> > Some errors that are caught when trying to upload an artifact result in the
> > worker to panic.  The queue has a few scenarios where an error (such as 400)
> > is returned because of input errors (such as artifact expiration > task
> > expiration) where it would probably be better to log that to the task log
> > and fail the task.
> 
> That particular check is handled up front, when the payload is validated, so
> does not cause a worker panic.

I was lying! So it turns out, I was confusing the check with another one. The worker was already checking that artifacts do not expire before task deadline, but were not checking that they do not expire after task expiry.

This came up in bug 1380978 and I made a fix for it (and added some tests) in:
  * https://github.com/taskcluster/generic-worker/commit/e4a56acf286dd972ed328789dbc1c6473e3b4486

Lastly, I also improved the logging (to show queue error messages) in:
  * https://github.com/taskcluster/generic-worker/commit/57b5079d50951558c737b592dee59e8c6396aa0b

Note, this depending on some changes in taskcluster-client-go (such that the errors returned from API methods retain the http request/response when an API call is not successful).
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Included in generic-worker 10.1.4
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.