Closed Bug 1333957 Opened 7 years ago Closed 6 years ago

Make "Aborting task - max run time exceeded!" a Treeherder-parseable message

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: pmoore)

References

Details

Attachments

(1 file)

When a job exceeds the max run time, like with bug 1312255, the logged output is

[taskcluster 2017-01-25T23:31:13.765Z] Aborting task - max run time exceeded!
[taskcluster 2017-01-25T23:31:13.787Z]   Exit Code: 0
[taskcluster 2017-01-25T23:31:13.787Z]   User Time: 0s
[taskcluster 2017-01-25T23:31:13.787Z] Kernel Time: 0s
[taskcluster 2017-01-25T23:31:13.787Z]   Wall Time: 2h54m16.0809961s
[taskcluster 2017-01-25T23:31:13.787Z] Peak Memory: 1687552
[taskcluster 2017-01-25T23:31:13.787Z]      Result: IDLENESS_LIMIT_EXCEEDED

no part of which is anything involving a message that treeherder knows to highlight as a failure (or, equally possible, some part of it is but the "[taskcluster 2017-01-25T23:31:13.765Z]" isn't starting with something that treeherder knows to strip out), so the result is a failure with no suggested bug to star it as, not even a line to search for bugs mentioning.
Blocks: 1374170
Blocks: 1372229
Component: General → Generic-Worker
Pete, this appears to originate with "max run time exceeded", but the IDLENESS_LIMIT_EXCEEDED has confused people, thinking the task ran too long without output. I can't quite make out what that means from the source, but perhaps the message could be changed?
Flags: needinfo?(pmoore)
I ran into this failure not being flagged when working on bug 1380081.

The `IDLENESS_LIMIT_EXCEEDED` part of the result was very confusing, and lead a few of us in #taskcluster into a wild goose chase trying to figure out if the process was shut down to a hard time limit or due to some sort of "idleness" limit being exceeded.

The answer was that a hard limit had been hit - it would be really awesome if the error message could be cleaned up to make that clear, and shown in the treeherder UI :-).
Assignee: nobody → pmoore
Flags: needinfo?(pmoore)
I started a discussion with Ed Morley about logging format here: http://logs.glob.uno/?c=mozilla%23treeherder#c139994

I'll work through the options with Ed to see if we can come to a logging pattern that makes sense. I definitely like having the timestamp in there, it would feel like a shame to move the timestamp out of the square brackets, as it then isn't clear if the timestamp is added by the worker is part of the standard out of the task. But we'll clean this up.

The IDLENESS_LIMIT_EXCEEDED message is indeed confusing and inappropriate - I will fix this too. It comes originally from the third party library we forked, however this is internal state that we shouldn't display. Apologies for the confusion this message caused.

https://github.com/contester/runlib/blob/90fe2e89f927e36e634e8e61cdc3d45b1fd26877/runexe/runexe_results.go#L38
It occurred to me, we probably don't want to auto-handle these failures at all, since there is not a common root cause.

When a task exceeds its maximum run time, this indicates some real problem occurred, or the task wasn't given enough time to run. What went wrong (in the case it isn't a simple not-given-enough-time situation) will be entirely task-specific.

It is trivial to bump the maximum run time of a task, so if that is the cause of a particular failure, it is very quick and easy to resolve. If that is not the root cause, the failure can potentially be anything.

Therefore I'm not sure that it would make sense to have this automatically starred as a general purpose "task took too long" bug, as that bug would then be associated to a myriad of tasks all performing different operations, and failing for different reasons.
I think it would still be useful to make this error appear in the Treeherder error summary. Just because an error appears there doesn't mean it has to match a bug (and in fact there is a blacklist for terms that shouldn't be searched in the bugscache).
Blocks: 1420078
Bug https://bugzilla.mozilla.org/show_bug.cgi?id=1420078 which is blocked by this one, has 69 failures in the past 7 days and 193 failures in the past 2 weeks.
Flags: needinfo?(pmoore)
(In reply to Pete Moore [:pmoore][:pete] from comment #3)
> I started a discussion with Ed Morley about logging format here:
> http://logs.glob.uno/?c=mozilla%23treeherder#c139994

For future reference, this link has moved to:
https://mozilla.logbot.info/treeherder/20170905#c139994
No longer blocks: 1372229
In the end, I think this might be pretty simple.

It looks like the generic-worker internal logs are prefixed:

> [taskcluster 2017-01-25T23:31:13.787Z]

The docker-worker internal logs are prefixed:

> [taskcluster 2018-02-08 10:13:16.591Z]

The simplest change is to just format the timestamps in generic-worker like we do in docker worker (i.e. 'T' -> ' ').

Does that work for you Ed?
Flags: needinfo?(pmoore) → needinfo?(emorley)
(note, I prefer with the 'T' as it is RFC3339 - but easier to adapt just generic-worker than to change behaviour of both docker-worker and treeherder at this point)
Either sounds fine to me. I'm happy to have to make more changes to Treeherder if it means a better end state fwiw.

I'm torn as to the 'T' vs not - adhering to an RFC sounds sensible, but the alternative format looks more human readable to me (and the timestamps are used more by humans than code at the moment).
Flags: needinfo?(emorley)
(In reply to Pete Moore [:pmoore][:pete] from comment #8)
> In the end, I think this might be pretty simple.
> 
> It looks like the generic-worker internal logs are prefixed:
> 
> > [taskcluster 2017-01-25T23:31:13.787Z]
> 
> The docker-worker internal logs are prefixed:
> 
> > [taskcluster 2018-02-08 10:13:16.591Z]
> 
> The simplest change is to just format the timestamps in generic-worker like
> we do in docker worker (i.e. 'T' -> ' ').
> 
> Does that work for you Ed?

OK that turned out to be a red herring. :-)

The logging format is already consistent with the regular expressions in treeherder, and treeherder is able to correctly deduce which lines are from the task process, and which are from the worker reporting on the task.

The problem is more subtle, which is that currently the design is that treeherder strips out worker internal log messages from the log file, and parses log lines which are from the process standard out/err. It explicitly searches for lines with the [taskcluster <timestamp>] format, and removes them, before parsing.

I'm not sure how task abortion is detected in the docker-worker case, as I'd expect it to hit the same problem. Perhaps in such cases the lines are reported with [taskcluster:error <timestamp>] or something similar. If that is the case, I can make generic-worker do the same.

Philor, do you know what the equivalent log lines from docker-worker look like? Is there a bug with links to timed-out docker worker logs? Thanks!
Flags: needinfo?(philringnalda)
With Ed's help (thanks Ed!) and looking more thoroughly into this, it looks like the timestamp isn't a problem, that will get stripped ok, the problem is that there is no standard error indicator on the line like 'taskcluster:error' to show that this is an error line, as opposed to a regular unimportant logging line.

Ed did unearth a problem, that there is a bug in the treeherder parsing that means [taskcluster <timestamp>] will get stripped, but [taskcluster:error <timestamp>] will be recognised as an error line but the timestamp won't get stripped.

Therefore we are going to implement as follows:

1) I'll update these logging lines to have format:

[taskcluster:error] .....

2) I'll create a bug for treeherder to strip the timestamp out for lines like `[<any:system:descriptor> <timestamp>]`.

3) When that is fixed/deployed, I'll add timestamp back in, i.e.

[taskcluster:error <timestamp>]
Flags: needinfo?(philringnalda)
That's actually an interesting question, because the answer is no, I don't know what they look like. Maybe, possibly, they look like the task being reclaimed after 24 hours.
Oh, no, they don't run away, I just don't remember things first thing in the morning. "[taskcluster:error] Task timeout after 7200 seconds. Force killing container."
(In reply to Nika Layzell [:mystor] from comment #2)
> I ran into this failure not being flagged when working on bug 1380081.
> 
> The `IDLENESS_LIMIT_EXCEEDED` part of the result was very confusing, and
> lead a few of us in #taskcluster into a wild goose chase trying to figure
> out if the process was shut down to a hard time limit or due to some sort of
> "idleness" limit being exceeded.
> 
> The answer was that a hard limit had been hit - it would be really awesome
> if the error message could be cleaned up to make that clear, and shown in
> the treeherder UI :-).

I've taken the opportunity to fix this in the same PR (commit https://github.com/taskcluster/generic-worker/commit/5d101105d32bfe027661641739bb0c6bfd49d2e3). Sorry for delay and confusion!
Comment on attachment 8949478 [details] [diff] [review]
Github Pull Request for generic-worker

Review of attachment 8949478 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me!  I wonder at which point Go will get a real logging library in the standard library.

::: main.go
@@ +1153,4 @@
>  			// public/ directory artifact that includes
>  			// public/logs/live_backing.log inadvertently.
>  			if feature := task.featureArtifacts[artifact.Base().Name]; feature != "" {
> +				task.Infof("WARNING - not uploading artifact %v found in task.payload.artifacts section, since this will be uploaded later by %v", artifact.Base().Name, feature)

Is there value in having task.Warn and task.Warnf functions as well?  That might be overengineering it in this case though.

::: taskstatus.go
@@ +147,4 @@
>  		aborted,
>  		func(task *TaskRun) error {
>  			log.Printf("Aborting task %v - max run time exceeded!", task.TaskID)
> +			task.Error("Aborting task - max run time exceeded!")

just curious why the taskId is in the log message but not the task log message.  Is this for the treeherder parser?  I noticed this in a couple other places (e.g. main.go).  Any value in having the task.{Error,info}{,f} functions automatically calling log.Printf as well?
Attachment #8949478 - Attachment is patch: true
Attachment #8949478 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8949478 - Flags: review?(jhford) → review+
Thanks John!


(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #17)
> Comment on attachment 8949478 [details] [diff] [review]
> Github Pull Request for generic-worker
> 
> Review of attachment 8949478 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good to me!  I wonder at which point Go will get a real logging
> library in the standard library.
> 
> ::: main.go
> @@ +1153,4 @@
> >  			// public/ directory artifact that includes
> >  			// public/logs/live_backing.log inadvertently.
> >  			if feature := task.featureArtifacts[artifact.Base().Name]; feature != "" {
> > +				task.Infof("WARNING - not uploading artifact %v found in task.payload.artifacts section, since this will be uploaded later by %v", artifact.Base().Name, feature)
> 
> Is there value in having task.Warn and task.Warnf functions as well?  That
> might be overengineering it in this case though.

Yeah, I'll add it, to keep it clean. I checked https://github.com/mozilla/treeherder/blob/master/treeherder/log_parser/parsers.py and it doesn't treat `^[taskcluster:warn` specially, so I'll make it:

> `[taskcluster:warn <date>] <message>`

I'll add some test cases too.

> ::: taskstatus.go
> @@ +147,4 @@
> >  		aborted,
> >  		func(task *TaskRun) error {
> >  			log.Printf("Aborting task %v - max run time exceeded!", task.TaskID)
> > +			task.Error("Aborting task - max run time exceeded!")
> 
> just curious why the taskId is in the log message but not the task log
> message.  Is this for the treeherder parser?  I noticed this in a couple
> other places (e.g. main.go).  Any value in having the task.{Error,info}{,f}
> functions automatically calling log.Printf as well?

Nice spot. It is indeed intentional, since the worker logs include references to multiple tasks, so to make it clear which task it relates to. The task logs are always for one specific task, and are uploaded as a task artifact. Therefore it should always be clear which task the message relates to. The task log itself contains a reference to the taskId in the header section, in case someone is looking at the log file out of context from the task url etc.
> Any value in having the task.{Error,info}{,f} functions automatically calling log.Printf as well?

The log.Printf occurs inside task.Log if the task can't log to the task logger for any reason, as a backup. This helps to keep the papertrail logs a little shorter, as workers tend to produce a large amount of logging, and in the case of a worker problem which is related to a task problem, the worker logs which task it is running at the time it crashed, which helps tie the worker events to the task events at the time of the problem.
(In reply to Pete Moore [:pmoore][:pete] from comment #9)
> (note, I prefer with the 'T' as it is RFC3339 - but easier to adapt just
> generic-worker than to change behaviour of both docker-worker and treeherder
> at this point)

I spotted in the end, technically a ' ' is allowed, so docker-worker timestamps do seem to be compliant with RFC3389 after all.

https://tools.ietf.org/html/rfc3339#section-5.6 says:

> NOTE: ISO 8601 defines date and time separated by "T".
>       Applications using this syntax may choose, for the sake of
>       readability, to specify a full-date and full-time separated by
>       (say) a space character.
This will get rolled out in bug 1399401.
Depends on: 1399401
QA Contact: pmoore
This should be fixed now with the rollout of bug 1399401. I'll leave it open for a while, just in case!
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: