1333957 - Make "Aborting task - max run time exceeded!" a Treeherder-parseable message

Reporter

Description

•

7 years ago

When a job exceeds the max run time, like with bug 1312255, the logged output is

[taskcluster 2017-01-25T23:31:13.765Z] Aborting task - max run time exceeded!
[taskcluster 2017-01-25T23:31:13.787Z]   Exit Code: 0
[taskcluster 2017-01-25T23:31:13.787Z]   User Time: 0s
[taskcluster 2017-01-25T23:31:13.787Z] Kernel Time: 0s
[taskcluster 2017-01-25T23:31:13.787Z]   Wall Time: 2h54m16.0809961s
[taskcluster 2017-01-25T23:31:13.787Z] Peak Memory: 1687552
[taskcluster 2017-01-25T23:31:13.787Z]      Result: IDLENESS_LIMIT_EXCEEDED

no part of which is anything involving a message that treeherder knows to highlight as a failure (or, equally possible, some part of it is but the "[taskcluster 2017-01-25T23:31:13.765Z]" isn't starting with something that treeherder knows to strip out), so the result is a failure with no suggested bug to star it as, not even a line to search for bugs mentioning.

Phil Ringnalda (:philor)

Reporter

Updated

•

7 years ago

Blocks: 1374170

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

7 years ago

Blocks: 1372229

Component: General → Generic-Worker

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

7 years ago

Pete, this appears to originate with "max run time exceeded", but the IDLENESS_LIMIT_EXCEEDED has confused people, thinking the task ran too long without output. I can't quite make out what that means from the source, but perhaps the message could be changed?

Flags: needinfo?(pmoore)

Nika Layzell [:nika] (ni? for response)

Comment 2

•

7 years ago

I ran into this failure not being flagged when working on bug 1380081.

The `IDLENESS_LIMIT_EXCEEDED` part of the result was very confusing, and lead a few of us in #taskcluster into a wild goose chase trying to figure out if the process was shut down to a hard time limit or due to some sort of "idleness" limit being exceeded.

The answer was that a hard limit had been hit - it would be really awesome if the error message could be cleaned up to make that clear, and shown in the treeherder UI :-).

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

7 years ago

Assignee: nobody → pmoore

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Assignee

Comment 3

•

7 years ago

I started a discussion with Ed Morley about logging format here: http://logs.glob.uno/?c=mozilla%23treeherder#c139994

I'll work through the options with Ed to see if we can come to a logging pattern that makes sense. I definitely like having the timestamp in there, it would feel like a shame to move the timestamp out of the square brackets, as it then isn't clear if the timestamp is added by the worker is part of the standard out of the task. But we'll clean this up.

The IDLENESS_LIMIT_EXCEEDED message is indeed confusing and inappropriate - I will fix this too. It comes originally from the third party library we forked, however this is internal state that we shouldn't display. Apologies for the confusion this message caused.

https://github.com/contester/runlib/blob/90fe2e89f927e36e634e8e61cdc3d45b1fd26877/runexe/runexe_results.go#L38

Pete Moore [:pmoore][:pete]

Assignee

Comment 4

•

7 years ago

It occurred to me, we probably don't want to auto-handle these failures at all, since there is not a common root cause.

When a task exceeds its maximum run time, this indicates some real problem occurred, or the task wasn't given enough time to run. What went wrong (in the case it isn't a simple not-given-enough-time situation) will be entirely task-specific.

It is trivial to bump the maximum run time of a task, so if that is the cause of a particular failure, it is very quick and easy to resolve. If that is not the root cause, the failure can potentially be anything.

Therefore I'm not sure that it would make sense to have this automatically starred as a general purpose "task took too long" bug, as that bug would then be associated to a myriad of tasks all performing different operations, and failing for different reasons.

Ed Morley [:emorley]

Comment 5

•

7 years ago

I think it would still be useful to make this error appear in the Treeherder error summary. Just because an error appears there doesn't mean it has to match a bug (and in fact there is a blacklist for terms that shouldn't be searched in the bugscache).

Phil Ringnalda (:philor)

Reporter

Updated

•

7 years ago

Blocks: 1420078

Noemi Erli[:noemi_erli]

Comment 6

•

6 years ago

Bug https://bugzilla.mozilla.org/show_bug.cgi?id=1420078 which is blocked by this one, has 69 failures in the past 7 days and 193 failures in the past 2 weeks.

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Assignee

Comment 7

•

6 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #3)
> I started a discussion with Ed Morley about logging format here:
> http://logs.glob.uno/?c=mozilla%23treeherder#c139994

For future reference, this link has moved to:
https://mozilla.logbot.info/treeherder/20170905#c139994

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

6 years ago

No longer blocks: 1372229

Pete Moore [:pmoore][:pete]

Assignee

Comment 8

•

6 years ago

In the end, I think this might be pretty simple.

It looks like the generic-worker internal logs are prefixed:

> [taskcluster 2017-01-25T23:31:13.787Z]

The docker-worker internal logs are prefixed:

> [taskcluster 2018-02-08 10:13:16.591Z]

The simplest change is to just format the timestamps in generic-worker like we do in docker worker (i.e. 'T' -> ' ').

Does that work for you Ed?

Flags: needinfo?(pmoore) → needinfo?(emorley)

Pete Moore [:pmoore][:pete]

Assignee

Comment 9

•

6 years ago

(note, I prefer with the 'T' as it is RFC3339 - but easier to adapt just generic-worker than to change behaviour of both docker-worker and treeherder at this point)

Ed Morley [:emorley]

Comment 10

•

6 years ago

Either sounds fine to me. I'm happy to have to make more changes to Treeherder if it means a better end state fwiw.

I'm torn as to the 'T' vs not - adhering to an RFC sounds sensible, but the alternative format looks more human readable to me (and the timestamps are used more by humans than code at the moment).

Flags: needinfo?(emorley)

Pete Moore [:pmoore][:pete]

Assignee

Comment 11

•

6 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #8)
> In the end, I think this might be pretty simple.
> 
> It looks like the generic-worker internal logs are prefixed:
> 
> > [taskcluster 2017-01-25T23:31:13.787Z]
> 
> The docker-worker internal logs are prefixed:
> 
> > [taskcluster 2018-02-08 10:13:16.591Z]
> 
> The simplest change is to just format the timestamps in generic-worker like
> we do in docker worker (i.e. 'T' -> ' ').
> 
> Does that work for you Ed?

OK that turned out to be a red herring. :-)

The logging format is already consistent with the regular expressions in treeherder, and treeherder is able to correctly deduce which lines are from the task process, and which are from the worker reporting on the task.

The problem is more subtle, which is that currently the design is that treeherder strips out worker internal log messages from the log file, and parses log lines which are from the process standard out/err. It explicitly searches for lines with the [taskcluster <timestamp>] format, and removes them, before parsing.

I'm not sure how task abortion is detected in the docker-worker case, as I'd expect it to hit the same problem. Perhaps in such cases the lines are reported with [taskcluster:error <timestamp>] or something similar. If that is the case, I can make generic-worker do the same.

Philor, do you know what the equivalent log lines from docker-worker look like? Is there a bug with links to timed-out docker worker logs? Thanks!

Flags: needinfo?(philringnalda)

Pete Moore [:pmoore][:pete]

Assignee

Comment 12

•

6 years ago

With Ed's help (thanks Ed!) and looking more thoroughly into this, it looks like the timestamp isn't a problem, that will get stripped ok, the problem is that there is no standard error indicator on the line like 'taskcluster:error' to show that this is an error line, as opposed to a regular unimportant logging line.

Ed did unearth a problem, that there is a bug in the treeherder parsing that means [taskcluster <timestamp>] will get stripped, but [taskcluster:error <timestamp>] will be recognised as an error line but the timestamp won't get stripped.

Therefore we are going to implement as follows:

1) I'll update these logging lines to have format:

[taskcluster:error] .....

2) I'll create a bug for treeherder to strip the timestamp out for lines like `[<any:system:descriptor> <timestamp>]`.

3) When that is fixed/deployed, I'll add timestamp back in, i.e.

[taskcluster:error <timestamp>]

Flags: needinfo?(philringnalda)

Phil Ringnalda (:philor)

Reporter

Comment 13

•

6 years ago

That's actually an interesting question, because the answer is no, I don't know what they look like. Maybe, possibly, they look like the task being reclaimed after 24 hours.

Phil Ringnalda (:philor)

Reporter

Comment 14

•

6 years ago

Oh, no, they don't run away, I just don't remember things first thing in the morning. "[taskcluster:error] Task timeout after 7200 seconds. Force killing container."

Pete Moore [:pmoore][:pete]

Assignee

Comment 15

•

6 years ago

Attached patch Github Pull Request for generic-worker — Details — Splinter Review

Attachment #8949478 - Flags: review?(jhford)

Pete Moore [:pmoore][:pete]

Assignee

Comment 16

•

6 years ago

(In reply to Nika Layzell [:mystor] from comment #2)
> I ran into this failure not being flagged when working on bug 1380081.
> 
> The `IDLENESS_LIMIT_EXCEEDED` part of the result was very confusing, and
> lead a few of us in #taskcluster into a wild goose chase trying to figure
> out if the process was shut down to a hard time limit or due to some sort of
> "idleness" limit being exceeded.
> 
> The answer was that a hard limit had been hit - it would be really awesome
> if the error message could be cleaned up to make that clear, and shown in
> the treeherder UI :-).

I've taken the opportunity to fix this in the same PR (commit https://github.com/taskcluster/generic-worker/commit/5d101105d32bfe027661641739bb0c6bfd49d2e3). Sorry for delay and confusion!

John Ford [:jhford] CET/CEST Berlin Time

Comment 17

•

6 years ago

Comment on attachment 8949478 [details] [diff] [review]
Github Pull Request for generic-worker

Review of attachment 8949478 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me!  I wonder at which point Go will get a real logging library in the standard library.

::: main.go
@@ +1153,4 @@
>  			// public/ directory artifact that includes
>  			// public/logs/live_backing.log inadvertently.
>  			if feature := task.featureArtifacts[artifact.Base().Name]; feature != "" {
> +				task.Infof("WARNING - not uploading artifact %v found in task.payload.artifacts section, since this will be uploaded later by %v", artifact.Base().Name, feature)

Is there value in having task.Warn and task.Warnf functions as well?  That might be overengineering it in this case though.

::: taskstatus.go
@@ +147,4 @@
>  		aborted,
>  		func(task *TaskRun) error {
>  			log.Printf("Aborting task %v - max run time exceeded!", task.TaskID)
> +			task.Error("Aborting task - max run time exceeded!")

just curious why the taskId is in the log message but not the task log message.  Is this for the treeherder parser?  I noticed this in a couple other places (e.g. main.go).  Any value in having the task.{Error,info}{,f} functions automatically calling log.Printf as well?

Attachment #8949478 - Attachment is patch: true

Attachment #8949478 - Attachment mime type: text/x-github-pull-request → text/plain

Attachment #8949478 - Flags: review?(jhford) → review+

Pete Moore [:pmoore][:pete]

Assignee

Comment 18

•

6 years ago

Thanks John!


(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #17)
> Comment on attachment 8949478 [details] [diff] [review]
> Github Pull Request for generic-worker
> 
> Review of attachment 8949478 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good to me!  I wonder at which point Go will get a real logging
> library in the standard library.
> 
> ::: main.go
> @@ +1153,4 @@
> >  			// public/ directory artifact that includes
> >  			// public/logs/live_backing.log inadvertently.
> >  			if feature := task.featureArtifacts[artifact.Base().Name]; feature != "" {
> > +				task.Infof("WARNING - not uploading artifact %v found in task.payload.artifacts section, since this will be uploaded later by %v", artifact.Base().Name, feature)
> 
> Is there value in having task.Warn and task.Warnf functions as well?  That
> might be overengineering it in this case though.

Yeah, I'll add it, to keep it clean. I checked https://github.com/mozilla/treeherder/blob/master/treeherder/log_parser/parsers.py and it doesn't treat `^[taskcluster:warn` specially, so I'll make it:

> `[taskcluster:warn <date>] <message>`

I'll add some test cases too.

> ::: taskstatus.go
> @@ +147,4 @@
> >  		aborted,
> >  		func(task *TaskRun) error {
> >  			log.Printf("Aborting task %v - max run time exceeded!", task.TaskID)
> > +			task.Error("Aborting task - max run time exceeded!")
> 
> just curious why the taskId is in the log message but not the task log
> message.  Is this for the treeherder parser?  I noticed this in a couple
> other places (e.g. main.go).  Any value in having the task.{Error,info}{,f}
> functions automatically calling log.Printf as well?

Nice spot. It is indeed intentional, since the worker logs include references to multiple tasks, so to make it clear which task it relates to. The task logs are always for one specific task, and are uploaded as a task artifact. Therefore it should always be clear which task the message relates to. The task log itself contains a reference to the taskId in the header section, in case someone is looking at the log file out of context from the task url etc.

Pete Moore [:pmoore][:pete]

Assignee

Comment 19

•

6 years ago

> Any value in having the task.{Error,info}{,f} functions automatically calling log.Printf as well?

The log.Printf occurs inside task.Log if the task can't log to the task logger for any reason, as a backup. This helps to keep the papertrail logs a little shorter, as workers tend to produce a large amount of logging, and in the case of a worker problem which is related to a task problem, the worker logs which task it is running at the time it crashed, which helps tie the worker events to the task events at the time of the problem.

Pete Moore [:pmoore][:pete]

Assignee

Comment 20

•

6 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #9)
> (note, I prefer with the 'T' as it is RFC3339 - but easier to adapt just
> generic-worker than to change behaviour of both docker-worker and treeherder
> at this point)

I spotted in the end, technically a ' ' is allowed, so docker-worker timestamps do seem to be compliant with RFC3389 after all.

https://tools.ietf.org/html/rfc3339#section-5.6 says:

> NOTE: ISO 8601 defines date and time separated by "T".
>       Applications using this syntax may choose, for the sake of
>       readability, to specify a full-date and full-time separated by
>       (say) a space character.

[github robot]

Comment 21

•

6 years ago

Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/a458804f5544ee3ecaf74d6647a26e3a4d984199
Bug 1333957 - differentiate error and info log lines

https://github.com/taskcluster/generic-worker/commit/29297e7236b13defd38f8c37ab8aa052cd907d25
Bug 1333957 - added task.Warn and task.Warnf methods

https://github.com/taskcluster/generic-worker/commit/a330d20d1d54877bf574f75eeba35e989c904075
Merge pull request #72 from taskcluster/bug1333957

Bug 1333957 - differentiate error and info log lines

Pete Moore [:pmoore][:pete]

Assignee

Comment 22

•

6 years ago

This will get rolled out in bug 1399401.

Depends on: 1399401

Pete Moore [:pmoore][:pete]

Assignee

Comment 23

•

6 years ago

Released in generic-worker 10.5.1:

  https://github.com/taskcluster/generic-worker/releases/tag/v10.5.1

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

6 years ago

QA Contact: pmoore

Pete Moore [:pmoore][:pete]

Assignee

Comment 25

•

6 years ago

This should be fixed now with the rollout of bug 1399401. I'll leave it open for a while, just in case!

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: Generic-Worker → Workers