Closed Bug 1147867 Opened 9 years ago Closed 9 years ago

Run failed with no usable log output to Treeherder

Categories

(Taskcluster :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Assigned: garndt)

References

Details

https://treeherder.mozilla.org/logviewer.html#?job_id=8075503&repo=mozilla-inbound

Looks like TC timed out?
http://docs.taskcluster.net/tools/task-inspector/#5u6mL0PtQWSw4QBMe6qf6w/0

[taskcluster] Task timeout after 3600 seconds. Force killing container.
[taskcluster] Unsuccessful task run with exit code: -1 completed in 3601.9

As an aside - copying/pasting output from "Inspect Task" is pretty poor UX.
Flags: needinfo?(jlal)
See Also: → 1147977
> Looks like TC timed out?
The task failed to complete before it's "task.payload.maxRunTime" was exhausted.
This is configurable in-tree. But in this case I doubt it was the limit that was the problem.

From the log:
----------------------------
curl -L -o /home/worker/.tc-vcs/clones/hg.mozilla.org/integration/gaia-central.tar.gz \
https://queue.taskcluster.net/v1/task/agraUjZRTVCmbx6R0Y5fZQ/artifacts/public/hg.mozilla.org/integration/gaia-central.tar.gz                                                                                                           
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                         
                                 Dload  Upload   Total   Spent    Left  Speed                                           
100    29  100    29    0     0     24      0  0:00:01  0:00:01 --:--:--    25                                          
 11 2021M   11  233M    0     0  68153      0  8:38:29  0:59:52  7:38:37     0
----------------------------

It timed-out while downloading a 2G artifact from S3.
I've filed bug 1147977 in order to look into this issue, presumably either S3 or
TCP congestion control is playing tricks on us. I suspect restarting slow downloads
would do the trick. And doing downloads in parallel would just be even more awesome :)


### On-Topic:
From an infrastructure perspective this is a poorly coded task.
Because the error happens inside the task. For this reason it becomes a task failure.
The fix is also in-tree, as it involves updated tc-vcs that lives in docker image that
is referenced in-tree.

That said, it's clearly a bug in taskcluster specific utilities used inside the task.
But I don't think we can do a meaningful distinction between these kinds of failures
and actual tests failures. We could possibly do an artifact, but it's gets complicated.
Note,
All of these are S3 being slow for some weird reason.

Except 5, which is 500 error while downloading docker images. This is clearly a TC problem.
I believe we're planning to fix it in Q2 using S3 to store docker images.
As an update we landed a improvement to our timeout logic here https://bugzilla.mozilla.org/show_bug.cgi?id=1147977 ... It's hard to say if this is fixed until we monitor for awhile.
Flags: needinfo?(jlal)
James, this lack of usable failure lines makes sheriffing TC jobs a big pain and leads to dumping ground bugs like these. What can we do to better ensure that we get usable logs in the situations raised here?
Flags: needinfo?(jlal)
Priority: -- → P1
Depends on: 1174557
Depends on: 1174559
(In reply to Treeherder Robot from comment #411)
> log:
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=10825454
> repository: mozilla-inbound
> start_time: 2015-06-16T06:57:21
> who: rvandermeulen[at]mozilla[dot]com
> machine: unknown
> revision: f5ce4d1cb3c2

The log file https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/sG2ZZj0WR_2oNcaI4wQ0kA/0/public/logs/live_backing.log ends with:

[taskcluster] Task timeout after 3600 seconds. Force killing container.
[taskcluster] Unsuccessful task run with exit code: -1 completed in 3688.478
seconds

This was fixed here to include the 'Error: ' prefix:
https://github.com/taskcluster/docker-worker/commit/dc5d27b5d056ecb745dddf486c290716fd21fb80?diff=unified#diff-af02ac312a10ad7d038707b65a14a7abL451

This is the task definition, showing the worker type (b2gtest):
https://queue.taskcluster.net/v1/task/sG2ZZj0WR_2oNcaI4wQ0kA

Greg,
Do we need to roll a new ami for the aws-provisioner-v1/b2gtest worker to pick up the new docker-worker change from bug 1174559, or is this done and we just need to wait for some spot instances to die that are running the old version of docker-worker?
Thanks,
Pete
Flags: needinfo?(garndt)
> Greg,
> Do we need to roll a new ami for the aws-provisioner-v1/b2gtest worker to
> pick up the new docker-worker change from bug 1174559, or is this done and
> we just need to wait for some spot instances to die that are running the old
> version of docker-worker?
> Thanks,
> Pete


The new ami was deployed, but some machines are still around as the ami slowly rolls out to new machines that are spun up.

Also, I'm waiting on the TH change to roll out to see how this is handled on TH.  I was told it might take a few days before the changes appear.
Flags: needinfo?(garndt)
(In reply to Greg Arndt [:garndt] from comment #413)

> Also, I'm waiting on the TH change to roll out to see how this is handled on
> TH.  I was told it might take a few days before the changes appear.

Do we have a bug for the TH change we can link?
Flags: needinfo?(jlal) → needinfo?(garndt)
Depends on: 1179835
(In reply to Selena Deckelmann :selenamarie :selena from comment #454)
> (In reply to Greg Arndt [:garndt] from comment #413)
> 
> > Also, I'm waiting on the TH change to roll out to see how this is handled on
> > TH.  I was told it might take a few days before the changes appear.
> 
> Do we have a bug for the TH change we can link?

It's listed on this bug as a blocker, https://bugzilla.mozilla.org/show_bug.cgi?id=1174557

It's been fixed, but after it was rolled out it was determined that we do not hit this log parsing logic unless there is a "finished" buildbot step, which there will not be if the task is ended prematurely.

I thought there was a bug for mozilla-taskcluster, but I guess not. I just entered one (bug 1147867) and after I roll out some changes for Edgar and Dustin today, it's next on the list to look into.
Component: TaskCluster → General
Product: Testing → Taskcluster
Version: Trunk → unspecified
Depends on: 1188363
See Also: → 1189830
(In reply to Treeherder Robot from comment #1199)
> log:
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=12618932
> repository: mozilla-inbound
> start_time: 2015-08-07T01:31:24
> who: tomcat[at]mozilla[dot]com
> machine: unknown
> buildname: non-buildbot b2g-linux64 test [TC] B2G Desktop ( debug )
> revision: 7e9801d71ae1

Error was:

[taskcluster] Error uploading "public/build/target.linux-x86_64.tar.bz2" artifact. Error: Could not upload artifact. Error: read ECONNRESET
[taskcluster] Error: Unknown taskcluster error encountered.  Ask administrator to lookup incidentId in log-file. Incident ID: 27a5a52e-6d4e-40d1-a91d-7a710bbf4300

This answer looks like it might be helpful in diagnosing the issue:
http://stackoverflow.com/questions/17245881/node-js-econnreset

Strangely I can't find the Incident ID in papertrail.
from IRC:

14:12 <&pmoore> garndt: jonasfj: i see under https://www.joyent.com/developers/node/debug#mdb
14:12 <&pmoore> "There are two ways you can programmatically induce your Node process to abort, the first is process.abort which you can call at anytime in your code. The other is to run your Node process with --abort-on-uncaught-exception which tells V8 that if your process encounters an uncaught exception it should generate a core file."
14:13 <&pmoore> i wonder if we should use this in the docker worker and/or provisioner, so we get a core file
14:15 <&pmoore> e.g. if docker worker is upstart job, that automatically respawns, it could do a startup check to see if there is a core dump, which it uploads if it finds one and then terminates itself
14:16 <&pmoore> at least this way we capture uncaught exceptions and can fix code as/when we find them
Depends on: 1189830
No longer depends on: 1189830
Flags: needinfo?(garndt)
See Also: → 1188444
See Also: → 1186619
Overall Bug 1188444 would reduce a lot of the issues with no usable output being reported to treeherder's log viewer.  Ultimately it won't fix the underlying issues though.

Some of these are failed because the max run time was exceeded.  An error (that should be parsed by treeherder) is in the logs, and bug 1188444 should address that.

Bug 1188930 is for tracking tc-vcs errors.  Bug 1188444 hopefully will put the errors in the log viewer, but this bug will be for tracking the various errors and resolving them.

Bug 1186619 should address retrying artifacts, and while doing so, should also provide a compatible message for treeherder error log parsing.

Here is a list I see so far of some of the most common issues:
Artifact Upload (bug 1186619)
  * Error uploading "public/logs//log_raw.log" artifact. Error: Could not upload artifact. undefined Status Code: 400
  * Error uploading "public/logs//localconfig.json" artifact. Error: Could not upload artifact. undefined Status Code: 500
  * Error uploading "public/build/emulator.tar.gz" artifact. Error: Could not upload artifact. Error: write ECONNRESET

tc-vcs (bug 1189830)
  * Error: Error running command: ./repo sync -j100
  * Error: Error running command: tar -x -z -C /home/worker/workspace/B2G -f /home/worker/.tc-vcs-repo/sources/git.mozilla.org/external/caf/platform/prebuilts/gcc/linux-x86/arm/arm-eabi-4.7/master.tar.gz
  * Error: unknown type of repository (/home/worker/workspace/B2G)(this command must be run in root of repository)

Worker internal (should be resolved by TH bug 1188444)
  * [taskcluster] Error: Task timeout after 14400 seconds. Force killing container.
See Also: → 1187960
With the fix of bug 1188444 reports of "no usable output" have greatly decreased and at the time of this writing the last starred job to this bug was 5 days ago.  Any jobs that still do not output any errors should be filed separately so we can track the fixes for those specific issues.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1182491
Assignee: nobody → garndt
You need to log in before you can comment on or make changes to this bug.