Open Bug 1273470 Opened 8 years ago Updated 1 year ago

Intermittent Test harness output was not a valid structured log message: | EnvironmentError: http server on port 8000 failed to start

Categories

(Testing :: web-platform-tests, defect)

Version 3
defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: aryx, Unassigned)

References

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell unknown])

https://treeherder.mozilla.org/logviewer.html#?job_id=28002427&repo=mozilla-inbound

09:01:55     INFO - Calling ['/home/worker/workspace/build/venv/bin/python', '-u', '/home/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/home/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/J7JrVt0-QIm65pSIcQTcMw/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--test-type=reftest', '--disable-e10s', '--prefs-root=/home/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/home/worker/workspace/build/tests/bin/certutil'] with output_timeout 1000
09:01:57     INFO - Using 1 client processes
09:01:59     INFO - Closing logging queue
09:01:59     INFO - queue closed
09:01:59 CRITICAL - Test harness output was not a valid structured log message: 
09:01:59 CRITICAL - Traceback (most recent call last):
09:01:59 CRITICAL - Test harness output was not a valid structured log message: 
09:01:59 CRITICAL -   File "/home/worker/workspace/build/tests/web-platform/runtests.py", line 16, in <module>
09:01:59 CRITICAL - Test harness output was not a valid structured log message: 
09:01:59 CRITICAL -     rv = wptrunner.main()
09:01:59 CRITICAL - Test harness output was not a valid structured log message: 
09:01:59 CRITICAL -   File "/home/worker/workspace/build/tests/web-platform/harness/wptrunner/wptrunner.py", line 240, in main
09:01:59 CRITICAL - Test harness output was not a valid structured log message: 
09:01:59 CRITICAL -     return not run_tests(**kwargs)
09:01:59 CRITICAL - Test harness output was not a valid structured log message: 
09:01:59 CRITICAL -   File "/home/worker/workspace/build/tests/web-platform/harness/wptrunner/wptrunner.py", line 157, in run_tests
09:01:59 CRITICAL - Test harness output was not a valid structured log message: 
09:01:59 CRITICAL -     test_environment.ensure_started()
09:01:59 CRITICAL - Test harness output was not a valid structured log message: 
09:01:59 CRITICAL -   File "/home/worker/workspace/build/tests/web-platform/harness/wptrunner/environment.py", line 207, in ensure_started
09:01:59 CRITICAL - Test harness output was not a valid structured log message: 
09:01:59 CRITICAL -     "%s server on port %d failed to start" % (scheme, port))
09:01:59 CRITICAL - Test harness output was not a valid structured log message: 
09:01:59 CRITICAL - EnvironmentError: http server on port 8000 failed to start
Most of these are on aurora and osx. The offending message looks to usually be a "wptserve Traceback", but there is some variation in the tracebacks.

:jgraham - Any insight, or interest?
Flags: needinfo?(james)
Ah, this seems to have morphed into a quite different issue to the one in the bug summary. Maybe uplifiting the fix in 1257249 would help here?
Flags: needinfo?(james)
Looks like it was the wpt update on the 18th of November (bug 1318666) that made this go away on m-c for good. I'll play around on Try to see if I can come up with something upliftable.
On second thought, no it doesn't. I went for a more "full-featured" uplift and that's also hitting the same failures still, so I'm pretty much at a loss at this point. Caring only about the red jobs for these purposes, of course.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c98f3d7985837c6fb3ded6acba95b87962b6941&group_state=expanded&filter-searchStr=osx%20plat

James, I'm afraid this is going to require more debugging expertise than I've got here. Can you please take another look?
Flags: needinfo?(james)
(fyi I haven't forgotten about this)
I think the missing piece here is that gecko logging on OSX is not reliable with very long lines (it's truncating them or mixing them together, or something). On m-c we aren't printing the actual marionette packets due to the changes in bug 1311116. On aurora we are. I've prepared a version of the patch in that bug for uplift.
Flags: needinfo?(james)
Depends on: 1311116
this failure had an uptick on August 20th, and has 42 failures in the last week spread across most if not all our configurations we run on.

:jgraham, this looks to be harness related, I assume you are the right person to look into this, if not, could you help me find the right person?
Flags: needinfo?(james)
Whiteboard: [stockwell needswork]
Looking at a random selection from the last few days, a bunch are apparently misclassified, a couple might be harness bugs and many are marionette not starting after:

[task 2017-08-22T00:28:03.095592Z] 00:28:03     INFO - PID 878 | 1503361683086	addons.xpi	WARN	Error parsing extensions state: [Exception... "Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [amIAddonManagerStartup.readStartupData]"  nsresult: "0x80520012 (NS_ERROR_FILE_NOT_FOUND)"  location: "JS frame :: resource://gre/modules/addons/XPIProvider.jsm :: loadExtensionState :: line 1523"  data: no] Stack trace: loadExtensionState()@resource://gre/modules/addons/XPIProvider.jsm:1523 < getInstallState()@resource://gre/modules/addons/XPIProvider.jsm:1558 < checkForChanges()@resource://gre/modules/addons/XPIProvider.jsm:3093 < startup()@resource://gre/modules/addons/XPIProvider.jsm:2160 < callProvider()@resource://gre/modules/AddonManager.jsm:263 < _startProvider()@resource://gre/modules/AddonManager.jsm:733 < startup()@resource://gre/modules/AddonManager.jsm:900 < startup()@resource://gre/modules/AddonManager.jsm:3084 < observe()@jar:file:///home/worker/workspace/build/application/firefox/omni.ja!/components/addonManager.js:65

Which I don't know anything about
Flags: needinfo?(james)
I have been seeing that warning when running
"./mach marionettetest --gecko-log -" on central for the past couple
of days, but it doesn’t appear to inhibit Marionette from starting
up.  The tests are running just fine.

I don’t see the same message on "./mach run -marionette", which
could indicate the Marionette harness is doing something to the
profile that the addon manager complains about.  I don’t know
what.

tl;dr I don’t think that warning affects the outcome of the tests.
:jgraham, we have 57 failures in the last week- can you look into fixing this?
Flags: needinfo?(james)
This is not one single issue; it's basically being misstarred although the messages are confusing enough that it's understandable why that's happening. 

On Windows it seems we are seeing:
00:56:19 CRITICAL - WindowsError: [Error 5] Access is denied
which seems to happen during shutdown. It might be possible just to ignore that error since we're shutting down anyway.

On Linux/OSX I see some duplicates of Bug 1393366, and a potpourri of other random errors each at low frequency.
Flags: needinfo?(james)

No more intermittent failures within the last two years. Probably because the log output has been changed. See bug 1561224 for a failure as it can be seen those days.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INCOMPLETE

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=261153368&repo=autoland&lineNumber=579

Log snippet:
[task 2019-08-12T18:55:12.316Z] 18:55:12 INFO - mozversion platform_version: 70.0a1
[task 2019-08-12T18:55:19.821Z] 18:55:19 INFO - Using 1 client processes
[task 2019-08-12T18:55:23.981Z] 18:55:23 INFO - Starting http server on 127.0.0.1:8001
[task 2019-08-12T18:55:23.981Z] 18:55:23 INFO - Starting http server on 127.0.0.1:8000
[task 2019-08-12T18:55:23.981Z] 18:55:23 INFO - Starting https server on 127.0.0.1:8443
[task 2019-08-12T18:55:54.174Z] 18:55:54 INFO - Closing logging queue
[task 2019-08-12T18:55:54.189Z] 18:55:54 INFO - queue closed
[task 2019-08-12T18:55:54.189Z] 18:55:54 CRITICAL - Test harness output was not a valid structured log message:
[task 2019-08-12T18:55:54.189Z] 18:55:54 CRITICAL - Traceback (most recent call last):
[task 2019-08-12T18:55:54.190Z] 18:55:54 CRITICAL - File "/Users/cltbld/tasks/task_1565635994/build/tests/web-platform/runtests.py", line 16, in <module>
[task 2019-08-12T18:55:54.190Z] 18:55:54 CRITICAL - rv = wptrunner.main()
[task 2019-08-12T18:55:54.190Z] 18:55:54 CRITICAL - File "/Users/cltbld/tasks/task_1565635994/build/tests/web-platform/tests/tools/wptrunner/wptrunner/wptrunner.py", line 357, in main
[task 2019-08-12T18:55:54.190Z] 18:55:54 CRITICAL - return start(**kwargs)
[task 2019-08-12T18:55:54.190Z] 18:55:54 CRITICAL - File "/Users/cltbld/tasks/task_1565635994/build/tests/web-platform/tests/tools/wptrunner/wptrunner/wptrunner.py", line 344, in start
[task 2019-08-12T18:55:54.190Z] 18:55:54 CRITICAL - return not run_tests(**kwargs)
[task 2019-08-12T18:55:54.190Z] 18:55:54 CRITICAL - File "/Users/cltbld/tasks/task_1565635994/build/tests/web-platform/tests/tools/wptrunner/wptrunner/wptrunner.py", line 195, in run_tests
[task 2019-08-12T18:55:54.190Z] 18:55:54 CRITICAL - test_environment.ensure_started()
[task 2019-08-12T18:55:54.190Z] 18:55:54 CRITICAL - File "/Users/cltbld/tasks/task_1565635994/build/tests/web-platform/tests/tools/wptrunner/wptrunner/environment.py", line 218, in ensure_started
[task 2019-08-12T18:55:54.190Z] 18:55:54 CRITICAL - ", ".join("%s:%s" % item for item in failed))
[task 2019-08-12T18:55:54.190Z] 18:55:54 CRITICAL - EnvironmentError: Servers failed to start: 127.0.0.1:8888
[task 2019-08-12T18:55:54.518Z] 18:55:54 ERROR - Return code: 1
[task 2019-08-12T18:55:54.519Z] 18:55:54 ERROR - No checks run.
[task 2019-08-12T18:55:54.519Z] 18:55:54 ERROR - No suite end message was emitted by this harness.
[task 2019-08-12T18:55:54.519Z] 18:55:54 CRITICAL - # TBPL FAILURE #
[task 2019-08-12T18:55:54.519Z] 18:55:54 WARNING - setting return code to 2
[task 2019-08-12T18:55:54.519Z] 18:55:54 INFO - Running post-action listener: _package_coverage_data
[task 2019-08-12T18:55:54.519Z] 18:55:54 INFO - Running post-action listener: _resource_record_post_action
[task 2019-08-12T18:55:54.519Z] 18:55:54 INFO - Running post-action listener: process_java_coverage_data
[task 2019-08-12T18:55:54.519Z] 18:55:54 INFO - Running post-action listener: stop_device
[task 2019-08-12T18:55:54.519Z] 18:55:54 INFO - [mozharness: 2019-08-12 18:55:54.519576Z] Finished run-tests step (success)
[task 2019-08-12T18:55:54.519Z] 18:55:54 INFO - Running post-run listener: _resource_record_post_run
[task 2019-08-12T18:55:54.576Z] 18:55:54 INFO - Total resource usage - Wall time: 63s; CPU: 18.0%; Read bytes: 48353280; Write bytes: 276287488; Read time: 364; Write time: 593
[task 2019-08-12T18:55:54.576Z] 18:55:54 INFO - TinderboxPrint: CPU usage<br/>18.5%
[task 2019-08-12T18:55:54.576Z] 18:55:54 INFO - TinderboxPrint: I/O read bytes / time<br/>48,353,280 / 364
[task 2019-08-12T18:55:54.576Z] 18:55:54 INFO - TinderboxPrint: I/O write bytes / time<br/>276,287,488 / 593
[task 2019-08-12T18:55:54.576Z] 18:55:54 INFO - TinderboxPrint: CPU idle<br/>206.2 (81.5%)
[task 2019-08-12T18:55:54.576Z] 18:55:54 INFO - TinderboxPrint: CPU system<br/>5.5 (2.2%)
[task 2019-08-12T18:55:54.576Z] 18:55:54 INFO - TinderboxPrint: CPU user<br/>41.4 (16.4%)
[task 2019-08-12T18:55:54.576Z] 18:55:54 INFO - TinderboxPrint: Swap in / out<br/>307,433,472 / 0
[task 2019-08-12T18:55:54.576Z] 18:55:54 INFO - pull - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2019-08-12T18:55:54.576Z] 18:55:54 INFO - verify-device - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2019-08-12T18:55:54.577Z] 18:55:54 INFO - install - Wall time: 21s; CPU: 39.0%; Read bytes: 277878272; Write bytes: 265076736; Read time: 17466; Write time: 522
[task 2019-08-12T18:55:54.577Z] 18:55:54 INFO - run-tests - Wall time: 43s; CPU: 8.0%; Read bytes: 33624064; Write bytes: 8794112; Read time: 265; Write time: 53
[task 2019-08-12T18:55:54.590Z] 18:55:54 WARNING - returning nonzero exit status 2
[taskcluster 2019-08-12T18:55:54.644Z] Exit Code: 2

Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.