Closed
Bug 1407383
Opened 7 years ago
Closed 6 years ago
Intermittent Wd Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-'] due to hang in mozlog
Categories
(Testing :: web-platform-tests, defect, P1)
Tracking
(firefox58 disabled, firefox59 disabled, firefox60 fixed)
RESOLVED
FIXED
mozilla60
People
(Reporter: intermittent-bug-filer, Assigned: whimboo)
References
(Depends on 1 open bug)
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell fixed][wptsync upstream])
Attachments
(6 files, 4 obsolete files)
59 bytes,
text/x-review-board-request
|
ato
:
review+
|
Details |
59 bytes,
text/x-review-board-request
|
jgraham
:
review+
|
Details |
1.39 KB,
text/plain
|
Details | |
3.06 KB,
text/plain
|
Details | |
59 bytes,
text/x-review-board-request
|
jgraham
:
review+
|
Details |
59 bytes,
text/x-review-board-request
|
jgraham
:
review+
|
Details |
Filed by: archaeopteryx [at] coole-files.de https://treeherder.mozilla.org/logviewer.html#?job_id=135990490&repo=mozilla-inbound https://queue.taskcluster.net/v1/task/eHR4XOXGRtmWRls2xymnfw/runs/0/artifacts/public/logs/live_backing.log [task 2017-10-10T17:37:20.644Z] 17:37:20 INFO - TEST-START | /webdriver/tests/actions/key.py [task 2017-10-10T17:54:00.677Z] 17:54:00 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/DcV_FpLGTCOfRXX2wSyYiA/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] [task 2017-10-10T17:54:00.714Z] 17:54:00 ERROR - timed out after 1000 seconds of no output [task 2017-10-10T17:54:00.715Z] 17:54:00 ERROR - Return code: -15 [task 2017-10-10T17:54:00.715Z] 17:54:00 ERROR - No suite end message was emitted by this harness.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 3•7 years ago
|
||
This particular failure is the earliest one, I couldn't find any other before that: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=656d53c7fd810507564844a6b0b732c116ad8efd&filter-searchStr=linux%20x64%20web-platform%20wdspec&selectedJob=135990490 :ato Could you please take a look at this bug?
Flags: needinfo?(ato)
Whiteboard: [stockwell needswork]
Comment 4•7 years ago
|
||
The orange factor report [1] seems to indicate that this is due to a timeout, and because it mostly occurs on pgo and debug builds it is probable to think this is because the timeout of the test is too short. I will try increasing it by marking it as a ‘long’ test in https://bugzil.la/1409049. [1] https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1407383&entireHistory=true&tree=all
Flags: needinfo?(ato)
Comment 5•7 years ago
|
||
(In reply to Andreas Tolfsen ‹:ato› from comment #4) > The orange factor report [1] seems to indicate that this is due to > a timeout, and because it mostly occurs on pgo and debug builds it > is probable to think this is because the timeout of the test is > too short. Scratch that. This is an overall 1000 seconds (16.6 minutes) timeout from mozprocess, meaning wptrunner or something else stops responding. I would expect wptrunner to abort the test much sooner and return with a TEST-TIMEOUT status.
Comment hidden (Intermittent Failures Robot) |
Comment 8•7 years ago
|
||
(In reply to Henrietta Maior [:henrietta_maior] from comment #7) > :ato Will you be able to look into this? I already replied to your needinfo request. I believe this is something more systemic to wptrunner or an infrastructure issue. Cc’ing jgraham to see if he can shed some light on this, or at least move the bug to the correct component.
Flags: needinfo?(ato)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(james)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 12•7 years ago
|
||
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/879f2ee477b4 Disable /webdriver/tests/actions/key.py on linux for frequent failures. r=me, a=testonly
Comment 13•7 years ago
|
||
I disabled this test for having >200 failures in the last 30 days, please remember to enable this when testing a fix
Keywords: leave-open
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
Comment 14•7 years ago
|
||
maja_zf: Just to make you aware, the WPT key tests are now disabled due to some problem in wptrunner.
Comment 15•7 years ago
|
||
To be clear, the diagnosis that the problem's in wptrunner doesn't strike me as very certain. wptrunner tries to run a timer thread that will stop waiting for the pytest results if they exceed the timeout. And presumably that works at least sometimes since we get TIMEOUT results. But in this case we apparently don't get that far, which probably is a bug in wptrunner in that it's not handling some behaviour of pytest, but I expect that something bad is happening when the test runs so that the python interpreter never gets to run the main thread and so we never report an EXTERNAL-TIMEOUT for this and so kill the TestRunner process and continue. Without someone managing to debug the problem I don't think it's clear what the underlying fault is.
Flags: needinfo?(james)
Assignee | ||
Updated•7 years ago
|
status-firefox58:
--- → disabled
Keywords: test-disabled
Comment 16•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/879f2ee477b4
Assignee | ||
Comment 17•7 years ago
|
||
Please also see bug 1391545 for a similar output affecting Marionette tests.
Comment hidden (Intermittent Failures Robot) |
I don't think this failure has anything to do with the test that got disabled (key.py). It just happened to be ordered as the first test to run. While investigating another intermittent, I'm still seeing this failure a lot despite key.py being disabled.
Comment hidden (Intermittent Failures Robot) |
Comment 22•7 years ago
|
||
now these are failing on: TEST-START | /webdriver/tests/actions/modifier_click.py So comment 19 is spot on that we are not executing any tests, we are just hanging.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 26•7 years ago
|
||
This failure has occurred 100 times in the last 7 days. It has failed only on Linux x64 and on opt,pgo,debug build types. Failing test suites: tests-wdspec-e10s, marionette-e10s, opt-mochitest-3, platform-tests-e10s. Link to a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=144012458&lineNumber=1229 Part of the log: task 2017-11-11T19:21:33.560Z] 19:21:33 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ 1228 [task 2017-11-11T19:21:33.560Z] 19:21:33 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py 1229 [task 2017-11-11T19:38:13.591Z] 19:38:13 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/daJK4fMkTQC-LumF6CRNug/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] 1230 [task 2017-11-11T19:38:13.625Z] 19:38:13 ERROR - timed out after 1000 seconds of no output 1231 [task 2017-11-11T19:38:13.626Z] 19:38:13 ERROR - Return code: -15 1232 [task 2017-11-11T19:38:13.626Z] 19:38:13 ERROR - No suite end message was emitted by this harness. 1233 [task 2017-11-11T19:38:13.627Z] 19:38:13 ERROR - # TBPL FAILURE # 1234 [task 2017-11-11T19:38:13.627Z] 19:38:13 WARNING - setting return code to 2 1235 [task 2017-11-11T19:38:13.628Z] 19:38:13 INFO - Running post-action listener: _package_coverage_data 1236 [task 2017-11-11T19:38:13.628Z] 19:38:13 INFO - Running post-action listener: _resource_record_post_action 1237 [task 2017-11-11T19:38:13.628Z] 19:38:13 INFO - [mozharness: 2017-11-11 19:38:13.628489Z] Finished run-tests step (success) 1238 [task 2017-11-11T19:38:13.628Z] 19:38:13 INFO - Running post-run listener: _resource_record_post_run 1239 [task 2017-11-11T19:38:13.728Z] 19:38:13 INFO - Total resource usage - Wall time: 1022s; CPU: 1.0%; Read bytes: 1077248; Write bytes: 1633046528; Read time: 60; Write time: 1293480 1240 [task 2017-11-11T19:38:13.728Z] 19:38:13 INFO - TinderboxPrint: CPU usage<br/>0.9% 1241 [task 2017-11-11T19:38:13.728Z] 19:38:13 INFO - TinderboxPrint: I/O read bytes / time<br/>1,077,248 / 60 1242 [task 2017-11-11T19:38:13.729Z] 19:38:13 INFO - TinderboxPrint: I/O write bytes / time<br/>1,633,046,528 / 1,293,480 1243 [task 2017-11-11T19:38:13.729Z] 19:38:13 INFO - TinderboxPrint: CPU idle<br/>4,038.3 (99.1%) 1244 [task 2017-11-11T19:38:13.729Z] 19:38:13 INFO - TinderboxPrint: Swap in / out<br/>0 / 0 1245 [task 2017-11-11T19:38:13.730Z] 19:38:13 INFO - pull - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0 1246 [task 2017-11-11T19:38:13.732Z] 19:38:13 INFO - install - Wall time: 18s; CPU: 28.0%; Read bytes: 139264; Write bytes: 33554432; Read time: 28; Write time: 110332 1247 [task 2017-11-11T19:38:13.743Z] 19:38:13 INFO - run-tests - Wall time: 1004s; CPU: 0.0%; Read bytes: 937984; Write bytes: 1595269120; Read time: 32; Write time: 1182940 1248 [task 2017-11-11T19:38:14.070Z] 19:38:14 INFO - Running post-run listener: _upload_blobber_files 1249 [task 2017-11-11T19:38:14.070Z] 19:38:14 WARNING - Blob upload gear skipped. Missing cmdline options. 1250 [task 2017-11-11T19:38:14.070Z] 19:38:14 INFO - Running post-run listener: copy_logs_to_upload_dir 1251 [task 2017-11-11T19:38:14.071Z] 19:38:14 INFO - Copying logs to upload dir... 1252 [task 2017-11-11T19:38:14.071Z] 19:38:14 INFO - mkdir: /builds/worker/workspace/build/upload/logs 1253 [task 2017-11-11T19:38:14.073Z] 19:38:14 INFO - Copying logs to upload dir... 1254 [task 2017-11-11T19:38:14.075Z] 19:38:14 WARNING - returning nonzero exit status 2 1255 [task 2017-11-11T19:38:14.099Z] cleanup 1256 [task 2017-11-11T19:38:14.099Z] + cleanup 1257 [task 2017-11-11T19:38:14.099Z] + local rv=2 1258 [task 2017-11-11T19:38:14.099Z] + [[ -s /builds/worker/.xsession-errors ]] 1259 [task 2017-11-11T19:38:14.099Z] + cp /builds/worker/.xsession-errors /builds/worker/artifacts/public/xsession-errors.log 1260 [task 2017-11-11T19:38:14.101Z] + true 1261 [task 2017-11-11T19:38:14.101Z] + cleanup_xvfb 1262 [task 2017-11-11T19:38:14.102Z] pidof Xvfb 1263 [task 2017-11-11T19:38:14.102Z] ++ pidof Xvfb 1264 [task 2017-11-11T19:38:14.105Z] + local xvfb_pid=25 1265 [task 2017-11-11T19:38:14.105Z] + local vnc=false 1266 [task 2017-11-11T19:38:14.105Z] + local interactive=false 1267 [task 2017-11-11T19:38:14.106Z] + '[' -n 25 ']' 1268 [task 2017-11-11T19:38:14.106Z] + [[ false == false ]] 1269 [task 2017-11-11T19:38:14.106Z] + [[ false == false ]] 1270 [task 2017-11-11T19:38:14.106Z] + kill 25 1271 [task 2017-11-11T19:38:14.107Z] + screen -XS xvfb quit 1272 [task 2017-11-11T19:38:14.107Z] XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0" 1273 [task 2017-11-11T19:38:14.107Z] after 783 requests (783 known processed) with 0 events remaining. 1274 [task 2017-11-11T19:38:14.111Z] compizconfig - Info: Backend : ini 1275 [task 2017-11-11T19:38:14.112Z] compizconfig - Info: Integration : true 1276 [task 2017-11-11T19:38:14.112Z] compizconfig - Info: Profile : default 1277 [task 2017-11-11T19:38:14.157Z] No screen session found. 1278 [task 2017-11-11T19:38:14.157Z] + true 1279 [task 2017-11-11T19:38:14.157Z] + exit 2 1280 [taskcluster 2017-11-11 19:38:14.457Z] === Task Finished === 1281 [taskcluster 2017-11-11 19:38:19.671Z] Unsuccessful task run with exit code: 2 completed in 2514.307 seconds :whimboo Do you have any updates?
Flags: needinfo?(hskupin)
Assignee | ||
Comment 27•7 years ago
|
||
(In reply to OrangeFactor Robot from comment #24) > For more details, see: > https://brasstacks.mozilla.com/orangefactor/ > ?display=Bug&bugid=1407383&startday=2017-10-30&endday=2017-11-05&tree=all If that is becoming a blocking issue James, Andreas, or Maja should have a look at this bug. They worked most on wptrunner. But I'm not sure whom of them have the time right now.
Flags: needinfo?(mjzffr)
Flags: needinfo?(james)
Flags: needinfo?(ato)
Comment 28•7 years ago
|
||
Something that's happening on mochitest, marionette and wpt clearly isn't a single bug in the wpt test harness. There are either multiple unrelated issues being bucketed together here, or it's in a lower layer like mozharness or marionette.
Assignee | ||
Comment 29•7 years ago
|
||
Correct, but someone would have to investigate that first. wptrunner handles the process itself, and as such it cannot be related to the process handling in Marionette. Although we have a similar issue there (bug 1391545).
Flags: needinfo?(hskupin)
Comment hidden (Intermittent Failures Robot) |
Comment 31•7 years ago
|
||
Has anyone found a reliable way to reproduce this locally?
Comment 32•7 years ago
|
||
looking at this bug, Firefox is not starting up. There is 0 output from firefox or from Marionette after geckdriver starts up for wdspec tests. For mochitest we dont use geckodriver so thats not the cause there
Assignee | ||
Comment 33•7 years ago
|
||
If this is happening fairly consistent the best option is really to just request a one click loaner and observe the tests via VNC. Given that those failures are on Linux, it is very easy to handle via Treeherder/Taskcluster.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 36•7 years ago
|
||
There were 144 failures related to this bug in the last 7 days on Linux. :whimboo any updates for this, it has spiked pretty high lately?
Updated•7 years ago
|
Whiteboard: [stockwell disabled] → [stockwell needswork]
Assignee | ||
Comment 37•7 years ago
|
||
(In reply to Andrei Ciure[:andrei_ciure_] from comment #36) > :whimboo any updates for this, it has spiked pretty high lately? I'm not the only one who has to work on all the intermittent failures. That doesn't scale. As such I put ni? for James, Andreas, and Maja a while ago. I hope one of those can find the time to check specifically this bug.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Flags: needinfo?(mjzffr)
Comment 40•7 years ago
|
||
James, Andreas, Looking at the log it looks like we never get to the point where we call Firefox to startup (if I understand the flow). # An Example of a test running [task 2017-11-20T10:46:49.696Z] 10:46:49 INFO - Starting http server on 127.0.0.1:8000 [task 2017-11-20T10:46:49.697Z] 10:46:49 INFO - Starting http server on 127.0.0.1:8443 [task 2017-11-20T10:46:49.706Z] 10:46:49 INFO - Starting http server on 127.0.0.1:8001 [task 2017-11-20T10:46:49.717Z] 10:46:49 INFO - PID 852 | 1511174809713 geckodriver INFO geckodriver 0.19.1 ( '589acfe789d8' '2017-11-20 13:59 +0800') [task 2017-11-20T10:46:49.721Z] 10:46:49 INFO - PID 852 | 1511174809716 geckodriver INFO Listening on 127.0.0.1:4444 [task 2017-11-20T10:46:50.220Z] 10:46:50 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ [task 2017-11-20T10:46:50.221Z] 10:46:50 INFO - TEST-START | /webdriver/tests/actions/modifier_click.py [task 2017-11-20T10:46:50.540Z] 10:46:50 INFO - STDOUT: ============================= test session starts ============================== [task 2017-11-20T10:46:50.540Z] 10:46:50 INFO - STDOUT: platform linux2 -- Python 2.7.12, pytest-2.9.1, py-1.4.31, pluggy-0.3.1 -- /builds/worker/workspace/build/venv/bin/python [task 2017-11-20T10:46:50.541Z] 10:46:50 INFO - STDOUT: rootdir: /builds/worker/workspace/build/tests/web-platform/tests/webdriver/tests/actions, inifile: [task 2017-11-20T10:46:50.541Z] 10:46:50 INFO - STDOUT: collecting ... [task 2017-11-20T10:46:50.558Z] 10:46:50 INFO - STDOUT: collected 9 items [task 2017-11-20T10:46:50.558Z] 10:46:50 INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/modifier_click.py::test_modifier_click[modifier0-ctrlKey] [task 2017-11-20T10:46:50.558Z] 10:46:50 INFO - PID 852 | 1511174810555 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.IRSxkIla0Kpi" [task 2017-11-20T10:46:52.629Z] 10:46:52 INFO - PID 852 | [865, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file /builds/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 2872 [task 2017-11-20T10:47:19.603Z] 10:47:19 INFO - PID 852 | 1511174839597 Marionette INFO Enabled via --marionette # An example of a failure [task 2017-11-20T10:44:56.874Z] 10:44:56 INFO - TEST-SKIP | /webdriver/tests/contexts/maximize_window.py | took 0ms [task 2017-11-20T10:44:56.891Z] 10:44:56 INFO - Starting runner [task 2017-11-20T10:44:56.899Z] 10:44:56 INFO - Starting http server on 127.0.0.1:8443 [task 2017-11-20T10:44:56.900Z] 10:44:56 INFO - Starting http server on 127.0.0.1:8000 [task 2017-11-20T10:44:56.917Z] 10:44:56 INFO - PID 852 | 1511174696914 geckodriver INFO geckodriver 0.19.1 ( '589acfe789d8' '2017-11-20 13:59 +0800') [task 2017-11-20T10:44:56.921Z] 10:44:56 INFO - PID 852 | 1511174696918 geckodriver INFO Listening on 127.0.0.1:4444 [task 2017-11-20T10:44:57.421Z] 10:44:57 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ [task 2017-11-20T10:44:57.422Z] 10:44:57 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py [task 2017-11-20T11:01:37.471Z] 11:01:37 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/O13YgUo-SwOE8BF5SXh-kw/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] [task 2017-11-20T11:01:37.505Z] 11:01:37 ERROR - timed out after 1000 seconds of no output [task 2017-11-20T11:01:37.506Z] 11:01:37 ERROR - Return code: -15 [task 2017-11-20T11:01:37.508Z] 11:01:37 ERROR - No suite end message was emitted by this harness. The only obvious this for me is that Firefox doesnt start up but also looking we don't see any logging from mozrunner being emitted either from GeckoDriver. E.g. mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.IRSxkIla0Kpi" is not shown anywhere.
Comment 41•7 years ago
|
||
(In reply to David Burns :automatedtester from comment #40) > Looking at the log it looks like we never get to the point where > we call Firefox to startup (if I understand the flow). So based on the log you pasted, geckodriver starts but never accepts a new connection. It would be extremely useful here, I think, with a trace-level log from geckodriver. I wonder if doing a try build with lots of Wd retriggers and a patch to pass --webdriver-arg='-vv' by default would show us anything interesting.
Flags: needinfo?(ato)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Comment 44•7 years ago
|
||
try run with verbose logging https://treeherder.mozilla.org/#/jobs?repo=try&revision=2fd843112adc&selectedJob=148141221
Updated•7 years ago
|
Flags: needinfo?(james)
Comment 45•7 years ago
|
||
looking at the log with more verbose logging we don't get pytest starting up and collecting tests Good run [task 2017-11-28T12:25:29.916Z] 12:25:29 INFO - PID 847 | 1511871929913 geckodriver INFO geckodriver 0.19.1 ( '2fd843112adc' '2017-11-28 12:01 +0000') [task 2017-11-28T12:25:29.916Z] 12:25:29 INFO - Starting http server on 127.0.0.1:8443 [task 2017-11-28T12:25:29.919Z] 12:25:29 INFO - PID 847 | 1511871929913 webdriver::httpapi DEBUG Creating routes [task 2017-11-28T12:25:29.926Z] 12:25:29 INFO - PID 847 | 1511871929918 geckodriver INFO Listening on 127.0.0.1:4444 [task 2017-11-28T12:25:30.418Z] 12:25:30 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ [task 2017-11-28T12:25:30.418Z] 12:25:30 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py [task 2017-11-28T12:25:30.762Z] 12:25:30 INFO - STDOUT: ============================= test session starts ============================== [task 2017-11-28T12:25:30.763Z] 12:25:30 INFO - STDOUT: platform linux2 -- Python 2.7.12, pytest-2.9.1, py-1.4.31, pluggy-0.3.1 -- /builds/worker/workspace/build/venv/bin/python [task 2017-11-28T12:25:30.763Z] 12:25:30 INFO - STDOUT: rootdir: /builds/worker/workspace/build/tests/web-platform/tests/webdriver/tests/actions, inifile: [task 2017-11-28T12:25:30.764Z] 12:25:30 INFO - STDOUT: collecting ... [task 2017-11-28T12:25:30.780Z] 12:25:30 INFO - STDOUT: collected 3 items [task 2017-11-28T12:25:30.780Z] 12:25:30 INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/key_shortcuts.py::test_mod_a_and_backspace_deletes_all_text [task 2017-11-28T12:25:30.781Z] 12:25:30 INFO - PID 847 | 1511871930773 webdriver::server DEBUG -> POST /session {"capabilities": {"alwaysMatch": {"moz:firefoxOptions": {"binary": "/builds/worker/workspace/build/application/firefox/firefox", "prefs": {"network.dns.localDomains": "web-platform.test,www.web-platform.test,www1.web-platform.test,www2.web-platform.test,xn--n8j6ds53lwwkrqhv28a.web-platform.test,xn--lve-6lad.web-platform.test"}}}}} [task 2017-11-28T12:25:30.784Z] 12:25:30 INFO - PID 847 | 1511871930777 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.hpPAvne6jv1N" [task 2017-11-28T12:25:30.785Z] 12:25:30 INFO - PID 847 | 1511871930780 geckodriver::marionette TRACE connection attempt 0/600 [task 2017-11-28T12:25:30.886Z] 12:25:30 INFO - PID 847 | 1511871930880 geckodriver::marionette TRACE connection attempt 1/600 [task 2017-11-28T12:25:30.988Z] 12:25:30 INFO - PID 847 | 1511871930981 geckodriver::marionette TRACE connection attempt 2/600 [task 2017-11-28T12:25:31.089Z] 12:25:31 INFO - PID 847 | 1511871931081 geckodriver::marionette TRACE connection attempt 3/600 bad run [task 2017-11-28T13:27:29.259Z] 13:27:29 INFO - TEST-SKIP | /webdriver/tests/element_click/select.py | took 0ms [task 2017-11-28T13:27:29.282Z] 13:27:29 INFO - Starting http server on 127.0.0.1:8001 [task 2017-11-28T13:27:29.282Z] 13:27:29 INFO - Starting runner [task 2017-11-28T13:27:29.285Z] 13:27:29 INFO - Starting http server on 127.0.0.1:8000 [task 2017-11-28T13:27:29.294Z] 13:27:29 INFO - Starting http server on 127.0.0.1:8443 [task 2017-11-28T13:27:29.306Z] 13:27:29 INFO - PID 849 | 1511875649302 geckodriver INFO geckodriver 0.19.1 ( '2fd843112adc' '2017-11-28 12:01 +0000') [task 2017-11-28T13:27:29.307Z] 13:27:29 INFO - PID 849 | 1511875649302 webdriver::httpapi DEBUG Creating routes [task 2017-11-28T13:27:29.310Z] 13:27:29 INFO - PID 849 | 1511875649306 geckodriver INFO Listening on 127.0.0.1:4444 [task 2017-11-28T13:27:29.809Z] 13:27:29 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ [task 2017-11-28T13:27:29.810Z] 13:27:29 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py [task 2017-11-28T13:44:09.832Z] 13:44:09 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/UBMTC4VvSEOM_TnO_q7h0A/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] [task 2017-11-28T13:44:09.883Z] 13:44:09 ERROR - timed out after 1000 seconds of no output [task 2017-11-28T13:44:09.883Z] 13:44:09 ERROR - Return code: -15 [task 2017-11-28T13:44:09.883Z] 13:44:09 ERROR - No suite end message was emitted by this harness. [task 2017-11-28T13:44:09.884Z] 13:44:09 ERROR - # TBPL FAILURE # [task 2017-11-28T13:44:09.884Z] 13:44:09 WARNING - setting return code to 2 [task 2017-11-28T13:44:09.885Z] 13:44:09 INFO - Running post-action listener: _package_coverage_data because we dont get oytest starting we then never get a session request which then times out Dave, can you suggest how we can delve deeper into this?
Flags: needinfo?(dave.hunt)
Assignee | ||
Comment 46•7 years ago
|
||
Maybe this is also all related to bug 1421289, which is causing various 1000s timeouts for Mochitests, and Marionette tests.
Depends on: 1421289
Comment 47•7 years ago
|
||
> Dave, can you suggest how we can delve deeper into this? I don't think I have enough context to add value here. The output in comment 60 suggests a TEST-START is being reported before the pytest session starts, which seems odd. Could you direct me to where pytest is invoked? I would suggest that either the issue is occurring before this, or perhaps pytest is throwing an exception that's being ignored.
Flags: needinfo?(dave.hunt)
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Comment 52•7 years ago
|
||
So it looks like we are hitting some external timeout if we look at https://treeherder.mozilla.org/logviewer.html#?job_id=148432942&repo=try&lineNumber=1223 More debug statements going in.
Comment hidden (mozreview-request) |
Comment hidden (Intermittent Failures Robot) |
Comment 55•7 years ago
|
||
Ok, I have managed to track it down to the invocaction of Pytest and then it hangs so it looks like my initial gut of Pytest being the cause is likely. Now... to understand why... We get to https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/tools/wptrunner/wptrunner/executors/pytestrunner/runner.py#54-62 before we hang. See log with my statements in https://treeherder.mozilla.org/logviewer.html#?job_id=148724638&repo=try&lineNumber=1251
Comment 56•7 years ago
|
||
Not knowing the best way to get through all of this I don't really know what are the next steps. I definitely know that _something_ in pytest is not returning. We pass in a file and other arguments, see comment above for link to code, and then wait for pytest. We then timeout, somehow, and realise its a timeout but mozprocess comes a long and kills us for taking too long. Dave, if you could give me pointers on where to start looking in pytest, I would be grateful. We are using the vendored version in third_party/python/pytest in mozilla central. James, should we, when hitting EXTERNAL-TIMEOUT do `result = Stop` in wptrunner as there seems to be an error we can't recover from?
Flags: needinfo?(james)
Flags: needinfo?(dave.hunt)
Assignee | ||
Comment 57•7 years ago
|
||
pytest in the following folder is kinda old: https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/tools/pytest Maybe it should be upgraded to a newer release?
Comment 58•7 years ago
|
||
It looks to me like pytest isn't even starting, as we'd expect to see some output in the console. You could try the the pytest-instafail plugin (https://pypi.python.org/pypi/pytest-instafail/), which shows errors/failures immediately rather than wait until the end of the suite. Alternatively, pytest-faulthandler (https://pypi.python.org/pypi/pytest-faulthandler) might shed some light on what's happening. You could also try adding some debug in WdspecRun.run(), where the thread managing the run is created and ultimately timed out. I would also suggest trying a later pytest release, as 3.x has been available for more than a year.
Flags: needinfo?(dave.hunt)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Updated•7 years ago
|
Attachment #8934146 -
Attachment is obsolete: true
Updated•7 years ago
|
Attachment #8933670 -
Attachment is obsolete: true
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 81•7 years ago
|
||
so from try runs the vendor update drops the chances of it happening but doesnt outright solve it
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 86•7 years ago
|
||
(In reply to David Burns :automatedtester from comment #81) > so from try runs the vendor update drops the chances of it > happening but doesnt outright solve it Do you have an indication of the improvement? Do you think there are more bugs in pytest downstream that causes it to occasionally freeze up?
Comment 87•7 years ago
|
||
mozreview-review |
Comment on attachment 8935728 [details] Bug 1407383: Wait for the URL to change in test before asserting https://reviewboard.mozilla.org/r/206624/#review212366 ::: testing/web-platform/tests/webdriver/tests/navigation/current_url.py:82 (Diff revision 2) > > def test_get_current_url_after_modified_location(session): > + start = session.transport.send("GET", "session/%s/url" % session.session_id) > session.execute_script("window.location.href = 'about:blank#wd_test_modification'") > - > + wait(session, > + lambda d: session.transport.send("GET", "session/%s/url" % session.session_id) != start.body["value"], d is unused.
Attachment #8935728 -
Flags: review?(ato) → review+
Comment 88•7 years ago
|
||
(In reply to Andreas Tolfsen ‹:ato› from comment #86) > (In reply to David Burns :automatedtester from comment #81) > > > so from try runs the vendor update drops the chances of it > > happening but doesnt outright solve it > > Do you have an indication of the improvement? Do you think there > are more bugs in pytest downstream that causes it to occasionally > freeze up? we go from ~15% intermittent rate to ~2.3%. I feel like it would be worth watching after this lands to know if its the next top prioritty issue As for pytest, I am not sure. It could be due to the other mozprocess intermittents we have on file.
Comment hidden (mozreview-request) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 92•6 years ago
|
||
mozreview-review |
Comment on attachment 8935420 [details] Bug 1407383 - Update metadata for WebDriver https://reviewboard.mozilla.org/r/206312/#review212780 I'l l apply this during the next wpt udpate.
Attachment #8935420 -
Flags: review?(james) → review+
Comment 93•6 years ago
|
||
mozreview-review |
Comment on attachment 8932433 [details] Bug 1407383 - Update Vendored packages for Web-Platform-Tests https://reviewboard.mozilla.org/r/203478/#review212782 I think it's easier to start a wpt update rather than pre-apply this.
Attachment #8932433 -
Flags: review?(james)
Comment hidden (mozreview-request) |
Updated•6 years ago
|
Attachment #8932433 -
Attachment is obsolete: true
Updated•6 years ago
|
Attachment #8935420 -
Attachment is obsolete: true
Comment 95•6 years ago
|
||
Pushed by dburns@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/81d47aa03334 Wait for the URL to change in test before asserting r=ato
Comment 96•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/81d47aa03334
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 101•6 years ago
|
||
I can’t say I see any improvement from the OrangeFactor bot.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 106•6 years ago
|
||
There are 71 failures in the last 7 days. They occur on Linux32, Linux x64 (ccov, jsdcov, nightly) nd the affected builds type are debug, opt, pgo. Log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=153448184&lineNumber=5348 [task 2017-12-27T20:18:52.540Z] 20:18:52 INFO - TEST-START | toolkit/mozapps/extensions/test/xpinstall/browser_navigateaway3.js 5340 [task 2017-12-27T20:18:52.622Z] 20:18:52 INFO - GECKO(4930) | Sending InstallTriggered 5341 [task 2017-12-27T20:18:52.666Z] 20:18:52 INFO - GECKO(4930) | 1514405932656 addons.xpi DEBUG Download started for http://example.com/browser/toolkit/mozapps/extensions/test/xpinstall/amosigned.xpi to file /tmp/tmp-bvk.xpi 5342 [task 2017-12-27T20:18:52.666Z] 20:18:52 INFO - GECKO(4930) | 1514405932662 addons.xpi DEBUG Download of http://example.com/browser/toolkit/mozapps/extensions/test/xpinstall/amosigned.xpi completed. 5343 [task 2017-12-27T20:18:52.708Z] 20:18:52 INFO - GECKO(4930) | 1514405932704 addons.repository DEBUG cacheAddons: enabled false IDs ["unsigned-xpi@tests.mozilla.org"] 5344 [task 2017-12-27T20:18:52.733Z] 20:18:52 INFO - GECKO(4930) | Sending PageLoaded 5345 [task 2017-12-27T20:26:21.477Z] 20:26:21 INFO - GECKO(4930) | 1514406381473 addons.manager INFO Skipping background update check 5346 [task 2017-12-27T20:28:21.479Z] 20:28:21 INFO - GECKO(4930) | *** Blocklist::notify: Requesting http://127.0.0.1:8888/extensions-dummy/blocklistURL 5347 [task 2017-12-27T20:28:21.495Z] 20:28:21 INFO - GECKO(4930) | *** Blocklist::onXMLLoad: there was an error during load 5348 [task 2017-12-27T20:45:01.531Z] 20:45:01 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/mochitest/runtests.py', '--disable-e10s', '--total-chunks', '7', '--this-chunk', '3', '--jscov-dir-prefix=/builds/worker/workspace/build/blobber_upload_dir', '--appname=/builds/worker/workspace/build/application/firefox/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/a7vaVafyQsW-qx46KzZi2w/artifacts/public/build/target.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--setpref=webgl.force-enabled=true', '--quiet', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/browser-chrome-coverage_raw.log', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/browser-chrome-coverage_errorsummary.log', '--use-test-media-devices', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--sandbox-read-whitelist=/builds/worker/workspace/build', '--log-raw=-', '--flavor=browser', '--chunk-by-runtime', '--timeout=1200'] 5349 [task 2017-12-27T20:45:01.556Z] 20:45:01 ERROR - timed out after 1000 seconds of no output 5350 [task 2017-12-27T20:45:01.556Z] 20:45:01 ERROR - Return code: -15 5351 [task 2017-12-27T20:45:01.557Z] 20:45:01 ERROR - No suite end message was emitted by this harness. 5352 [task 2017-12-27T20:45:01.557Z] 20:45:01 INFO - TinderboxPrint: mochitest-browser-chrome-coverage<br/>213/0/0 5353 [task 2017-12-27T20:45:01.557Z] 20:45:01 ERROR - # TBPL FAILURE # 5354 [task 2017-12-27T20:45:01.558Z] 20:45:01 WARNING - setting return code to 2 5355 [task 2017-12-27T20:45:01.558Z] 20:45:01 ERROR - The mochitest suite: browser-chrome-coverage ran with return status: FAILURE 5356 [task 2017-12-27T20:45:01.559Z] 20:45:01 INFO - Running post-action listener: _package_coverage_data 5357 [task 2017-12-27T20:45:01.559Z] 20:45:01 INFO - Beginning compression of JSDCov artifacts... 5358 [task 2017-12-27T20:45:01.559Z] 20:45:01 INFO - Running command: ['zip', '-r', '/builds/worker/workspace/build/blobber_upload_dir/jsdcov_artifacts.zip', '.', '-i', 'jscov*.json'] in /builds/worker/workspace/build/blobber_upload_dir 5359 [task 2017-12-27T20:45:01.559Z] 20:45:01 INFO - Copy/paste: zip -r /builds/worker/workspace/build/blobber_upload_dir/jsdcov_artifacts.zip . -i jscov*.json
Flags: needinfo?(ato)
Comment hidden (Intermittent Failures Robot) |
Comment 108•6 years ago
|
||
It doesn’t look like the pytest/py vendor upgrade made any difference, but I don’t have any better ideas what this could be and I don’t currently have time allocated to dig in.
Flags: needinfo?(ato)
Comment 109•6 years ago
|
||
The latest version of pytest captures logging, and includes it in the output when a test fails. It's possible that this could reveal more information when the process is appearing to hang. To enable live logging you require pytest 3.3.1 or later, and the command line options: -s and --log-cli-level with the appropriate log level. See https://docs.pytest.org/en/3.3.1/logging.html for more information.
Comment 110•6 years ago
|
||
:automatedtester, can you find someone to look into this high frequency failure that we have had for 3+ months?
Flags: needinfo?(dburns)
Comment 111•6 years ago
|
||
(In reply to Andreas Tolfsen ‹:ato› from comment #108) > It doesn’t look like the pytest/py vendor upgrade made any > difference, but I don’t have any better ideas what this could be > and I don’t currently have time allocated to dig in. the changes I have done for vendored code has not been downstreamed from the w3c. I will get James to schedule a sync
Flags: needinfo?(dburns)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 114•6 years ago
|
||
I still see failures here, I assume a sync took place sometime in the last day or so, :automatedtester, can you follow up here?
Flags: needinfo?(dburns)
Assignee | ||
Comment 115•6 years ago
|
||
No, AFAIK the sync has caused a lot of failures to appear, so it might not have been done yet. At least I don't see an updated pytest: https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/tools/pytest James, can you supply details regarding the sync?
Flags: needinfo?(dburns) → needinfo?(james)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•6 years ago
|
Component: geckodriver → web-platform-tests
Comment hidden (Intermittent Failures Robot) |
Comment 120•6 years ago
|
||
:jgraham, this still shows a lot of failures- can you help prioritize this?
Comment 121•6 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #120) > :jgraham, this still shows a lot of failures- can you help prioritize this? the wpt sync should finish today or first thing tomorrow.
Flags: needinfo?(james)
Comment 122•6 years ago
|
||
It looks like this isn't fixed by the sync: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=83bfdce958e7de595edd901f1c60686127ca22e9 I think someone needs to get a loaner, run the test in a loop until it fails and then use gdb to figure out what's going on.
Comment 123•6 years ago
|
||
(In reply to James Graham [:jgraham] from comment #122) > It looks like this isn't fixed by the sync: > https://treeherder.mozilla.org/#/jobs?repo=mozilla- > inbound&revision=83bfdce958e7de595edd901f1c60686127ca22e9 > > I think someone needs to get a loaner, run the test in a loop until it fails > and then use gdb to figure out what's going on. I can't see this error on that push My original work saw it move from >15% down to 2%(comment 88). I would like to see the numbers over the next few days and then see what our next steps are.
Assignee | ||
Comment 124•6 years ago
|
||
Please note that the affected test now seems to be key_shortcuts.py and not key.py. At least in those two failure cases on inbound.
Comment hidden (Intermittent Failures Robot) |
Comment 126•6 years ago
|
||
I can reproduce a hang on TC, but not on my machine (so far) by running something like: mach wpt --test-type wdspec --repeat 1000 Unfortunately I can't figure out how to debug it on TC: * gdb -p (to attach to the existing processes) seems to be disabled on taskcluster since it would allow interacting with processes outside the container * gdb python --args <command line above> doesen't provide me with a way to look at the child processes (afaik) * rr doesn't work on AWS I'm pretty sure there's some kind of deadlock here. It could be something to do with the log output, but that's only a guess. Debugging this blind is hard, so any suggestions would be appreciated.
Assignee | ||
Comment 127•6 years ago
|
||
Similar to David I see this on and off on MacOS. gdb shouldn't be possible to use anymore. So any instructions in how to best debug this with lldb? Just in case I see it again.
Comment hidden (Intermittent Failures Robot) |
Comment 129•6 years ago
|
||
I ran these tests all night on an OSX debug build and didn't see a hang, so if you can reproduce that's very interesting. In after starting lldb it looks like you do something like process attach --pid <pid> I don't know if there's lldb extensions for python like the gdb ones that allow py-bt, but in any case backtraces from each running python processs seem interesting.
Assignee | ||
Comment 130•6 years ago
|
||
So I hit this today myself again, and I tried to debug it via lldb. But the stack looks pretty useless because it is not symbolicated. I also tried with pyrasite to connect to the hanging processes but it was also not working.
At least the last spawned process looks a bit different for the first thread's last frames:
+ 2646 connection_recv_obj (in _multiprocessing.so) + 90 [0x10e52545a]
+ 2646 conn_recv_string (in _multiprocessing.so) + 68 [0x10e5259c5]
+ 2646 _conn_recvall (in _multiprocessing.so) + 49 [0x10e525a9c]
+ 2646 read (in libsystem_kernel.dylib) + 10
So it seems to hang in communicating with the other processes?
What I can say that it stops in between those two lines. Means when the issue occurs pytest does not print it's info line:
> 0:04.67 TEST_START: /webdriver/tests/actions/key_shortcuts.py
> ============================================================== test session starts ==============================================================
Assignee | ||
Comment 131•6 years ago
|
||
(In reply to James Graham [:jgraham] from comment #129) > I ran these tests all night on an OSX debug build and didn't see a hang, so > if you can reproduce that's very interesting. Btw how did you run those? With `--repeat-until-unexpected`? If yes, that should not work because if we really hang that early, you should always execute the full command and run the test only once. And that until it hangs.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 137•6 years ago
|
||
:jgraham this failure is showing up at a very high rate still, can we disable all tests that affect this on linux until this is resolved? do you have other ideas?
Flags: needinfo?(james)
Assignee | ||
Comment 138•6 years ago
|
||
Please note that there are still a lot of falsely starred entries in the list of OF. Also this is not only related to this specific test but it happens for the very first test as run in the job. So disabling just the first test doesn't help. We would have to completely turn of wdspec tests.
Comment hidden (Intermittent Failures Robot) |
Comment 140•6 years ago
|
||
(In reply to OrangeFactor Robot from comment #139) > 23 failures in 141 pushes (0.163 failures/push) were associated with this > bug yesterday. > so its 18 failures in 141 pushes(0.127 failures/push)
Comment 141•6 years ago
|
||
a 12%+ failure rate- should we hide this test or restrict it to a single config until this is fixed?
Flags: needinfo?(dburns)
Comment 142•6 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #141) > a 12%+ failure rate- should we hide this test or restrict it to a single > config until this is fixed? As mentioned in comment 138 that would turn off all wdspec tests as its not related to a test. From a quick search of Bugzilla[1] I think there might be an underlying issue in mozprocess issue or there might be a resource issue from TC as it appears, anecdotally, that we are hitting a deadlock probably due to I/O [1] https://bugzilla.mozilla.org/buglist.cgi?quicksearch=mozprocess%20timed%20out&list_id=13974869
Flags: needinfo?(dburns)
Comment 143•6 years ago
|
||
I asked if we should restrict this to a single config until this is fixed, for example linux64/debug only. Who is working on the mozprocess issue? Is there a bug we can track instead of this?
Flags: needinfo?(dburns)
Assignee | ||
Comment 144•6 years ago
|
||
David, you said that the underlying issue might be in pytest. How safe is this assumption? I mean in relation to a possible issue with mozprocess, which both would be mutual exclusive.
Flags: needinfo?(james)
Comment 145•6 years ago
|
||
Sorry if I’m missing some context here, but what uses mozprocess here?
Comment 146•6 years ago
|
||
(In reply to Andreas Tolfsen ‹:ato› from comment #145) > Sorry if I’m missing some context here, but what uses mozprocess here? Mozprocess starts everything and makes sure that they don't just hang by killing jobs that don't output for 1000s.
Flags: needinfo?(dburns)
Comment 147•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #144) > David, you said that the underlying issue might be in pytest. How safe is > this assumption? I mean in relation to a possible issue with mozprocess, > which both would be mutual exclusive. I know that pytest is not returning. I am 100% confident in that. Why it's not returning I have no idea and I know James hasnt been able to debug either. James' theory, which makes the most sense, is there is an I/O deadlock somewhere. If that is the case then there might be a bigger issue which is why I searched for mozprocess intermittents today which shocked me
Comment 148•6 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #143) > I asked if we should restrict this to a single config until this is fixed, > for example linux64/debug only. Who is working on the mozprocess issue? Is > there a bug we can track instead of this? No one as far as I know, I thought that I would look it up today to see if there is a trend and clearly there is. I will raise one and put all the mozprocess bugs there as deps that we can then clear up as we know that its not something more sinister underlying
Comment 149•6 years ago
|
||
I have put some time into this e.g. I set up the docker image from Taskcluster locally and tried to reproduce. But so far it seems to be easy to reproduce in CI and hard to debug there, and hard to reproduce outside CI where it would be eas[y|ier] to debug. AFAIK only whimboo has reported seeing the hang outside the CI environment, and it's not 100% clear if that's the same issue. I agree that the situation in CI is currently untenable. I don't really object to reducing the number of Wd jobs in the short term, but ultimately someone needs to dedicate some time to actually resolving the issue, and I haven't been able to do that recently.
Assignee | ||
Comment 150•6 years ago
|
||
David could you please give me some links in where I would have to add some debug print lines to actually check if I see the identical problem with pytest? I don't know where to start, and your input could shorten the time drastically.
Flags: needinfo?(dburns)
Assignee | ||
Comment 151•6 years ago
|
||
And one note... we actually use mozprocess to start the wpt tests via mozharness. So keep in mind that it has its own copy of mozprocess in place. To reproduce this output, you should use mozharness. Also this 1000s timeout means that no output has been given in that time, which underpins the comment from David about the deadlock above. Yes, I was able to see the deadlock one and off locally, and if I know where to get started with debugging, I can jump into that.
Comment 152•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #151) > And one note... we actually use mozprocess to start the wpt tests via > mozharness. So keep in mind that it has its own copy of mozprocess in place. > To reproduce this output, you should use mozharness. Also this 1000s timeout > means that no output has been given in that time, which underpins the > comment from David about the deadlock above. > > Yes, I was able to see the deadlock one and off locally, and if I know where > to get started with debugging, I can jump into that. If its pytest as the cause then https://searchfox.org/mozilla-central/source/testing/web-platform/tests/tools/wptrunner/wptrunner/executors/pytestrunner/runner.py#54-63 is what will hang.
Flags: needinfo?(dburns)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 156•6 years ago
|
||
Fun discovery from the weekend: add enough logging and this stops happening.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 162•6 years ago
|
||
Using a specific test in the summary is misleading as mentioned a couple of times earlier here. So we should really get rid of it. Instead lets prefix it with the test job symbol for better starring.
Summary: Intermittent /webdriver/tests/actions/key.py Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', → Intermittent Wd failure: Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-',
Assignee | ||
Updated•6 years ago
|
Summary: Intermittent Wd failure: Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', → Intermittent Wd Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-',
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 167•6 years ago
|
||
There are 87 failures in the past 7 days. Platforms: Linux opt/ debug, Linux x64 opt/ debug/ pgo, linux32-nightly opt, linux32-devedition opt, linux64-ccov opt, linux64-jsdcov opt, linux64-nightly opt. Recent failure log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=160524801 Relevant part of the log: [task 2018-02-06T01:40:01.065Z] 01:40:01 INFO - SUITE-START | Running 54 tests [task 2018-02-06T01:40:01.065Z] 01:40:01 INFO - Running wdspec tests [task 2018-02-06T01:40:01.065Z] 01:40:01 INFO - TEST-START | /webdriver/tests/actions/key.py [task 2018-02-06T01:40:01.065Z] 01:40:01 INFO - TEST-SKIP | /webdriver/tests/actions/key.py | took 0ms [task 2018-02-06T01:40:01.066Z] 01:40:01 INFO - TEST-START | /webdriver/tests/contexts/maximize_window.py [task 2018-02-06T01:40:01.066Z] 01:40:01 INFO - TEST-SKIP | /webdriver/tests/contexts/maximize_window.py | took 0ms [task 2018-02-06T01:40:01.066Z] 01:40:01 INFO - Starting http server on web-platform.test:8001 [task 2018-02-06T01:40:01.070Z] 01:40:01 INFO - Starting runner [task 2018-02-06T01:40:01.070Z] 01:40:01 INFO - Starting http server on web-platform.test:8443 [task 2018-02-06T01:40:01.097Z] 01:40:01 INFO - PID 840 | 1517881201093 geckodriver INFO geckodriver 0.19.1 ( 'e6138707451a' '2018-02-05 20:17 -0500') [task 2018-02-06T01:40:01.101Z] 01:40:01 INFO - PID 840 | 1517881201096 geckodriver INFO Listening on 127.0.0.1:4444 [task 2018-02-06T01:40:01.602Z] 01:40:01 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ [task 2018-02-06T01:40:01.602Z] 01:40:01 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py [task 2018-02-06T01:56:41.662Z] 01:56:41 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/XXPgFAK3QL-xbZQq71hzZA/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--no-pause-after-test', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] [task 2018-02-06T01:56:41.663Z] 01:56:41 ERROR - timed out after 1000 seconds of no output [task 2018-02-06T01:56:41.663Z] 01:56:41 ERROR - Return code: -15 [task 2018-02-06T01:56:41.664Z] 01:56:41 ERROR - No suite end message was emitted by this harness. [task 2018-02-06T01:56:41.664Z] 01:56:41 ERROR - # TBPL FAILURE # [task 2018-02-06T01:56:41.665Z] 01:56:41 WARNING - setting return code to 2 [task 2018-02-06T01:56:41.665Z] 01:56:41 INFO - Running post-action listener: _package_coverage_data [task 2018-02-06T01:56:41.665Z] 01:56:41 INFO - Running post-action listener: _resource_record_post_action [task 2018-02-06T01:56:41.665Z] 01:56:41 INFO - [mozharness: 2018-02-06 01:56:41.665687Z] Finished run-tests step (success) [task 2018-02-06T01:56:41.666Z] 01:56:41 INFO - Running post-run listener: _resource_record_post_run [task 2018-02-06T01:56:41.738Z] 01:56:41 INFO - Total resource usage - Wall time: 1021s; CPU: 1.0%; Read bytes: 442368; Write bytes: 444174336; Read time: 24; Write time: 24936 [task 2018-02-06T01:56:41.738Z] 01:56:41 INFO - TinderboxPrint: CPU usage<br/>0.6% [task 2018-02-06T01:56:41.738Z] 01:56:41 INFO - TinderboxPrint: I/O read bytes / time<br/>442,368 / 24 [task 2018-02-06T01:56:41.739Z] 01:56:41 INFO - TinderboxPrint: I/O write bytes / time<br/>444,174,336 / 24,936 [task 2018-02-06T01:56:41.739Z] 01:56:41 INFO - TinderboxPrint: CPU idle<br/>4,038.5 (99.4%) [task 2018-02-06T01:56:41.739Z] 01:56:41 INFO - TinderboxPrint: Swap in / out<br/>0 / 0 [task 2018-02-06T01:56:41.740Z] 01:56:41 INFO - pull - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0 [task 2018-02-06T01:56:41.741Z] 01:56:41 INFO - install - Wall time: 17s; CPU: 25.0%; Read bytes: 0; Write bytes: 9338880; Read time: 0; Write time: 484 [task 2018-02-06T01:56:41.751Z] 01:56:41 INFO - run-tests - Wall time: 1004s; CPU: 0.0%; Read bytes: 204800; Write bytes: 434835456; Read time: 16; Write time: 24452 [task 2018-02-06T01:56:42.030Z] 01:56:42 INFO - Running post-run listener: _upload_blobber_files [task 2018-02-06T01:56:42.030Z] 01:56:42 WARNING - Blob upload gear skipped. Missing cmdline options. [task 2018-02-06T01:56:42.030Z] 01:56:42 INFO - Running post-run listener: copy_logs_to_upload_dir [task 2018-02-06T01:56:42.031Z] 01:56:42 INFO - Copying logs to upload dir... [task 2018-02-06T01:56:42.031Z] 01:56:42 INFO - mkdir: /builds/worker/workspace/build/upload/logs [task 2018-02-06T01:56:42.032Z] 01:56:42 INFO - Copying logs to upload dir... [task 2018-02-06T01:56:42.034Z] 01:56:42 WARNING - returning nonzero exit status 2
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 173•6 years ago
|
||
This bug has failed 85 times in the last 7 days. It failed only on Linux affecting debug, pgo and opt build types. Failing tests: debug-web-platform-tests-wdspec-e10, opt-web-platform-tests-wdspec-e10, opt-web-platform-tests-wdspec-e10, opt-test-verify-wpt-e10. Link to a recent log:https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=162310358&lineNumber=1222 Part of that log: 01:30:07 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py [task 2018-02-15T01:46:47.542Z] 01:46:47 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/YmxWjjgmTZy5PdEjMJy1FQ/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--no-pause-after-test', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] [task 2018-02-15T01:46:47.559Z] 01:46:47 ERROR - timed out after 1000 seconds of no output [task 2018-02-15T01:46:47.560Z] 01:46:47 ERROR - Return code: -15 [task 2018-02-15T01:46:47.560Z] 01:46:47 ERROR - No suite end message was emitted by this harness. [task 2018-02-15T01:46:47.561Z] 01:46:47 ERROR - # TBPL FAILURE # [task 2018-02-15T01:46:47.561Z] 01:46:47 WARNING - setting return code to 2 [task 2018-02-15T01:46:47.562Z] 01:46:47 INFO - Running post-action listener: _package_coverage_data [task 2018-02-15T01:46:47.562Z] 01:46:47 INFO - Running post-action listener: _resource_record_post_action [task 2018-02-15T01:46:47.562Z] 01:46:47 INFO - [mozharness: 2018-02-15 01:46:47.561867Z] Finished run-tests step (success) [task 2018-02-15T01:46:47.562Z] 01:46:47 INFO - Running post-run listener: _resource_record_post_run [task 2018-02-15T01:46:47.646Z] 01:46:47 INFO - Total resource usage - Wall time: 1022s; CPU: 1.0%; Read bytes: 241664; Write bytes: 524836864; Read time: 4; Write time: 93468 The WPT key tests are disabled due to some problem in wptrunner. new file mode 100644 --- /dev/null +++ b/testing/web-platform/meta/webdriver/tests/actions/key.py.ini @@ -0,0 +1,4 @@ +[key.py] + type: wdspec + disabled: + if (os == "linux"): https://bugzilla.mozilla.org/show_bug.cgi?id=1407383 Bug 1421289 needs to be fixed in order to make progress on the actual bug.
Assignee | ||
Comment 174•6 years ago
|
||
I really propose that we re-enable actions/key.py given that the test module itself is not the problem here. We shouldn't see an increase of failures. The test was just inappropriately disabled a long time ago.
Comment 175•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #174) > I really propose that we re-enable actions/key.py given that the test module > itself is not the problem here. We shouldn't see an increase of failures. > The test was just inappropriately disabled a long time ago. Sounds reasonable to me.
Comment 176•6 years ago
|
||
will we re-enable that in this bug or somewhere else?
Assignee | ||
Comment 177•6 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #176) > will we re-enable that in this bug or somewhere else? You skipped the test without a patch attached to this bug, so I assume we can do the same to unskip?
Flags: needinfo?(jmaher)
Assignee | ||
Updated•6 years ago
|
status-firefox59:
--- → disabled
status-firefox60:
--- → disabled
Assignee | ||
Comment 179•6 years ago
|
||
We should do it here. See your comment 12.
Flags: needinfo?(jmaher)
Comment 180•6 years ago
|
||
for disabling tests we often just disable the test, not add a patch; if you want to revert that patch, go ahead! if there are other factors other than removing the .ini file, it probably is worth a review.
Flags: needinfo?(jmaher)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 183•6 years ago
|
||
Ok, we cannot re-enable key.py yet due to perma timeout failures. There are no restarts of Firefox in-between each sub-test but the very first start of Firefox seems to take extra-ordinary long: [task 2018-02-16T12:24:42.511Z] 12:24:42 INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/key.py::test_lone_keyup_sends_no_events [task 2018-02-16T12:25:09.626Z] 12:25:09 INFO - PID 843 | GLib-GIO-Message: Using the 'memory' GSettings backend. Your settings will not be saved or shared with other applications. [task 2018-02-16T12:25:10.320Z] 12:25:10 INFO - PID 843 | 1518783910314 Marionette INFO Listening on port 2828 [task 2018-02-16T12:25:11.003Z] 12:25:11 INFO - STDOUT: PASSED Here exactly 28 seconds until Firefox was ready. It is actually a different problem.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 188•6 years ago
|
||
This bug - Depends on: 1421289, 1418227 - Blocks: 1429391, 1433861 I can see there is a lot work on this bug, as an update I want to add the following: There are 81 failures in the last 7 days and 242 failures in the last 21 days. - Linux x64: 50 - Linux: 23 - linux32-nightly: 2 - linux64-ccov: 3 - linux64-jsdcov: 1 - linux64-nightly: 1 - OS X 10.10: 1 - opt: 41 - debug: 27 - pgo: 13 Here is a recent log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=164092702&lineNumber=2952 [task 2018-02-24T05:37:25.838Z] 05:37:25 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/Eq0ruhAHRN6GlG2-xuqcqQ/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--no-pause-after-test', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] [task 2018-02-24T05:37:25.846Z] 05:37:25 ERROR - timed out after 1000 seconds of no output [task 2018-02-24T05:37:25.846Z] 05:37:25 ERROR - Return code: -15 [task 2018-02-24T05:37:25.846Z] 05:37:25 ERROR - No suite end message was emitted by this harness. [task 2018-02-24T05:37:25.846Z] 05:37:25 ERROR - # TBPL FAILURE # [task 2018-02-24T05:37:25.846Z] 05:37:25 WARNING - setting return code to 2
Flags: needinfo?(james)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 193•6 years ago
|
||
:jgraham, can we disable wdspec on linux?
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 196•6 years ago
|
||
I was trying to reproduce this hang with a 1-click loaner and got it fairly quickly. Sadly I cannot use GDB to get stacktraces of each of the processes because ptrace is not allowed: > # sudo gdb --pid=3586 > GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1 > [..] > Attaching to process 3586 > Could not attach to process. If your uid matches the uid of the target > process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try > again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf > ptrace: Operation not permitted. I also cannot turn off the restriction: > # echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope > tee: /proc/sys/kernel/yama/ptrace_scope: Read-only file system > 0 At least by inspecting the log I noticed that the hang happens after `Executor setup done`, which is NOT printed to the public logs as uploaded to taskcluster: {"source": "web-platform-tests", "thread": "Thread-TestrunnerManager-1", "time": 1520242899539, "action": "log", "message": "Executor setup done", "level": "DEBUG", "pid": 3586} ^C{"source": "web-platform-tests", "thread": "Thread-TestrunnerManager-1", "time": 1520243299983, "action": "log", "message": "TestRunnerManager main loop terminating, starting cleanup", "level": "DEBUG", "pid": 3586}
Assignee | ||
Comment 197•6 years ago
|
||
Here the instructions in how to reproduce the hang on a one-click loaner: 1) Create a one-click loaner based on https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=e0a1ba39c48913e4e88efe4fa289c23454ff5181&selectedJob=165824714 2) Run the execution step and stop it when tests are started 3) Save this attachment as run.sh and make it executable 4) Run ./run.sh and wait until it hangs
Comment 198•6 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #193) > :jgraham, can we disable wdspec on linux? Redirecting this to maja_zf.
Flags: needinfo?(james) → needinfo?(mjzffr)
Assignee | ||
Comment 199•6 years ago
|
||
The hang is actually caused for `__import__(modname)` where `modname` is `tests.conftest`: https://searchfox.org/mozilla-central/source/testing/web-platform/tests/tools/third_party/py/py/_path/local.py#668 https://searchfox.org/mozilla-central/source/testing/web-platform/tests/webdriver/tests/conftest.py I'm going to reduce the content of this file now to see what causes it. I suspect it is the import of pytest.
Rather than disabling the entire suite, I like the idea of setting it to Tier-2 for a couple of months until we get the most broad, infrastructure-related intermittents figured out. The individual tests, for the most part, are fine and provide useful information to developers working on the features being tested.
Flags: needinfo?(mjzffr)
Assignee | ||
Comment 201•6 years ago
|
||
Reducing conftest.py showed that the hang comes actually with `fixtures.py`: https://searchfox.org/mozilla-central/source/testing/web-platform/tests/webdriver/tests/support/fixtures.py In that file I can remove everything except the following line: > import webdriver So something in that package keeps us hanging. More updates follow...
Assignee | ||
Comment 202•6 years ago
|
||
I finally traced it down to StructuredLog and here specifically the `__init__` method: https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozlog/mozlog/structuredlog.py#149 The hang happens in `with self._lock`. James given that you know mozlog better than me, any advice in what would be good as next steps?
Flags: needinfo?(james)
Comment 203•6 years ago
|
||
So, I would really like to see what happens if you apply the following patch to log all activity on that lock: diff --git a/testing/mozbase/mozlog/mozlog/structuredlog.py b/testing/mozbase/mozlog/mozlog/structuredlog.py index 74a7ef55e3f4..712560a0eeda 100644 --- a/testing/mozbase/mozlog/mozlog/structuredlog.py +++ b/testing/mozbase/mozlog/mozlog/structuredlog.py @@ -7,6 +7,7 @@ from __future__ import absolute_import, print_function, unicode_literals from multiprocessing import current_process from threading import current_thread, Lock import json +import os import sys import time import traceback @@ -69,6 +70,27 @@ Subfields for all messages: _default_logger_name = None +class LoggedLock(object): + def __init__(self): + self._lock = Lock() + + def acquire(self): + os.write(2, "Lock requested %s %s\n" % (current_process().pid, current_thread().name)) + rv = self._lock.acquire() + os.write(2, "Lock acquired %s %s\n" % (current_process().pid, current_thread().name)) + return rv + + def release(self): + os.write(2, "Lock released by %s %s\n" % (current_process().pid, current_thread().name)) + return self._lock.release() + + def __enter__(self): + return self.acquire() + + def __exit__(self, *args, **kwargs): + return self.release() + + def get_default_logger(component=None): """Gets the default logger if available, optionally tagged with component name. Will return None if not yet set @@ -134,7 +156,7 @@ class ComponentState(object): class StructuredLogger(object): - _lock = Lock() + _lock = LoggedLock() _logger_states = {} """Create a structured logger with the given name diff --git a/testing/web-platform/tests/tools/webdriver/webdriver/client.py b/testing/web-platform/tests/tools/webdriver/webdriver/client.py index 622c5fce76ac..86209a366691 100644 --- a/testing/web-platform/tests/tools/webdriver/webdriver/client.py +++ b/testing/web-platform/tests/tools/webdriver/webdriver/client.py @@ -6,7 +6,10 @@ import transport from six import string_types +import os +os.write(2, "Before get_default_logger\n") from mozlog import get_default_logger +os.write(2, "After get_default_logger\n") logger = get_default_logger() That's against m-c so you may need to edit the paths to get it to apply. I suspect what's going on is that we are randomly inheriting a locked lock into the subprocess due to the Incredible Brokenness of Python Multithreading. But I'd like to confirm that's what's going on. Irrespective of that, we don't actually seem to use the logger anywhere? So we could delete that line and stop using mozlog in the conftest file and the problem will go away. Apologies for not spotting this earlier. In retrospect it seems like exactly the kind of thing that coculd cause exactly this kind of problem
Flags: needinfo?(james)
Comment 204•6 years ago
|
||
*Multiprocessing. Multithreading is less broken, just not fit for (many) purpose(s).
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Priority: P5 → P1
Assignee | ||
Comment 205•6 years ago
|
||
Assignee | ||
Comment 206•6 years ago
|
||
Also I don't see why we are hanging here in trying to aquire a lock. As it looks like the same lock instance gets released right before: Lock requested 14850 Thread-TestrunnerManager-1 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0> Lock acquired 14850 Thread-TestrunnerManager-1 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0> {"source": "web-platform-tests", "thread": "Thread-TestrunnerManager-1", "time": 1520281029594, "action": "log", "message": "Executor setup done ", "level": "DEBUG", "pid": 14850} Lock released by 14850 Thread-TestrunnerManager-1 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0> *** before self._lockLock requested 14890 Thread-2 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0>
Assignee | ||
Comment 207•6 years ago
|
||
(In reply to James Graham [:jgraham] from comment #203) > Irrespective of that, we don't actually seem to use the logger anywhere? So > we could delete that line and stop using mozlog in the conftest file and the > problem will go away. Apologies for not spotting this earlier. In retrospect This doesn't help. We still hang even with the lines removed in `fixture.py`.
Assignee | ||
Comment 208•6 years ago
|
||
The thread which actually hangs here is the one which gets created by `WdspecRun.run()`: https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/tools/wptrunner/wptrunner/executors/base.py#405
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 219•6 years ago
|
||
mozreview-review |
Comment on attachment 8956221 [details] Bug 1407383 - [wdspec] Remove usage of mozlog to prevent process lock. https://reviewboard.mozilla.org/r/225132/#review231182 ::: testing/web-platform/tests/webdriver/tests/support/fixtures.py:20 (Diff revision 3) > def ignore_exceptions(f): > def inner(*args, **kwargs): > try: > return f(*args, **kwargs) > except webdriver.error.WebDriverException as e: > - logger.warning("Ignored exception %s" % e) > + print >> sys.stderr, "Ignored exception %s" % e Not sure what I should make out of this lint failure: https://treeherder.mozilla.org/#/jobs?repo=try&revision=14d058dbd80c9cb2b5645f55012c61e5b3f8f454&selectedJob=166153169 Does it mean I have to use a print function or not? The wording is kinda strange. Also i don't see this failure when running the linter locally with the exact same arguments.
Assignee | ||
Updated•6 years ago
|
Attachment #8956222 -
Flags: review?(james)
Attachment #8956221 -
Flags: review?(james)
Attachment #8951549 -
Flags: review?(james)
Comment 220•6 years ago
|
||
mozreview-review |
Comment on attachment 8956221 [details] Bug 1407383 - [wdspec] Remove usage of mozlog to prevent process lock. https://reviewboard.mozilla.org/r/225132/#review231224
Attachment #8956221 -
Flags: review?(james) → review+
Comment 221•6 years ago
|
||
mozreview-review |
Comment on attachment 8956222 [details] Bug 1407383 - [wdspec] Remove unused imports from fixtures module. https://reviewboard.mozilla.org/r/225134/#review231228
Attachment #8956222 -
Flags: review?(james) → review+
Comment 222•6 years ago
|
||
mozreview-review |
Comment on attachment 8951549 [details] Bug 1407383 - [wdspec] Re-enable key.py on Linux64. https://reviewboard.mozilla.org/r/220862/#review231226
Attachment #8951549 -
Flags: review?(james) → review+
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 225•6 years ago
|
||
mozreview-review-reply |
Comment on attachment 8956221 [details] Bug 1407383 - [wdspec] Remove usage of mozlog to prevent process lock. https://reviewboard.mozilla.org/r/225132/#review231182 > Not sure what I should make out of this lint failure: > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=14d058dbd80c9cb2b5645f55012c61e5b3f8f454&selectedJob=166153169 > > Does it mean I have to use a print function or not? The wording is kinda strange. Also i don't see this failure when running the linter locally with the exact same arguments. The problem was that the file needed to be added to the whitelist for the linter to allow the usage of `print` at all.
Comment 226•6 years ago
|
||
Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/880cc555591a [wdspec] Remove unused imports from fixtures module. r=jgraham https://hg.mozilla.org/integration/autoland/rev/c26cb6d150d5 [wdspec] Remove usage of mozlog to prevent process lock. r=jgraham https://hg.mozilla.org/integration/autoland/rev/bf1d69a11d6f [wdspec] Re-enable key.py on Linux64. r=jgraham
Created web-platform-tests PR https://github.com/w3c/web-platform-tests/pull/9872 for changes under testing/web-platform/tests
Whiteboard: [stockwell disable-recommended] → [stockwell disable-recommended][wptsync upstream]
Can't merge web-platform-tests PR due to failing upstream checks: Github PR https://github.com/w3c/web-platform-tests/pull/9872 * continuous-integration/travis-ci/pr (https://travis-ci.org/w3c/web-platform-tests/builds/349768613?utm_source=github_status&utm_medium=notification)
Whiteboard: [stockwell disable-recommended][wptsync upstream] → [stockwell disable-recommended][wptsync upstream error]
Comment 229•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/880cc555591a https://hg.mozilla.org/mozilla-central/rev/c26cb6d150d5 https://hg.mozilla.org/mozilla-central/rev/bf1d69a11d6f
Whiteboard: [stockwell disable-recommended][wptsync upstream error] → [stockwell disable-recommended][wptsync upstream]
Assignee | ||
Updated•6 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Keywords: leave-open,
test-disabled
Resolution: --- → FIXED
Whiteboard: [stockwell disable-recommended][wptsync upstream] → [stockwell fixed][wptsync upstream]
Target Milestone: --- → mozilla60
Comment 230•6 years ago
|
||
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=167034813&repo=autoland&lineNumber=377485
Flags: needinfo?(hskupin)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 232•6 years ago
|
||
(In reply to Andreea Pavel [:apavel] from comment #230) > Recent failure log: > https://treeherder.mozilla.org/logviewer. > html#?job_id=167034813&repo=autoland&lineNumber=377485 This has nothing to do with this bug, and is something completely different. I will file a new bug.
Flags: needinfo?(hskupin)
Summary: Intermittent Wd Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', → Intermittent Wd Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-'] due to hang in mozlog
Upstream PR merged
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•