Closed
Bug 1172343
Opened 9 years ago
Closed 9 years ago
Failures in various tests in self.marionette.start_session() : IOError: Connection to Marionette server is lost
Categories
(Firefox OS Graveyard :: Gaia::UI Tests, defect)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: martijn.martijn, Unassigned)
References
Details
(Keywords: regression)
Attachments
(3 files)
I'm seeing a lot of failures in various tests: http://jenkins1.qa.scl3.mozilla.com/view/Mozilla%20Lab/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.2/277/HTML_Report/ http://jenkins1.qa.scl3.mozilla.com/view/Bitbar/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/88/HTML_Report/ http://jenkins1.qa.scl3.mozilla.com/view/Mozilla%20Lab/job/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/282/HTML_Report/ This happened in the Mountain View lab as wel in bitbar. It looks like when one test is failing, then all the following tests will fail as well. Example output of this failure: Traceback (most recent call last): File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_client-0.13-py2.7.egg/marionette/marionette_test.py", line 277, in run self.setUp() File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 862, in setUp self.device.start_b2g() File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 667, in start_b2g self.marionette.start_session() File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/marionette.py", line 951, in start_session self.session = self._send_message('newSession', 'value', capabilities=desired_capabilities, sessionId=session_id) File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/decorators.py", line 36, in _ return func(*args, **kwargs) File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/marionette.py", line 687, in _send_message response = self.client.send(message) File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_transport-0.4-py2.7.egg/marionette_transport/transport.py", line 87, in send self.connect() File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_transport-0.4-py2.7.egg/marionette_transport/transport.py", line 75, in connect hello = self.receive() File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_transport-0.4-py2.7.egg/marionette_transport/transport.py", line 60, in receive raise IOError(self.connection_lost_msg) IOError: Connection to Marionette server is lost. Check gecko.log (desktop firefox) or logcat (b2g) for errors. I guess this could perhaps be reproduced by running all these tests in 1 run yourself.
Reporter | ||
Comment 1•9 years ago
|
||
I have no idea what's causing these connection issues with the device. I guess perhaps we could find when this started regressing. But anyway, it's pretty serious, because it causes a lot of failures in the test runs.
Reporter | ||
Comment 2•9 years ago
|
||
I suspect bug 1172032 might actually be the same issue. That one is also a connection issue. In that case, this might be somehow an issue with b2g. I don't really know how to proceed here.
Depends on: 1172032
Flags: needinfo?(dave.hunt)
Reporter | ||
Comment 3•9 years ago
|
||
Ok, just got this issue myself now, while trying to run test_gallery_handle_valid_image_file.py on device. When this happened, the screen turned all black, while the script started to reboot the phone. I've attached the logcat of the phone, while this issue occured.
Reporter | ||
Comment 4•9 years ago
|
||
Ok, that means this is perhaps not related to bug 1172032 after all. My guess this could be related to bug 1172167 and bug 1172212 (oom issues).
Reporter | ||
Comment 6•9 years ago
|
||
Ok, I understand. But what can we do about this? This is badly interfering with Jenkins results, currently. I guess try to find a regression range?
Depends on: 1172167
Flags: needinfo?(jlorenzo)
Comment hidden (typo) |
Comment 8•9 years ago
|
||
(In reply to Martijn Wargers [:mwargers] (QA) from comment #6) > I guess try to find a regression range? Yes. It appears to be a gecko regression. You can quickly check this by putting a gecko from a couple of days ago and see if it reproduces when the latest gaia is on it. Once you know, the faster thing to do is to use git bisect run[1], like: > git bisect start firstBadKnownCommit lastKnownGoodCommit > git bisect run "build.sh && gaiatest yourUsualArguments" git will do the bisection for you depending on the result of the gaiatest command. [1] http://git-scm.com/docs/git-bisect#_bisect_run
Reporter | ||
Comment 9•9 years ago
|
||
Peter, could someone from your team perhaps find out a regression range for this? (might be tricky)
Flags: needinfo?(pbylenga)
Keywords: regression,
regressionwindow-wanted
Comment 10•9 years ago
|
||
Oliver was trying to work on this yesterday, NI on him to check in.
Flags: needinfo?(pbylenga) → needinfo?(onelson)
Comment 11•9 years ago
|
||
I was beginning this work yesterday, but near end of day during my work I was disconnected from Jenkins and unable to reconnect. I was attempting to trace the history of the bug through the b2g-inbound run on Jenkins to narrow a window down, and it's first appearance was here: Loose window ----------------- First Broken: * http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.b2g-inbound.tinderbox.ui.functional.smoke.bitbar/1723/ BuildID: 20150603064742 Gaia: 38f2e6bfa819b1417c3cbd4299a8121f5fe4f2a0 Gecko: 04fd9b897dcd Last Working: * http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.b2g-inbound.tinderbox.ui.functional.smoke.bitbar/1720/ BuildID: 20150603051843 Gaia: 7eae1cf6c3966e970edba715dccd23f6f41008b2 Gecko: 1dc1a8a61d2d ** note there is two builds in between that both failed to run, returning the results.xml file, meaning the error could have begun slightly sooner than indicated. This still needs considerable research I'm sure, but I won't be able to look into it anymore today. I'll try to pick it up tomorrow.
Flags: needinfo?(onelson)
Reporter | ||
Comment 12•9 years ago
|
||
So something like this I guess? https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2015-06-03+16%3A10%3A00&enddate=2015-06-03+19%3A52%3A00 Makes me wonder if the first pull request of bug 1169327 is somehow the cause.
Reporter | ||
Comment 13•9 years ago
|
||
Hrm, I get the impression that the 1st pull request of bug 1169327 might have caused this issue. I've locally backed it out and I'm currently further in the smoketest run without getting a black screen than I've got with that pull request in.
Blocks: 1169327
Reporter | ||
Comment 14•9 years ago
|
||
Hmm, no, just with the last test of the smoketest run, test_play_3gp_video.py, I still got the black screen :(
No longer blocks: 1169327
Reporter | ||
Comment 15•9 years ago
|
||
ok, when I run the smoketests with 512mb memory on the Flame device, I don't get to see this issue. so I think this is a memory issue like the other memory issues, we've seen recently.
Reporter | ||
Comment 16•9 years ago
|
||
Actually, with 512mb memory on the Flame device, I also got it, but afterwards the tests were finished (not sure hen it started happening). When I have the pull from bug 1169327 applied, I got stuck with the last test again, test_play_3gp_video.py, using 512mb ram. So somehow the pull request from bug 1169327 seems to make things worse, but doesn't appear to be the cause of it.
Reporter | ||
Comment 17•9 years ago
|
||
[Blocking Requested - why for this release]: (In reply to Martijn Wargers [:mwargers] (QA) from comment #16) > So somehow the pull request from bug 1169327 seems to make things worse, but > doesn't appear to be the cause of it. That might make finding a regression range even more tricky. Perhaps if we find a regression range for bug 1172167, it might be also the cause of this bug?
blocking-b2g: --- → 3.0?
Reporter | ||
Comment 18•9 years ago
|
||
I tried something like this: https://bugzilla.mozilla.org/attachment.cgi?id=8621321 Not creating new sandboxes for every execute_*sync_script call. It seemed to work the first smoketest run, I did, but on the 2nd one, I got this issue again starting from test_settings_usb_storage.py. Johnathan, do you know of ways to minimize memory usage here?
Flags: needinfo?(jgriffin)
Reporter | ||
Comment 19•9 years ago
|
||
(In reply to Martijn Wargers [:mwargers] (QA) from comment #18) > It seemed to work the first smoketest run, I did, but on the 2nd one, I got > this issue again starting from test_settings_usb_storage.py. Hrm, not sure. I might have the patch backed out at that point again.
Reporter | ||
Comment 20•9 years ago
|
||
Ok, I tested this locally and it completed the run just fine. I don't know if the 'new_sandbox=False' part is necessary, but I think the "sandbox='system'" part makes this bug disappear.
Attachment #8621605 -
Flags: review?(jgriffin)
Attachment #8621605 -
Flags: review?(dave.hunt)
Reporter | ||
Comment 21•9 years ago
|
||
Ok, for some reason test_cost_control_data_alert_mobile.py is failing with the pull request. I can take a look at that later.
Comment 22•9 years ago
|
||
(In reply to Martijn Wargers [:mwargers] (QA) from comment #20) > Created attachment 8621605 [details] [review] > https://github.com/mozilla-b2g/gaia/pull/30563 > > Ok, I tested this locally and it completed the run just fine. > I don't know if the 'new_sandbox=False' part is necessary, but I think the > "sandbox='system'" part makes this bug disappear. It would be helpful to know whether it was sandbox='system' or new_sandbox=False. sandbox='system' makes the call happen in a privileged scope, so if some webapi has changed and no longer works in content, that might explain it. new_sandbox=False would imply this problem is being caused by a memory leak, either in Marionette or in some WebAPI/Gaia method it uses.
Flags: needinfo?(jgriffin)
Reporter | ||
Comment 23•9 years ago
|
||
Ok, with just sandbox='system', I got stuck at test_power_button_long_press.py (which is the 2nd to last test in that run) With just new_sandbox='false', I got stuck at test_play_3gp_video.py (which is the last test in that run). So both setting seem to help somewhat. Johnathan, is there any harm using those settings? If not, I'd rather get this in for now, because we currently have terribly unstable (unusable almost) Jenkins runs.
Flags: needinfo?(jgriffin)
Comment 24•9 years ago
|
||
No, there's no harm in using both; it just doesn't help us understand what's going on.
Flags: needinfo?(jgriffin)
Comment 25•9 years ago
|
||
Comment on attachment 8621605 [details] [review] https://github.com/mozilla-b2g/gaia/pull/30563 This is a pretty blunt hammer, but if you need it I won't get in your way. I don't own these tests, so we should probably get Dave's opinion too.
Attachment #8621605 -
Flags: review?(jgriffin) → review+
Reporter | ||
Comment 26•9 years ago
|
||
Ok, I just redid the test run again with the pull request applied, but this time this bug occurred already at test_add_photo_to_contact.py, which is 31 tests until the end. I did do something like 12 test_ftu_skip_tour.py tests before doing this test run (to test the pull request in bug 1134257). Let's see if I can reproduce doing test_ftu_skip_tour.py a whole lot more times.
Reporter | ||
Comment 27•9 years ago
|
||
Ok, a test like http://mxr.mozilla.org/gaia/source/tests/python/gaia-ui-tests/gaiatest/tests/functional/contacts/test_import_contacts_menu_no_network.py is giving this error after 11 repeats for me. So I guess that's where the memory leak is coming from.
Reporter | ||
Comment 28•9 years ago
|
||
Ok, modifying test_import_contacts_menu_no_network.py to something like this makes the phone become unusable after 1 run really, so I guess the memory leak has something to do with the code that's being touched there.
Comment 29•9 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #25) > Comment on attachment 8621605 [details] [review] > https://github.com/mozilla-b2g/gaia/pull/30563 > > This is a pretty blunt hammer, but if you need it I won't get in your way. > I don't own these tests, so we should probably get Dave's opinion too. If our tests are exposing a genuine bug (either in the product, or our automation tools) then I don't feel like we should pave over it. I think we should determine the cause of this and once we understand it we can then either come up with a suitable fix, or use a workaround until the underlying issues is resolved. From your last two comments it seems that you may be closing in on something - does this help you to find a regression range?
Updated•9 years ago
|
Attachment #8621605 -
Flags: review?(dave.hunt) → review-
Reporter | ||
Comment 30•9 years ago
|
||
(In reply to Dave Hunt (:davehunt) from comment #29) > From your last two comments it seems that you may be > closing in on something - does this help you to find a regression range? I haven't got time to find a better regression range. Peter, Oliver, the regression range in comment 11 might be incorrect. Could you perhaps find a regression range with the modified test I attached in comment 28? You might be getting problems with bug 1094759, btw.
Reporter | ||
Comment 31•9 years ago
|
||
(In reply to Martijn Wargers [:mwargers] (QA) from comment #30) > You might be getting problems with bug 1094759, btw. I was getting that, when trying to run older builds to get a regression range, I mean.
Reporter | ||
Comment 32•9 years ago
|
||
This is supposedly not an issue anymore on bitbar, according to Oliver. I tried out the test script again with a build from 20150616160206 and an updated Gaia tree and indeed, after 5 repeats, the issue hasn't occurred yet. It really makes me wonder what fixed it somehow.
Comment 33•9 years ago
|
||
Confirming with Martijn in comment 32, reports have suddenly stopped showing signs of this error which demonstrated extremely consistent behavior prior. Here is the last observed REPRO of this issue: * http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.2.bitbar/106/#showFailuresLink And the first PASS immediately after: * http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.2.bitbar/107/ Keeping these links readily available in case we need/want to dive into what saw this resolved. In the meanwhile, resolving as worksforme.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(onelson)
Resolution: --- → WORKSFORME
Comment 34•9 years ago
|
||
Clearing tag due to bug being resolved.
Flags: needinfo?(ktucker)
Keywords: regressionwindow-wanted
Updated•9 years ago
|
blocking-b2g: 3.0? → ---
QA Whiteboard: [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Updated•9 years ago
|
Flags: needinfo?(pbylenga)
Reporter | ||
Comment 35•9 years ago
|
||
It looks like that has started appearing again :( I've filed bug 1190791 for it now.
Reporter | ||
Comment 36•9 years ago
|
||
Ok, it looks like this was caused by bug 1115480 and then fixed by bug 1171827. That seems to correspond with the regression range and the fix range.
You need to log in
before you can comment on or make changes to this bug.
Description
•