Closed Bug 1172343 Opened 9 years ago Closed 9 years ago

Failures in various tests in self.marionette.start_session() : IOError: Connection to Marionette server is lost

Categories

(Firefox OS Graveyard :: Gaia::UI Tests, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: martijn.martijn, Unassigned)

References

Details

(Keywords: regression)

Attachments

(3 files)

I'm seeing a lot of failures in various tests:
http://jenkins1.qa.scl3.mozilla.com/view/Mozilla%20Lab/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.2/277/HTML_Report/
http://jenkins1.qa.scl3.mozilla.com/view/Bitbar/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/88/HTML_Report/
http://jenkins1.qa.scl3.mozilla.com/view/Mozilla%20Lab/job/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/282/HTML_Report/

This happened in the Mountain View lab as wel in bitbar.
It looks like when one test is failing, then all the following tests will fail as well.

Example output of this failure:
Traceback (most recent call last):
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_client-0.13-py2.7.egg/marionette/marionette_test.py", line 277, in run
    self.setUp()
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 862, in setUp
    self.device.start_b2g()
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 667, in start_b2g
    self.marionette.start_session()
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/marionette.py", line 951, in start_session
    self.session = self._send_message('newSession', 'value', capabilities=desired_capabilities, sessionId=session_id)
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/decorators.py", line 36, in _
    return func(*args, **kwargs)
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/marionette.py", line 687, in _send_message
    response = self.client.send(message)
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_transport-0.4-py2.7.egg/marionette_transport/transport.py", line 87, in send
    self.connect()
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_transport-0.4-py2.7.egg/marionette_transport/transport.py", line 75, in connect
    hello = self.receive()
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_transport-0.4-py2.7.egg/marionette_transport/transport.py", line 60, in receive
    raise IOError(self.connection_lost_msg)

IOError: Connection to Marionette server is lost. Check gecko.log (desktop firefox) or logcat (b2g) for errors.

I guess this could perhaps be reproduced by running all these tests in 1 run yourself.
I have no idea what's causing these connection issues with the device. 
I guess perhaps we could find when this started regressing.
But anyway, it's pretty serious, because it causes a lot of failures in the test runs.
I suspect bug 1172032 might actually be the same issue. That one is also a connection issue.

In that case, this might be somehow an issue with b2g. I don't really know how to proceed here.
Depends on: 1172032
Flags: needinfo?(dave.hunt)
Attached file console_output.txt
Ok, just got this issue myself now, while trying to run test_gallery_handle_valid_image_file.py on device.
When this happened, the screen turned all black, while the script started to reboot the phone.

I've attached the logcat of the phone, while this issue occured.
Ok, that means this is perhaps not related to bug 1172032 after all. 
My guess this could be related to bug 1172167 and bug 1172212 (oom issues).
This type of traceback usually suggests a crash.
Flags: needinfo?(dave.hunt)
Ok, I understand. But what can we do about this? This is badly interfering with Jenkins results, currently. I guess try to find a regression range?
Depends on: 1172167
Flags: needinfo?(jlorenzo)
(In reply to Martijn Wargers [:mwargers] (QA) from comment #6)
> I guess try to find a regression range?

Yes. It appears to be a gecko regression. You can quickly check this by putting a gecko from a couple of days ago and see if it reproduces when the latest gaia is on it. Once you know, the faster thing to do is to use git bisect run[1], like:
> git bisect start firstBadKnownCommit lastKnownGoodCommit
> git bisect run "build.sh && gaiatest yourUsualArguments"

git will do the bisection for you depending on the result of the gaiatest command.

[1] http://git-scm.com/docs/git-bisect#_bisect_run
Peter, could someone from your team perhaps find out a regression range for this? (might be tricky)
Flags: needinfo?(pbylenga)
Oliver was trying to work on this yesterday, NI on him to check in.
Flags: needinfo?(pbylenga) → needinfo?(onelson)
I was beginning this work yesterday, but near end of day during my work I was disconnected from Jenkins and unable to reconnect. I was attempting to trace the history of the bug through the b2g-inbound run on Jenkins to narrow a window down, and it's first appearance was here:

Loose window
-----------------
First Broken:
* http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.b2g-inbound.tinderbox.ui.functional.smoke.bitbar/1723/
BuildID: 20150603064742
Gaia: 38f2e6bfa819b1417c3cbd4299a8121f5fe4f2a0
Gecko: 04fd9b897dcd

Last Working:
* http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.b2g-inbound.tinderbox.ui.functional.smoke.bitbar/1720/
BuildID: 20150603051843
Gaia: 7eae1cf6c3966e970edba715dccd23f6f41008b2
Gecko: 1dc1a8a61d2d

** note there is two builds in between that both failed to run, returning the results.xml file, meaning the error could have begun slightly sooner than indicated.

This still needs considerable research I'm sure, but I won't be able to look into it anymore today. I'll try to pick it up tomorrow.
Flags: needinfo?(onelson)
So something like this I guess? https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2015-06-03+16%3A10%3A00&enddate=2015-06-03+19%3A52%3A00
Makes me wonder if the first pull request of bug 1169327 is somehow the cause.
Hrm, I get the impression that the 1st pull request of bug 1169327 might have caused this issue. I've locally backed it out and I'm currently further in the smoketest run without getting a black screen than I've got with that pull request in.
Blocks: 1169327
Hmm, no, just with the last test of the smoketest run, test_play_3gp_video.py, I still got the black screen :(
No longer blocks: 1169327
ok, when I run the smoketests with 512mb memory on the Flame device, I don't get to see this issue. so I think this is a memory issue like the other memory issues, we've seen recently.
Actually, with 512mb memory on the Flame device, I also got it, but afterwards the tests were finished (not sure hen it started happening). 
When I have the pull from bug 1169327 applied, I got stuck with the last test again, test_play_3gp_video.py, using 512mb ram. 

So somehow the pull request from bug 1169327 seems to make things worse, but doesn't appear to be the cause of it.
[Blocking Requested - why for this release]:

(In reply to Martijn Wargers [:mwargers] (QA) from comment #16)
> So somehow the pull request from bug 1169327 seems to make things worse, but
> doesn't appear to be the cause of it.

That might make finding a regression range even more tricky.
Perhaps if we find a regression range for bug 1172167, it might be also the cause of this bug?
blocking-b2g: --- → 3.0?
I tried something like this: https://bugzilla.mozilla.org/attachment.cgi?id=8621321
Not creating new sandboxes for every execute_*sync_script call.
It seemed to work the first smoketest run, I did, but on the 2nd one, I got this issue again starting from test_settings_usb_storage.py.

Johnathan, do you know of ways to minimize memory usage here?
Flags: needinfo?(jgriffin)
(In reply to Martijn Wargers [:mwargers] (QA) from comment #18)
> It seemed to work the first smoketest run, I did, but on the 2nd one, I got
> this issue again starting from test_settings_usb_storage.py.

Hrm, not sure. I might have the patch backed out at that point again.
Ok, I tested this locally and it completed the run just fine.
I don't know if the 'new_sandbox=False' part is necessary, but I think the "sandbox='system'" part makes this bug disappear.
Attachment #8621605 - Flags: review?(jgriffin)
Attachment #8621605 - Flags: review?(dave.hunt)
Ok, for some reason test_cost_control_data_alert_mobile.py is failing with the pull request. I can take a look at that later.
(In reply to Martijn Wargers [:mwargers] (QA) from comment #20)
> Created attachment 8621605 [details] [review]
> https://github.com/mozilla-b2g/gaia/pull/30563
> 
> Ok, I tested this locally and it completed the run just fine.
> I don't know if the 'new_sandbox=False' part is necessary, but I think the
> "sandbox='system'" part makes this bug disappear.

It would be helpful to know whether it was sandbox='system' or new_sandbox=False.  sandbox='system' makes the call happen in a privileged scope, so if some webapi has changed and no longer works in content, that might explain it.  new_sandbox=False would imply this problem is being caused by a memory leak, either in Marionette or in some WebAPI/Gaia method it uses.
Flags: needinfo?(jgriffin)
Ok, with just sandbox='system', I got stuck at test_power_button_long_press.py (which is the 2nd to last test in that run)
With just new_sandbox='false', I got stuck at test_play_3gp_video.py (which is the last test in that run).

So both setting seem to help somewhat.
Johnathan, is there any harm using those settings? If not, I'd rather get this in for now, because we currently have terribly unstable (unusable almost) Jenkins runs.
Flags: needinfo?(jgriffin)
No, there's no harm in using both; it just doesn't help us understand what's going on.
Flags: needinfo?(jgriffin)
Comment on attachment 8621605 [details] [review]
https://github.com/mozilla-b2g/gaia/pull/30563

This is a pretty blunt hammer, but if you need it I won't get in your way.  I don't own these tests, so we should probably get Dave's opinion too.
Attachment #8621605 - Flags: review?(jgriffin) → review+
Ok, I just redid the test run again with the pull request applied, but this time this bug occurred already at test_add_photo_to_contact.py, which is 31 tests until the end.
I did do something like 12 test_ftu_skip_tour.py tests before doing this test run (to test the pull request in bug 1134257).
Let's see if I can reproduce doing test_ftu_skip_tour.py a whole lot more times.
Blocks: 1134257
Ok, a test like http://mxr.mozilla.org/gaia/source/tests/python/gaia-ui-tests/gaiatest/tests/functional/contacts/test_import_contacts_menu_no_network.py is giving this error after 11 repeats for me.
So I guess that's where the memory leak is coming from.
Attached patch all.diffSplinter Review
Ok, modifying test_import_contacts_menu_no_network.py to something like this makes the phone become unusable after 1 run really, so I guess the memory leak has something to do with the code that's being touched there.
(In reply to Jonathan Griffin (:jgriffin) from comment #25)
> Comment on attachment 8621605 [details] [review]
> https://github.com/mozilla-b2g/gaia/pull/30563
> 
> This is a pretty blunt hammer, but if you need it I won't get in your way. 
> I don't own these tests, so we should probably get Dave's opinion too.

If our tests are exposing a genuine bug (either in the product, or our automation tools) then I don't feel like we should pave over it. I think we should determine the cause of this and once we understand it we can then either come up with a suitable fix, or use a workaround until the underlying issues is resolved. From your last two comments it seems that you may be closing in on something - does this help you to find a regression range?
Attachment #8621605 - Flags: review?(dave.hunt) → review-
(In reply to Dave Hunt (:davehunt) from comment #29)
>  From your last two comments it seems that you may be
> closing in on something - does this help you to find a regression range?

I haven't got time to find a better regression range.
Peter, Oliver, the regression range in comment 11 might be incorrect.
Could you perhaps find a regression range with the modified test I attached in comment 28?
You might be getting problems with bug 1094759, btw.
No longer blocks: 1134257
Flags: needinfo?(pbylenga)
Flags: needinfo?(onelson)
(In reply to Martijn Wargers [:mwargers] (QA) from comment #30)
> You might be getting problems with bug 1094759, btw.

I was getting that, when trying to run older builds to get a regression range, I mean.
This is supposedly not an issue anymore on bitbar, according to Oliver.
I tried out the test script again with a build from 20150616160206 and an updated Gaia tree and indeed, after 5 repeats, the issue hasn't occurred yet. It really makes me wonder what fixed it somehow.
No longer depends on: 1172032
Confirming with Martijn in comment 32, reports have suddenly stopped showing signs of this error which demonstrated extremely consistent behavior prior. 


Here is the last observed REPRO of this issue:
* http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.2.bitbar/106/#showFailuresLink

And the first PASS immediately after:
* http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.2.bitbar/107/


Keeping these links readily available in case we need/want to dive into what saw this resolved.
In the meanwhile, resolving as worksforme.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(onelson)
Resolution: --- → WORKSFORME
Clearing tag due to bug being resolved.
Flags: needinfo?(ktucker)
blocking-b2g: 3.0? → ---
QA Whiteboard: [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Flags: needinfo?(pbylenga)
It looks like that has started appearing again :( I've filed bug 1190791 for it now.
Ok, it looks like this was caused by bug 1115480 and then fixed by bug 1171827.
That seems to correspond with the regression range and the fix range.
Blocks: 1115480
Depends on: 1171827
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: