1172343 - Failures in various tests in self.marionette.start_session() : IOError: Connection to Marionette server is lost

Reporter

Description

•

9 years ago

I'm seeing a lot of failures in various tests:
http://jenkins1.qa.scl3.mozilla.com/view/Mozilla%20Lab/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.2/277/HTML_Report/
http://jenkins1.qa.scl3.mozilla.com/view/Bitbar/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/88/HTML_Report/
http://jenkins1.qa.scl3.mozilla.com/view/Mozilla%20Lab/job/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/282/HTML_Report/

This happened in the Mountain View lab as wel in bitbar.
It looks like when one test is failing, then all the following tests will fail as well.

Example output of this failure:
Traceback (most recent call last):
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_client-0.13-py2.7.egg/marionette/marionette_test.py", line 277, in run
    self.setUp()
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 862, in setUp
    self.device.start_b2g()
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 667, in start_b2g
    self.marionette.start_session()
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/marionette.py", line 951, in start_session
    self.session = self._send_message('newSession', 'value', capabilities=desired_capabilities, sessionId=session_id)
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/decorators.py", line 36, in _
    return func(*args, **kwargs)
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/marionette.py", line 687, in _send_message
    response = self.client.send(message)
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_transport-0.4-py2.7.egg/marionette_transport/transport.py", line 87, in send
    self.connect()
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_transport-0.4-py2.7.egg/marionette_transport/transport.py", line 75, in connect
    hello = self.receive()
  File "/var/jenkins/2/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_transport-0.4-py2.7.egg/marionette_transport/transport.py", line 60, in receive
    raise IOError(self.connection_lost_msg)

IOError: Connection to Marionette server is lost. Check gecko.log (desktop firefox) or logcat (b2g) for errors.

I guess this could perhaps be reproduced by running all these tests in 1 run yourself.

Martijn Wargers (dead)

Reporter

Comment 1

•

9 years ago

I have no idea what's causing these connection issues with the device. 
I guess perhaps we could find when this started regressing.
But anyway, it's pretty serious, because it causes a lot of failures in the test runs.

Martijn Wargers (dead)

Reporter

Comment 2

•

9 years ago

I suspect bug 1172032 might actually be the same issue. That one is also a connection issue.

In that case, this might be somehow an issue with b2g. I don't really know how to proceed here.

Depends on: 1172032

Flags: needinfo?(dave.hunt)

Martijn Wargers (dead)

Reporter

Comment 3

•

9 years ago

Attached file console_output.txt — Details

Ok, just got this issue myself now, while trying to run test_gallery_handle_valid_image_file.py on device.
When this happened, the screen turned all black, while the script started to reboot the phone.

I've attached the logcat of the phone, while this issue occured.

Martijn Wargers (dead)

Reporter

Comment 4

•

9 years ago

Ok, that means this is perhaps not related to bug 1172032 after all. 
My guess this could be related to bug 1172167 and bug 1172212 (oom issues).

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 5

•

9 years ago

This type of traceback usually suggests a crash.

Flags: needinfo?(dave.hunt)

Martijn Wargers (dead)

Reporter

Comment 6

•

9 years ago

Ok, I understand. But what can we do about this? This is badly interfering with Jenkins results, currently. I guess try to find a regression range?

Depends on: 1172167

Flags: needinfo?(jlorenzo)

Comment hidden (typo)

(In reply to Martijn Wargers [:mwargers] (QA) from comment #6)
> I guess try to find a regression range?

Yes. It appears to be a gecko regression. You can quickly check this by putting a gecko from a couple of days ago and see if it reproduces when the latest gaia is on it. Once you know, the faster thing to do is to use git bisect run[1], like:
> git bisect start firstBadKnownCommit lastKnownGoodCommit
> git bisect "run build.sh && gaiatest yourUsualArguments"

[1] http://git-scm.com/docs/git-bisect#_bisect_run

Johan Lorenzo [:jlorenzo]

Comment 8

•

9 years ago

(In reply to Martijn Wargers [:mwargers] (QA) from comment #6)
> I guess try to find a regression range?

Yes. It appears to be a gecko regression. You can quickly check this by putting a gecko from a couple of days ago and see if it reproduces when the latest gaia is on it. Once you know, the faster thing to do is to use git bisect run[1], like:
> git bisect start firstBadKnownCommit lastKnownGoodCommit
> git bisect run "build.sh && gaiatest yourUsualArguments"

git will do the bisection for you depending on the result of the gaiatest command.

[1] http://git-scm.com/docs/git-bisect#_bisect_run

Martijn Wargers (dead)

Reporter

Comment 9

•

9 years ago

Peter, could someone from your team perhaps find out a regression range for this? (might be tricky)

Flags: needinfo?(pbylenga)

Keywords: regression, regressionwindow-wanted

Peter Bylenga [:PBylenga]

Comment 10

•

9 years ago

Oliver was trying to work on this yesterday, NI on him to check in.

Flags: needinfo?(pbylenga) → needinfo?(onelson)

Oliver Nelson [:oliverthor]

Comment 11

•

9 years ago

I was beginning this work yesterday, but near end of day during my work I was disconnected from Jenkins and unable to reconnect. I was attempting to trace the history of the bug through the b2g-inbound run on Jenkins to narrow a window down, and it's first appearance was here:

Loose window
-----------------
First Broken:
* http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.b2g-inbound.tinderbox.ui.functional.smoke.bitbar/1723/
BuildID: 20150603064742
Gaia: 38f2e6bfa819b1417c3cbd4299a8121f5fe4f2a0
Gecko: 04fd9b897dcd

Last Working:
* http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.b2g-inbound.tinderbox.ui.functional.smoke.bitbar/1720/
BuildID: 20150603051843
Gaia: 7eae1cf6c3966e970edba715dccd23f6f41008b2
Gecko: 1dc1a8a61d2d

** note there is two builds in between that both failed to run, returning the results.xml file, meaning the error could have begun slightly sooner than indicated.

This still needs considerable research I'm sure, but I won't be able to look into it anymore today. I'll try to pick it up tomorrow.

Flags: needinfo?(onelson)

Martijn Wargers (dead)

Reporter

Comment 12

•

9 years ago

So something like this I guess? https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2015-06-03+16%3A10%3A00&enddate=2015-06-03+19%3A52%3A00
Makes me wonder if the first pull request of bug 1169327 is somehow the cause.

Martijn Wargers (dead)

Reporter

Comment 13

•

9 years ago

Hrm, I get the impression that the 1st pull request of bug 1169327 might have caused this issue. I've locally backed it out and I'm currently further in the smoketest run without getting a black screen than I've got with that pull request in.

Blocks: 1169327

Martijn Wargers (dead)

Reporter

Comment 14

•

9 years ago

Hmm, no, just with the last test of the smoketest run, test_play_3gp_video.py, I still got the black screen :(

No longer blocks: 1169327

Martijn Wargers (dead)

Reporter

Comment 15

•

9 years ago

ok, when I run the smoketests with 512mb memory on the Flame device, I don't get to see this issue. so I think this is a memory issue like the other memory issues, we've seen recently.

Martijn Wargers (dead)

Reporter

Comment 16

•

9 years ago

Actually, with 512mb memory on the Flame device, I also got it, but afterwards the tests were finished (not sure hen it started happening). 
When I have the pull from bug 1169327 applied, I got stuck with the last test again, test_play_3gp_video.py, using 512mb ram. 

So somehow the pull request from bug 1169327 seems to make things worse, but doesn't appear to be the cause of it.

Martijn Wargers (dead)

Reporter

Comment 17

•

9 years ago

[Blocking Requested - why for this release]:

(In reply to Martijn Wargers [:mwargers] (QA) from comment #16)
> So somehow the pull request from bug 1169327 seems to make things worse, but
> doesn't appear to be the cause of it.

That might make finding a regression range even more tricky.
Perhaps if we find a regression range for bug 1172167, it might be also the cause of this bug?

blocking-b2g: --- → 3.0?

Martijn Wargers (dead)

Reporter

Comment 18

•

9 years ago

I tried something like this: https://bugzilla.mozilla.org/attachment.cgi?id=8621321
Not creating new sandboxes for every execute_*sync_script call.
It seemed to work the first smoketest run, I did, but on the 2nd one, I got this issue again starting from test_settings_usb_storage.py.

Johnathan, do you know of ways to minimize memory usage here?

Flags: needinfo?(jgriffin)

Martijn Wargers (dead)

Reporter

Comment 19

•

9 years ago

(In reply to Martijn Wargers [:mwargers] (QA) from comment #18)
> It seemed to work the first smoketest run, I did, but on the 2nd one, I got
> this issue again starting from test_settings_usb_storage.py.

Hrm, not sure. I might have the patch backed out at that point again.

Martijn Wargers (dead)

Reporter

Comment 20

•

9 years ago

Attached file https://github.com/mozilla-b2g/gaia/pull/30563 — Details

Ok, I tested this locally and it completed the run just fine.
I don't know if the 'new_sandbox=False' part is necessary, but I think the "sandbox='system'" part makes this bug disappear.

Attachment #8621605 - Flags: review?(jgriffin)

Attachment #8621605 - Flags: review?(dave.hunt)

Martijn Wargers (dead)

Reporter

Comment 21

•

9 years ago

Ok, for some reason test_cost_control_data_alert_mobile.py is failing with the pull request. I can take a look at that later.

Jonathan Griffin (:jgriffin)

Comment 22

•

9 years ago

(In reply to Martijn Wargers [:mwargers] (QA) from comment #20)
> Created attachment 8621605 [details] [review]
> https://github.com/mozilla-b2g/gaia/pull/30563
> 
> Ok, I tested this locally and it completed the run just fine.
> I don't know if the 'new_sandbox=False' part is necessary, but I think the
> "sandbox='system'" part makes this bug disappear.

It would be helpful to know whether it was sandbox='system' or new_sandbox=False.  sandbox='system' makes the call happen in a privileged scope, so if some webapi has changed and no longer works in content, that might explain it.  new_sandbox=False would imply this problem is being caused by a memory leak, either in Marionette or in some WebAPI/Gaia method it uses.

Flags: needinfo?(jgriffin)

Martijn Wargers (dead)

Reporter

Comment 23

•

9 years ago

Ok, with just sandbox='system', I got stuck at test_power_button_long_press.py (which is the 2nd to last test in that run)
With just new_sandbox='false', I got stuck at test_play_3gp_video.py (which is the last test in that run).

So both setting seem to help somewhat.
Johnathan, is there any harm using those settings? If not, I'd rather get this in for now, because we currently have terribly unstable (unusable almost) Jenkins runs.

Flags: needinfo?(jgriffin)

Jonathan Griffin (:jgriffin)

Comment 24

•

9 years ago

No, there's no harm in using both; it just doesn't help us understand what's going on.

Flags: needinfo?(jgriffin)

Jonathan Griffin (:jgriffin)

Comment 25

•

9 years ago

Comment on attachment 8621605 [details] [review]
https://github.com/mozilla-b2g/gaia/pull/30563

This is a pretty blunt hammer, but if you need it I won't get in your way.  I don't own these tests, so we should probably get Dave's opinion too.

Attachment #8621605 - Flags: review?(jgriffin) → review+

Martijn Wargers (dead)

Reporter

Comment 26

•

9 years ago

Ok, I just redid the test run again with the pull request applied, but this time this bug occurred already at test_add_photo_to_contact.py, which is 31 tests until the end.
I did do something like 12 test_ftu_skip_tour.py tests before doing this test run (to test the pull request in bug 1134257).
Let's see if I can reproduce doing test_ftu_skip_tour.py a whole lot more times.

Martijn Wargers (dead)

Reporter

Updated

•

9 years ago

Blocks: 1134257

Martijn Wargers (dead)

Reporter

Comment 27

•

9 years ago

Ok, a test like http://mxr.mozilla.org/gaia/source/tests/python/gaia-ui-tests/gaiatest/tests/functional/contacts/test_import_contacts_menu_no_network.py is giving this error after 11 repeats for me.
So I guess that's where the memory leak is coming from.

Martijn Wargers (dead)

Reporter

Comment 28

•

9 years ago

Attached patch all.diff — Details — Splinter Review

Ok, modifying test_import_contacts_menu_no_network.py to something like this makes the phone become unusable after 1 run really, so I guess the memory leak has something to do with the code that's being touched there.

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 29

•

9 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #25)
> Comment on attachment 8621605 [details] [review]
> https://github.com/mozilla-b2g/gaia/pull/30563
> 
> This is a pretty blunt hammer, but if you need it I won't get in your way. 
> I don't own these tests, so we should probably get Dave's opinion too.

If our tests are exposing a genuine bug (either in the product, or our automation tools) then I don't feel like we should pave over it. I think we should determine the cause of this and once we understand it we can then either come up with a suitable fix, or use a workaround until the underlying issues is resolved. From your last two comments it seems that you may be closing in on something - does this help you to find a regression range?

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

9 years ago

Attachment #8621605 - Flags: review?(dave.hunt) → review-

Martijn Wargers (dead)

Reporter

Comment 30

•

9 years ago

(In reply to Dave Hunt (:davehunt) from comment #29)
>  From your last two comments it seems that you may be
> closing in on something - does this help you to find a regression range?

I haven't got time to find a better regression range.
Peter, Oliver, the regression range in comment 11 might be incorrect.
Could you perhaps find a regression range with the modified test I attached in comment 28?
You might be getting problems with bug 1094759, btw.

No longer blocks: 1134257

Flags: needinfo?(pbylenga)

Flags: needinfo?(onelson)

Martijn Wargers (dead)

Reporter

Comment 31

•

9 years ago

(In reply to Martijn Wargers [:mwargers] (QA) from comment #30)
> You might be getting problems with bug 1094759, btw.

I was getting that, when trying to run older builds to get a regression range, I mean.

Martijn Wargers (dead)

Reporter

Comment 32

•

9 years ago

This is supposedly not an issue anymore on bitbar, according to Oliver.
I tried out the test script again with a build from 20150616160206 and an updated Gaia tree and indeed, after 5 repeats, the issue hasn't occurred yet. It really makes me wonder what fixed it somehow.

Martijn Wargers (dead)

Reporter

Updated

•

9 years ago

No longer depends on: 1172032

Oliver Nelson [:oliverthor]

Comment 33

•

9 years ago

Confirming with Martijn in comment 32, reports have suddenly stopped showing signs of this error which demonstrated extremely consistent behavior prior. 


Here is the last observed REPRO of this issue:
* http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.2.bitbar/106/#showFailuresLink

And the first PASS immediately after:
* http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.2.bitbar/107/


Keeping these links readily available in case we need/want to dive into what saw this resolved.
In the meanwhile, resolving as worksforme.

Status: NEW → RESOLVED

Closed: 9 years ago

Flags: needinfo?(onelson)

Resolution: --- → WORKSFORME

Pi Wei Cheng [:piwei] (inactive)

Comment 34

•

9 years ago

Clearing tag due to bug being resolved.

Flags: needinfo?(ktucker)

Keywords: regressionwindow-wanted

KTucker [:KTucker][Inactive 3/4/2016]

Updated

•

9 years ago

blocking-b2g: 3.0? → ---

QA Whiteboard: [QAnalyst-Triage+]

Flags: needinfo?(ktucker)

Oliver Nelson [:oliverthor]

Updated

•

9 years ago

Flags: needinfo?(pbylenga)

Martijn Wargers (dead)

Reporter

Comment 35

•

9 years ago

It looks like that has started appearing again :( I've filed bug 1190791 for it now.

Martijn Wargers (dead)

Reporter

Comment 36

•

9 years ago

Ok, it looks like this was caused by bug 1115480 and then fixed by bug 1171827.
That seems to correspond with the regression range and the fix range.

Blocks: 1115480

Depends on: 1171827

console_output.txt 9 years ago Martijn Wargers (dead) 66.89 KB, text/plain		Details
https://github.com/mozilla-b2g/gaia/pull/30563 9 years ago Martijn Wargers (dead) 46 bytes, text/x-github-pull-request	jgriffin : review+ davehunt : review-	Details \| Review
all.diff 9 years ago Martijn Wargers (dead) 3.61 KB, patch		Details \| Diff \| Splinter Review