Closed Bug 1126260 Opened 9 years ago Closed 9 years ago

Outgoing/Incoming call fails intermittently

Categories

(Firefox OS Graveyard :: Gaia::System::Window Mgmt, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:2.2+, b2g-v2.2 fixed, b2g-master verified)

VERIFIED FIXED
2.2 S13 (29may)
blocking-b2g 2.2+
Tracking Status
b2g-v2.2 --- fixed
b2g-master --- verified

People

(Reporter: RobertC, Assigned: gsvelto)

References

Details

(Keywords: regression, Whiteboard: [fromAutomation])

Attachments

(4 files, 4 obsolete files)

Attached image callscreen.png
We are seeing an intermittent issue in the python automation tests where dialer fails to make an outgoing call.
After typing the phone number and tapping call a different callscreen is displayed (screenshot attached).

Issue reproduced locally with automation: 4 out of 10.
I was not able to reproduce the issue manually after 20 attempts.

Build info m-c v188-1 319MB Flame KK:
Gaia-Rev        b02ec9713e6de8d96c6954d2c0dfd0442b0656ac
Gecko-Rev       https://hg.mozilla.org/mozilla-central/rev/38e4719e71af
Build-ID        20150126160233
Version         38.0a1
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20150126.192457
FW-Date         Mon Jan 26 19:25:08 EST 2015
Bootloader      L1TC10011880

STR:
1. open dialer app
2. type phone number
3. tap call button

Expected result:
Callscreen is displayed and outgoing call works as expected

Actual result:
The callscreen has a different layout (phone number missing as well as call type) and the call fails, the displayed app after this is the dialer.

The issue is not reproduced on base build v18D
Attached file logcat.txt (obsolete) —
Logcat of the automated test failing locally.
Would you mind providing a job where we see the call screen failing?
Flags: needinfo?(robert.chira)
http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-inbound.ui.functional.sanity/3274/HTML_Report/

The test fails at a later step, after the call screen closes so the image is different.
The screenshot that is attached was taken manually on my device while the test was running.
Flags: needinfo?(robert.chira)
Summary: Outgoing call fails intermittently → Outgoing/Incoming call fails intermittently
I reproduce this issue manually by chance today on master.

[Blocking Requested - why for this release]: Regression that an end user would hit.
blocking-b2g: --- → 3.0?
QA wanted for a branch check.
QA Contact: ktucker
I was unable to reproduce this issue on Flame 2.2 after 50 plus attempts

The correct call screen was shown and they call went through as expected. 

Environmental Variables:
Device: Flame 2.2 (Full Flash)(KK)(319mb)
BuildID: 20150128002506
Gaia: cd42b034fd2825c3675ace3a67f5775eb61c2d60
Gecko: d824c65a6a2b
Gonk: e7c90613521145db090dd24147afd5ceb5703190
Version: 37.0a2 (2.2) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0

Going to get a regression window on this.
triage: blocker
blocking-b2g: 3.0? → 3.0+
I could not get this issue to reproduce on the latest build as it seems to be fixed.  Using the automation tests I was able to reproduce this bug on builds from the 26th easily, but on the latest working central build I repeated the script 40 times with 0 reproductions.  Removing the regression window tag for now as this issue seems to be fixed.  

Robert Chira if you see this issue again please add the tag again and I will find the regression window.

Issue does NOT occur:

Environmental Variables:
Device: Flame 3.0
BuildID: 20150128181137
Gaia: 9d2378a9ef092ab1fc15c3a9f7fc4171aab59d57
Gecko: ea54e6623f59
Version: 38.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:38.0) Gecko/38.0 Firefox/38.0

Issue DOES occur

Environmental Variables:
Device: Flame 3.0
BuildID: 20150126134900
Gaia: b02ec9713e6de8d96c6954d2c0dfd0442b0656ac
Gecko: 38e4719e71af
Version: 38.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:38.0) Gecko/38.0 Firefox/38.0
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(robert.chira)
Flags: needinfo?(ktucker)
QA Contact: ktucker → jmercado
Closing this as worksforme. Please reopen if the issue is encountered again.
Status: NEW → RESOLVED
Closed: 9 years ago
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Resolution: --- → WORKSFORME
Encountered this issue again on today's master, with automation:
Gaia-Rev        8238eeacc7030b2cdbf7ab4eba2f36779b702599
Gecko-Rev       https://hg.mozilla.org/mozilla-central/rev/29b05d283b00
Build-ID        20150130010210
Version         38.0a1
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20150130.044333
FW-Date         Fri Jan 30 04:43:44 EST 2015
Bootloader      L1TC000118D0
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Adding qawanted so we can get better logs.
QA Whiteboard: [QAnalyst-Triage+]
Keywords: qawanted
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
Keywords: qawanted
Let's try to get a regression window again.
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
My experience with this bug is mainly in v2.2 and with incoming calls. The device is ZTE OPEN. It is 8/10 times reproducible (if not more) when given contact calls me (by chance this is my wife). The contact has image. The image is taken with the device camera and is not resized. My feeling is that it is performance issue.
Flags: needinfo?(robert.chira)
Bug 1116588 seems to be the cause for this issue.

Mozilla-inbound Regression Window

Last Working 
Environmental Variables:
Device: Flame 3.0
BuildID: 20150126062631
Gaia: 793773bb2944b42a85dd160049e605cbd880c4da
Gecko: 0bec74187553
Version: 38.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:38.0) Gecko/38.0 Firefox/38.0

First Broken 
Environmental Variables:
Device: Flame 3.0
BuildID: 20150126065334
Gaia: 793773bb2944b42a85dd160049e605cbd880c4da
Gecko: babd56077826
Version: 38.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:38.0) Gecko/38.0 Firefox/38.0

Last Working gaia / First Broken gecko - Issue DOES occur
Gaia: 793773bb2944b42a85dd160049e605cbd880c4da
Gecko: babd56077826

First Broken gaia / Last Working gecko - Issue does NOT occur
Gaia: 793773bb2944b42a85dd160049e605cbd880c4da
Gecko: 0bec74187553

Gecko Pushlog: http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=0bec74187553&tochange=babd56077826
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
Kartikaya, can you take a look at this please? Looks like the work done on bug 1116588 might have caused this issue to occur.
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker) → needinfo?(bugmail.mozilla)
1) What STR were used to find the regression window in comment 16?
2) The cset 7c4059ebfcf7 identified in comment 16 was backed out two days later (see bug 1116588 comment 29), so please re-check if this is still happening on master. If it is, it may have come back with the second patch on bug 1116588 which was a modified version of the original patch.
Flags: needinfo?(bugmail.mozilla) → needinfo?(jmercado)
The test_dialer python automation script was used to reproduce this bug for the regression window.  If the issue did not reproduce within 40 retries, the build was considered working for the regression window.

In comment 9 this issue had been fixed on master on the 28th of January, but the issue began occurring again as of comment 11 on the 30th of January.
Flags: needinfo?(jmercado) → needinfo?(bugmail.mozilla)
How can I run the test_dialer python automation script? Are there instructions somewhere?
Flags: needinfo?(bugmail.mozilla) → needinfo?(jmercado)
The steps we use to do this were modified from http://blargon7.com/2013/07/running-firefox-os-ui-tests-without-a-device-revised/ but to use with a device.  Maybe Naoki can give you more correct guidance.
Flags: needinfo?(jmercado) → needinfo?(nhirata.bugzilla)
Flags: needinfo?(bugmail.mozilla)
Comment on attachment 8560438 [details] [review]
[PullReq] bebef1987:dial1 to mozilla-b2g:master

I updated some waits in our pull

We wait for the main window and the call screen to be displayed and then try to interact with it.

I think this will fix our test fails.
Attachment #8560438 - Flags: review?(jlorenzo)
Attachment #8560438 - Flags: review?(gmealer)
Comment on attachment 8560438 [details] [review]
[PullReq] bebef1987:dial1 to mozilla-b2g:master

This still looks like an actual bug in the production code, to me. I ran another adhoc job and test_dialer_clear_miss_call_notification.TestDialerClearMissCallNotification failed 4 out of 11 times with the same screenshot: http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk.ui.adhoc/681/HTML_Report/

By the way, most of the test failed there probably due to connectivity issues. Could it be a race condition depending on the network coverage?
Attachment #8560438 - Flags: review?(jlorenzo) → review-
Comment on attachment 8560438 [details] [review]
[PullReq] bebef1987:dial1 to mozilla-b2g:master

Going to r- based on Johan's estimation that we might be covering up a real product bug.
Attachment #8560438 - Flags: review?(gmealer) → review-
Sorry for the delay. I finally got some time to go through these instructions but for some reason when I run the gaiatest app it can't connect to marionette on the device. The connection to port 2828 fails. Do I need to do something in particular to make something on the device listen to that port? I built gecko with ENABLE_MARIONETTE=1 and as far as I can tell that's all I should need.
Flags: needinfo?(bugmail.mozilla) → needinfo?(nhirata.bugzilla)
Oh, ignore comment 27. I was port-forwarding through SSH and I screwed up the tunneling. It seems to be running now.
Flags: needinfo?(nhirata.bugzilla)
So I ran all of the dialer tests and got 19 failures and 4 passes. Pretty sure that's not normal, so I don't know if my results are representative of what's happening in automation. However test_dialer.py was one of the failures and I can run it in isolation with a 100% failure rate (3/3 so far). It fails with the attached output. Looking up in this bug I don't see any information about how this test is failing in automation so I can't say if debugging the failure I'm seeing locally will be of any use. If anybody has more info please let me know.
A little more investigation indicates that more likely my testvars file doesn't match what is used in automation, and also I don't have a SIM for the phone so I don't know if I can even reproduce the right failure locally.
Not related to the call log db refactor.
No longer blocks: 1112577
Florin or RobertC, can you send kats your testvars file ( you can remove anything that doesn't relate to this particular test )?
Flags: needinfo?(robert.chira)
Flags: needinfo?(bebe)
I just sent a private email to Kats with the information contained in the testvars.json needed here.
Flags: needinfo?(robert.chira)
Flags: needinfo?(bebe)
Attached file Local run output (obsolete) —
I used a Nexus 4 device with a SIM card and the testvars.json that Johan sent me to run through the dialer tests, and stuff was passing. At least, I didn't run into any of the errors in the report that Johan sent me. That report indicated an "Expected Failure" in test_call_log_groups, which I saw. It also indicated "Error" results in test_dialer_find_contact, test_dialer_clear_miss_call_notification, and test_redial_last_number. I didn't see any of these. I did see a failure in test_dialer_clear_miss_call_notification because the python code couldn't find the "plivo" module but that seems like a setup problem. I'm attaching the output I got from running these three tests locally. I also ran test_dialer.py locally a few times and that passed fine.

Unless somebody can point out if I'm doing something wrong, I think the only way forward here is for somebody to debug the issue on a machine that can reproduce it.
Whiteboard: [fromAutomation] → [fromAutomation][planned-sprint c=?]
Target Milestone: --- → 2.2 S10 (17apr)
Johan, could you drive checking if this is still an issue, before we commit to investigating and fixing it?
Flags: needinfo?(jlorenzo)
Whiteboard: [fromAutomation][planned-sprint c=?] → [fromAutomation]
Target Milestone: 2.2 S10 (17apr) → ---
I ran the tests again, and the issue can't be seen anymore[1]. There are some failures, but none of them shows the "blank" call screen. I also don't remember seeing this issue recently in the reports. Let's close this bug and reopen it if it comes back.

[1] http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk.ui.adhoc/797/HTML_Report/
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Flags: needinfo?(jlorenzo)
Resolution: --- → WORKSFORME
I was able to reproduce this bug today on Flame 3.0. I have been seeing this bug about twice a week for the last couple weeks, but about a month ago I was seeing it almost every day. It seems as though the repro rate has dropped, but I still dont know any manual STR except that it seems to only happen one time during the first couple calls of a fresh flash.

When making an outgoing call the callscreen appeared blank, with broken UI elements

Environmental Variables:
Device: Flame 3.0 (319mb)(Kitkat)(Full Flash)
Build ID: 20150421010201
Gaia: a8e4f95dce9db727dda5d408b038f97fb4296557
Gecko: 7b823253d9f2
Gonk: b83fc73de7b64594cd74b33e498bf08332b5d87b
Version: 40.0a1 (3.0)
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:40.0) Gecko/40.0 Firefox/40.0

Reopening this bug
Status: RESOLVED → REOPENED
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage?]
Flags: needinfo?(pbylenga)
Resolution: WORKSFORME → ---
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(pbylenga)
Bhavana, this should block Spark.

Gabriele, could you take a look into this?
Flags: needinfo?(gsvelto)
Flags: needinfo?(bbajaj)
blocking-b2g: 3.0+ → spark+
Flags: needinfo?(bbajaj)
Sure, I'll have a look though I've never encountered this issue in practice.
Flags: needinfo?(gsvelto)
NI on Johan to take a look.
Flags: needinfo?(jlorenzo)
I found an accurate (5/5 times) way to reproduce the problem. Here are the steps with today's build[1]:

1. Set the memory to 260MB (That's correct, this value is not supposed to be supported. I'll explain the reason of this value below).
2. Run test_dialer.py[2]

How did we come to 260MB?
* I tried to repro the problem like usual, by repeating multiple times the test with a restart between each test, on a 319MB device => 0/10 repro.
* Same thing without a restart in between => 0/10 repro
* Still no restart in between, on a 300MB device => 0/10 repro
* Same thing on 280MB => Repro'd at the 3rd attempt
* Same thing on 260MB => 5/5 times reproduced.

I got a logcat with RIL and Gaia debug traces (attached).
At first, the RIL doesn't seem the cause of the issue. Even with the blank call screen, I am able to hear the phone call and I can hang up by pressing the end call button. 

I haven't managed to see something in the gaia traces. Do you, Gabriele?

[1] Build ID               20150503160200
Gaia Revision          e18cce173840d6ff07fb6f1f0e0ffb58b99aab3e
Gaia Date              2015-05-02 04:27:01
Gecko Revision         https://hg.mozilla.org/mozilla-central/rev/dc5f85980a82
Gecko Version          40.0a1
Device Name            flame
Firmware(Release)      4.4.2
Firmware(Incremental)  eng.cltbld.20150503.193941
Firmware Date          Sun May  3 19:39:52 EDT 2015
Bootloader             L1TC000118D0

[2] https://github.com/mozilla-b2g/gaia/blob/master/tests/python/gaia-ui-tests/gaiatest/tests/functional/dialer/test_dialer.py
In gaia:
> cd tests/python/gaia-ui-tests
> virtualenv .env && source .env/bin/activate/ && python setup.py develop && pip install plivo
Edit testvars.json to fill in "remote_phone_number" and the plivo token. PM me if you need these values.
> adb forward tcp:2828 tcp:2828 && gaiatest --testvars=testvars.json --address=localhost:2828 --timeout=30000 gaiatest/tests/functional/dialer/test_dialer.py
Flags: needinfo?(jlorenzo) → needinfo?(gsvelto)
Attachment #8555195 - Attachment is obsolete: true
Attachment #8557392 - Attachment is obsolete: true
Attachment #8560438 - Attachment is obsolete: true
Attachment #8578087 - Attachment is obsolete: true
I've started working on this, hopefully I'll be able to repro soon. BTW I've never seen this while dogfooding but I'm using the full 1024MiB configuration on my Flame and it seems like this is related to a low memory scenario.

I have once met a similar issue which led to the same screen to be displayed but the STR was completely different (two calls were involved with one being held). The log didn't show anything interesting at the time.
Flags: needinfo?(gsvelto)
Assignee: nobody → gsvelto
OK, I've identified the issue. When receiving a 'memorypressure' event we're unconditionally freeing the callscreen window, even if there are ongoing calls:

https://github.com/mozilla-b2g/gaia/blob/bc8be5081676ffed4cf9c2c9c53fc48caa0f1a49/apps/system/js/dialer_agent.js#L60

It's a miracle this hasn't bitten us in the past, though it's perfectly possible that other intermittent dialer bugs were caused by this. For example if the 'memorypressure' condition arises after the callscreen has been shown and populated correctly it turns white even though the call keeps going. I'll cook up a patch shortly.
Moving to the right component, I'm writing unit-tests for the patch and will upload it soon. BTW I think we should backport this to v2.2 too since the change is very small but the impact is significant on memory-constrained devices.
Component: Gaia::Dialer → Gaia::System::Window Mgmt
Comment on attachment 8612304 [details] [review]
[gaia] gabrielesvelto:bug-1126260-callscreen-memory-pressure > mozilla-b2g:master

Here's a patch that fixes the problem and introduces unit-tests to cover this scenario. In a nutshell my change prevents the callscreen from being freed when under memory pressure if there's any ongoing call or if it's visible. The latter condition is important because when we close the callscreen it remains visible for a couple of seconds *after* the last call has been disconnected and we don't want to turn it into a white window while the user is still seeing it.
Attachment #8612304 - Flags: review?(alive)
This affects v2.2 and the repro rate is going to be high on memory constrained devices so nom'ing for 2.2+. When it happens it effectively makes the callscreen unusable, possibly preventing a user from answering a call, preventing the user from making a call or even making the user unable to hang up a call if it happens while one is already connected. I think this should be enough to consider this a cert blocker. The upside is that the patch is only 7 lines, carries very little risk and all possible scenarios are covered with unit-tests.
blocking-b2g: spark+ → 2.2?
Comment on attachment 8612304 [details] [review]
[gaia] gabrielesvelto:bug-1126260-callscreen-memory-pressure > mozilla-b2g:master

LGTM, thanks.
Attachment #8612304 - Flags: review?(alive) → review+
since it's very close to CC June 8th, ni RM Josh, I suggest to make it 2.2+.
Flags: needinfo?(jocheng)
Thanks for the quick review Alive! Ready to land.
Keywords: checkin-needed
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
2.2+ as this could break major function.
Hi Gabriele,
Could you please raise 2.2 approval?
Thanks
blocking-b2g: 2.2? → 2.2+
Flags: needinfo?(jocheng) → needinfo?(gsvelto)
Comment on attachment 8612304 [details] [review]
[gaia] gabrielesvelto:bug-1126260-callscreen-memory-pressure > mozilla-b2g:master

[Approval Request Comment]
[Bug caused by] (feature/regressing bug #): bug 999478
[User impact] if declined: In low-memory situations the callscreen might stop working, preventing the user from receiving a call, making one or hanging up one already connected
[Testing completed]: Tested on a Flame by setting the memory pressure trigger so low as to force memory pressure events to be fired almost permanently, the fix is also covered with unit-tests
[Risk to taking this patch] (and alternatives if risky): Disabling memory pressure detection, but this will hamper applications consuming a lot of memory and memory-constrained devices in general
[String changes made]: None
Flags: needinfo?(gsvelto)
Attachment #8612304 - Flags: approval-gaia-v2.2?
I ran the automation script that often reproduced this issue 40 times on a full flashed engineering build from today and saw 0 reproductions:

Actual Results: All calls were made correctly.
Status: RESOLVED → VERIFIED
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
Keywords: qawanted, verifyme
Forgot to paste the build variables for the above comment.

Environmental Variables:
Device: Flame 3.0
BuildID: 20150529063340
Gaia: a0d4c222c1f1a8d2954539dac1537a1166aab320
Gecko: 9738f055d98c
Gonk: 040bb1e9ac8a5b6dd756fdd696aa37a8868b5c67
Version: 41.0a1 (3.0) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:41.0) Gecko/41.0 Firefox/41.0
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Comment on attachment 8612304 [details] [review]
[gaia] gabrielesvelto:bug-1126260-callscreen-memory-pressure > mozilla-b2g:master

Approving as this causing major function failure.
Attachment #8612304 - Flags: approval-gaia-v2.2? → approval-gaia-v2.2+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: