Closed Bug 1162112 Opened 9 years ago Closed 9 years ago

test_rocketbar_add_collection_save_bookmark.py: " TimeoutException: TimeoutException: Timed out after 30.1 seconds "

Categories

(Firefox OS Graveyard :: Gaia::UI Tests, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: onelson, Unassigned)

References

()

Details

Attachments

(3 files, 1 obsolete file)

Description:
When the test attempts to add a new smart collection to the homescreen from hold-tapping the homescreen, they will observe that upon attempting to load into the menu for 'smart collection' selection, the phone will not navigate past the loading screen. Observed in manually testing, this only appears when the user performs with solely a data connection, no repro on Wi-Fi. However, the test appears to have Wi-Fi enabled and fails this test.

* http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/209/HTML_Report/
* http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk.ui.adhoc/823/HTML_Report/

Repro Steps:
1) Update phone to 20150506010204
2) Connect to Data Connection
3) Progress through FTU and reach homescreen
4) Hold tap empty space on homescreen to enter dialog
5) Tap 'Add Smart Collections'
6) Observe UI

Actual:
Loading occurs indefinitely

Expected: 
Loading maintains until Collections proc in timely manner


Traceback (most recent call last):
 File "/var/jenkins/1/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_client-0.12-py2.7.egg/marionette/marionette_test.py", line 296, in run
testMethod()
File "/var/jenkins/1/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/tests/python/gaia-ui-tests/gaiatest/tests/functional/rocketbar/test_rocketbar_add_collection_save_bookmark.py", line 20, in test_rocketbar_add_collection
collection_activity = contextmenu.tap_add_collection()
File "/var/jenkins/1/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/tests/python/gaia-ui-tests/gaiatest/apps/homescreen/regions/context_menu.py", line 29, in tap_add_collection
return CollectionsActivity(self.marionette)
File "/var/jenkins/1/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/tests/python/gaia-ui-tests/gaiatest/apps/homescreen/regions/collections_activity.py", line 22, in __init__
Wait(self.marionette).until(expected.element_not_displayed(loading))
File "/var/jenkins/1/workspace/flame-kk-319.mozilla-central.nightly.ui.functional.smoke/.env/local/lib/python2.7/site-packages/marionette_driver-0.5-py2.7.egg/marionette_driver/wait.py", line 143, in until
cause=last_exc)
TimeoutException: TimeoutException: Timed out after 30.1 seconds


Environmental Variables:
Device firmware (base) 	L1TC100118D0
Device firmware (date) 	06 May 2015 08:21:32
Device firmware (incremental) 	eng.cltbld.20150506.042120
Device firmware (release) 	4.4.2
Device identifier 	flame
Device memory 	219772 kB
Device serial 	e479da71
Device uptime 	0 days 0 hours 2 minutes 12 seconds
Gaia date 	05 May 2015 21:10:18
Gaia revision 	3e6fd1e0a478
Gecko build 	20150506010204
Gecko revision 	ba44099cbd07
Gecko version 	40.0a1

Reproducible manually: YES, when attempting with only a data connection

Repro frequency: 5/6
QA Whiteboard: [QAnalyst-Triage?][fxosqa-auto-backlog?]
Flags: needinfo?(pbylenga)
I suspect this is just a matter of the external website being very slow to deliver the necessary content, which causes this test to time out.
Attached patch 1162112.diff (obsolete) — Splinter Review
Not even this worked. I now even get a "Connect to the internet to add more Smart Collections" dialog, even though I have a working wifi and internet connection on my phone.
Related to bug 1144557?

I can also reproduce this on b2g desktop on v2.2. Sometimes, the list appears quick, other times very slow.

Kgrandon, is there some way to know if this would be server-side problem?
Flags: needinfo?(kgrandon)
Sounds like it's potentially a server problem. I haven't seen this personally though.

Any chance you could get some logs for me?
Flags: needinfo?(kgrandon)
QA Whiteboard: [QAnalyst-Triage?][fxosqa-auto-backlog?] → [QAnalyst-Triage+][fxosqa-auto-backlog?]
Flags: needinfo?(pbylenga)
Comment 5 is private: false
Attachment #8602197 - Attachment is obsolete: true
Comment on attachment 8602886 [details] [review]
[gaia] mwargers:1162112 > mozilla-b2g:master

Geo, I know that adding time.sleep() stuff is horrible, but in this case it seems necessary again. 
Using Wait().until seems to interfere with network loading here, currently. I noticed this also in bug 1136791, comment 4 (where we used a same time.sleep workaround).


I propose to use this workaround for now and to file a bug on the issue that Wait().until seems to interfere with network loading and then causes certain tests to time out that depend on a certain external site to load.
Attachment #8602886 - Flags: review?(gmealer)
(In reply to Kevin Grandon :kgrandon from comment #5)
> Sounds like it's potentially a server problem. I haven't seen this
> personally though.
> 
> Any chance you could get some logs for me?

Kevin, sorry, what kind of logs are you looking for and how to get them?
In any case, I suspect there is some issue here, related to Marionette Wait() and page loading (see comment 7), which a log would still be useful.
Flags: needinfo?(kgrandon)
Assignee: nobody → martijn.martijn
I think I'm just looking for typical stdout/console.log statements. I'm not really familiar with this harness unfortunately.
Flags: needinfo?(kgrandon)
Comment on attachment 8602886 [details] [review]
[gaia] mwargers:1162112 > mozilla-b2g:master

Martijn, before I r+ this, can you give a shot with a slightly different Wait instead of the sleep:

Wait(self.marionette, timeout=30, interval=5).until(...)

This would go in both cases.

What that does is make us only check every 5 seconds instead of the default ten times a second. Makes it less likely the Wait polling loop is slowing down background operations like network loading.

FWIW, if we're ever having issues Waiting on music playing and stuff, same thing applies. Waits can interfere with background tasks unless you slow the poll down.

If that works, you might be able to drop the interval down to 2 or 1 with no problem, but I'm picking a very long interval for isolation's sake.
Flags: needinfo?(martijn.martijn)
Also, sounds like Oliver might be seeing this repro manually on both wifi and data at times per IRC convo. We should be careful not to change the test to accommodate an actual problem, so that should probably be figured out first.

NI'ed Oliver for more input.
Flags: needinfo?(onelson)
https://bugzilla.mozilla.org/show_bug.cgi?id=1161421

This issue had been bugged previously for reproduction manually, and I just got around to adding a video for the issue [https://youtu.be/3UmMWf4a8fU]. 

I managed to repro this issue with both Wi-Fi and Data connections, but it's intermittent nature makes it really hard to track down. I've observed cases where it would pass 5 times in a row, and others where it would fail 5 times in a row. Maybe that could be an indication that the phone may enter a 'state' where the bug becomes more prone to occur.
Flags: needinfo?(onelson)
(In reply to Oliver Nelson [:oliverthor] from comment #12)

> I managed to repro this issue with both Wi-Fi and Data connections, but it's
> intermittent nature makes it really hard to track down. I've observed cases
> where it would pass 5 times in a row, and others where it would fail 5 times
> in a row. Maybe that could be an indication that the phone may enter a
> 'state' where the bug becomes more prone to occur.

Could also mean availability of the server is going in and out. Think we'd need to isolate that before knowing whether this was a client-side issue. If the phone is sending the request promptly and reacting to response promptly, then it's functionally correct (though there are other questions there around UX behavior for long requests, things like that).

If there are logs for the request/response that's ideal. Otherwise, not sure our best option for monitoring the network traffic. If you have an wireless AP with traffic monitor on it (Asus's APs have this, for example) that might work.
Attached file logcat.txt
Yeah, this seems like a server issue. I can reproduce it manually too, on device and b2g desktop.
There are times, the server responds well and then the test passes, but often, it stalls at some point when it's trying to load something.

This is a logcat attached, when the loading is very slow.

One thing I noticed is that the url that is being loaded is:
https://appsearch.services.mozilla.com/partners/1.0/Categories/list/?apiKey=79011a035b40ef3d7baeabc8f85b862f&deviceId=fxos-app://collection.gaiamobile.org/800e2b2b-f9c9-4a52-9e9f-66dda08aa411&ctx={%22lc%22:%22en-US%22,%22tz%22:%222%22,%22v%22:%223.0.0.0-prerelease%22,%22dn%22:%22flame%22,%22cr%22:%22choozze%22,%22ct%22:%22wifi%22,%22mcc%22:%22204%22,%22mnc%22:%2216%22,%22sr%22:%22480x853.5%22}&

Sometimes it's really slow and sometimes it's quick.
So I guess it's an issue with that site.
Flags: needinfo?(martijn.martijn)
Naoki, do you know who to contact regarding server issues with smart collections (that was everything.me, right?)
Flags: needinfo?(nhirata.bugzilla)
Ran would be the point of contact for everything.me.  I needinfo'ed him on this.

At the same time, we should determine if it's the client ( isolated by using desktop web browser w/ firefox os mobile agent ), the network between the server ( we would have to do some network monitoring of sending packet / ack and the time between), or the server.  I've ran into this before at certain time periods and the server seems fine, it seems like it may be global traffic.  It's hard for the everything.me team to test there because they are at the location so chances are they would only be running things on their local network unless they use some sort of VPN service and route all their network traffic through the VPN.

Also to note, we will be switching to our own snapshot server at some point soon, in which case jlaz would be the point of contact.
Flags: needinfo?(ran)
Flags: needinfo?(nhirata.bugzilla)
Flags: needinfo?(jlaz)
It seems to me that the site https://appsearch.services.mozilla.com should always directly load. If it doesn't load directly always, then I think it means that Smart Collections sometimes don't load well.
When I try to reload https://appsearch.services.mozilla.com a couple of times, there is always one time, where the site doesn't load for me.
Currently https://appsearch.services.mozilla.com is a redirect to the everything.me server hosted at everything.me.  We do not have control of the server, everything.me does.

This is also why you may experience latency especially at peak hours of the internet in different parts of the world.  There is a plan to switch over the redirect to a direct hosting of a snapshot e.me server that we host in the near future.

Regardless, we should first narrow down where the latency comes from as per comment 16.  If something doesn't load well from a server, it usually indicates latency and packet loss.  We need to narrow down if it's client, network or server related.

To note : http://www.internetpulse.net/ and http://www.internettrafficreport.com/ may be good to look at during the times when you have latency
Depends on: 1163040
Possible points of failure:
1. e.me server
2. appsearch redirector
3. Client connectivity
4. Client code bug

I think it's safe to assume that it's not a server problem or code problem since the problem is network bound.
I suggest looking at logs from the appsearch redirector to see if these failed requests 1) reach the redirector 2) e.me response is successfully fetched and sent back to the client.
Flags: needinfo?(ran)
I'm pretty certain it's some server side issue. I see the problem on b2g desktop as well as on device.
The https://appsearch.services.mozilla.com url sometimes works and sometimes refuses to return anything.

This is what I get in logcat:
I/Smart Collections( 7369): Content JS LOG: evme API request: https://appsearch.services.mozilla.com/partners/1.0/Categories/list/?apiKey=79011a035b40ef3d7baeabc8f85b862f&deviceId=fxos-app://collection.gaiamobile.org/f0d2a03d-a3ec-4447-a6a9-e673111ff286&ctx={"lc":"en-US","tz":"2","v":"3.0.0.0-prerelease","dn":"flame","cr":"choozze","ct":"wifi","mcc":"204","mnc":"16","sr":"480x853.5"}& 
I/Smart Collections( 7369):     at log (app://collection.gaiamobile.org/gaia_build_defer_create.js:334:619)
I/Smart Collections( 7369): Content JS LOG: evme create-collection: error network error 

That last part makes it evident, it's a server-side issue, no?
I updated the pull request now to use timeout=30, interval=5.
I still think it would be better to use this pull request. 
External sites might still be slow to respond and this makes it more clear that it's not the test code itself, that could be causing the issue.
Flags: needinfo?(gmealer)
(In reply to Martijn Wargers [:mwargers] (QA) from comment #21)
> I updated the pull request now to use timeout=30, interval=5.
> I still think it would be better to use this pull request. 
> External sites might still be slow to respond and this makes it more clear
> that it's not the test code itself, that could be causing the issue.

Did it make things better? The request itself is simple, and I'm fine r+'ing it, but I am curious whether you saw any change in behavior.
Flags: needinfo?(gmealer)
Comment on attachment 8602886 [details] [review]
[gaia] mwargers:1162112 > mozilla-b2g:master

I'm certainly more comfortable with the slow Wait rather than a sleep. This might still be a bit speculative but r+.
Attachment #8602886 - Flags: review?(gmealer) → review+
Attachment #8602886 - Flags: review?(jlorenzo)
Comment on attachment 8602886 [details] [review]
[gaia] mwargers:1162112 > mozilla-b2g:master

Based on the discussion we had last stand up, it looks like a good fix.
Attachment #8602886 - Flags: review?(jlorenzo) → review+
(In reply to Johan Lorenzo [:jlorenzo] (QA) from comment #24)
> Comment on attachment 8602886 [details] [review]
> [gaia] mwargers:1162112 > mozilla-b2g:master

Merged this: https://github.com/mozilla-b2g/gaia/commit/7b660dc21aaf05d838f64e01a1f1e4374c03ad59

This will not fix the failures, because this is a server issue, so I'm leaving the bug open. Someone from the server team should take a look.
Assignee: martijn.martijn → nobody
Depends on: 1165370
Depends on: 1175469
Ok, this seems to work again (bug 1175469), so this test can be re-enabled.
(In reply to Martijn Wargers [:mwargers] (QA) from comment #26)
> Created attachment 8624816 [details] [review]
> https://github.com/mozilla-b2g/gaia/pull/30675
> 
> Ok, this seems to work again (bug 1175469), so this test can be re-enabled.

Merged: https://github.com/mozilla-b2g/gaia/commit/77dd6b90d52f94853f11101d574107b873fa01b3

Marking wfm, as this seems to be working again, see bug 1175469.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Flags: needinfo?(jlaz)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: