Closed Bug 1422769 Opened 7 years ago Closed 5 years ago

Marionette incompatiblities during page navigation with multiple content processes

Categories

(Remote Protocol :: Marionette, defect, P3)

59 Branch
defect

Tracking

(firefox58 affected, firefox59 affected)

RESOLVED WORKSFORME
Tracking Status
firefox58 --- affected
firefox59 --- affected

People

(Reporter: whimboo, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

The failure happens mainly for test_navigation.py, and here in `setUp()`:

>        self.marionette.navigate(self.marionette.absolute_url("windowHandles.html"))
>    --> self.new_tab = self.open_tab(open_with_link)

In `open_tab` a click on an element is performed which should be immediately present when the loading of "windowHandles.html" has been done. But the following is visible:

05:21:34     INFO -  [Parent 1750, Main Thread] WARNING: [nsFrameLoader] ReallyStartLoadingInternal tried but couldn't show remote browser.
05:21:34     INFO -  : file /builds/worker/workspace/build/src/dom/base/nsFrameLoader.cpp, line 904

It means the content browser is not there.
I added a screenshot as attachment 8934897 [details] accidentally to bug 1422741.

Mike, can you help us with this failure in Firefox?
Flags: needinfo?(mconley)
Since my patch on bug 1417051 has been landed we see it kinda often now, mostly for debug builds across platforms for the test:

https://dxr.mozilla.org/mozilla-central/rev/a928be5dacc3b544e29c0612b3f8cda6447df802/testing/marionette/harness/marionette_harness/tests/unit/test_navigation.py#571

Due to the high intermittent failure rate I have to skip this test now via bug 1422741.
I can now finally reproduce it locally by just running TestPageLoadStrategy.test_strategy_after_remoteness_change in a loop. So this test is doing something which following tests stumble over and cause Firefox to misbehave.
Meanwhile I'm not sure if the warning as logged by Firefox is the real underlying issue. I base this on the following trace:

> 1512560287116	Marionette	TRACE	183 -> [0,11,"findElement",{"using":"id","value":"new-blank-tab"}]
> 1512560287136	Marionette	TRACE	183 <- [1,11,null,{"value":{"element-6066-11e4-a52e-4f735466cecf":"b926a9dd-6be0-4b47-b63b-b05d3e4079b8","ELEMENT":"b926a9dd-6be0-4b47-b63b-b05d3e4079b8"}}]
> 1512560287139	Marionette	TRACE	183 -> [0,12,"clickElement",{"id":"b926a9dd-6be0-4b47-b63b-b05d3e4079b8"}]
> **** Failed with: Web element reference not seen before: b926a9dd-6be0-4b47-b63b-b05d3e4079b8
> stacktrace:
>	WebDriverError@chrome://marionette/content/error.js:172:5
>	NoSuchElementError@chrome://marionette/content/error.js:400:5
>	get@chrome://marionette/content/element.js:231:13
>	clickElement@chrome://marionette/content/listener.js:1318:14

As you can see we find the element with the id `new-blank-tab` but then fail to click on the exact same element. Note that we are still in the same window - there was no call to `switch_to_window` happening.

Now the interesting fact... When this failure gets thrown we have 2(!!) tab open. The first one with the wanted URL containing the element we want to click on, and a second blank tab. More importantly this second tab is selected and in the foreground. It feels that we have raised this exception, but at the same time successfully clicked on the element, which should have opened a new blank tab.

I will add more logging to the test, so that we might get an idea what's up with it.
Something is broken in the element store so the exception gets raised, but we still click the link, so that a new tab with about:blank opens right away:

1512562569959	Marionette	TRACE	57 -> [0,11,"getWindowHandles",{}]
1512562569962	Marionette	TRACE	57 <- [1,11,null,["2147483649"]]
1512562569983	Marionette	TRACE	57 -> [0,12,"getWindowHandle",{}]
1512562569985	Marionette	TRACE	57 <- [1,12,null,{"value":"2147483649"}]
1512562569988	Marionette	TRACE	57 -> [0,13,"findElement",{"using":"id","value":"new-blank-tab"}]
1512562570018	Marionette	TRACE	57 <- [1,13,null,{"value":{"element-6066-11e4-a52e-4f735466cecf":"e1efd178-568b-e748-bed1-3a682fda5f06","ELEMENT":"e1efd178-568b-e748-bed1-3a682fda5f06"}}]
1512562570023	Marionette	TRACE	57 -> [0,14,"getWindowHandles",{}]
1512562570025	Marionette	TRACE	57 <- [1,14,null,["2147483649"]]
1512562570029	Marionette	TRACE	57 -> [0,15,"getWindowHandle",{}]
1512562570031	Marionette	TRACE	57 <- [1,15,null,{"value":"2147483649"}]
1512562570044	Marionette	TRACE	57 -> [0,16,"clickElement",{"id":"e1efd178-568b-e748-bed1-3a682fda5f06"}]
1512562570069	Marionette	TRACE	57 <- [1,16,{"error":"no such element","message":"Web element reference not seen before: e1efd178-568b-e748-bed1-3a682fda5f06","sta ... :5\nget@chrome://marionette/content/element.js:231:13\nclickElement@chrome://marionette/content/listener.js:1318:14\n"},null]
**** Failed with: Web element reference not seen before: e1efd178-568b-e748-bed1-3a682fda5f06
stacktrace:
	WebDriverError@chrome://marionette/content/error.js:172:5
	NoSuchElementError@chrome://marionette/content/error.js:400:5
	get@chrome://marionette/content/element.js:231:13
	clickElement@chrome://marionette/content/listener.js:1318:14

1512562570082	Marionette	TRACE	57 -> [0,17,"getWindowHandles",{}]
1512562570084	Marionette	TRACE	57 <- [1,17,null,["2147483649"]]
[Parent 93387, Main Thread] WARNING: [nsFrameLoader] ReallyStartLoadingInternal tried but couldn't show remote browser.
: file /builds/worker/workspace/build/src/dom/base/nsFrameLoader.cpp, line 904
1512562570160	Marionette	TRACE	57 -> [0,18,"getWindowHandle",{}]
1512562570161	Marionette	TRACE	57 <- [1,18,null,{"value":"2147483649"}]
1512562570161	Marionette	DEBUG	Register listener.js for window 2147483728
1512562573171	Marionette	TRACE	57 -> [0,19,"getWindowHandles",{}]
1512562573173	Marionette	TRACE	57 <- [1,19,null,["2147483649","2147483728"]]
1512562573177	Marionette	TRACE	57 -> [0,20,"getWindowHandle",{}]
1512562573178	Marionette	TRACE	57 <- [1,20,null,{"value":"2147483649"}]

In that case we should not throw the exception because the element has to be present. So I wonder why the element cannot be found. Most likely because it was wrongly or not stored.

On the other side the click should not happen at all if the element retrieval fails!

I don't think that this is a problem with Firefox anymore, but that Marionette is misbehaving here. I will further investigate.

Given that this causes misbehavior for users of Marionette/Geckodriver, and is the reason for one of the top intermittents of Mn jobs since yesterday, this should be a P1.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Flags: needinfo?(mconley)
Priority: -- → P1
Summary: Page load failures due to: "ReallyStartLoadingInternal tried but couldn't show remote browser" → Failure interacting with elements due to: Web element reference not seen before
Ok, so it seems that this is only the click command which sometimes fail:

1512563284055	Marionette	TRACE	189 -> [0,16,"getElementText",{"id":"63ef8e25-5f7d-824b-bcf1-07f147f26788"}]
New blank Tab
1512563284115	Marionette	TRACE	189 <- [1,16,null,{"value":"New blank Tab"}]
1512563284119	Marionette	TRACE	189 -> [0,17,"clickElement",{"id":"63ef8e25-5f7d-824b-bcf1-07f147f26788"}]
1512563284139	Marionette	TRACE	189 <- [1,17,{"error":"no such element","message":"Web element reference not seen before: 63ef8e25-5f7d-824b-bcf1-07f147f26788","sta ... :5\nget@chrome://marionette/content/element.js:231:13\nclickElement@chrome://marionette/content/listener.js:1318:14\n"},null]

I added a dump statement to listener.js:clickElement to check if it is present, and got the following:

1512565871143	Marionette	TRACE	153 -> [0,17,"clickElement",{"id":"cf44f498-6faf-c04d-89a1-c229d7eb742b"}]
** has element [object ContentWebElement uuid=cf44f498-6faf-c04d-89a1-c229d7eb742b]: true
** has element [object ContentWebElement uuid=cf44f498-6faf-c04d-89a1-c229d7eb742b]: false

It means that `clickElement` in listener.js is getting run twice right after each other. The first time we have a valid window reference and element store, but for the second time it fails.

Interestingly I also see duplicated DEBUG output of the page load listener:

1512565870337	Marionette	DEBUG	Received DOM event "pagehide" for "http://127.0.0.1:64698/windowHandles.html"
1512565870338	Marionette	DEBUG	Received DOM event "pagehide" for "http://127.0.0.1:64698/windowHandles.html"
1512565870340	Marionette	DEBUG	Received DOM event "unload" for "http://127.0.0.1:64698/windowHandles.html"
1512565870341	Marionette	DEBUG	Received DOM event "unload" for "http://127.0.0.1:64698/windowHandles.html"
++DOMWINDOW == 19 (0x11173b400) [pid = 94161] [serial = 207] [outer = 0x110547e20]
1512565870366	Marionette	DEBUG	Received DOM event "DOMContentLoaded" for "http://127.0.0.1:64698/windowHandles.html"
1512565870367	Marionette	DEBUG	Received DOM event "DOMContentLoaded" for "http://127.0.0.1:64698/windowHandles.html"
1512565870375	Marionette	DEBUG	Received DOM event "pageshow" for "http://127.0.0.1:64698/windowHandles.html"
1512565870376	Marionette	DEBUG	Received DOM event "pageshow" for "http://127.0.0.1:64698/windowHandles.html"

Going further back (specifically to the beginning of the test) that output is shown:

1512565870044	Marionette	DEBUG	Register listener.js for window 2147483649
1512565870078	Marionette	DEBUG	Register listener.js for window 2147483656
1512565870090	Marionette	TRACE	153 <- [1,1,null,{"sessionId":"4f16ee80-6943-5e45-9c3f-0ede90db2e90","capabilities":{"browserName":"firefox","browserVersion":"59.0a ... ssID":94159,"moz:profile":"/var/folders/4k/sf4gz5fn3kl9hr3nd7pzbvhc0000gn/T/tmpyGHFva.mozrunner","moz:webdriverClick":true}}]
1512565870102	Marionette	TRACE	153 -> [0,2,"setTimeouts",{"script":30000}]
1512565870106	Marionette	TRACE	153 <- [1,2,null,{}]
1512565870115	Marionette	DEBUG	Register listener.js for window 2147483649
1512565870116	Marionette	TRACE	153 -> [0,3,"setTimeouts",{"pageLoad":300000}]
1512565870120	Marionette	TRACE	153 <- [1,3,null,{}]
1512565870128	Marionette	TRACE	153 -> [0,4,"setTimeouts",{"implicit":0}]
1512565870131	Marionette	TRACE	153 <- [1,4,null,{}]
1512565870136	Marionette	DEBUG	Register listener.js for window 2147483656
1512565870141	Marionette	TRACE	153 -> [0,5,"getCurrentChromeWindowHandle",{}]
1512565870143	Marionette	TRACE	153 <- [1,5,null,{"value":"3"}]
1512565870151	Marionette	TRACE	153 -> [0,6,"getChromeWindowHandles",{}]
1512565870155	Marionette	TRACE	153 <- [1,6,null,["3"]]
1512565870159	Marionette	TRACE	153 -> [0,7,"getWindowHandle",{}]
1512565870160	Marionette	TRACE	153 <- [1,7,null,{"value":"2147483649"}]

There are multiple lines for "Register listener.js" which means that our framescript is loaded more than once while the maybe the tab is getting moved around between different content processes?
Here the different registrations for our framescript:

1512567917769	Marionette	DEBUG	Register listener.js for window 2147483649 - http://127.0.0.1:50369/windowHandles.html
1512567917852	Marionette	DEBUG	Register listener.js for window 2147483656 - about:newtab
1512567917890	Marionette	DEBUG	Register listener.js for window 2147483649 - http://127.0.0.1:50369/windowHandles.html
1512567917915	Marionette	DEBUG	Register listener.js for window 2147483656 - about:newtab

When everything works fine the following output is seen:

1512567915722	Marionette	DEBUG	Register listener.js for window 2147483649 - http://127.0.0.1:50369/windowHandles.html
1512567915734	Marionette	DEBUG	Register listener.js for window 2147483656 - about:newtab

The framescript is getting loaded twice that much as usual in the failing case.
Attached file Marionette test
The failure here is clearly related to multiple content processes. When I set `dom.ipc.processCount` to `1` it's no longer visible, and as higher its value is as better I can reproduce it.

So moving the tabs several times between content processes, which causes our framescript to get registered again and again, creates a situation where we start working with invalid references.

As attached you can find a Marionette test, which should be executed with the following command as best with a debug build:

`mach marionette test %path_to_test_file% --run-until-failure`
I have an explanation of what's going on here...

In Marionette we have the framescript (listener.js) which gets loaded and registered whenever a tab is opened, changes its remoteness state, or when it gets moved to another process. In all those situations the outer window id of the browser changes, and as such the id of the framescript, and all its message listeners. Further we use the broadcast message manager right now to communicate between the parent and the framescript, which is not ideal.

For commands which have to survive a re-registering of the framescript (like navigation commands, and click), a queue called pendingCommands is used. It stores data which has to be communicated to the framescript in case of a reload. This works fine for a single reload of the framescript. But with multiple content processes it can happen that a tab gets moved around between processes more than once during the command execution. In such a situation the pendingCommands queue has been flushed, and we no longer let the framescript know about a still ongoing command.

There are currently two different situations we end-up in:

1) For commands like execute_async_script() Marionette would wait forever and never send a response to the client, until it hits the global timeout and force kills Firefox.

2) Under some situations our framescript is getting loaded twice for the same tab. Here an example when calling newSession with two open tabs:

> 1512648283578	Marionette	DEBUG	Register listener.js for window 2147483649 ([object ContentFrameMessageManager @ 0x117df71c0 (native @ 0x117df14c0)]
> 1512648283585	Marionette	DEBUG	**** start listeners in listener.js
> 1512648283590	Marionette	TRACE	84 <- [1,1,null,{"sessionId":"eff0b5c7-f894-9f48-bf75-eacdb7e94d92","capabilities":{"browserName":"firefox","browserVersion":"59.0a ... essID":1828,"moz:profile":"/var/folders/4k/sf4gz5fn3kl9hr3nd7pzbvhc0000gn/T/tmpF6xB7k.mozrunner","moz:webdriverClick":true}}]
> 1512648283593	Marionette	DEBUG	Register listener.js for window 2147483656 ([object ContentFrameMessageManager @ 0x121730280 (native @ 0x12172ad00)]
> 1512648283600	Marionette	TRACE	84 -> [0,2,"setTimeouts",{"script":30000}]
> 1512648283601	Marionette	TRACE	84 <- [1,2,null,{}]
> 1512648283602	Marionette	DEBUG	**** start listeners in listener.js
> 1512648283603	Marionette	TRACE	84 -> [0,3,"setTimeouts",{"pageLoad":300000}]
> 1512648283606	Marionette	TRACE	84 <- [1,3,null,{}]
> 1512648283610	Marionette	TRACE	84 -> [0,4,"setTimeouts",{"implicit":0}]
> 1512648283612	Marionette	TRACE	84 <- [1,4,null,{}]
> 1512648283612	Marionette	DEBUG	Register listener.js for window 2147483649 ([object ContentFrameMessageManager @ 0x117df71c0 (native @ 0x117df14c0)]
> 1512648283614	Marionette	TRACE	84 -> [0,5,"getCurrentChromeWindowHandle",{}]
> 1512648283615	Marionette	TRACE	84 <- [1,5,null,{"value":"3"}]
> 1512648283615	Marionette	DEBUG	**** start listeners in listener.js
> 1512648283619	Marionette	TRACE	84 -> [0,6,"getChromeWindowHandles",{}]
> 1512648283620	Marionette	TRACE	84 <- [1,6,null,["3"]]
> 1512648283622	Marionette	DEBUG	Register listener.js for window 2147483656 ([object ContentFrameMessageManager @ 0x121730280 (native @ 0x12172ad00)]

For both tabs the framescript is loaded twice, and as such all the event listeners are getting registered twice!
Summary: Failure interacting with elements due to: Web element reference not seen before → Marionette incompatiblities with multiple content processes
It looks similar to what Andreas fixed on bug 1239552, but that code is still in place. So I wonder what else could cause a reload of the framescript in the same tab without a window id change.

Not sure what the best here is. Just to wait for bug 1311041 being fixed, or even to temporarily disable multiple content processes in Marionette until then.

Andreas, given that you refactor that much code in this area right now, could you have a look?
Assignee: hskupin → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(ato)
(In reply to Henrik Skupin (:whimboo) [mostly away Dec 11th - Jan
3rd] from comment #10)

> It looks similar to what Andreas fixed on bug 1239552, but that
> code is still in place. So I wonder what else could cause a reload
> of the framescript in the same tab without a window id change.
> 
> Not sure what the best here is. Just to wait for bug 1311041 being
> fixed, or even to temporarily disable multiple content processes
> in Marionette until then.

First of all, thanks for doing the legwork on researching this.

> Andreas, given that you refactor that much code in this area right
> now, could you have a look?

I don’t have a strong opinion… I do think that the window
refactoring will fix, or make it easier to fix, this bug.  If we
think it will provide a better user experience to momentarily
disable multiprocessing I would be fine with that.

We could set a recommended preference in geckodriver’s
testing/geckodriver/src/prefs.rs, or are you suggesting disabling
e10s also for m-c?
Flags: needinfo?(ato)
Depends on: 1426154
With bug 1426154 being fixed I tried out my testcase as attached to the bug and was first happy that nothing failed. I retriggered a local test run with repeat 100 a couple of times and all was fine. Then I just wanted to see what --repeat 500 brings us. I started this command, and then it failed:

> 1515668925204	Marionette	TRACE	1116 -> [0,13,"get",{"url":"http://127.0.0.1:63310/windowHandles.html"}]
> 1515668925209	Marionette	DEBUG	Received DOM event "beforeunload" for "http://127.0.0.1:63310/windowHandles.html"
> 1515668925217	Marionette	DEBUG	Received DOM event "beforeunload" for "http://127.0.0.1:63310/windowHandles.html"
> 1515668925218	Marionette	DEBUG	Received DOM event "beforeunload" for "http://127.0.0.1:63310/windowHandles.html"
> 1515668925228	Marionette	DEBUG	Received DOM event "pagehide" for "http://127.0.0.1:63310/windowHandles.html"
> 1515668925228	Marionette	DEBUG	Received DOM event "pagehide" for "http://127.0.0.1:63310/windowHandles.html"
> 1515668925229	Marionette	DEBUG	Received DOM event "unload" for "http://127.0.0.1:63310/windowHandles.html"
> 1515668925230	Marionette	DEBUG	Received DOM event "unload" for "http://127.0.0.1:63310/windowHandles.html"
> 1515668925235	Marionette	DEBUG	Received DOM event "DOMContentLoaded" for "http://127.0.0.1:63310/windowHandles.html"
> 1515668925247	Marionette	DEBUG	Received DOM event "DOMContentLoaded" for "http://127.0.0.1:63310/windowHandles.html"
> 1515668925253	Marionette	DEBUG	Received DOM event "pageshow" for "http://127.0.0.1:63310/windowHandles.html"
> 1515668925253	Marionette	DEBUG	Received DOM event "pageshow" for "http://127.0.0.1:63310/windowHandles.html"
> [..]
> NoSuchElementException: Failed to trigger opening a new tab: Web element reference not seen before: d96649a4-a0ba-9a4b-9917-ce91b409665a
> [..]
> passed: 371
> failed: 1
> todo: 0

It is very very seldom that this failure is happening now, and this is fantasic! But Marionette still seems to fail, or maybe what we could also consider is a bug in IPC core. But that this hard to tell, and I don't have the time to further dig into this problem.

I would say we first wait for window tracking patch to be landed. Then we can continue here.
I'm going to lower the priority given that this problem doesn't appear that often anymore.
Priority: P1 → P3
Depends on: 1504807
I think that the underlying problem here is bug 1503274.
Depends on: 1503274
I will leave this bug open until we have bug 1311041 fixed, given that it will implement a sane tracking of open windows and tabs.
Note, that also with the just landed fix on bug 1509906 Firefox directly loads the target URL in the correct process type.

All issues as noticed for multiple processes so far were navigation related. With all the fixes I landed in the recent weeks I tried the attached testcase again, and wasn't able to let it fail even once for 500 iterations. Also the referenced test in test_navigation.py (bug 1422741) seems to work fine now.

I think that we can finally close this bug. There is nothing more which I can do for now, but will reopen it in case something else is left over.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
Summary: Marionette incompatiblities with multiple content processes → Marionette incompatiblities during page navigation with multiple content processes

I'm seeing a Marionette failure with multiple windows that has a related error trace.

Would changing dom.ipc.processCount to 1 help debug the issue?

Flags: needinfo?(hskupin)

Feel free to attach the trace output of that failing test. I would be interested to see it. Also if you can reproduce it a testcase would be fantastic.

Flags: needinfo?(hskupin)
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: