Closed Bug 1363368 Opened 7 years ago Closed 2 years ago

Registering a browser doesn't wait for contentBrowser to be ready

Categories

(Remote Protocol :: Marionette, defect, P2)

defect

Tracking

(firefox54 affected, firefox55 affected)

RESOLVED WORKSFORME
Tracking Status
firefox54 --- affected
firefox55 --- affected

People

(Reporter: whimboo, Unassigned)

References

Details

While investigating the issue on bug 1361837 I noticed the following inconsistency in a test after an in_app restart:

> def test_in_app_restart_with_callback(self):
>      self.marionette.restart(in_app=True,
>                              callback=lambda: self.shutdown(restart=True))
>      print "* Current window handle: %s" % self.marionette.current_window_handle
>      print "* Window handles: %s" % self.marionette.window_handles

Result:

* Current window handle: 3
* Window handles: []

Window handles should at least contain '3' because that's the currently selected window.

The interesting part is here:

1494336958544   Marionette      DEBUG   Register listener.js for window id "12"
1494336958577   Marionette      TRACE   2 <- [1,1,null,{"sessionId":"3824cd37-0d2b-4f63-8c3f-58d8b5a8ae95","capabilities":{"browserName":"firefox","browser
Version":"55.0a1","platformName":"linux","platformVersion":"3.13.0-112-generic","pageLoadStrategy":"normal","acceptInsecureCerts":false,"timeouts":{"implic
it":0,"pageLoad":300000,"script":30000},"rotatable":false,"specificationLevel":0,"moz:processID":16614,"moz:profile":"/tmp/tmp5gXR7l.mozrunner","moz:access
ibilityChecks":false}}]
1494336958628   Marionette      TRACE   2 -> [0,2,"setContext",{"value":"content"}]
1494336958630   Marionette      TRACE   2 <- [1,2,null,{}]
1494336958642   Marionette      TRACE   2 -> [0,3,"getWindowHandle",{}]
1494336958644   Marionette      TRACE   2 <- [1,3,null,{"value":"3"}]
1494336958664   Marionette      TRACE   2 -> [0,4,"getWindowHandles",{}]
1494336958665   Marionette      TRACE   2 <- [1,4,null,[]]

We register our frame script for the content browser with the id 12, but getWindowHandle returns the id from the chrome window, which should have indeed a tab and should return that id.

I assume this is a race right after the startup of the application.
Blocks: 1361837
Here from local testing on OS X...

I added two print lines right after the in_app triggered restart via a callback:

* Current chrome window handle: 3
* Chrome window handles: [u'3']

Those report the correct values for the chrome windows. But then I had an assertion line which checks if the currently selected window handle is contained in the known window handles:

 0:37.27 TEST_END: MainThread FAIL, expected PASS
Traceback (most recent call last):
  File "/Volumes/data/code/gecko/testing/marionette/harness/marionette_harness/marionette_test/testcases.py", line 166, in run
    testMethod()
  File "/Volumes/data/code/gecko/testing/marionette/harness/marionette_harness/tests/unit/test_quit_restart.py", line 140, in test_in_app_restart_with_callback
    self.assertIn(self.marionette.current_window_handle, self.marionette.window_handles)
AssertionError: u'3' not found in []

First, the known window handle here is '3', which basically means we have used the chrome window handle. This should only happen when there is a non-browser window or a browser without any tabs. In this specific case the latter assumption would make sense, because window handles reports an empty array, and this only happens when a tabbrowser is present but doesn't list any tabs.

I will add more debugging lines to figure out what's going on here.
This is actually a regression in Firefox between May 5th and May 6th. The only bug which comes into my mind here is bug 1054740, which landed exactly at this time and caused problems for us in various restart tests. See bug 1361837.

It means at the time when Marionette checks for window handles there is a browser with no single tab present. Please note that in the above test there was a single page open before the restart.

Dao, I will see that I can give you more debug output, and a patch so it will be easy for you to reproduce.
Blocks: 1054740
Flags: needinfo?(dao+bmo)
Keywords: regression
Version: Version 3 → 55 Branch
Ok, so it's not a direct regression from bug 1054740, but it just triggered the behavior to occur way more often now.

From more debugging I can see the following:

1494446301147	Marionette	DEBUG	Finding window handles...
1494446301147	Marionette	DEBUG	* Chrome window: chrome://browser/content/browser.xul
1494446301147	Marionette	DEBUG	** Number of open tabs: 1
1494446301147	Marionette	DEBUG	** Tab: about:blank
1494446301147	Marionette	DEBUG	** Window ID: null
1494446301147	Marionette	DEBUG	Done.

Because the outer window id of the open tab is null, we do not add this id to the list of window handles returned. As such the array is empty.

https://dxr.mozilla.org/mozilla-central/rev/b21b974d60d3075ae24f6fb1bae75d0f122f28fc/testing/marionette/driver.js#185-188
https://dxr.mozilla.org/mozilla-central/rev/b21b974d60d3075ae24f6fb1bae75d0f122f28fc/testing/marionette/driver.js#1225-1241

The problematic line is in the code for getIdForBrowser:

> let winId = browser.outerWindowID;

Not sure why this gives us null.
No longer blocks: 1054740
Keywords: regression
Blocks: 1332122
(In reply to Henrik Skupin (:whimboo) from comment #2)
> It means at the time when Marionette checks for window handles there is a
> browser with no single tab present.

That's not really possible.
Flags: needinfo?(dao+bmo)
Here a bit more details:

494490219648	Marionette	TRACE	2 -> [0,5,"getWindowHandle",{}]
1494490219649	Marionette	DEBUG	*** browser outer window id: null
1494490219649	Marionette	TRACE	2 <- [1,5,null,{"value":"3"}]

Getting the outer window id for the contentBrowser returns null. As such we fallback to use the window id of the chrome window.

1494490219655	Marionette	TRACE	2 -> [0,6,"getWindowHandles",{}]
1494490219656	Marionette	DEBUG	Finding window handles...
1494490219656	Marionette	DEBUG	* Chrome window: chrome://browser/content/browser.xul
1494490219656	Marionette	DEBUG	** Number of open tabs: 1
1494490219656	Marionette	DEBUG	** Tab: about:blank (remote: true)
1494490219656	Marionette	DEBUG	*** browser outer window id: null
1494490219656	Marionette	DEBUG	** Window ID: null
1494490219656	Marionette	DEBUG	Done.
1494490219657	Marionette	TRACE	2 <- [1,6,null,[]]

Same problem as above. Given that there is no window id we do not add it.

Dao, do you have an idea why the id could be null at this time? There are lots of other cases when we handle remoteness tabs, and when the id is correctly returned:

1494490216377	Marionette	DEBUG	Finding window handles...
1494490216377	Marionette	DEBUG	* Chrome window: chrome://browser/content/browser.xul
1494490216377	Marionette	DEBUG	** Number of open tabs: 1
1494490216378	Marionette	DEBUG	** Tab: about:blank (remote: true)
1494490216378	Marionette	DEBUG	*** from browser ids: 4294967297
1494490216378	Marionette	DEBUG	** Window ID: 4294967297
1494490216378	Marionette	DEBUG	Done.
1494490216381	Marionette	TRACE	2 <- [1,6,null,["4294967297"]]

Might this happen in a transition phase and we have to wait until a non null value gets returned?
Flags: needinfo?(dao+bmo)
I found the following:

https://dxr.mozilla.org/mozilla-central/rev/d8762cb967423618ff0a488f14745f60964e5c49/toolkit/components/aboutperformance/content/aboutPerformance.js#80

> let id = browser.outerWindowID; // May be `null` if the browser isn't loaded yet

So it indeed looks like that we would have to poll for the id, or maybe let the framescript wait in registering the browser until it indeed has been fully loaded. I would be more in favor of the second option.
Dao, is there an event we could listen for when the content browser is ready?
Andreas, I think I found a race condition in our code which only seems to happen during startup of the application, or when the framescript is getting (re-)loaded.

Here the detailed explanation... The framescript is sending a message to the parent process with the topic `Marionette:register` when it is loaded and `registerSelf` is executed. Then the parent process is doing a couple of checks for the new content browser and finally updates our internal lists (this.wins, this.browsers, this.curBrowser). While Marionette is handling all those steps we do not block the execution of commands. While this is not a problem for navigation commands because they expect a possible reload of the listener (eg. a remoteness change), other commands might operate on an invalid browser id. Here an example for startup:

1. Marionette client starts Firefox, Marionette gets initialized
2. The session start command is sent
3. The command getWindowHandles is send
4. While 3) is running the tab loading for restoring the session is still active and eg causes a remoteness change including a listener reload
5. getWindowHandles returns a list of ids with the new invalid id of the content browser
6. The command switchToWindow is send for the content browser and fails because the window is not present anymore

We could check on startup if all browsers are done with loading their content. But this wouldn't fix a situation like that in the middle of a session, whereby I'm not sure how often this could happen.
Flags: needinfo?(ato)
As more I dive into this code, as more I think we have to do a complete refactoring of the browser registration and handling. Given its complexity by sending messages multiple times forth and back between the parent and the child process, it's actually not easy to understand.

So fixing this issue might actually take a bit longer as expected, but I will get started on that.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Summary: getWindowHandles can lie about existing windows after a restart → Registering a browser doesn't wait for contentBrowser to be ready
(In reply to Henrik Skupin (:whimboo) from comment #7)
> Dao, is there an event we could listen for when the content browser is ready?

Dunno. mconley?
Flags: needinfo?(dao+bmo) → needinfo?(mconley)
The outerWindowId is set using the Browser:Init message:

http://searchfox.org/mozilla-central/rev/cd8c561106d804e26bc09389f18f361846d005eb/toolkit/content/browser-child.js#615

You could perhaps listen for that.
Flags: needinfo?(mconley)
There are a good number of things that contribute to what we
are seeing here.

Since the outerWindowID remains null until the <xul:browser> has
initialised, the immediate fix to this problem appears to be to listen
for Browser:Init before providing a Marionette session.  But because the
<xul:browser> may be blown away by a frame script reload or a remoteness
change, it is possible it is not as straight forward as just listening
for this event.

whimboo’s comment #8 has a good explanation of what is going on, but
I believe the underlying reasons that put us in to this situation have
to do with (1) Marionette’s internal cache of window handles, (2)
fallbacks when identifying the current window handle, (3) failure to
wait for the browser to be in a known safe state when returning a new
session, and (4) inprecise tracking of unique windows.

Before I go through these point by point, I want to clarify the
definitions of ‘window handle’ and ‘chrome window handle’, which
are respectively WebDriver and Marionette specific concepts:

  window handle
    A WebDriver ‘window’ is equivalent to an HTML top-level browsing
    context.  Each top-level browsing context is assigned a unique,
    serialisable handle that is used to identify the browsing context
    when switching the target for Marionette commands.

    Switching “by window”, as WebDriver calls it, involves locating
    the browsing context (the <xul:browser>) by its window handle,
    changing the internal browser target, and focussing the OS window
    and/or tab that is associated with it.  All subsequent RPC commands
    related to DOM querying or interaction will reach the new targetted
    browsing context.

  chrome window handle
    A Marionette ‘chrome window’ should be equivalent to a
    ChromeWindow.  This is used for targetting the DOM of chrome
    context for querying or interaction, for example to click menus or
    manipulate the URL bar.

(1)

    Due to the inherent asynchronous nature of the web, windows and
    browsing contexts appear and disappear all the time.  Marionette’s
    mapping of windows/browsers to unique identifier is lacking because
    it does not itself keep track of appearing or disappearing objects.
    Instead it relies on code scattered throughout Marionette to update
    it, making the cache’s integrity questionable.

    Marionette keeps a list of windows (GeckoDriver#wins in
    testing/marionette/driver.js) that is meant to be updated as
    new windows appear or disappear.  The same is the case for
    <xul:browser>s and <xul:tabbrowser>s (GeckoDriver#browsers),
    except it is only updated after the Marionette:register
    IPC message is received from the content frame script
    (testing/marionette/listener.js).  It currently uses a mixture of
    observer notifications and outerWindowID to determine when to update
    the cache.

    I think a better approach would be to have a service with
    its own observers that would be the one-source-of-truth for
    all window management and which would not allow consumers to
    manipulate its internal state.  It would contain a weak map of
    an internally-assigned UUID (as opposed to outerWindowID) to
    <xul:browser>’s permanentKey.

(2)

    It’s not clear to me what the use case is for falling back to the
    chrome window handle when a browser’s outerWindowID is null.  As
    you will never be able to call the switchToWindow command with a
    chrome window handle, this seems like a complication we can remove.

(3)

    To ensure that the tab restoration code has time to complete before
    the getWindowHandles command is executed, we need to make the
    newSession command delay the response.

(4)

    Marionette currently relies on outerWindowID to track browsers,
    which are lost when a remoteness change occurs.  We ought to use
    permanentKey.

Some of these points have overlapping solutions, so we need to figure
out the right dependency chain to resolve them.  I believe it would
be beneficial to try find an easy fix for this particular bug first
(waiting for Browser:Init?) and then trying to solve the “reliable and
stateless window tracking” and “use permanentKey” together.
Flags: needinfo?(ato)
Blocks: 1366281
As discussed with Andreas, we will use this bug to only fix the session startup issue. A more complete refactoring has to happen in a follow-up bug. This is so we can shorten the time until the regression is fixed.

Now that I'm back I will start to work on this issue.
Here the call stack for the newSession command:

https://dxr.mozilla.org/mozilla-central/source/testing/marionette/driver.js

>|- driver.newSession() [async(generator)]
>  |- driver.newSession().waitForWindow() [async]
>  | |- driver.startBrowser() [sync]
>  |   |- driver.addBrowser() [sync]
>  |   |- browser.startSession() [sync]
>  |     |- driver.whenBrowserStarted() [sync]
>  |- yield driver.registerPromise()
>    |- driver.registerBrowser() [sync]
>  |- yield driver.browserListening

For me `registerBrowser` would be the place where we have to add the logic for waiting of a valid content browser. It means the following conditions have to be met:

* The currently selected (not necessarily focused) tab is valid
* The content browser has been finished loading, and offers a valid outer window id

Andreas, would you mind to give a quick feedback for this approach? I think it's the minimalist change we could do to actually support the new behavior.
Flags: needinfo?(ato)
(In reply to Henrik Skupin (:whimboo) from comment #14)

> Here the call stack for the newSession command: […]

Thanks for digging into this and documenting this!  I know how painful
it is to understand this part of our code base.

> For me `registerBrowser` would be the place where we have to add the
> logic for waiting of a valid content browser. It means the following
> conditions have to be met:
>
> * The currently selected (not necessarily focused) tab is valid

When the browser is starting we shouldn’t have to make a distinction
between a selected, but not necessarily focussed, tab.  This is a
WebDriver concept that doesn’t become relevant until a session has
been established.

I also think the term ‘valid’ above deserves a more detailed
explanation.

> * The content browser has been finished loading, and offers a valid
> outer window id

And a ‘valid outerWindowID’ in this case would be not-null, right?
What will the practical implementation of this look like?  Using
wait.until?

> Andreas, would you mind to give a quick feedback for this approach? I
> think it's the minimalist change we could do to actually support the
> new behavior.

Changing registerBrowser makes sense to me.  That means the promise
won’t resolve until all the conditions have been met.
Flags: needinfo?(ato)
(In reply to Andreas Tolfsen ‹:ato› from comment #15)
> > For me `registerBrowser` would be the place where we have to add the
> > logic for waiting of a valid content browser. It means the following
> > conditions have to be met:
> >
> > * The currently selected (not necessarily focused) tab is valid
> 
> When the browser is starting we shouldn’t have to make a distinction
> between a selected, but not necessarily focussed, tab.  This is a
> WebDriver concept that doesn’t become relevant until a session has
> been established.

I'm not that sure. Which tab Marionette selects when Firefox starts with multiple tabs open? If it's the first tab, there can be another tab in focus. So the first tab doesn't have the focus. This can perfectly happen in sessionrestore scenarios. 

> I also think the term ‘valid’ above deserves a more detailed
> explanation.

Just that it still exists.

> > * The content browser has been finished loading, and offers a valid
> > outer window id
> 
> And a ‘valid outerWindowID’ in this case would be not-null, right?
> What will the practical implementation of this look like?  Using
> wait.until?

Correct. We would need a timer instance, or we could try to listen to `Browser:Init` if possible, and resolve it via a Promise.
As seen in earlier comments we inappropriately add chrome window handles to `getWindowHandles`. I filed bug 1368965 for that and want to get this fixed first.
Depends on: 1368965
(In reply to Henrik Skupin (:whimboo) from comment #16)

> (In reply to Andreas Tolfsen ‹:ato› from comment #15)
> 
> > > For me `registerBrowser` would be the place where we have to add
> > > the logic for waiting of a valid content browser. It means the
> > > following conditions have to be met:
> > > 
> > > * The currently selected (not necessarily focused) tab is valid
> > 
> > When the browser is starting we shouldn’t have to make a
> > distinction between a selected, but not necessarily focussed, tab.
> > This is a WebDriver concept that doesn’t become relevant until a
> > session has been established.
> 
> I'm not that sure. Which tab Marionette selects when Firefox starts
> with multiple tabs open? If it's the first tab, there can be another
> tab in focus. So the first tab doesn't have the focus. This can
> perfectly happen in sessionrestore scenarios.

Fair point, and we should probably make a decision on this.  It probably
makes the most sense to say that the current top-level browsing context
is the selected browser (tab) in the current ChromeWindow.

> > And a ‘valid outerWindowID’ in this case would be not-null,
> > right?  What will the practical implementation of this look like?
> > Using wait.until?
> 
> Correct. We would need a timer instance, or we could try to listen to
> `Browser:Init` if possible, and resolve it via a Promise.

As part of my window tracking patch, we will have one central component
that tracks the state of appearing and disappearing content browsers.
We have discussed this in the past, but I get a sense it would be a
good thing if it was able to emit Marionette-only custom events such as
"Marionette:BrowserOpen", "Marionette:BrowserClose", &c.  Would this be
useful to you in the future?  (Not saying it’s a dependency to fix this.)
(In reply to Andreas Tolfsen ‹:ato› from comment #18)
> Fair point, and we should probably make a decision on this.  It probably
> makes the most sense to say that the current top-level browsing context
> is the selected browser (tab) in the current ChromeWindow.

Yes, and that is what I already used for my patch on bug 1368492.

> > Correct. We would need a timer instance, or we could try to listen to
> > `Browser:Init` if possible, and resolve it via a Promise.
> 
> As part of my window tracking patch, we will have one central component
> that tracks the state of appearing and disappearing content browsers.
> We have discussed this in the past, but I get a sense it would be a
> good thing if it was able to emit Marionette-only custom events such as
> "Marionette:BrowserOpen", "Marionette:BrowserClose", &c.  Would this be
> useful to you in the future?  (Not saying it’s a dependency to fix this.)

I think it would depend on how many modules would use those messages. I could see proxy.js for the currently running command, and the code which keeps the list of browsers up to date. So not sure right now.
I assume that with the additional delay after startup (bug 1369700) which defers the initialization of Marionette further, this issue might be less prominent, or even not visible anymore.
(In reply to Henrik Skupin (:whimboo) from comment #20)
> I assume that with the additional delay after startup (bug 1369700) which
> defers the initialization of Marionette further, this issue might be less
> prominent, or even not visible anymore.

Yes, I thought the same thing when I talked to florian about it.
No longer blocks: 1332122
The offending patch which caused Marionette to fail has been partly reverted for Firefox 55.0. But the tests which have been disabled never got re-enabled. :(

I will have a look if the tests can be re-enabled without having to fix this bug first.
Try looks fine beside a perma failure (probably a crash) in TestQuitRestart.test_in_app_restart_with_callback_no_shutdown. I will file a separate bug to get this test re-enabled.
Bug 1389103 is taking care of the unskip now.
No longer blocks: 1362502, 1362799, 1362917, 1363087, 1366281
Priority: -- → P3
Please also see bug 1387337 which we might want to fix first, because that's clearly reproducible.
Depends on: 1387337
Priority: P3 → P1
I _think_ this should get resolved with https://bugzilla.mozilla.org/show_bug.cgi?id=1311041.

Reducing priority since it is not being actively worked on.
OS: Unspecified → All
Priority: P1 → P2
Hardware: Unspecified → All
Version: 55 Branch → Trunk
Assignee: hskupin → nobody
Status: ASSIGNED → NEW

The problem here is most likely located in registerBrowser (driver.js). As I have noticed lately in some cases it can take a while until the outerWindowId has been made available to the parent process. As such registerBrowser would have to wait for it to become available. But that isn't the case at the moment. It calls Services.wm.getOuterWindowWithId() only once, which will return null, and then sets some invalid data to this.wins:

https://searchfox.org/mozilla-central/rev/b4ebbe90ae4d0468fe6232bb4ce90699738c8125/testing/marionette/driver.js#523-539

It might be possible to verify that by delaying the Browser:Init message here:
https://searchfox.org/mozilla-central/rev/b4ebbe90ae4d0468fe6232bb4ce90699738c8125/toolkit/content/browser-child.js#33

The best strategy would be to have an observer for ChromeWindows
and its browsing contexts. I know nika is working on uplifting
events related to browsing contexts being created from product
specific code (i.e. Firefox’ Browser:Init et al.) into the platform.

This doesn't seem to be no longer an issue since we switched to the JSWindowActor implementation. Closing as WFM.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.