Closed Bug 1323701 Opened 7 years ago Closed 2 years ago

Crashed tabs in Firefox 50 after "WebDriverException: Message: Element reference not seen before"

Categories

(Remote Protocol :: Marionette, defect, P5)

Version 3
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: agibson, Unassigned)

References

Details

Attachments

(5 files)

I have several tests that are crashing tabs in Firefox 50, using the latest docker-selenium node-firefox release [1]. Note the HTML screenshots in the link below (you may need to scroll to the right to see them).

https://ci.us-west.moz.works/job/bedrock_integration_test_runner_agibson/ws/results/index.html

The exception that seems to be thrown is: "WebDriverException: Message: Element reference not seen before". This is happening after clicking on an element that triggers a JS (jQuery) based animation that hides/shows an element).

I'm happy to help try and debug.

[1] https://hub.docker.com/r/selenium/node-firefox/
Attached file test-report.zip
I've attached the test report here that isolates the failures in question (the live URL may change as I'm testing further)
Is such a test something we could easily run as a pure Marionette test? It would be great to have a minimized testcase for it.

Andreas, I assume that geckodriver does not handle crashes of Firefox yet? We may need a mozcrash like cargo package for Rust which checks for minidumps and print the details.
Flags: needinfo?(ato)
(In reply to Henrik Skupin (:whimboo) from comment #2)
> Is such a test something we could easily run as a pure Marionette test? It
> would be great to have a minimized testcase for it.

This is a little tricky as right now I can't reproduce the failures on my local machine (OSX) running the exact same versions of Firefox / Selenium, but they occur 100% of the time in docker-selenium in our Jenkins job. I believe these failures are another manifestation of Bug 1307119 which went away for me, until I tried running the tests in our pipeline (could it be something e10s related?)

I'll try and work on something more minimal, but in the meantime one of the tests in question can be found here:

test: https://github.com/mozilla/bedrock/blob/master/tests/functional/contribute/test_stories.py#L12
page object: https://github.com/mozilla/bedrock/blob/master/tests/pages/contribute/stories.py#L19
Maybe try to run with e10s disabled and check how it works. Sadly I don't have the time to convert your example to a Marionette testcase.
(In reply to Henrik Skupin (:whimboo) from comment #4)
> Maybe try to run with e10s disabled and check how it works. Sadly I don't
> have the time to convert your example to a Marionette testcase.

I'll see what I can work out thanks. Unfortunately this is blocking upgrading our test suite from upgrading to Selenium 3 :(
Attached file stories.py
Attached a simpler reduced test case. As previously mentioned, I can't reproduce the failure locally (even with this reduced test case) - but hopefully this will help ease investigation.
Attachment #8818887 - Attachment mime type: text/x-python-script → text/plain
Which platforms are affected by docker-selenium? If it's Linux which distributions do they have?
(In reply to Henrik Skupin (:whimboo) from comment #7)
> Which platforms are affected by docker-selenium? If it's Linux which
> distributions do they have?

Here is the DockerFile: https://github.com/SeleniumHQ/docker-selenium/blob/master/NodeFirefox/Dockerfile
So it's Ubuntu 16.04:
https://github.com/SeleniumHQ/docker-selenium/blob/master/Base/Dockerfile

For which action does the crash actually happen? The initial page load or after clicking the toggle button?  I would assume for the latter given that waits for the aria-hidden attribute.

How many iterations do you have to run to see the crash? I assume it's not always happening given that there is a loop in the minimized testcase.
(In reply to Henrik Skupin (:whimboo) from comment #9)
> So it's Ubuntu 16.04:
> https://github.com/SeleniumHQ/docker-selenium/blob/master/Base/Dockerfile
> 
> For which action does the crash actually happen? The initial page load or
> after clicking the toggle button?  I would assume for the latter given that
> waits for the aria-hidden attribute.
> 
> How many iterations do you have to run to see the crash? I assume it's not
> always happening given that there is a loop in the minimized testcase.

The crash seems to be happening after clicking the toggle button. It happens 100% of the time in our Jenkins jobs (same for the other 6 tests).

I added a loop in the minimized test case to try and reproduce it more easily (sadly to no avail).
Ok, so is there one thing you can try if it is possible? My fix on bug 1299216 will improve crash handling for content crashes. This has been landed on Firefox 52. Could you take this version of Firefox and run the tests against? It should give us a clear report. Otherwise please really try to run Firefox in non-e10s mode, and check if the crash happens there too. Thanks.
(In reply to Henrik Skupin (:whimboo) from comment #11)
> Ok, so is there one thing you can try if it is possible? My fix on bug
> 1299216 will improve crash handling for content crashes. This has been
> landed on Firefox 52. Could you take this version of Firefox and run the
> tests against? It should give us a clear report. Otherwise please really try
> to run Firefox in non-e10s mode, and check if the crash happens there too.
> Thanks.

Ok, so it looks like e10s may well be the underlying issue.

I added the following preference settings to the test run:

firefox_profile.set_preference('browser.tabs.remote.autostart', False)
firefox_profile.set_preference('browser.tabs.remote.autostart.2', False)
firefox_profile.set_preference('browser.tabs.remote.force-enable', False)

I then the ran the test suite three times and all tests passed as expected for each run. You can see the results of the last run here: https://ci.us-west.moz.works/job/bedrock_integration_test_runner_agibson/52/console
Ok, do you have a chance to run Firefox 52 (currently DevEdition) with e10s enabled? I would be interested to see which crash this actually is.
(In reply to Henrik Skupin (:whimboo) from comment #13)
> Ok, do you have a chance to run Firefox 52 (currently DevEdition) with e10s
> enabled? I would be interested to see which crash this actually is.

This isn't so straight forward as I don't believe dev-edition is available on docker-hub just yet.
So I cannot further help, sorry.

If there is VNC access available, you might want to login, open about:crashes in Firefox after the content tab crashed, and submit such a crash report. I would need the ID or link to crash stats.
(In reply to Henrik Skupin (:whimboo) from comment #15)
> So I cannot further help, sorry.
> 
> If there is VNC access available, you might want to login, open
> about:crashes in Firefox after the content tab crashed, and submit such a
> crash report. I would need the ID or link to crash stats.

I'll see what I can do
(In reply to Henrik Skupin (:whimboo) from comment #2)
> Andreas, I assume that geckodriver does not handle crashes of Firefox yet?
> We may need a mozcrash like cargo package for Rust which checks for
> minidumps and print the details.

Sorry, I don’t know the answer to that.
Flags: needinfo?(ato)
It would be good to see a trace-level verbosity log from geckodriver.  See the geckodriver README for how to raise the log level.
Attached file geckodriver.log
Attached geckodriver.log file from the test run as requested.

You can search for "get: http://bedrock:8000/en-US/contribute/stories/shreyas/" to see where the specific test is excepted.
(In reply to Andreas Tolfsen ‹:ato› from comment #18)
> It would be good to see a trace-level verbosity log from geckodriver.  See
> the geckodriver README for how to raise the log level.

Sorry this took me a while to get back to - please see the attachment in Comment 19. I hope this provides some clue as to the problem.
Flags: needinfo?(ato)
Attachment #8832459 - Attachment mime type: text/x-log → text/plain
The log is not that helpful actually because of the missing crash information. We just see:

> [Parent 409] WARNING: pipe error (55): Connection reset by peer: file /builds/slave/m-rel-l64-00000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 320
> ###!!! [Parent][MessageChannel] Error: (msgtype=0x2E007D,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv

It means the content process died and we lost the connection. A useful crash report is still a wanted item for me.
I somewhat suspect that the original issue reported in comment #0 is to do with remoteness changes throwing the known elements table away.  When a remoteness change occurs (when Firefox is running with e10s), a <xul:browser>’s outerWindowID changes which affects the lookup in the table of known elements that you have seen before.
Flags: needinfo?(ato)
Just a note that I haven't forgotten about this bug. Unfortunately the situation seems to have has gone from bad to worse. Test runs using Firefox 52 and the latest build of docker-selenium seem to never finish. There are no errors reported in the log - jobs just stall and never finish :( 

Hopefully I can try and make some sense of what thins latest issue is (it may or may not be related to this bug).
(In reply to Alex Gibson [:agibson] from comment #23)
> Just a note that I haven't forgotten about this bug. Unfortunately the
> situation seems to have has gone from bad to worse. Test runs using Firefox
> 52 and the latest build of docker-selenium seem to never finish. There are
> no errors reported in the log - jobs just stall and never finish :( 
> 
> Hopefully I can try and make some sense of what thins latest issue is (it
> may or may not be related to this bug).

OK, if I skip the test that is causing the original issue in this bug, the test run passes and nothing hangs.

I really don't know what else I can do to try and get some action on this bug, as the latest version of Firefox just hangs indefinitely and does not report any errors in the logs.
Can you please create a new trace level log with the current versions of application and tools? Also please check that you have the latest installed.
(In reply to Henrik Skupin (:whimboo) from comment #25)
> Can you please create a new trace level log with the current versions of
> application and tools? Also please check that you have the latest installed.

How do I do this?
Do you mean another geckodriver log? (other than the one provided in Comment 19)
Yes, and in comment 18 Andreas suggested how to do it. Here a quick link for you: https://github.com/mozilla/geckodriver/#log-object

When you provide the log please also give the name of tests which are causing the browser to hang/crash. It would help us to find the appropriate parts in the log.
Attached file dd580d6ca855.log
Attached geckodriver log from node

You can search for "get: http://bedrock:8000/en-US/contribute/stories/shreyas/" to find the test in question.
Attachment #8861422 - Attachment mime type: text/x-log → text/plain
I see an IPC error in that test:

> ###!!! [Parent][MessageChannel] Error: (msgtype=0x2C0085,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv

And this indicates the hang of the tab, and the crash afterward because the tab gets killed by the watchdog timer. The hang actually happens with the following call:

> [find element: By.className: story-more])

It's unclear to me what could actually cause it. Maybe it might help to recreate the attached testcase as a plain Marionette test.
I already tried to produce a reduced test case back in Comment 6. I can't however recreate the crash on my local machine (macOS), as it seems to happen only in our automation.
Attached file marionette test
So I created a test which utilizes Marionette directly, and should do everything what the original test is doing.

Alex, maybe you can run this yourself by using a checkout of mozilla-central and the version of Firefox? Or don't you have any kind of access to those boxes? You didn't answer my VNC question from 4 months ago yet.
(In reply to Henrik Skupin (:whimboo) from comment #32)
> Created attachment 8861849 [details]
> marionette test
> 
> So I created a test which utilizes Marionette directly, and should do
> everything what the original test is doing.
> 
> Alex, maybe you can run this yourself by using a checkout of mozilla-central
> and the version of Firefox? Or don't you have any kind of access to those
> boxes? You didn't answer my VNC question from 4 months ago yet.

I don't have VNC access currently - sorry this is difficult as I have limited access to these boxes. Let me see if I can try and get a crash report somehow.
With the help of :jgmize we managed to VNC into each node and grab crash reports

This is the crash report from the test in question:

https://crash-stats.mozilla.com/report/index/e6ec26c9-c7f5-46e8-b420-bd8061170505

Two more nodes had also crashed on two different tests, with what looks like the same issue:

https://crash-stats.mozilla.com/report/index/2f25740b-da39-4e08-b24d-465131170505
https://crash-stats.mozilla.com/report/index/3d0f562e-f978-40fa-a07b-f58ce1170505
Flags: needinfo?(hskupin)
Yes, it's all the same reason why Firefox is crashing for you here. So this crash is covered by bug 1338771, and you would have to wait for a fix or apply the workaround. Maybe you should mention on this bug that you have a fairly good reproduction rate.

In general it's just sad that we do not make this info available for users of GeckoDriver. It would have helped a lot in the past months to at least know that there was a crash.
Depends on: 1338771
Flags: needinfo?(hskupin)
I can confirm that the workaround here [1] seems to solve the crashes for us!

[1] https://github.com/SeleniumHQ/docker-selenium/pull/485
Priority: -- → P2
Priority: P2 → P5

(In reply to Alex Gibson [:agibson] from comment #36)

I can confirm that the workaround here [1] seems to solve the crashes for us!

[1] https://github.com/SeleniumHQ/docker-selenium/pull/485

Hi Alex, it's been a while but it would be great to hear if you can still reproduce this failure or not with a recent Firefox release. Especially since bug 1338771 was marked as WFM. If you still can we should reopen that bug. Thanks!

Flags: needinfo?(agibson)

I haven't reproduced this bug for a long time, so I think we can say it's safe to assume fixed. Thanks

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(agibson)
Resolution: --- → WORKSFORME
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: