Open Bug 1430064 Opened 6 years ago Updated 2 months ago

Firefox does not exit when geckodriver is terminated

Categories

(Testing :: geckodriver, defect, P3)

Version 3
defect

Tracking

(Not tracked)

People

(Reporter: whimboo, Unassigned)

References

()

Details

Attachments

(5 files)

In case a Selenium test doesn't explicitly call "driver.quit()" at the end of the test, geckodriver will be terminated by Selenium. Sadly this results in a situation that Firefox continues to run, even without geckodriver running. It also means that the profile geckodriver own continues to be around.

As Andreas mentioned on IRC this could be because of issues with signal handling in Rust.
Selenium advice has always been to call `quit()` at the end of a test. If people are doing this then they are getting themselves into weird edge cases that go against all advice.

Dropping this to P5 and if a contributor wants to fix it then great
Priority: -- → P5
I would be interested in helping to get this resolved.  There are certain cases where disaster prevents an application from properly calling "quit()", like a failure in Selenium Grid.  In such a failure, you might be left with tens of browsers to cleanup.

There are two crates I've earmarked that can handle interrupt or term signals: chan-signal and ctrlc.  Ctrlc provides handling for interrupt and term, with the latter being a configurable option in Cargo.toml.  With ctrlc, all signals must be handled with the same handler, and the process is rather opaque.

Chan-signal provides finer-grained control of signal handling, with the option for separate signal handlers for separate signals, but currently only supports Unix.  It looks like Windows support is in the scope, but there's been no forward development towards it.  It's also based on the chan crate, which provides a more transparent mechanism for handling the signals.

Chan-signal seems like the better option moving forward, but the lack of current Windows support is a miss.  I would be interested in getting feedback on the decision before I move forward with a test patch.
I think Andreas can give some helpful feedback here given that he AFAIR had a quick look at signal handling lately. Maybe there is even another bug open.
Flags: needinfo?(ato)
We ran into another case where signal handling is necessary when
discussing the patch for preserving minidump files when Firefox
crashes.  jgraham has summarised the discussion in
https://bugzilla.mozilla.org/show_bug.cgi?id=1433495#c4.  It is
also my impression that chan-signal is the right path forward, even
if it currently does not give us Windows.

Greg: Can you actually confirm that handling SIGINT and stopping
Firefox solves this problem?  whimboo recently recently changed
geckodriver to more gracefully deal with stopping Firefox.
Flags: needinfo?(ato) → needinfo?(gsfraley)
Summary: Geckodriver doesn't close Firefox if getting terminated → Firefox does not exit when geckodriver is terminated
Sure! I'll get a first-pass patch going to see what impact it has on the browser being left over.  I'll also reach out to the chan-signal dev to see how close Windows support is on the horizon.
Flags: needinfo?(gsfraley)
Even the simplest examples of chan-signal don't seem to work when used in conjunction with hyper, like the following modification of main.rs in GeckoDriver:

```
let listening = webdriver::server::start(addr, handler, &extension_routes()[..])
    .map_err(|err| (ExitCode::Unavailable, err.to_string()))?;
info!("Listening on {}", listening.socket);

let signal_channel = chan_signal::notify(&[Signal::INT, Signal::TERM]);
let signal = signal_channel.recv().unwrap();
println!("Received signal {:?}", signal);
```

Sending a Ctrl+C or TERM doesn't result in the received-signal-message being printed out.  Spinning up a Hyper 0.10 server on a test project results in the same behavior.

I'll check into the ctrlc crate next, but the chan-signal crate has me scratching my in regards to why it won't work.
So one idea I just had is to handle the signal, get Firefox to shut down and then just call `std::process::exit()` to exit without worrying about terminating the hyper event loop.
OK, so I just pushed my bits to mozreview and would appreciate feedback (I marked r=ato, but it's more like f?ato). I didn't extensively test this, but it seems to work at least  a little. Also it's basically just what was in my git stash, so it's not a super-clean patch at this point.
Oh and, Greg, apologies for hijacking this bug; I just happened  to have been looking at a related issue a few weeks ago and so had a nearly working patch. If you can find a way to make this work on Windows, that would be amazing.
No prob!  I've been reading the chan-signal code/Microsoft docs to get an idea of how Windows support would play out.  I don't know what timeframe it would take for me to come up with a solution, but I'll pay attention to this bug as a test point.
Windows support can be added pretty easily to chan-signal.  I'm going to work with the author to get the functionality merged back into his version.  I'd recommend sticking with the crate on crates.io, but if you want to test windows functionality in a pinch, my fork lives at https://github.com/gsfraley/chan-signal.git.
Comment on attachment 8958770 [details]
Bug 1430064 - Use chan_signal to handle signals sent to geckodriver,

https://reviewboard.mozilla.org/r/227680/#review234104

This is a pretty clever patch!  I just have some minor comments and
things you need to fix up.

::: commit-message-d6957:5
(Diff revision 1)
> +Bug 1430064 - Use chan_signal to handle signals sent to geckodriver, r=ato
> +
> +This requires a refactor of the WebDriver main loop so that we are able to send
> +a message from the geckodriver process in. Hyper doesn't allow us to
> +exit it's handler loop cleanly so we take the approach of doing the

its

::: testing/geckodriver/Cargo.toml:13
(Diff revision 1)
> +chan = "0.1.12"
> +chan-signal = "0.3.1"

Let’s depend on chan "0.1" and chan-signal "0.3" so that it is
easier to share the vendored crates with other Rust code in the
tree.

::: testing/geckodriver/Cargo.toml:13
(Diff revision 1)
> +chan = "0.1.12"
> +chan-signal = "0.3.1"

You also need to run "./mach vendor rust" because these are not
currently vendored.

::: testing/geckodriver/src/main.rs:1
(Diff revision 1)
>  extern crate chrono;
> +extern crate chan;
> +extern crate chan_signal;

Surely chan* should come before chrono?

::: testing/geckodriver/src/main.rs:213
(Diff revision 1)
> +    let server =  webdriver::server::WebDriverServer::new();
> +
> +    let builder = thread::Builder::new().name("signal handler".to_string());
> +    let chan = server.send_chan();
> +    builder.spawn(move || {
> +        signal_notify(signal, chan)

Missing semicolon.  I don’t think this is supposed to return anything?

::: testing/geckodriver/src/main.rs:232
(Diff revision 1)
> +{
> +
> +    // Blocks until this process is sent an INT or TERM signal.
> +    signal.recv().unwrap();
> +
> +    info!("Got a signal, quitting");

s/info/debug/

::: testing/webdriver/src/server.rs:64
(Diff revision 1)
>      }
>  
>      fn run(&mut self, msg_chan: Receiver<DispatchMessage<U>>) {
>          loop {
> -            match msg_chan.recv() {
> +            let msg = msg_chan.recv();
> +            info!("Got msg");

Drop this.

::: testing/webdriver/src/server.rs:92
(Diff revision 1)
>                      if resp_chan.send(resp).is_err() {
>                          error!("Sending response to the main thread failed");
>                      };
>                  }
> -                Ok(DispatchMessage::Quit) => break,
> +                Ok(DispatchMessage::Quit) => {
> +                    info!("Got quit");

Drop or lower to debug.

::: testing/webdriver/src/server.rs:259
(Diff revision 1)
> +        let (msg_send, msg_recv) = channel();
> +        WebDriverServer {
> +            send_chan: msg_send,
> +            recv_chan: msg_recv
> +        }

If you name the return values from channel() send_chan and recv_chan,
you can avoid the field assignments when constructing WebDriverServer:

> let (send_chan, recv_chan) = channel();
> WebDriverServer {
>   send_chan,
>   recv_chan,
> }
Attachment #8958770 - Flags: review?(ato) → review-
Comment on attachment 8960180 [details]
Bug 1430064 - Vendor chan_signal crate,

https://reviewboard.mozilla.org/r/228948/#review234652
Attachment #8960180 - Flags: review?(ato) → review+
Comment on attachment 8958770 [details]
Bug 1430064 - Use chan_signal to handle signals sent to geckodriver,

https://reviewboard.mozilla.org/r/227680/#review234654
Attachment #8958770 - Flags: review?(ato) → review+
Flags: needinfo?(ato)
Blocks: 1491288
Flags: needinfo?(ato)
Flags: needinfo?(ato)
Flags: needinfo?(ato)
Priority: P5 → P2
Resetting ni? flag because it was unintentionally removed.
Flags: needinfo?(ato)
I’m going to pick this up and rebase jgraham’s patch.  We now use
a hyper version which _does_ support shutting down.
Assignee: nobody → ato
Status: NEW → ASSIGNED
Flags: needinfo?(ato)
Blocks: 1495062
No longer blocks: 1491288
Blocks: 1433495

(In reply to Andreas Tolfsen ⦗:ato⦘ from comment #20)

I’m going to pick this up and rebase jgraham’s patch. We now use
a hyper version which does support shutting down.

Looks like you missed to follow-up here. Will you be able to do it soon? Otherwise we should move this bug to the next 0.25 release (bug 1520585).

Flags: needinfo?(ato)

No if you want to do it, please go ahead!

Assignee: ato → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(ato)
Priority: P2 → P3
Blocks: 1520585
No longer blocks: 1495062
No longer blocks: 1520585
No longer blocks: 1573798
Blocks: 1584911

(In reply to Andreas Tolfsen ⦗:ato⦘ from comment #20)

I’m going to pick this up and rebase jgraham’s patch. We now use
a hyper version which does support shutting down.

Hello. I'm interested in helping resolve this issue. I have a Python script that uses Selenium+geckodriver+Firefox, and I'd like to gracefully handle ctrl+c events, but I run into this issue where the Firefox windows don't close. Is the current status still that the attached patch needs to be rebased?

James, could you please help Daniel given that were your patches?

Flags: needinfo?(james)

chan-signal/chan have since been deprecated in favour of
signal-hook/crossbeam-channel, so the attached patches can’t
simply be rebased.

The webdriver crate has since also moved to use a Tokio-based
hyper and no longer does its own HTTP request routing but instead
relies on warp.

The new APIs are however strikingly similar and the same fundamental
approach should still be applicable. You will need to spin up a
thread in geckodriver that blocks on receiving the signal and hook
that up to the webdriver::server::Dispatcher thread and warp
in some way using a channel so they know to exit when the interrupt
arrives.

Unfortunately I don’t know the warp API very well, but hyper
can be stopped with hyper::server::Server::with_graceful_shutdown().

Thanks! I'll start working on it.

Flags: needinfo?(james)

Daniel, do you have any further questions? Otherwise do you have an updated status?

Flags: needinfo?(dhertenstein)

No further questions at the moment. I have not had a significant amount of time to work on this yet due to my other responsibilities, but I am committed to fixing this bug. I, unexpectedly, will be on medical leave for 3-6 weeks starting on the 10th, so I may not have a patch before January. If someone else comes along and wants to fix it while I am out, they are welcome to, but, as I said, I am committed to fixing this since it blocks some of my own work and it is an interesting problem to me.

Flags: needinfo?(dhertenstein)

Good to hear, and until then best luck for you. I don't think that during that time anyone else would need it. As such the bug should be waiting for you. :)

Hello, I have hit a wall on this. I am able to spin up a thread that handles the signal and send a message to the webdriver::server::Dispatcher thread, but I have yet to find a way to shutdown the warp server that compiles. The compilation fails saying that I cannot await a future because we're not using Rust 2018. Is that true or have I made a simple mistake somewhere? I have a patch file (along with one for the vendoring of signal-hook) that I can attach if you'd like to look at where I am.

The compilation fails saying that I cannot await a future because we're not using Rust 2018

This is unfortunately true. I don't know how hard it would be to update to the new edition. I"m guessing not that hard.

I have a patch file (along with one for the vendoring of signal-hook) that I can attach if you'd like to look at where I am.

Please! Or push to phabricator if you are able to do that.

This is my first time working in moz-central, so I don't know how to push to phabricator yet. I'll attach the patches now while I read up on how to do that.

And perhaps I am doing the attachment part wrong too because the vendoring patch is too big.

I don't know if the situation has improved recently but the docs for this were previously horrible. The tl;dr version is:

  • Ensure you have Python 3 and pip 3 installed
  • pip3 install mozphab (with --user if you don't want a system-wide install)
  • moz-phab submit <commits>

Don't worry about the vendoring patch.

Our code should compile for 2018. I fixed all what was broken via bug 1508670 a long time ago. Maybe something new sneaked in. Just test it out.

Depends on: 1613975

So features requiring Rust 2018 should now work on central. The patch you attached looks like full files rather than a diff; lets try to make the Phabricator route work.

Changing the edition did work though it uncovered new compilation errors. I have fixed those and am running the webdriver tests now. Then I'll test my script that reproduces the original problem, and see if I've actually got everything working. If I do, I'll push to Phabricator.

Assignee: nobody → dhertenstein
Status: NEW → ASSIGNED

Okay. I have something that I think works. In my repro script, the Firefox window now closes when it didn't before. However, I cannot tell if the warp-driven webdriver server thread is properly joining. If I put a print at the end of the webdriver server thread body, I don't see the output.

James, mind following up on Daniel's question from about 2 weeks ago? Thanks.

Flags: needinfo?(james)

So with a diff like

diff --git a/testing/geckodriver/src/main.rs b/testing/geckodriver/src/main.rs
index cf844bfe2e8ac..bc4eae6a3e4e1 100644
--- a/testing/geckodriver/src/main.rs
+++ b/testing/geckodriver/src/main.rs
@@ -236,7 +236,7 @@ fn inner_main(app: &mut App) -> ProgramResult<()> {
             })?;
         }
     }
-
+    println!("SHUTDOWN inner_main");
     Ok(())
 }
 
diff --git a/testing/webdriver/src/server.rs b/testing/webdriver/src/server.rs
index 2146969ab20fe..63b2981b43199 100644
--- a/testing/webdriver/src/server.rs
+++ b/testing/webdriver/src/server.rs
@@ -183,12 +183,14 @@ where
     let builder = thread::Builder::new().name("webdriver server".to_string());
     let handle = builder.spawn(move || {
         tokio::run(server);
+        println!("SHUTDOWN webdriver server");
     })?;
 
     let builder = thread::Builder::new().name("webdriver dispatcher".to_string());
     builder.spawn(move || {
         let mut dispatcher = Dispatcher::new(handler);
         dispatcher.run(&msg_recv);
+        println!("SHUTDOWN dispatcher");
     })?;
 
     Ok(Listener {

ctrl-c with a Firefox session started shuts down Firefox, and gives the following output:

^CSHUTDOWN webdriver server
SHUTDOWN inner_main
jgraham@flitwick:~/develop/gecko$ Exiting due to channel error.
`
Exiting due to channel error.

So I don't see the SHUTDOWN dispatcher line, and something, either Firefox or geckodriver is producing the Exiting due to channel error. output. So I think maybe something is wrong in the order of shutdown in this case, but basically the patch here works. I'm pretty happy to take it if we solve the excessive CPU usage of the busy loop.

Flags: needinfo?(james)

How does it triggering Firefox to quit? Is it sending the Marionette:Quit message? That is how it should be done because that causes a safe shutdown. Everything else will just be killing the process which is not really wanted.

Hello. I've had to put working on this on hold until a week or two from now to push another project out the door. James has been providing feedback on Phabricator and I do have next steps to solve the last remaining issues. The busy loop can be solved with a sleep in the thread handling the signal, but I would like to see if I can phrase it as a future so it plays along with the event loop. That said, I haven't seen that channel error message. When I get back to this, I'll see if I can reproduce it on my end.

Henrik, the signal handler sends a webdriver::server::DispatchMessage::Quit message to the webdriver dispatcher which causes the dispatcher to call its delete_session() function, which calls the MarionetteHandler's delete_session() function, which passes a WebDriverCommand::DeleteSession message to the MarionetteHandler's handle_command() function, which sends that message to the MarionetteConnection through MarionetteConnection's send_command() function which encodes that message into a Marionette:Quit message. Or at least that's the execution path I traced by reading. I can't confirm that's actually how it works, so it should be confirmed by someone more familiar with the codebase.

Hello again. The current status on this is that shutdown of the dispatcher hangs while waiting for the MarionetteConnection to close (via MarionetteConnection.close() within MarionetteHandler.delete_session(). The close function is an empty function, and I can't figure out what is actually happening with that function.

I was scheduled to go on family leave for a new child next week, but because of COVID-19, those plans have moved up a week, and tomorrow is my last work day for 3 months.

If the source of the hang can be identified, I think (hope) that is the last hurdle. Adding a 1s sleep to the signal handling thread makes CPU usage go back to the level it was before.

Thank you Daniel for the feedback! I haven't looked into that detail myself for several months so hard to say what's going on. Would you mind pushing your latest changes to phabricator so that we could test on our own? Maybe James has some additional feedback here...

Otherwise all best for you and your family. Especially for the upcoming new family member!

Flags: needinfo?(james)

Okay, latest changes are pushed (just the addition of the sleep to address CPU usage). Thanks for the well wishes!

Blocks: 1649094
No longer blocks: 1584911
Assignee: dhertenstein → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(james)

James, would you have the time to check the latest status of the patch? Not sure what's remaining here, and if it's not too much if it could be a potential fix for 0.28.0.

Flags: needinfo?(james)
Assignee: nobody → dhertenstein
Attachment #9126238 - Attachment description: Bug 1430064 - Use signal-hook crate to handle signals in geckodriver, r=whimboo,jgraham → Bug 1430064 - Use signal-hook crate to handle signals in geckodriver,
Status: NEW → ASSIGNED

I rebased/partially rewrote this patch (the webdriver part; the geckodriver part is largely unchanged). I don't see a hang so afaict we can land this subject to review etc.

Flags: needinfo?(james)
Assignee: dhertenstein → james
Priority: P3 → P1
Attachment #9126238 - Attachment description: Bug 1430064 - Use signal-hook crate to handle signals in geckodriver, → Bug 1430064 - Use signal-hook crate to handle signals in geckodriver, r=#webdriver-reviewers
Attachment #9172631 - Attachment description: Bug 1430064 - Update vendored rust packages, → Bug 1430064 - Update vendored rust packages, r=#webdriver-reviewers

So what I didn't notice earlier is that with this patch we only handle SIGINT, which only handles the Ctrl+C case. But we don't setup signal handlers for those cases when the geckodriver process gets killed via SIGTERM and/or SIGKILL. Both signals are limited to non-Windows platforms, but I feel that we should still add them both or only one to handle a graceful shutdown of Firefox when the geckodriver process gets killed.

Also there seems to be a race condition. In most of the cases we are going to quit the session before sendnig the Marionette:Quit command. As such Firefox will still be not closed.

In a working case it looks like:

1601629124353	geckodriver	DEBUG	signal received
1601629124353	webdriver::server	DEBUG	The dispatcher got the Quit message, deleting session
1601629124353	webdriver::server	DEBUG	Deleting session
1601629124353	webdriver::server	DEBUG	shutdown!
1601629124355	Marionette	DEBUG	0 -> [0,3,"Marionette:Quit",{"flags":["eForceQuit"]}]
1601629124355	Marionette	INFO	Stopped listening on port 61295
1601629124377	Marionette	DEBUG	Closed connection 0

And here a failing case:

1601629293053	geckodriver	DEBUG	signal received
1601629293053	webdriver::server	DEBUG	The dispatcher got the Quit message, deleting session
1601629293053	webdriver::server	DEBUG	Deleting session
1601629293054	webdriver::server	DEBUG	shutdown!
1601629293057	Marionette	DEBUG	Closed connection 0
^CDEBUG:urllib3.connectionpool:Resetting dropped connection: localhost

James, what do you think?

Flags: needinfo?(james)

It's too late for 0.28 and as such has to be moved to 0.29.

Blocks: 1668243
No longer blocks: 1649094
No longer blocks: 1668243
Blocks: 1686110
Severity: normal → S3
Priority: P1 → P3
Blocks: 1723202
No longer blocks: 1686110
No longer blocks: 1433495

Too late for the 0.31.0 release and James doesn't work on it at the moment.

Assignee: james → nobody
Blocks: 1750691
No longer blocks: 1723202
Status: ASSIGNED → NEW
Flags: needinfo?(james)

We haven't had the time to continue on this particular issue and most likely we won't be able anytime soon. As such lets drop the bug from any milestone / release tracking.

No longer blocks: 1750691

Note that with bug 336193 fixed Firefox should handle SIGTERM/SIGINT/SIGHUP on all platforms now.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: