Closed Bug 1410366 Opened 7 years ago Closed 7 years ago

Improve socket handling for initial connection

Categories

(Remote Protocol :: Marionette, defect, P2)

57 Branch
defect

Tracking

(firefox-esr52 wontfix, firefox57 fixed, firefox58 fixed)

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- wontfix
firefox57 --- fixed
firefox58 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Whiteboard: [17/10])

Attachments

(2 files)

As noticed by Geoff over on bug 1407390, Marionette client creates a new socket object for each of the iterations in the while loop inside of wait_for_port:

https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/testing/marionette/client/marionette_driver/marionette.py#691-710

Doing that very quickly (we have a wait of .5s only) could cause the operating system to run out of resources after a while if existent sockets cannot be closed that quickly.

A better approach here is to create the socket before the while loop, and only try to connect to Marionette server inside the loop. This will ensure that we re-use the same socket for each iteration, and will also lower the amount of time to spend in creating socket objects.
So while trying to get this implemented I actually noticed that this will not work. It's basically that you cannot call `connect` multiple times on the socket in the case no connection could me made. You will clearly have to call `close` each time, but that makes the socket actually invalid. As result a new socket has to be created. 

The following stackoverflow thread sums it up:
https://stackoverflow.com/questions/43643876/python-socket-invalid-argument

Geoff, you proposed this change, maybe you have some other experience with it?


Something else we should do, and what we already do in transport.py would be to call `shutdown(socket.SHUT_RDWR)` before `close`. This would force the socket to immediately shutdown, and not leave it in the TIME_WAIT state..

Also I wonder why we explicitly trying to receive some data, and check for a colon be present. If that should be a handshake, it's very poor. Also what happens if something on the server side does NOT send a message? We wouldn't return here.

I think that the wait_for_port method should return immediately when a connection can be made to the given port. If the server side is the wanted application the command `start_session` will take care of.
Flags: needinfo?(gbrown)
Summary: Fix wait_for_port to not create a socket for each iteration of the while loop → Improve socket handling in wait_for_port
This approach didn't work. The try build is busted for various restart tests. Not sure why receiving data is important, but it might be we have a race somewhere.
(In reply to Henrik Skupin (:whimboo) from comment #1)
> The following stackoverflow thread sums it up:
> https://stackoverflow.com/questions/43643876/python-socket-invalid-argument

I didn't think of that; that's unfortunate.

You could still separate the connect from the recv, like this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9ef99be65888085ac5de074c75fed321f3944831. However, the logs suggest there's little point to that: In all the ones I checked, the first recv() attempt succeeds.
Flags: needinfo?(gbrown)
Bah, this is a kinda nasty race condition which I hit with my latest revision of the patch and would explain some of the hangs we currently see after at least a restart of Firefox. So what's going on...

Basically this is a very bad race condition and manifests in these lines of code:
https://dxr.mozilla.org/mozilla-central/source/testing/marionette/server.js#397-401

When we do a restart the `_acceptConnections` flag is getting set to false, but hey there is no real value in doing that as long as the server socket automatically accepts connections. There is only the `onSocketAccepted` listener available but no other one to simply block new connection requests.

So that means that calling `wait_for_port` during an ongoing restart of Firefox could potentially result in a very short-lived connection. By removing the requirement of receiving a ":" it just continues to `start_session`. Here a new connection also gets accepted and raises after 2s for waiting of the hello string. Then the `do_process_check` decorator kicks in, and we wait 120s for Firefox to shutdown. This actually will not happen because Firefox got restarted, the process id updated, and now runs forever.

As said above this all is fine for restarts, but I'm not sure why we also see that for fresh starts. For now I would say lets fix the known issue, and if it is still present afterward for initial startup, lets further investigate.

To actually solve this problem we will have to stop the server socket from actively listening for new connections. This can be done by calling `close()`, which importantly does not affect already accepted connections! See MDN for details:

https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsIServerSocket#close()

It means this is a regression from bug 1309556 which landed in Firefox 52!
Blocks: 1401498
Attachment #8921035 - Flags: review?(ato)
Attachment #8920560 - Flags: review?(ato)
(In reply to Henrik Skupin (:whimboo) from comment #5)

> To actually solve this problem we will have to stop the server
> socket from actively listening for new connections. This can be
> done by calling `close()`, which importantly does not affect
> already accepted connections!  See MDN for details:

This means a client won’t be able to instruct the server to start
accepting connections again through Marionette:AcceptConnections
unless it is done from an existing socket, and if that loses
connection, Firefox is left in an unrecoverable state?
Flags: needinfo?(hskupin)
(In reply to Andreas Tolfsen ‹:ato› from comment #10)
> This means a client won’t be able to instruct the server to start
> accepting connections again through Marionette:AcceptConnections
> unless it is done from an existing socket, and if that loses
> connection, Firefox is left in an unrecoverable state?

Nothing changes here for the client. With or without my patch another connection attempt will not be successful. The current way we handle that is only busted, because the client would see a successful temporary connect, but immediately afterward the connection will be dropped. With my changes we will not temporarily accept a connection, but generally deny it. See the code in `onSocketAccepted`:

https://dxr.mozilla.org/mozilla-central/rev/69e24c678a28dc46a4c1bda3ff04b2f6106ff71a/testing/marionette/server.js#397
Flags: needinfo?(hskupin)
(In reply to Henrik Skupin (:whimboo) from comment #11)

> (In reply to Andreas Tolfsen ‹:ato› from comment #10)
> 
> > This means a client won’t be able to instruct the
> > server to start accepting connections again through
> > Marionette:AcceptConnections unless it is done from an existing
> > socket, and if that loses connection, Firefox is left in an
> > unrecoverable state?
> 
> Nothing changes here for the client. With or without my patch
> another connection attempt will not be successful.

Yes, right, that’s true.

> The current way we handle that is only busted, because the
> client would see a successful temporary connect, but immediately
> afterward the connection will be dropped.

Well it was designed like that under the expectation that the client
would understand and disconnect, but quite possibly there is not
enough context for it to understand it needs to do that because no
response would be sent, hence why it looked for the ":" (colon)
character in the first 16 bytes.

> With my changes we will not temporarily accept a connection, but
> generally deny it. See the code in `onSocketAccepted`:

I _think_ that makes sense.  An alternative approach would be for
onSocketAccepted to immediately disconnect the client, but I guess
this could cause problems with wait_for_port.
(In reply to Andreas Tolfsen ‹:ato› from comment #12)
> I _think_ that makes sense.  An alternative approach would be for
> onSocketAccepted to immediately disconnect the client, but I guess
> this could cause problems with wait_for_port.

Yes, and that would be still racy. `start_session` is more affected, given that it tries to wait for the hello string afterward, which will never be received.
Comment on attachment 8921035 [details]
Bug 1410366 - Stop socket server from listening for new connections if told so.

https://reviewboard.mozilla.org/r/191988/#review197228

::: commit-message-ce1a8:5
(Diff revision 2)
> +Bug 1410366 - Stop socket server from listening for new connections if told so.
> +
> +Simply checking '_acceptConnections' when clients are trying to
> +connect to Marionette, and revoking the connection request inside
> +of `onSocketAccepted` is plainly wrong, given that a connection is

Avoid Markdown in commit messages.

::: testing/marionette/server.js:340
(Diff revision 2)
> -      logger.info("New connections are accepted again");
> +      logger.info("New connections will no longer be accepted");
> +      this.disconnect();
> +    }
> +  }
> +
> +  connect() {

"connect" is the wrong terminology here: the client connects, but
the server listens.

Perhaps this can be built into the acceptConnections setter, since
that is called by start() and stop()?

::: testing/marionette/server.js:347
(Diff revision 2)
> +      const flags = KeepWhenOffline | LoopbackOnly;
> +      const backlog = 1;
> +      this.socket = new ServerSocket(this.port, flags, backlog);
> +      this.port = this.socket.port;
> +
> +      this.socket && this.socket.asyncListen(this);

What purpose does the this.socket precursor address?

::: testing/marionette/server.js
(Diff revision 2)
> -    this.port = this.socket.port;
>      Preferences.set(PREF_PORT, this.port);
> +    env.set(ENV_ENABLED, "1");
>  
>      this.alive = true;
> -    this._acceptConnections = true;

this._acceptConnections is not used anymore, but as I pointed out in
my earlier issue, I wonder if this.acceptConnections = true/false
could not replace this.{connect,disconnect}?

::: testing/marionette/server.js:413
(Diff revision 2)
>      if (!this._acceptConnections) {
>        logger.warn("New connections are currently not accepted");
>        return;
>      }

As I understand it, this is now unused?
Attachment #8921035 - Flags: review?(ato) → review-
Comment on attachment 8920560 [details]
Bug 1410366 - Improve socket handling in wait_for_port

https://reviewboard.mozilla.org/r/191568/#review197264

::: testing/marionette/client/marionette_driver/marionette.py:706
(Diff revision 3)
> +                    except:
> +                        pass

In transport.py:260 [1] we look at the more specific exception type IOError and only for some specific socket errors:

> try:
>     sock.shutdown(socket.SHUT_RDWR)
> except IOError as exc:
>     # If the socket is already closed, don't care about:
>     #   Errno  57: Socket not connected
>     #   Errno 107: Transport endpoint is not connected
>     if exc.errno not in (57, 107):
>         raise

Should we not do the same here?

  [1] https://searchfox.org/mozilla-central/source/testing/marionette/client/marionette_driver/transport.py#260-267
Attachment #8920560 - Flags: review?(ato) → review+
Comment on attachment 8921035 [details]
Bug 1410366 - Stop socket server from listening for new connections if told so.

https://reviewboard.mozilla.org/r/191988/#review197228

> "connect" is the wrong terminology here: the client connects, but
> the server listens.
> 
> Perhaps this can be built into the acceptConnections setter, since
> that is called by start() and stop()?

We would have to update `start()` and `stop()` so both will set `acceptConnections`. I have no problem with that.

The only difference is that we will see the `New connections are accepted` log entry now for the first server socket initialization too.

> What purpose does the this.socket precursor address?

Ups. It's a left over. It will be removed.
Comment on attachment 8921035 [details]
Bug 1410366 - Stop socket server from listening for new connections if told so.

https://reviewboard.mozilla.org/r/191988/#review197228

> We would have to update `start()` and `stop()` so both will set `acceptConnections`. I have no problem with that.
> 
> The only difference is that we will see the `New connections are accepted` log entry now for the first server socket initialization too.

Andreas agreed on it via IRC. So the next version will have this change included.

> this._acceptConnections is not used anymore, but as I pointed out in
> my earlier issue, I wonder if this.acceptConnections = true/false
> could not replace this.{connect,disconnect}?

Right. The logic looks way simpler now. Thanks!

> As I understand it, this is now unused?

Yes, the listener callback will no longer be called if new connections are not allowed anymore.
Comment on attachment 8920560 [details]
Bug 1410366 - Improve socket handling in wait_for_port

https://reviewboard.mozilla.org/r/191568/#review197264

> In transport.py:260 [1] we look at the more specific exception type IOError and only for some specific socket errors:
> 
> > try:
> >     sock.shutdown(socket.SHUT_RDWR)
> > except IOError as exc:
> >     # If the socket is already closed, don't care about:
> >     #   Errno  57: Socket not connected
> >     #   Errno 107: Transport endpoint is not connected
> >     if exc.errno not in (57, 107):
> >         raise
> 
> Should we not do the same here?
> 
>   [1] https://searchfox.org/mozilla-central/source/testing/marionette/client/marionette_driver/transport.py#260-267

We use those because when we are connected to Marionette server the connection could drop as initiated by the server side (eg. unexpected shutdown, crash...). In those cases we have to raise an excpetion. Here in `wait_for_port` we don't have to care about the process status. It's enough for us to wait until the first connection attempt was successful. Then we close the socket, and do not care about any exception as thrown.

Does that make sense?
Comment on attachment 8920560 [details]
Bug 1410366 - Improve socket handling in wait_for_port

https://reviewboard.mozilla.org/r/191568/#review197264

> We use those because when we are connected to Marionette server the connection could drop as initiated by the server side (eg. unexpected shutdown, crash...). In those cases we have to raise an excpetion. Here in `wait_for_port` we don't have to care about the process status. It's enough for us to wait until the first connection attempt was successful. Then we close the socket, and do not care about any exception as thrown.
> 
> Does that make sense?

Yes, thanks for explaining.
Comment on attachment 8921035 [details]
Bug 1410366 - Stop socket server from listening for new connections if told so.

https://reviewboard.mozilla.org/r/191988/#review197282

I agree, looks much better now.  Thanks for making the changes!
Attachment #8921035 - Flags: review?(ato) → review+
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f4152054eb1b
Stop socket server from listening for new connections if told so. r=ato
https://hg.mozilla.org/integration/autoland/rev/43e45144dd9f
Improve socket handling in wait_for_port r=ato
Backed out for Marionette mass failures, at least on Android 4.3 API16+ debug:

https://hg.mozilla.org/integration/autoland/rev/5861384d1db5d7b81c2b552e5fa9a455a5bdeb75
https://hg.mozilla.org/integration/autoland/rev/879e10581abc1c25051ba1da171dbd73af39e3ef

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=43e45144dd9fffa826b7b4d8289f77ddeea172c6&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=138953411&repo=autoland

[task 2017-10-23T18:14:05.920Z] 18:14:05     INFO -  Using workspace for temporary data: "/builds/worker/workspace/build/tests/marionette"
[task 2017-10-23T18:14:05.960Z] 18:14:05     INFO -  mozdevice Detected adb 1.0.39
[task 2017-10-23T18:14:06.040Z] 18:14:06     INFO -  mozdevice Detected Android sdk 18
[task 2017-10-23T18:14:06.147Z] 18:14:06     INFO -  mozdevice will use zip to push directories
[task 2017-10-23T18:14:13.013Z] 18:14:13     INFO -  timed out waiting for profiles.ini
[task 2017-10-23T18:14:15.416Z] 18:14:15     INFO -  Application command: /builds/worker/workspace/build/android-sdk-linux/platform-tools/adb -s emulator-5554 shell am start -a android.activity.MAIN -n org.mozilla.fennec_aurora/org.mozilla.gecko.BrowserApp --es args '-no-remote -profile /storage/sdcard/tests/profile -marionette' --es env0 MOZ_CRASHREPORTER=1 --es env1 R_LOG_VERBOSE=1 --es env2 MOZ_HIDE_RESULTS_TABLE=1 --es env3 MOZ_LOG=signaling:3,mtransport:4,DataChannel:4,jsep:4,MediaPipelineFactory:4 --es env4 MOZ_CRASHREPORTER_SHUTDOWN=1 --es env5 R_LOG_DESTINATION=stderr --es env6 MOZ_CRASHREPORTER_NO_REPORT=1 --es env7 NO_EM_RESTART=1 --es env8 MOZ_PROCESS_LOG=/tmp/tmpWJQvlBpidlog --es env9 R_LOG_LEVEL=6
[task 2017-10-23T18:14:16.588Z] 18:14:16     INFO -  Starting: Intent { act=android.activity.MAIN cmp=org.mozilla.fennec_aurora/org.mozilla.gecko.BrowserApp (has extras) }
[task 2017-10-23T18:14:19.333Z] 18:14:19     INFO -  Profile path is /tmp/tmp_Rcipl.mozrunner
[task 2017-10-23T18:14:19.335Z] 18:14:19     INFO -  Starting fixture servers
[task 2017-10-23T18:14:24.937Z] 18:14:24     INFO -  {"thread": "MainThread", "level": "ERROR", "pid": 1066, "source": "Marionette test runner", "time": 1508782464936, "action": "log", "message": "Failure during harness execution", "stack": "Traceback (most recent call last):\n\n  File \"/builds/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py\", line 92, in cli\n    failed = harness_instance.run()\n\n  File \"/builds/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py\", line 72, in run\n    runner.run_tests(tests)\n\n  File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 843, in run_tests\n    self.fixture_servers = self.start_fixture_servers()\n\n  File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 943, in start_fixture_servers\n    if self.appName == \"fennec\":\n\n  File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 677, in appName\n    self._appName = self.capabilities.get('browserName')\n\n  File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/runner/base.py\", line 667, in capabilities\n    self.marionette.start_session()\n\n  File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/decorators.py\", line 28, in _\n    m._handle_socket_failure()\n\n  File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/decorators.py\", line 23, in _\n    return func(*args, **kwargs)\n\n  File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/marionette.py\", line 1224, in start_session\n    self.protocol, _ = self.client.connect()\n\n  File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/transport.py\", line 198, in connect\n    raw = self.receive(unmarshal=False)\n\n  File \"/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/transport.py\", line 152, in receive\n    raise socket.error(\"No data received over socket\")\n\nIOError: Process has been unexpectedly closed (Exit code: 0) (Reason: No data received over socket)\n"}
[task 2017-10-23T18:14:24.951Z] 18:14:24    ERROR - Return code: 1
Flags: needinfo?(hskupin)
It's unfortunate that the tests have not been run on Android given that I only used e10s, and didn't force non-e10s. Inspecting the stack the failure happens for the very first retrieval of data which in this case the hello string. Hereby we only allow a timeout of 2s, whereby any other command is allowed to wait for the socket timeout.

Given that builds for Android are running in the emulator and this is absolutely slow, this timeout seems to be inappropriate nowadays. I think that we should remove, or at least extend it to a reasonable amount of time.

Here a try build which removes the restriction completely and adds two logging lines so that we can see how long it usually takes to receive the first data:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a5d66ced6f300e536b3e62c2f0cd7bf79b2eb3a
Flags: needinfo?(hskupin)
Blocks: 1411189
Blocks: 1340621
Blocks: 1397612
Blocks: 1398685
Really strange:

> [task 2017-10-24T09:20:33.529Z] 09:20:33     INFO -  ** before hello string: 2017-10-24 09:20:33.528379
> [task 2017-10-24T09:20:39.823Z] 09:20:39     INFO -  {"thread": "MainThread", "level": "ERROR", "pid": 1068, ...}
> [task 2017-10-24T09:20:39.837Z] 09:20:39    ERROR - Return code: 1

So it's even not working with a 6min socket timeout! I think that I will have to try out how to run the Marionette tests on a TC one click loaner.
Blocks: 1365243
Blocks: 1389103
Blocks: 1407156
Blocks: 1397076
Blocks: 1391545
Blocks: 1400819
(In reply to Henrik Skupin (:whimboo) from comment #28)
> Really strange:
> 
> > [task 2017-10-24T09:20:33.529Z] 09:20:33     INFO -  ** before hello string: 2017-10-24 09:20:33.528379
> > [task 2017-10-24T09:20:39.823Z] 09:20:39     INFO -  {"thread": "MainThread", "level": "ERROR", "pid": 1068, ...}
> > [task 2017-10-24T09:20:39.837Z] 09:20:39    ERROR - Return code: 1
> 
> So it's even not working with a 6min socket timeout!

That's 6 *seconds*, right?
Oh! What a bummer! I cannot read anymore. Thanks a lot for this hint! Now the question is where this obviously very low value comes from.
Here is why we return already after 6s:

[task 2017-10-24T13:21:44.278Z] 13:21:44     INFO -  ** connected to port 2828
[task 2017-10-24T13:21:44.278Z] 13:21:44     INFO -  ** before hello string: 2017-10-24 13:21:44.277271
[task 2017-10-24T13:21:44.278Z] 13:21:44     INFO -  ** socket timeout set to 360.0
[task 2017-10-24T13:21:44.278Z] 13:21:44     INFO -  ** sock timeout set to 360.0
[task 2017-10-24T13:21:44.278Z] 13:21:44     INFO -  ** before sock.recv: 2017-10-24 13:21:44.277378
[task 2017-10-24T13:21:44.284Z] 13:21:44     INFO -  ** after sock.recv: 2017-10-24 13:21:44.283800
[task 2017-10-24T13:21:44.286Z] 13:21:44     INFO -  ** chunk=

The socket timeout is actually set correctly to 360s but recv returns instantly with no single data read. Which indicates the following:

> When a recv returns 0 bytes, it means the other side has closed (or is in the process of closing) the connection. You will not receive any more data on this connection. Ever. You may be able to send data successfully.

Most likely this problem exists because of the very slow startup and Firefox doesn't send anything at this stage. For this bug I think it's overkill to get investigated and fixed.

As a solution I will put back the data check for ":" into `wait_for_port()`, which should solve this problem because then Marionette will wait until data is actually available.
Blocks: 1402138
Summary: Improve socket handling in wait_for_port → Improve socket handling for initial connection
And it works! I will add a comment why we have to keep the data checks in wait_for_page load and push again to autoland.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/30e69377c285
Stop socket server from listening for new connections if told so. r=ato
https://hg.mozilla.org/integration/autoland/rev/11e5f7b5376f
Improve socket handling in wait_for_port r=ato
No longer blocks: 1391545
No longer blocks: 1411189
No longer blocks: 1407156
No longer blocks: 1402138
No longer blocks: 1407390
It's a bit hard to say if the situation has been improved or not, but at least I haven't seen any regressions from landing this patch in the last two days. 

So can someone please uplift this test-only patch to mozilla-beta? Thanks.
Whiteboard: [checkin-needed-beta]
Sorry, I didn't realize that it was a Marionette
To get the patches uplifted to esr52 I would have to make substantial changes. I don't think it warrants the work, also because no-one actually noticed that since the 52 release, and it only affects restart tests.
Priority: -- → P2
Whiteboard: [17/10]
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: