Closed Bug 508128 Opened 15 years ago Closed 13 years ago

httpd.js running in xpcshell GC's too infrequently

Categories

(Core :: XPConnect, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla7

People

(Reporter: cpearce, Assigned: jdm)

References

Details

Attachments

(9 files, 3 obsolete files)

1.08 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
3.01 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
1.09 KB, patch
Details | Diff | Splinter Review
1.57 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
1.61 KB, patch
Details | Diff | Splinter Review
1.37 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
2.70 KB, patch
Details | Diff | Splinter Review
1.42 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
10.42 KB, patch
Details | Diff | Splinter Review
When repeatedly running mochitest on content/media/test/test_seek.html, I can see in Task Manager or by using `top` that xpcshell.exe's memory usage rapidly increases with every run. It expands by a few hundred megabytes every reload of test_seek.html. If you wait long enough, (I assume) a garbage collection is done, and the memory shrinks back down to a few hundred megabytes. When running this many times, it's easy to expand the memory usage to over a gigabyte.

Usually the first run expands the memory usage to a few hundred magabytes, but if I add a call to gc() at the end of _connectionClosed in httpd.js:719, memory always stays at around 9MB.

Note that test_seek is disabled on Windows because as it fails intermittently. This is because either necko or httpd.js can't handle suspending or resuming connections, this is a separate issue, and I'm looking into this. If test_seek.html appears to not terminate, that's not related to this gc issue. If I set the media code to not try to suspend/resume connections, this memory bloat still happens, so that's not the cause.

This bug happens on at least Linux and Windows. You have to edit content/media/test/Makefile.in to enable test_seek.html on Windows if you want to test on Windows.

STR:
1. Run mochitest on content/media/test_seek.html
2. Observe memory usage of xpcshell.exe in `top`/Task Manager.
3. Repeatedly reload test_seek.html page, note ever increasing memory usage.
Please try the patch in bug 506125
(In reply to comment #1)
> Please try the patch in bug 506125

It may help, but not enough; it's still only take a few runs to get over 1GB on Linux running seek_test.html.
We should investigate whether the MaybeGC hook is called properly. It should trigger the GC as your working set size increases (for every 18.5% increase you get a GC).
Does xpcshell do any calling of MaybeGC by default?  It may be that we need to run it with a -b argument to get steadier memory behaviour here, or just add the gc() call to httpd.js as alluded above.
(In reply to comment #4)
> Does xpcshell do any calling of MaybeGC by default?

I tested this with Anreas' patch from bug 506125, and it looks like xpcshell doesn't ever call MaybeGC(). During a mochitest run the browser does, but xpcshell doesn't. It looks like xpcshell only GCs on exit. I've run through a /content mochitest run on Windows with a break point in xpcshell's instance of js_GC() and it's not hit. Memory use for xpcshell.exe peaked at about 885MB, firefox.exe at about 170MB. Remember this was only for the content mochitests!

> It may be that we need to run it with a -b argument to get steadier memory 
> behaviour here, or just add the gc() call to httpd.js as alluded above.

I looks like xpcshell doesn't have a -b argument, could you elaborate please?
That explains it. I will look at xpcshell and see if we can easily add MaybeGC calls.
Assignee: general → nobody
Component: JavaScript Engine → XPConnect
QA Contact: general → xpconnect
Hardware: x86 → All
Patch adds a gc() call, keeps xpcshell memory down to about 15MB or so.

<gal> cpearce: so bent said xpcshell is mostly used for testing
<gal> and if we do an automated gc heuristics, you couldn't test GC stuff easily any more
<gal> so maybe we should simply add that manual gc() call
<cpearce> gal: that's a shame, but ok.
<gal> yeah, I agree
<gal> the regular shell triggers the GC as expected
<cpearce> I'll put a patch up for gc() in httpd.js then.
<cpearce> Are there any other scripts run by xpcshell that we need to do something similar in?
<gal> I wouldn't know
<gal> I know we have tests that hit last ditch gc
<brendan> we shouldn't do too much gc() hacking
Attachment #392624 - Flags: review?
Attachment #392624 - Flags: review? → review?(jwalden+bmo)
Attachment #392624 - Flags: review?(jwalden+bmo) → review-
Comment on attachment 392624 [details] [diff] [review]
Patch httpd.js to gc() in _connectionClosed
[Backout: Comment 28]

I'm sympathetic, but no.  xpcshell should have some way -- I believe by default -- for GC to occur naturally over time, without requiring explicit calls to a gc function.  If there is some reason to need more precise control over GC, that functionality should be opt-in; the common case is not to be testing precise GC behavior.  It is not the duty of every xpcshell user to work around this lack.

As a side effect, your patch probably causes the reftest harness to fail.
Brendan, see comment #8. Can you, Jeff and Andreas reach an agreement on what needs to be done here, and do it? This is a pretty huge problem --- I don't know how we've struggled on this long without noticing it. (OK, I have noticed huge httpd.js memory usage before, but I didn't consider that GC might just not be happening.)
GC will happen, but only at the extreme of $hugemem, not optimistically before then, I wasn't thinking of that, sorry.

If you want to minimize the time it takes to gc, adding |data.length = 0| after |output.writeByteArray(data, data.length)| in httpd.js will probably make a large difference for this particular case; I'd be fine with that addition.
Can we do something expedient in httpd.js ASAP, with a followup bug for the right fix to xpcshell filed and patched soon enough?

/be
(In reply to comment #8)
> (From update of attachment 392624 [details] [diff] [review])
> I'm sympathetic, but no. [...]
> As a side effect, your patch probably causes the reftest harness to fail.

I pushed the patch to tryserver, there were a few random-looking failures but all the tests ran at least, and all tests also run locally. I've pushed to tryserver again to see if the random-looking-failures happen again. How could this change cause the reftest harness to fail?
Looks like I am going to be fixing this. My memory pressure-based GC heuristic patch removes triggering GC from the allocator path and since xpcshell never calls MaybeGC, nochrome talos fails on some platforms now.
Assignee: nobody → gal
Blocks: 506125
Chris, can you get the quick fix in? I will provide a generalized fix.
(In reply to comment #14)
> Chris, can you get the quick fix in? I will provide a generalized fix.

Sure, who can will review+ it?
My first pick would be jwalden, followed by sayrer. Both are cc'd.
Well he knows that corner a lot better me. I can't overrule him. Waldo, whats up with the review? Are you ok with a temporary stop-gap measure until I have the MaybeGC path in place? We can back it out later. It sounds like this issue is biting people right now.
if this is turning tinderboxen orange as I write, we should take the httpd.js bandaid until we get a real fix.

bandaids are ok as long as they get removed properly.
is that an r=sayrer?
I expect this would break reftests because there wouldn't be a gc() function to call in reftest; the server is run there as a component, and I wouldn't expect components to have such a function in scope.  Thinking a little more, it's possible that the method is called late enough that the thrown ReferenceError happens to be lucky and doesn't break anything.  It's also possible my non-fanatical attempts to make httpd.js code exception-safe (or, rather, don't-hang-if-an-exception-is-thrown-safe) were successful in this case; hard to say without investigating, which I'm not inclined to do.

There are bandaids that change the way existing APIs are used to work around problems in them.  There are also gross, awful hacks that perform some completely random change unmotivated by what the surrounding code does.  I think this is an instance of the latter and not an instance of the former.
Here's a hack I'd accept, because it doesn't touch httpd.js itself: if we added testing/mochitest/tests/run-gc.sjs with contents |function handleRequest(request, response) { response.setHeader("Cache-Control", "no-cache", false); Components.utils.forceGC(); }| and occasionally loaded that file, would that also do a JS GC and ameliorate the problem?
(In reply to comment #22)
> Here's a hack I'd accept, because it doesn't touch httpd.js itself: if we added
> testing/mochitest/tests/run-gc.sjs with contents |function
> handleRequest(request, response) { response.setHeader("Cache-Control",
> "no-cache", false); Components.utils.forceGC(); }| and occasionally loaded that
> file, would that also do a JS GC and ameliorate the problem?

gah. way, way too much ink spilled on a trivial, temporary bandaid patch. 

Making an sjs file and calling is more complicated. We should just call Components.utils.forceGC() on connection close until Gal gets us a better fix, if you ask me.
Actually, easier would be to just do it in the server directly, like so.  Rebuilding and testing this now...
Comment on attachment 392624 [details] [diff] [review]
Patch httpd.js to gc() in _connectionClosed
[Backout: Comment 28]

Please guard this with |if (typeof gc === "function)|.  Also, I hate myself for doing this.

My patch didn't work, I'm guessing because by the time it's called the garbage your gc collects is entrained by XPCOM gunk, and I don't see a way to force a cycle collection in xpcshell.  (The usual way requires nsIDOMWindowUtils, which exists in the browser.)
Attachment #392624 - Flags: review- → review+
let's try this on the tracemonkey tree:

http://hg.mozilla.org/tracemonkey/rev/1391e8b39c73
(In reply to comment #26)
> let's try this on the tracemonkey tree:
> 
> http://hg.mozilla.org/tracemonkey/rev/1391e8b39c73

This seems to have made the tracemonkey test boxes consistently orange across the board. :(
Fixed by 506117 as a byproduct.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → DUPLICATE
Fixed by 506125, not 506117. Apologies.
No longer blocks: 506125
Attachment #392786 - Attachment is obsolete: true
Attachment #392624 - Attachment is obsolete: true
Attachment #392624 - Attachment description: Patch httpd.js to gc() in _connectionClosed → Patch httpd.js to gc() in _connectionClosed [Backout: Comment 28]
Un-duping. There's been no progress on bug 506125 for two months, and we need this fixed ASAP, and surely we can do something simple in httpd.js that doesn't require us to implement intelligent memory-pressure-based GC across the board.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
I pushed the comment #26 patch to the try-servers. It seems have caused a hang in extension manager tests:

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1257114322.1257122789.7235.gz#err0

TEST-PASS | /builds/slave/sendchange-linux-unittest/mozilla/objdir/_tests/xpcshell/test_extensionmanager/unit/test_LightweightThemeManager.js | test passed
TEST-PASS | /builds/slave/sendchange-linux-unittest/mozilla/objdir/_tests/xpcshell/test_extensionmanager/unit/test_bug257155.js | test passed

command timed out: 300 seconds without output, killing pid 1083
Summary: xpcshell.exe gc's too infrequently → httpd.js running in xpcshell GC's too infrequently
Windows failed in unit tests, too, but at a different test, so I suppose the hang is probably at a random place:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1257114322.1257123928.19656.gz#err0

TEST-PASS | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_dm\unit\test_bug_382825.js | test passed
TEST-PASS | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_dm\unit\test_bug_384744.js | test passed
TEST-PASS | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_dm\unit\test_bug_395092.js | test passed

command timed out: 300 seconds without output

On both Linux and Windows (Mac hasn't reported yet), we have a hang at the end of chrometests, after they've passed:

TEST-PASS | chrome://mochikit/content/browser/xpinstall/tests/browser_whitelist6.js | Install should have been triggered by the right uri

Browser Chrome Test Summary
	Passed: 4000
	Failed: 0
	Todo: 5

*** End BrowserChrome Test Results ***
INFO | automation.py | Application ran for: 0:03:00.187000
SUCCESS: The process with PID 8288 has been terminated.
before 995328, after 995328, break 00000000
Interestingly, Mac hung in exactly the same place as Linux:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1257114322.1257125858.8527.gz#err1

TEST-PASS | /builds/slave/sendchange-macosx-unittest/mozilla/objdir/_tests/xpcshell/test_extensionmanager/unit/test_LightweightThemeManager.js | test passed
TEST-PASS | /builds/slave/sendchange-macosx-unittest/mozilla/objdir/_tests/xpcshell/test_extensionmanager/unit/test_bug257155.js | test passed

command timed out: 300 seconds without output, killing pid 16779

The hang after running chrome tests also happened on Mac.
We hang in test_bug299716.js because it does do_timeout(DELAY, "run_test_pt2()"). That timeout never fires, I think because we GC before it fires and the GC cleans up all references to the timer, which cancels the timer.
Fix do_timeout so pending timers aren't GCed and canceled.
Attachment #409642 - Flags: review?
Attachment #409642 - Flags: review? → review?(jwalden+bmo)
The remaining hang is that httpd.js fails to terminate after the browser has exited from the browser mochitests. We get to
>>> shutting down server on port 8888
and then we close the connection to /server/shutdown, but it looks like there are still other connections open. Maybe we missed a connection-closed notification because something got GCed that was supposed to notice when a connection has closed? It's 100% reproducible for me.
Yes, the problem is that the connection for browser/components/privatebrowsing/test/browser/staller.sjs never closes.
It looks like this:

  function stall() {
    response.write("stalling...\n");
  }

  response.setHeader("Content-Type", "text/plain", false);
  stall();

  const nsITimer = Components.interfaces.nsITimer;
  var timer = Components.classes["@mozilla.org/timer;1"]
                        .createInstance(nsITimer);
  timer.initWithCallback(stall, 500, nsITimer.TYPE_REPEATING_SLACK);

so my guess is that this is supposed to write to a closed socket, which will trigger an error that lets httpd.js notice that the connection is closed. But I bet we're GCing away this timer callback so we never write and never notice the socket closed.
Attachment #409642 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 409642 [details] [diff] [review]
fix do_timeout [checked in comment #50]

Seems sensible.

The naming convention for head.js-private variables is non-existent, and I was tempted to grumble here, but this isn't any different than what exists -- filed bug 525819 to hit that.  Also, I filed bug 525820 to make do_timeout take a nullary function (not a string) as an argument.
Attached patch fix staller.sjs (obsolete) — Splinter Review
Similar fix for staller.sjs. This is slightly more complicated, we have to keep firing the timer until the response has ended (the browser closes its side and we get an error writing to the socket). This requires adding an isEnded API to Response.
Attachment #409662 - Flags: review?(jwalden+bmo)
With those patches, I got some new test failures on tryserver on Windows:

TEST-UNEXPECTED-FAIL | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\unit\test_cookie_header.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/unit/test_cookie_header.js | Expected to find header 'Cookie' but didn't find it - See following stack:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/unit/test_cookie_header.js | Unexpected exception: 2147500036 - See following stack:
TEST-UNEXPECTED-FAIL | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\test\test_host.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js |  == HTTP/1.1 400  - See following stack:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | error thrown by responseCheck: 2147500036 - See following stack:
TEST-UNEXPECTED-FAIL | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\test\test_request_line_split_in_two_packets.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_request_line_split_in_two_packets.js |  == HTTP/1.1 200 TEST PASSED - See following stack:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | error thrown by responseCheck: 2147500036 - See following stack:
TEST-UNEXPECTED-FAIL | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\test\test_seizepower.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js |  == too-late passed - See following stack:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | error thrown by responseCheck: 2147500036 - 

Linux and Mac had the test_cookie_header.js failures but not the others.
Attachment #409662 - Flags: review?(jwalden+bmo) → review-
Comment on attachment 409662 [details] [diff] [review]
fix staller.sjs

>diff --git a/netwerk/test/httpserver/httpd.js b/netwerk/test/httpserver/httpd.js

>+  /**
>+   * Returns true when the response has ended (completely constructed if possible
>+   * and the connection closed); further actions on this will fail.
>+   */
>+  get isEnded()
>+  {
>+    return this._ended;
>+  },

I'd rather not add this.  write() is guaranteed to throw if the response is fully constructed (whether completely sent or not), and it is entirely safe to rely on that behavior.


>diff --git a/browser/components/privatebrowsing/test/browser/staller.sjs b/browser/components/privatebrowsing/test/browser/staller.sjs

There seems to be more complexity here than necessary, some to support a repeating timer using that explicit timer type and some to support multiple timers (except this only ever uses one).  I had trouble following the control flow and the timer-release behavior.  I'd rather see repeated reuse of a one-shot timer as below, which I think is clearer and more concise (this semi-relies on a one-shot timer clearing its callback if not reinitialized; feel free to add a try { } catch (e) { timer = null; throw e; } around the contents of stall if you prefer):

var timer;
function handleRequest(request, response)
{
  response.setStatusLine(request.httpVersion, 200, "OK");
  response.setHeader("Content-Type", "text/plain", false);

  const nsITimer = Components.interfaces.nsITimer;
  timer = Components.classes["@mozilla.org/timer;1"]
                    .createInstance(nsITimer);
  function stall(t)
  {
    response.write("stalling...\n");
    timer.initWithCallback(stall, 500, nsITimer.TYPE_ONE_SHOT);
  }

  stall();
  response.processAsync();
}
This one's a doozy. If someone does a QI on JS-wrapped XPCOM object, the JS wrapper remembers that it has the methods of that interface, so if someone else gets that XPCOM object, they'll see those methods without QI'ing. But if you get a GC in between, the JS wrapper goes away, and the "someone else" won't see those methods.

In this case, the test was QIing to the wrong interface but finding the method anyway ... unless we happen to GC in between!
I hope this is what you had in mind? Seems to work.
Attachment #409662 - Attachment is obsolete: true
Attachment #409896 - Flags: review?(jwalden+bmo)
With these patches, things look good locally. My try-server build seems to be in limbo.
Comment on attachment 409896 [details] [diff] [review]
fix staller.sjs (v2) [checked in comment 70]

Well, what I had in mind was exactly what I typed out.  :-)  Maybe that had some thinkos in it?  No matter, this is as good.
Attachment #409896 - Flags: review?(jwalden+bmo) → review+
Checked in patches from comments #37 and #46:
http://hg.mozilla.org/mozilla-central/rev/89bce43a99d1
http://hg.mozilla.org/mozilla-central/rev/4aa00e390ced

My try-server build with the full set of patches did not work. On Windows the errors in comment #43 recurred (other than test_cookie_header errors). They don't occur locally, as I mentioned. So more investigation is needed.
I think I found the bug. RawTest's 'writer' gets a stream from XPCOM in onOutputStreamReady and then sometimes calls waitToWriteOutput, which expects that stream to have been previously QIed to nsIAsyncOutputStream. That QI was done on the stream object previously in performNextTest, but GC might have stripped it. There is a similar issue for 'reader'.

Pushed to try server...
I think our test-running xpcshells need to GC vigorously if only so we can prevent more tests being added with these bugs!
(In reply to comment #52)
> I think our test-running xpcshells need to GC vigorously if only so we can
> prevent more tests being added with these bugs!

"These bugs" are all bug 307441, a bad bug from the First Age. Who will fix it?

/be
My try-server build passed on Linux (again), got what could be a new intermittent failure on Mac:
TEST-UNEXPECTED-FAIL | /builds/slave/sendchange-macosx-unittest/mozilla/objdir/_tests/xpcshell/test_necko/unit/test_traceable_channel.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | /builds/slave/sendchange-macosx-unittest/mozilla/objdir/_tests/xpcshell/test_necko/unit/test_traceable_channel.js | replaced http response body == original http response body - See following stack:

Windows never reported back :-(
Ah, Windows just finally reported back. A couple of the errors are still there :-(.

TEST-UNEXPECTED-FAIL | e:\builds\moz2_slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\test\test_host.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js |  == HTTP/1.1 400  - See following stack:
TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | error thrown by responseCheck: 2147500036 - See following stack:
TEST-UNEXPECTED-FAIL | e:\builds\moz2_slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\test\test_seizepower.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js | exceptions test passed ==  - See following stack:
Are you manually eliding stacks in the above output, or is the stack-dumping code not working?  (Although, stacks might be fairly generic here given the asynchronicity of the tests and the RawTest gunk, so I'm not sure if they'd actually help or not.)
The first one at least looks like it's what's expected, although there's some fail there because it's catching an exception and throwing a new one (losing the actually interesting stack in the process).  Still examining...
      onOutputStreamReady: function(stream)
      {
        var data = testArray[testIndex].data[dataIndex];

        var written = 0;
        try
        {
          written = stream.write(data, data.length);
          if (written == data.length)
            dataIndex++;
          else
            testArray[testIndex].data = data.substring(written);
        }
        catch (e) { /* stream could have been closed, just ignore */ }

        // Keep reading data until there's no more data to read
        if (written != 0)
          waitToWriteOutput(stream);
        else
          stream.close();
      }

So, |testArray[testIndex].data| is an array of strings.  Initially |data| is the next one of those strings to write out.  We try to write it out.  If we wrote it all out, then we increase |dataIndex| to work on the next string to write out.  If we *didn't*, however, we grab what wasn't written and re-queue it:

            testArray[testIndex].data = data.substring(written);

|data.substring(written)| is a string.  |testArray[testIndex].data| was previously an array.  Type error!  The left-hand side should be |testArray[testIndex].data[dataIndex]|.  The TypeError you're seeing would happen if the remaining substring is small but |dataIndex| was large.
OK, so I've fixed that, and fixed another bug where we should be checking dataIndex < testArray[testIndex].data.length, but we've still got failures.

On Linux:
http://tinderbox.mozilla.org/showlog.cgi?tree=MozillaTry&errorparser=unittest&logfile=1257390383.1257399263.19861.gz&buildtime=1257390383&buildname=Linux%20try%20hg%20unit%20test&fulltext=1#err1
TEST-UNEXPECTED-FAIL | /builds/slave/sendchange-linux-unittest/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js |  == async seizure passed - See following stack:
JS frame :: /builds/slave/sendchange-linux-unittest/mozilla/testing/xpcshell/head.js :: do_throw :: line 201
JS frame :: /builds/slave/sendchange-linux-unittest/mozilla/testing/xpcshell/head.js :: do_check_eq :: line 253
JS frame :: /builds/slave/sendchange-linux-unittest/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js :: checkAsyncSeizure :: line 252
JS frame :: /builds/slave/sendchange-linux-unittest/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 606

On Windows:
http://tinderbox.mozilla.org/showlog.cgi?tree=MozillaTry&errorparser=unittest&logfile=1257390383.1257402748.25078.gz&buildtime=1257390383&buildname=WINNT%205.2%20try%20hg%20unit%20test&fulltext=1#err0
TEST-INFO | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | Swallowed exception [Exception... "Component returned failure code: 0x80470002 (NS_BASE_STREAM_CLOSED) [nsIBinaryInputStream.available]"  nsresult: "0x80470002 (NS_BASE_STREAM_CLOSED)"  location: "JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 585"  data: no], see following stack:
undefined
NEXT ERROR TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js |  == HTTP/1.1 400  - See following stack:
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_throw :: line 201
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_check_eq :: line 253
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js :: check400 :: line 303
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 606

TEST-INFO | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | Swallowed exception [Exception... "Component returned failure code: 0x80470002 (NS_BASE_STREAM_CLOSED) [nsIBinaryInputStream.available]"  nsresult: "0x80470002 (NS_BASE_STREAM_CLOSED)"  location: "JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 585"  data: no], see following stack:
undefined
NEXT ERROR TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_request_line_split_in_two_packets.js |  == HTTP/1.1 200 TEST PASSED - See following stack:
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_throw :: line 201
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_check_eq :: line 253
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_request_line_split_in_two_packets.js :: checkLotsOfLeadingBlankLines :: line 133
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 606


TEST-INFO | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | Swallowed exception [Exception... "Component returned failure code: 0x80470002 (NS_BASE_STREAM_CLOSED) [nsIBinaryInputStream.available]"  nsresult: "0x80470002 (NS_BASE_STREAM_CLOSED)"  location: "JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 585"  data: no], see following stack:
undefined
TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js | exceptions test passed ==  - See following stack:
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_throw :: line 201
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_check_eq :: line 253
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js :: checkExceptions :: line 243
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 606
All of these errors are consistent with RawTest just dropping data on the floor before calling the data-check callback.
I wonder if it's possible for reader.onInputStreamReady to be fired sometimes when there's no data in the stream yet? I pushed another patch to the try-server to see if I can narrow that down.
Plausible.  :-\

It's depressing how complicated this code is, but on the other hand 1) at least it's not spread across a handful of different tests, and 2) if the tests didn't exist we'd be in an even worse position.

Regarding the written != 0 check you mentioned as looking wrong, the reason for it, after thinking a little more, is that if the stream closes before we've finished writing all data (entirely possible when the test is that the server fails fast with a 400 error), we have to stop writing out data.  Switching to just check whether more data remains to be written seems not enough to handle that case, but I'm not entirely sure what the better thing to do is.
I guess we can check written != 0 too. onOutputStreamReady shouldn't fire unless we can write at least one byte.

The latest push has no errors on Linux or Mac (although there could just be randomness here, of course) but the same 3 errors on Windows. Here's a relevant snippet:
http://tinderbox.mozilla.org/showlog.cgi?tree=MozillaTry&errorparser=unittest&logfile=1257414317.1257428920.30309.gz&buildtime=1257414317&buildname=WINNT%205.2%20try%20hg%20unit%20test&fulltext=1#err0

*** copying undefined bytes to output: 66,97,100,32,114,101,113,117,101,115,116,10
*** _waitForData
*** onInputStreamReady
*** cancel(0)
*** onStopRequest async callback
*** onStopRequest [status=0x0]
*** closing connection 2 on port 2904
before 1060864, after 1052672, break 00000000
TEST-INFO | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | Swallowed exception [Exception... "Component returned failure code: 0x80470002 (NS_BASE_STREAM_CLOSED) [nsIBinaryInputStream.available]"  nsresult: "0x80470002 (NS_BASE_STREAM_CLOSED)"  location: "JS frame :: e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 587"  data: no], see following stack:
undefined
*** reader.onInputStreamReady: available=0
NEXT ERROR TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js |  == HTTP/1.1 400  - See following stack:
JS frame :: e:\builds\slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_throw :: line 201
JS frame :: e:\builds\slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_check_eq :: line 253
JS frame :: e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js :: check400 :: line 303
JS frame :: e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 610

I think the key line here is "reader.onInputStreamReady: available=0". That's just wrong.
(In reply to comment #63)
> It's depressing how complicated this code is, but on the other hand 1) at least
> it's not spread across a handful of different tests, and 2) if the tests didn't
> exist we'd be in an even worse position.

If we find out that lack of GC in httpd.js is causing a lot of the orange across the board, then I say if the tests didn't exist we'd be in a much *better* position.

In fact I think we should disable RawTests and land these fixes. The tests are clearly broken and I have had enough of trying to fix them.
I don't know if this patch helps at all, but it might helpful for future work on these tests.
Attachment #410631 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 410632 [details] [diff] [review]
Additional diagnostics to report more data about exceptions in tests

>diff --git a/netwerk/test/httpserver/test/head_utils.js b/netwerk/test/httpserver/test/head_utils.js

Hm, we should have added this noting and reporting functionality ages ago, didn't occur to me that such functionality should be generalized.  I wonder: is this because people aren't testing the exceptions thrown by methods enough?


>@@ -571,45 +571,58 @@ function runRawTests(testArray, done)
>   /** Data received so far from the server. */
>   var received = "";
> 
>   /** Reads data from the socket. */
>   var reader =
>     {
>       onInputStreamReady: function(stream)
>       {
>-        var bis = new BinaryInputStream(stream);
>+        try {

try
{

catch (e)
{

finally
{

throughout.


>+          var av = 0;
>+          try
>+          {
>+            av = bis.available();
>+          }
>+          catch (e) {
>+            do_note_exception(e);
>+            /* default to 0 */

Place the comment before the noting; that seems the more logical ordering (same elsewhere too).  We default to 0, therefore we merely note the exception rather than doing something with it like checking its value.


>diff --git a/netwerk/test/unit/test_traceable_channel.js b/netwerk/test/unit/test_traceable_channel.js

>   onStopRequest: function(request, context, statusCode) {
>-    this.listener.onStopRequest(request, context, statusCode);
>-    httpserver.stop(do_test_finished);
>+    dump("In TracingListener.onStopRequest\n");
>+    try {
>+      this.listener.onStopRequest(request, context, statusCode);
>+      httpserver.stop(do_test_finished);
>+      dump("Did TracingListener.onStopRequest\n");
>+    } catch (e) {
>+      dump("TracingListener.onStopRequest swallowing exception: " + e + "\n");
>+    }
>   },

This formulation is a bad idea because if onStopRequest throws, the server will never be stopped and do_test_finished will never be called.  Instead, move the server-stop call into a finally clause.
Attachment #410632 - Flags: review?(jwalden+bmo) → review+
Assignee: gal → roc
Depends on: 601557
I'm looking into this.  I'm seeing a very repeatable assertion failure in test_async_notifications_404.js which looks like a weakness in the scope of the request proxies, which I've filed as bug 601557.
Assignee: nobody → josh
Having applied attachment 410632 [details] [diff] [review] and made Waldo's requested changes, I see this notice before the test failure occurs:

TEST-INFO | /Users/moco/src/m-c/ff-dbg/_tests/xpcshell/netwerk/test/httpserver/test/head_utils.js | Swallowed exception [Exception... "Component returned failure code: 0x80470002 (NS_BASE_STREAM_CLOSED) [nsIBinaryInputStream.available]"  nsresult: "0x80470002 (NS_BASE_STREAM_CLOSED)"  location: "JS frame :: /Users/moco/src/m-c/ff-dbg/_tests/xpcshell/netwerk/test/httpserver/test/head_utils.js :: anonymous :: line 532"  data: no], see following stack:
Apparently that's not the cause, however, because the same exception occurs in successful runs as well.
As far as I can tell, we're suffering from a similar problem to bug 601557.  The input stream created in performNextTest goes out of scope, and apparently nobody else bothers to hold a strong reference to it.  When httpd.js dispatches a runnable that ends up triggering a GC, it races against the stream becoming readable.  If the GC wins, the stream is destroyed and in its dying throes triggers a runnable which calls OnInputStreamReady and keeps the stream alive for the callback.  The easiest solution is to retain a strong ref to the stream somewhere in head_utils.js, such as within the reader object.
Here's the updated diagnostic patch for eventual landing.  Carrying forward the previous r+.
Attachment #481089 - Flags: review?(jwalden+bmo)
Attachment #410632 - Attachment is obsolete: true
Attachment #481090 - Attachment description: "httpd.js running in xpcshell GC's too infrequently" [ → Additional diagnostics to report more data about exceptions in tests
I've had several try server runs with the patch from bug 601557 and here applied, and the only failures I've received are known intermittent oranges that don't appear to be any worse with this GC behaviour change.  I'm feeling pretty good about this.
Comment on attachment 481089 [details] [diff] [review]
Ensure input stream lifetime is not cut short by a GC. [checked in comment 80]

>diff --git a/netwerk/test/httpserver/test/head_utils.js b/netwerk/test/httpserver/test/head_utils.js
>@@ -517,16 +518,17 @@ function runRawTests(testArray, done)
>   /** Data received so far from the server. */
>   var received = "";
> 
>   /** Reads data from the socket. */
>   var reader =
>     {
>       onInputStreamReady: function(stream)
>       {
>+        do_check_true(stream == this.stream);

=== please, == is a bit broken (not in this instance, but bad choices can snowball into bad habits).

Is it the case that this issue also needs to be fixed in runHttpTests?  Looks like it to me, but maybe I'm missing something.  If you agree, please add:

* a |listener._channel = ch;| before the |ch.asyncOpen(listener, null)| in
  |performNextTest|,
* a |/** Current channel being observed by this. */ _channel: null,| in the
  |listener| literal,
* a |do_check_true(request === this._channel);| in |onStartRequest|,
* and a |this._channel = null;| in |onStopRequest|

while you're making the changes you've already suggested.  Sorry for the delay in getting to the review...
Attachment #481089 - Flags: review?(jwalden+bmo) → review+
And the actual gc() patch bounced because it lit up the xpcshell tests.  Investigating yet again.  Sigh.
The failures in xpcshell tests all seem to be:

###!!! ASSERTION: nsExceptionManager not thread-safe

at the stack:

 5  libxul.so!nsExceptionManager::Release [nsExceptionService.cpp:031018350d0f :
    eip = 0x02860530   esp = 0xbfe7ac00   ebp = 0xbfe7ac38   ebx = 0x0357750c
    Found by: call frame info
 6  libxul.so!nsExceptionService::DoDropThread [nsExceptionService.cpp:031018350
    eip = 0x0285f744   esp = 0xbfe7ac40   ebp = 0xbfe7ac68   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
 7  libxul.so!nsExceptionService::DropAllThreads [nsExceptionService.cpp:0310183
    eip = 0x0285f782   esp = 0xbfe7ac70   ebp = 0xbfe7ac78   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
 8  libxul.so!nsExceptionService::Shutdown [nsExceptionService.cpp:031018350d0f 
    eip = 0x0285fa47   esp = 0xbfe7ac80   ebp = 0xbfe7ac98   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
 9  libxul.so!nsExceptionService::Observe [nsExceptionService.cpp:031018350d0f :
    eip = 0x0285fa93   esp = 0xbfe7aca0   ebp = 0xbfe7aca8   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
10  libxul.so!nsObserverList::NotifyObservers [nsObserverList.cpp:031018350d0f :
    eip = 0x027fb5e3   esp = 0xbfe7acb0   ebp = 0xbfe7acd8   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
11  libxul.so!nsObserverService::NotifyObservers [nsObserverService.cpp:03101835
    eip = 0x027fc6ff   esp = 0xbfe7ace0   ebp = 0xbfe7ad18   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
12  libxul.so!mozilla::ShutdownXPCOM [nsXPComInit.cpp:031018350d0f : 624 + 0x3b]
    eip = 0x027df6d6   esp = 0xbfe7ad20   ebp = 0xbfe7ad98   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info


Seems like a bug in nsExceptionService, perhaps?
There's also this failure:

TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/mozilla-central_fedora64_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_records_crypto.js | test failed (with xpcshell return code: 0), see following log:
>>>>>>>
Setting the identity for passphrase
TEST-INFO | (xpcshell/head.js) | test 1 pending
TEST-INFO | (xpcshell/head.js) | test 2 pending
2010-10-22 13:21:27	Test                 INFO	Setting up server and authenticator
2010-10-22 13:21:27	Test                 INFO	Generating keypair + symmetric key
2010-10-22 13:21:27	PubKeyManager        DEBUG	Generating RSA keypair
2010-10-22 13:21:27	PubKeyManager        DEBUG	Generating RSA keypair... done
2010-10-22 13:21:27	Test                 INFO	Setting up keyring
2010-10-22 13:21:27	Test                 INFO	Creating a record
TEST-PASS | /home/cltbld/talos-slave/mozilla-central_fedora64_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_records_crypto.js | [run_test : 80] http://localhost:8080/crypto/steam == http://localhost:8080/crypto/steam
TEST-PASS | /home/cltbld/talos-slave/mozilla-central_fedora64_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_records_crypto.js | [run_test : 81] ../crypto/steam == ../crypto/steam
2010-10-22 13:21:27	Test                 INFO	Encrypting a record
2010-10-22 13:21:27	PubKeyManager        TRACE	Importing record: http://localhost:8080/keys/pubkey
2010-10-22 13:21:28	Net.Resource         DEBUG	GET success 200 http://localhost:8080/keys/pubkey
2010-10-22 13:21:28	PrivKeyManager       TRACE	Importing record: http://localhost:8080/keys/privkey
before 1576960, after 1417216, break 01db7000
2010-10-22 13:21:28	Net.Resource         DEBUG	GET success 200 http://localhost:8080/keys/privkey
before 1421312, after 1413120, break 01db7000
TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | 2147500037
(In reply to comment #82)
> Seems like a bug in nsExceptionService, perhaps?

I talked to jst about this, and he sounded ok with just making the reference counting threadsafe in this case.

http://hg.mozilla.org/mozilla-central/file/3a5b84a290d7/xpcom/base/nsExceptionService.cpp#l95 already documents that it can be released on a different thread during shutdown.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 606592
Turns out that the patch in bug 606592 to address comment 84 also resolves the quite mysterious test_records_crypto.js failure.  I can't explain where the failure comes from, but I've got multiple try runs demonstrating completely green trees with that patch applied.
Oh joy. Some more try runs revealed that the test_records_crypto.js failure is alive and well, despite the patches in bug 606592 and bug 450812.  More investigation required.
Attachment #392624 - Attachment is obsolete: false
Attachment #409642 - Attachment description: fix do_timeout → fix do_timeout [checked in comment #50]
Attachment #409843 - Attachment description: fix subtle QI bug in test_cookie_header.js → fix subtle QI bug in test_cookie_header.js [checked in comment 50]
Attachment #409896 - Attachment description: fix staller.sjs (v2) → fix staller.sjs (v2) [checked in comment 70]
Attachment #410013 - Attachment description: Fix RawTest QI issue → Fix RawTest QI issue [checked in comment 70]
Attachment #410631 - Attachment description: Fix additional RawTest bugs → Fix additional RawTest bugs [checked in comment 70]
Attachment #481089 - Attachment description: Ensure input stream lifetime is not cut short by a GC. → Ensure input stream lifetime is not cut short by a GC. [checked in comment 80]
Attachment #481090 - Attachment description: Additional diagnostics to report more data about exceptions in tests → Additional diagnostics to report more data about exceptions in tests [checked in comment 80]
Since the last time I sent this to try, test_records_crypto.js has been changed in such a way that we don't hit the mysterious error any longer.  Of course, parser/htmlparser/tests/mochitest/test_bug543062.html now times out, but for a very sensible reason.  In file_bug543062.sjs, the request handler calls armTimer, which creates a timer in a local variable.  When this function finishes and the var goes out of scope, the timer is canceled and destroyed.
Depends on: 619026
Great, if you can fix that maybe we can land this before anyone else accidentally adds new dependencies on not GCing :-)
Depends on: 619164
Depends on: 619167
Depends on: nsITimer-fail
No longer depends on: 619026
I had a fully green try run at http://tbpl.mozilla.org/?tree=Try&pusher=josh@joshmatthews.net&rev=0426e35b1403. I'm going for it!
Whiteboard: fixed-in-cedar
http://hg.mozilla.org/mozilla-central/rev/7ee0fe6aec58
Status: REOPENED → RESOLVED
Closed: 14 years ago13 years ago
Resolution: --- → FIXED
Whiteboard: fixed-in-cedar
Target Milestone: --- → mozilla7
Josh, jsreftests have started intermittently timing out on Android (see bug 662468), with a lot of GC spew from xpcshell. Do you think this bug might have introduced that?
for the tegras we run xpcshell on a host computer and access it remotely from Android.  Maybe it is related to this patch, but it wouldn't be related to httpd.js and xpcshell binary.
If this patch is responsible for a change in behaviour, it would be that httpd.js is now GCing more frequently. As Joel points out, that means a remote machine is doing the GCing, not the fennec process on Android. I'm having a hard time coming up with any reasons this patch could be causing the jsreftest problems.
(In reply to comment #95)
> If this patch is responsible for a change in behaviour, it would be that
> httpd.js is now GCing more frequently. As Joel points out, that means a
> remote machine is doing the GCing, not the fennec process on Android. I'm
> having a hard time coming up with any reasons this patch could be causing
> the jsreftest problems.

See bug 648104, comment 9.  One of the issues with the reftest family of tests on android is that we are forced to run all reftests at a very high resolution (for a tegra), which significantly impacts available memory on the tegras.
(In reply to comment #96)
> See bug 648104, comment 9.  One of the issues with the reftest family of
> tests on android is that we are forced to run all reftests at a very high
> resolution (for a tegra), which significantly impacts available memory on
> the tegras.

Yes, but that was the case before this patch as well, so I don't see how that's relevant.
(In reply to comment #95)
> If this patch is responsible for a change in behaviour, it would be that
> httpd.js is now GCing more frequently. As Joel points out, that means a
> remote machine is doing the GCing, not the fennec process on Android. I'm
> having a hard time coming up with any reasons this patch could be causing
> the jsreftest problems.

Josh, thanks for the explanation: I agree, it is almost certainly not due to the GC change.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: