Closed Bug 924415 Opened 11 years ago Closed 11 years ago

Find out why setting chromemargin to 0,-1,-1,-1 is so expensive for TART on UX branch on OS X.

Categories

(Firefox :: Tabbed Browser, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: mconley, Assigned: mconley)

References

Details

(Keywords: perf)

Attachments

(3 files)

Last night, I pushed some patches to try, measuring some tweaks to see what (if anything) about our tabs on OS X could be related to the TART regression.

I found that if I removed Cocoa from the list of widget backends that CAN_DRAW_IN_TITLEBAR, *the regression went down sharply*.

Here's my evidence...

m-c Baseline: https://tbpl.mozilla.org/?tree=Try&rev=7150cb170352
UX Baseline: https://tbpl.mozilla.org/?tree=Try&rev=363f0ff307d5
No CAN_DRAW_IN_TITLEBAR: https://tbpl.mozilla.org/?tree=Try&rev=fbf27c6d492c

Here's a comparison between m-c and UX to get us all familiar with the current regression: http://compare-talos.mattn.ca/?oldRevs=7150cb170352&newRev=363f0ff307d5&server=graphs.mozilla.org&submit=true

And here's a comparison between m-c and the patch that disables CAN_DRAW_IN_TITLEBAR: http://compare-talos.mattn.ca/?oldRevs=7150cb170352&newRev=fbf27c6d492c&server=graphs.mozilla.org&submit=true

It's not perfect, but it's certainly quite a bit better.

So I want to explore what is expensive about CAN_DRAW_IN_TITLEBAR, and see what we can do about it.
Whiteboard: [Australis:P1] → [Australis:P1][Australis:M?]
Reflow comparison profile for Snow Leopard that sets before as not having CAN_DRAW_IN_TITLEBAR, and after as UX baseline.

That way, slow stuff should, theoretically, float to the top.
And here's an SPS profile using the same technique - before was not having CAN_DRAW_IN_TITLEBAR, and after is UX baseline:

http://tests.themasta.com/cleopatra/?report=e1d3c78a15bd0c5c332232aa2d131ad6b03d6cea
This is really interesting!

The SPS profile shows that almost the whole CAN_DRAW_IN_TITLEBAR regression can be attributed to slower GLContextCGL::SwapBuffers - that is, we're waiting for the GPU to composite the layers into the window.

It's not clear to me why composition is slower with CAN_DRAW_IN_TITLEBAR. Composition should be really quick, the window doesn't consist of all that many layers. One difference in composition between CAN_DRAW_IN_TITLEBAR and not is nsChildView::MaybeDrawTitlebar, which draws a layer with the window buttons on top when we draw in the titlebar. This might account for the difference in composition speed, I guess... but this would still be very surprising to me. Mike, can you get another measurement of a UX build in which you don't turn off CAN_DRAW_IN_TITLEBAR but instead add an early return at the beginning of nsChildView::MaybeDrawTitlebar?

The reflow profile doesn't measure composition speed so it doesn't show this part of the regression. What it does show is more expensive #nav-bar background drawing and more expensive -moz-appearance:titlebar drawing. No idea why that happens yet.
(In reply to Markus Stange [:mstange] from comment #3)
> Mike, can you get another measurement of a UX build in
> which you don't turn off CAN_DRAW_IN_TITLEBAR but instead add an early
> return at the beginning of nsChildView::MaybeDrawTitlebar?
> 

K, done - try build baking here: https://tbpl.mozilla.org/?tree=Try&rev=cb494e531477
OK, so MaybeDrawTitlebar has nothing to do with the slowdown.

Another hypothesis: It could just be the increased OpenGL context size. The outer window bounds are the same in both cases, but when we draw in the titlebar, the OpenGL context is bigger because it has to cover the titlebar, too.
Not sure how we could test if this is what causes the difference. We could maybe instrument TART to reduce the window height by 22px before starting the test... but I wonder if it's worth doing that.
Doesn't CAN_DRAW_IN_TITLEBAR influence the JS we run (specifically, the TabsInTitlebar math) and perhaps also the CSS (assuming we correctly set the tabsintitlebar attribute) ? Do we know whether it's the native part or the JS/CSS part of this that's causing slowdown?
I did an experiment last night, and removed the parts of browser.xul and browser.js that set the chromemargin attribute to "0,-1,-1,-1".

Here is a comparison against UX:

http://compare-talos.mattn.ca/?oldRevs=363f0ff307d5&newRev=905522d61d73&server=graphs.mozilla.org&submit=true

And here is a comparison against m-c:

http://compare-talos.mattn.ca/?oldRevs=7150cb170352&newRev=905522d61d73&server=graphs.mozilla.org&submit=true

This very much resembles the performance win noted in comment 0 with setting CAN_DRAW_IN_TITLEBAR to false.

So I think we've constrained the problem a little more here.
Summary: Find out why CAN_DRAW_IN_TITLEBAR is so expensive for TART on UX branch on OS X. → Find out why setting chromemargin to 0,-1,-1,-1 is so expensive for TART on UX branch on OS X.
(In reply to :Gijs Kruitbosch from comment #6)
> Doesn't CAN_DRAW_IN_TITLEBAR influence the JS we run (specifically, the
> TabsInTitlebar math) and perhaps also the CSS (assuming we correctly set the
> tabsintitlebar attribute) ? Do we know whether it's the native part or the
> JS/CSS part of this that's causing slowdown?

I would venture that this is a native slowdown, triggered when we start drawing in the titlebar by setting that chromemargin attribute.
(In reply to :Gijs Kruitbosch from comment #6)
> Doesn't CAN_DRAW_IN_TITLEBAR influence the JS we run (specifically, the
> TabsInTitlebar math) and perhaps also the CSS (assuming we correctly set the
> tabsintitlebar attribute) ? Do we know whether it's the native part or the
> JS/CSS part of this that's causing slowdown?

The SPS comparison profile from comment 2 is pretty clear about the difference coming from SwapBuffers and not not from any JS or reflow.
So here's more information - here's a recent comparison profile between the tips of UX and m-c. Note that as part of this experiment, I slowed down the animations by 5x in an attempt to gain more samples.

Here's the profile:  http://tests.themasta.com/cleopatra/?report=4a7422ede34e2e563b6b05e0f43e7ae0ad46d868

Invert the call stack, and you'll see that we spend a great deal of time waiting for the GPU to composite in "mach_msg_trap" - about 449ms in total.

Now here's an older comparison profile where I removed the CAN_DRAW_IN_TITLEBAR build pref, compared against UX tip:

http://tests.themasta.com/cleopatra/?report=827030934d27a425cd717389126036d7d1c80f3d

Again, invert the call stack, and then scroll way down to the bottom. The UX branch spends about 248ms less time compositing when we don't draw in the titlebar.

On Tuesday, I'll take one of the builds where I slowed down the tabs by 5x, and run it through the OpenGL Profiler to see if I can determine what's taking a long time to composite.
More data - I used the OpenGL Profiler and collected stats on each OpenGL function that's called during TART. I've put my findings into this spreadsheet:

https://docs.google.com/spreadsheet/ccc?key=0Av64yYvszN34dFlOcnhLRW5MMDVzTy1qRkdlampaWmc&pli=1#gid=19

BenWa and mattwoodrow took a look at it with me, and our observations were as follows:

* UX is making fewer calls to CGLFlushDrawable, but its average time executing it is higher
* UX also seems to spend longer executing glTexImage2D than m-c
* UX never uses CopyTexImage2D, which shows that it's branching quite a bit differently than m-c
* There are definitely differences in which functions UX is executing compared to m-c, and how frequently it's executing them. Here's a spreadsheet showing the differences: https://docs.google.com/spreadsheet/ccc?key=0Av64yYvszN34dFlOcnhLRW5MMDVzTy1qRkdlampaWmc&pli=1#gid=20

Matt Woodrow asked me for some logging output for TART for a debug build with MOZ_DUMP_PAINT_LIST=1. I did that, and extracted only the spew that occurs when TART is recording. I did so for the simple-open-DPI1, simple-close-DPI1, icon-open-DPI1 and icon-close-DPI1 subtests. I'll attach those next.
Anything particularly illuminating in these logs, Matt?
Flags: needinfo?(matt.woodrow)
Ok, let me summarize a conversation that I just had with mattwoodrow in #gfx:

Here's where we stand:

1) We've shown that the 10.6 OS X regression comes almost entirely from the chromemargin attribute being set on the main window. Setting that attribute gives us the ability to draw in the titlebar, which we need to do for both tabs in titlebar and lw-themes.

2) We've shown that what's happening is that UX, when drawing in the titlebar like this (with or without tabs up there), we seem to spend a lot more time waiting for the GPU to composite than m-c does.

3) OpenGL Profiler shows that we've doubled some of our GL calls, and also shows that while we call CGLFlushDrawable less for UX (this is the GL call to paint the frame), we spend more time doing it. Other findings are listed in comment 11, with a full breakdown of the difference in the spreadsheet of comment 11.

4) The MOZ_DUMP_PAINT_LIST logs for both UX and m-c (which dump the layer tree) were examined by mattwoodrow, and he's been unable to find anything particularly interesting in the differences between the two. 

mattwoodrow writes that the facts from (3) and (4) seem to contradict themselves, which suggests we're missing something, but he's not sure what.

According to him, it's unlikely to be a driver or hardware issue on the talos box, and more likely to be a platform issue.

He's suggested I pose this particular problem to Milan and the rest of Graphics in Toronto to see if they can devote some cycles to figuring it out.
Flags: needinfo?(matt.woodrow)
Hey Milan,

Do you know if anybody on the Graphics team has cycles to look at this particular problem, now that we've identified it?

-Mike
Flags: needinfo?(milan)
I'll leave a needinfo on me for this and deal with it next week - graphics is in the work week right now.  Let me know if that's not ok.
st3fan and I were just poking at this.

We've noticed that in Quartz Debugger, if we turn on "Flash screen updates", and turn off Hardware Acceleration, the UX build seems to paint in the region where the window title used to go when opening and closing tabs with TART.

Is it possible we're wasting a great deal of time drawing the title of the window, and then overwriting it with the toolbar gradient? Or is none of this valid because I had to disable HWA in order to reveal it?

Markus - does anything in here sound suspect to you?
Flags: needinfo?(mstange)
I'll leave for Markus to answer this.
(In reply to Mike Conley (:mconley) from comment #18)
> st3fan and I were just poking at this.
> 
> We've noticed that in Quartz Debugger, if we turn on "Flash screen updates",
> and turn off Hardware Acceleration, the UX build seems to paint in the
> region where the window title used to go when opening and closing tabs with
> TART.

This is very interesting!

> Is it possible we're wasting a great deal of time drawing the title of the
> window, and then overwriting it with the toolbar gradient? Or is none of
> this valid because I had to disable HWA in order to reveal it?

I can't say at the moment, but I'll look into it. It's definitely possible that some of this affects us with HWA on, too. I'll keep the needinfo to myself open until I've played with it a little.
Flags: needinfo?(milan)
I should point out that I ran an experiment this weekend where I disabled drawing the title in the titlebar completely for OS X[1], confirmed that the title area wasn't being re-painted during TART via the Quartz Debugger, and then pushed to try. No measurable performance change compared against a baseline UX. :/

Re-setting the needinfo on Milan, since it looks like he wanted this as a reminder for possibly getting us some Graphics-team help on this.

And I guess I'll clear the needinfo on mstange, since this whole title thing sounds like a blind alley.

[1]: Here's the patch: https://pastebin.mozilla.org/3367350
Flags: needinfo?(mstange) → needinfo?(milan)
I'm going to take a look at this today.
Flags: needinfo?(milan)
So we're getting 3 extra draws from RectTextureImage::Draw(). Perhaps this isn't that surprising... mconley is doing a push that will tell us if these draws are what's making the difference.
So this is real interesting - mstange and avih (and maybe others) have been working on getting the frame recorder to work with OMTC enabled. I took the patch from mstange, modified talos to run TART without disabling OMTC, and pushed to try.

Here are those try builds:

m-c: https://tbpl.mozilla.org/?tree=Try&showall=1&rev=166a6634a7b3
UX: https://tbpl.mozilla.org/?tree=Try&showall=1&rev=8c2ff3135e59

And here's a compare-talos between m-c and UX with OMTC enabled (with m-c as "original" and UX as "new"):

http://compare-talos.mattn.ca/?oldRevs=166a6634a7b3&newRev=8c2ff3135e59&server=graphs.mozilla.org&submit=true

I don't see a regression anymore - in fact, I see some pretty nice wins with UX.

So, some questions:

1) Are these numbers relative / to be believed?
2) If not, what are we going to do once OMTC is no longer disable-able?
(In reply to Mike Conley (:mconley) from comment #24)
> So, some questions:
> 
> 1) Are these numbers relative / to be believed?

Yes and No. These numbers do suggest that we're able to paint faster with UX than M-C. This is a good thing.

Also, we saw that RectTextureImage::EndUpdate() was being called a lot on UX non-OMTC and not on OMTC. It seems reasonable that calling EndUpdate() a lot was causing the performance regression we were seeing.

> 2) If not, what are we going to do once OMTC is no longer disable-able?

The current plan is to capture frame times from the compositor. That should give us better numbers than we currently have.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #25)
> (In reply to Mike Conley (:mconley) from comment #24)
> > So, some questions:
> > 
> > 1) Are these numbers relative / to be believed?
> 
> Yes and No. These numbers do suggest that we're able to paint faster with UX
> than M-C. This is a good thing.
> 
> Also, we saw that RectTextureImage::EndUpdate() was being called a lot on UX
> non-OMTC and not on OMTC. It seems reasonable that calling EndUpdate() a lot
> was causing the performance regression we were seeing.

Turns out this was wrong. It looks like we still need to investigate EndUpdate().
Which part is wrong?

The reason I thought our RectTextureImages were not to blame was that disabling nsChildView::MaybeDrawTitlebar did not show any TART wins. The EndUpdate call for the titlebar RectTextureImage is made in that function.

However, EndUpdate should only be called after we get a viewWillDraw call with a dirty region that intersects the titlebar. And even without OMTC, that should only happen very rarely because we don't use native setNeedsDisplayInRect invalidation when using GL.
Oh, right, changing the window title causes native invalidations. But according to comment 21, circumventing those does not help either.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #25)
> (In reply to Mike Conley (:mconley) from comment #24)
> > So, some questions:
> > 
> > 1) Are these numbers relative / to be believed?
> 
> Yes and No. These numbers do suggest that we're able to paint faster with UX
> than M-C. This is a good thing.

Just adding that the "and No" part is because these numbers only reflect the main thread load - where UX is apparently better than m-c.

The actual frame rate is the slower of the two (main/compositor thread), and currently for tab animation it appears that the compositor is the bottleneck. So either syncing the threads (which we initially thought was the case already) or measuring at the compositor would provide more relevant numbers.
(In reply to Avi Halachmi (:avih) from comment #29)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #25)
> > (In reply to Mike Conley (:mconley) from comment #24)
> > > So, some questions:
> > > 
> > > 1) Are these numbers relative / to be believed?
> > 
> > Yes and No. These numbers do suggest that we're able to paint faster with UX
> > than M-C. This is a good thing.
> 
> Just adding that the "and No" part is because these numbers only reflect the
> main thread load - where UX is apparently better than m-c.
> 
> The actual frame rate is the slower of the two (main/compositor thread), and
> currently for tab animation it appears that the compositor is the
> bottleneck. So either syncing the threads (which we initially thought was
> the case already) or measuring at the compositor would provide more relevant
> numbers.

Ok, good to know.

So, more data:

1) Jeff and I have determined that causing nsChildView's isCoveringTitlebar to always return false *does not* affect the regression.
2) We've determined that causing nsChildView's isCoveringTitlebar to always return false causes the OpenGL call frequency to more or less match the call frequency when chromemargin wasn't being set on the main-window.[1]

From #1 and #2, we have to conclude that the OpenGL Profiles / call frequency have nothing to do with this regression.

[1]: See the green column: https://docs.google.com/spreadsheet/ccc?key=0Av64yYvszN34dFlOcnhLRW5MMDVzTy1qRkdlampaWmc&pli=1#gid=22
So I've profiled the WindowServer and noticed that we spend a lot more time calling affineGLAccel with chromemargin set to 0,-1,-1,-1. The window server seems to be iterating over 'shapes'/regions in _CGXGLCompositeLayer_ and calling affineGLAccel for each one. This makes me wonder if we're sending different dirty regions to the window server with chromemargin.
Some pretty big developments on this bug in the last 24 hours:

1) Jeff figured out that the frequent calls to affineGLAccel were related to the position of the window. In particular, when we're drawing in the titlebar[1] of a window, and that titlebar overlaps the titlebar of *another* window (in this case, the smaller talos window that spawns before the browser opens), the OS X WindowServer detects that overlap and iterates over a lot of shapes...and stuff. I'm mostly just piecing together what I've heard Jeff say, and he might want to jump in with more technical details / accuracy.

2) I ran a TART run for OSX where before executing the test, we moved the window down 22px so that the titlebars did not overlap. The regression disappeared. Here's the compare talos between m-c where we move the window down 22px, and a UX where we move the window down 22px:

http://compare-talos.mattn.ca/?oldRevs=306d00762934&newRev=9c093efe5054&server=graphs.mozilla.org&submit=true

So, this is great news - we've really narrowed down where the regression is coming from.

Some questions:

1) Is this something we can write off as a "feature" of OS X's WindowServer compositor? Or is there something we can do to make the WindowServer not call affineGLAccel as often while still drawing in the titlebar?

2) If this *is* something we can write off as an OS X issue that we can't get around, should we modify TART to drop the window 22px before the run in order to make it easier to detect regressions in the future?

3) Also, again, if this *is* something we can write off as an OS X issue that we can't get around, can we go ahead and remove this as a landing blocker?

Sending out a spew of needinfo's to get answers to the above 3 questions.
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(dolske)
Flags: needinfo?(avihpit)
It would be interesting to know if running m-c with the window moved up by 22px causes a similar regression.
(In reply to Mike Conley (:mconley) from comment #32)
> 2) If this *is* something we can write off as an OS X issue that we can't
> get around, should we modify TART to drop the window 22px before the run in
> order to make it easier to detect regressions in the future?

It could happen in practice that the Firefox window overlaps another window, so the talos scenario, while not the only one or the best performing one is still valid. Also, apparently this scenario does regress performance, so I think it's something which should be reflected at the numbers (and might change in the future for whatever reasons).

For these reasons, I don't think we should move the window for TART, and we should let it reflect this perf issue.


> 3) Also, again, if this *is* something we can write off as an OS X issue
> that we can't get around, can we go ahead and remove this as a landing
> blocker?

I'm good with not blocking on this if it turns out to be an OS X issue. As we've mentioned earlier on IRC, on OS X tab animations are currently not visibly slower on UX than on m-c.

I think the most important thing about this regression is being able to attribute it to something, which you've done. So now we know.
Flags: needinfo?(avihpit)
Flags: needinfo?(dolske)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #33)
> It would be interesting to know if running m-c with the window moved up by
> 22px causes a similar regression.

It was not possible to move the m-c window 22px up, since it was already at the top of the screen.

Instead, I moved the other window down 22px, so that the animating section of mozilla-central's tabs overlapped the other window.

No appreciable difference in performance according to compare-talos: http://compare-talos.mattn.ca/?oldRevs=306d00762934&newRev=279b1b57ba6f&server=graphs.mozilla.org&submit=true
Flags: needinfo?(jmuizelaar)
Summary from a conversation I just had with Jeff (and a few others):

* We've found the root cause of this regression - the OS X WindowServer is doing crazy affineGLAccel business when the UX titlebar overlaps the titlebars of other windows of the same process (in the talos case, that's the pageloader window). This is where the regression is coming from. The WindowServer doesn't not seem to regress when the UX titlebar overlaps the titlebars of windows from other processes.
* Jeff is going to look today to see if there's anything we can do to convince the WindowServer to not do this, but he doesn't have high hopes.

His conclusion is that now that we know that this is the cause of the regression, and that it's highly likely that trying to convince the WindowServer to not do this is more trouble than it's worth, that this should not block Australis landing on mozilla-central.
On what versions of OS X does this problematic window server behavior occur?
(In reply to Josh Aas (Mozilla Corporation) from comment #37)
> On what versions of OS X does this problematic window server behavior occur?

The effect is most pronounced in 10.6. Also appears to be present in 10.7, but I would argue that it's a bit better there. For 10.8, the regression is not detectable.
(In reply to Mike Conley (:mconley) from comment #36)
> WindowServer doesn't not seem to regress when the UX titlebar overlaps the
> titlebars of windows from other processes.

Hooo boy, nice double-negative on my part. I should have written: "The WindowServer does not seem to regress when the UX titlebar overlaps the titlebars of windows from other processes."
How do you feel, dolske? With the above information, are you comfortable removing this as a landing blocker?
Flags: needinfo?(dolske)
A summary of a conversation that dolske, jrmuizel and I just had:

* Jeff believes the following:
  * This problem is too hard to be worth spending our time on, considering how small the regression is
  * There are other things we could do to give us 4% improvement across the board
  * Other apps (Safari for example) already hit this problem, but that's when they draw *everything*. We're actually in a better position than them.
  * Blocking on this bug would be foolish / a waste of time

* Dolske believes the following:
  * The severity of the regression, coupled with the difficulty of the fix and what's actually needed to trigger it (same-process titlebars overlapping), he thinks we're solid to remove this as a landing blocker.

Clearing dolske's needinfo, and removing this bug as a landing blocker.
Flags: needinfo?(dolske)
Whiteboard: [Australis:P1][Australis:M?]
So the stated summary of this bug is finding out why setting chromemargin was causing a TART regression. We've found out why, and so I think it's safe to close out this bug.

Marking as WONTFIX, since we're not going to be pursuing this investigation (or a fix) any further at this time.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: