Closed Bug 1525983 Opened 5 years ago Closed 5 years ago

Non-optimal GC behavior on skeletal animation testcase

Categories

(Core :: JavaScript: GC, defect, P2)

defect

Tracking

()

RESOLVED FIXED
Performance Impact low
Tracking Status
firefox-esr60 --- unaffected
firefox67 --- wontfix
firefox68 --- fixed
firefox69 --- fixed

People

(Reporter: mstange, Assigned: pbone)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Keywords: perf, perf:responsiveness, regression)

Attachments

(1 file)

The testcase from bug 1517916 shows gradually rising memory in Firefox, whereas in Chrome, memory stays really flat.

There also seem to have been two recent regressions in terms of GC times on the testcase, so much so that it's now impacting the animation frame rate. GC behavior went from almost no GC to quite frequent GC, with a step in between, over the last two or three weeks.

Almost no GC: https://perfht.ml/2WQzR9p
Around one GC per second: https://perfht.ml/2WX2EJI
Very frequent GC: https://perfht.ml/2WNbjy9

(These profiles are captured with very minimal profiler features - no JS stacks so that GC behavior would not be impacted in any way.)

I'm also attaching a screenshot from the Chrome devtools that shows no long-term rise in memory.

My guess (based on bug 1520286) is that the garbage objects are all very small short-lived typed arrays that represent matrices.

Bug 1510560 changes our pretenuring conditions, which can have an affect on how quickly various heaps fill up and therefore when we collect them.

Bug 1497873 is also a suspect, and might be the cause for the nursery size to be large in the last profile.

it's worth trying to find a regression range and we should start with the above two bugs.

NI myself to find the regression range.

Has Regression Range: --- → no
Flags: needinfo?(pbone)
Priority: -- → P2
Whiteboard: [qf?]

(In reply to Paul Bone [:pbone] from comment #1)

Bug 1510560 changes our pretenuring conditions, which can have an affect on how quickly various heaps fill up and therefore when we collect them.

Whoops, that bug has no patches and hasn't even landed. What I was thinking of was the 2nd patch from Bug 1497873 below.

Bug 1497873 is also a suspect, and might be the cause for the nursery size to be large in the last profile.

Hi Markus, I'm confused by the build IDs in these profiles, it doesn't match what you've described. Could you check my work?

(In reply to Markus Stange [:mstange] from comment #0)

There also seem to have been two recent regressions in terms of GC times on the testcase, so much so that it's now impacting the animation frame rate. GC behavior went from almost no GC to quite frequent GC, with a step in between, over the last two or three weeks.

Almost no GC: https://perfht.ml/2WQzR9p

Name:Firefox
Version:66.0
Build ID:20190110214210

I agree, I see no GC Major/Slice markers, plenty of GC minor slices but all are OUT_OF_NURSERY with a 1MB nursery. This is pretty much generational GC acting as designed.

There were 5409 nursery collections taking 2% of all the elapsed time.
Mean: 0.07ms ±0.01
Median: 0.07ms
90th percentile: 0.09ms
Max: 0.20ms
Total: 395ms / 19,916ms (2.0%)

Around one GC per second: https://perfht.ml/2WX2EJI

Name:Firefox
Version:67.0
Build ID:20190202094451

4040 pauses for Nursery collections:
Mean: 0.09ms ±0.02
Median: 0.09ms
90th percentile: 0.11ms
Max: 0.21ms
Total: 366ms / 19,636ms (1.9%)

So about the same for nursery behaviour.

204 pauses for Major slices:
Mean: 3.75ms ±2.53
Median: 5.00ms
90th percentile: 5.04ms
Max: 14.0ms
Total: 765ms / 19,636ms (3.9%)

But now there are major GCs also, still they're fairly quick and maybe only the top 10% may result in a dropped frame, but that probably depends what else is happening (looking at this data only).

These 204 slices belong to several major GCs, each major GC seems to have between 5-10 slices and is started by ALLOC_TRIGGER, each GC seems to free around half the engine's memory. The memory directly used by the engine is fairly stable, cycling between ~50MB and 112MB. The increase in total memory usage over time is either elsewhere or indirect and not freed by these zonal GCs.

Very frequent GC: https://perfht.ml/2WNbjy9

Name:Firefox
Version:67.0
Build ID:20190131214909

NOTE, this version is OLDER than the previous one.

298 pauses for Nursery collections:
Mean: 7.62ms ±2.92
Median: 8.99ms
90th percentile: 10.2ms
Max: 11.2ms
Total: 2,270ms / 19,998ms (11%)
116 pauses for Major slices:
Mean: 13.5ms ±16.1
Median: 6.58ms
90th percentile: 38.0ms
Max: 86.5ms
Total: 1,570ms / 19,998ms (7.9%)

This is spending almost 20% of its time in GC, with a large increase to the time spent collecting the nursery. Looking further this is using a much larger nursery size (16MB) with more data being tenured (>25%). A reasonable guess is that something is being pretenured that is then causing more objects to be tenured, and is creating more work for the GC.

This is why I thought of Bug 1497873, which did land in this time except that this profile is from an earlier build and the previous one is from a later build. So if anything the above bug improved this (which is likely, since one thing it did was avoid any pretenuring when the nursery is small).

Marcus, am I understanding this correctly, that this wasn't a regression but an improvement? Or what am I missing.

(The statistics in this comment are from an experimental patch I have for the profiler.)

Flags: needinfo?(pbone) → needinfo?(mstange)

(In reply to Markus Stange [:mstange] from comment #0)

Created attachment 9042214 [details]
Screen Shot 2019-02-02 at 1.57.35 PM.png

The testcase from bug 1517916 shows gradually rising memory in Firefox, whereas in Chrome, memory stays really flat.

I wasn't able to find the cause of this with about:memory, perhaps I need the DMD

https://hg.mozilla.org/mozilla-central/rev/fe279cc0e629 (bug 1520286) added support to inline the TypedArray constructor when called with an object argument (which is what happens in this demo a couple of million times per second). Additionally when we're now inlining the TypedArray constructor, we're now also taking the ObjectGroup's pre-tenure flag into account. That means before bug 1520286, all TypedArrays (when the TypedArray constructor was called with an object) were always initially created in the nursery, but now some TypedArrays may get pre-tenured.

That could explain why there are more MajorGCs in the second profile (the only profile including the patch from bug 1520286), because we now allocate tenured TypedArrays. The second profile includes calls to js::NewObject, which indicates the new-object cache wasn't used, which in turn can only be used for cachable objects, which means objects with NewObjectKind::GenericObject. So in other words, because the profile indicates the new-object cache sometimes isn't used for TypedArrays, we get a hint that some TypedArray objects in the second profile are pre-tenured, which could have led to more MajorGCs.

Whiteboard: [qf?] → [qf:p3:responsiveness]

Removing the regressionwindow-wanted tag as the issue was identified in comment 5.

Jon, could you look at this or help me find someone on the GC team that can

Flags: needinfo?(jcoppeard)

In comment 3 I pointed out that the 3rd profile is from a Firefox version older than the 2nd profile. So I think this represents a performance improvement, not a regression. This bug still has a NI for Marcus to check this.

I tested this manually on builds downloaded by mozregression. I ran the animation for ~10 (not timed accurately!) and compiled some stats about GC behaviour:

Path:               Major  Minor Average Average  Average  Total
                    Count  Count PR      nursery  time /uS time /mS
                                         size /MB
output-19-01-10.txt     71  4261    0.5%        1      108      460
output-19-01-15.txt     86  4581    0.5%        1      102      469
output-19-01-20.txt    148   338   19.4%       12     5074     1715
output-19-01-25.txt    152   341   26.0%       12     5916     2017
output-19-01-30.txt    139   334   29.3%       12     6150     2054
output-19-02-05.txt    187  4208    0.7%        1      121      509
output-19-02-10.txt    203  4798    1.0%        1      132      633
output-19-02-15.txt    204 18047    1.2%        1       45      820
output-19-02-20.txt    269 21978    0.5%        1       37      830
output-19-02-28.txt    201 15626    0.4%        1       31      488

Paths above are named after the build date.

It looks like there are three changes of GC behaviour in this time frame. I'll investigate further next week.

There are a couple of things going on here:

  1. Sometimes we tenure lots of object during nursery collection (builds between 19-01-20 and 19-01-30 above). There are two possible reasons for this: either our pre-tenuring heuristics are failing, or the timing of nursery collections is falling into a pattern which causes more to be tenured. If it's the former I'd expect the major GC count to increase when this is happening, but from the results above this isn't clear (the count increases when this behaviour starts, but then increases slightly again when it ends). The latter doesn't seem that plausible either though - I wouldn't expect the behaviour to fall so clearly into two different modes.

  2. Now that we allow smaller nursery sizes we can end up collecting the nursery very frequently if we don't tenure anything (builds from 19-02-15 above). Small nursery sizes are a memory saving but this does mean that fixed overheads in nursery collection (like tracing the runtime) become an increasing burden. Perhaps we should take collection frequency into account when sizing the nursery or perhaps we should not allow the nursery of the foreground runtime to become so small.

Flags: needinfo?(jcoppeard)
Depends on: 1532271
Depends on: 1532376

Thanks Jon.

What do the last two columns measure, the time for nursery collections/major slices/elapsed time etc? So I don't know which rows are good/bad.

(In reply to Jon Coppeard (:jonco) from comment #10)

  1. Now that we allow smaller nursery sizes we can end up collecting the nursery very frequently if we don't tenure anything (builds from 19-02-15 above). Small nursery sizes are a memory saving but this does mean that fixed overheads in nursery collection (like tracing the runtime) become an increasing burden. Perhaps we should take collection frequency into account when sizing the nursery or perhaps we should not allow the nursery of the foreground runtime to become so small.

The theory is that even with some (but not a lot) overhead this may be faster because less memory is touched and therefore the working set is more likely to stay in cache (L2 or L3 since we don't get small enough for L1). What I found was that 160KB nursery was the sweet spot between the savings from better use of cache vs the cost of more frequent collections. (even on tests that allocated very frequently causing very frequent nursery collection - that's the exact usage I was thinking of). I was underwhelmed by this because I did expect a smaller sweet spot. I set the limit to 192KB which I felt was more conservative.

Did you test with poisoning disabled? I believe it still poisons the whole nursery and so on Nightly frequent nursery collection is underwhelming.

If we're finding that these overheads are still too much for the savings we can increase it, particularly for foreground tabs. (I do want to reduce it further for background tabs).

A separate idea:

Since this is an animation I would assume the best place to GC would be after each frame. Can we detect the use of frame animation handlers and adjust our scheduling based on that?

Blocks: GCScheduling

(In reply to Paul Bone [:pbone] from comment #11)

What do the last two columns measure, the time for nursery collections/major slices/elapsed time etc? So I don't know which rows are good/bad.

The last two columns are average time per minor GC and total time spent in minor GC (note different units).

Did you test with poisoning disabled?

Annoyingly I did not, so I've re-run the tests. New results:

Path:               Major  Minor Average Average  Average  Total
                    Count  Count PR      nursery  time /uS time /mS
                                         size /MB
output-19-01-10.txt    117   266   21.6%       13     5123     1362
output-19-01-15.txt     56  3530    0.5%        1       51      180
output-19-01-20.txt     79  3451    0.6%        1       57      200
output-19-01-25.txt    103   253   21.1%       13     5551     1404
output-19-01-30.txt    111   268   22.1%       12     5038     1350
output-19-02-05.txt    140  3017    0.7%        1       61      185
output-19-02-10.txt    157  2735    0.6%        1       57      156
output-19-02-15.txt    149 14934    0.4%        1       20      312
output-19-02-20.txt    169   540    8.6%        6     1397      754
output-19-02-25.txt    155 13807    0.7%        1       21      298
output-19-02-28.txt    145 13355    0.3%        1       20      271

Observations:

  1. It seems random whether or not it goes into the mode where we tenure more and use a larger nursery. I don't think this is related to a change in the code -- in other words, it's still a problem.

  2. Discounting runs where the above problem happened, builds from 19-02-15 with the smaller nursery size do ~4 times more minor GCs and spend ~50% more time in minor GC.

(In reply to Jon Coppeard (:jonco) from comment #12)

Observations:

  1. It seems random whether or not it goes into the mode where we tenure more and use a larger nursery. I don't think this is related to a change in the code -- in other words, it's still a problem.

  2. Discounting runs where the above problem happened, builds from 19-02-15 with the smaller nursery size do ~4 times more minor GCs and spend ~50% more time in minor GC.

I agree, What kind of system did you run this on, what's its cache hierarchy like?

But the bigger difference may be the benchmark I used to tune this (to find the 192KB) vs this benchmark. Maybe runtime marking is slower for this benchmark.

Depends on: 1532838

Steven, could you please find an assignee for this?

Flags: needinfo?(sdetar)

Paul, can you continue to look at this?

Flags: needinfo?(sdetar) → needinfo?(pbone)

Probably next week.

Assignee: nobody → pbone
Status: NEW → ASSIGNED
Depends on: 1538053

A number of ideas for improving this benchmark have been identified. As we complete each one we should see if the benchmark has been improved and close the bug when we're happy.

Flags: needinfo?(pbone)

I retested this today. In one of my tests I saw large nursery sizes with a sagnificant number of items being tenured 5-10%, probably due to some pretenuring. But in all other attempts I saw small nursery sizes with very little tenuring >0.3%. This is mostly working properly, rarely pretenuring something it shouldn't have.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

Yes, it looks a lot better on my end, too! Thanks!

Flags: needinfo?(mstange)
Performance Impact: --- → P3
Whiteboard: [qf:p3:responsiveness]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: