Closed Bug 1376883 Opened 7 years ago Closed 7 years ago

stylo: Drop thread pool stack size to 100k

Categories

(Core :: CSS Parsing and Computation, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox57 --- fixed

People

(Reporter: bholley, Assigned: jseward)

References

Details

(Whiteboard: [MemShrink:P1])

Attachments

(1 file, 4 obsolete files)

Once we get rid of dom-level recursion during the parallel traversal in the dependencies, we should be able to get by with very small stacks for the stylo workers. I was originally thinking 500k, but I think even less would be sufficient.

We should be just call:

stack_size(self, 100 * 1024)
Depends on: 1376884
Whiteboard: [MemShrink]
What is the benefit of a small stack size if we don't actually fault the memory in?  Seems like some increased stability risk.
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #1)
> What is the benefit of a small stack size if we don't actually fault the
> memory in?  Seems like some increased stability risk.

I agree; this strikes me as dangerous, for (to me) unclear benefit.
Typically we'd only have 6 stylo workers with minimal physical memory use in
each stack.  I would be reluctant to drop the stack size below 1MB.

At 100KB we would be able only to get in about 60 frames in the recursive
parallel traversal, per my measurements from last week, before we'd have to
disable the tail recursion optimisation.  And that's assuming there are no
other frames on the stack.
(In reply to Julian Seward [:jseward] from comment #2)
> At 100KB we would be able only to get in about 60 frames in the recursive
> parallel traversal, per my measurements from last week,

Not even 60 frames.  For an x86_64-linux optimised Gecko build, we use 3472
bytes of stack per recursive invokation.  So a 100KB stack would limit us to,
realistically, about 20 iterations.
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #1)
> What is the benefit of a small stack size if we don't actually fault the
> memory in?  Seems like some increased stability risk.

Modulo the dependencies, we don't have any required recursion in the parallel paths of the style system. So the amount of recursion we need is under our control, and the amount of headroom we need is fixed.

The benefits are:
* Saving address space on 32-bit.
* Avoiding dirtying the entire stack with a single long tail call, which would commit the memory for the lifetime of the process.

I think we should:
(1) Fix emilio's invalidation stuff to user iterators (see the dependent bug).
(2) Set the tail call limit to something conservative (~20 is totally fine). Figure out how much stack space this uses.
(3) Measure the maximum stack space used to style an element, and multiply it a few times to be safe.
(4) Set the stack size limit to (2) + (3). This happens at http://searchfox.org/mozilla-central/rev/17ebac68112bd635b458e831670c1e506ebc17ad/servo/components/style/gecko/global_style_data.rs#84


Julian is going to look at this.
Assignee: nobody → jseward
Priority: -- → P2
Whiteboard: [MemShrink] → [MemShrink:P1]
Priority: P2 → --
Julian, are you still looking at this?
Flags: needinfo?(jseward)
Priority: -- → P2
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #5)
> Julian, are you still looking at this?

Yes.  Will resume.
Flags: needinfo?(jseward)
I did some initial measurements, changing the stack size in
servo/components/style/gecko/global_style_data.rs (STYLE_THREAD_POOL)
and the recursion limit in servo/components/style/parallel.rs
(RECURSION_DEPTH_LIMIT).  I tested with the deep trees generated
by scripts in bug 1368415.  'ok' means completes the test OK,
'fail' means it segfaults.

What I have so far is

128k stack
  10 ok
  20 ok
  40 ok
  60 ok
  70 fail
  80 fail

256k stack
  80 ok
  130 ok
  140 fail
  160 fail

so it looks like we should budget around 2KB per frame.

That said .. this is with Nathan's disable-LTO-for-rust-compilation patch
applied.  I can't guess whether that would increase or decrease stack use
per recursive iteration.  So I will also measure with that removed.
Some more complete numbers.  I binary-searched for the fail points
(allowable recursion depth) for 128k and 256k stack sizes, giving this:

  With Nathan's patch (optimised, but with LTO disabled)

  128k stack
    60 ok
    70 fail

  256k stack
    130 ok
    140 fail

  Without Nathan's patch (optimised, with LTO)

  128k stack
    70 ok
    80 fail

  256k stack
    140 ok
    150 fail

It appears that the non-LTO version requires marginally more stack.  Using
that as a guide, we could for example choose 256KB and a limit of 70, which
would leave us with more than 100KB spare for contingencies/unknowns.

These measurements are for x86_64-Linux.  I can't think of a scenario in
which structures for the 32 bit case would be larger, so I'm assuming that
any choice we make based on the above numbers would be valid for the 32 bit
case too.
Attachment #8896912 - Flags: feedback?(bobbyholley)
Blocks: 1376884
No longer depends on: 1376884
Comment on attachment 8896912 [details] [diff] [review]
bug1376883-reduce_wthread_stacksize-1.diff

Review of attachment 8896912 [details] [diff] [review]:
-----------------------------------------------------------------

So, first of all, I appreciate the patch here, and it's more or less what we talked about.

However, I did some digging and chatted with emilio a bit today, and I think we should change our approach to detect the precise stack depth.

The reason for this is that the invalidation machinery isn't easy to make recursive (but we can introduce a bailout condition). So we basically need to do the same depth tracking that parallel.rs does, but then we need to consider the pessimal case where parallel.rs gets to MAX_DEPTH, and _then_ calls over into the invalidation machinery, which also goes to MAX_DEPTH. So if we use a counter, we need to calibrate the counters to handle the worst case for both sets of machinery, which is extremely rare.

Additionally, the guess-work around stack frame sizes is hard to reason about across platforms.

So instead of passing recursion_depth, I think we should just pass a base_stack_address, and compare the current stack address when deciding whether to operate recursively. We can measure the base stack address like so: https://play.rust-lang.org/?gist=f26128cb00e5c8a432146d6dcadb4e4a&version=stable .

Also, I think we should probably do 128k instead of 256k. Extremely deep recursion is very rare, and if it ever happens we'll permanently commit that address range, which isn't great.

So let's switch to precise stack measurement, set the limit to 128k, and prevent recursion if we're within 30k of the limit (which should be plenty for the rest of the style system).
Attachment #8896912 - Flags: feedback?(bobbyholley) → feedback-
Flags: needinfo?(jseward)
I opened https://github.com/servo/servo/pull/18097 to eliminate one other recursion site. Emilio will do the rest in bug 1376884, on top of the patch that Julian will write in this bug.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #10)
> The reason for this is that the invalidation machinery isn't easy to make
> recursive

Err, isn't easy to make _non_ recursive.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #10)
> So let's switch to precise stack measurement, set the limit to 128k, and
> prevent recursion if we're within 30k of the limit (which should be plenty
> for the rest of the style system).

I can do that, no problem.

As a sanity check, though .. this will make Stylo more nondeterministic than
it is now.  Say it fails in the field for some reason.  We try to reproduce
it locally and (for example) it fails in an opt build but not in a debug
build, because the frame sizes are different and so the two builds take
different paths.  Similarly if we change compilers, etc.

So we might lose something in terms of reproducibility of behaviour.  Is
that considered OK?
Flags: needinfo?(jseward) → needinfo?(bobbyholley)
Yeah, I think the hit to reproducibility in those edge cases is probably acceptable given the alternatives. Fair point though!
Flags: needinfo?(bobbyholley)
Here's what I have so far.  It adds "stack_limit: usize" to
ThreadLocalStyleContext (no newtype niceness yet).  In parallel.rs, fn
traverse_nodes(), it then attempts to acquire the stored stack_limit value
and compare it against the current SP -- at the moment, just printing it and
the delta.

But after not long at all, the tls.ensure call I added panics:

   thread 'StyleThread#1' panicked at 'already borrowed: BorrowMutError'.

I can see that it's possibly panicing against the tls.ensure call in
top_down_dom.  But I don't want a mutable borrow of the whole TLSC
structure.  I only want to see the immutable stack_limit value.

I'm stuck at this point.  Did I do something wrong?  I can kinda see that if
someone else has a mutable borrow to the TLSC then I won't even be able to
get an immutable borrow to inspect the stack_limit value.  (is that correct?
my intuition about borrowing is weak at best.)
The problem is that the TLS is already borrowed in top_down_dom, which is the caller of traverse_nodes. You should do the check there, and then simply pass an allow_recursion boolean to traverse_nodes.
Here's a working version of the patch.  Some notes:

* I wrapped up the mechanism inside a struct StackLimitChecker, to which
  callers can simply ask ".limit_exceeded() ?".  I believe it is a valid
  instance of the newtype idiom.

* I marked limit_exceeded() as #[inline(never)] so as to force it into its
  own frame, and thus avoid the imprecision of the "marker" |get_sp.foo|
  being inlined into some function with a large frame, and thus ending up in
  some arbitrary place within a large frame.  I don't know if this is really
  worth the perf loss of having to do one function call per overflow check
  -- probably not.

* I tested on a bloom-basic deep tree.  From debug printing, I see that the
  90K active area allows 54 frames of recursion before bailing out.  The
  stack size is set to 128KB now.

* Is it worth adding an assertion, in .limit_exceeded, that the difference
  between the current SP and limit is less than a few megabytes?  This would
  catch cases where we mistakenly query it from a different thread than the
  one that created the StackLimitChecker.  Or does the Rust typesystem
  completely guarantee that we can't use it on a different thread than the
  one it was created on?

  If the typesystem guarantees that, great.  If not I'd feel more comfortable
  with the abovementioned sanity check.
Attachment #8896912 - Attachment is obsolete: true
Attachment #8898282 - Attachment is obsolete: true
Attachment #8898719 - Flags: feedback?(bobbyholley)
Comment on attachment 8898719 [details] [diff] [review]
A working direct-measurement patch

Review of attachment 8898719 [details] [diff] [review]:
-----------------------------------------------------------------

Your assertion sounds good. Honestly, we should just debug_assert! that the depth, if greater than the limit, is at least less than the total stack size, otherwise we should have already segfaulted.

::: servo/components/style/context.rs
@@ +612,5 @@
> +
> +impl StackLimitChecker {
> +    /// Create a new limit checker, for this thread, allowing further use
> +    /// of up to |stack_size| bytes beyond (below) the current stack pointer.
> +    pub fn new(stack_size: usize) -> Self {

I might call this stack_size_limit to disambiguate from the actual size.

@@ +624,5 @@
> +    pub fn limit_exceeded(&self) -> bool {
> +        StackLimitChecker::get_sp() <= self.lower_limit
> +    }
> +
> +    #[inline(always)]

Please include the original comments in the snippet about how this isn't guaranteed to work but should probably work for now.

@@ +701,5 @@
>              font_metrics_provider: E::FontMetricsProvider::create_from(shared),
> +            // We can use up to 90KB of stack freely.  After that we
> +            // need to limit further stack growth so as to stay within
> +            // a total stack of 128KB.  See bug 1376883.
> +            stack_limit_checker: StackLimitChecker::new(90 * 1024),

Please move this to a constant somewhere near the the place where we define the stack limit for the rayon pool.

::: servo/components/style/parallel.rs
@@ +158,5 @@
>          let mut tlc = tls.ensure(
>              |slot: &mut Option<ThreadLocalStyleContext<E>>| create_thread_local_context(traversal, slot));
> +
> +        // Check that we're not in danger of running out of stack.
> +        if tlc.stack_limit_checker.limit_exceeded() { recursion_ok = false; }

Just do recursion_ok = !limit_exceeded, no need for the if.
Attachment #8898719 - Flags: feedback?(bobbyholley) → feedback+
Addresses all review comments.

Includes a debug_assert!, as discussed, that tries to check that we've been
called from the right thread.  Details in comments.

The stack size has been turned into a constant

  pub const STYLE_THREAD_STACK_SIZE_KB: usize = 128;

defined in components/style/context.rs and used both there to set the stack
limit, and in components/style/gecko/global_style_data.rs to set the thread
stack size, so there's a single point of truth for the stack size.

It's still 128KB but now split 88KB recursion / 40KB reserve, which gives 51
frames to bailout.

One question: impl<E: TElement> ThreadLocalStyleContext<E> in context.rs
only sets |stack_limit_checker| for the case #[cfg(feature = "gecko")].  Is
that OK, or will it cause compile breakage when build for
#[cfg(feature = "servo")] ?
Attachment #8898719 - Attachment is obsolete: true
Attachment #8899526 - Flags: review?(bobbyholley)
Comment on attachment 8899526 [details] [diff] [review]
bug1376883-reduce_wthread_stacksize-2.diff

Review of attachment 8899526 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good in general, but you'll want to make sure this compiles with servo. You can do that by cd servo/ && ./mach check.

::: servo/components/style/context.rs
@@ +730,5 @@
> +            // which we reserve 40KB of stack as a safety buffer.  Currently
> +            // the stack size is 128KB, so this allows 88KB for recursive
> +            // DOM traversal.  See bug 1376883.
> +            stack_limit_checker: StackLimitChecker::new(
> +                (STYLE_THREAD_STACK_SIZE_KB - 40) * 1024),

You'll want to add this to the servo constructor as well.

::: servo/components/style/gecko/global_style_data.rs
@@ +33,5 @@
>      pub style_thread_pool: Option<rayon::ThreadPool>,
>  }
>  
> +/// The minimum stack size for a thread in the styling pool, in kilobytes.
> +pub const STYLE_THREAD_STACK_SIZE_KB: usize = 128;

You should define this outside of gecko/ so that all the relevant machinery will compile for servo as well. Putting it in parallel.rs is probably fine.

@@ +95,5 @@
>                  .breadth_first()
>                  .thread_name(thread_name)
>                  .start_handler(thread_startup)
> +                .exit_handler(thread_shutdown)
> +                // Set thread stack size to 128KB.  See bug 1376883.

I'd say "See gecko bug" to distinguish from servo issues.
Attachment #8899526 - Flags: review?(bobbyholley) → review-
Fixes all review comments in comment 20, and I verified that it compiles
with servo.
Attachment #8899526 - Attachment is obsolete: true
Attachment #8899822 - Flags: review?(bobbyholley)
Comment on attachment 8899822 [details] [diff] [review]
bug1376883-reduce_wthread_stacksize-3.diff

Review of attachment 8899822 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/style/context.rs
@@ +649,5 @@
> +    }
> +
> +    // Technically, rustc can optimize this away, but shouldn't for now.
> +    // We should fix this once black_box is stable.
> +    #[inline(always)]

Seems like it might be better to make this inline(never) and skip the annotation on limit_exceeded(), so that we accurately measure the stack size in new() as well?

@@ +712,5 @@
>              font_metrics_provider: E::FontMetricsProvider::create_from(shared),
> +            // Threads in the styling pool have small stacks, and we have to
> +            // be careful not to run out of stack during recursion in
> +            // parallel.rs.  Therefore set up a stack limit checker, in
> +            // which we reserve 40KB of stack as a safety buffer.  Currently

Can you measure how many levels of recursion we reach on a perfectly linear DOM before the limiter kicks in, and add it as a comment? I think it's probably useful intuition.

If this would take a lot of time, you can skip it.

@@ +714,5 @@
> +            // be careful not to run out of stack during recursion in
> +            // parallel.rs.  Therefore set up a stack limit checker, in
> +            // which we reserve 40KB of stack as a safety buffer.  Currently
> +            // the stack size is 128KB, so this allows 88KB for recursive
> +            // DOM traversal.  See Gecko bug 1376883.

Nit: Realistically, the stack size stuff is going to be more relevant to people working on Firefox than on Servo. So I'd move the comment to the gecko constructor below, and reference it here.
Attachment #8899822 - Flags: review?(bobbyholley) → review+
Just a poke to get this landed. The sooner everything gets into the tree, the better.
Flags: needinfo?(jseward)
https://hg.mozilla.org/integration/autoland/rev/a5f01ec89dd2342f3d0478020b16aefd1dbed7ff
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
This caused crashtest bustage for a few reasons:
* The debug_assertion will always fire whenever we hit the stack limit. It should be comparing against the entire stack, not against lower_limit.
* The invalidation machinery doesn't support stack limits yet (bug 1376884).

Note that this gets extra complicated because my patches in bug 1393632, which landed immediately afterwards, cause us to stop doing the parallel traversal in a lot of cases, so the failure modes are less obvious. So it's best to test fixes with those patches reverted.

Julian, in the future please push stylo patches to try first, since they're harder for sheriffs to back out.

I'm reverting the stack limit here: https://github.com/servo/servo/pull/18240

We can re-enable it and finish sorting it out in bug 1376884.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: