Closed Bug 1279389 Opened 8 years ago Closed 8 years ago

Enable osfile.reset_worker_delay on all products

Tracking

(firefox50 fixed)

Status:

RESOLVED FIXED

Milestone:

mozilla50

Tracking Flags:

Tracking

Status

firefox50

---

fixed

People

(Reporter: bkelly, Assigned: bkelly)

References

Details

(Whiteboard: [MemShrink])

Attachments

(3 files, 1 obsolete file)

P1 Enable osfile.reset_worker_delay with a conservative 30 second delay. r=yoric 8 years ago Ben Kelly [:bkelly, not reviewing] 743 bytes, patch		Details \| Diff \| Splinter Review
P2 Relax the timing in browser_telemetry_toolboxtabs_storage.js to account for osfile worker startup. r=jryans 8 years ago Ben Kelly [:bkelly, not reviewing] 1.16 KB, patch	jryans : review+	Details \| Diff \| Splinter Review
P3 Fix OS.File.DirectoryIterator leaks throughout tree. r=yoric 8 years ago Ben Kelly [:bkelly, not reviewing] 9.32 KB, patch	Yoric : review+	Details \| Diff \| Splinter Review
P1 Enable osfile.reset_worker_delay on non-release builds with a conservative 30 second delay. r=yoric 8 years ago Ben Kelly [:bkelly, not reviewing] 802 bytes, patch	Yoric : review+	Details \| Diff \| Splinter Review

Ben Kelly [:bkelly, not reviewing]

Assignee

Description

•

8 years ago

In bug 927560 and bug 961317 we added the ability to shutdown the osfile worker after a timeout.  This was mainly done to reduce memory usage on b2g.

It would be nice, though, to run this consistently across all our products.  There is no good reason to waste memory on desktop if we have the means to clean it up.  This would save us about 3.5MB on desktop and 1MB on android.  While these are relatively small amounts, every little bit adds up and can also contribute to fragmentation over time.

Since the pref is already implemented, I think we should consider just turning it on.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 1

•

8 years ago

Attached patch P1 Enable osfile.reset_worker_delay with a conservative 30 second delay. r=yoric (obsolete) — Details — Splinter Review

This enables the osfile.reset_worker_delay pref with a fairly conservative 30 second value.  At this level try is green (with the follow-on P2 patch that tweaks one of the devtools tests):

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9eaa178f554c

I also did a talos run before and after the patch to demonstrate that it did not regress performance:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=233bfce6458f&newProject=try&newRevision=9eaa178f554cfc4a7b186db93192b8f9f7510031&framework=1

I did try lower delay values, but it made more tests unstable.  For example, the devtools webide browser_tabs.js test opens and closes webide very quickly.  This tool has an async step to update the package list from disk, but nothing in the test waits for this.  With the added delay of the worker startup and its associated messaging sometimes this package list update would fire after shutdown leak collection completed.  I don't think this is a real production bug, but is an issue with the test not waiting for all operations to complete.

Rather than chase stuff like that down to set an aggressive timeout I figured we could start at something conservative like this and still reap the benefit of improved memory in a typical browsing session.

Attachment #8761885 - Flags: review?(dteller)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 2

•

8 years ago

Attached patch P2 Relax the timing in browser_telemetry_toolboxtabs_storage.js to account for osfile worker startup. r=jryans — Details — Splinter Review

Ryan, I needed to relax the timing constraint in browser_telemetry_toolboxtabs_storage.js slightly in order to accommodate added latency from spinning up the osfile worker in some cases.  Without this the test passes all its cases, but runs into a "this.transport is null" in main.js as its ending.  I believe its hitting a similar "async file operation that doesn't block the test" as I described in the previous comment.

Attachment #8761886 - Flags: review?(jryans)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 3

•

8 years ago

Comment on attachment 8761885 [details] [diff] [review]
P1 Enable osfile.reset_worker_delay with a conservative 30 second delay. r=yoric

Actually, I see that the worker is getting stuck open after the crash manager maintenance runs.  Its not getting requests, though.  If I can't fix that then this is moot.

Attachment #8761885 - Flags: review?(dteller)

Ben Kelly [:bkelly, not reviewing]

Assignee

Updated

•

8 years ago

Attachment #8761886 - Flags: review?(jryans)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 4

•

8 years ago

It seems we are forgetting to close directory iterators somewhere.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 5

•

8 years ago

In this case its like TelemetryStorage.jsm, but dxr suggests there are more in the tree. :-(

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 6

•

8 years ago

Comment on attachment 8761885 [details] [diff] [review]
P1 Enable osfile.reset_worker_delay with a conservative 30 second delay. r=yoric

See comment 1 for review description.

Attachment #8761885 - Flags: review?(dteller)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 7

•

8 years ago

Comment on attachment 8761886 [details] [diff] [review]
P2 Relax the timing in browser_telemetry_toolboxtabs_storage.js to account for osfile worker startup. r=jryans

See comment 2 for review description.

Attachment #8761886 - Flags: review?(jryans)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 8

•

8 years ago

Attached patch P3 Fix OS.File.DirectoryIterator leaks throughout tree. r=yoric — Details — Splinter Review

Fix all the places I could find where we forget to call .close() on a DirectoryIterator.  Another advantage of turning on the osfile worker timeout is that we can more easily see if a new leak like this creeps in.

Attachment #8761913 - Flags: review?(dteller)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 9

•

8 years ago

New try:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f358ff3e819d

For talos comparison the base try revision is:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=233bfce6458f

J. Ryan Stinnett [:jryans] (Use needinfo, replies may be slow)

Comment 10

•

8 years ago

Comment on attachment 8761886 [details] [diff] [review]
P2 Relax the timing in browser_telemetry_toolboxtabs_storage.js to account for osfile worker startup. r=jryans

Review of attachment 8761886 [details] [diff] [review]:
-----------------------------------------------------------------

Sounds fine to me, assuming Talos looks good for the new run.

Attachment #8761886 - Flags: review?(jryans) → review+

David Teller [:Yoric] - still alive but not very active

Comment 11

•

8 years ago

In B2G, this feature has been a neverending cause of race conditions and deadlocks. So I'm actually voting for us to entirely get rid of the ability to shutdown OS.File, rather than extending it to other platforms.

Moreover, shutting down OS.File makes sense on B2G, because it is only ever used during the first ~1 second of booting a process, but not nearly as much on Desktop, where it is used all over the place.

Finally, we have had issues in the past with how long it takes to start the OS.File worker in the first place. According to Telemetry, this can take up to 5 seconds (!) on Desktop, depending on cache contention, etc. so we might end up hurting the performance of Firefox for users.

Flags: needinfo?(bkelly)

David Teller [:Yoric] - still alive but not very active

Comment 12

•

8 years ago

Comment on attachment 8761885 [details] [diff] [review]
P1 Enable osfile.reset_worker_delay with a conservative 30 second delay. r=yoric

Review of attachment 8761885 [details] [diff] [review]:
-----------------------------------------------------------------

(canceling review until we have discussed the usefulness of this bug)

Attachment #8761885 - Flags: review?(dteller)

David Teller [:Yoric] - still alive but not very active

Comment 13

•

8 years ago

Comment on attachment 8761913 [details] [diff] [review]
P3 Fix OS.File.DirectoryIterator leaks throughout tree. r=yoric

Review of attachment 8761913 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the fix.

Note that, in practice, most of these iterators are automatically closed (this happens when we have iterated to the end), but being extra cautious is a good idea.

Attachment #8761913 - Flags: review?(dteller) → review+

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 14

•

8 years ago

(In reply to David Rajchenbach-Teller [:Yoric] (please use "needinfo") from comment #13)
> Comment on attachment 8761913 [details] [diff] [review]
> P3 Fix OS.File.DirectoryIterator leaks throughout tree. r=yoric
> 
> Review of attachment 8761913 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Thanks for the fix.
> 
> Note that, in practice, most of these iterators are automatically closed
> (this happens when we have iterated to the end), but being extra cautious is
> a good idea.

From what I have observed this auto-close is not working.  Those archive telemetry iterators seem like they iterate to the end quickly, but they were not closed.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 15

•

8 years ago

(In reply to David Rajchenbach-Teller [:Yoric] (please use "needinfo") from comment #11)
> In B2G, this feature has been a neverending cause of race conditions and
> deadlocks. So I'm actually voting for us to entirely get rid of the ability
> to shutdown OS.File, rather than extending it to other platforms.

Deadlocks?  You mean within OS.File itself?

> Moreover, shutting down OS.File makes sense on B2G, because it is only ever
> used during the first ~1 second of booting a process, but not nearly as much
> on Desktop, where it is used all over the place.

From what I can tell we don't need this running 95% of the time.  We need it once every 10 minutes or so when sync runs.  Or if the user opens a devtools toolbox that uses it.  It seems a shame to waste resources for no reason.

> Finally, we have had issues in the past with how long it takes to start the
> OS.File worker in the first place. According to Telemetry, this can take up
> to 5 seconds (!) on Desktop, depending on cache contention, etc. so we might
> end up hurting the performance of Firefox for users.

Where is the telemetry for this for desktop?  I did run talos and it did not show any regressions.

In general the time it takes to start a worker thread should be negligible compared to disk IO times.

Flags: needinfo?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 16

•

8 years ago

See questions in comment 15.

Flags: needinfo?(dteller)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 17

•

8 years ago

I guess I'd like to propose enabling it in nightly since we're at the beginning of a merge cycle.  If it causes problems we can back it out.

David Teller [:Yoric] - still alive but not very active

Comment 18

•

8 years ago

(In reply to Ben Kelly [:bkelly] from comment #15)
> Deadlocks?  You mean within OS.File itself?

Indeed. Not the kind of deadlock that blocks a thread entirely, but the kind of deadlock we get with async programming that prevents progress (of an async codepath) because some resource is not available nor will ever be.

> > Moreover, shutting down OS.File makes sense on B2G, because it is only ever
> > used during the first ~1 second of booting a process, but not nearly as much
> > on Desktop, where it is used all over the place.
> 
> From what I can tell we don't need this running 95% of the time.  We need it
> once every 10 minutes or so when sync runs.  Or if the user opens a devtools
> toolbox that uses it.  It seems a shame to waste resources for no reason.

I understand that. If you are willing to put the time to debugging the aforementioned deadlocks, go ahead.

Personally, I'd prefer if we rewrote OS.File in Rust. This would save basically all of this memory, with a faster startup and without the need for the C++ fastpath designed to workaround the slow startup.


> > Finally, we have had issues in the past with how long it takes to start the
> > OS.File worker in the first place. According to Telemetry, this can take up
> > to 5 seconds (!) on Desktop, depending on cache contention, etc. so we might
> > end up hurting the performance of Firefox for users.
> 
> Where is the telemetry for this for desktop?  I did run talos and it did not
> show any regressions.

It's OSFILE_WORKER_READY. You'll see that the 75th percentile is already > 1 second and the 95th percentile almost at 4 seconds.

> In general the time it takes to start a worker thread should be negligible
> compared to disk IO times.

In theory, yes. We were pretty surprised when we realized that using OS.File with Session Restore, which made actual I/O more efficient, caused a multi-second regression in Firefox startup times.

Flags: needinfo?(dteller)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 19

•

8 years ago

(In reply to David Rajchenbach-Teller [:Yoric] (please use "needinfo") from comment #18)
> (In reply to Ben Kelly [:bkelly] from comment #15)
> > Deadlocks?  You mean within OS.File itself?
> 
> Indeed. Not the kind of deadlock that blocks a thread entirely, but the kind
> of deadlock we get with async programming that prevents progress (of an
> async codepath) because some resource is not available nor will ever be.

Are there existing cases I can test?  This is pretty vague and hard to address at the moment?

> > From what I can tell we don't need this running 95% of the time.  We need it
> > once every 10 minutes or so when sync runs.  Or if the user opens a devtools
> > toolbox that uses it.  It seems a shame to waste resources for no reason.
> 
> I understand that. If you are willing to put the time to debugging the
> aforementioned deadlocks, go ahead.

I would, but how can I find them if we don't enable this?

> > > Finally, we have had issues in the past with how long it takes to start the
> > > OS.File worker in the first place. According to Telemetry, this can take up
> > > to 5 seconds (!) on Desktop, depending on cache contention, etc. so we might
> > > end up hurting the performance of Firefox for users.
> > 
> > Where is the telemetry for this for desktop?  I did run talos and it did not
> > show any regressions.
> 
> It's OSFILE_WORKER_READY. You'll see that the 75th percentile is already > 1
> second and the 95th percentile almost at 4 seconds.

This measured at browser start up today, right?  When the main thread is completely bogged down loading everything right?  We don't start this at any other time on desktop right now AFAIK.  How can this be comparable to starting the worker during more typical operation?

> > In general the time it takes to start a worker thread should be negligible
> > compared to disk IO times.
> 
> In theory, yes. We were pretty surprised when we realized that using OS.File
> with Session Restore, which made actual I/O more efficient, caused a
> multi-second regression in Firefox startup times.

Again, startup is a special case where the main thread is buried under runnables.  Avoiding async is a huge win here.  I don't think that has much bearing on performance once we are outside of startup conditions.

How about if I just enable it in non-release builds?  Aurora and nightly?  That would let us find bugs, etc.

Flags: needinfo?(dteller)

David Teller [:Yoric] - still alive but not very active

Comment 20

•

8 years ago

(In reply to Ben Kelly [:bkelly] from comment #19)
> Are there existing cases I can test?  This is pretty vague and hard to
> address at the moment?

Well, everything was on B2G, so it's going to be hard to track down these days.

Lissyx, maybe you remember about a semi-recent B2G issue that was caused by OS.File shutdown?

> > I understand that. If you are willing to put the time to debugging the
> > aforementioned deadlocks, go ahead.
> 
> I would, but how can I find them if we don't enable this?

Well, if you're game, go ahead and enable it.

> > It's OSFILE_WORKER_READY. You'll see that the 75th percentile is already > 1
> > second and the 95th percentile almost at 4 seconds.
> 
> This measured at browser start up today, right?  When the main thread is
> completely bogged down loading everything right?  We don't start this at any
> other time on desktop right now AFAIK.  How can this be comparable to
> starting the worker during more typical operation?

This is measured at the first call to OS.File, so yes, it's certainly during startup.

I don't have a good idea to measure regular restart.

> Again, startup is a special case where the main thread is buried under
> runnables.  Avoiding async is a huge win here.  I don't think that has much
> bearing on performance once we are outside of startup conditions.
> 
> How about if I just enable it in non-release builds?  Aurora and nightly? 
> That would let us find bugs, etc.

Fine by me.

Flags: needinfo?(dteller)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 21

•

8 years ago

The last try showed no perf regressions on talos:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=233bfce6458f&newProject=try&newRevision=f358ff3e819d4478252c2ae58d699658cc836ae1&framework=1

Looking at the graph I think the perf improvement is a false positive.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 22

•

8 years ago

Attached patch P1 Enable osfile.reset_worker_delay on non-release builds with a conservative 30 second delay. r=yoric — Details — Splinter Review

This only enables the pref on non release builds.

Attachment #8761885 - Attachment is obsolete: true

Attachment #8762131 - Flags: review?(dteller)

David Teller [:Yoric] - still alive but not very active

Updated

•

8 years ago

Attachment #8762131 - Flags: review?(dteller) → review+

Pulsebot

Comment 23

•

8 years ago

Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2600b18edec8
P1 Enable osfile.reset_worker_delay on non-release builds with a conservative 30 second delay. r=yoric
https://hg.mozilla.org/integration/mozilla-inbound/rev/494ea9227b69
P2 Relax the timing in browser_telemetry_toolboxtabs_storage.js to account for osfile worker startup. r=jryans
https://hg.mozilla.org/integration/mozilla-inbound/rev/f2110c264778
P3 Fix OS.File.DirectoryIterator leaks throughout tree. r=yoric

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 24

•

8 years ago

Ben, is there some form of targeted manual testing that the QA team could perform to help with testing this?

Flags: needinfo?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 25

•

8 years ago

(In reply to Florin Mezei, QA (:FlorinMezei) from comment #24)
> Ben, is there some form of targeted manual testing that the QA team could
> perform to help with testing this?

So to see if this worked, follow these steps:

1) Launch a fresh browser
2) Browser normally for a few minutes
3) Open an about:memory tab and minimize/measure
4) In the main process there should not be an osfile worker

When account sync (or other subsystems that use osfile async APIs) run, however, the osfile worker should be spawned again.  And then go away after its been idle for a while.  Again, you can use about:memory to see if the worker goes away.

Potential problems to look for would be sync or other file consumers intermittently breaking or being abnormally slow.

Again, this is only enabled in non-release builds.  So nightly now and eventually aurora once 50 merges there.

Flags: needinfo?(bkelly)

Carsten Book [:Tomcat]

Comment 26

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/2600b18edec8
https://hg.mozilla.org/mozilla-central/rev/494ea9227b69
https://hg.mozilla.org/mozilla-central/rev/f2110c264778

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox50: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla50

Robert Wood [:rwood]

Updated

•

8 years ago

Depends on: 1284621

Eric Rahm [:erahm]

Updated

•

8 years ago

Depends on: 1290288

BMO Automation

Updated

•

11 months ago

Product: Toolkit → Toolkit Graveyard

You need to log in before you can comment on or make changes to this bug.