Closed Bug 1432484 Opened 6 years ago Closed 6 years ago

Linux Firefox 58 hangs at startup with profile on NFS mount or Windows Share

Categories

(Core :: Security: PSM, defect)

defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 1444943

People

(Reporter: drichard, Unassigned)

References

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0
Build ID: 20180103231032

Steps to reproduce:

We have been running Firefox 57 with excellent success.  I was doing QA work on upgrading to FF 58.  I installed the new package, closed FF 57 and relaunched FF 58. 


Actual results:

The browser hangs for a good while before becoming responsive.  When this is happening, the UI is completely blocked.   I put FF 57 and FF 58 on strace, and the output is very different.  What I am seeing is that the startups are identical, basic libraries, then fonts and then FF 58 repeats these lines over and over again:

stat("/home/drichard/.mozilla/firefox/mn25ti71.default/cert9.db-wal", 0x7fff562dca30) = -1 ENOENT (No such file or directory)
fstat(72, {st_mode=S_IFREG|0600, st_size=1114112, ...}) = 0
fcntl(72, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
fcntl(72, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(72, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(72, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/cert9.db-journal", 0x7fff562dca30) = -1 ENOENT (No such file or directory)
lseek(72, 24, SEEK_SET)                 = 24
read(72, "\0\0\0\2\0\0\0\"\0\0\0\0\0\0\0\0", 16) = 16
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/cert9.db-wal", 0x7fff562dca30) = -1 ENOENT (No such file or directory)
fstat(72, {st_mode=S_IFREG|0600, st_size=1114112, ...}) = 0
fcntl(72, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
fcntl(72, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(72, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(72, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0

I will attach the whole strace dump.  My non-expert review of this seems to indicate that these files do not exist in my profile, and that it's looping over and over again looking for them and the UI becomes responsive when it "gives up".  I'm using an identical profile with FF 57 and FF 58.  FF 57 seems not to care that these files do not exist.

Additional note:  Our Firefox profiles are located on a NFS mount.  Maybe the new code introduced is expecting a local disk?

drichard@linux-utz5:/users/drichard/net_folders/.mozilla/firefox/mn25ti71.default> ls -lad cer*
-rw------- 1 drichard drichard 1376256 Jan 23 08:58 cert8.db
-rw------- 1 drichard drichard 1114112 Jan 23 08:19 cert9.db
-rw------- 1 drichard drichard   47878 Jan 22 17:42 cert_override.txt



Expected results:

Firefox should just open :)
Severity: normal → critical
OS: Unspecified → Linux
Hardware: Unspecified → x86_64
Ubuntu 14.04 LTS - x64 bit, Firefox 58.0

I tried to reproduce the case you described above but the firefox 58.0 browser performed properly as it should. It didn't hang on my system while opening it. It might be a one-time problem that might have occured, please do check if its happening again so that if its repeating, I can go ahead and change the status of the reported bug.

[bugday-20180122]
This was not a onetime thing.  When I start FF 58, it always hangs.  When I run FF 57 against the same profile, it works perfectly well.  We have 800 users, so I'm concerned that if it happened to me as the first tester...that it might happen to many more.  If users damage their profiles, we'll have a serious support issue.
I put it on the debugger and halted it at the point that it's blocked:

Thread 1 "firefox" received signal SIGINT, Interrupt.
0x00007ffff6c3a005 in __GI___xstat (vers=<optimized out>, 
    name=0x7fffcc97b345 "/home/drichard/.mozilla/firefox/mn25ti71.default/cert9.db-wal", buf=0x7fffffff98c0) at ../sysdeps/unix/sysv/linux/wordsize-64/xstat.c:35
35      ../sysdeps/unix/sysv/linux/wordsize-64/xstat.c: No such file or directory.
(gdb) 

Testing now to see if NFS is at play.
Ok, I have more information.  Our .mozilla folders are running over NFS.   So I took this folder and made a tarball and moved it to the local disk drive of the local cluster node and FF 58 opened immediately.  I then moved it back to NFS and FF 58 locked again.   I feel like this is proving the issue is related to the new code and interacting with a NFS mount.  When I interrupted with GDB above it's locked on xstat.   I perceive it's having trouble with a profile on a NFS drive.


__xstat
Name
__xstat -- provide inode information
Ubuntu 14.04 LTS - x64 bit, Firefox 58.0

I couldn't reproduce the bug but I have changed the status of the bug to new so that a developer might look into it and get the problem sorted out.

[bugday-20180122]
Status: UNCONFIRMED → NEW
Ever confirmed: true
Is there anything that I can do to assist with this bug?  We have over 800 users that cannot be upgraded to FF 58 until profiles can properly be stored on a NFS mount.
I shared this bug with other Mozilla contributors, they will definitely look into it. If there are any further queries I will let you know. As of now do not upgrade to Firefox 58.0 till the problem is sorted.

[bugday-20180122]
I am having this same problem with Firefox 58 (and 58.0.1) and am also using NFS for my home directories (and problem disappears when I use local disk for my homedir).  I am running this on openSUSE 42.3 x86_64.  It seems that the default DB type went from dbm to sql (sqlite) in Firefox 58 ( reference https://support.mozilla.org/en-US/questions/1201980 and also bugzilla 1377940.).  

And it looks like sqlite on NFS is SLOW according to https://wiki.mozilla.org/NSS_Shared_DB_Samples

Supposedly you can export env var NSS_DEFAULT_DB_TYPE="dbm"  to make it go back to DBM type but I can't get it to work...
This is great investigative work!

There have been some sqlite databases in the profile folder for many years, and they seemed to work OK.  But certainly this could be root cause.  If this is a sqlite issue, I'm wondering if they are trying to work in synchronous mode which is indeed slower over NFS.   One can turn this off with PRAGMA synchronous=OFF;    -- but it doesn't appear there is a way to do this with an environmental variable in a way that would carry to child processes.

So I have confirmed that NSS_DEFAULT_DB_TYPE does not affect Firefox 58 at all.

The strace that I attached to the ticket seems to imply that it's stuck in a loop and continually trying to gain access to files vs being stuck on one sqlite database.  

I was hoping to not have to tinker with mozregression and that someone familiar with the code would just know what's happening.  We're just received a critical security bulletin on FF 57, and we're kind of dead in the water at this time in terms of upgrading sadly.  

NFS mounts are critical for enterprise deployments.  Our load balancers move people between nodes and then obtain their profile via a common NFS drive.
Blocks: 1377940
Is performance improved if you set the environment variable NSS_SDB_USE_CACHE to "yes"?
Flags: needinfo?(drichard)
See Also: → 1434548
Summary: Linux Firefox 58 Hangs At Startup → Linux Firefox 58 hangs at startup with profile on NFS mount
(In reply to David Keeler [:keeler] (use needinfo) from comment #10)
> Is performance improved if you set the environment variable
> NSS_SDB_USE_CACHE to "yes"?

I just tested this and YES, performance is definetly improved.  Firefox UI no longer becomes fully unresponsive for up to 20 seconds (I wasn't able to switch between tabs for example).  When setting that env var, as soon as firefox starts, the tabs are still loading in the background, but the UI is responsive and I can switch between tabs immediately.
Good catch on 1434548, his strace is nearly identical.

As a point of reference, FF 58 takes 19 seconds for the UI to become responsive, whereas FF 57 is 1-2 seconds.



I set NSS_SDB_USE_CACHE=yes and Firefox 58 opens right up in about the same speed as 57.

Strace with NSS_SDB_USE_CACHE=yes is much cleaner, but still a little activity looking for wal's.  Not even close to how it looks with that variable unset.


drichard@desktopnxnode1:/u/firefox_2018_01_23/firefox$ cat /tmp/david.log | grep .wal
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/cookies.sqlite-wal", {st_mode=S_IFREG|0644, st_size=623080, ...}) = 0
open("/home/drichard/.mozilla/firefox/mn25ti71.default/cookies.sqlite-wal", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 32
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/permissions.sqlite-wal", 0x7fff49aca2c0) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/permissions.sqlite-wal", 0x7fff49acb650) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/permissions.sqlite-wal", 0x7fff49acb610) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/permissions.sqlite-wal", 0x7fff49acb500) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/permissions.sqlite-wal", 0x7fff49acb500) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/content-prefs.sqlite-wal", 0x7fff49ac65e0) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/content-prefs.sqlite-wal", 0x7fff49ac76e0) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/places.sqlite-wal", {st_mode=S_IFREG|0644, st_size=3016896, ...}) = 0
open("/home/drichard/.mozilla/firefox/mn25ti71.default/places.sqlite-wal", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 59
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/favicons.sqlite-wal", {st_mode=S_IFREG|0644, st_size=2098720, ...}) = 0
open("/home/drichard/.mozilla/firefox/mn25ti71.default/favicons.sqlite-wal", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 62
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/cert9.db-wal", 0x7fff49ac51a0) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/key4.db-wal", 0x7fff49ac51a0) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/key4.db-wal", 0x7fff49ac5230) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/key4.db-wal", 0x7fff49ac51d0) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/key4.db-wal", 0x7fff49ac5230) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/key4.db-wal", 0x7fff49ac5360) = -1 ENOENT (No such file or directory)
stat("/home/drichard/.mozilla/firefox/mn25ti71.default/key4.db-wal", 0x7fff49ac1d20) = -1 ENOENT (No such file or directory
Flags: needinfo?(drichard)
So, is this the suggested workaround/resolution for now ?  Any way to compile this in (or a config option in about:config)? Otherwise we just have to wrap the firefox start command with this.
I filed bug 1435376 to implement detection/setting the environment variable as appropriate.
I'm interested too in Eric's question about if this solution is recommended.  Nice that this will be automatically detected and variable set, but what will be our impact?  As FF advances, are we going to be running in a way that is not tested and will have possible issues in the future?  

I will be setting up a group of 10 beta testers today and move them to FF 58 with using NSS_SDB_USE_CACHE
Same here, FF 58.01 on Ubuntu 16.04. Loading with NFS(4) stored profile takes 20-25s, with NSS_SDB_USE_CACHE workaround 5-7s
Note that this is the time it takes until FF is usable, which means it takes that long to load the first webpage.

I noticed one thing: Starting with firefox --new-window "about:config" makes FF directly usable without workaround, also if I choose about:config as startpage (which doesn't get displayed though). Maybe that helps.
Where I work, we're having issues with version 58 on RHEL7 and NFS that are probably related. It's not so much the startup times that are the problem (the UI appears quickly and remains responsive), but it takes ages for the first pages to load completely. Especially the images easily take over 30 seconds to appear. Often, the first pages even stop loading after several minutes, with many/most images still missing.
After a while, everything starts to work normal, but the issue can reappear several times a day (same FF instance still running).

With FF 57, we never had this problem.

After reading this bug report, I tried setting the NSS_SDB_USE_CACHE variable and this makes a huge difference: startup and page loading is now extremely fast, just as in FF 57.

So I suspect that the underlying cause for our symptoms is the same: the slow performance of sqlite on NFS.
  Is FF writing to sqlite async?  (PRAGMA synchronous=OFF;)    If not, maybe that approach could be used to increase the speed over NFS?    That would allow us to use the same codebase as everyone else, and we could enable async with an about:config type setting with the understanding of the inherent risks by doing so.

I have 10 beta testers on 58 with NSS_SDB_USE_CACHE and no reports of problems or performance issues seen so far.
(In reply to drichard from comment #18)
> Is FF writing to sqlite async?  (PRAGMA synchronous=OFF;)    If not, maybe
> that approach could be used to increase the speed over NFS?

I suspect NSS is deliberately using synchronous writes, to achieve the goal that multiple applications can do concurrent read/write operations to the NSS sqlite storage. But ideally Bob Relyea should comment on that question.
Flags: needinfo?(rrelyea)
I filed bug #1436298, so that such regressions are hopefully found in the future before a release.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1436298
    "Add user home on NFS mount to QA tests"
We had a conversation with Bob Relyea, and I'd like to give a summarize of the situation.

The slowness on NFS is known.

Users of Firefox, that store their home directory (or firefox profile directory) on a slow network filesystem, should indeed set the environment variable NSS_SDB_USE_CACHE=yes. This should probably get documented somewhere in the Mozilla support knowledge base.

Changing NSS to always use the behavior that NSS_SDB_USE_CACHE=yes trigger is likely not an option, because it causes NSS to be slower for users that are on a fast filesystem.

(The current default is NSS_SDB_USE_CACHE=no.)

NSS has implemented automatic detection of slow filesystems, which can be explicitly requested by setting the variable to NSS_SDB_USE_CACHE=auto and in theory the NSS default could be changed to always perform this test.

Unfortunately the automatic test slows down startup time.

So, the available options seem to be:

(1) accept that users of slow network filesystems must discover on their own that they need to set
    the NSS_SDB_USE_CACHE=yes environment variable

(2) change NSS to always use the automatic detection, and accept the slower startup time for everyone


A third option could be, inspired by Hubert Kario:

(3) Implement platform specific code, that attempts to check if the storage location for the NSS database files
    is a local filesystem, or located on a network filesystem.

    If it is a network filesystem, run the performance test to check if it's a fast or a slow network filesystem.

With option (3), only users of network filesystems would experience the startup cost for the test.


Regarding the suggestion to use async sqlite writes, Bob has confirmed what I said in comment 19. The primary goal of migrating away from the DBM storage was to enable shared NSS data storage and parallel access, where one application can immediately see what changes another application made.
Flags: needinfo?(rrelyea)
Seeing this same behavior with Windows 7 redirected profiles, so it's not just limited to Linux.  We have our profiles on a network share, and ever since the upgrade to FF58, we've seen the same slowdowns and sluggishness. If we move the exact same networked profile to a local disk, the issues go away. A lot of times we see TLS Handshake issues, so I'm guessing we're also having issues with the certs9.db file being a major culprit.   In fact, just removing that file, and reloading the profile, will result in a single fast startup. 
 
Are there workarounds for the NSS_SDB_USE_CACHE variable for Windows users as well for the time being until this gets resolved?
If you configure your Windows system to set the same environment variable, you should notice an improvement.
OS: Linux → All
Hardware: x86_64 → All
Summary: Linux Firefox 58 hangs at startup with profile on NFS mount → Linux Firefox 58 hangs at startup with profile on NFS mount or Windows Share
(In reply to Kai Engert (:kaie:) from comment #24)
> If you configure your Windows system to set the same environment variable,
> you should notice an improvement.

Thanks Kai.  Can confirm this does indeed solve the issue.  Was able to deploy the environmental variable with group policy.
A little more info:

This wasn't a regression. It was purposefully chosen by mozilla. There are 3 options with NSS_SDB_USE_CACHE: yes, no, and auto. The cache is slower when used with local filesystems, and faster if the filesystem is slow (networked). Auto cost startup time and lots of system calls as it measures whether your local filesystem is slow enough to warrant using the cache.

Since networked home filesystems are significantly less common than local home filesystems, and since the environment variable is available for those cases that use networked filesystems, and because mozilla is sensitive to startup time, the default was changed from auto to no.
It's not because it was a conscious decision that it not a regression. When software, which run blazingly fast in version 57, suddenly becomes almost unusable at times in version 58, this is definitely a performance regression. It doesn't matter that this is only the case for a minority of users. And it also doesn't matter that there is a workaround.

Users expect that Firefox runs fine out of the box, not that they have to google to find a bug report with a reference to an obscure environment variable that may solve their problem. (I'm sure that many users don't even know what an environment variable is, or don't even realize that they are on a network mounted filesystem.)

I really appreciate all the good work that has been done by the Firefox team, but IMHO this was a poor decision in which the potential impact on slow filesystems was grossly underestimated. 

Anyway, it's not like users move their profiles between filesystems all the time, so conceptually, the expensive filesystem speed check only needs to run once (or every once in a while) and you can remember the result for the next time. This may require some implementation effort, but it can't be that hard. 

Just my 2 (Euro)cents.
Tentatively moving to NSS.
Assignee: nobody → nobody
Component: Untriaged → Libraries
Product: Firefox → NSS
Version: 58 Branch → other
Bob, Tim,

I have the feeling that we should find a compromise, because both extremes are not working appropriately.

Although the current default (no probing, no cache) is appropriate for the majority of users, it's unfortunate that a minority of people have to suffer these consequences, especially because the workaround is likely difficult to identify.

How about the following approach:


(Step 1) Always start NSS with the cache enabled (equivalent to the NSS_SDB_USE_CACHE=yes setting). This will avoid the blocking for the minority of users, and solve this regression bug.

Bob said, when using a fast filesystem, using NSS_SDB_USE_CACHE=yes is slower than NSS_SDB_USE_CACHE=no.

I don't know if the NSS_SDB_USE_CACHE=yes setting negatively affects the startup time, or if the effects are more noticeable during the remainder of the Firefox process lifetime.

My hope is, this suggested change has only a small impact on startup time, and that it could be acceptable for all users, in order to help the people who suffer from this bug.


(Step 2) After NSS has completed initialization, NSS could start a secondary thread. On this secondary thread, NSS will perform the filesystem probing. Because the probing happens in parallel, and because it doesn't block the startup of Firefox, my hope is that it shouldn't be noticeable to users.


(Step 3) After the secondary thread has completed the probing, the probing result is delivered to NSS. If the probing showed that we are on a fast filesystem, then NSS disables the cache.


Bob, Tim, what do you think about this idea?
Flags: needinfo?(ttaubert)
Flags: needinfo?(rrelyea)
NSS_SDB_USE_CACHE=yes would negatively affect users the majority of users, both at startup and in continued operation. Caching is slower than direct filesystem access if the filesystem has reasonable performance. I think the best we can do is those on network filesystems just need to set the environment variable (probing was the attempt to compromise by taking an up front startup hit). While it's and extreme affect, the number of users on network filesystems (on all platforms) are far smaller than the rest of the user base. The environment variable work on all OS's (including windows).

The issue with the thread approach is you won't be able to display and https:// home page until the probe is complete. Starting with cached and moving to non-cached or vice versa would require quite a bit of complex design (it would affect database consistency if not handled correctly).

RE comment 27. It's not a regression because the effect was known before the decision was made. The decision was based on 2 things: 1) there was a way for affected users to work around the issue, 2) the 'blindly fast' code was also dangerously broken (running on unmaintained code that hasn't been patched for 3 decades), 3) would have been a performance issue for all users. Those arguments were already weighed with the decision. The fix for the users of this bug is to set the environment variable, not to revert the default for everyone.

bob
Flags: needinfo?(rrelyea)
(In reply to Robert Relyea from comment #30)
> NSS_SDB_USE_CACHE=yes would negatively affect users the majority of users,
> both at startup and in continued operation. Caching is slower than direct
> filesystem access if the filesystem has reasonable performance. I think the
> best we can do is those on network filesystems just need to set the
> environment variable (probing was the attempt to compromise by taking an up
> front startup hit). While it's and extreme affect, the number of users on
> network filesystems (on all platforms) are far smaller than the rest of the
> user base. The environment variable work on all OS's (including windows).

I completely agree with what Bob says here.

If this especially hits enterprise deployments I think admins can easily also set the environment variable for all machines.
Flags: needinfo?(ttaubert)
But then make sure that they know about it. I couldn't find a hint in the release notes when I encountered the slow behavior.
(In reply to lars.behrens from comment #32)
> But then make sure that they know about it. I couldn't find a hint in the
> release notes when I encountered the slow behavior.

So much this.  I understand the hesitation about changing the NSS settings on startup, or on the fly, but at least make enterprise users aware of this.  Before we made the environmental variable change, Firefox was nearly unusable.  We we're extremely close to just removing Firefox from our office completely due to all the struggles of our users.

(In reply to Robert Relyea from comment #30)
> It's not a regression because the effect was known before the decision was made. 
> The decision was based on 2 things: 1) there was a way for affected users to work around the issue

The only documentation on this it seems is this ticket right here.   If this effect was known before the decision was made, why was there no documentation on this breaking roaming profile environments?  If the workaround was already known why wasn't that communicated?  Also, why is this something that has to be set via an environmental variable?  This should be something that could be set via one of the Firefox "Advanced Settings", and via the Firefox GPO templates.  There should also be documentation on it that gives guidance for enterprise admins.
FYI, we're working on automatically setting NSS_SDB_USE_CACHE=1 when we detect a profile on a remote drive on Windows: bug 1435376.
Blocks: 1435376
See Also: → 1449158
PSM changes to detect network directories landed in bug 1444943 (linux) and bug 1435376 (windows).
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Changing Product because NSS didn't change.
Assignee: nobody → nobody
Component: Libraries → Security: PSM
Product: NSS → Core
Version: other → unspecified
Resolution: FIXED → DUPLICATE
See Also: → 1456888
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: