Closed Bug 1141962 Opened 9 years ago Closed 9 years ago

[MTBF] Device random reboot with reason wdog_bark

Categories

(Firefox OS Graveyard :: Stability, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ashiue, Assigned: ting)

References

Details

Attachments

(1 file)

Attached file logcat33
MTBF testing aborted(test device has been reboot and marionette connection lost).
After checking log, and found lots of "E MP-Decision: Error 1 setting online status to 1 forcpuX" error in logcat. 

Also see some log "Sending message to frozen Nuwa", not sure if this is expected?

<build info>
Build ID 20150310162504
Gaia Revision 5af6f8d5d6161dea02002634c6d0a570a122e5dd
Gaia Date 2015-03-10 19:17:12
Gecko Revision https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/ec87adb8cf13
Gecko Version 37.0
Device Name hammerhead
Firmware(Release) 5.0
Firmware(Incrmental)  eng.cltbld.20150310.201846
Firmware Date Tue Mar 10 20:19:04 EDT 2015
Bootloader HHZ12d
QA Whiteboard: [COM=MTBF-Nexus5L]
ni? Ting, can you help to see this issue?
Flags: needinfo?(janus926)
Sure, I'll take it.
Flags: needinfo?(janus926)
Assignee: nobody → janus926
Status: NEW → ASSIGNED
A crash dump with symbol will be helpful. I'll check what are the messages about.
The device did not crash, so there was no crash report.
(In reply to Alison Shiue from comment #0)
> MTBF testing aborted(test device has been reboot and marionette connection
> lost).

QA doubt it has been reboot, but there's no crash dump.

"Sending message to frozen Nuwa" means someone is sending message to Nuwa after it's frozen which is not expected. But the log is not sufficient to see what the message is, I'll try to add some more logs.

"MP-Decision: Error 1 setting online status to 1 forcpuX" should be from QC's proprietary module mpdecision, it seems failed setting cpuX online. We need their input for how to debug this.
(In reply to Ting-Yu Chou [:ting] from comment #5)
> "MP-Decision: Error 1 setting online status to 1 forcpuX" should be from
> QC's proprietary module mpdecision, it seems failed setting cpuX online. We
> need their input for how to debug this.

Bhavana, I'm not sure whom to ask. But how can I have QC's comment regarding this?
Flags: needinfo?(bbajaj)
(In reply to Ting-Yu Chou [:ting] from comment #5)
> "Sending message to frozen Nuwa" means someone is sending message to Nuwa
> after it's frozen which is not expected. But the log is not sufficient to
> see what the message is, I'll try to add some more logs.

The message is PHal::Msg_NotifyNetworkChange, and happens whenever enable/disable WiFi. It does not cause any reboot or crash so don't care for MTBF.
(In reply to Ting-Yu Chou [:ting] from comment #6)
> (In reply to Ting-Yu Chou [:ting] from comment #5)
> > "MP-Decision: Error 1 setting online status to 1 forcpuX" should be from
> > QC's proprietary module mpdecision, it seems failed setting cpuX online. We
> > need their input for how to debug this.
> 
> Bhavana, I'm not sure whom to ask. But how can I have QC's comment regarding
> this?

NI :inder from CAF who should be able to here.

Inder, it seems we are seeing this error message in a CAF module that could be relating to aborts, any thoughts ?
Flags: needinfo?(bbajaj) → needinfo?(ikumar)
Bhavana - No, we haven't seen such an error in our setup and not sure what could be causing it.
Flags: needinfo?(ikumar)
blocking-b2g: --- → 2.2?
Since there's no crash dump nor noticable logs, I tried to attach gdb to b2g process see if I can get anything. The device shutdown at the first try, and in dmesg there're a lot of:

  _cpu_up: attempt to bring up CPU 3 failed

So maybe this does not cause the bug. Also I saw this:

  healthd: battery l=70 v=3963 t=44.3 h=3 st=2 c=181 chg=u

But still nothing noticable, gdb showed "Remote connection closed". Will try another run.
Lets try to catch in gdb on the next run and we can try to get help from CAF as necessary.

For now, I am blocking on this as we basically cannot get ahead on MTBF hours on n-5L
blocking-b2g: 2.2? → 2.2+
I hit the issue at 2nd run, but gdb doesn't catch anything, still "Remote conneciton closed"...
I have observed twice that the device ran till out of battery, but I haven't seen it back alive.

I placed MOZ_CRASH() at PowerManager::Shutdown() and PowerManager::Reboot() to make sure it's not our code to reboot/shutdown the device also added some logs to BatteryManager. Alice is trying another run.
(In reply to Ting-Yu Chou [:ting] from comment #13)
> logs to BatteryManager. Alice is trying another run.

It's Alison...
There's a file /sys/class/power_supply/battery/capacity shows the battery's level.
2 devices were out of battery and both in shutdown state when entered the office this morning. Will try another run.
After running about 9.9 hours last Friday, the debug build hit the issue of device connection lost.
Please refer the log info: https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-lNDWe2oZS1saVM2QmdDVFk&usp=drive_web
The device I left last Firday is off this morning, but it has a crash report, not sure does it relate to this bug:

Operating system: Android
                  0.0.0 Linux 3.4.0-g42750ec-00031-gfa25ffd #9 SMP PREEMPT Wed Nov 19 15:22:54 CST 2014 armv7l Android/full_hammerhead/hammerhead:5.0/LRX21T/ting03171018:eng/test-keys
CPU: arm
     4 CPUs

Crash reason:  SIGSEGV
Crash address: 0x5a5a5a5e

Thread 160 (crashed)
 0  bluetooth.default.so!fixed_queue_enqueue [fixed_queue.c : 88 + 0x0]
     r4 = 0x5a5a5a5a    r5 = 0x5a5a5a66    r6 = 0xb1fc4a99    r7 = 0x00000000
     r8 = 0x00000000    r9 = 0x00000c03   r10 = 0x00000c35    fp = 0xb1fff9ed
     sp = 0x952e3c58    lr = 0xb1fc7599    pc = 0xb1fc76a8
    Found by: given as instruction pointer in context
 1  bluetooth.default.so!thread_post [thread.c : 126 + 0x7]
     r4 = 0x8d19fa98    r5 = 0x84e7fbb0    r6 = 0xb1fc4a99    r7 = 0x00000000
     r8 = 0x00000000    r9 = 0x00000c03   r10 = 0x00000c35    fp = 0xb1fff9ed
     sp = 0x952e3c68    pc = 0xb1fc7599
    Found by: call frame info
 2  bluetooth.default.so!bthc_tx [bt_hci_bdroid.c : 233 + 0xf]
     r4 = 0x8b408388    r5 = 0xb2024738    r6 = 0x00000001    r7 = 0x00000000
     r8 = 0x00000000    r9 = 0x00000c03   r10 = 0x00000c35    fp = 0xb1fff9ed
     sp = 0x952e3c88    pc = 0xb1fc5003
    Found by: call frame info
 3  bluetooth.default.so!transmit_buf [bt_hci_bdroid.c : 428 + 0x3]
     r4 = 0x8b408388    r5 = 0xb210108c    r6 = 0x00000001    r7 = 0x00000000
     r8 = 0x00000000    r9 = 0x00000c03   r10 = 0x00000c35    fp = 0xb1fff9ed
     sp = 0x952e3c98    pc = 0xb1fc502b
    Found by: call frame info
 4  bluetooth.default.so!bte_main_hci_send [bte_main.c : 597 + 0x9]
     r4 = 0x8b408388    r5 = 0xb210108c    r6 = 0x00000001    r7 = 0x00000000
     r8 = 0x00000000    r9 = 0x00000c03   r10 = 0x00000c35    fp = 0xb1fff9ed
     sp = 0x952e3ca0    pc = 0xb1f1a1c7
    Found by: call frame info
 5  bluetooth.default.so!btu_hcif_send_cmd [btu_hcif.c : 502 + 0x9]
     r4 = 0x8b408388    r5 = 0xb210108c    r6 = 0x00000001    r7 = 0x00000000
     r8 = 0x00000000    r9 = 0x00000c03   r10 = 0x00000c35    fp = 0xb1fff9ed
     sp = 0x952e3ca8    pc = 0xb1fa4c69
    Found by: call frame info
 6  bluetooth.default.so!btu_hcif_process_event [btu_hcif.c : 1531 + 0x7]
     r4 = 0x84e80015    r5 = 0x00002006    r6 = 0x00000000    r7 = 0x00000000
     r8 = 0xb21010a8    r9 = 0xb210108c   r10 = 0xb2101098    fp = 0x00000001
     sp = 0x952e3cd8    pc = 0xb1fa52cb
    Found by: call frame info
 7  bluetooth.default.so!btu_task [btu_task.c : 259 + 0x7]
     r4 = 0x84e80008    r5 = 0xb2012ad8    r6 = 0x00000005    r7 = 0xb1fffda4
     r8 = 0xb1fffd0f    r9 = 0xb1fffd45   r10 = 0xb2101008    fp = 0x00000002
     sp = 0x952e3d48    pc = 0xb1fa5d63
    Found by: call frame info
 8  bluetooth.default.so!gki_task_entry [gki_ulinux.c : 263 + 0x5]
     r4 = 0xb2101508    r5 = 0xb210025c    r6 = 0x00000000    r7 = 0xb1ff079e
     r8 = 0xade24e88    r9 = 0xade24e88   r10 = 0xb6eab2c5    fp = 0x952e3db0
     sp = 0x952e3d78    pc = 0xb1f754eb
    Found by: call frame info
 9  libc.so!__pthread_start [pthread_create.cpp : 141 + 0x5]
     r4 = 0xade24e80    r5 = 0xade24ec0    r6 = 0xade24e80    r7 = 0x00000078
     r8 = 0xade24e88    r9 = 0xade24e88   r10 = 0xb6eab2c5    fp = 0x952e3db0
     sp = 0x952e3d98    pc = 0xb6eab2e5
    Found by: call frame info
10  libc.so!__start_thread [clone.cpp : 41 + 0x3]
     r4 = 0xade24e88    r5 = 0xb6eab2c5    r6 = 0xade24e80    r7 = 0x00000078
     r8 = 0xade24e88    r9 = 0xade24e88   r10 = 0xb6eab2c5    fp = 0x952e3db0
     sp = 0x952e3da8    pc = 0xb6ea92d5
    Found by: call frame info
(In reply to Alison Shiue from comment #17)
> After running about 9.9 hours last Friday, the debug build hit the issue of
> device connection lost.
> Please refer the log info:
> https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-
> lNDWe2oZS1saVM2QmdDVFk&usp=drive_web

Since there's no crash dump, so it is not a reboot/poweroff from PowerManager. Also, the last BatteryManager::Notify() call still has 1/4 battery:

  logcat204
  03-21 00:05:53.839  8885  8885 D x       : BatteryManager::Notify pLevel=0.240000 pCharging=1 mLevel=0.230000 mCharging=1

which is about 18 mins ago before the last line of logcat:

  logcat209
  03-21 00:23:01.152   869   884 E MP-Decision: Error 1 setting online status to 1 forcpu3

So it also seems not out of battery.

Inder, are there any suggestions for such a random device reboot that dmesg shows no panic log also there's no crash dump?
Flags: needinfo?(ikumar)
(In reply to Ting-Yu Chou [:ting] from comment #18)
> The device I left last Firday is off this morning, but it has a crash
> report, not sure does it relate to this bug:

This should be a different issue since the bug is device random reboot.
(In reply to Ting-Yu Chou [:ting] from comment #19)
> Inder, are there any suggestions for such a random device reboot that dmesg
> shows no panic log also there's no crash dump?

This is the last message from /proc/last_kmsg:

[36170.628019] PDN configure_ldo_or_hs_one: could not switch krait3 to ldo rc = -6
[36170.675940] _cpu_up: attempt to bring up CPU 1 failed
[36170.677219] _cpu_up: attempt to bring up CPU 2 failed
[36171.745205] Watchdog bark! Now = 36171.745199
[36171.745259] Watchdog last pet at 36160.744938
[36171.745351] cpu alive mask from last pet 0
[36171.745397] Causing a watchdog bite!
No errors detected
Boot info:
Last boot reason: wdog_bark
Summary: [MTBF][Nexus5-L] MTBF testing aborted and found lots of "E MP-Decision: Error 1 setting online status to 1 forcpuX" error in logcat → [MTBF][Nexus5-L] MTBF testing aborted and the device is random reboot with last boot reason wdog_bark
Summary: [MTBF][Nexus5-L] MTBF testing aborted and the device is random reboot with last boot reason wdog_bark → [MTBF][Nexus5-L] MTBF testing aborted and the device random reboot with reason wdog_bark
According to comment 21, should we ask partner to check this bug?
(In reply to Ting-Yu Chou [:ting] from comment #21)
> [36171.745205] Watchdog bark! Now = 36171.745199
> [36171.745259] Watchdog last pet at 36160.744938
> [36171.745351] cpu alive mask from last pet 0
> [36171.745397] Causing a watchdog bite!
> No errors detected
> Boot info:
> Last boot reason: wdog_bark

There're no files under /data/dontpanic.
(In reply to Ken Chang[:ken] from comment #22)
> According to comment 21, should we ask partner to check this bug?

NI'd Inder in comment 19 already.
Sorry, I am not sure what branch Nexus uses nor do I have any clues from the info here. Let me know if you are able to reproduce on the ZTE L based device or on Flame and we can look into it.
Flags: needinfo?(ikumar)
2 devices with Ting-Yu's debug build hit this issue again, please refer log information for both devices: https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-lNDWe2oaHhmVTVhbHQzOVU&usp=drive_web
(In reply to Alison Shiue from comment #26)
> 2 devices with Ting-Yu's debug build hit this issue again, please refer log
> information for both devices:
> https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-
> lNDWe2oaHhmVTVhbHQzOVU&usp=drive_web

So I put a dump_stack() call in wdog_bark_handler(), and the last_kmsg Alison attached are:

1.
16035.716846] [<c010e294>] (unwind_backtrace+0x0/0x140) from [<c0a3fec4>] (dump_stack+0x20/0x24)
[16035.716902] [<c0a3fec4>] (dump_stack+0x20/0x24) from [<c016d994>] (wdog_bark_handler+0x128/0x1d0)
[16035.716997] [<c016d994>] (wdog_bark_handler+0x128/0x1d0) from [<c01f9a20>] (handle_irq_event_percpu+0x94/0x2b4)
[16035.717088] [<c01f9a20>] (handle_irq_event_percpu+0x94/0x2b4) from [<c01f9c8c>] (handle_irq_event+0x4c/0x6c)
[16035.717181] [<c01f9c8c>] (handle_irq_event+0x4c/0x6c) from [<c01fcaa4>] (handle_fasteoi_irq+0xb0/0x180)
[16035.717272] [<c01fcaa4>] (handle_fasteoi_irq+0xb0/0x180) from [<c01f9278>] (generic_handle_irq+0x3c/0x4c)
[16035.717366] [<c01f9278>] (generic_handle_irq+0x3c/0x4c) from [<c0107d88>] (handle_IRQ+0x6c/0xc8)
[16035.717418] [<c0107d88>] (handle_IRQ+0x6c/0xc8) from [<c0100560>] (gic_handle_irq+0x50/0xf4)
[16035.717470] [<c0100560>] (gic_handle_irq+0x50/0xf4) from [<c0107000>] (__irq_svc+0x40/0x70)
[16035.717555] Exception stack(0xebf37d58 to 0xebf37da0)
[16035.717603] 7d40:                                                       c10c9d40 60000013
[16035.717692] 7d60: 47e447e4 00010000 c16fee00 00000400 c5a05ffc 00000400 ebf37dc8 00100100
[16035.717778] 7d80: c10c9ac0 ebf37dac 00000000 ebf37da0 c023bec8 c0a5210c 60000013 ffffffff
[16035.717871] [<c0107000>] (__irq_svc+0x40/0x70) from [<c0a5210c>] (_raw_spin_unlock_irqrestore+0x3c/0x74)
[16035.717930] [<c0a5210c>] (_raw_spin_unlock_irqrestore+0x3c/0x74) from [<c023bec8>] (release_pages+0x1b0/0x1e8)
[16035.718023] [<c023bec8>] (release_pages+0x1b0/0x1e8) from [<c02533a4>] (unmap_single_vma+0x428/0x6e4)
[16035.718113] [<c02533a4>] (unmap_single_vma+0x428/0x6e4) from [<c0253e70>] (unmap_vmas+0x68/0x7c)
[16035.718203] [<c0253e70>] (unmap_vmas+0x68/0x7c) from [<c0253f38>] (zap_page_range+0xb4/0x15c)
[16035.718292] [<c0253f38>] (zap_page_range+0xb4/0x15c) from [<c0251e30>] (sys_madvise+0x260/0x5d0)
[16035.718345] [<c0251e30>] (sys_madvise+0x260/0x5d0) from [<c0107400>] (ret_fast_syscall+0x0/0x30)
[16035.718430] Causing a watchdog bite!
No errors detected
Boot info:
Last boot reason: wdog_bark

2.
10885.616958] [<c010e294>] (unwind_backtrace+0x0/0x140) from [<c0a3fec4>] (dump_stack+0x20/0x24)
[10885.617089] [<c0a3fec4>] (dump_stack+0x20/0x24) from [<c016d994>] (wdog_bark_handler+0x128/0x1d0)
[10885.617223] [<c016d994>] (wdog_bark_handler+0x128/0x1d0) from [<c01f9a20>] (handle_irq_event_percpu+0x94/0x2b4)
[10885.617355] [<c01f9a20>] (handle_irq_event_percpu+0x94/0x2b4) from [<c01f9c8c>] (handle_irq_event+0x4c/0x6c)
[10885.617434] [<c01f9c8c>] (handle_irq_event+0x4c/0x6c) from [<c01fcaa4>] (handle_fasteoi_irq+0xb0/0x180)
[10885.617565] [<c01fcaa4>] (handle_fasteoi_irq+0xb0/0x180) from [<c01f9278>] (generic_handle_irq+0x3c/0x4c)
[10885.617697] [<c01f9278>] (generic_handle_irq+0x3c/0x4c) from [<c0107d88>] (handle_IRQ+0x6c/0xc8)
[10885.617827] [<c0107d88>] (handle_IRQ+0x6c/0xc8) from [<c0100560>] (gic_handle_irq+0x50/0xf4)
[10885.617955] [<c0100560>] (gic_handle_irq+0x50/0xf4) from [<c01071c0>] (__irq_usr+0x40/0x60)
[10885.618024] Exception stack(0xebd37fb0 to 0xebd37ff8)
[10885.618146] 7fa0:                                     00100000 ad38ff60 00002b4e 0000ff60
[10885.618271] 7fc0: ad3f9520 b2defcc0 00000001 00110000 00000001 b1e2a6e0 be7fd188 000054dc
[10885.618342] 7fe0: 00000000 be7fd120 b5b40b37 b5b40be2 00070030 ffffffff
[10885.618463] Causing a watchdog bite!
No errors detected
Boot info:
Last boot reason: wdog_bark
I should use abort() instead, will try another run.
(In reply to Ting-Yu Chou [:ting] from comment #28)
> I should use abort() instead, will try another run.

I am not sure how to generate core dump in kernel mode, I placed panic() and die() (with NULL registers) in the watchdog handler.
2 devices with new debug build hit reboot issue again, but the last boot reason are different, please refer: https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-lNDWe2odXBsZU1FaHBoLTA&usp=drive_web
(In reply to Alison Shiue from comment #30)
> 2 devices with new debug build hit reboot issue again, but the last boot
> reason are different, please refer:
> https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-
> lNDWe2odXBsZU1FaHBoLTA&usp=drive_web

The one with panic() have some output but still nothing helpful, there's no user space stack:

  [17117.263699] Watchdog bark! Now = 17117.263684
  [17117.263917] Watchdog last pet at 17103.677383
  [17117.264280] cpu alive mask from last pet 0-3
  [17117.264469] Causing a watchdog bite!
  [17117.264662] Kernel panic - not syncing: Panic!
  [17117.265231] [<c010e294>] (unwind_backtrace+0x0/0x140) from [<c0a3fe68>] (dump_stack+0x20/0x24)
  [17117.265612] [<c0a3fe68>] (dump_stack+0x20/0x24) from [<c0a4076c>] (panic+0xb0/0x224)
  [17117.265842] [<c0a4076c>] (panic+0xb0/0x224) from [<c016d9a4>] (wdog_bark_handler+0x138/0x164)
  [17117.266222] [<c016d9a4>] (wdog_bark_handler+0x138/0x164) from [<c01f99b4>] (handle_irq_event_percpu+0x94/0x2b4)
  [17117.266603] [<c01f99b4>] (handle_irq_event_percpu+0x94/0x2b4) from [<c01f9c20>] (handle_irq_event+0x4c/0x6c)
  [17117.266991] [<c01f9c20>] (handle_irq_event+0x4c/0x6c) from [<c01fca38>] (handle_fasteoi_irq+0xb0/0x180)
  [17117.267381] [<c01fca38>] (handle_fasteoi_irq+0xb0/0x180) from [<c01f920c>] (generic_handle_irq+0x3c/0x4c)
  [17117.267766] [<c01f920c>] (generic_handle_irq+0x3c/0x4c) from [<c0107d88>] (handle_IRQ+0x6c/0xc8)
  [17117.267990] [<c0107d88>] (handle_IRQ+0x6c/0xc8) from [<c0100560>] (gic_handle_irq+0x50/0xf4)
  [17117.268367] [<c0100560>] (gic_handle_irq+0x50/0xf4) from [<c01071c0>] (__irq_usr+0x40/0x60)
  [17117.268718] Exception stack(0xeb46ffb0 to 0xeb46fff8)
  [17117.268918] ffa0:                                     a34bc000 74800000 00000001 b4c97e01
  [17117.269285] ffc0: a34bc030 00000000 00000000 002e001b 00000000 beaa33b4 beaa337c 0000000c
  [17117.269645] ffe0: b6fb3e18 beaa32b8 b4c5b93f b4c97e00 600f0030 ffffffff
  [17117.269850] CPU2: stopping
  [17117.269899] [<c010e294>] (unwind_backtrace+0x0/0x140) from [<c0a3fe68>] (dump_stack+0x20/0x24)
  [17117.269945] [<c0a3fe68>] (dump_stack+0x20/0x24) from [<c010cd0c>] (handle_IPI+0x1c0/0x1d4)
  [17117.270021] [<c010cd0c>] (handle_IPI+0x1c0/0x1d4) from [<c01005c4>] (gic_handle_irq+0xb4/0xf4)
  [17117.270096] [<c01005c4>] (gic_handle_irq+0xb4/0xf4) from [<c01071c0>] (__irq_usr+0x40/0x60)
  [17117.270168] Exception stack(0xea441fb0 to 0xea441ff8)
  [17117.270209] 1fa0:                                     00000000 b6fc7ec8 00000000 00000000
  [17117.270284] 1fc0: b6858800 00000010 b6fc7ec8 00000001 beaa4ef0 b6804670 beaa4e98 b5f3d5d3
  [17117.270325] 1fe0: b6fb3e54 beaa4db0 b6f855df b6f855d6 600e0030 ffffffff
  [17117.270399] CPU1: stopping
  [17117.270448] [<c010e294>] (unwind_backtrace+0x0/0x140) from [<c0a3fe68>] (dump_stack+0x20/0x24)
  [17117.270523] [<c0a3fe68>] (dump_stack+0x20/0x24) from [<c010cd0c>] (handle_IPI+0x1c0/0x1d4)
  [17117.270599] [<c010cd0c>] (handle_IPI+0x1c0/0x1d4) from [<c01005c4>] (gic_handle_irq+0xb4/0xf4)
  [17117.270643] [<c01005c4>] (gic_handle_irq+0xb4/0xf4) from [<c01071c0>] (__irq_usr+0x40/0x60)
  [17117.270716] Exception stack(0xe57d1fb0 to 0xe57d1ff8)
  [17117.270789] 1fa0:                                     00000004 20000000 00000000 b6762dcd
  [17117.270831] 1fc0: beaa48c4 b10c47a0 9e3779b9 00000001 b10c47a0 beaa49c4 b32aae40 b32aae6c
  [17117.270904] 1fe0: 00000031 beaa48a0 b4ade257 b4ade8c0 000e0030 ffffffff
  [17118.271469] Rebooting in 5 seconds..
  [17123.273753] Going down for restart now
  [17123.274942] Calling SCM to disable SPMI PMIC arbiter

  No errors detected
  Boot info:
  Last boot reason: kernel_panic

The one with die(), even it is accessing invalid address, there's no core dump:

  [21376.329037] Watchdog bark! Now = 21376.329031
  [21376.329127] Watchdog last pet at 21365.328771
  [21376.329175] cpu alive mask from last pet 0
  [21376.329220] Causing a watchdog bite!
  [21376.329307] Unable to handle kernel NULL pointer dereference at virtual address 00000040
  [21376.329397] pgd = ebcd0000
  [21376.329481] [00000040] *pgd=00000000

  No errors detected
  Boot info:
  Last boot reason: wdog_bite
BTW, not sure if it is related, the message "_cpu_up: attempt to bring up CPU * failed" is because msm_thermal_cpu_callback() returns NOTIFY_BAD due to cpu core overheating (>80°c).
pet_watchdog_work() is enqueued to cpu0 and is expected to be executed after 10s (qcom,pet-time) delay. The watchdog barks if it is not pet for 11s (qcom,bark-time), which I don't think user space can cause this. So I'd stop trying to get a core dump.
(In reply to Ting-Yu Chou [:ting] from comment #33)
> (qcom,bark-time), which I don't think user space can cause this. So I'd stop
> trying to get a core dump.

Scratch this, https://code.google.com/p/android/issues/detail?id=63385 shows it's possible that user space causes it.
Paul & Alison have setup video recording with 3 devices, we will check what are they doing right before reboot. I will also trace workqueue and cpu hotplug in kernel.
Hi Ting-Yu,

3 devices all encountered reboot issue, and unfortunately, the video recording was not long enough that we only recorded one device reboot behaviour.

Here are log information for 3 devices: https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-lNDWe2oWVNhNktKdlV4QkE&usp=drive_web

Video (device 0ae0abef0337dcbf reboot at about 9:00): https://vreplay.mozilla.com/replay/showRecordDetails.html?sortBy=date&viewCount=1&currentPage=1&groupBy=combo&roomFilter=&usernameFilter=&searchFilter=&usernameFullFilter=&myManager=-1&adminManager=0&webCast=0&command=&recId=1745&auxMessage=&auxMessage1=&lang=en&langChanged=&tenantFilter=&securityTab=
Executed 2 devices with debug again and both encountered reboot issue. 
Besides, one device (0ae0abef0337dcbf) hit one crash when browser play video, and I think this issue is tracked at bug 1146861.

Here are log information and videos for 2 devices: https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-lNDWe2oSXdSdGVZVkZVNTA&usp=drive_web

* Note:   
[devices 075a1a2000d22892 (left, red)] reboot at capture-2.avi 1:16:39
[devices 0ae0abef0337dcbf (right, white)] reboot at capture-4.avi 4:50:10, and this device executed camera multiple shots before reboot.
This case also found on Flame KK. Update title.
Summary: [MTBF][Nexus5-L] MTBF testing aborted and the device random reboot with reason wdog_bark → [MTBF]MTBF testing aborted and the device random reboot with reason wdog_bark
(In reply to Alison Shiue from comment #36)
> Video (device 0ae0abef0337dcbf reboot at about 9:00):
> https://vreplay.mozilla.com/replay/showRecordDetails.
> html?sortBy=date&viewCount=1&currentPage=1&groupBy=combo&roomFilter=&username
> Filter=&searchFilter=&usernameFullFilter=&myManager=-
> 1&adminManager=0&webCast=0&command=&recId=1745&auxMessage=&auxMessage1=&lang=
> en&langChanged=&tenantFilter=&securityTab=

The device was idle at homescreen before reboot.

(In reply to Alison Shiue from comment #37)
> [devices 075a1a2000d22892 (left, red)] reboot at capture-2.avi 1:16:39

It was doing text input.

> [devices 0ae0abef0337dcbf (right, white)] reboot at capture-4.avi 4:50:10,
> and this device executed camera multiple shots before reboot.

And from last_kmsg, pet_watchdog_work() did enqueue the delayed work, but it was not triggered.
Are there any suggestions to figure out what cause the watchdog bite? Thanks.
Flags: needinfo?(ikumar)
(In reply to Ting-Yu Chou [:ting] from comment #40)
> Are there any suggestions to figure out what cause the watchdog bite? Thanks.

NI mwu, dhyland, gsvelto as well.
Flags: needinfo?(mwu)
Flags: needinfo?(gsvelto)
Flags: needinfo?(dhylands)
I have talked to mwu yesterday, he suggests to make sure the issue can reproduce on Flame so we can have QC to help. QA will try to get last_kmsg from a Flame with the same symptom to see if it is also a watchdog bite.
Flags: needinfo?(mwu)
Flags: needinfo?(gsvelto)
Flags: needinfo?(dhylands)
Flame tested, they were "unable to connect to device" but didn't reboot.  Possibly test harness or adb busy.  I'll add retry and see if I could reproduce.  Can't tell any relation by flame's latest build.
I added some logs to the kernel, it shows the timer function (delayed_work_timer_fn()) for queuing pet work does happen, but pet_watchdog_work() is not called.
From last night's logs, there was a work being processed and queued repeatedly around every 30us before watchdog bite. I'll see if I can figure out what it is.
(In reply to Ting-Yu Chou [:ting] from comment #45)
> From last night's logs, there was a work being processed and queued
> repeatedly around every 30us before watchdog bite. I'll see if I can figure
> out what it is.

Hmm... I thoguht it could be some busy loop cause the watchdog bite, but there were 2~3 seconds before watchdog bite which run_timer_softirq() comes in and pet_watchdog_work() is not called.
Summary: [MTBF]MTBF testing aborted and the device random reboot with reason wdog_bark → [MTBF] Device random reboot with reason wdog_bark
I've seen more than one patterns triggering watchdog bite:

1. workqueue seems block in kgsl_process_events
2. a lot of flush_to_ldisc()
3. only see tick_nohz_stop_sched_tick() logs
I have told to Ting-Yu for this bug.  I suggest to disable watchdog or reduce intervals of the timer.  

By definition, this kernel watchdog is watching if kernel is running correctly.  Once watchdog barks, it means some part of kernel is wrong.  It means kernel do something wrong, it is a bug of kernel.  Any good kernel should not make good watchdog barking.

Since we are meaning to running MTBF on nexus 5, we don't ship nexus 5 and its kernel as a product, it is insane to spend a lot of resource to fix it.  If disabling or reducing intervals could fix MTBF, why not?
(In reply to Thinker Li [:sinker] from comment #48)
> I have told to Ting-Yu for this bug.  I suggest to disable watchdog or
> reduce intervals of the timer.  
> 
> By definition, this kernel watchdog is watching if kernel is running
> correctly.  Once watchdog barks, it means some part of kernel is wrong.  It
> means kernel do something wrong, it is a bug of kernel.  Any good kernel
> should not make good watchdog barking.
> 
> Since we are meaning to running MTBF on nexus 5, we don't ship nexus 5 and
> its kernel as a product, it is insane to spend a lot of resource to fix it. 
> If disabling or reducing intervals could fix MTBF, why not?

Viral, how do you think?
Flags: needinfo?(vwang)
I have talked to Alison yesterday as I noticed Nexus-5-L MTBF reaches 25 hours in a recent update. She told me the number was based on 5.1, which is different from this bug (5.0). She will try more runs and upadte here if it can no longer reproduce.
Flags: needinfo?(vwang)
Flags: needinfo?(ikumar)
(In reply to Ting-Yu Chou [:ting] from comment #49)
> (In reply to Thinker Li [:sinker] from comment #48)
> > I have told to Ting-Yu for this bug.  I suggest to disable watchdog or
> > reduce intervals of the timer.  
> > 
> > By definition, this kernel watchdog is watching if kernel is running
> > correctly.  Once watchdog barks, it means some part of kernel is wrong.  It
> > means kernel do something wrong, it is a bug of kernel.  Any good kernel
> > should not make good watchdog barking.
> > 
> > Since we are meaning to running MTBF on nexus 5, we don't ship nexus 5 and
> > its kernel as a product, it is insane to spend a lot of resource to fix it. 
> > If disabling or reducing intervals could fix MTBF, why not?
> 
> Viral, how do you think?

I agree the watchdog problem comes from kernel side should not relative to gecko.
Yes, we can provide a kernel build disable watchdog, but if we still met the case that kernel didn't kick watchdog in time, the MTBF may stop at the same time (since kernel has no bandwidth to kick watchdog, it can't service gecko/gaia)
I will try it in my local device to see if MTBF can run longer.
Until now, we have executed MTBF with Nexus-5-L-5.1 9 times, did not hit this reboot issue.
Hi! Alison,

Just want to check that do you see this issue again in past three days?
Thanks

--
Keven
Flags: needinfo?(ashiue)
(In reply to Keven Kuo [:kkuo] from comment #53)
> Hi! Alison,
> 
> Just want to check that do you see this issue again in past three days?
> Thanks
> 
> --
> Keven

No, I did not see this issue again in past three days.
Flags: needinfo?(ashiue)
(In reply to Keven Kuo [:kkuo] from comment #53)
> Hi! Alison,
> 
> Just want to check that do you see this issue again in past three days?
> Thanks
> 
> --
> Keven

No, I did not see this issue again in past three days.
mark unconfirmed by comment 52 and comment 55.
Status: ASSIGNED → UNCONFIRMED
Ever confirmed: false
Hi Alison,
Please renominate if you still observe the issue again. Thanks!
blocking-b2g: 2.2+ → ---
Flags: needinfo?(ashiue)
(In reply to Josh Cheng [:josh] from comment #57)
> Hi Alison,
> Please renominate if you still observe the issue again. Thanks!

OK.
Flags: needinfo?(ashiue)
Triage: mark as wontfix. Not to keep track on nexus 5-L. If z3c-L has this issue, please create new one.
Status: UNCONFIRMED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: