Closed Bug 1066765 Opened 10 years ago Closed 10 years ago

please run hardware diagnostics on foopy64 and reimage

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Unassigned)

References

Details

(Whiteboard: iX Systems RMA Case ID #AMA-717-88012)

Attachments

(1 file)

This is having a bunch of problems translatable to the attached pandas, lets run diags first.
colo-trip: --- → scl3
running diagnostics
Whiteboard: running diagnostics
memtest found 0 errors after 13 passes, running Western Digital's full media scan on hdd.
passed hd diags, reimaging
Whiteboard: running diagnostics → reimaging
host back up.

sals-MacBook-Pro-3:~ sal$ sudo fping  10.26.19.126
10.26.19.126 is alive
sals-MacBook-Pro-3:~ sal$ sudo fping   10.26.131.21
10.26.131.21 is alive
sals-MacBook-Pro-3:~ sal$ ssh !$
ssh 10.26.131.21
The authenticity of host '10.26.131.21 (10.26.131.21)' can't be established.
RSA key fingerprint is 0f:fb:74:e6:23:32:2b:30:ca:e6:4c:b2:f7:97:f5:26.
Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
It's still showing load issues. Do we have more invasive disk diags we can run (or other diags)?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Oh, heck yeah, we definitely have drive issues. There are multiple:

Sep 30 18:27:53 foopy64 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 30 18:27:53 foopy64 kernel: ata1.00: irq_stat 0x40000001
Sep 30 18:27:53 foopy64 kernel: ata1.00: failed command: FLUSH CACHE EXT
Sep 30 18:27:53 foopy64 kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Sep 30 18:27:53 foopy64 kernel:         res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
Sep 30 18:27:53 foopy64 kernel: ata1.00: status: { DRDY ERR }
Sep 30 18:27:53 foopy64 kernel: ata1.00: error: { ABRT }
Sep 30 18:27:53 foopy64 kernel: ata1.00: configured for UDMA/33
Sep 30 18:27:53 foopy64 kernel: ata1: EH complete

Please contact iX for a drive replacement.
It may also be the controller, too, come to think of it. They'll probably ask for some diagnostic data we can pass onto them to help them diagnose the issue.
iX Systems RMA Case ID #AMA-717-88012
Whiteboard: reimaging → iX Systems RMA Case ID #AMA-717-88012
I've ran HDD diags but no errors were detected.  While we wait for iX to respond, I'll reimage the host.
reimaging won't help, we've already done that. There's is most definitely issues with the disk or controller if you look at the system logs.
The host has been dropped off to iX for 48 hrs burn-in diags.
Feedback from iX support:

The node itself has passed our burn-in test. Have you run any additional tests on the drives associated with it?
Node came back from iX with no errors detected.  iX has loaned me a temp HDD to replace.  Upon reimaging, I get the following error; see attachment.
Attached image foopy64.JPG
fwiw, we're again getting:

[19:48:59]	nagios-releng	Tue 16:49:02 PDT [4219] foopy64.p4.releng.scl3.mozilla.com:load is CRITICAL: (Return code of 255 is out of bounds) (http://m.mozilla.org/load)
(In reply to Vinh Hua [:vinh] from comment #15)
> Created attachment 8501434 [details]
> foopy64.JPG

Vinh:  according to dustin and lsscsi, foopies shouldn't have raid.  I think the disk needs to be zero'd to wipe the metadata that is hanging up anaconda
Host is online now with a loaner HDD. Let me know if issues persist.
:callek - How's foopy64 holding up after the HDD replacement?
I'm not sure if its been put back into rotation, passing the question off to coop.
Flags: needinfo?(coop)
It's back in rotation, and seems to be working correctly thus far.

I'll reopen if anything goes wrong.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Flags: needinfo?(coop)
Resolution: --- → FIXED
:coop - The replacement hard disk came in.  Can you take foopy64 down for the replacement?
Status: RESOLVED → REOPENED
Flags: needinfo?(coop)
Resolution: FIXED → ---
(In reply to Vinh Hua [:vinh] from comment #22)
> :coop - The replacement hard disk came in.  Can you take foopy64 down for
> the replacement?

I've disabled all the pandas on foopy64 and shutdown the foopy. It's all yours.
Flags: needinfo?(coop)
:coop - Foopy64 has been reimaged with the replacement drive.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: