Closed
Bug 1407534
Opened 7 years ago
Closed 6 years ago
Please re-image talos-linux64-ix-033
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: aobreja, Unassigned)
References
Details
Please re-image talos-linux64-ix-033 Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Machine is unreachable, manual intervention required
Reporter | ||
Updated•7 years ago
|
Blocks: talos-linux64-ix-033
Comment 1•7 years ago
|
||
back online after reimage. mozillas-Air-2:~ vle$ fping talos-linux64-ix-033.test.releng.scl3.mozilla.com talos-linux64-ix-033.test.releng.scl3.mozilla.com is alive
Assignee: server-ops-dcops → vle
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 2•7 years ago
|
||
The machine was re-image,but I still don't have access to it,I tried to re-image 2 times but still the same issue:
>Using username "root".
>Server refused our key
Maybe puppet didn't ran successfully.Can you ran maybe diagnostics on it,or pass it RelOps,this machine has a very strange behavior,I can connect to the other machines from this pool.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 3•7 years ago
|
||
can relops run puppet on this server? it boots up fine and shows no issues.
Assignee: vle → nobody
Component: DCOps → General
Product: Infrastructure & Operations → Release Engineering
QA Contact: cshields
Version: unspecified → ---
Comment 4•6 years ago
|
||
Not sure if task is still valid but 302'ing to CIduty folks.
Component: General → Buildduty
Product: Release Engineering → Infrastructure & Operations
QA Contact: catlee
Comment 5•6 years ago
|
||
I have re-imaged this however the process took more then 2 hours. It should be done in max 1h. After the re-iamge was done, I was able to login however the server just freezed. Power cycled and I have checked the disk. Looks like has some errors, this why is slow. Van, can you swap the disk with a new one so I can re-image once again? Full smart log. smartctl -a /dev/sda smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-76-generic] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital RE4 Serial ATA Device Model: WDC WD5003ABYX-01WERA2 Serial Number: WD-WMAYP3662834 LU WWN Device Id: 5 0014ee 05885c9e2 Firmware Version: 01.01S03 User Capacity: 500,107,862,016 bytes [500 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue May 15 23:10:22 2018 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. See vendor-specific Attribute list for failed Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 7860) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 80) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 001 001 051 Pre-fail Always FAILING_NOW 1925 3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 2 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 094 093 000 Old_age Always - 4830 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 9 194 Temperature_Celsius 0x0022 117 112 000 Old_age Always - 26 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 001 001 000 Old_age Offline - 85013 SMART Error Log Version: 1 ATA Error Count: 2 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 2 occurred at disk power-on lifetime: 2896 hours (120 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 10 51 40 28 57 81 ee Error: IDNF at LBA = 0x0e815728 = 243357480 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 40 28 57 81 ee 00 20d+11:18:23.842 WRITE DMA ca 00 98 88 56 81 ee 00 20d+11:18:23.841 WRITE DMA ca 00 70 10 56 81 ee 00 20d+11:18:23.841 WRITE DMA ca 00 20 e8 55 81 ee 00 20d+11:18:23.351 WRITE DMA ca 00 20 c0 55 81 ee 00 20d+11:18:23.072 WRITE DMA Error 1 occurred at disk power-on lifetime: 980 hours (40 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 90 a2 44 e3 Error: UNC 8 sectors at LBA = 0x0344a290 = 54829712 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 90 a2 44 e3 08 40d+01:35:57.371 READ DMA ca 00 08 18 1b 40 e5 08 40d+01:35:57.368 WRITE DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 4 - # 2 Short offline Completed without error 00% 2 - # 3 Extended offline Completed without error 00% 1 - # 4 Short offline Completed without error 00% 0 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Comment 6•6 years ago
|
||
Machine didn't accept any jobs since 14 03 2018 :ciduty can you please re-image it again.
Flags: needinfo?(ciduty)
Comment 7•6 years ago
|
||
I spoke with JLund in the last meeting about this, and as far as I can tell, this machine is from the old buildbot infrastructure, and apart from t-xp32-ix we don't need to worry about them. Also, at the moment, in that pool are 75+ perfectly working machines that didn't took jobs for more then 80 days. https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=talos-linux64-ix I'll close the bug, feel free to reopen it if something is unclear and for faster turnarounds ping us on #ci channel.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 6 years ago
Flags: needinfo?(ciduty)
Resolution: --- → WONTFIX
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•