Closed Bug 1409349 Opened 7 years ago Closed 7 years ago

More machine from t-w1064,t-w864,t-w732 and t-yosemite pools are unreachable

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aobreja, Assigned: van)

References

Details

Attachments

(1 file)

Please check the list below,all these machines are unreachable:

-t-yosemite-r7-0229
-t-w1064-ix-139
-t-w1064-ix-138
-t-yosemite-r7-0279
-t-yosemite-r7-0225 
-t-yosemite-r7-0225 
-t-yosemite-r7-0137
-t-yosemite-r7-0130
-t-yosemite-r7-0068
-t-yosemite-r7-0048
-t-yosemite-r7-0045
-t-w1064-ix-117 
-t-w1064-ix-312
-t-w1064-ix-313
-t-w732-ix-131
-t-w864-ix-037
-t-w732-ix-130
-t-w732-ix-107 
-t-w732-ix-120
-t-w732-ix-105
-t-w732-ix-096
-t-w732-ix-081
-t-w732-ix-047 
-t-w732-ix-056
-t-w732-ix-054
-t-w732-ix-041
-t-w732-ix-037
-t-w732-ix-011 
-t-w732-ix-031
-t-w732-ix-033
-t-w732-ix-030
-t-w732-ix-022
-t-w732-ix-016

>Attempting SSH reboot...Failed.
>Attempting IPMI reboot...Failed.
>Machine is unreachable, manual intervention require
Also t-yosemite-r7-402
Also the list bellow:

-t-w732-ix-104 
-t-w732-ix-027
-t-w732-ix-003 
-t-w732-ix-020
-t-w732-ix-024
-t-w732-ix-040
-t-w732-ix-073 
-t-w732-ix-087
-t-w732-ix-086
-t-w732-ix-091
-t-w732-ix-098
-t-w732-ix-065
-t-w732-ix-106
-t-w732-ix-122
-t-w732-ix-141
-t-w732-ix-111
Also:

-t-yosemite-r7-0110  
-t-yosemite-r7-0108
fubar, it was proposed by arr that we stop filing dcops bugs for machines that were no longer reachable (buildduty can't recover). Do you have context here? One of the primary concerns is to keep low pool counts from getting lower, e.g. yosemite and xp. Reason being as they are our latest pending bottlenecks[1]. Perhaps we should be making exceptions? Any thoughts from you or anyone watching this component?

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1409439
Flags: needinfo?(klibby)
I don't have context, unfortunately. I'd like to understand why machines are going unreachable; that seems like something is very broken.  

OTOH, I expect that response to this bug, at the very least, will be delayed by :van being in MDC1 setting up the minis from move train #2.
Flags: needinfo?(klibby)
okay thanks. It usually is a sign they are very broken, yes. Unfortunately though, this is normal and an essential escalation step in recovering the hardware machines we have.

Thank you for letting us know about expected delays, buildduty can act appropriately. If anyone with context to dcops and the data center migration could give us a estimated timeline to actioning this bug, that would be great.
oh wow, there's about 30 machines just in this bug. i'll be on site today and perhaps tomorrow to catch up on these bugs.
Assignee: server-ops-dcops → vle
Thank you Van! If you need anything tested or confirmed on the systems or in the logs after boot/image on these, please ping me. I'd like to help if I can.
:van, can you put a spider kvm on one of those yosemite hosts for me so I can try to determine why it's network is going unresponsive. (Assuming they aren't completely hung up)
:dividehex, i've attached a spider to one of the minis - 10.26.52.254.

going through the list but so far all the minis ive come across are running into the local fw issue. im reimaging them and will have a more comprehensive update. 

in comment 0, you have -t-yosemite-r7-0225 listed twice, did you mean another host is down?
See Also: → 1401601
:dividehex/markco, it looks like the win7 testers are also running into a APIPA issue. i've had to reimage quite a few today and will continue troubleshooting the rest of the win7 nodes tomorrow. do you want me to attach a kvm or are you able to use ipmi's console redirection?

Decommission:
-t-w1064-ix-138 - these 4 are same chassis, bad backplane, out of warranty
-t-w1064-ix-139 - these 4 are same chassis, bad backplane, out of warranty
-t-w1064-ix-312 - these 4 are same chassis, bad backplane, out of warranty
-t-w1064-ix-312 - these 4 are same chassis, bad backplane, out of warranty

back online:
-t-yosemite-r7-0229 - local fw issue, attached spider kvm for troubleshooting
-t-yosemite-r7-0279 - local fw issue, reimaged
-t-yosemite-r7-0225 - local fw issue, reimaged
-t-yosemite-r7-0137 - local fw issue, reimaged
-t-yosemite-r7-0130 - local fw issue, reimaged
-t-yosemite-r7-0068 - local fw issue, reimaged
-t-yosemite-r7-0048 - local fw issue, reimaged
-t-yosemite-r7-0045 - local fw issue, reimaged
-t-yosemite-r7-0110 - local fw issue, reimaged
-t-yosemite-r7-0108 - local fw issue, reimaged
-t-w1064-ix-117 - back online
-t-w732-ix-131 - private IP addressing, reimaged
-t-w864-ix-037 - back online
-t-w732-ix-130 - private IP addressing, reimaged

pending:
-t-yosemite-r7-402 - MDC1 node, tracked in bug 1409281, opened QTS REQ0194461


still need to troubleshoot the following nodes tomorrow:
-t-w732-ix-107 
-t-w732-ix-120
-t-w732-ix-105
-t-w732-ix-096
-t-w732-ix-081
-t-w732-ix-047 
-t-w732-ix-056
-t-w732-ix-054
-t-w732-ix-041
-t-w732-ix-037
-t-w732-ix-011 
-t-w732-ix-031
-t-w732-ix-033
-t-w732-ix-030
-t-w732-ix-022
-t-w732-ix-016
-t-w732-ix-104 
-t-w732-ix-027
-t-w732-ix-003 
-t-w732-ix-020
-t-w732-ix-024
-t-w732-ix-040
-t-w732-ix-073 
-t-w732-ix-087
-t-w732-ix-086
-t-w732-ix-091
-t-w732-ix-098
-t-w732-ix-065
-t-w732-ix-106
-t-w732-ix-122
-t-w732-ix-141
-t-w732-ix-111
Flags: needinfo?(jwatkins)
Flags: needinfo?(mcornmesser)
Van: Can you change the bios graphic card priority on 3 of the w732 machines? Once that is done I can connect through IPMI and do some trouble shooting.
Flags: needinfo?(mcornmesser)
Van, could you check the minis #219 and #120 in scl3, and #444 in mdc1? I think that duplicate entry for t-yosemite-r7-225 was meant for t-yosemite-r7-219

SCL3:
https://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/status.cgi?hostgroup=t-yosemite-r7-machines&style=detail&limit=100&sorttype=1&sortoption=6&sorttype=2&sortoption=6
Shows these three as down for 10h+:
(expected, on kvm) t-yosemite-r7-0229.test.releng.scl3.mozilla.com  20d 22h 6m 52s 	30/30 	PING CRITICAL - Packet loss = 100% 
t-yosemite-r7-0219.test.releng.scl3.mozilla.com  11d 1h 5m 10s 	30/30 	PING CRITICAL - Packet loss = 100% 
t-yosemite-r7-0120.test.releng.scl3.mozilla.com  0d 15h 0m 48s 	30/30 	PING CRITICAL - Packet loss = 100% 

MDC1:
https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?hostgroup=t-yosemite-r7-machines&style=detail&limit=100&sorttype=1&sortoption=6&sorttype=2&sortoption=6
t-yosemite-r7-444.test.releng.mdc1.mozilla.com  2d 5h 22m 19s 	3/3 	PING CRITICAL - Packet loss = 100% 
(loaner, I think expected) t-yosemite-r7-393.test.releng.mdc1.mozilla.com

Details
These are the mac minis that were down yesterday:
12 minis were non-responsive. Nagios showed 11 with 2 to 19 days unresponsive (plus the one in mdc1)
t-yosemite-r7-0045 reported 10/17. hw reboot fixed 08/02, 07/21, 06/30, 02/13, 2016: 12/21(found off), 09/26, 07/13, 07/05
t-yosemite-r7-0048 reported 10/17
t-yosemite-r7-0068 10/17. hardware reboot fixed 08/08, 08/02, 06/06, 04/24, 02/27, 02/22
t-yosemite-r7-0108 10/17. hardware reboot fixed 08/07, 08/02
t-yosemite-r7-0110 10/17
t-yosemite-r7-0130 10/17. reimage fixed 09/26 (host-based firewall?), hw reboot fixed 08/07, 03/06, 02/08
t-yosemite-r7-0137 10/17. earlier reports cancelled 06/26, 12/22
t-yosemite-r7-0219 10/17 (linked as a bug but not listed in a comment)
t-yosemite-r7-0225 10/17. reimage fixed 02/06
t-yosemite-r7-0229 10/17. reimage fixed 09/28 (host-based firewall?), reimaged 09/07, cancelled 08/10, hw reboot fixed 08/02, 02/22, 01/17, 2016: 12/15, 11/28, hw reboot fixed 05/18, loaner 05/12-05/18
t-yosemite-r7-0279 10/17. hardware reboot fixed 04/24
t-yosemite-r7-402 reported 10/16 (found working)
t-yosemite-r7-444 reported 10/18 (linked as a bug but not listed in a comment)
Flags: needinfo?(vle)
:dhouse, can you open a separate bug for MDC1 minis? it makes it much easier to track since they are different data centers. i'm on site and will continue troubleshooting.
Flags: needinfo?(vle)
(In reply to Van Le [:van] from comment #13)
> pending:
> -t-yosemite-r7-402 - MDC1 node, tracked in bug 1409281, opened QTS REQ0194461
> 

Thx Van! The only problem MDC1 mini is covered already in bug https://bugzilla.mozilla.org/show_bug.cgi?id=1409743 with QTS REQ0194461

So that doesn't need to be handled in this bug.
took care of all the yosemite and w7 left in the bug. can you open new bugs for any future bad hosts? the "one bug to track them all" makes it hard to chase down/look up their past issues.

:markco, i changed resolution to 3 hosts as requested.

t-yosemite-r7-0219 - local fw issue, reimaged
t-yosemite-r7-0120 - local fw issue, reimaged

t-w732-ix-107 - APIPA, rebooted and came back online
t-w732-ix-120 - APIPA, rebooted with no luck, changed video output for mark to troubleshoot
t-w732-ix-105 - APIPA, rebooted with no luck, changed video output for mark to troubleshoot
t-w732-ix-096 - APIPA, rebooted with no luck, changed video output for mark to troubleshoot
t-w732-ix-081 - APIPA, rebooted with no luck, reimaged
t-w732-ix-047 - APIPA, rebooted with no luck, reimaged
t-w732-ix-056 - APIPA, rebooted with no luck, reimaged
t-w732-ix-054 - APIPA, rebooted with no luck, reimaged
t-w732-ix-041 - bad drive, reimaged
t-w732-ix-037 - APIPA, rebooted with no luck, reimaged 
t-w732-ix-011 - APIPA, rebooted with no luck, reimaged 
t-w732-ix-031 - APIPA, rebooted with no luck, reimaged
t-w732-ix-033 - APIPA, rebooted with no luck, reimaged
t-w732-ix-030 - APIPA, rebooted with no luck, reimaged
t-w732-ix-022 - bad drive, reimaged
t-w732-ix-016 - APIPA, rebooted with no luck, reimaged
t-w732-ix-104 - APIPA, rebooted with no luck, reimaged 
t-w732-ix-027 - APIPA, rebooted with no luck, reimaged
t-w732-ix-003 - APIPA, rebooted with no luck, reimaged 
t-w732-ix-020 - APIPA, rebooted with no luck, reimaged
t-w732-ix-024 - APIPA, rebooted with no luck, reimaged
t-w732-ix-040 - APIPA, rebooted with no luck, reimaged
t-w732-ix-073 - APIPA, rebooted with no luck, reimaged 
t-w732-ix-087 - APIPA, rebooted with no luck, reimaged
t-w732-ix-086 - APIPA, rebooted with no luck, reimaged
t-w732-ix-091 - bad drive, reimaged
t-w732-ix-098 - APIPA, rebooted with no luck, reimaged
t-w732-ix-065 - APIPA, rebooted with no luck, reimaged
t-w732-ix-106 - APIPA, rebooted with no luck, reimaged
t-w732-ix-122 - APIPA, rebooted with no luck, reimaged
t-w732-ix-141 - APIPA, rebooted with no luck, reimaged
t-w732-ix-111 - bad drive, reimaged
Flags: needinfo?(mcornmesser)
to note, the bad drives in previous comment were swapped with drives from machines we decommissioned (same manufacturer/model). all of these machines are out of warranty and the drives are hit or miss with the manufacturer warranty.
I took a look at t-w732-ix-120. I suspect that the machines weren't able to communicate on a boot and that is what caused the issue. They were probably looping back to themselves for DNS as well. Which is why a reboot did not fix it.
Flags: needinfo?(mcornmesser)
:markco, why is this happening in the first place? are we able to fix the issue or is reimaging going to be the fix for the APIPA issues?
Flags: needinfo?(mcornmesser)
(In reply to Van Le [:van] from comment #21)
> :markco, why is this happening in the first place? are we able to fix the
> issue or is reimaging going to be the fix for the APIPA issues?

I will dive through the logs on the local machine tomorrow to see if I can see any type of root cause. 

Once the machine reaches this state reimaging will be the most straight forward way to get it back into the pool.
Flags: needinfo?(mcornmesser)
Depends on: 1411094
I'm going to go out on a limb here and make an assumption since I don't really have the time capacity to pin down the cause.  My best guess regarding the APIPA issue is that some how the local firewall is interfere with the dhcp exchange.  I'm not sure if it stems from connection state tables to fail to keep track of the outgoing packets (such as a timeout or excessive delay between responses) or maybe it doesn't handle a new (non-renewal) lease properly since there is no IP and the returning packet is from a unicast ip while the outgoing packet recorded in the state table was to broadcast.

I'm just not sure.

Anyway, if we make this assumption we should be able to workaround it with a more permissive firewall rule to allow dhcp regardless of state tracking.

This also applies to bug 1401601.

markco, Q: you will need to make a rule under the gpo to do the same.  Basically on ingress, allow udp from any ip on source port 67 to any ip destination port 68.
Flags: needinfo?(jwatkins)
Attachment #8921280 - Flags: review?(dhouse)
Attachment #8921280 - Flags: review?(dhouse) → review+
See Also: → 1413980
closing out bug, please reopen if you need any additional hands on with any of the 50ish hosts listed in this bug.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
See Also: → 1419690
See Also: → t-w1064-ix-312
See Also: → 1419693
See Also: → t-w1064-ix-139
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: