1464064 - Moonshot Linux nodes stop functioning

Reporter

Description

•

6 years ago

We have found that the bellow Linux servers are not visible in TC and not taking jobs. We will re-image them and update this bug.

t-linux64-ms-193
t-linux64-ms-279
t-linux64-ms-280
t-linux64-ms-484
t-linux64-ms-495
t-linux64-ms-527
t-linux64-ms-580

Kendall Libby [:fubar] (he/him)

Comment 1

•

6 years ago

Can we get a link to the papertrail logs for some of these? This sounds suspiciously like what we're seeing on the w10 nodes, where they suddenly stop working.

Zsolt Fay [:zfay]

Updated

•

6 years ago

Blocks: t-linux64-ms-280

Zsolt Fay [:zfay]

Updated

•

6 years ago

Blocks: 1464073

Zsolt Fay [:zfay]

Updated

•

6 years ago

Blocks: 1464080

:dhouse

Assignee

Comment 2

•

6 years ago

good results for t-linux-ms-495 (https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-495). I think we need to do the same reimage and then watch and possibly cold-reboot if first boot after reimage gets stuck.

I'll check 279 and 280 next (that get stuck at pxeboot menu).

:dhouse

Assignee

Comment 3

•

6 years ago

(In reply to Kendall Libby [:fubar] from comment #1)
> Can we get a link to the papertrail logs for some of these? This sounds
> suspiciously like what we're seeing on the w10 nodes, where they suddenly
> stop working.

Here is all of them in papertrail: https://papertrailapp.com/groups/6937292/events?q=t-linux64-ms-193%20OR%20%20t-linux64-ms-279%20OR%20%20t-linux64-ms-280%20OR%20%20t-linux64-ms-484%20OR%20%20t-linux64-ms-495%20OR%20%20t-linux64-ms-527%20OR%20%20t-linux64-ms-580&focus=936334690644828165

Not much for the ones that are sticking at the pxeboot however. But ones like #495 show logs once they are repaired: https://papertrailapp.com/systems/1899813261/events?focus=936334699754856501

:dhouse

Assignee

Comment 4

•

6 years ago

There is a tracking bug for hardware issues on the moonshots:

https://bugzilla.mozilla.org/show_bug.cgi?id=1428159

If we find a hardware issue with any of these we can address it through that bug.

Comment 5

•

6 years ago

stuck on pxeboot:
t-linux64-ms-193  (after failing pxeboot, goes into xen currently)
t-linux64-ms-279  (after failing pxeboot, goes into ubuntu but without tc-worker running)
t-linux64-ms-280  (after failing pxeboot, goes into ubuntu but without tc-worker running)

fixed by reimage:
t-linux64-ms-484
t-linux64-ms-495
t-linux64-ms-527

okay: 571-580 are not in production. they are a development set
t-linux64-ms-580 (we expect this to be off or not running tc-worker

:dhouse

Assignee

Comment 6

•

6 years ago

t-linux64-ms-488 also came up as not running the tc worker. This one needs to be reimaged as it has a problem with its puppet certificate and cannot update its puppet config.
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-488

Roland Mutter Michael (:rmutter)

Comment 7

•

6 years ago

I've done the reimage on t-linux-ms-488. Seems like it appears in TC and it's taking tasks. We still cannot SSH into it.

Attila Craciun [:arny]

Reporter

Comment 8

•

6 years ago

The bellow Linux servers where not present in the TC list, however, I was able to see each ones tasks. I have rebooted all and they run tasks successfully.
 
t-linux64-ms-007
t-linux64-ms-057
t-linux64-ms-141
t-linux64-ms-183
t-linux64-ms-189
t-linux64-ms-493

Attila Craciun [:arny]

Reporter

Comment 9

•

6 years ago

(In reply to Attila Craciun [:arny] from comment #8)
> The bellow Linux servers where not present in the TC list, however, I was
> able to see each ones tasks. I have rebooted all and they run tasks
> successfully.
>  
> t-linux64-ms-007
> t-linux64-ms-057
> t-linux64-ms-141
> t-linux64-ms-183
> t-linux64-ms-189
> t-linux64-ms-493

PXE is not working also for this machines.

Attila Craciun [:arny]

Reporter

Comment 10

•

6 years ago

The bellow servers where not visible in TC. After checking them, all where stuck at grub menu. Rebooted them,  they show up in TC, running and completing jobs successfully.

 t-linux64-ms-272
 t-linux64-ms-273
 t-linux64-ms-274
 t-linux64-ms-275 (need firmware upgrade bug 1464044)
 t-linux64-ms-276
 t-linux64-ms-277

:dhouse

Assignee

Comment 11

•

6 years ago

(In reply to Attila Craciun [:arny] from comment #9)
> (In reply to Attila Craciun [:arny] from comment #8)
> > The bellow Linux servers where not present in the TC list, however, I was
> > able to see each ones tasks. I have rebooted all and they run tasks
> > successfully.
> >  
> > t-linux64-ms-007
> > t-linux64-ms-057
> > t-linux64-ms-141
> > t-linux64-ms-183
> > t-linux64-ms-189
> > t-linux64-ms-493
> 
> PXE is not working also for this machines.

I don't have PXE working yet for the mdc1 moonshots.

:dhouse

Assignee

Comment 12

•

6 years ago

(In reply to Dave House [:dhouse] from comment #11)
> (In reply to Attila Craciun [:arny] from comment #9)
> > (In reply to Attila Craciun [:arny] from comment #8)
> > > The bellow Linux servers where not present in the TC list, however, I was
> > > able to see each ones tasks. I have rebooted all and they run tasks
> > > successfully.
> > >  
> > > t-linux64-ms-007
> > > t-linux64-ms-057
> > > t-linux64-ms-141
> > > t-linux64-ms-183
> > > t-linux64-ms-189
> > > t-linux64-ms-493
> > 
> > PXE is not working also for this machines.
> 
> I don't have PXE working yet for the mdc1 moonshots.

I changed all of the linux nodes on the mdc1 and mdc2 moonshots to boot from their local hard-disks instead of doing PXE-boot first. So, this can prevent machines that reboot from wasting time trying to pxe-boot.

Zsolt Fay [:zfay]

Comment 13

•

6 years ago

As an update on linux moonshots:

t-linux64-ms-193 and t-linux64-ms-275 are out of service. This is also stated in the MS document

t-linux64-ms-279 and t-linux64-ms-280 were missing from TC. I re-imaged them, the process went all the way. Machine 279 was assigned to dividehex when it was last broken as per bug 1435020.

t-linux64-ms-394 however won't even go through PXE boot. Looks like it's trying but keeps getting back to the beginning of PXE boot.

Attila Craciun [:arny]

Reporter

Comment 14

•

6 years ago

t-linux64-ms-257 - rebooted, was not present in TC. Now is back in business.

:dhouse

Assignee

Comment 15

•

6 years ago

I see that t-linux64-ms-394 is working: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394

Danut, could you have someone on your team check over the others to see if they are in the same state now or fixed?

Roland Mutter Michael (:rmutter)

Comment 16

•

6 years ago

(In reply to Dave House [:dhouse] from comment #15)
> I see that t-linux64-ms-394 is working:
> https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/
> gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394
> 
> Danut, could you have someone on your team check over the others to see if
> they are in the same state now or fixed?

:dhouse, 279 and 280 are missing once again from TC . Shall we keep them in that state for further investigation or shall we re-image 'em once again?

Flags: needinfo?(dhouse)

:dhouse

Assignee

Comment 17

•

6 years ago

(In reply to Roland Mutter Michael (:rmutter) from comment #16)
> (In reply to Dave House [:dhouse] from comment #15)
> > I see that t-linux64-ms-394 is working:
> > https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/
> > gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394
> > 
> > Danut, could you have someone on your team check over the others to see if
> > they are in the same state now or fixed?
> 
> :dhouse, 279 and 280 are missing once again from TC . Shall we keep them in
> that state for further investigation or shall we re-image 'em once again?

:rmutter, please re-image them once again. If this repeats, we can review the logs to see what has happened to cause them to stop taking jobs.

Flags: needinfo?(dhouse)

:dhouse

Assignee

Comment 18

•

6 years ago

Adding some new nodes seeing this problem from #ci:
> 21:32:51 <&riman|ciduty> Hello dhouse: The following  t-linux64-ms-(351, 356, 357, 436) workers are missing from Taskcluster. I have tried to re-image them but they all remain stuck after F12. Could you take a look, please?

t-linux64-ms-351
t-linux64-ms-356
t-linux64-ms-357
t-linux64-ms-436

:dhouse

Assignee

Updated

•

6 years ago

Assignee: relops → dhouse

:dhouse

Assignee

Comment 19

•

6 years ago

(In reply to Dave House [:dhouse] from comment #18)
> Adding some new nodes seeing this problem from #ci:
> > 21:32:51 <&riman|ciduty> Hello dhouse: The following  t-linux64-ms-(351, 356, 357, 436) workers are missing from Taskcluster. I have tried to re-image them but they all remain stuck after F12. Could you take a look, please?
> 
> t-linux64-ms-351
> t-linux64-ms-356
> t-linux64-ms-357
> t-linux64-ms-436

I see the same hanging at "Booting PXE over IPv4" in mdc2 chassis 8, 9 and 11 (10, 12, 13, 14 are not having this problem). I spot-checked across all of these mdc2 chassis.
 
Also, the 4 above are never ping-able (and while spot-checking i found t-linux64-ms-346 had this problem 2 of 3 times I rebooted it. so I think this may be intermittent across others also); When I boot them from their local ubuntu install, they go into the "raise the network interfaces" waiting period and then give up without network (and are not pingable).

So I tried changing back to the vm admin hosts for pxeboot and the failing machines still did not get farther in pxeboot (so I reverted that back to the correct new admin hosts for pxe/tftp).

:dhouse

Assignee

Comment 20

•

6 years ago

Through troubleshooting in #systems, Van found that the problem chassis needed their 2nd switches restarted; See https://mana.mozilla.org/wiki/display/NETOPS/HP+Switch+Configuration#HPSwitchConfiguration-12.Troubleshooting

```
If you see the switches/chassis complaining of a duplicate IP, that means the switch may have lost its IRF config and will need to be rebooted.
ex: Duplicate address 10.51.16.34 on interface M-GigabitEthernet0/0/0, sourced from 9cb6-54fe-7cca
```

He fixed 8,9,11 moon chassis and I confirmed by pxebooting two machines from each chassis.

I need to check through all of the linux cartridges on these three chassis to make sure none are left thinking that they have no network (or needing reimaged).

:dhouse

Assignee

Comment 21

•

6 years ago

We also need to set up some sort of monitoring to be alerted if the switch problem happens again (since we do not know what caused it).

Roland Mutter Michael (:rmutter)

Comment 22

•

6 years ago

Went for a full check of linux moonshots that does appear in TC. Seems like the following machines are not in TC :
t-linux64-ms-141
t-linux64-ms-351
t-linux64-ms-356
t-linux64-ms-357
t-linux64-ms-436

Will proceed with a reboot for every machine. If that doesn't work , I'll start a reimage for each one. I'll be back with updates.

:dhouse

Assignee

Comment 23

•

6 years ago

(In reply to Roland Mutter Michael (:rmutter) from comment #22)
> Went for a full check of linux moonshots that does appear in TC. Seems like
> the following machines are not in TC :
> t-linux64-ms-141
> t-linux64-ms-351
> t-linux64-ms-356
> t-linux64-ms-357
> t-linux64-ms-436
> 
> Will proceed with a reboot for every machine. If that doesn't work , I'll
> start a reimage for each one. I'll be back with updates.

Thank you! I appreciate your work on these.

Roland Mutter Michael (:rmutter)

Comment 24

•

6 years ago

After rebooting the machines, the following and the candidates for reimage:
t-linux64-ms-351
t-linux64-ms-356
t-linux64-ms-357
t-linux64-ms-436

Roland Mutter Michael (:rmutter)

Comment 25

•

6 years ago

:dhouse I saw from previous shifts that reimaging for 356, 357 and 436 is disabled. Please ping us whenever they are ready for the reimage. For now, Adrian will reimage t-linux64-ms-351.

:dhouse

Assignee

Comment 26

•

6 years ago

(In reply to Roland Mutter Michael (:rmutter) from comment #25)
> :dhouse I saw from previous shifts that reimaging for 356, 357 and 436 is
> disabled. Please ping us whenever they are ready for the reimage. For now,
> Adrian will reimage t-linux64-ms-351.

Thankyou. We were able to get the reimaging fixed (the network switches in moon8/9/11 had lost some config and had to be reconfigured). 

I'll reimage 356,357,436 to make sure that works on them.

:dhouse

Assignee

Comment 27

•

6 years ago

I started reimaging 356,357,436. I'll check for them to puppetize and start taking work.
Confirmed all three were not in taskcluster taking jobs before reimaging:
```
t-linux64-ms-356.test.releng.mdc2.mozilla.com https://moon-chassis-9.inband.releng.mdc2.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c11n1
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-356
t-linux64-ms-357.test.releng.mdc2.mozilla.com https://moon-chassis-9.inband.releng.mdc2.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c12n1
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-357
t-linux64-ms-436.test.releng.mdc2.mozilla.com https://moon-chassis-11.inband.releng.mdc2.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c1n1
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-436
```

:dhouse

Assignee

Comment 28

•

6 years ago

356,357,436 are reimaged and now running jobs:
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-356
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-357
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-436

Adrian Pop

Comment 29

•

6 years ago

I've reimaged t-linux64-ms-351, later it needs to be checked if it takes jobs

Roland Mutter Michael (:rmutter)

Comment 30

•

6 years ago

(In reply to Adrian Pop from comment #29)
> I've reimaged t-linux64-ms-351, later it needs to be checked if it takes jobs

Looks good in TC: https://tools.taskcluster.net/groups/W1Mde5F9Rpm8QrPEqMl2Hg/tasks/PEvc-MzET6KLNaGyaclDWg/runs/0

:dhouse

Assignee

Comment 31

•

6 years ago

All of the machines reported in this bug are accounted for and working correctly now (279 and 280 are loaners, all others were in a good state):

https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-007
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-057
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-141
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-183
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-189
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-193
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-272
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-273
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-274
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-275
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-276
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-277
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-279
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-280
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-351
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-356
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-357
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-436
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-484
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-493
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-495
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-527
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-580

The two missing from taskcluster are 279,280:
t-linux64-ms-279.test.releng.mdc1.mozilla.com https://moon-chassis-7.inband.releng.mdc1.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c9n1
t-linux64-ms-280.test.releng.mdc1.mozilla.com https://moon-chassis-7.inband.releng.mdc1.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c10n1

279 was repaired by re-seating the cartridge in bug 1435020 (created bug 1472727 this morning to track it as a loaner)
280 is a loaner for Dragos (see bug 1464070)

No longer blocks: t-linux64-ms-280

Status: NEW → RESOLVED

Closed: 6 years ago

Depends on: 1435020, t-linux64-ms-280

Resolution: --- → FIXED

:dhouse

Assignee

Updated

•

6 years ago

Depends on: 1473589

Zsolt Fay [:zfay]

Comment 32

•

6 years ago

gonna keep  tracking and updating this bug with linux machines that fail. 
t-linux64-ms-527 <-- rebooted, reimaged, back in TC, waiting for jobs.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Danut Labici [:dlabici]

Updated

•

6 years ago

Blocks: t-linux64-ms-580

:dhouse

Assignee

Comment 33

•

6 years ago

527 looks good. 580 is a dev machine
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-527

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

Resolution: --- → FIXED

Radu Iman[:riman]

Updated

•

6 years ago

Blocks: t-linux64-ms-193

Radu Iman[:riman]

Updated

•

6 years ago

Blocks: t-linux64-ms-275

Zsolt Fay [:zfay]

Comment 34

•

6 years ago

I've re-imaged a large amount of linux moonshot machines which apparently all failed in under 12h:

linux-{001, 003, 005, 008, 011, 014, 015, 047, 049, 051, 054, 056, 058, 092, 094, 097, 098, 100, 101, 103, 104, 136, 137, 140, 141, 144, 146, 148, 149, 181, 183, 185, 187, 188, 192, 193, 195, 226, 227, 232, 235, 237, 239, 271, 272, 273, 275, 276, 277, 279, 346, 353, 538}

Dave, could this be related to the firmware upgrade you brought to the moonshots? Also had considerably more W10 workers to deal with today, compared to last 2 weeks.

Status: RESOLVED → REOPENED

Flags: needinfo?(dhouse)

Resolution: FIXED → ---

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Comment 35

•

6 years ago

(In reply to Zsolt Fay [:zsoltfay] from comment #34)
> I've re-imaged a large amount of linux moonshot machines which apparently
> all failed in under 12h:
> 
> linux-{001, 003, 005, 008, 011, 014, 015, 047, 049, 051, 054, 056, 058, 092,
> 094, 097, 098, 100, 101, 103, 104, 136, 137, 140, 141, 144, 146, 148, 149,
> 181, 183, 185, 187, 188, 192, 193, 195, 226, 227, 232, 235, 237, 239, 271,
> 272, 273, 275, 276, 277, 279, 346, 353, 538}
> 
> Dave, could this be related to the firmware upgrade you brought to the
> moonshots? Also had considerably more W10 workers to deal with today,
> compared to last 2 weeks.

In the last 48h, 79 linux moonshots have been re-imaged but not recovered from that state.
At this point I'm afraid that the deployment process is bad, since some of the machines have been re-imaged at least once, most of them 2-3 times (tracking for that can be found in the following doc https://docs.google.com/spreadsheets/d/1A6fU2t3rVY2oAd-U26k4lPZGjfULnh6w5XySqsMofUM) and still appear in a bad state.
At this point, the deploy process looks to be normal, no major reason why they fail to take tasks.
I'll continue investigate the issue and report in if I'll find something obvious.

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Comment 36

•

6 years ago

Later updates, after looking into services the first 3 machines I checked have this process (719) running:

[root@t-linux64-ms-005 ~]# ps -ef | grep puppet
root       719   714  0 03:09 ?        00:00:00 /bin/bash /root/puppetize.sh
root      5428  5416  0 17:22 pts/0    00:00:00 grep --color=auto puppet
[root@t-linux64-ms-005 ~]#

Doing 
> cat last_run_report.yaml |grep fail
we got
> status: failed

Also:
[root@t-linux64-ms-005 state]# cat /var/lib/puppet/state/last_run_summary.yaml
---
  version:
    config: remotes/origin/HEAD
    puppet: "3.8.5"
  resources:
    changed: 4
    failed: 1
    failed_to_restart: 0
    out_of_sync: 5
    restarted: 0
    scheduled: 0
    skipped: 1
    total: 456
  time:
    anchor: 0.004462988
    augeas: 0.291225276
    config_retrieval: 14.035952311998699
    exec: 0.374154974
    file: 1.0897829879999996
    filebucket: 8.704e-05
    firewall: 0.009833629000000002
    firewallchain: 0.001217386
    group: 0.000156733
    host: 0.000386595
    package: 4.367681283000001
    resources: 0.000127482
    schedule: 0.00052634
    service: 0.8251894579999998
    sysctl: 0.000188186
    total: 21.0020116199987
    user: 0.0010389499999999999
    last_run: 1537144918
  changes:
    total: 4
  events:
    failure: 1
    success: 4
    total: 5
[root@t-linux64-ms-005 state]#


and from papertrail we got this

>  message: "change from stopped to running failed: Could not start Service[mig-agent]: Execution of '/bin/systemctl start mig-agent' returned 5: Failed to start mig-agent.service: Unit mig-agent.service not found."

and puppetize.log contains a lot of:
> Running puppet agent against server 'puppet'
> Puppet run failed; re-trying after 10m

Also started the puppet service on the first machine (t-linux64-ms-001) and looked into papertrail and found this:
> Sep 16 18:04:59 t-linux64-ms-001.test.releng.mdc1.mozilla.com puppet-agent: (/File[/var/lib/puppet/lib]) Could not evaluate: Could not retrieve file metadata for puppet://releng-puppet2.srv.releng.scl3.mozilla.com/plugins: Failed to open TCP connection to releng-puppet2.srv.releng.scl3.mozilla.com:8140 (Connection timed out - connect(2) for "releng-puppet2.srv.releng.scl3.mozilla.com" port 8140) 

Is that server being shutdown??
bcrisan@bcrisan-P6198:~$ fping releng-puppet2.srv.releng.scl3.mozilla.com
releng-puppet2.srv.releng.scl3.mozilla.com is unreachable

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Comment 37

•

6 years ago

(In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #36)

> Is that server being shutdown??
> bcrisan@bcrisan-P6198:~$ fping releng-puppet2.srv.releng.scl3.mozilla.com
> releng-puppet2.srv.releng.scl3.mozilla.com is unreachable

I'm going to answer to that, Yes it is down and probably the mdc1 puppet server should be used because the workers are in MDC1

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Updated

•

6 years ago

Depends on: 1491732

Adrian Pop

Comment 38

•

6 years ago

I've tried to reboot the following, all of them were powered off. After Power on the machines restarted a few times without any success on booting up OS. After a few restarts all off them got on power off state :

t-linux64-ms-272
t-linux64-ms-273
t-linux64-ms-276
t-linux64-ms-277

:dhouse

Assignee

Comment 39

•

6 years ago

We have the moonshots configured to power off after 3 failed boots. 

These four {272,273,276,277} have been restarted or started since then and are working correctly:
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-272
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-273
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-276
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-277

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

Flags: needinfo?(dhouse)

Resolution: --- → FIXED

Kendall Libby [:fubar] (he/him)

Comment 40

•

6 years ago

(In reply to Dave House [:dhouse] from comment #39)
> We have the moonshots configured to power off after 3 failed boots. 

If we're going to stick with that then we should be getting alerts when that happens, either from nagios or iLO.

Kendall Libby [:fubar] (he/him)

Updated

•

6 years ago

Depends on: 1493981