Closed Bug 1039977 (mac-v2-signing1) Opened 10 years ago Closed 10 years ago

mac-v2-signing1 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

References

Details

Attachments

(1 file)

Spontaneously combusted:

Wed 23:25:33 PDT [4961] mac-signing1.srv.releng.scl3.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
Rebooted via PDU. No indication of what went wrong in signing logs, /var/log/system.log or kernel.log. Restarted signing servers.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Depends on: 1041305
Locked up again, nagios started with
Sun 01:41:43 PDT [4407] mac-signing1.srv.releng.scl3.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%

arr rebooted via pdu, bug 1041305.

Ben, can we send this off for diagnostics as-is ? At the moment, the slaves figure out immediately that it's not responsive, presumably OS X refuses the connection because none of the signing scripts are running.
We probably need to remove mac-signing1 from puppet/modules/buildmaster/templates/passwords.py.erb if it's going to go offline, because I don't see a timeout set in tools/lib/python/signing/client.py (and the default in the socket library is no timeout).
Status: RESOLVED → REOPENED
Flags: needinfo?(bhearsum)
Resolution: FIXED → ---
(In reply to Nick Thomas [:nthomas] from comment #2)
> Locked up again, nagios started with
> Sun 01:41:43 PDT [4407] mac-signing1.srv.releng.scl3.mozilla.com is DOWN
> :PING CRITICAL - Packet loss = 100%
> 
> arr rebooted via pdu, bug 1041305.
> 
> Ben, can we send this off for diagnostics as-is ?

I'd be more comfortable if we could wipe the disk first. Everything secret is passphrase protected, but it's still better to make sure people can't get their hands on the files.

> We probably need to remove mac-signing1 from
> puppet/modules/buildmaster/templates/passwords.py.erb if it's going to go
> offline, because I don't see a timeout set in
> tools/lib/python/signing/client.py (and the default in the socket library is
> no timeout).

Yeah, agreed.
Flags: needinfo?(bhearsum)
(In reply to Ben Hearsum [:bhearsum] from comment #3)
> I'd be more comfortable if we could wipe the disk first. Everything secret
> is passphrase protected, but it's still better to make sure people can't get
> their hands on the files.

We could do some srm on selected key chains and certs instead of a full wipe, but I'm not sure where they all are placed.
Attachment #8460128 - Flags: review?(bhearsum)
Comment on attachment 8460128 [details] [diff] [review]
[puppet] Remove mac-signing1 while testing

Review of attachment 8460128 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8460128 - Flags: review?(bhearsum) → review+
Comment on attachment 8460128 [details] [diff] [review]
[puppet] Remove mac-signing1 while testing

Landed in puppet, will need a reconfig too:
 https://hg.mozilla.org/build/puppet/rev/fd51ccdb93e8
 https://hg.mozilla.org/build/puppet/rev/38d60c42ecc1
Attachment #8460128 - Flags: checked-in+
Cleaned up secrets.
Depends on: 1047093
Back online after a RAM replacement and a new name (bug 1049546).
Alias: mac-signing1 → mac-v2-signing1
Summary: mac-signing1 problem tracking → mac-v2-signing1 problem tracking
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
The dep-key signing server went down at 2014-09-03 06:28:07, for an (as yet) unknown reason:

pmoore@Elisandra:~ $ ssh root@mac-v2-signing1.srv.releng.scl3.mozilla.com tail -10 /builds/signing/dep-key-signing-server/signing.log
2014-09-03 06:26:12,399 - DEBUG - Cleaning up...
2014-09-03 06:26:12,638 - INFO - Deleting /builds/signing/dep-key-signing-server/unsigned-files/0d92f62009f95636e4dccef61a728a2ff979f424 (too old)
2014-09-03 06:26:12,723 - INFO - Deleting /builds/signing/dep-key-signing-server/unsigned-files/0d92f62009f95636e4dccef61a728a2ff979f424.fn (too old)
2014-09-03 06:26:12,728 - INFO - Deleting /builds/signing/dep-key-signing-server/unsigned-files/cf00cf13318ed9dd0bd5a83121b118986e25c46f (too old)
2014-09-03 06:26:12,734 - INFO - Deleting /builds/signing/dep-key-signing-server/unsigned-files/cf00cf13318ed9dd0bd5a83121b118986e25c46f.fn (too old)
2014-09-03 06:26:12,855 - INFO - Deleting /builds/signing/dep-key-signing-server/signed-files/gpg/0c29eacf41ad6bc3172aac1412bc5fe09cd6aa60.out with no unsigned file
2014-09-03 06:26:12,864 - INFO - Deleting /builds/signing/dep-key-signing-server/signed-files/gpg/0d92f62009f95636e4dccef61a728a2ff979f424 with no unsigned file
2014-09-03 06:26:12,868 - INFO - Deleting /builds/signing/dep-key-signing-server/signed-files/gpg/cf00cf13318ed9dd0bd5a83121b118986e25c46f with no unsigned file
2014-09-03 06:28:07,382 - INFO - pid 99383 exiting normally
2014-09-03 06:28:07,684 - INFO - exiting
pmoore@Elisandra:~ $
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I can start it up, but first I'd like to check if I can find any reason this might have intentionally been brought down...
See Also: → 1062302
top -o cpu

doesn't show any obvious cpu eaters etc - so I'm going to restart it now
Started back up. However, when restarting, I was expecting to be prompted for passphrases for:

  * gpg
  * signcode
  * mar
  * jar
  * b2gmar
  * dmg

However, I was only prompted for passphrases for:

  * gpg
  * dmg
  * mar

In other words, the following types of signing are probably not available on this dep-key signing server instance on this server (since I was not prompted for their passphrases):

  * signcode
  * jar
  * b2gmar

This is in contrast to bug 1062302 which says gpg and mar should be disabled on mac v2 signing servers (whereas they seem to comprise 2/3rds of the available services for dep-key signing).

Ben, what are your thoughts?

Pete
Flags: needinfo?(bhearsum)
(In reply to Pete Moore [:pete][:pmoore] from comment #13)
> Started back up. However, when restarting, I was expecting to be prompted
> for passphrases for:
> 
>   * gpg
>   * signcode
>   * mar
>   * jar
>   * b2gmar
>   * dmg
>
> However, I was only prompted for passphrases for:
> 
>   * gpg
>   * dmg
>   * mar

Mac signing machines have only ever done gpg, dmg, and mar signing. It's not surprising at all that you were prompted for only these three. You may have been confused by the list of passphrases in the private repo, but that's simply a list of all possible ones - it doesn't imply that they're all enabled on all signing servers.
 
> This is in contrast to bug 1062302 which says gpg and mar should be disabled
> on mac v2 signing servers (whereas they seem to comprise 2/3rds of the
> available services for dep-key signing).

This part is confusing. We changed Buildbot to stop looking at mac-v2 signing servers for gpg and mar signing, because they were dying under the load. We didn't change the formats that were enabled for the servers themselves because of time constraints. It's likely that we *will* disable those formats, but more investigation is needed first.



I suspect what happened here is that after a very long time, my request for a stop of that instance (made about ~24h ago) finally went through. The machine was so heavily loaded at the time that I wasn't sure if python even launched to try to shut it down, and I didn't think to check later. Sorry =(
Flags: needinfo?(bhearsum)
This is lingering in the buildduty queue, Its unclear to me what is left to do, if anything. Can one of you summarize or move/close please
Flags: needinfo?(pmoore)
Flags: needinfo?(bhearsum)
I'll look at this soon.
Assignee: nobody → bhearsum
Flags: needinfo?(pmoore)
Flags: needinfo?(bhearsum)
There's nothing left to do here - the critical issue is fixed. We should disable gpg and mar signing on all of the mac signing servers, but that's not a critical issue. I filed bug 1065871 for that.
Assignee: bhearsum → nobody
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: