Closed Bug 1145387 (t-yosemite-r5-0073) Opened 9 years ago Closed 9 years ago

t-yosemite-r5-0073 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86_64
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

Details

(Whiteboard: [buildduty][buildslaves][capacity])

Apparently I broke it by rebooting it: I rebooted the whole 10.10 pool, and the rest of them mostly survived, but this one has done two PDU reboots and one ssh reboot since without returning to actually taking jobs.
Re-imaged and returned to production.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Made no difference - if the reimage actually did happen, what survives a reimage?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It's failing to connect to the master in a way I haven't seen before:

2015-03-24 08:52:36-0700 [-] Log opened.
2015-03-24 08:52:36-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz6/bin/python2.7 2.7.3) starting up.
2015-03-24 08:52:36-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2015-03-24 08:52:36-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x10d003c20>
2015-03-24 08:52:36-0700 [-] Connecting to buildbot-master107.bb.releng.scl3.mozilla.com:9201
2015-03-24 08:52:36-0700 [-] Watching /builds/slave/talos-slave/shutdown.stamp's mtime to initiate shutdown
2015-03-24 08:52:36-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2015-03-24 08:52:36-0700 [Broker,client] While trying to connect:
        Traceback from remote host -- Traceback (most recent call last):
          File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/spread/pb.py", line 1346, in remote_respond
            d = self.portal.login(self, mind, IPerspective)
          File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/cred/portal.py", line 116, in login
            ).addCallback(self.realm.requestAvatar, mind, *interfaces
          File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/internet/defer.py", line 260, in addCallback
            callbackKeywords=kw)
          File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/internet/defer.py", line 249, in addCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/master.py", line 498, in requestAvatar
            p = self.botmaster.getPerspective(mind, avatarID)
          File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/master.py", line 364, in getPerspective
            d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one")
          File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/spread/pb.py", line 328, in callRemote
            _name, args, kw)
          File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/spread/pb.py", line 807, in _sendMessage
            raise DeadReferenceError("Calling Stale Broker")
        twisted.spread.pb.DeadReferenceError: Calling Stale Broker
        
2015-03-24 08:52:36-0700 [Broker,client] Lost connection to buildbot-master107.bb.releng.scl3.mozilla.com:9201
2015-03-24 08:52:36-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x10d003c20>
2015-03-24 08:52:36-0700 [-] Main loop terminated.
2015-03-24 08:52:36-0700 [-] Server Shut Down.

So it keeps looping indefinitely through the list of runner tasks.

cc-ing :kmoir for yosemite insight, and :mrrrgn for runner.
The master had this in the log
2015-03-24 10:31:38-0700 [Broker,31889,10.26.56.68] duplicate slave t-yosemite-r5-0073; rejecting new slave and pinging old
2015-03-24 10:31:38-0700 [Broker,31889,10.26.56.68] old slave was connected from IPv4Address(TCP, '10.26.56.68', 49239)
2015-03-24 10:31:38-0700 [Broker,31889,10.26.56.68] new slave is from IPv4Address(TCP, '10.26.56.68', 49204)
2015-03-24 10:31:38-0700 [Broker,31889,10.26.56.68] Peer will receive following PB traceback:
2015-03-24 10:31:38-0700 [Broker,31889,10.26.56.68] Unhandled Error

I updated slavealloc and the the slave attached to my master and it started taking jobs.  So then I renabled it as a production slave again in slavealloc and rebooted it.  It connected to buildbot-master107 again and the same error messages are occurring.

According to the buildbot issue, the old tcp connection will eventually time out, but not sure if this still applies given that it's such an old report

http://trac.buildbot.net/ticket/887
and
http://trac.buildbot.net/ticket/1856

according to netstat on the master (buildbot-master107.bb.releng.scl3.mozilla.com) there aren't any established connections to this ip (10.26.56.68)
(In reply to Kim Moir [:kmoir] from comment #4)
> according to netstat on the master
> (buildbot-master107.bb.releng.scl3.mozilla.com) there aren't any established
> connections to this ip (10.26.56.68)

I'm going to gracefully restart bm107 and see if that helps.
This slave connect to bm108 and is now passing jobs. bm107 has been restarted.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.