Closed Bug 1286942 Opened 8 years ago Closed 8 years ago

Buildbot DB issues

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Unassigned)

References

Details

I'm seeing errors like this.

Jul 14 12:39:30 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector InternalError: (_mysql_exceptions.InternalError) (145, "Table './buildbot_schedulers/buildrequests' is marked as crashed and should be repaired") [SQL: u'SELECT buildrequests.complete \nFROM buildrequests \nWHERE buildrequests.id = %s'] [parameters: (116840858L,)]

Filing as a tracker.
Indeed, the table is crashed:
mysql> select * from buildrequests limit 5;
ERROR 145 (HY000): Table './buildbot_schedulers/buildrequests' is marked as crashed and should be repaired

Repairing:
mysql> repair table buildrequests;
See Also: → 1286963
Let me know when/if I need to start a build 2.
:sheeri - can you add to this bug the # records & oldest one? 

Theoretically, we (releng) prune every weekend via a job we control. The large size reported in #data looks "too big" if that job is running correctly.
[13:28] <      sheeri>| sal: still repairing
[13:28] <      sheeri>| build requests is a large table
[13:29] <         sal>| sheeri: ahh ty
[13:40] <      sheeri>| sal: yeah, it’s 31G of data + indexes
Flags: needinfo?(scabral)
(In reply to Hal Wine [:hwine] (use NI) from comment #3)
> [13:40] <      sheeri>| sal: yeah, it’s 31G of data + indexes

I guess this contains data from months ago. Would it be worth archiving part of it in a separate table (or in a dump, or whatever)?

Also, does taskcluster rely on this database being up and running? If not, can try be opened so that at least taskcluster builds happen?
since i got now asked a few times, we don't have a ETA for tree reopening yet, the database team is working on fixing the issue.
[root@buildbot1.db.scl3 buildbot_schedulers]# ls -lrth !$
ls -lrth *.TMD
-rw-rw---- 1 mysql mysql 9.5G Jul 15 05:04 buildrequests.TMD
[root@buildbot1.db.scl3 buildbot_schedulers]# ls -lrth *.TMD
-rw-rw---- 1 mysql mysql 9.8G Jul 15 05:36 buildrequests.TMD

So probably another 30 minutes (the indexes are 10G).

The "Too many connections" issue recovered, at least transiently, and I'm in the database.
Flags: needinfo?(scabral)
the repair on the master is still ongoing, so we failed over to the slave. The *new* master is not replicating the old master, so the repair won't happen again on the new master.
coop restarted some stuff, and I'm seeing queries come through to buildbot2, which is the new master, and handling both reads and writes until buildbot1 finishes the repair.
coop mentioned a storage engine error and I'm seeing this in the logs:
Got error 127 when reading table './buildbot_schedulers/buildsets'

This is a table with <2G, so I'm repairing it now.
repair of buildsets complete.
scheduler_changes was also corrupt (throwing 124 errors in the tests_scheduler log) and was repaired.
Trees are soft-opened right now as we restart services that were hung on the database corruption. 

Will close this once we're fully open again.
we opened now try and autoland since we also have a backlog of thing that need to catch up.
See Also: → 1287117
things are looking better. we have a high pending count which is not suprising given the length of downtime. some confusion as to whether we have all platforms actually running something but at this point I think that is due to the overwhelming delay due to to the ~25k pending jobs. Pending seems to have stablized and is decreasing very slowly.

aws instances are spinning up fast now: https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/ec2-dashboard

current pending: https://secure.pub.build.mozilla.org/buildapi/pending

I think we can close out this bug but the question is whether we want to let buildbot deal with pending or if we need to intervene and kill the oldest pending jobs. This depends on how invasive that db action would be and whether sheriffs would prefer longer pending or killed older jobs.
callek cleared ~8000 pending jobs from any job submitted prior to today (< 15th). However, devs and autoland have fought back and pushed our pending to ~30k.

sheriffs are aware of the delay and we are going to wait a bit to see of coalescing and other natural optimizations (provisioning more instances against demand) to kick in
Okay, I've reopened everything but trunk trees, and I've set those to approval-required so I can meter the reopening. If all goes well, I'll fully reopen the trees in the next couple of hours.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
new table damaged issues. may need repair again:

12:26:48 <rail> hmm, selfserve is not happy
12:26:55 <rail> 500 :/
12:27:56 <•catlee> more db issues?
12:28:38 <jlund> wonder if "Pending jobs is UNKNOWN: No JSON object could be decoded" has something to do with it
12:29:35 <Callek> jlund: that may be a symptom
12:29:49 <Callek> if its more DB issues, I'll be *really* sad, tbh
12:29:56 — jlund hops on 81
12:30:29 <jlund> _mysql_exceptions.InternalError: (144, "Table './buildbot_schedulers/buildsets' is marked as crashed and last (automatic?) repair failed")
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 1287455
goncalves_pythian helped to fix the table and we are not seeing any errors now.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
See Also: → 1287817
FYI, in comment 3 I was asked for age of oldest record, and # of records:

oldest "submitted at" time - 2011-06-09 00:00:45 
# records 113,570,887 (113 million)
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.