1286942 - Buildbot DB issues

Reporter

Description

•

8 years ago

I'm seeing errors like this.

Jul 14 12:39:30 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector InternalError: (_mysql_exceptions.InternalError) (145, "Table './buildbot_schedulers/buildrequests' is marked as crashed and should be repaired") [SQL: u'SELECT buildrequests.complete \nFROM buildrequests \nWHERE buildrequests.id = %s'] [parameters: (116840858L,)]

Filing as a tracker.

Sheeri Cabral [:sheeri]

Comment 1

•

8 years ago

Indeed, the table is crashed:
mysql> select * from buildrequests limit 5;
ERROR 145 (HY000): Table './buildbot_schedulers/buildrequests' is marked as crashed and should be repaired

Repairing:
mysql> repair table buildrequests;

Keegan Ferrando [:fauweh]

Updated

•

8 years ago

Comment 2

•

8 years ago

Let me know when/if I need to start a build 2.

Hal Wine [:hwine] use NI!

Comment 3

•

8 years ago

:sheeri - can you add to this bug the # records & oldest one? 

Theoretically, we (releng) prune every weekend via a job we control. The large size reported in #data looks "too big" if that job is running correctly.
[13:28] <      sheeri>| sal: still repairing
[13:28] <      sheeri>| build requests is a large table
[13:29] <         sal>| sheeri: ahh ty
[13:40] <      sheeri>| sal: yeah, it’s 31G of data + indexes

Flags: needinfo?(scabral)

Mike Hommey [:glandium]

Comment 4

•

8 years ago

(In reply to Hal Wine [:hwine] (use NI) from comment #3)
> [13:40] <      sheeri>| sal: yeah, it’s 31G of data + indexes

I guess this contains data from months ago. Would it be worth archiving part of it in a separate table (or in a dump, or whatever)?

Also, does taskcluster rely on this database being up and running? If not, can try be opened so that at least taskcluster builds happen?

Carsten Book [:Tomcat]

Comment 5

•

8 years ago

since i got now asked a few times, we don't have a ETA for tree reopening yet, the database team is working on fixing the issue.

Sheeri Cabral [:sheeri]

Comment 6

•

8 years ago

[root@buildbot1.db.scl3 buildbot_schedulers]# ls -lrth !$
ls -lrth *.TMD
-rw-rw---- 1 mysql mysql 9.5G Jul 15 05:04 buildrequests.TMD
[root@buildbot1.db.scl3 buildbot_schedulers]# ls -lrth *.TMD
-rw-rw---- 1 mysql mysql 9.8G Jul 15 05:36 buildrequests.TMD

So probably another 30 minutes (the indexes are 10G).

The "Too many connections" issue recovered, at least transiently, and I'm in the database.

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Comment 7

•

8 years ago

the repair on the master is still ongoing, so we failed over to the slave. The *new* master is not replicating the old master, so the repair won't happen again on the new master.

Sheeri Cabral [:sheeri]

Comment 8

•

8 years ago

coop restarted some stuff, and I'm seeing queries come through to buildbot2, which is the new master, and handling both reads and writes until buildbot1 finishes the repair.

Sheeri Cabral [:sheeri]

Comment 9

•

8 years ago

coop mentioned a storage engine error and I'm seeing this in the logs:
Got error 127 when reading table './buildbot_schedulers/buildsets'

This is a table with <2G, so I'm repairing it now.

Sheeri Cabral [:sheeri]

Comment 10

•

8 years ago

repair of buildsets complete.

Chris Cooper [:coop] (he/him)

Comment 11

•

8 years ago

scheduler_changes was also corrupt (throwing 124 errors in the tests_scheduler log) and was repaired.

Chris Cooper [:coop] (he/him)

Comment 12

•

8 years ago

Trees are soft-opened right now as we restart services that were hung on the database corruption. 

Will close this once we're fully open again.

Carsten Book [:Tomcat]

Comment 13

•

8 years ago

we opened now try and autoland since we also have a backlog of thing that need to catch up.

Hal Wine [:hwine] use NI!

Updated

•

8 years ago

Comment 14

•

8 years ago

things are looking better. we have a high pending count which is not suprising given the length of downtime. some confusion as to whether we have all platforms actually running something but at this point I think that is due to the overwhelming delay due to to the ~25k pending jobs. Pending seems to have stablized and is decreasing very slowly.

aws instances are spinning up fast now: https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/ec2-dashboard

current pending: https://secure.pub.build.mozilla.org/buildapi/pending

I think we can close out this bug but the question is whether we want to let buildbot deal with pending or if we need to intervene and kill the oldest pending jobs. This depends on how invasive that db action would be and whether sheriffs would prefer longer pending or killed older jobs.

Jordan Lund (:jlund)

Comment 15

•

8 years ago

callek cleared ~8000 pending jobs from any job submitted prior to today (< 15th). However, devs and autoland have fought back and pushed our pending to ~30k.

sheriffs are aware of the delay and we are going to wait a bit to see of coalescing and other natural optimizations (provisioning more instances against demand) to kick in

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 16

•

8 years ago

Okay, I've reopened everything but trunk trees, and I've set those to approval-required so I can meter the reopening. If all goes well, I'll fully reopen the trees in the next couple of hours.

Chris Cooper [:coop] (he/him)

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Jordan Lund (:jlund)

Comment 17

•

8 years ago

new table damaged issues. may need repair again:

12:26:48 <rail> hmm, selfserve is not happy
12:26:55 <rail> 500 :/
12:27:56 <•catlee> more db issues?
12:28:38 <jlund> wonder if "Pending jobs is UNKNOWN: No JSON object could be decoded" has something to do with it
12:29:35 <Callek> jlund: that may be a symptom
12:29:49 <Callek> if its more DB issues, I'll be *really* sad, tbh
12:29:56 — jlund hops on 81
12:30:29 <jlund> _mysql_exceptions.InternalError: (144, "Table './buildbot_schedulers/buildsets' is marked as crashed and last (automatic?) repair failed")

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

:gerard-majax

Updated

•

8 years ago

Blocks: 1287455

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Updated

•

8 years ago

Blocks: 1261494

Rail Aliiev [:rail]

Reporter

Comment 18

•

8 years ago

goncalves_pythian helped to fix the table and we are not seeing any errors now.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Updated

•

8 years ago

Comment 19

•

8 years ago

FYI, in comment 3 I was asked for age of oldest record, and # of records:

oldest "submitted at" time - 2011-06-09 00:00:45 
# records 113,570,887 (113 million)

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard