Closed Bug 1497956 Opened 6 years ago Closed 6 years ago

[tracker] upgrade postgres to 9.5

Categories

(Socorro :: Database, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: miles)

References

Details

Attachments

(1 file)

Socorro is currently using Postgres 9.4. This covers the work required to upgrade to 9.5.
Making this a P2. We want to get this done soon.
Priority: -- → P2
We'll want to get to 9.6. These will require two upgrades (9.4 to 9.5 then 9.5 to 9.6) meaning two downtimes for RDS.We should stop at 9.6 since CloudSQL does not support 10.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.PostgreSQL.html#USER_UpgradeDBInstance.PostgreSQL.MajorVersion

https://cloud.google.com/sql/docs/postgres/db-versions
Everything that this was blocked on is fixed now.

I have no idea what this upgrade involves other than it probably involves a site-wide outage. The db should be super small now. Does that help?

Brian, Miles: Can one of you take this on? We could do it in December or wait until January--I don't have a preference or dire need.
Flags: needinfo?(miles)
Flags: needinfo?(bpitts)
I can take this on, and December after all hands sounds good. This will involve a site-wide outage of some kind (we could go read-only), but we can decide on our tolerance.

The simplest option is to have RDS perform the upgrade for us on our existing instance. We'll disable writers (processor, crontabber) to simplify things. This would incur a sitewide downtime window. We'll want to test this in stage first, and potentially even on a DB dump of production to get an idea of the timing. The ops labor here is minimal.

Another option would be to disable writers, dump the database to a new instance, and perform the upgrade(s) there. Slightly more ops labor, but no full downtime windows. We'd cutover to a new stack of instances set to use the new DB and could cut back if things went wrong.

If downtime allows, it would be simpler to use the same instance. Disaster recovery would look a lot like the second approach - we'd spin up a new instance from backup and cut to that. I'm open to either here.

I think the next step would be to run through the first option (A) in stage and see how long the downtime window actually is. I've been pleasantly surprised by RDS upgrades in the past.

Will: Does that sound good?
Flags: needinfo?(willkg)
Flags: needinfo?(miles)
Flags: needinfo?(bpitts)
Upgrading the db in place sounds fine to me. I don't have a preference one way or the other.

Sounds like the steps are these:

1. delete unnecessary tables in stage and prod

2. upgrade stage db in place and measure how long it takes and determine whether there were any issues we should think about for the prod upgrade

3. depending on how that goes, schedule a prod upgrade


For timing, this week is short notice, but otherwise good if we have time. The db is much smaller than it used to be, so maybe the outage would be small?

Next week (Dec 3 -> Dec 7) is all hands week. I'd be game for testing it on stage and doing prod if we wanted.

Next next week (Dec 10 -> Dec 14) is release week--we can't do an outage during release week.

After that, I don't know who's on PTO and who isn't, so it might get hard to schedule.

What do you think?
Flags: needinfo?(willkg)
Commits pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/f8cdac00c15628edd54ed9760d856f8cfeb7bcb7
bug 1497956: upgrade postgres to 9.5 in local dev environment

This updates the local dev and test environments to use PostgreSQL 9.5.

https://github.com/mozilla-services/socorro/commit/f6ae51ee4b03e0e44639b00bb8cbbf78bbfddc72
Merge pull request #4731 from willkg/1497956-pg95

bug 1497956: upgrade postgres to 9.5 in local dev environment
The old tables and stored procedures have been removed in stage and prod. I upgraded the image in the local dev environment and the tests to use postgresql 9.5 and everything looks fine. Yay!
Assignee: willkg → miles
Miles upgraded stage yesterday. It was roughly a 12 minute outage. During that time, we saw errors show up in sentry from the webapp.

We let that bake for 12 hours. Everything seems fine--nothing unusual in the logs, Sentry, or Datadog.

I think we're good to go for doing a prod upgrade. I'll email the stability list to see if it would be create a hardship if we did that today.
Miles upgraded prod just now. It was also like a 12 minute outage. Everything seems to be doing fine, so I'm going to mark this FIXED.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: