Closed Bug 1094922 Opened 10 years ago Closed 10 years ago

Widespread hg.mozilla.org unresponsiveness

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1096337

People

(Reporter: RyanVM, Assigned: fubar)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4069] [tree-closing])

All trees are currently closed due to hg.mozilla.org being nearly unresponsive at the moment.
Lots of self-serve nagios alerts as well, a la:
Thu 08:32:04 PST [4872] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/http+-+/buildapi/self-serve/jobs)
Assignee: nobody → klibby
Servers appear to be happy again.

Early indications are that release automation started sending a lot of traffic to hg.mozilla.org.

You can see things at https://graphite.mozilla.org/render?from=-6hours&until=now&width=586&height=308&_salt=1408009636.045&target=hosts.hgweb*_dmz_scl3_mozilla_com.apache.apache80.apache_scoreboard.sending.count&title=In-flight%20Apache%20Requests&hideLegend=true&_uniq=0.6127992861238261

It started with a slow ramp-up ~3.5 hours ago. Hit peak nearly 2 hours ago. And was mitigated a few minutes ago.
We're still trying to find the smoking gun.

Meltdown of hg.mozilla.org was caused by exhaustion of the WSGI worker pool on the servers. There were many incoming requests waiting for an available WSGI worker to become available. Many of these timed out. hg.mozilla.org was still serving requests during the outage. But since Mercurial operations like `hg pull` and `hg clone` (which are the same operations from the perspective of the server) involve multiple HTTP requests to the server, chances are one of them failed. But enough were getting through that not all clients were impacted.

Increasing the WSGI process pool size (after realizing that the machines were not utilizing 100% CPU, not incurring excessive I/O wait, and not swapping) made the issue go away.

We're still trying to figure out what caused requests to process slower and the WSGI pool to be exhausted.
At this time, there are no more operational issues with the hg server, and sheriffs have been given the go-ahead to open trees at their discretion. (They'll do so after some builds go green.)
Whiteboard: [tree-closing]
Here are the average, per-hour times for "getbundle" (the underlying Mercurial protocol command behind pull/clone) on the build/tools repo today:

Hour                Wall  CPU
1900-11-06T00:00:00 12.01 5.94
1900-11-06T01:00:00 9.58 5.67
1900-11-06T02:00:00 9.30 5.66
1900-11-06T03:00:00 7.29 4.81
1900-11-06T04:00:00 6.35 4.15
1900-11-06T05:00:00 8.38 5.07
1900-11-06T06:00:00 6.30 4.12
1900-11-06T07:00:00 8.46 5.17
1900-11-06T08:00:00 8.87 5.47
1900-11-06T09:00:00 11.85 5.69
1900-11-06T10:00:00 12.30 5.85
1900-11-06T11:00:00 14.95 6.06
1900-11-06T12:00:00 9.12 5.64
1900-11-06T13:00:00 19.73 5.80
1900-11-06T14:00:00 37.05 4.78
1900-11-06T15:00:00 70.34 5.81
1900-11-06T16:00:00 46.13 5.96
1900-11-06T17:00:00 39.59 6.11
1900-11-06T18:00:00 7.21 4.36

(Ignore the 1900 year part. The logs aren't capturing the year and the date parsing fills in 1900 as the default. Sad panda.)

The times captured include just the time that the HTTP request was actively being processed in a WSGI process: this does *not* include time between HTTP request and when it was successfully dispatched to a WSGI worker (seconds could have passed).

The following things could influence getbundle times:

1) The size of data in the bundle
2) The state of the server
3) The state of the client
4) The network in between the server and client

We know from the consistent CPU times (which shouldn't vary no matter what state things are in - serving a bundle is a mostly idempotent operation and uses +- 1% CPU cycle variation in a controlled environment) that the size of the bundle was consistent before and during the event. #1 did not influence the event.

build/tools is a small repository (5147 changesets with 10394 changes to 1212 files) and ~27 MB. Clients should have no problem cloning it quickly. Furthermore, since dozens of clients inside automation clones this repo and experienced the same thing, I *think* we can throw out some kind of systemic client-centric event causing their clone times to increase.

That leaves the state of the server and the network.

We didn't see a major increase in clone/pull events from before the outage. So it doesn't look like a sudden surge of traffic influx caused this.

https://nigelbabu.github.io/hgstats/#hours/10 shows that the servers were behaving well, even as load grew. It wasn't until we reached a tipping point a few hours in that the servers fell over.

https://graphite.mozilla.org/render?from=-10hours&until=now&width=586&height=308&_salt=1408009636.045&target=sumSeries%28hosts.hgweb*_dmz_scl3_mozilla_com.interface.if_octets.bond0.tx%29&title=Outbound%20Network%20Traffic&hideLegend=true&_uniq=0.6127992861238261 Shows that our total outbound network traffic was also pretty consistent during the event.

I'm kind of scratching my head here. I'd like to point in the direction of a network event. But I would expect an event that reduced transfer speed would reduce total egress from hg.mozilla.org. We don't see a major loss in outbound traffic.

So weird.
I'm not sure I'd call that graphite graph "pretty consistent." That's a drop of 100M octets  from start to bottoming out. The relevant Zeus graph shows the same, in different units: http://people.mozilla.org/~klibby/hg/hg-http-data-to-client-24h.png

800Mbps drop over 90 minutes seems fairly significant.
Whiteboard: [tree-closing] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4069] [tree-closing]
The event is resolved. Things seem to be humming fine this morning. Although, we do still see elevated in-flight requests compared to days before.

I wrote up a summary of events at http://gregoryszorc.com/blog/2014/11/07/mercurial-server-hiccup-2014-11-06/

I reckon Hal will be filing a new bug to track an official RFO.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
There was a continuation of this event today. More details are in the duped bug.
Resolution: FIXED → DUPLICATE
You need to log in before you can comment on or make changes to this bug.