Closed Bug 1438828 Opened 6 years ago Closed 6 years ago

Periodic Treeherder HTTP 503s/H12s due to Heroku routing issues

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

For the 2nd/3rd time, we've just had a brief spell of HTTP 503s for Treeherder web requests, which Heroku reports as error code H12:
https://screenshots.firefox.com/hBLcO4Cyvuup3SJw/dashboard.heroku.com

The symptoms are:
* sudden jump from normal request times to the 30s cut-off + HTTP 503s / H12s, with no ramp up
* all failed requests are made to the same single dyno

Normally an H12 would be an app issue, however:
* H12s are normally caused by too much request queueing, and so would typically correlate with throughput increases or an increase in dynamic request times, which I don't see on New Relic at all (plus no ramp up)
* this looks like it's the same issue being seen by the taskcluster apps in bug 1434338 (for which they've opened https://help.heroku.com/tickets/550490 but not had a great response)

From the logs for this most recent instance (which affected web.2)...

# Last successful request:
Feb 16 11:59:12 treeherder-prod app/web.2: 10.65.108.122 - - [16/Feb/2018:11:59:12 +0000] "GET /api/project/mozilla-inbound/bug-job-map/?job_id=162623817 HTTP/1.1" 200 2 "https://treeherder.mozilla.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0" 
Feb 16 11:59:12 treeherder-prod heroku/router: at=info method=GET path="/api/project/mozilla-inbound/bug-job-map/?job_id=162623817" host=treeherder.mozilla.org request_id=0c052efd-5557-4f3c-af14-11270ea65d5f fwd="<IP REMOVED>" dyno=web.2 connect=1ms service=15ms status=200 bytes=376 protocol=https 

# Errors seen (380 failed requests total):
Feb 16 11:59:47 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/revision.txt" host=treeherder.mozilla.org request_id=c72f806a-af2b-4367-8ac2-6665bfba6bdf fwd="<IP REMOVED>" dyno=web.2 connect=4ms service=30001ms status=503 bytes=0 protocol=https 
Feb 16 11:59:47 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/api/project/try/jobs/?count=2000&last_modified__gt=2018-02-12T19:17:01.018&result_set_id__in=305266&return_type=list" host=treeherder.mozilla.org request_id=5af20039-a1cd-45ec-b963-b521c3d3ee67 fwd="<IP REMOVED>" dyno=web.2 connect=3ms service=30000ms status=503 bytes=0 protocol=https 
Feb 16 11:59:48 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/" host=treeherder.mozilla.org request_id=2b97da6b-9fd2-4ee8-a634-190d036e4f04 fwd="<IP REMOVED>" dyno=web.2 connect=3ms service=30000ms status=503 bytes=0 protocol=https 
...
Feb 16 12:03:43 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/api/project/mozilla-release/resultset/?fromchange=586e26fcca79a844c6f5fcebf17068423b283d02" host=treeherder.mozilla.org request_id=fe191dea-bfd3-42da-bef4-3e2fb79f0064 fwd="<IP REMOVED>" dyno=web.2 connect=6ms service=30000ms status=503 bytes=0 protocol=https 
Feb 16 12:03:43 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/api/project/autoland/resultset/?fromchange=f1487d25e019dd13454840e29b416ca441e79344" host=treeherder.mozilla.org request_id=535b6b32-3a46-4e29-8bf0-9cb684b7956d fwd="<IP REMOVED>" dyno=web.2 connect=1ms service=30001ms status=503 bytes=0 protocol=https 
Feb 16 12:03:45 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/revision.txt" host=treeherder.mozilla.org request_id=a9d95bda-8711-4ffd-ac7c-553b8755c1cb fwd="<IP REMOVED>" dyno=web.2 connect=3ms service=30001ms status=503 bytes=0 protocol=https 

# And buried in that:

Feb 16 12:03:14 treeherder-prod heroku/web.2: Relocating dyno to a new server 
Feb 16 12:03:15 treeherder-prod heroku/web.2: State changed from up to down 
Feb 16 12:03:15 treeherder-prod heroku/web.2: State changed from down to starting 
Feb 16 12:03:19 treeherder-prod heroku/web.2: Starting process with command `newrelic-admin run-program gunicorn treeherder.config.wsgi:application --timeout 20` 
Feb 16 12:03:22 treeherder-prod app/web.2: [2018-02-16 12:03:21 +0000] [4] [INFO] Starting gunicorn 19.7.1 
Feb 16 12:03:22 treeherder-prod heroku/web.2: State changed from starting to up 

...so this appears to be a "Heroku routing layer not handling health checks correctly" type issue.
I've filed our own ticket to try and give support another nudge:
https://help.heroku.com/tickets/555426
Heroku have made a few tweaks/fixes which seem to have helped.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
This happened again today on prod between 16:22 and 16:27 UTC.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
First H12 at 16:23:32:

Mar 05 16:23:32 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/api/project/mozilla-inbound/resultset/?fromchange=d50f11db7d6479c39125a4db75a8a31daeb8fba3" host=treeherder.mozilla.org request_id=244cfc07-0df9-4b37-8212-789634b636df fwd="REDACTED" dyno=web.1 connect=3ms service=30001ms status=503 bytes=0 protocol=https 

Last H12 at 16:27:15:

Mar 05 16:27:15 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/api/project/try/jobs/?count=2000&last_modified__gt=2018-03-04T05:03:04.993&result_set_id__in=312331,312324,312319,312317,312313,312312,312308,312304,312291,312269,312133,311838,311821,311153,311129,311053,311050,310972,310971,310942&return_type=list" host=treeherder.mozilla.org request_id=c116dce0-34d9-4f67-bfc2-5e4c78da6319 fwd="REDACTED" dyno=web.1 connect=1ms service=30001ms status=503 bytes=0 protocol=https

The dyno was relocated, as before:

Mar 05 16:26:45 treeherder-prod heroku/web.1: Relocating dyno to a new server 
Mar 05 16:26:45 treeherder-prod heroku/web.1: State changed from up to down 
Mar 05 16:26:45 treeherder-prod heroku/web.1: State changed from down to starting 
Mar 05 16:26:51 treeherder-prod heroku/web.1: Starting process with command `newrelic-admin run-program gunicorn treeherder.config.wsgi:application --timeout 20` 
Mar 05 16:26:54 treeherder-prod heroku/web.1: State changed from starting to up 

The other two dynos continued functioning, so would have affected only a proportion of requests, but still.
I've left another comment on https://help.heroku.com/tickets/550490
They left another reply on 6th March (ask me for the share link if you want to read it). Work is ongoing on their side, and a proportion of the causes have been identified and fixed.

We haven't seen an issue since 5th March, so closing this out for now.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.