1231918 - incorrect http/200 responses from jsonrpc.cgi

Reporter

Description

•

9 years ago

when the following request is sent to bmo:

--
POST https://bugzilla.mozilla.org/jsonrpc.cgi
Content-Length: 2
Content-Type: application/json

{}
--

bugzilla returns http/500.  this isn't ideal but it's currently expected.


however bugzilla.mozilla.org intermittently returns a http/200 response:

--
HTTP/1.1 200 OK
Connection: close
Date: Fri, 11 Dec 2015 13:32:18 GMT
Server: Apache
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Client-Date: Mon, 07 Dec 2015 17:42:40 GMT
Keep-Alive: timeout=5, max=1000
X-Backend-Server: web5.bugs.scl3.mozilla.com
--

it's wildly variable when this may happen - sometimes i'll get a http/200 after a handful of requests, sometimes it's over a hundred.  i've seen the http/200 response from all of the backend servers.

this is very weird, and slightly concerning.  note the content-length of these invalid "ok" responses is always zero bytes.

significantly more concerning is occasionally (perhaps one in five of the http/200 responses) we see a set-cookie header:

--
HTTP/1.1 200 OK
Connection: close
Date: Fri, 11 Dec 2015 13:39:58 GMT
Server: Apache
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Keep-Alive: timeout=5, max=1000
Set-Cookie: BUGLIST=448493-510945-661739-667865-705617-839470-923774-1043837-1093462-1101296-1103323-1142139-1183663-1190333-890504-250667-315415-369955-531199-563881-565458-578828-676975-682026-697948-796848-895511-958564-969787-981574-995953-996754-1003476-1087418-1123986-1123987-1125285-1126996-1131187-1163631-1167896-1198975-1202496-1211270-1215800-391684-751274-559296-873055-968065; path=/; expires=Fri, 01-Jan-2038 00:00:00 GMT; secure
X-Backend-Server: web5.bugs.scl3.mozilla.com
--

there's no way this cookie should be returned as a result of the request - it's generated by buglist.cgi and stores the search result.

i've only ever seen he BUGLIST cookie returned; i've never seen a session or any other cookie.  however i'm still concerned that something weird is going on.


i've tried to replicate this issue without success under the following conditions:
- local dev environment (both mod_perl and mod_cgi)
- bugzilla-dev.allizom.org (bmo dev) and bugzilla.allizom.org (bmo stage)
- on a production webhead taken out of the zeus pool, requests generated from localhost

so this appears to be something triggered by concurrent requests, or zeus.


i'm not sure what's going on, and will need some lower-level assistance to figure out if apache/mod_perl/bugzilla is generating the invalid responses, or if it's something that zeus is doing.

:atoll indicated that packet sniffing magic may shed some light on the origin of the response.

as i'm taking pto after mozlando, :dylan and :fubar are the points of contact for this issue after today.

Dylan Hardison [:dylan] (he/him)

Assignee

Updated

•

9 years ago

Assignee: nobody → dylan

Jake Maul [:jakem]

Comment 1

•

8 years ago

Did some digging here, and I'm able to replicate this direct to one of the web nodes, without Zeus:

curl -v -H 'Host: bugzilla.mozilla.org' http://10.22.82.29:443/jsonrpc.cgi -d '{}' -H 'Content-Type: application/json' -o /dev/null

Run that a few times and eventually I was able to get some 200's out of it. Did not take a whole lot, less than 20 for me. Given what I'm seeing, I do suspect that parallelism is relevant, so peak traffic times are probably more susceptible to this happening.

I won't paste the output here because it contains seemingly arbitrary JSON data from other users' queries. Here's an example with the variable data cut out and replaced with "arbitrary bugzilla json data here".

> POST /jsonrpc.cgi HTTP/1.1
> Host: bugzilla.mozilla.org
> User-Agent: curl/7.43.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 2
>
} [2 bytes data]
* upload completely sent off: 2 out of 2 bytes
< HTTP/1.1 200 OK
< Date: Wed, 06 Jan 2016 17:55:32 GMT
< Server: Apache
< malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "(end of string)")arbitrary bugzilla json data here">
< X-Backend-Server: web4.bugs.scl3.mozilla.com
< Content-Length: 0
< Content-Type: text/plain; charset=UTF-8

Note how this is in the *header* block, suggesting that something was trying to generate some sort of header and failed, reading the wrong query's data and generating an error instead of a valid header. The body of the response is indeed empty, hence Content-Length: 0.


I see at least 2 problems here:

1) Getting data from other queries. This is the part that I think is caused by parallelism. Thread safety or shared storage space (memcache, database) perhaps.

2) Whatever is trying to generate the header that goes there is not doing enough error checking and returning something completely bonkers.

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 2

•

8 years ago

http://modperlbook.org/html/16-2-2-Content-Length-Header.html

> "According to section 14.13 of the HTTP specification, the Content-Length header is the number of octets (8-bit bytes) in the 
> body of a message. If the length can be determined prior to sending, it can be very useful to include it. The most important 
> reason is that KeepAlive requests (when the same connection is used to fetch more than one object from the web server) work 
> only with responses that contain a Content-Length header"

It so happens that for BMO we:
have KeepAlive turned on
and we never ever set a Content-Length header (in fact, from perl most of the time we don't know this.)

As Jake was seeing body-content in the headers... this makes me think that might be the problem.

gozer: The documentation says "Keep Alives work only with responses that contain a Content-Length header",
what do they mean by "Work"? I've looked at the code, and as far as I can tell there is nothing stopping you from not sending a content-length *and* the connection being kept open by the KeepAlive setting...

Flags: needinfo?(gozer)

:Atoll

Comment 3

•

8 years ago

Note that it's possible to disable keepalives to the backend nodes, at the cost of performance.

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 4

•

8 years ago

KeepAlive is not the issue at hand -- because we see this even running curl against the webheads. My mistake.
After talking with gozer I have identified several other possibilities. Working through those now.

Flags: needinfo?(gozer)

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 5

•

8 years ago

Here's what I know so far:

When this happens, we see this in the logs:

Jan  8 17:00:02 web2.bugs.scl3.mozilla.com apache[10370]: [Fri Jan 08 09:00:02 2016] [error] [client 96.58.191.108] malformed header from script. Bad header=malformed JSON string, neither: jsonrpc.cgi

We see other mailformed headers in the log:

[client 202.67.42.34] malformed header from script. Bad header=<!DOCTYPE html PUBLIC "-//W3C/: buglist.cgi, referer: https://bugzilla.mozilla.org/query.cgi

That is why we sometimes see the buglist header.

Something could be happening in XS land, or it could be something happening in a $SIG{__DIE__} handler. I'm narrowing it down.

With a kept-alive socket, printing out garbage -- it will sit there until the next request. So turning off keep alive would alleviate this slightly (but obviously this is still a problem! jsonrpc should not be omitting headers!)

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 6

•

8 years ago

I'm thinking this is server push + keepalives.
https://github.com/mozilla/webtools-bmo-bugzilla/blob/master/buglist.cgi#L730-L747

should disable keep alives. If that works, we should either call close() for server pushes, force the process
that does a server push to exit, or remove the server push feature. I strongly endorse the last option.

:glob ✱

Reporter

Comment 7

•

8 years ago

dylan - great work; that code does indeed look suspect.

fubar - can you disable keepalives in the dev and stage configs.  we'll do a quick sanity check to make sure nothing broke, and then we can disable it on production.

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 8

•

8 years ago

ok, dev and staging both have KeepAlive disabled.

Flags: needinfo?(klibby)

:glob ✱

Reporter

Comment 9

•

8 years ago

thanks fubar - things look good to me.  please flip the switch on prod.

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 10

•

8 years ago

(In reply to Byron Jones ‹:glob› from comment #9)
> thanks fubar - things look good to me.  please flip the switch on prod.

done.

Flags: needinfo?(klibby)

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 11

•

8 years ago

Attached patch 1231918_1.patch — Details — Splinter Review

At least in part, the problem is that sentry adds a __DIE__ handler which calls ThrowTemplateError(). That function expects the http headers to already be sent! Examining the call sites, I concluded this was always the case (except in our __DIE__ handler). This patch fixes that call to ThrowTemplateError().

Meanwhile, we need to eventually not use a __DIE__ handler. It would be better to have an eval { } block in mod_perl.pl, and only install the __DIE__ handler on dev machines.

Note that bz-dev-manager adds its own __DIE__ handler -- you will need to revert that change to test this patch.

For testing, I used http://bugzilla.local/bmo/buglist.cgi?bugidtype=include&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=XML&keywords=css1%3C%00ScRiPt%20%0D%0A%3Eprompt%28964655%29%3C%2FScRiPt%3E&keywords_type=anywords&list_id=396

It remains to be seen if this is *only* problem we're encountering for the random intermixed HTTP responses. We may need to call CORE::GLOBAL::exit() when a fatal error happens, because mod_perl2 gets into an inconsistent state.

Attachment #8707628 - Flags: review?(dkl)

David Lawrence [:dkl]

Comment 12

•

8 years ago

Comment on attachment 8707628 [details] [diff] [review]
1231918_1.patch

Review of attachment 8707628 [details] [diff] [review]:
-----------------------------------------------------------------

r=dkl

::: Bugzilla/Sentry.pm
@@ +336,5 @@
> +            my $cgi = Bugzilla->cgi;
> +            $cgi->close_standby_message('text/html', 'inline', 'error', 'html');
> +            Bugzilla::Error::ThrowTemplateError($message);
> +            print $cgi->multipart_final() if $cgi->{_multipart_in_progress};
> +

nit: remove extra newline

Attachment #8707628 - Flags: review?(dkl) → review+

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 13

•

8 years ago

So this will have unintended consequences because of the die handler we use to send stuff to sentry.

To explain how __DIE__ handlers work, see this code:

https://gist.github.com/dylanwh/8e46c22bda50eb21e60f

It outputs this:
die handler: error message at eval.pl line 12.
warn handler: error message at eval.pl line 12.

What does this mean?

Even with my fix, any code that does this:

eval { operation_that_may_die };
if ($@) {
    # ...
}

will first output an HTTP error page, and then continue on from the if block!

I'm going to examine the code to see if this happens anywhere.

Note this could even happen in code we don't control...

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 14

•

8 years ago

Okay, thankfully _in_eval() guards against this (which I didn't think eval { } block added (eval) to the stack.
Note this means exception handling is very slow (because we walk the stack for every exception)

:Atoll

Comment 15

•

8 years ago

Do we need a separate bug for "stop walking the stack for every exception", to track that perf issue?

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 16

•

8 years ago

I don't have proof that it is very slow, other than remembering that you're not really encouraged to use caller()... I think I'd like to land this patch and see what happens.

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 17

•

8 years ago

Aha! We need to use $^S instead of _in_eval().

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 18

•

8 years ago

I thought maybe that wasn't a read-only SV, but it is so there's no fixing that under mod_perl.

:glob ✱

Reporter

Updated

•

8 years ago

Component: Infrastructure → General

QA Contact: mcote

:glob ✱

Reporter

Comment 19

•

8 years ago

To ssh://gitolite3@git.mozilla.org/webtools/bmo/bugzilla.git
   ace6728..15a6c7e  master -> master

Dylan Hardison [:dylan] (he/him)

Assignee

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

:Atoll

Comment 20

•

8 years ago

Did we reenable keepalives as part of shipping the fix in comment 19, or does this fix require *both* the comment 19 *and* keepalives off solutions in tandem?

Kendall Libby [:fubar] (he/him)

Comment 21

•

8 years ago

keepalives have NOT been re-enabled, but I don't know if that is still required.

Status: RESOLVED → REOPENED

Flags: needinfo?(dylan)

Resolution: FIXED → ---

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 22

•

8 years ago

I'm fairly certain we can re-enable them. 98% sure.

Flags: needinfo?(dylan)

Kendall Libby [:fubar] (he/him)

Comment 23

•

8 years ago

re-enabled keepalives on dev and stage. please test and let me know if I can re-enable on prod.

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 24

•

8 years ago

Looks good. We never were able to trigger this on dev or stage though.

Kendall Libby [:fubar] (he/him)

Comment 25

•

8 years ago

ok, re-enabled on prod.

Dylan Hardison [:dylan] (he/him)

Assignee

Comment 26

•

8 years ago

This seems to be resolved.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → FIXED

Dylan Hardison [:dylan] (he/him)

Assignee

Updated

•

7 years ago

Group: bmo-infra

:glob ✱

Reporter

Updated

•

7 years ago

Group: infra

Dylan Hardison [:dylan] (he/him)

Assignee

Updated

•

7 years ago

Group: bmo-infra