Closed Bug 1218943 Opened 9 years ago Closed 7 years ago

Pushlog returned invalid JSON causing buildbot to not schedule try jobs for multiple hours

Categories

(Developer Services :: Mercurial: Pushlog, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: Callek, Assigned: gps)

Details

So today I got poked that there was a try job with no jobs associated with it.

After poking we found http://hg.mozilla.org/try/rev/edc128273da0 as the problem rev, note the non-ascii...

Buildbots traceback was:

/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client] <HgPoller for https://hg.mozilla.org/try>: polling failed, result Invalid control character at: line 1 column 4462 (char 4462)
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client] Traceback (most recent call last):
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]   File "/builds/buildbot/build_scheduler/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]     self.result = callback(self.result, *args, **kw)
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]   File "/builds/buildbot/build_scheduler/lib/python2.7/site-packages/buildbotcustom/changes/hgpoller.py", line 124, in succeeded
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]     self.d.callback(result)
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]   File "/builds/buildbot/build_scheduler/lib/python2.7/site-packages/twisted/internet/defer.py", line 318, in callback
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]     self._startRunCallbacks(result)
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]   File "/builds/buildbot/build_scheduler/lib/python2.7/site-packages/twisted/internet/defer.py", line 424, in _startRunCallbacks
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]     self._runCallbacks()
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client] --- <exception caught here> ---
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]   File "/builds/buildbot/build_scheduler/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]     self.result = callback(self.result, *args, **kw)
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]   File "/builds/buildbot/build_scheduler/lib/python2.7/site-packages/buildbotcustom/changes/hgpoller.py", line 258, in processData
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]     pushes = _parse_changes(query)
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]   File "/builds/buildbot/build_scheduler/lib/python2.7/site-packages/buildbotcustom/changes/hgpoller.py", line 100, in _parse_changes
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]     pushes = json.loads(data).values()
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]   File "/tools/python27/lib/python2.7/json/__init__.py", line 326, in loads
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]     return _default_decoder.decode(s)
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]   File "/tools/python27/lib/python2.7/json/decoder.py", line 366, in decode
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]   File "/tools/python27/lib/python2.7/json/decoder.py", line 382, in raw_decode
/builds/buildbot/build_scheduler/master/twistd.log-2015-10-27 08:57:20-0700 [HTTPPageGetter,client]     obj, end = self.scan_once(s, idx)
/builds/buildbot/build_scheduler/master/twistd.log:2015-10-27 08:57:20-0700 [HTTPPageGetter,client] exceptions.ValueError: Invalid control character at: line 1 column 4462 (char 4462)

This was on bm81...

it got an invalid json response from hg at 06:18:19  and continued being invalid until  08:57:20.

https://hg.mozilla.org/try/json-pushes?full=1&fromchange=331921314e2d1f01f38f197ac2115ae0225b1560

The invalid push is id 94805 (even though buildbot lumped it in with a pushlog for id 94804).

We should make pushlog not output invalid json like this. Otherwise we risk hitting larger scale issues like this that go unnoticed for longer than expected.
To mitigate today, I did:

>>> for i in master.scheduler_manager.change_svc.services:
...     if not isinstance(i, buildbotcustom.changes.hgpoller.HgPoller):
...         continue
...     if i.baseURL == "https://hg.mozilla.org/try":
...         print repr(i), i.baseURL
...         i.lastChangeset = u"0cbba4676e25b151ae9b57c5524f07900f3f097f"
... 
<buildbotcustom.changes.hgpoller.HgPoller object at 0x60b7790> https://hg.mozilla.org/try
This /might/ be regression from https://hg.mozilla.org/hgcustom/version-control-tools/rev/22031500436f. If so, it is an upstream bug. It could also be us not accessing the raw bytes from the commit message properly. We do that incorrectly in a lot of places...
Assignee: nobody → gps
Status: NEW → ASSIGNED
Priority: -- → P1
Yeah, the extension isn't accessing the raw bytes of the commit message properly: it gets converted to the local encoding and we use that.
For posterity, the raw bytes of the commit message:

$ hg log -r edc128273da0 -T '{desc}' | hexdump -C
00000000  4d 54 44 54 20 65 6e 61  62 6c 65 64 3a 20 1b 5b  |MTDT enabled: .[|
00000010  32 30 30 7e 74 72 79 3a  20 2d 62 20 6f 20 2d 70  |200~try: -b o -p|
00000020  20 6d 61 63 6f 73 78 36  34 2c 77 69 6e 33 32 2c  | macosx64,win32,|
00000030  77 69 6e 36 34 2c 61 6e  64 72 6f 69 64 2d 61 70  |win64,android-ap|
00000040  69 2d 31 31 20 2d 75 20  6e 6f 6e 65 20 2d 74 20  |i-11 -u none -t |
00000050  73 76 67 72 2c 73 76 67  72 2d 65 31 30 73 2c 72  |svgr,svgr-e10s,r|
00000060  65 6d 6f 74 65 2d 74 73  76 67                    |emote-tsvg|

The 0x1b (escape) is the first "fun" character.
So, there are two bugs.

1) We're not using encoding.fromlocal() to obtain raw bytes before passing the description and author fields to the JSON encoder. Fortunately, the server is using utf-8, so it's interpretation of the bytes should be mostly sane.
2) Mercurial's JSON encoder isn't properly escaping some values. This is https://bz.mercurial-scm.org/show_bug.cgi?id=4926.

#2 will likely get fixed in Mercurial 3.7.

Since it apparently took months for this bug to surface, I'm inclined to fix #1 and ignore #2. If #2 really becomes a problem, we can screen commits via a hook or modify things to use a proper JSON encoder.
On further examination of Mercurial's source code, us not passing encoding.fromlocal() is acceptable. Mercurial should be doing the escaping from the raw value stored within. It isn't. So I've filed https://bz.mercurial-scm.org/show_bug.cgi?id=4933.

There isn't much we can do here except revert to use a custom JSON encoder until upstream is fixed.

Again, since this appears to be a once-in-several-months issue, I don't think it is worth the effort. If it happens again I'll up the priority.
Priority: P1 → P5
(In reply to Gregory Szorc [:gps] from comment #6)
> On further examination of Mercurial's source code, us not passing
> encoding.fromlocal() is acceptable. Mercurial should be doing the escaping
> from the raw value stored within. It isn't. So I've filed
> https://bz.mercurial-scm.org/show_bug.cgi?id=4933.
> 
> There isn't much we can do here except revert to use a custom JSON encoder
> until upstream is fixed.
> 
> Again, since this appears to be a once-in-several-months issue, I don't
> think it is worth the effort. If it happens again I'll up the priority.

While it is once in several months, it certainly ate up multiple hours of my day, and that was not counting the hours of developer time wasted, or others on my team assisting me in understanding what may have broken try scheduling.

I think we do need to have a guarantee that JSON returned by pushlog is always decodeable as JSON for our infra. And thus a higher priority than P5.
Or a hook that rejects such commits in the first place.
We now know that invalid pushlog JSON could result in jobs not getting scheduled. We know to look for that next time.

If it happens again, we'll fix it. I don't think it will happen again.
https://hg.mozilla.org/try/json-pushes?full=1&startID=94804&endID=94806 contains the "trouble" changeset (edc128273da0) and it is now rendering as valid JSON. This likely fixed itself after upgrading to a new Mercurial release.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.