Closed Bug 1181153 Opened 9 years ago Closed 9 years ago

Port treestatus to relengapi

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(11 files)

51 bytes, text/x-github-pull-request
dustin
: review+
Details | Review
51 bytes, text/x-github-pull-request
dustin
: review+
Details | Review
1.88 KB, text/plain
Details
40 bytes, text/x-review-board-request
emorley
: feedback+
Details
46 bytes, text/x-github-pull-request
dustin
: review+
emorley
: checkin+
Details | Review
50 bytes, text/x-github-pull-request
freddy
: review+
Details | Review
40 bytes, text/x-review-board-request
catlee
: review+
Details
58 bytes, text/x-github-pull-request
automatedtester
: review+
Details | Review
54 bytes, text/x-github-pull-request
jsantell
: review+
Details | Review
48 bytes, text/x-github-pull-request
abr
: review+
Details | Review
43 bytes, text/x-github-pull-request
Details | Review
Catlee and I got a good start on this at whistler in
  https://github.com/catlee/build-relengapi/tree/treestatus
But needs to be finished up.
Component: Other → TreeStatus
Product: Release Engineering → Tree Management
QA Contact: mshal
Version: unspecified → ---
That's landed, but I still need to make the transition.
Here's what I find in the access logs:

2620:101:80fc:224:baac:6fff:fe38:f64e - - [24/Aug/2015:14:24:28 +0000] "GET /?format=json HTTP/1.1" 200 7642 "-" "Python-urllib/2.7" 
  mtv2 corp network

63.245.214.82 - - [24/Aug/2015:14:16:58 +0000] "GET /b2g-inbound?format=json HTTP/1.1" 200 346 "-" "Python-urllib/2.6"
63.245.214.162 - - [24/Aug/2015:14:18:40 +0000] "GET /try?format=json HTTP/1.1" 200 83 "-" "Python-urllib/2.7"
63.245.214.82 - - [24/Aug/2015:14:26:42 +0000] "GET /b2g-inbound?format=json HTTP/1.1" 200 346 "-" "Python-urllib/2.6"
63.245.214.82 - - [24/Aug/2015:14:31:44 +0000] "GET /b2g-inbound?format=json HTTP/1.1" 200 346 "-" "Python-urllib/2.6"
63.245.214.162 - - [24/Aug/2015:14:41:15 +0000] "GET /mozilla-inbound?format=json HTTP/1.1" 200 323 "-" "Python-urllib/2.7"
63.245.214.162 - - [24/Aug/2015:14:53:30 +0000] "GET /try?format=json HTTP/1.1" 200 83 "-" "Python-urllib/2.7"
  scl3/releng NAT, urllib UA

69.59.28.19 - - [24/Aug/2015:14:06:32 +0000] "GET / HTTP/1.1" 200 16590 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" 
  pingdom

<mumble> - - [24/Aug/2015:14:28:26 +0000] "GET /mozilla-inbound?format=json HTTP/1.1" 200 323 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:41.0) Gecko/20100101 Firefox/41.0" 
  users with browsers

63.245.214.162 - - [24/Aug/2015:14:52:31 +0000] "GET /gaia?format=json HTTP/1.0" 200 212 "-" "Twisted PageGetter" 
  scl3 NAT, always /gaia


I'm betting that the urrlib requests are from the hg hook (63.245.214.162) and b2g-bumper (.82).  Pingdom is easy.  Users with browsers will follow the redirect.  The mtv2 requests baffle me a little: they appear roughly every 1-3 minutes, so I don't think they're on a crontask.  I have no idea what's using the Twisted PageGetter -- is there something in the Buildbot code that consults the tree status?
Flags: needinfo?(bugspam.Callek)
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> Here's what I find in the access logs:
> 
> 2620:101:80fc:224:baac:6fff:fe38:f64e - - [24/Aug/2015:14:24:28 +0000] "GET
> /?format=json HTTP/1.1" 200 7642 "-" "Python-urllib/2.7" 
>   mtv2 corp network

*spitballing* maybe nagios?

> 63.245.214.162 - - [24/Aug/2015:14:52:31 +0000] "GET /gaia?format=json
> HTTP/1.0" 200 212 "-" "Twisted PageGetter" 
>   scl3 NAT, always /gaia

This is not something in releng-controlled buildbot, I vaguely recall :Pike using buildbot for l10n reasons, and maybe jhford/someone for other b2g reasons (note this is requesting gaia).


But this is me not really certain as to the cause of either of those.
Flags: needinfo?(bugspam.Callek)
The v6 stuff is not nagios -- nagios is not in mtv2, and anyway is only pinging this host.

Thanks, Ed -- I'll check through that list and see what I can figure out.  We could leave something in place to transform requests to http://treestatus.mozilla.org/<tree>?format=json into an appropriate call to the new API, and return the result, but that will mean that existing code doesn't change to point to the new service, and we're running two services indefinitely.  My feeling is that we should get the stuff we know about, including tree-critical stuff, shifted to use the new API and then shut off the old host so that anything else will break and alert its author.
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> Here's what I find in the access logs:
> 
>            The mtv2 requests baffle me a little: they appear roughly every
> 1-3 minutes, so I don't think they're on a crontask.  I have no idea what's
> using the Twisted PageGetter -- is there something in the Buildbot code that
> consults the tree status?

Unsure if related - moc does monitor tree-status, so will need to change their tool
That's pingdom, and already on the list.
the Twisted PageGetter is https://github.com/mozfreddyb/treestatusbot/blob/master/irc.py

It's possible that the mtv2 requests are just from a host running one of the relevant Firefox extensions.

I'm starting to change my mind about breaking the old site (comment 7) after modifying the known uses to hit RelengAPI.  Keeping the old site running as a translator has the disadvantage of keeping and old service around (with attendant disk, memory, and CPU usage on servers), but the advantage of not disturbing a lot of people with something that is ultimately pretty trivial.

I'm going to change the approach, then: I'll build a replacement for the existing treestatus.mozilla.org which redirects / to the RelengAPI UI but handles /<tree>?format=json and /?format=json as described in comment 7.  I'll deploy this first (after testing), and only then migrate as many services as possible (comment 6, mozharness in-tree, mozharness out-of-tree, treeherder, hg hook) to look at RelengAPI.  The replacement will be hosted on the releng cluster in scl3, since the ulterior motive for all of this is to get the service out of phx1.

I'll see if I can implement the old site with just Apache directives, to avoid the need for an additional WSGI daemon.
The Apache config looks like

    RewriteEngine On
    SSLProxyEngine On
   
    # proxy requests with ?format=json
    RewriteCond %{QUERY_STRING} ^format=json$
    RewriteRule "^/(.*)" https://api.pub.build.mozilla.org/treestatus/compat/trees/$1 [P,L]
    ProxyPassReverse / https://api.pub.build.mozilla.org/treestatus/compat/trees/
   
    # and redirect everything else
    RewriteRule "^/(.*)$" https://api.pub.build.mozilla.org/treestatus/$1 [R,L]

but because the target is https, this requires mod_ssl be loaded, which is a little bit complicated.
Depends on: 1198837
Bug 1198837 has the TrafficScript I used to accomplish the same thing.

Unfortunately, that does require a DNS change.  I may need to use the Apache approach as well, to handle the DNS propagation interval, but at least that's just temporary.

https://treestatus.allizom.org currently has this applied.
Attached file sync.py
One-way sync from old to new; this uses transactions to safely delete and re-insert everything without a "blip" of lost data in the interim.  It takes about 15 seconds to run on the current production data (I practiced by mirroring that to the relengapi staging instance).

Note that this assumes direct access to both databases, which is unusual since they're in different datacenters!
Attachment #8652898 - Flags: review?(bugspam.Callek) → review+
Bug 1181153: use the new RelengAPI-based tree status; r?emorley
Attachment #8656663 - Flags: review?(emorley)
Comment on attachment 8656663 [details]
MozReview Request: Bug 1181153: use the new RelengAPI-based tree status; r?emorley

Looks fine to me, but deferring to an hg.m.o peer (owner?) :-)
Attachment #8656663 - Flags: review?(gps)
Attachment #8656663 - Flags: review?(emorley)
Attachment #8656663 - Flags: feedback+
Attachment #8656663 - Flags: review?(gps)
Comment on attachment 8656663 [details]
MozReview Request: Bug 1181153: use the new RelengAPI-based tree status; r?emorley

https://reviewboard.mozilla.org/r/18229/#review16413

Aside from the API issue, this is good.

::: hghooks/mozhghooks/treeclosure.py:25
(Diff revision 1)
> -treestatus_base_url = "https://treestatus.mozilla.org"
> +treestatus_base_url = "https://api.pub.build.mozilla.org/treestatus/trees/%s"

https://api.pub.build.mozilla.org/treestatus/trees/mozilla-inbound is failing for me. This change as-is will break the hook.
Well the change isn't at fault - there's just no data there yet.  But as I said in the review req, this patch won't land until that is the authoritative data source.
Bug 1181153: use the new treestatus API; r?catlee
Attachment #8664484 - Flags: review?(catlee)
Comment on attachment 8664484 [details]
MozReview Request: Bug 1181153: use the new treestatus API; r?catlee

https://reviewboard.mozilla.org/r/19961/#review18067
Attachment #8664484 - Flags: review?(catlee) → review+
Depends on: 1207690
Depends on: 1207724
Comment on attachment 8665022 [details] [review]
https://github.com/jsantell/mozilla-tree-status/pull/6

Tree Status addon looks good -- will the old API work for users that don't upgrade? (Not a huge deal, this is all for internal mozilla usage for the most part)
Attachment #8665022 - Flags: review?(jsantell) → review+
Jordan, the old API will keep working for "a while", but I'd like to decommission it within a few months.
Attachment #8664499 - Flags: review?(dburns) → review+
Comment on attachment 8664401 [details] [review]
https://github.com/mozilla/treeherder/pull/995

(from github issue)
Attachment #8664401 - Flags: review?(cdawson) → review+
Depends on: 1209086
When it comes time to make this transition, I'd like to deploy the Apache config in comment 14 on the phx1 generic cluster so that any client with cached DNS gets the right results.  There's a way to do this with TrafficScript (see bug 1198837) but as I understand it the phx1 load balancer can't proxy over to an scl3 VIP (in other words, it supports `pool.select`, but has no equivalent to ProxyPass).  That's what's led me to do this with Apache.

The rub is, in order to proxy to an HTTPS backend, Apache requires mod_ssl, which does not appear to be installed on the phx1 generic cluster.  And that poses a substantial risk to other generic sites.

As I see it, the options are:
 - install mod_ssl and make this Apache config change during the transition
 - do this with trafficscript instead (if possible)
 - accept that, for the duration of the DNS propagation, there are two treestatus instances

The cost of the last is that sheriffs cannot close the trees during that time.  It's likely only a few minutes, so probably not horrible.

Richard, since you helped out on bug 1198837, what do you think?
Flags: needinfo?(rsoderberg)
Short-term? I would migrate treestatus DNS to Dynect (or Route 53) and then take option 3, "two treestatus instances", for ~5 minutes, and then hardhat the old instance permanently.

Long-term? I would migrate treestatus to AWS, since it needs to be up even if a datacenter is down.
Flags: needinfo?(rsoderberg)
How does migrating to dynect or route 53 help over just switching the record in Mozilla's DNS?

And yes, RelengAPI will be migrating to Heroku someday.
Per irc, we can avoid all of this if we just do the transition in a TCW.  Set things up in scl3, then just change the DNS during the TCW when the trees are all marked "closed" anyway.  The DNS propagation will end before the TCW does, so no need to manage the split-brain during that time.  And if the new service fails, just revert the DNS to the un-touched phx1 deployment.
OK, migrated!

I've left the old service in phx1 on for the moment, although all DNS is pointed away from it.  It's still serving ~0.003 rq/s from browsers viewing treestatus, but there's no need to cut those folks off.

All of the patch-landing remains to be done, but can happen at any time now.
The hg hooks are landed.  I don't want to do anything further on a Friday afternoon.
Attachment #8665033 - Flags: review?(mh+mozilla)
:glandium -- does that mean pulsebot won't get patched?  Or that there was a problem with the patch?  Or that you're the wrong person to review?  Or that you landed the patch?
Flags: needinfo?(mh+mozilla)
That there was a problem with the patch and that I landed a fix anyways. I was also assuming you saw that I closed the pull request mentioning it was fixed.
Flags: needinfo?(mh+mozilla)
Attachment #8664401 - Flags: checkin+
OK, this is largely complete.  There is still code out there with the old URLs in it, but (a) it will still work and (b) there are patches on this bug for it.

There's a MOC bug to set up a new pingdom alert.

The next release of relengapi will include a link from the root page.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Attachment #8665029 - Flags: review?(adam) → review?
Product: Tree Management → Release Engineering
Comment on attachment 8665029 [details] [review]
https://github.com/adamroach/moz-treestat/pull/1

Had to make a minor tweak here ("Accept: application/jason"), but aside from that, this patch works like a charm. Thanks!
Attachment #8665029 - Flags: review? → review+
Component: Applications: TreeStatus → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: