Closed Bug 1238944 Opened 8 years ago Closed 8 years ago

enable caching for balrog admin app

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(4 files, 1 obsolete file)

We enabled blob+blob_version caching on the public app awhile back, but the admin app still pulls from the database with every single request. It _should_ be simple to enable the cache (just a config change), but I want to do some in depth testing first to make sure there's no bugs that cause changes to get overwritten or other weirdness that's unique to using the cache in a r/w app.
Attachment #8709177 - Flags: review?(catlee)
Attachment #8709177 - Flags: review?(catlee) → review+
Commit pushed to master at https://github.com/mozilla/balrog

https://github.com/mozilla/balrog/commit/8816915ebf5fc7804ff1106d0609382a8dd3bb31
Merge pull request #46 from bhearsum/admin-cache

bug 1238944: enable caching for balrog admin app. r=catlee
Comment on attachment 8709177 [details] [review]
enable caching for admin app

Merged - going to verify in dev today, and push to production next week. (Maybe not until mid week to avoid potential interference with releases.)
Attachment #8709177 - Flags: checked-in+
Attachment #8711107 - Flags: review?(catlee)
Attachment #8711107 - Flags: review?(catlee) → review+
Commit pushed to master at https://github.com/mozilla/balrog

https://github.com/mozilla/balrog/commit/8cab7e6eac87c5a2c948ce217330cf3f8a5615cb
Merge pull request #48 from bhearsum/enable-admin-cache

bug 1238944: Actually enable admin caching in admin.wsgi. r=catlee
Attachment #8711107 - Flags: checked-in+
This seems to be working well. I did some tests in dev by doing a 1,000 retrievals of the same blob in a loop, parallelized 16 ways. Here's a graph of rx/tx: http://people.mozilla.org/~bhearsum/sattap/71dd6e08.png

The 4 spikes around 18:00 are without admin caching enabled. They're a bit varied, and average around 450 packets, with a lot of variance in individual runs.

The 3 spikes around 21:20 are with caching enabled. You can see the first one maxing out at ~250 packets, and subsequent ones just below. This makes sense, because the first one had to retrieve the blob from the database, while the subsequent ones had a hot cache in the first retrieval. There's also less variance here which is no surprise -- fewer network requests means fewer spots for variance to creep in.

So, this is ready to go to production. I intend to push it sometime next week, when I finish up some other work (and after Firefox 44 ships). Note to self: make sure to update admin.ini in production prior to pushing the code.
I did some additional testing today with a script that attempted to emulate the types of changes Funsize makes. I tweaked balrog-submitter.py to submit random.randint() as the filesize, and ran it with:
seq 1 200 | xargs -P8 -n1 retry -a 10 -s 2 -m 2 -- python balrog-submitter.py -a https://aus4-admin-dev.allizom.org/api -c creds.py -u stage-ffxbld -t nightly -p buildprops.json -v

So, that's done 200 times, parallelized 8 ways. This caused a lot of 400s due to races (just like Funsize has), but all eventually retried successfully.

I did this 4 times without caching enabled, and then 4 times with it enabled. Here's a graph of the incoming network traffic to aus1.dev.webapp.phx1.mozilla.com: http://people.mozilla.org/~bhearsum/sattap/b4f4d7b3.png

The tests run without caching peaked at 400-600 packets (per second, I think), with the average being about 535.

The tests run with caching had a much tighter range at ~275-300. So, enabling caching appears to cut the network load between the admin node and the db in half, and provide a bit more consistent performance overall.

I intend to push this to prod after 45.0b1 ships, hopefully tomorrow morning.
So, landing this today completely blew up the public facing node. I backed out (through an awful "git diff | patch -p1 -R") after being unable to find a fix in the first 15min. My suspicion is that the deepcopy() I added in the cache caused a ton of extra load on the webheads...but it's difficult to confirm. I tried adjusting one live in production to remove those and it didn't seem to help in the few minutes it was active. But there was such a backlog of traffic that it's hard to know if it truly would've made a difference. I'm going to try some tests in dev with and without the deepcopy and see how that goes.

w0ts0n pointed me at this wonderful view on newrelic that shows a lot of information about specific transactions: https://rpm.newrelic.com/accounts/263620/applications/5929969/transactions?type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f6175736c69622e7765622e76696577732e636c69656e743a636c69656e747265717565737433222c22225d&tab-detail_5a7bf2-02f28fae-c683-11e5-ad22-c81f66b8ceca=trace_details&app_trace_id=5a7bf2-02f28fae-c683-11e5-ad22-c81f66b8ceca

From here, I can glean that we got at least as far as getting the blob from the database, but didn't return from ClientRequestView.get. There's very little code inbetween those two points, which helps significantly in analysis.
See Also: → 1244100
bug 1240141 is about netflows, not related to this...
This is the captured out put from a request that took a long time to complete. I'll add some line by line analysis after attaching.
Here's a better log...the first one actually had multiple requests in it.
Attachment #8713732 - Attachment is obsolete: true
Comment on attachment 8713743 [details]
debug logging from a really slow request

>2016-01-29 12:16:13,435 - DEBUG - PID: 5682 - Request: 140482947154192 - ClientRequestView.get#51: Got query: {'product': u'Firefox', 'queryVersion': 3, 'buildID': u'20140825202822', 'locale': u'en-US', 'version': u'32.0', 'headerArchitecture': 'Intel', 'buildTarget': u'WINNT_x86-msvc', 'force': True, 'distribution': u'default', 'osVersion': u'Windows_NT 6.1.7601 (x86)', 'channel': u'release', 'distVersion': u'default'}
>2016-01-29 12:16:14,243 - DEBUG - PID: 5682 - Request: 140482947154192 - AUS.evaluateRules#54: Looking for rules that apply to:
>2016-01-29 12:16:14,243 - DEBUG - PID: 5682 - Request: 140482947154192 - AUS.evaluateRules#55: {'product': u'Firefox', 'queryVersion': 3, 'buildID': u'20140825202822', 'locale': u'en-US', 'version': u'32.0', 'headerArchitecture': 'Intel', 'buildTarget': u'WINNT_x86-msvc', 'force': True, 'distribution': u'default', 'osVersion': u'Windows_NT 6.1.7601 (x86)', 'channel': u'release', 'distVersion': u'default'}
>2016-01-29 12:16:14,387 - DEBUG - PID: 5682 - Request: 140482947154192 - Rules.getRulesMatchingQuery#771: where: [<sqlalchemy.sql.expression.BooleanClauseList object at 0x7fc4bc1f5490>, <sqlalchemy.sql.expression.BooleanClauseList object at 0x7fc4bc1f5850>]

These first 2 seconds are the SELECT that grabs rules from the database, done at: https://github.com/mozilla/balrog/blob/master/auslib/db.py#L770

>2016-01-29 12:16:14,433 - DEBUG - PID: 5682 - Request: 140482947154192 - Rules.getRulesMatchingQuery#772: Raw matches:
>2016-01-29 12:16:14,433 - DEBUG - PID: 5682 - Request: 140482947154192 - Rules.getRulesMatchingQuery#776: {u'comment': "OS Blocking rule for de-supported OS' on all nightly channels", u'product': None, u'data_version': 14L, u'buildID': None, u'distribution': None, u'whitelist': None, u'backgroundRate': 100L, u'mapping': 'No-Update', u'priority': 100L, u'alias': None, u'version': None, u'headerArchitecture': None, u'update_type': 'minor', u'buildTarget': None, u'locale': None, u'osVersion': 'Darwin 6,Darwin 7,Darwin 8,Darwin 9,Windows_95,Windows_98,Windows_NT 4,Windows_NT 5.0,Windows_NT 5.1.0,Windows_NT 5.1.1,GTK 2.0.,GTK 2.1.,GTK 2.2.,GTK 2.3.,GTK 2.4.,GTK 2.5.,GTK 2.6.,GTK 2.7.,GTK 2.8.,GTK 2.9.,GTK 2.10.,GTK 2.11.,GTK 2.12.,GTK 2.13.,GTK 2.14.,GTK 2.15.,GTK 2.16.,GTK 2.17.', u'rule_id': 1L, u'channel': 'nightly*', u'distVersion': None}
<snip>
>2016-01-29 12:16:15,085 - DEBUG - PID: 5682 - Request: 140482947154192 - Rules.getRulesMatchingQuery#776: {u'comment': 'Unsupported GTK blocking rule, bug 1227023', u'product': 'Firefox', u'data_version': 2L, u'buildID': None, u'distribution': None, u'whitelist': None, u'backgroundRate': 100L, u'mapping': 'GTK3-Desupport', u'priority': 160L, u'alias': None, u'version': '>=44.0', u'headerArchitecture': None, u'update_type': 'minor', u'buildTarget': None, u'locale': None, u'osVersion': 'GTK 2,GTK 3.0.,GTK 3.1.,GTK 3.2.,GTK 3.3.', u'rule_id': 294L, u'channel': 'beta*', u'distVersion': None}
>2016-01-29 12:16:15,085 - DEBUG - PID: 5682 - Request: 140482947154192 - Rules.getRulesMatchingQuery#780: beta* doesn't match release
>2016-01-29 12:16:15,085 - DEBUG - PID: 5682 - Request: 140482947154192 - Rules.getRulesMatchingQuery#814: Reduced matches:
>2016-01-29 12:16:15,085 - DEBUG - PID: 5682 - Request: 140482947154192 - Rules.getRulesMatchingQuery#817: {u'comment': 'default release rule updated by buildbot, DO NOT DELETE', u'product': 'Firefox', u'data_version': 58L, u'buildID': None, u'distribution': None, u'whitelist': None, u'backgroundRate': 0L, u'mapping': 'Firefox-44.0-build3', u'priority': 90L, u'alias': 'firefox-release', u'version': None, u'headerArchitecture': None, u'update_type': 'minor', u'buildTarget': None, u'locale': None, u'osVersion': None, u'rule_id': 145L, u'channel': 'release', u'distVersion': None}
>2016-01-29 12:16:15,085 - DEBUG - PID: 5682 - Request: 140482947154192 - Rules.getRulesMatchingQuery#817: {u'comment': 'SHA1 watershed bug 1234277', u'product': 'Firefox', u'data_version': 4L, u'buildID': None, u'distribution': None, u'whitelist': None, u'backgroundRate': 100L, u'mapping': 'Firefox-43.0.1-build1', u'priority': 100L, u'alias': None, u'version': '<43.0.1', u'headerArchitecture': None, u'update_type': 'minor', u'buildTarget': None, u'locale': None, u'osVersion': 'Windows_NT', u'rule_id': 269L, u'channel': 'release*', u'distVersion': None}

This block is all of the post processing, which finishes at https://github.com/mozilla/balrog/blob/master/auslib/db.py#L814.

>2016-01-29 12:16:15,085 - DEBUG - PID: 5682 - Request: 140482947154192 - AUS.evaluateRules#67: Matching rule: {u'comment': 'SHA1 watershed bug 1234277', u'product': 'Firefox', u'data_version': 4L, u'buildID': None, u'distribution': None, u'whitelist': None, u'backgroundRate': 100L, u'mapping': 'Firefox-43.0.1-build1', u'priority': 100L, u'alias': None, u'version': '<43.0.1', u'headerArchitecture': None, u'update_type': 'minor', u'buildTarget': None, u'locale': None, u'osVersion': 'Windows_NT', u'rule_id': 269L, u'channel': 'release*', u'distVersion': None}

This is the "take the highest priority bit" from https://github.com/mozilla/balrog/blob/master/auslib/AUS.py

>2016-01-29 12:16:15,085 - DEBUG - PID: 5682 - Request: 140482947154192 - Releases.getReleases#913: Looking for releases with:
>2016-01-29 12:16:15,085 - DEBUG - PID: 5682 - Request: 140482947154192 - Releases.getReleases#914: name: Firefox-43.0.1-build1
>2016-01-29 12:16:15,085 - DEBUG - PID: 5682 - Request: 140482947154192 - Releases.getReleases#915: product: None
>2016-01-29 12:16:15,086 - DEBUG - PID: 5682 - Request: 140482947154192 - Releases.getReleases#916: version: None

This getReleases call originates at https://github.com/mozilla/balrog/blob/master/auslib/AUS.py#L89. It's impossible to know whether or not it was necessary to talk to the database, or whether the blob was retrieved from the cache - there's no log output that differs between those two cases.

>2016-01-29 12:16:21,340 - DEBUG - PID: 5682 - Request: 140482947154192 - AUS.evaluateRules#94: Returning release Firefox-43.0.1-build1

This is the final line of evaluateRules().

>2016-01-29 12:16:31,972 - DEBUG - PID: 5682 - Request: 140482947154192 - auslib.blobs.apprelease.ReleaseBlobV4.matchesUpdateQuery#13: Trying to match update query to Firefox-42.0-build2
>2016-01-29 12:16:32,244 - DEBUG - PID: 5682 - Request: 140482947154192 - auslib.blobs.apprelease.ReleaseBlobV4.matchesUpdateQuery#24: releasePlat buildID is: 20151029151421
>2016-01-29 12:16:35,733 - DEBUG - PID: 5682 - Request: 140482947154192 - auslib.blobs.apprelease.ReleaseBlobV4.matchesUpdateQuery#13: Trying to match update query to Firefox-41.0.2-build2
>2016-01-29 12:16:35,733 - DEBUG - PID: 5682 - Request: 140482947154192 - auslib.blobs.apprelease.ReleaseBlobV4.matchesUpdateQuery#24: releasePlat buildID is: 20151014143721
>2016-01-29 12:16:38,902 - DEBUG - PID: 5682 - Request: 140482947154192 - auslib.blobs.apprelease.ReleaseBlobV4.matchesUpdateQuery#13: Trying to match update query to Firefox-43.0-build1
>2016-01-29 12:16:38,959 - DEBUG - PID: 5682 - Request: 140482947154192 - auslib.blobs.apprelease.ReleaseBlobV4.matchesUpdateQuery#24: releasePlat buildID is: 20151208100201
>2016-01-29 12:16:42,362 - DEBUG - PID: 5682 - Request: 140482947154192 - auslib.blobs.apprelease.ReleaseBlobV4.matchesUpdateQuery#13: Trying to match update query to Firefox-40.0.3-build1
>2016-01-29 12:16:42,484 - DEBUG - PID: 5682 - Request: 140482947154192 - auslib.blobs.apprelease.ReleaseBlobV4.matchesUpdateQuery#24: releasePlat buildID is: 20150826023504

These are additional blob retrievals that happen during XML generation, when trying to decided whether or not to serve a partial. Again, it's impossible to know whether or not the retrieval was cached.

>2016-01-29 12:16:42,485 - DEBUG - PID: 5682 - Request: 140482947154192 - ClientRequestView.get#61: Sending XML: <?xml version="1.0"?>

And this is the start of XML response.


So. The vast majority of time ends up in getReleaseBlob. The first call to it appears to take ~6s, and subsequent calls appear to take a few seconds each, totaling to 12s. This means that out of the 29 seconds this request took, nearly 20 seconds of that was retrieving releases. The two possible candidates for causing this are either database slowness, or my deepycopy() change to the cache. I've dug in really deeply to the New Relic data, and I'm pretty certain now that the deepcopy() is what caused the issue. This is based on the following:
* Around 7:20am (eastern) I temporarily changed the code on aus1.webapp.phx1 to remove the deepcopy()'s. The load on that machine went down at that time, and its throughput went up. The other nodes (which didn't have the deepcopy removed) continued to have high load and bad throughput.
* Around the same time, traffic from the database went up a bit, which is consistent with fulfilling more requests.
* Most of the code change was admin-app only (ie, not even imported by the aus5 wsgi app). Of the code that could possibly affect it, cache.py is the only part that appears to actually be invoked. The other changes are all to blob validation, and aus5 never calls .validate() on blobs AFAICT.

So, given all of this there's a few things I want to do:
1) Either get rid of deepcopy() entirely or make it far less expensive. I originally added it mostly as a safeguard against accidentally updating a cached copy of the blob after retrieving it from the cache. This is useful to the admin app (but not the only solution to the problem), but absolutely not useful to the public facing app. At the very worst, I can ifdef it and turn it off for the public app.
2) Add something that makes cache stats accessible. One of the angles I investigated was the possibility of the blob cache being too small, and blobs getting constantly shoved out of it. It is 500 items large, and between the blobs that rules.mapping point at, and the blobs referenced within those, we may well be over 500 already. Cache stats would help a lot in figuring this out. Once we have them, we can see if increasing the size of the blob cache may be useful. This is probably something for another bug.
3) Add a bit more debug logging around blob retrievals. Knowing when one is a cache hit or a miss would've helped make debugging individual requests like the above much easier.
Lots of details/background in the PR.
Attachment #8714401 - Flags: feedback?(catlee)
Comment on attachment 8714401 [details] [review]
make copying on cache get/put optional

This is ready for proper review now. I intend to do similar perf checks to comment #6 when it makes it to dev.
Attachment #8714401 - Flags: feedback?(catlee) → review?(catlee)
Attachment #8714401 - Flags: review?(catlee) → review+
Commit pushed to master at https://github.com/mozilla/balrog

https://github.com/mozilla/balrog/commit/bd247f83061d6835d5924319dac048e853e351a2
Merge pull request #50 from bhearsum/optional-cache-copy

bug 1238944: Make copying upon cache get/put optional
Comment on attachment 8714401 [details] [review]
make copying on cache get/put optional

This was deployed today, successfully!
Attachment #8714401 - Flags: checked-in+
I looked over the metrics this morning to make sure this patch had the positive effect we were looking for, and it appears to. Starting on the 26th, the admin node had received network traffic averages (the bulk of which comes from the database server) of: 23MB/sec, 15MB/sec, 19MB/sec, 13MB/sec, and 11MB/sec. Yesterday's average in the same time window was only 7MB/sec, and today is looking similar.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: