Closed Bug 1204920 Opened 9 years ago Closed 9 years ago

Do we need linux32 talos?

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wlach, Unassigned)

References

Details

We spend a fair amount of time running linux32 talos tests, a platform that probably gets very very little use in the wild. I suspect it also has very similar performance characteristics to 64 bit linux. We should consider turning off these jobs.

We should figure out the following:

1. How many *unique* performance alerts we've seen on linux32 (that we didn't also see on linux64 or some other platform), and what their characteristics/resolutions were.
2. How much machine time we spend on linux32 talos.

Depending on the results we might want to consider turning linux32 off and freeing up resources for other things. :)
I'm on board with turning off platforms that don't get us interesting numbers.

Btw, It looks like 64-bit Linux builds are also more popular among users:

Nightly: 6.2k sessions (32-bit on Linux) vs 63.59k sessions (64-bit on Linux)
Aurora: 12.49k (32-bit) vs 108.33k (64-bit)
Beta: 21.87k (32-bit) vs 30.24k (64-bit)
Release (Telemetry is opt-in on Release!): 89.39k (32-bit) vs 130.89k (64-bit)

Data from:
Nightly 43 http://mzl.la/1KkDCoO
Aurora 42 http://mzl.la/1KkECJq
Beta 41 http://mzl.la/1KkE8mH
Release 40 http://mzl.la/1KkEwBB
a11yr:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,8eb746b71df3b2363964eb75023eda6c5cec52bc,0]&series=[mozilla-inbound,a848c3258cb0f7f0bf058cb8936353dc3626b0d5,0]&series=[mozilla-inbound,064f57310c9ff4dd16adbc64ed1cff0a6d4bde51,1]&series=[mozilla-inbound,d4936842754584539928c6f5781106c1b0fc6f96,1]&series=[mozilla-inbound,d94a1100216dba2bacfb063e5315fd25d875ea2c,0]&series=[mozilla-inbound,b332843380d02009805b33015f923af37c63a267,0]&series=[mozilla-inbound,d5e9d9d3c8d0f73ee05561ca2ac922a896e92df4,1]&series=[mozilla-inbound,642725db58061608270f7f6be2b62ffb77890876,1]
* same patterns, pgo linux32 shows a slightly larger change w.r.t jemalloc4, but we would still detect the regression.

cart:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,5650e71c27b8333b5521da646be37410402de701,0]&series=[mozilla-inbound,ee34d78df835c423867ad248cf12037784844bd4,0]&series=[mozilla-inbound,5b78557f49cddaea96f7ea86feeb7affb8ffaeed,1]&series=[mozilla-inbound,ebb5632547d62331f096633938303bdc84f4dc12,0]&series=[mozilla-inbound,d3e348b81f26ed4ab23597884139faa000e56a4b,0]&series=[mozilla-inbound,eb05ab320cf3b9d062e624276d42337a95bd24bf,0]&series=[mozilla-inbound,01a1bf3dd449675894c6c4c08021e5cdfef20010,1]&series=[mozilla-inbound,fce83bfd6c88974a251b0fda0b15ea80f3f8ae58,0]
* identical data except for a slight shift related to jemalloc4.

damp
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,56d653709121dbd160367e3a56215818fa1ec51e,0]&series=[mozilla-inbound,67f73683523d5d9a74728eb280097d4cb1384101,0]&series=[mozilla-inbound,f72dcf5cde6de923606c6dd325c141abf91c414a,1]&series=[mozilla-inbound,7acb1c86ece27aac58562e8aca66f50ce860d100,1]&series=[mozilla-inbound,25169736932930bc20f0437757e9399dadf68976,0]&series=[mozilla-inbound,8db4184f56e09eb0a34227f5270e345701798ba3,0]&series=[mozilla-inbound,3eca8f3dcb305b6ef6a171a33f9236210581e9ef,1]&series=[mozilla-inbound,abebb26a9ab14bdf494637a4a5ae553055b99f66,1]
* all identical

dromaeo_css:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,49f1c53c911dae318bd601103a46b9ce93bfcc19,0]&series=[mozilla-inbound,ed11c8b349c4bdd3da6e5ef853a0583226cff495,0]&series=[mozilla-inbound,de23cd987fea3de90995a2e9f70a524d0e8d9d6d,1]&series=[mozilla-inbound,1e3cb4d90692c1de2696dda6ad331cb1f15d426c,1]&series=[mozilla-inbound,73c31ca5f967b35041bd3fe80ea6a8f031aea600,0]&series=[mozilla-inbound,02a34099ddca5e91ac2d0b8808543169cd18805d,0]&series=[mozilla-inbound,cc4c110c59af3ba9516fcd8d4cc88222039fd36d,1]&series=[mozilla-inbound,9a6d7da6721c7f4463017cc2d349ce203a5ba553,1]
* non pgo/non e10s is slightly different, same pattern overall, just a few point which look suspect

glterrain:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,b6ffb1402195cb68ba52e6bfd242a4c09acc1488,0]&series=[mozilla-inbound,28e47e1119842831dde52c72d32c793ebca16691,0]&series=[mozilla-inbound,d3c79c87a6cdbf8e03958c34a9f1db83533640ce,0]&series=[mozilla-inbound,ba8a7a57fd6de8e07de6201019b90e7e828f852a,0]&series=[mozilla-inbound,72e07984983c51f486a3cbb36481ae53c9240d5c,1]&series=[mozilla-inbound,dc0ad655d5003896c90e2219f28897b6406acb62,1]&series=[mozilla-inbound,6e44c4c5d72c7245e0dbb1078e11ab6ed9d97fba,0]&series=[mozilla-inbound,93765ad224f7fb2e73b07f94967a5d49e8611210,0]
* non pgo and pgo, linux64 e10s has the same pattern, but it offset more than linux32 e10s from the non e10s version.

kraken:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,b6cde3c3869f70ac5005123deee73488c65230b3,0]&series=[mozilla-inbound,1574e9b47736f2eb94c4c058a3bd8c566c4a4101,0]&series=[mozilla-inbound,d6871c70abde341a37dbdd001196f0d7e35f3b99,1]&series=[mozilla-inbound,a1ee7e293a10b5979f95a5cf2e6619080ce23835,1]&series=[mozilla-inbound,17ed6349edd0e71d0df2748dc78b1b3fcea9acc9,0]&series=[mozilla-inbound,74fe784935587666235e95682379d1c94cd8c9cd,0]&series=[mozilla-inbound,85fc4181007e5a895b3316a72adea1f1d93dae4b,1]&series=[mozilla-inbound,406b6327e9e043d738112a501e2710c60131427a,1]
* same overall pattern, one difference is that on august 10th we set a new value for kraken and we had different new baselines.  Before we were in the ~100 range for both platforms, after August 10th, we were at ~1450 for linux64 and ~1650 for linux32.
* what is really odd is there is no alert from graphserver for this change.

session restore:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,e9be3e94b1d2eb1dc606b5af74a2b47d890998f5,1]&series=[mozilla-inbound,8144db5677483dd98cde4f412ecf98c5bc6a4fa6,1]&series=[mozilla-inbound,d302a19fe1f655de7f9db1928a97bda4b3274568,1]&series=[mozilla-inbound,36edf5e308da1401e57949a8cc5c32bb6ee0df0b,1]
* slight deviation in results as time goes on, linux64 regresses over time more than linux32, still a similar pattern shape and all larger changes are in parallel.

session restore no auto restore:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,f4c4dae358f6ebf766b1f6aad25cdc2714cb9036,1]&series=[mozilla-inbound,103633b58789e290889e44f5cd59ec459bed9b3d,1]&series=[mozilla-inbound,9a330c1c036fbf8f681b0a4f8b11747ab3101774,1]&series=[mozilla-inbound,5423071f6bfb22895118976d676f4ecb04a91e1e,1]
* all good except jemalloc for linux64 pgo- we didn't see a jump here, whereas the linux32 pgo did.  We would have still seen the non-pgo alerts though.

tart:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,c1053c9ba89dd3e6c19110f68cb3d2fd1e40d26f,0]&series=[mozilla-inbound,1bf8120a8d029026fe5ef70c988448498c905cd1,0]&series=[mozilla-inbound,3a3c00b2b0388b6c8fc356c2708602d089a502d8,1]&series=[mozilla-inbound,49773b71d6259bc3e4831a964ef4313b94ccca52,1]&series=[mozilla-inbound,4a03564179d82c4105af5facba8b3d6fc785415a,0]&series=[mozilla-inbound,65e173fb1a511a28979c769ad2c8a1b31dd358bc,0]&series=[mozilla-inbound,b05aa73ee3a2420a2031b88f12ee6924d6b42583,1]&series=[mozilla-inbound,0fa6186e2d531c0ec871d14a84701477a5792e8b,1]
* all the same!

tcanvasmark:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,b010c0a2506e3032387eb82c3885f6dc4e81809e,0]&series=[mozilla-inbound,0c7d2d03da71ce409e4c7c5cef4a006e53dc6986,0]&series=[mozilla-inbound,dff15dbc5bd51e71d92d53ad16df8c2cf9cc3cc1,1]&series=[mozilla-inbound,9236910c1a438aa553c82983a0c7627d354a8364,1]&series=[mozilla-inbound,df8939dc6e77c3a3c208294042ab6b50013d3966,0]&series=[mozilla-inbound,d534dff4bab8bdb94f4e065c3451ac5ccc45af42,0]&series=[mozilla-inbound,2b8744221788a28e772b0dfe828dabb4d907b69f,1]&series=[mozilla-inbound,1f9a2659a7ef1e059512cac78ae35b3894462e4e,1]
* identical to kraken in that on august 10th we see a different set of values and they are slightly different.  I suspect this is when we started doing the calculations in talos vs inside of perfherder.

tp5o:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,1638ee1a10b220e789ca50026186e933c91b0164,0]&series=[mozilla-inbound,e41666f4a3fef5b172fdadfd488136e6544823c0,0]&series=[mozilla-inbound,23a4c9bc50e1d24b0a562363986025115041bef5,1]&series=[mozilla-inbound,b6e92a65310e0b78e12ef5a850d32e6907d1b28a,1]&series=[mozilla-inbound,bd72d04511c657c5c5040f1633fe73642fcdcb3b,0]&series=[mozilla-inbound,dc5dc84a309cfa3f89307a1d195307e333707a12,0]&series=[mozilla-inbound,8d03711ed5a2816fb76e2b861c7d6658c434c1a2,1]&series=[mozilla-inbound,83109a80c31b59f2f8a748a3bb358fd320615536,1]
* a slight hiccup related to jemalloc4 where linux64 didn't show a visible regression, otherwise the patterns are identical.

tp5o scroll:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,7f5f2778f72104c33f286793377cde13dc1ea1e5,0]&series=[mozilla-inbound,8e5b0e970849b0d53aa4bfc0669d0f69aeb354fd,0]&series=[mozilla-inbound,eef8c8cf03486885672a954b7ae3cd86ed6c569c,1]&series=[mozilla-inbound,31bb11dc46dcea80c70b521b6a6e341b2c078b3c,1]&series=[mozilla-inbound,4c4dfc86eefd577ec8f1a280c4b08d3f6f0f108a,0]&series=[mozilla-inbound,6764e9b1ce6291a6d9683810b584f1f3536362e6,0]&series=[mozilla-inbound,27d16dd5bd8df50025a7c9176b7e61d1632273fb,0]&series=[mozilla-inbound,a184c3ca3ee4bd9f7db02216f5127915ef285ba9,1]
* only deviation is e10s on pgo, luckily we see the same deviation on linux32 and linux64!

tpaint:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,5ad1cf8f223e0d78a3baf2838bb495796a7fdad2,0]&series=[mozilla-inbound,4472e33abb66f6bcece8c79b223b0a052976337f,0]&series=[mozilla-inbound,e64d80388b43fcbabd7825e1fa8076faf8f78bf5,1]&series=[mozilla-inbound,f205b89aeb54bc5885ba3ab8ef39d7005ba908b9,1]&series=[mozilla-inbound,e06918cf794c1b73f831684896b6da10bea4af0b,0]&series=[mozilla-inbound,9cf6695cdf39d34ae8f89862dc60913468f9e426,0]&series=[mozilla-inbound,760f0df94fe8b910a53e2bc5f0b95378d60e058d,1]&series=[mozilla-inbound,a5818940b2aca3a90cc799760ed375d16f0bf229,0]
* perfectly identical
* we should look into why tpaint on e10s is so noisy compared to non e10s

tps:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,71877e4ceaaba35bfd0837c691709d8144db96cc,1]&series=[mozilla-inbound,418050a8050d83a74db6e704eba424f5a1f41cfc,0]&series=[mozilla-inbound,8ffee197f862b38d63d9cb9fac2262de80f05e71,0]&series=[mozilla-inbound,56728e8c7c1f228c0f27a0de8a14812e8f58b3d8,0]&series=[mozilla-inbound,21552af6220f8727499e86b152ff30c82c79612f,1]&series=[mozilla-inbound,637a7f061cf5e18c4a14cf10f342b19a345f8e3c,0]&series=[mozilla-inbound,a68877fb2ed6eece036620b6b7e8fae55e77a293,0]&series=[mozilla-inbound,1a848737aafd7329867a1480164efe4fa9e9265e,0]
* opt non e10s has a difference for 10 days in august
* pgo non e10s has the same difference
* e10s is all identical

tresize:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,90b2964264c531aa801a987595c22c0832c2ac93,0]&series=[mozilla-inbound,10dc340395cfbaf2ce907cc074879f71cef4c203,0]&series=[mozilla-inbound,5ec7ba9772f2592b25d3b420aaf487158e3d06b4,1]&series=[mozilla-inbound,0d518ea8c422c469657f7354249e25e7388bcb26,0]&series=[mozilla-inbound,72f4651f24362c87efb15d5f4113b9ca194d8e3f,0]&series=[mozilla-inbound,55776a2a6808c7c69af642f42e05d0589f4d10d9,0]&series=[mozilla-inbound,86708a260eef1d74b07e38d19095ffae06a3d262,1]&series=[mozilla-inbound,e6e77a8588e6e8636e8dfbb8310d2b307cf9d991,0]
* all good (despite weekends problems)

ts_paint:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,c0f587042737bd4e1d03bdc699afec0faa935cb2,0]&series=[mozilla-inbound,6d7d292a0925eefe76960045c0f6dc4af7b79fd8,0]&series=[mozilla-inbound,376fafd107fa88db4eb7cc323280bc9a34a6a70d,1]&series=[mozilla-inbound,14f06afda8c1a56531d6c8e836e8db692450846a,1]&series=[mozilla-inbound,32206df28ab3b52e15d26f73178ee78e2af7e760,0]&series=[mozilla-inbound,3476c73a12c03397de3c2c251219ed6ce54c0615,0]&series=[mozilla-inbound,a47f61cd357cd9695f71bf700008afb45f80a55b,1]&series=[mozilla-inbound,1c89003681b9e15217b21da1d1668fa4b9821bad,1]
* identical!

tscrollx:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,2c361edff63bf67ec99d3fe7aa5e75f851a92057,0]&series=[mozilla-inbound,4baf82f24547e8e2ffe773188b014f4d16147a07,0]&series=[mozilla-inbound,46eb0291e77dd43122e019f1d4f2dabf7b058aa8,0]&series=[mozilla-inbound,a29880a4d6113a789867e5a04b554ebc991a9646,0]&series=[mozilla-inbound,f58e3f07b738cd3393275d04d37b3333622e0fb7,0]&series=[mozilla-inbound,d1d4c7dc4c8e34bf20e8b723aea64756823211ff,0]&series=[mozilla-inbound,b02b29308f79d2f139a95fcfa3127c928ec1c8f6,1]&series=[mozilla-inbound,ec9a6eab017cc5e0fb8b9d024b4b9788af627d86,1]
* identical!

tsvgr_opacity:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,291f67947a7052fb4d466dca227e1dd86315df9f,1]&series=[mozilla-inbound,4d676cb316d69deb3eff9f587f1d28dc75626331,1]&series=[mozilla-inbound,775e4ada356b88baafb0e4589758b2f199c1e6e8,0]&series=[mozilla-inbound,e0de85d4856c8b01cb435e8a0181929f6cec8c19,0]&series=[mozilla-inbound,6981e256ea8173cdb53dec0741ec05e8ace13f30,0]&series=[mozilla-inbound,505e97c4c50669524afd8d210471ceddf62cfe2a,1]&series=[mozilla-inbound,b98a5a90dfcf07eb3d784ca1cd4ade4de6e6c44a,0]&series=[mozilla-inbound,c5ef5b4dcf32f92aeb1a7972dbcf27e24db58e5a,0]
* linux 64 opt + pgo (non e10s) is much different at the time of jemalloc4- suspicious looking

tsvgx:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,58e5edc18b2f6d4da80a4b7eb985c94ca3bbddbe,0]&series=[mozilla-inbound,ac92564dc2b8d81c0af5b3907d8d8e51a8f8cad8,0]&series=[mozilla-inbound,e2bb1025e0347477bdd4af8a8ed14fa24911e974,1]&series=[mozilla-inbound,d348d06dcf28595d73842e79b5c67c52b5659bbd,1]&series=[mozilla-inbound,0aba8dc69a91eadb973351dd8ad5281c7a62fddd,0]&series=[mozilla-inbound,8b93cefcbc1e94cee0c9bb25f770cd81d5f894ce,0]&series=[mozilla-inbound,9fc1870abb16858b56b5ed5731fea64841d0c8a6,1]&series=[mozilla-inbound,cc1d6cf4246aeb6788716a01963d18708328a47d,1]
* identical

v8_7:
* https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=[mozilla-inbound,f3cdc5746374e146c7cc7c679e89f6dc5931f68e,0]&series=[mozilla-inbound,7c168e7c9a757f6dbeeaeb51829a6b0eb955b5bc,0]&series=[mozilla-inbound,588dd5ccee3d2cca950df7a39e919dc4f1207b9d,1]&series=[mozilla-inbound,5f1eef9adb9c370f20bc9b4082aa7529320625bf,1]&series=[mozilla-inbound,e81533b82bb54362bdb9a24b08da2d09b79c8956,0]&series=[mozilla-inbound,e2ec0b508e7a1418abff1dd16529ac960d7142c8,0]&series=[mozilla-inbound,ddadf0ea14cbb4279acbb9d5ca1c51fabb73f713,1]&series=[mozilla-inbound,90b13d7644d5be6ed8c5a1d16151101239bf0b6e,1]&zoom=1439748880129.0781,1443009643000,10628.03406646286,28019.33841428894
* linux64 opt and pgo are different than linux32
* linux64 e10s is the same as linux32 e10s though (which matches linux32 opt/pgo)
In summary-

differences of noted concern:
* v8_7
* tsvgr_opacity
* tps (short time period)
* tp5o
* session_restore_no_auto

these are not severe differences, but maybe worth looking into in further detail as time permits.  Open to other thoughts and interpretations.
(In reply to William Lachance (:wlach) from comment #0)
> We spend a fair amount of time running linux32 talos tests, a platform that
> probably gets very very little use in the wild

(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #1)
> Nightly: 6.2k sessions (32-bit on Linux) vs 63.59k sessions (64-bit on Linux)
> ...

If we summarize it over all channels, we get ~130k users on linux 32 and ~330k users on linux 64. I think it's pretty evident that as far as Firefox is concerned, linux 32 is not negligible compared to linux 64.

From this perspective only, I think we should not throw away this coverage if we can allow to keep it.

Also, as Joel noted, its regressions are not identical to linux 64, and I clearly recall regression bugs which were unique to linux 32 (or that the biggest regression was on linux 32).

If, OTOH, we're struggling with capacity and feel that dropping some coverage will have a big benefit elsewhere, then I agree linux 32 would be among the candidates to drop.

However, whether or not there's a greater benefit in dropping test platforms - is not a question I can answer personally since I'm not familiar with the costs (monetary or otherwise) of keeping less used platforms.

On its own, if there are no costs to keeping it, I'd say keep it. Otherwise, IMO we need to weight the cost against the benefit of dropping it, and for that we'd need to first enumerate the factors for both.
Vladan, can you offer your thoughts here.  I really think Avi has great feedback.  Right now a fairly easy option for us to fix our windows machine backlog is to reuse these linux32 hardware machines and reinstall them as windows.  That would solve a lot of problems.
Flags: needinfo?(vladan.bugzilla)
(In reply to Joel Maher (:jmaher) from comment #5)
> Right now a fairly easy option for us to fix our windows machine
> backlog is to reuse these linux32 hardware machines and reinstall them as
> windows.  That would solve a lot of problems.

If this is a temporary thing and we know we get them back to linux32 in few days at most, personally I think it's a worthy cause.

Otherwise, if the plan is to make it permanent, I still think we should weight the cost vs the benefit.
The proposal is to make this permanent; Windows slaves are overloaded badly and will only become more so.  We have no other options to increase Windows hardware capacity this year, AFAIUI.
(In reply to Joel Maher (:jmaher) from comment #5)
> Vladan, can you offer your thoughts here.  I really think Avi has great
> feedback.  Right now a fairly easy option for us to fix our windows machine
> backlog is to reuse these linux32 hardware machines and reinstall them as
> windows.  That would solve a lot of problems.

Personally, I'm on board with dropping Linux 32-bit Talos coverage to help with Windows Talos backlogs. Windows is many times more important than Linux + it sounds like we're working with a fixed amount of resources + the insights lost by dropping linux32 don't sound terrible.
But can you first post this question on m.d.platform and link back to this bug? Other people might have concerns we've overlooked.
Flags: needinfo?(vladan.bugzilla)
Blocks: 1208449
Thanks Kim for filing the bug and getting a patch to disable the linux32 tests.  From the full agreement on the dev.platform post:
https://groups.google.com/forum/#!topic/mozilla.dev.platform/G8X_LzQ1yfs

I think we are ready to move forward here.
the answer is no
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.