Closed Bug 1174263 Opened 9 years ago Closed 8 years ago

Update the clobberer tool to support clobbering Taskcluster-based builds

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Assigned: garbas)

References

Details

Attachments

(3 files)

Bug 1171809 tracks the Taskcluster side of implementing clobbering support (i.e. forced objdir purges manually requested via someone).

In addition to Taskcluster supporting such an operation, the clobberer tool will need to be updated to recognize the jobs (they aren't listed currently) and send a compatible notification to that infra that such a request has been made.
In bug 1171809, http://docs.taskcluster.net/services/purge-cache/ was deployed and now it is hooked up to docker-worker. The clobberer service could now hit that API endpoint and purge caches in TaskCluster workers.
<catlee> The api needs to send the right pulse message 
<catlee> And I just realized that we need a way to populate the build names once we're fully on tc
Assignee: nobody → rok
this bug has been idle for almost 4 weeks.  This is 1 of 2 issues remaining before we can make taskcluster linux64 debug tier1.

:garbas, can you update this bug with the latest progress?  Is there help you need?  things you are blocked on (other than time)?
Flags: needinfo?(rok)
Flags: needinfo?(rok)
Attachment #8723638 - Flags: review?(dustin)
the best way to resolve a needinfo, submit a request for review:)
(In reply to Joel Maher (:jmaher) from comment #3)
> this bug has been idle for almost 4 weeks.  This is 1 of 2 issues remaining
> before we can make taskcluster linux64 debug tier1.
> 
> :garbas, can you update this bug with the latest progress?  Is there help
> you need?  things you are blocked on (other than time)?

i've just submitted initial version of PR. i asked :dustin to review the work. things that still need to be polished:
 - create autosuggest dropdown with all possible cacheNames for particual workerType
 - extend tc's purge-cache api to be able to purge multiple caches at once
... and i have this ticket as a high prio on my todo list, i hope to close it tomorrow the latest.
> - extend tc's purge-cache api to be able to purge multiple caches at once
Why not just call it multiple times... The server can handle concurrent requests :)
If you do an unbounded number of pulse messages per request success gets more sketchy, granted it's
probably not an issue in this case.
:jonasfj i just thought it would be a one/two-liner fix in docker-worker to support this.

since i have your attention here. is there a way (in some part of the api) to get a list of all current possible cacheNames  for specific workerType? :dustin mentioned that i would need to create the decision tasks and build our "cache" of available cacheNames. if you have a better idea how to do this, i'd really appreciate it.
Status: NEW → ASSIGNED
the ticket for decision task mentioned in my previous comment is https://bugzilla.mozilla.org/show_bug.cgi?id=1245538
Depends on: 1245538
No longer depends on: 1245538
Depends on: 1245538
> is there a way (in some part of the api) to get a list of all current possible cacheNames for specific workerType?

No, there is no central registry of cacheNames...
You could go through the task-graph from the last decision task, and find caches that way.

> :jonasfj i just thought it would be a one/two-liner fix in docker-worker to support this.
You wouldn't do it in docker-worker, if you wanted to it could easily be done in purge-cache:
https://github.com/taskcluster/taskcluster-purge-cache/blob/6c7bc93d75425547c48c4b16a9c20db187a35f1d/lib/api.js#L55
Instead of taking a single cacheName in the request payload, you could take multiple
and send a pulse message for each of them. But why? It's a small thing and doesn't matter here
because nobody would request 10k cacheNames be purged. But if someone did, that would likely
affect robustness, but preventing such unbounded requests, we increase robustness.

(In this case, it probably doesn't matter either way, because sending messages is very fast)
Comment on attachment 8723638 [details] [review]
https://github.com/mozilla/build-relengapi/pull/365

r- for needing tests and dropping use of TC creds for getting workerTypes, both minor.

As for the next step and the question of enumerating cache names: I'm growing increasingly fond of the idea of gathering data from decision tasks.  It's similar in effect to the allthethings.json we provide from Buildbot: basically a big JSON summary of how things are currently set up, per tree.  To that end, I had a few additional thoughts:

1. In the second draft of this clobberer project, if you have three inputs, you can give more accurate options:

Branch:      [mozilla-inbound, mozilla-central, try, ..]
Worker Type: [opt-linux64, ..]
Cache Name:  [...]

The options for "Branch" can be enumerated at index namespace `gecko.v2` (without credentials, thus possible from the JS frontend).  Once that is complete, the worker types can be enumerated from artifact `public/graph.json` of the task indexed at `gecko.v2.<branch>.latest.firefox.decision`, which again could be fetched from the frontend.  Once the user has selected one or more worker types, the cache names can be determined from the same artifact.

2. There is the question of caching and filtering these artifacts which, when we are doing in-tree scheduling, will be quite large.  So it may make sense (in another bug, later) to build a releng service or blueprint that can flexibly query the latest task graph for a branch.  Besides allowing some short-term caching, this will give us a single point to modify when task or graph schemas change.
Attachment #8723638 - Flags: review?(dustin) → review-
:dustin tnx! this makes it very clear how to finish this ticket, once decision task is indexed. i'll move and work on getting decision task index and since it is blocking me on this.
Blocks: 1250596
No longer depends on: 1250596
i've talked to :Tomcat about the UI and here is the feedback:
 
- it would be very nice if worker types are named the same way as in treeherder. eg: "dbg-macosx64" -> "MacOSX64 Dbg"

- cache names as an implementation detail we are not interested in. we rather purge more then too little. having to purge per worker type is ok.
Worker types and platforms aren't the same thing, so a naming correspondance doesn't make sense.  For example, all linux tests are performed in `desktop-test` or `desktop-test-xlarge`.  So I think those are going to remain.  We'll also be renaming them in bug 1220686 to something a bit more predictable (but still not especially related to platform, except at the very coarsest mac / windows / linux level).

The name of the workspace cache, however, *does* correspond pretty closely with the platform -- more precisely, with the TH job.  For example, linux64 asan will use a different workspace than linux64 dbg, even if both appear on the same platform in TH.

The tooltool cache never needs purging, since its content signatures are checked on every use.  Similarly, the tc-vcs cache "should" never need purging (modulo some issues we've had in the past).  Both contain a massive amount of highly-replicated content, and blowing away those caches is going to cause a major performance dip as 5-10GB (and growing) of additional data per task are downloaded.  At that scale, the S3 transfer costs start to add up, too.

So I think it's important to keep caches visible.  Arguably, it's the workerType that's an implementation detail and one which, until this moment, I had been thinking of changing without bothering to tell anyone.

Just as a thought experiment, which of these makes more sense:

 - purge all caches on desktop-build-level-3-dbg
   (so, every non-try debug build task for linux or os x will have to re-fetch all of its data)

 - purge level-3-mozilla-inbound-build-linux32-workspace on all workerTypes
   (so, just the workspaces for every non-try mozilla-inbound linux32 task)

With Rok's system of scanning recent decision tasks, you can get a nice enumeration of cache names and/or workerTypes, and they sort nicely since the more-specific information is on the right.
Bug 1198374 comment 15 is a good datapoint for over-clobbering -- clobbering the tooltool cache will add 1-2 minutes to every windows build task just for the toolchain download.  The clang and gcc toolchains are about the same size.
Attachment #8723638 - Flags: review- → review?(dustin)
I've landed a few fixes, but there are a few things you may want to look at:

 * the name of the setting TASKCLUSTER_CACHE_DURATION may be a bit too general -- what if we want to cache something else?

 * the list of TASKCLUSTER_CACHES_TO_SKIP includes branch names, and will need tweaking every time a new branch is created.  It might be better to use something more flexible -- perhaps regular expressions?

 * it might be useful to put the two different clobberers on different onscreen tabs, rather than stacking them atop one another.

at any rate, this is deployed and I successfully purged all caches for the dbg-linux32 workerType.  Tomcat, do you want to have a look and provide feedback?
Flags: needinfo?(cbook)
hey dustin,

i tried to purge the cache in https://tools.taskcluster.net/task-inspector/#VKgwJU63SL2oJsj-aEOY3g/ and i run into the problem that it was saying i need to login (thats fine) and then i logged in with my ldap account and even then clobbering failed with:

401:  Authorization Failed

do we sheriffs need authorization for clobbering maybe ?
Flags: needinfo?(cbook) → needinfo?(dustin)
Attached file error message
I would expect that if you can push, you can clobber.  on weekends sheriffs are not always around to offer credentials.
In fact, everyone in team_moco has clobber permissions (it's the same permissions as for the buildbot clobberer).

The tool you're using, on the task inspector, is the purge-cache button, which Rok added but which isn't featureful enough to call "Clobberer" (P.S. Rok: we should talk about adding scopes so people can use the button!).

The tool we're working on in this bug is Clobberer, at https://api.pub.build.mozilla.org/clobberer/ -- please give that a shot!
Flags: needinfo?(dustin) → needinfo?(cbook)
Attached image screenshot for branches
hm dustin are the branches not in sync ? when i select inbound like on the screenhot i got esr to clobber45 ?
Flags: needinfo?(cbook)
Comment on attachment 8730696 [details]
screenshot for branches

That's the Buildbot clobberer -- also not the one we're working on here.  See, right above, where it says "TaskCluster"?  That's the new bit :)

Also, that's really weird -- do you have some browser extension in that affects select boxes?  Or maybe nightly is doing something strange?  There are no checkmarks in that <select> (which is just a regular HTML select.. no fanciness) in any of my browsers.
NI'ing Tomcat since this is blocking Tier1... :)
Flags: needinfo?(cbook)
works great, retested on windows and latest Firefox Release and works fine! Thanks dustin!
Flags: needinfo?(cbook)
Great!
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: