Closed Bug 1204970 Opened 9 years ago Closed 8 years ago

modify check_pending_builds to report more granularly on pending builds/tests

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: aselagea)

References

Details

Attachments

(3 files, 8 obsolete files)

Right now check_pending_builds just sums all pending jobs and isn't a great indicator of where a problem may lie (or if it's just normal load).

In talking with catlee, we should modify this so that we can check per platform to help us pinpoint issues quicker and ignore "normal load." 

He also suggested that we set thresholds differently based on time of day.

My suggestion is that we pass in some base thresholds and a key that matches ... the platform type? Something we can easily correlate to platform, anyway. This way we can break it down by platform (passed into the script on multiple nagios checks) and add some buffer to the threshold based on time of day within the script logic if we want.
Attached patch check_pending_builds.py.patch (obsolete) — Splinter Review
Attached a first draft of the script
Attachment #8670221 - Flags: review?(kmoir)
Assignee: nobody → vlad.ciobancai
Below you can find the ouput from the python script

C:\Python27\python.exe C:/Users/vlad.ciobancai/PycharmProjects/untitled/get_num_pending.py
OK Pending Builds: 48 for platform b2g-inbound
OK Pending Builds: 1 for platform date
OK Pending Builds: 1 for platform Unknown
OK Pending Builds: 2 for platform mozilla-aurora
WARNING Pending Builds: 3772 for platform try
OK Pending Builds: 5 for platform try-comm-central
OK Pending Builds: 2 for platform mozilla-central
OK Pending Builds: 40 for platform comm-central
OK Pending Builds: 63 for platform fx-team
OK Pending Builds: 135 for platform mozilla-inbound
Comment on attachment 8670221 [details] [diff] [review]
check_pending_builds.py.patch

What does the output look when you run this? Do the thresholds need to be adjusted?  My understanding is that currently the threshold is for all types of builds so my presumption is that they would need to be set on a per platform basis.
The output from the script is pasted above . I wanted first to be sure if it's ok the approach that I start
Oh okay, sorry I didn't see the earlier output, my bugmail was sorted weird in my inbox.  In any case, 
I think we want to break the pending counts by platform like they are broken down in this report (i.e. t-w732-ix, t-w864-ix and so on)

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/

which is more useful than those by branch (mozilla-central, mozilla-inbound etc)
Comment on attachment 8670221 [details] [diff] [review]
check_pending_builds.py.patch

new patches are being worked on, will review those when they are available
Attachment #8670221 - Flags: review?(kmoir)
Attached file new_check_pending_builds.py (obsolete) —
A sample output of the script:

CRITICAL Pending Builds: 9764
Top Builds by Platform:
win32 --> 3859 builds
win64 --> 2267 builds
linux64 --> 896 builds
linux --> 668 builds
macosx64 --> 522 builds
Attachment #8679426 - Flags: review?(kmoir)
Comment on attachment 8679426 [details]
new_check_pending_builds.py

I looks good,should we adjust the warning thresholds to be smaller because they are broken out by platform?
Attachment #8679426 - Flags: review?(kmoir) → feedback+
Attached file new_check_pending_builds.py (obsolete) —
So I updated the script to do the following: 
    - return a CRITICAL alert when the number of pending builds for a certain platform is equal or greater than 2000 
    - return a WARNING alert when the number of pending builds for a certain platform is equal or greater than 1200
    - return OK if the number of pending builds for a certain platform is lower than 1200

In all of the cases above, it will also print the top 3 platforms by pending jobs.
Attachment #8679426 - Attachment is obsolete: true
Attachment #8679537 - Flags: review?(kmoir)
For example:

CRITICAL Pending Builds: 2365
Top Builds by Platform:
win32 --> 2365
win64 --> 1714
android-api-11 --> 176
Attached file new_check_pending_builds.py (obsolete) —
Adjusted the way the results are reported a little bit.
Example:

CRITICAL Pending Builds: 5583 on 'win32'
Top Builds by Platform:
win32 --> 5583
win64 --> 3471
macosx64 --> 224
Attachment #8679537 - Attachment is obsolete: true
Attachment #8679537 - Flags: review?(kmoir)
Attachment #8680680 - Flags: review?(kmoir)
Attachment #8680680 - Flags: review?(kmoir) → review+
Attached patch bug1204970.patch (obsolete) — Splinter Review
diff between old and new in patch form
Attached patch bug1204970.patch (obsolete) — Splinter Review
Attachment #8682587 - Attachment is obsolete: true
Attachment #8682592 - Flags: checked-in+
Comment on attachment 8682592 [details] [diff] [review]
bug1204970.patch

reverted this because some changes are needed on that nagios alerts before it can be deployed to production
Attachment #8682592 - Flags: checked-in+ → checked-in-
Attached file check_pending_builds.py (obsolete) —
Updated the script to provide some optimization.
Attachment #8680680 - Attachment is obsolete: true
Attached patch bug1204970.patch (obsolete) — Splinter Review
the new version of the patch
Attachment #8682592 - Attachment is obsolete: true
Attachment #8689525 - Flags: review?(kmoir)
Attachment #8689525 - Flags: review?(kmoir) → review+
Assignee: vlad.ciobancai → alin.selagea
The bad news is that this check doesn't appear to be managed by puppet at all. I had :aselagea copy the new file in place, and running puppet didn't overwrite it.

We've seen a couple 10s nrpe timeouts so far, so I'm not sure if this check is taking significantly longer than the old one, or this machine is just crufty and needs a reboot after 266 days of uptime.

For the time being I've copied the old script to my personal tmp directory on cruncher if we need to revert.
Since we're seeing lots of NRPE timeouts, I've set it to 60s in commands.pp

        check_pending_builds                     => '$USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c check_pending_builds',
Following Callek's suggestions, I refined the script a little bit. Attached the new version.

Sample output:
    OK Pending Builds: 16 on ['t-w732-ix']
Attachment #8670221 - Attachment is obsolete: true
Attachment #8689523 - Attachment is obsolete: true
Attachment #8711618 - Flags: review?(bugspam.Callek)
Here's the new patch.
Attachment #8689525 - Attachment is obsolete: true
Attachment #8711622 - Flags: review?(bugspam.Callek)
Attachment #8711618 - Flags: review?(bugspam.Callek)
Attached the diff between the latest version and the last reviewed version.
Attachment #8711706 - Flags: review?(bugspam.Callek)
Comment on attachment 8711706 [details] [diff] [review]
bug_1204970_new.patch

Review of attachment 8711706 [details] [diff] [review]:
-----------------------------------------------------------------

::: check_pending_builds.py
@@ +1,1 @@
> +#!/usr/local/bin/python2.7

this slightly concerns me, perhaps

#!/usr/bin/env python2.7

instead?

(please verify it works first) -- was there a specific reason for this change? (e.g. if no specific reason might as well undo it)

@@ +65,2 @@
>                  duplicate = 1
>                  break

I think this duplicate check can be improved for speed, but since the basic logic was already here, its not blocking anything in terms of my review.

@@ -118,5 @@
> -            "on '%s'" % sort_count_by_platform[0][0]
> -        print 'Top Builds by Platform:'
> -        for k in range(0, len(sort_count_by_platform)):
> -            if k < 3:
> -                print sort_count_by_platform[k][0], '-->', sort_count_by_platform[k][1]

top builds was a useful output when manually running imo, but not necessary for a script strictly used as a nagios check.
Attachment #8711706 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8711622 [details] [diff] [review]
bug_1204970.patch

clearing review, since I didn't review this one officially. But I agree to just land the final version of the check into braindump/ assuming all lines have had review (which aiui they have)
Attachment #8711622 - Flags: review?(bugspam.Callek)
Updated the script to use "#!/usr/bin/env python" (just as it was on version 1.0) and checked that it works. Pushed the new version to braindump and also replaced the existing script on cruncher with this one.
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #24)
> Updated the script to use "#!/usr/bin/env python" (just as it was on version
> 1.0) and checked that it works. Pushed the new version to braindump and also
> replaced the existing script on cruncher with this one.

Turns out that shebang change doesn't actually work as I suggested, - well it works manually from command line, but not from nagios (perhaps some PATH issue)

Anyway, I reverted it on the live copy, not the repo copy.
What is the state of this bug? Can it be closed?  I noticed in #buildduty that we have alerts per platform for tests working.
Blocks: 1245823
The check now uses the new version of the script.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: