Closed Bug 1204970 Opened 9 years ago Closed 8 years ago

modify check_pending_builds to report more granularly on pending builds/tests

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: arich, Assigned: aselagea)

References

Details

Attachments

(3 files, 8 obsolete files)

check_pending_builds.py.patch 9 years ago Vlad Ciobancai [:vladC] [:vciobancai] 1.27 KB, patch		Details \| Diff \| Splinter Review
new_check_pending_builds.py 9 years ago Alin Selagea [:aselagea] 4.44 KB, text/plain	kmoir : feedback+	Details
new_check_pending_builds.py 9 years ago Alin Selagea [:aselagea] 4.44 KB, text/plain		Details
new_check_pending_builds.py 9 years ago Alin Selagea [:aselagea] 4.62 KB, text/plain	kmoir : review+	Details
bug1204970.patch 9 years ago Kim Moir [:kmoir] ET 4.83 KB, patch		Details \| Diff \| Splinter Review
bug1204970.patch 9 years ago Kim Moir [:kmoir] ET 5.71 KB, patch	kmoir : checked-in-	Details \| Diff \| Splinter Review
check_pending_builds.py 9 years ago Alin Selagea [:aselagea] 5.14 KB, text/plain		Details
bug1204970.patch 9 years ago Alin Selagea [:aselagea] 6.45 KB, patch	kmoir : review+	Details \| Diff \| Splinter Review
check_pending_builds.py 8 years ago Alin Selagea [:aselagea] 4.51 KB, text/plain		Details
bug_1204970.patch 8 years ago Alin Selagea [:aselagea] 5.75 KB, patch		Details \| Diff \| Splinter Review
bug_1204970_new.patch 8 years ago Alin Selagea [:aselagea] 6.74 KB, patch	Callek : review+	Details \| Diff \| Splinter Review

Amy Rich [:arr] [:arich]

Reporter

Description

•

9 years ago

Right now check_pending_builds just sums all pending jobs and isn't a great indicator of where a problem may lie (or if it's just normal load).

In talking with catlee, we should modify this so that we can check per platform to help us pinpoint issues quicker and ignore "normal load." 

He also suggested that we set thresholds differently based on time of day.

My suggestion is that we pass in some base thresholds and a key that matches ... the platform type? Something we can easily correlate to platform, anyway. This way we can break it down by platform (passed into the script on multiple nagios checks) and add some buffer to the threshold based on time of day within the script logic if we want.

Vlad Ciobancai [:vladC] [:vciobancai]

Comment 1

•

9 years ago

Attached patch check_pending_builds.py.patch (obsolete) — Details — Splinter Review

Attached a first draft of the script

Attachment #8670221 - Flags: review?(kmoir)

Vlad Ciobancai [:vladC] [:vciobancai]

Updated

•

9 years ago

Assignee: nobody → vlad.ciobancai

Vlad Ciobancai [:vladC] [:vciobancai]

Comment 2

•

9 years ago

Below you can find the ouput from the python script

C:\Python27\python.exe C:/Users/vlad.ciobancai/PycharmProjects/untitled/get_num_pending.py
OK Pending Builds: 48 for platform b2g-inbound
OK Pending Builds: 1 for platform date
OK Pending Builds: 1 for platform Unknown
OK Pending Builds: 2 for platform mozilla-aurora
WARNING Pending Builds: 3772 for platform try
OK Pending Builds: 5 for platform try-comm-central
OK Pending Builds: 2 for platform mozilla-central
OK Pending Builds: 40 for platform comm-central
OK Pending Builds: 63 for platform fx-team
OK Pending Builds: 135 for platform mozilla-inbound

Kim Moir [:kmoir] ET

Comment 3

•

9 years ago

Comment on attachment 8670221 [details] [diff] [review]
check_pending_builds.py.patch

What does the output look when you run this? Do the thresholds need to be adjusted?  My understanding is that currently the threshold is for all types of builds so my presumption is that they would need to be set on a per platform basis.

Vlad Ciobancai [:vladC] [:vciobancai]

Comment 4

•

9 years ago

The output from the script is pasted above . I wanted first to be sure if it's ok the approach that I start

Kim Moir [:kmoir] ET

Comment 5

•

9 years ago

Oh okay, sorry I didn't see the earlier output, my bugmail was sorted weird in my inbox.  In any case, 
I think we want to break the pending counts by platform like they are broken down in this report (i.e. t-w732-ix, t-w864-ix and so on)

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/

which is more useful than those by branch (mozilla-central, mozilla-inbound etc)

Kim Moir [:kmoir] ET

Comment 6

•

9 years ago

Comment on attachment 8670221 [details] [diff] [review]
check_pending_builds.py.patch

new patches are being worked on, will review those when they are available

Attachment #8670221 - Flags: review?(kmoir)

Alin Selagea [:aselagea]

Assignee

Comment 7

•

9 years ago

Attached file new_check_pending_builds.py (obsolete) — Details

A sample output of the script:

CRITICAL Pending Builds: 9764
Top Builds by Platform:
win32 --> 3859 builds
win64 --> 2267 builds
linux64 --> 896 builds
linux --> 668 builds
macosx64 --> 522 builds

Attachment #8679426 - Flags: review?(kmoir)

Kim Moir [:kmoir] ET

Comment 8

•

9 years ago

Comment on attachment 8679426 [details]
new_check_pending_builds.py

I looks good,should we adjust the warning thresholds to be smaller because they are broken out by platform?

Attachment #8679426 - Flags: review?(kmoir) → feedback+

Alin Selagea [:aselagea]

Assignee

Comment 9

•

9 years ago

Attached file new_check_pending_builds.py (obsolete) — Details

So I updated the script to do the following: 
    - return a CRITICAL alert when the number of pending builds for a certain platform is equal or greater than 2000 
    - return a WARNING alert when the number of pending builds for a certain platform is equal or greater than 1200
    - return OK if the number of pending builds for a certain platform is lower than 1200

In all of the cases above, it will also print the top 3 platforms by pending jobs.

Attachment #8679426 - Attachment is obsolete: true

Attachment #8679537 - Flags: review?(kmoir)

Alin Selagea [:aselagea]

Assignee

Comment 10

•

9 years ago

For example:

CRITICAL Pending Builds: 2365
Top Builds by Platform:
win32 --> 2365
win64 --> 1714
android-api-11 --> 176

Alin Selagea [:aselagea]

Assignee

Comment 11

•

9 years ago

Attached file new_check_pending_builds.py (obsolete) — Details

Adjusted the way the results are reported a little bit.
Example:

CRITICAL Pending Builds: 5583 on 'win32'
Top Builds by Platform:
win32 --> 5583
win64 --> 3471
macosx64 --> 224

Attachment #8679537 - Attachment is obsolete: true

Attachment #8679537 - Flags: review?(kmoir)

Attachment #8680680 - Flags: review?(kmoir)

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Attachment #8680680 - Flags: review?(kmoir) → review+

Kim Moir [:kmoir] ET

Comment 12

•

9 years ago

Attached patch bug1204970.patch (obsolete) — Details — Splinter Review

diff between old and new in patch form

Kim Moir [:kmoir] ET

Comment 13

•

9 years ago

Attached patch bug1204970.patch (obsolete) — Details — Splinter Review

Attachment #8682587 - Attachment is obsolete: true

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Attachment #8682592 - Flags: checked-in+

Kim Moir [:kmoir] ET

Comment 14

•

9 years ago

Comment on attachment 8682592 [details] [diff] [review]
bug1204970.patch

reverted this because some changes are needed on that nagios alerts before it can be deployed to production

Attachment #8682592 - Flags: checked-in+ → checked-in-

Alin Selagea [:aselagea]

Assignee

Comment 15

•

9 years ago

Attached file check_pending_builds.py (obsolete) — Details

Updated the script to provide some optimization.

Attachment #8680680 - Attachment is obsolete: true

Alin Selagea [:aselagea]

Assignee

Comment 16

•

9 years ago

Attached patch bug1204970.patch (obsolete) — Details — Splinter Review

the new version of the patch

Attachment #8682592 - Attachment is obsolete: true

Attachment #8689525 - Flags: review?(kmoir)

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Attachment #8689525 - Flags: review?(kmoir) → review+

Vlad Ciobancai [:vladC] [:vciobancai]

Updated

•

8 years ago

Assignee: vlad.ciobancai → alin.selagea

Amy Rich [:arr] [:arich]

Reporter

Comment 17

•

8 years ago

The bad news is that this check doesn't appear to be managed by puppet at all. I had :aselagea copy the new file in place, and running puppet didn't overwrite it.

We've seen a couple 10s nrpe timeouts so far, so I'm not sure if this check is taking significantly longer than the old one, or this machine is just crufty and needs a reboot after 266 days of uptime.

For the time being I've copied the old script to my personal tmp directory on cruncher if we need to revert.

Amy Rich [:arr] [:arich]

Reporter

Comment 18

•

8 years ago

Since we're seeing lots of NRPE timeouts, I've set it to 60s in commands.pp

        check_pending_builds                     => '$USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c check_pending_builds',

Alin Selagea [:aselagea]

Assignee

Comment 19

•

8 years ago

Attached file check_pending_builds.py — Details

Following Callek's suggestions, I refined the script a little bit. Attached the new version.

Sample output:
    OK Pending Builds: 16 on ['t-w732-ix']

Attachment #8670221 - Attachment is obsolete: true

Attachment #8689523 - Attachment is obsolete: true

Attachment #8711618 - Flags: review?(bugspam.Callek)

Alin Selagea [:aselagea]

Assignee

Comment 20

•

8 years ago

Attached patch bug_1204970.patch — Details — Splinter Review

Here's the new patch.

Attachment #8689525 - Attachment is obsolete: true

Attachment #8711622 - Flags: review?(bugspam.Callek)

Alin Selagea [:aselagea]

Assignee

Updated

•

8 years ago

Attachment #8711618 - Flags: review?(bugspam.Callek)

Alin Selagea [:aselagea]

Assignee

Comment 21

•

8 years ago

Attached patch bug_1204970_new.patch — Details — Splinter Review

Attached the diff between the latest version and the last reviewed version.

Attachment #8711706 - Flags: review?(bugspam.Callek)

Justin Wood (:Callek)

Comment 22

•

8 years ago

Comment on attachment 8711706 [details] [diff] [review]
bug_1204970_new.patch

Review of attachment 8711706 [details] [diff] [review]:
-----------------------------------------------------------------

::: check_pending_builds.py
@@ +1,1 @@
> +#!/usr/local/bin/python2.7

this slightly concerns me, perhaps

#!/usr/bin/env python2.7

instead?

(please verify it works first) -- was there a specific reason for this change? (e.g. if no specific reason might as well undo it)

@@ +65,2 @@
>                  duplicate = 1
>                  break

I think this duplicate check can be improved for speed, but since the basic logic was already here, its not blocking anything in terms of my review.

@@ -118,5 @@
> -            "on '%s'" % sort_count_by_platform[0][0]
> -        print 'Top Builds by Platform:'
> -        for k in range(0, len(sort_count_by_platform)):
> -            if k < 3:
> -                print sort_count_by_platform[k][0], '-->', sort_count_by_platform[k][1]

top builds was a useful output when manually running imo, but not necessary for a script strictly used as a nagios check.

Attachment #8711706 - Flags: review?(bugspam.Callek) → review+

Justin Wood (:Callek)

Comment 23

•

8 years ago

Comment on attachment 8711622 [details] [diff] [review]
bug_1204970.patch

clearing review, since I didn't review this one officially. But I agree to just land the final version of the check into braindump/ assuming all lines have had review (which aiui they have)

Attachment #8711622 - Flags: review?(bugspam.Callek)

Alin Selagea [:aselagea]

Assignee

Comment 24

•

8 years ago

Updated the script to use "#!/usr/bin/env python" (just as it was on version 1.0) and checked that it works. Pushed the new version to braindump and also replaced the existing script on cruncher with this one.

Justin Wood (:Callek)

Comment 25

•

8 years ago

(In reply to Alin Selagea [:aselagea][:buildduty] from comment #24)
> Updated the script to use "#!/usr/bin/env python" (just as it was on version
> 1.0) and checked that it works. Pushed the new version to braindump and also
> replaced the existing script on cruncher with this one.

Turns out that shebang change doesn't actually work as I suggested, - well it works manually from command line, but not from nagios (perhaps some PATH issue)

Anyway, I reverted it on the live copy, not the repo copy.

Kim Moir [:kmoir] ET

Comment 26

•

8 years ago

What is the state of this bug? Can it be closed?  I noticed in #buildduty that we have alerts per platform for tests working.

Justin Wood (:Callek)

Updated

•

8 years ago

Blocks: 1245823

Alin Selagea [:aselagea]

Assignee

Comment 27

•

8 years ago

The check now uses the new version of the script.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.