Closed Bug 930021 Opened 11 years ago Closed 10 years ago

Monitor free inodes on buildbot masters

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhopkins, Assigned: ashish)

References

Details

Attachments

(1 file)

buildbot-master10 ran out of inodes due to a stale cleanup lock file but nagios did not alert on the inode situation beforehand.

We should make sure we not only monitor and alert on free disk space but free inodes as well.
Nagios is checking free space and inodes for /, /builds/, and /var, which is triplication because those paths are all on the same / partition (wat). The checks have been green for at least 61 days.

Where is the lock file stored ?
> Nagios is checking free space and inodes for /, /builds/, and /var, which is triplication because those paths are all on the same / partition (wat). The checks have been green for at least 61 days.

What are the inodes warning/alert thresholds?

> Where is the lock file stored ?

The lock file was /etc/cron.d/bm10-tests1-tegra  See also: bug 930216
something isn't working with the nagios check then. the first indication that something was wrong was at 23:44 ET when nagios alerted about # of dead items. the notification for /builds never happened AFAICT.
http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/config.cgi?type=services&expand=buildbot-master10.build.mtv1.mozilla.com says 5% and 10%, which presumably applies for both space and inodes.

ashish, any ideas on this ?
Flags: needinfo?(ashish)
(In reply to Nick Thomas [:nthomas] from comment #4)
> http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/config.
> cgi?type=services&expand=buildbot-master10.build.mtv1.mozilla.com says 5%
> and 10%, which presumably applies for both space and inodes.
> 
> ashish, any ideas on this ?

That is correct. The thresholds apply to inodes as well. Unsure why Nagios didn't alert. But I can't verify that now either...
Flags: needinfo?(ashish)
Looking at[1]:

 $USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c check_disk -a $ARG1$ $ARG2$ $ARG3$
 -> 	$USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c check_disk -a 10% 5% /

The arguments "10% 5% /" get translated on the client via /etc/nagios/nrpe.cfg:

 command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$

So the 'rendered' command that gets run in this case is:

 check_disk -w 10% -c 5% -p /

I did some testing and found that the above command will only check free disk space - there are separate arguments for checking inodes: -W and -C (note capitalization).

If we want to use the same thresholds for inodes as free disk space, we could modify /etc/nrpe.cfg to read:

 command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -W $ARG1$ -C $ARG2$ -p $ARG3$

Otherwise, we could create a new command like:

 command[check_inodes]=/usr/lib64/nagios/plugins/check_disk -W $ARG1$ -C $ARG2$ -p $ARG3$

[1] http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/config.cgi?type=command&expand=check_nrpe_disk!10%25!5%25!%2F
Component: Other → Platform Support
QA Contact: joduinn → coop
(In reply to John Hopkins (:jhopkins) from comment #6)
> Otherwise, we could create a new command like:
> 
>  command[check_inodes]=/usr/lib64/nagios/plugins/check_disk -W $ARG1$ -C
> $ARG2$ -p $ARG3$

I'm all for having a distinct check here.
Attachment #8365187 - Flags: review?(dustin)
Attachment #8365187 - Flags: review?(dustin) → review+
Comment on attachment 8365187 [details] [diff] [review]
[puppet] add check_inodes

https://hg.mozilla.org/build/puppet/rev/c340ec61e2d9

Next, I believe we have to request that IT make use of this new check command.
Attachment #8365187 - Flags: checked-in+
ashish: what do we need to do to have nagios use the check_inodes command?
Flags: needinfo?(ashish)
Blocks: re-nagios
(In reply to John Hopkins (:jhopkins) from comment #10)
> ashish: what do we need to do to have nagios use the check_inodes command?

Which hostgroups should this new check be added to? buildbot-master10 doesn't exist anymore...
Flags: needinfo?(ashish)
At a minimum, these hostgroups:

 dev-buildbot-masters
 scl3-production-buildbot-masters
 use1-production-buildbot-masters
 usw2-production-buildbot-masters

I expect check_inodes would be a good counterpart to most (all?) existing UNIX-based check_disk checks.
Flags: needinfo?(ashish)
Check has been added to specified hostgroups. However this lone host has not gotten the NRPE configuration:

dev-master01.build.scl1.mozilla.com

I've acked the host per conversation with :Callek on IRC.
Assignee: nobody → ashish
Status: NEW → RESOLVED
Closed: 10 years ago
Component: Platform Support → Server Operations
Flags: needinfo?(ashish)
Flags: checked-in+
Product: Release Engineering → mozilla.org
QA Contact: coop → shyam
Resolution: --- → FIXED
Version: unspecified → other
re: c#13 added to /etc/nagios/nrpe.cfg on dev-master01

command[check_inodes]=/usr/lib64/nagios/plugins/check_disk -W $ARG1$ -C $ARG2$ -p $ARG3$
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: