10. Automation Tools Provided

10.1 Overview

As shipped a sample automation directory is provided with the application. It's pretty simple but works. It is provided as a sample structure upon which you can build your own alerts, it is not a full blown automation tool (you can make it do anything a script can do).

It is a seperate package and can be installed anywhere, but is shipped under the alert toolkit directory as thats where I think it belongs. If you move it update the config.txt file.

Basically as long as the config.txt file points to the correct directory, all alert toolkit scripts supplied will find these scripts correctly. The automation scripts themselves use relative paths to where they are run from to find there own configuration so will run correctly anywheresupplied will find these scripts correctly. The automation scripts themselves use relative paths to where they are run from to find there own configuration so will run correctly anywhere. The automation file tools_library will need to be customised as that identifies to the automation scripts where the alert toolkit was installed.

10.2 Key files

It has the following key components

The file tools_library that sets up the scripts it will need to use based upon whether it can find my alert toolkit or not. It will run without it. You should customise the directory the alert toolkit has been installed into
The script file actioncmd.sh which checks the information passed to it to see if it can find a matvhing automation rule to run. You should never touch this. See section 10.5 below for more details
The script file is_automated.sh which is called by actioncmd.sh to search for automation rule matched. This can be run manually if you so wish. See section 10.5 below for more details
The automation rule file automation_table which contains the automation rules and what should be done with any rules that trigger, see below in section 10.4 for more details
The directory scripts which must be under the automation directory. This is a sanity check, you don't want automation scripts lying all over the machine, they must be kept in this one place.

10.3 Requesting an event be automated

The very basic steps to automate an event are

Add the rule to the automation_table file

In your script when an the event is deemed to need automating call actioncmd.sh with the name of the script making the automation request, the subject key, and any parameters you want passed to the script identified in the automation rule.

Note: when calling actioncmd.sh you must pass the name of the script making the call. One subject may be checked for different things by different specialised scripts, using the scriptname as part of the key should be enough to make the each event unique as hopefully you use different scripts to run different types of checks on your servers. It also has the added benefit of letting you know what script calls each rule should you documentation get a little sloppy.

Ensure your rules are correct, as

If you try to automate an event that does not have a matching rule then an alert will be raised advising there was no matching automation rule.
If you run the actioncmd.sh automation rule request correctly, but as the wrong user an alert will be raised inicating the rule failed as it was run under the wrong user, no attempt will be made to run the script.

10.4 The automation_table ruleset

The key file is in the automation directory itself, and is called automation_table. It must contain records in the form below (SPACE SEPERATED, not tabbed).

# automation_table # A list of rules we expect to be coded for various alert threshold events # # All fields must be provided, and are # alert-key : A unique message number key for action/error alerts # scriptname : the name of the script that is checking/calling us # subject : the subject that will need to be automated # user-needed : the userid the automation is expected to run under # automation-script: the script to run for automation # # alert-key scriptname subject user-needed automation-script AUTO_HTTPD r030_filesize_check.sh /var/log/httpd/access_log root httpd_rollover.sh AUTO_HTTPD r030_filesize_check.sh /var/log/httpd/error_log root httpd_rollover.sh AUTO_EMPTY r030_filesize_check.sh /var/log/daemons/errors root empty_out_file.sh AUTO_HTTPERR r015_httpd_halt_check.sh /var/log/httpd/error_log apache httpd_restart.sh

Note: I have personally standardised on starting every key with AUTO- as this allows me to add checks for automated events that occurred in my daily exception reporting. What you actually use is up to you but you should use a common standard.

The fields in the file are...

The alert key is used by the alert toolkit applications
the user-needed is a check, if the automation script is called by any other user the rule will fail as it will be assumed any other user does not have accessi
The automation-script is the script to be called which will be passed all commands given to actioncmd.sh.

10.5 The automation scripts in more detail

10.5.1 actioncmd.sh

This is the main interface script to the automation rules. When called from one of your sites health checking scripts it will

if the tools_library cannot be found, writes a message to the system log and exits with exit code 1, returns text string NO to stdout. Note: writes error to system log as without the tools_library it does not have available to it how to raise an alert

raises a warning alert saying automation is beginning

check a matching rule exists, using is_automated.sh

if exit code was 2 exits with exit code 1, is_automated.sh has already raised any alerts needed

if stdout check was "NO" then there was no automation rule so actioncmd.sh raises a critical alert to say so, end exist with exit code 1

otherwise, it carries on the following steps

raises a warning alert indicating automation is in progress for the subject. This is simply as some automation rules may take a while to run and we want operations to know it is actually happening and not being ignored

executes the automation script identified by the ruleset, passing it all the parameters that were passed to actioncmd.sh

on completion of the automation script checks the exit code

if exit code was 0, cancels the earlier warning alert

if exit code was not 0, replaces the warning alert with a critical alert saying automation failed

always exist 0 at this point, it has worked ok and if a rule failed a message was logged correctly indicating so.

Note that in the steps above the actioncmd.sh will always exit 0 if a perfectly good matching rule was found. This is the correct behaviour for my site as I see no point in embedding in my scripts a check to see if an automation step failed and getting the calling script to raise an alert, when if the automation script failed the actioncmd.sh script will already have done so on my behalf. I like to keep my code simple so things like this can be offloaded into tools.

A sample of how you would use this in your scripts is below...

#!/bin/bash # if /var/log/httpd/access_log is over 1MB raise an alert # if it's over 2MB take automated recovery action # expected output format is as below... # -rw-r--r-- 1 root root 94808 Jul 25 18:01 # /var/log/httpd/access_log myname=`basename $0` filename="/var/log/httpd/access_log" testvar=`ls -la ${filename}` testsize=`echo "${testvar}" | awk {'print $5'}` if [ ${testsize} -gt 2000000 ]; then # only the first two parameters are required, but all parameters # passed to actioncmd.sh are also passed to the automation script # so we can provide addditional information. actioncmd.sh "${myname}" "${filename}" "${testsize}" # the rule above would presumably roll your logs and run a web trend report. else if [ ${testsize} -gt 1000000 ]; then raise_alert 9003 127.0.0.1 warning "${keyname}" "${filename} > 1000000" # else, no problem fi fi exit 0

Or if you use the supplied raise_alert.sh script

#!/bin/bash # if /var/log/httpd/access_log is over 1MB raise an alert # if it's over 2MB take automated recovery action # expected output format is as below... # -rw-r--r-- 1 root root 94808 Jul 25 18:01 # /var/log/httpd/access_log myname=`basename $0` filename="/var/log/httpd/access_log" testvar=`ls -la ${filename}` testsize=`echo "${testvar}" | awk {'print $5'}` if [ ${testsize} -gt 1000000 ]; then # All parms after automate are used by the automation script to be run if found raise_alert.sh "${keyname}" "${filename} > 1000000" "warning" AUTOMATE ${myname} ${filename} ${testsize} fi exit 0

10.5.2 is_automated.sh

This is called by actioncmd.sh. It is a seperate routine as you may wish to call it yourself to test for the existence of an automation rule. It will

if the tools_library cannot be found, writes a message to the system log and exits with exit code 1, returns text string NO to stdout. Note: writes error to system log as without the tools_library it does not have available to it how to raise an alert

if no automation_table file is found raises a critical alert and exits with exit code 2, returns text string NO to stdout

if no matching automation rule is found exits with exit code 0 (this is not an error) and returns the text string NO to stdout

if an automation rule is found but the script the rule is configured to run does not exist, it exits with exit code 2, raises a critical alert and writes the text string NO to stdout

if an automation rule is found but the script was called by the wrong userid; it exits with exit code 2, raises a critical alert and writes the text string NO to stdout

If all is OK, it exits with exit code 0 and writes the string
YES automationkey automation-script
to stdout for the caller to do something usefull with.

As a general rule you wouldn't call this from your own scripts as the actioncmd.sh calls this on your behalf, however there may be occasions where you in your own scripts want to check if a rule will exist and take specialised action for that unique script of yours if it does not exist or does not pass checks. Note however that it will still raise alerts as appropriate for you.

A sample of how you could use it in your scripts is below...
Remember:
Exit code 0 = OK, but may be no rule
Exit code 1 = missing library
Exit code 2 = bad rule found
Always check the stdout text to see if a rule can be automated in the case of exit code 0 by checking for YES* or NO*.

#!/bin/bash myname=`basname $0` subject="WHATEVER" check_str=`is_automated ${myname} ${subject}` check_result=$? # get the exit code also case ${check_result} in 0) if [ "${check_str}." = "NO." ]; then echo "No rule exists for this subject/script pair" else alert_server_key_to_use=`echo "${check_str}" | awk {'print $2'}` automation_script=`echo "${check_str}" | awk {'print $3'}` echo "Rule exists: automation script name is ${automation_script}" fi ;; 1) echo "Missing required library file - logger message written" ;; 2) echo "Errors in automation rule - critical alert written" ;; *) echo "Undocumented error code" ;; esac # done exit 0

10.6 Examples provided

In the alert toolkit samples directory there is a sample automation script using this library. This is the automation_filesize_check.sh (uses the automation_filesize_check.cfg) which is based upon the sample filesize check script.

This sample is the same as the normal one with the exception of trying to run an automation rule when a file reached a critical limit. It assumes that nobody took action when the warning alert was raised however many days ago so it will take action at the critical level.

You should review this script for ideas on how to write your own.

Personally, I only automate what I consider critical 'should have been recovered' events. This is simply because a generally raise warning events prior to something becoming critical, a critical event should therefore only occur if nobody bothered to fix the warning.
Having said that there are many critical events I also won't automate, unless they would result in loss of service; it's always better to find out what caused a problem rather than just resarting something or the problem will probably just re-occur and keep re-triguring the automation rule for no benifit.