10. Automation Tools Provided

10.1 Overview

As shipped a sample automation directory is provided with the application. It's pretty simple but works. It is provided as a sample structure upon which you can build your own alerts, it is not a full blown automation tool (you can make it do anything a script can do).

It is a seperate package and can be installed anywhere, but is shipped under the alert toolkit directory as thats where I think it belongs. If you move it update the config.txt file.

Basically as long as the config.txt file points to the correct directory, all alert toolkit scripts supplied will find these scripts correctly. The automation scripts themselves use relative paths to where they are run from to find there own configuration so will run correctly anywheresupplied will find these scripts correctly. The automation scripts themselves use relative paths to where they are run from to find there own configuration so will run correctly anywhere. The automation file tools_library will need to be customised as that identifies to the automation scripts where the alert toolkit was installed.

10.2 Key files

It has the following key components

10.3 Requesting an event be automated

The very basic steps to automate an event are
  1. Add the rule to the automation_table file
  2. In your script when an the event is deemed to need automating call actioncmd.sh with the name of the script making the automation request, the subject key, and any parameters you want passed to the script identified in the automation rule.

Note: when calling actioncmd.sh you must pass the name of the script making the call. One subject may be checked for different things by different specialised scripts, using the scriptname as part of the key should be enough to make the each event unique as hopefully you use different scripts to run different types of checks on your servers. It also has the added benefit of letting you know what script calls each rule should you documentation get a little sloppy.

Ensure your rules are correct, as

10.4 The automation_table ruleset

The key file is in the automation directory itself, and is called automation_table. It must contain records in the form below (SPACE SEPERATED, not tabbed).
# automation_table
# A list of rules we expect to be coded for various alert threshold events
#
# All fields must be provided, and are
#   alert-key        : A unique message number key for action/error alerts
#   scriptname       : the name of the script that is checking/calling us
#   subject          : the subject that will need to be automated
#   user-needed      : the userid the automation is expected to run under
#   automation-script: the script to run for automation
#
# alert-key  scriptname                  subject                      user-needed  automation-script
AUTO_HTTPD   r030_filesize_check.sh      /var/log/httpd/access_log     root        httpd_rollover.sh
AUTO_HTTPD   r030_filesize_check.sh      /var/log/httpd/error_log      root        httpd_rollover.sh
AUTO_EMPTY   r030_filesize_check.sh      /var/log/daemons/errors       root        empty_out_file.sh
AUTO_HTTPERR r015_httpd_halt_check.sh    /var/log/httpd/error_log      apache      httpd_restart.sh

Note: I have personally standardised on starting every key with AUTO- as this allows me to add checks for automated events that occurred in my daily exception reporting. What you actually use is up to you but you should use a common standard.

The fields in the file are...

10.5 The automation scripts in more detail

10.5.1 actioncmd.sh

This is the main interface script to the automation rules. When called from one of your sites health checking scripts it will
  1. if the tools_library cannot be found, writes a message to the system log and exits with exit code 1, returns text string NO to stdout. Note: writes error to system log as without the tools_library it does not have available to it how to raise an alert
  2. raises a warning alert saying automation is beginning
  3. check a matching rule exists, using is_automated.sh
    • if exit code was 2 exits with exit code 1, is_automated.sh has already raised any alerts needed
    • if stdout check was "NO" then there was no automation rule so actioncmd.sh raises a critical alert to say so, end exist with exit code 1
    • otherwise, it carries on the following steps
  4. raises a warning alert indicating automation is in progress for the subject. This is simply as some automation rules may take a while to run and we want operations to know it is actually happening and not being ignored
  5. executes the automation script identified by the ruleset, passing it all the parameters that were passed to actioncmd.sh
  6. on completion of the automation script checks the exit code
    • if exit code was 0, cancels the earlier warning alert
    • if exit code was not 0, replaces the warning alert with a critical alert saying automation failed
  7. always exist 0 at this point, it has worked ok and if a rule failed a message was logged correctly indicating so.

Note that in the steps above the actioncmd.sh will always exit 0 if a perfectly good matching rule was found. This is the correct behaviour for my site as I see no point in embedding in my scripts a check to see if an automation step failed and getting the calling script to raise an alert, when if the automation script failed the actioncmd.sh script will already have done so on my behalf. I like to keep my code simple so things like this can be offloaded into tools.

A sample of how you would use this in your scripts is below...
#!/bin/bash
# if /var/log/httpd/access_log is over 1MB raise an alert
# if it's over 2MB take automated recovery action
# expected output format is as below...
# -rw-r--r--    1 root     root        94808 Jul 25  18:01 # /var/log/httpd/access_log
myname=`basename $0`
filename="/var/log/httpd/access_log"
testvar=`ls -la ${filename}`
testsize=`echo "${testvar}" | awk {'print $5'}` 
if [ ${testsize} -gt 2000000 ];
then
   # only the first two parameters are required, but all parameters
   # passed to actioncmd.sh are also passed to the automation script
   # so we can provide addditional information.
   actioncmd.sh "${myname}" "${filename}" "${testsize}"
   # the rule above would presumably roll your logs and run a web trend report.
else
   if [ ${testsize} -gt 1000000 ];
   then
      raise_alert 9003 127.0.0.1 warning "${keyname}" "${filename} > 1000000"
   # else, no problem
   fi
fi
exit 0
Or if you use the supplied raise_alert.sh script
#!/bin/bash
# if /var/log/httpd/access_log is over 1MB raise an alert
# if it's over 2MB take automated recovery action
# expected output format is as below...
# -rw-r--r--    1 root     root        94808 Jul 25  18:01 # /var/log/httpd/access_log
myname=`basename $0`
filename="/var/log/httpd/access_log"
testvar=`ls -la ${filename}`
testsize=`echo "${testvar}" | awk {'print $5'}` 
if [ ${testsize} -gt 1000000 ];
then
   # All parms after automate are used by the automation script to be run if found
   raise_alert.sh "${keyname}" "${filename} > 1000000" "warning" AUTOMATE ${myname} ${filename} ${testsize}
fi
exit 0

10.5.2 is_automated.sh

This is called by actioncmd.sh. It is a seperate routine as you may wish to call it yourself to test for the existence of an automation rule. It will
  1. if the tools_library cannot be found, writes a message to the system log and exits with exit code 1, returns text string NO to stdout. Note: writes error to system log as without the tools_library it does not have available to it how to raise an alert
  2. if no automation_table file is found raises a critical alert and exits with exit code 2, returns text string NO to stdout
  3. if no matching automation rule is found exits with exit code 0 (this is not an error) and returns the text string NO to stdout
  4. if an automation rule is found but the script the rule is configured to run does not exist, it exits with exit code 2, raises a critical alert and writes the text string NO to stdout
  5. if an automation rule is found but the script was called by the wrong userid; it exits with exit code 2, raises a critical alert and writes the text string NO to stdout
  6. If all is OK, it exits with exit code 0 and writes the string
    YES automationkey automation-script
    to stdout for the caller to do something usefull with.

As a general rule you wouldn't call this from your own scripts as the actioncmd.sh calls this on your behalf, however there may be occasions where you in your own scripts want to check if a rule will exist and take specialised action for that unique script of yours if it does not exist or does not pass checks. Note however that it will still raise alerts as appropriate for you.

A sample of how you could use it in your scripts is below...
Remember:
Exit code 0 = OK, but may be no rule
Exit code 1 = missing library
Exit code 2 = bad rule found
Always check the stdout text to see if a rule can be automated in the case of exit code 0 by checking for YES* or NO*.
#!/bin/bash
myname=`basname $0`
subject="WHATEVER"
check_str=`is_automated ${myname} ${subject}`
check_result=$?  # get the exit code also
case ${check_result} in
   0) if [ "${check_str}." = "NO." ];
      then
         echo "No rule exists for this subject/script pair"
	  else
         alert_server_key_to_use=`echo "${check_str}" | awk {'print $2'}`
         automation_script=`echo "${check_str}" | awk {'print $3'}`
         echo "Rule exists: automation script name is ${automation_script}"
      fi
      ;;
   1) echo "Missing required library file - logger message written"
      ;;
   2) echo "Errors in automation rule - critical alert written"
      ;;
   *) echo "Undocumented error code"
      ;;
   esac
# done
exit 0

10.6 Examples provided

In the alert toolkit samples directory there is a sample automation script using this library. This is the automation_filesize_check.sh (uses the automation_filesize_check.cfg) which is based upon the sample filesize check script.

This sample is the same as the normal one with the exception of trying to run an automation rule when a file reached a critical limit. It assumes that nobody took action when the warning alert was raised however many days ago so it will take action at the critical level.

You should review this script for ideas on how to write your own.

Personally, I only automate what I consider critical 'should have been recovered' events. This is simply because a generally raise warning events prior to something becoming critical, a critical event should therefore only occur if nobody bothered to fix the warning.
Having said that there are many critical events I also won't automate, unless they would result in loss of service; it's always better to find out what caused a problem rather than just resarting something or the problem will probably just re-occur and keep re-triguring the automation rule for no benifit.

End of chapter 10