As shipped a sample automation directory is provided with the application. It's pretty simple but works. It is provided as a sample structure upon which you can build your own alerts, it is not a full blown automation tool (you can make it do anything a script can do).
It is a seperate package and can be installed anywhere, but is shipped under the alert toolkit directory as thats where I think it belongs. If you move it update the config.txt file.
Basically as long as the config.txt file points to the correct directory, all alert toolkit scripts supplied will find these scripts correctly. The automation scripts themselves use relative paths to where they are run from to find there own configuration so will run correctly anywheresupplied will find these scripts correctly. The automation scripts themselves use relative paths to where they are run from to find there own configuration so will run correctly anywhere. The automation file tools_library will need to be customised as that identifies to the automation scripts where the alert toolkit was installed.
It has the following key components
The very basic steps to automate an event are
|
Note: when calling actioncmd.sh you must pass the name of the script making the call. One subject may be checked for different things by different specialised scripts, using the scriptname as part of the key should be enough to make the each event unique as hopefully you use different scripts to run different types of checks on your servers. It also has the added benefit of letting you know what script calls each rule should you documentation get a little sloppy.
Ensure your rules are correct, as
The key file is in the automation directory itself, and is called automation_table. It must contain records in the form below (SPACE SEPERATED, not tabbed).
# automation_table # A list of rules we expect to be coded for various alert threshold events # # All fields must be provided, and are # alert-key : A unique message number key for action/error alerts # scriptname : the name of the script that is checking/calling us # subject : the subject that will need to be automated # user-needed : the userid the automation is expected to run under # automation-script: the script to run for automation # # alert-key scriptname subject user-needed automation-script AUTO_HTTPD r030_filesize_check.sh /var/log/httpd/access_log root httpd_rollover.sh AUTO_HTTPD r030_filesize_check.sh /var/log/httpd/error_log root httpd_rollover.sh AUTO_EMPTY r030_filesize_check.sh /var/log/daemons/errors root empty_out_file.sh AUTO_HTTPERR r015_httpd_halt_check.sh /var/log/httpd/error_log apache httpd_restart.sh |
Note: I have personally standardised on starting every key with AUTO- as this allows me to add checks for automated events that occurred in my daily exception reporting. What you actually use is up to you but you should use a common standard.
The fields in the file are...
This is the main interface script to the automation rules. When called from one of your sites health checking scripts it will
|
Note that in the steps above the actioncmd.sh will always exit 0 if a perfectly good matching rule was found. This is the correct behaviour for my site as I see no point in embedding in my scripts a check to see if an automation step failed and getting the calling script to raise an alert, when if the automation script failed the actioncmd.sh script will already have done so on my behalf. I like to keep my code simple so things like this can be offloaded into tools.
A sample of how you would use this in your scripts is below...
#!/bin/bash # if /var/log/httpd/access_log is over 1MB raise an alert # if it's over 2MB take automated recovery action # expected output format is as below... # -rw-r--r-- 1 root root 94808 Jul 25 18:01 # /var/log/httpd/access_log myname=`basename $0` filename="/var/log/httpd/access_log" testvar=`ls -la ${filename}` testsize=`echo "${testvar}" | awk {'print $5'}` if [ ${testsize} -gt 2000000 ]; then # only the first two parameters are required, but all parameters # passed to actioncmd.sh are also passed to the automation script # so we can provide addditional information. actioncmd.sh "${myname}" "${filename}" "${testsize}" # the rule above would presumably roll your logs and run a web trend report. else if [ ${testsize} -gt 1000000 ]; then raise_alert 9003 127.0.0.1 warning "${keyname}" "${filename} > 1000000" # else, no problem fi fi exit 0 |
#!/bin/bash # if /var/log/httpd/access_log is over 1MB raise an alert # if it's over 2MB take automated recovery action # expected output format is as below... # -rw-r--r-- 1 root root 94808 Jul 25 18:01 # /var/log/httpd/access_log myname=`basename $0` filename="/var/log/httpd/access_log" testvar=`ls -la ${filename}` testsize=`echo "${testvar}" | awk {'print $5'}` if [ ${testsize} -gt 1000000 ]; then # All parms after automate are used by the automation script to be run if found raise_alert.sh "${keyname}" "${filename} > 1000000" "warning" AUTOMATE ${myname} ${filename} ${testsize} fi exit 0 |
This is called by actioncmd.sh. It is a seperate routine as you may wish to call it yourself to test for the existence of an automation rule. It will
|
As a general rule you wouldn't call this from your own scripts as the actioncmd.sh calls this on your behalf, however there may be occasions where you in your own scripts want to check if a rule will exist and take specialised action for that unique script of yours if it does not exist or does not pass checks. Note however that it will still raise alerts as appropriate for you.
A sample of how you could use it in your scripts is below...
Remember:
Exit code 0 = OK, but may be no rule
Exit code 1 = missing library
Exit code 2 = bad rule found
Always check the stdout text to see if a rule can be automated in the
case of exit code 0 by checking for YES* or NO*.
#!/bin/bash myname=`basname $0` subject="WHATEVER" check_str=`is_automated ${myname} ${subject}` check_result=$? # get the exit code also case ${check_result} in 0) if [ "${check_str}." = "NO." ]; then echo "No rule exists for this subject/script pair" else alert_server_key_to_use=`echo "${check_str}" | awk {'print $2'}` automation_script=`echo "${check_str}" | awk {'print $3'}` echo "Rule exists: automation script name is ${automation_script}" fi ;; 1) echo "Missing required library file - logger message written" ;; 2) echo "Errors in automation rule - critical alert written" ;; *) echo "Undocumented error code" ;; esac # done exit 0 |
In the alert toolkit samples directory there is a sample automation script using this library. This is the automation_filesize_check.sh (uses the automation_filesize_check.cfg) which is based upon the sample filesize check script.
This sample is the same as the normal one with the exception of trying to run an automation rule when a file reached a critical limit. It assumes that nobody took action when the warning alert was raised however many days ago so it will take action at the critical level.
You should review this script for ideas on how to write your own.
Personally, I only automate what I consider critical 'should have been
recovered' events. This is simply because a generally raise warning events
prior to something becoming critical, a critical event should therefore only
occur if nobody bothered to fix the warning.
Having said that there are many critical events I also won't automate, unless
they would result in loss of service; it's always better to find out what
caused a problem rather than just resarting something or the problem will
probably just re-occur and keep re-triguring the automation rule for no
benifit.