This article illustrates how you can create a custom 'program' action that is triggered when a selection of events is raised. The action will seek to take the appropriate debugging or remedial action to address the problem associated with each event.
To help us create and debug the event handler, we'll first create a very simple debugging action. Go to the System -> Alerting -> Manage Actions page. Create a Program Action named 'Debug Problem', and configure it to call /bin/echo:
To help us create and debug the event handler, we'll first create a very simple debugging action. Go to the System -> Alerting -> Manage Actions page. Create a Program Action named 'Debug Problem', and configure it to call /bin/echo:
The program (/bin/echo) is passed two parameters by default: the name of the event type that triggered the action and information about the specific event reported within that event type. This will suffice for now - we will add more arguments later when we have finished writing the program.
Next, we create a set of events (an 'event type') that will trigger the action:
Go to System -> Alerting -> Manage Event Types and create a new event type called 'Problems to Debug'. You will be presented with a list of all the events that Traffic Manager can catch in a tree structure. Select the following events:
Save the event type by clicking 'Update'
The next step is to configure Traffic Manager to trigger the action when one of the events in our event type occurs.
Go to the System -> Alerting page and select the 'Problems to Debug' event type from the drop-down box at the bottom of the page. The event type will appear in the list of mappings alongside a drop-down box containing a list of all the actions that have been configured. Select the 'Debug Problem' action from the list.
It would also be useful to receive a notification that some debug output has been produced, so select 'E-Mail' from the list of actions as well. Click 'Update' to save the changes and then, if you haven't already done so, configure the E-Mail action to use your mail server and e-mail address.
Currently the 'Debug Problem' action will not do anything useful when it is triggered, so we need to write a program for it to run. The code for this program is attached to this article.
The program examines the event information it receives and, for certain events, performs some debugging actions. The program determines which event it is handling by matching the primary tag (as presented in the 'Event Type' configuration list).
The Perl program looks for the 'nodefail' tag, then extracts the name of the node and its port from the message.
if( $message =~ /\tnodefail\t/ ) {
my( $node, $port ) = ( $1, $2 ) if $message =~ /\tnodes\/(\S+)\d+)\t/;
}
It then starts capturing traffic going between Traffic Manager and that node to see if there are any clues as to what is causing the failure. The node might, for example, be ignoring invalid requests from a particular client, thus causing the passive monitoring feature of Traffic Manager to mark it as failed.
`tcpdump -c 1000 -n -s 0 -i any -w $diagnostic_file host $node`;
The captured traffic is then sent to a different machine so it can be analysed.
`scp $diagnostic_file $scpuser\@$scpdest`;
The program uses scp to send the information, which usually requires a password to be entered to access the remote machine. Because scp is being invoked by the program there is no opportunity to enter a password. To get around this problem, you can configure scp to contact a particular remote machine without requiring a password. Alternatively, if no location is passed to the program, it will just write the files to a specific location on the Traffic Manager machine so you can access them manually.
If there is a problem with Traffic Manager, the program will create a technical support report that you can send to the support team should you need further assistance with the problem. Information about the specific problem that occurred in the software will be sent in the notification e-mail that we configured earlier.
`$ENV{ZEUSHOME}/zxtm/bin/support-report $diagnostic_file`;
If Traffic Manager detects that it is running low on free file descriptors, the program will obtain information about current memory usage, disk usage, active connections and file descriptor settings.
`ulimit -a >> $diagnostic_file`;
`vmstat -s >> $diagnostic_file`;
`df -h >> $diagnostic_file`;
`netstat -an >> $diagnostic_file`;
By examining this information, you should be able to determine why the system is running low on file descriptors. Often it is because the maximum number of file descriptors (as reported by ulimit) is too low, though it could also be caused by the system running out of memory or disk space or there simply being an abnormally high number of active connections.
Finally, if SLM fails the program is triggered with the 'slmnodeinfo' event that identifies which nodes contributed to the SLM failure. In this case, the program will log on to the nodes in question and obtain information about the running processes to see what is going wrong. To do this it uses rsh, which means that you need to have the appropriate permissions configured in the '.rhosts' files on each node to allow the machine running Traffic Manager to access them without a password.
`rsh -l $rshuser $node "ps -eo pid,ppid,rss,vsize,pcpu,pmem,cmd -ww --sort=-pcpu" >> $diagnostic_file`;
`rsh -l $rshuser $node "vmstat -s" >> $diagnostic_file`;
The program also looks out for a 'testaction' event, which is reported when you use the 'Update and Test' button on the action page. We will use this later to make sure the program is working correctly and copies the debug output to the correct location.
We can now configure the 'Debug Problem' action to use the correct program. Upload the program to Traffic Manager's Action Programs catalog (in the 'Extra Files' section.)
Go to System -> Alerting -> Manage Actions, and edit the Debug Problem action; change the program from 'Custom...' to the program you just uploaded.
You will have noticed that the program takes several arguments beyond just the event information. These arguments include the location to which files should be sent and the scp and rsh usernames to use when connecting to remote machines. You can use the 'Argument Descriptions' section of the page to configure the action to supply these arguments. After expanding the Argument Descriptions section, enter 'rshuser' into the name box and 'Username used to log on to failing nodes' in the description box. Click update and then add the remaining arguments - scpuser and scpdest - in the same way.
The arguments will appear in the 'Additional Settings' section where you can configure them with the appropriate values for your system. Click 'Update' to save the configuration and scroll down to the Additional Settings section again. The command that will be executed when the action is triggered is shown at the bottom of this section:
It would also be helpful to enable 'Verbose' mode on the action at this point so any problems that occur are reported in the Event Log.
If you want to test the program out, click 'Update and Test' from the Debug Problem action's page and you should find a file called 'test-event.txt' in the location you put in the 'scpdest' parameter. If not then double check that you can use scp to copy files from the Traffic Manager machine to that location without requiring any user interaction.
If you did get the file then when any of the events in the 'Debug Problems' event type occur you will receive some additional debugging information!