The Stingray Product Documentation (user manual) describes how to configure and use Stingray' Service Level Monitoring capability. This article summarizes how the SLM statistics are calculated so that you can better appreciate what they mean and how they can be used to detect service level problems.
Service Level Monitoring classes are used to determine how well a service is meeting a response-time-based service level. One key goal of the implementation of Service Level Monitoring is that the data it measures is not affected by the performance or reliability of the remote clients; as far as possible, the SLM class gives an accurate measure of performance and reliability that measures only the factors that the application administrator has control over.
By default, connections are not measured using Service Level Monitoring classes. Typically, an administrator would select the types of connections he or she is interested in (for example, just requests for .asp resources) and assign those to a service level monitoring class:
connection.setServiceLevelClass( "My SLM Class Name" );
A virtual server may also be configured with a 'default' service level class, so that all of the connections it manages are times using the class.
Stingray starts a timer for each connection when it receives the request from the remote client. The timer is stopped when either the first data from the server is received, or if the connection is unexpectedly closed (perhaps due to a client or server timeout).
The high-res timer measures the time taken to run any TrafficScript request rules (including the delay if these rules perform a blocking action such as communicating with an external server), the time taken to read any additional request data (such as HTTP body data), the time to connect to a node, write the request and read the first response data.
When the timer is stopped, Stingray checks to see if a Service Level Class was assigned to this connection. If so, the elapsed time is recorded against the SLM class, and the per-second 'max' and 'min' response time managed by the class is updated if necessary. Stingray also maintains a per-second count of how many requests conformed and how many failed to conform.
Note: connections which close unexpectedly before the 'conformance' time limit are disregarded completely because they never completed. Connections which close unexpectedly after the 'conformance' time-limit has passed are counted as non-conforming and the elasped time is counted towards the performance of the SLM class.
For each service level class, Stingray maintains a list of the last 10 seconds worth of data in a rolling fashion - min, max and average response times, numbers conforming and non-confirming. When asked for the percentage-confirming for the SLM class, Stingray sums the results from the last 10 seconds.
Note: Stingray commonly runs in multi-process mode (one process per CPU core). In that case, each child process counts SLM data in a shared memory segment, so the results should be consistent no matter which process handles a given connection.
Note: When running in a cluster, Stingray automatically shares and merges SLM data from other members of the cluster. There may be a slight time-delay in the state sharing, so the SLM calculations from different cluster members running in active-active mode may be slightly inconsistent. If the cluster has only one active traffic manager for a given SLM class, the passive traffic managers will be able to 'see' the SLM statistics, but they may be delayed by a second or so.
SLM class data may be used in a variety of ways. You can configure 'warning' and 'serious' conformance levels, and Stingray will log whenever the class transits between an 'ok', 'warning' or 'serious' state. The transition can also trigger an event using Stingray's Event Handling capability, and you can assign custom actions to these events.
You can inspect the state of an SLM class in TrafficScript to return an error message or return different content when a service begins to underperform. Service Level Monitoring is a key measurement tool when determining whether or not to prioritize traffic.