Showing results for 
Search instead for 
Did you mean: 

What's the best way to monitor services?


What's the best way to monitor services?

What is the best monitoring system to use when attempting to do program service monitoring and failover?  The physical servers I am working with have many virtualized machines that provide a service that my company would like to make highly available.  I have been reading about Service Level Monitoring (SLM) and it looks somewhat promising if it were possible to configure it to failover to other servers - but everything I have read suggests that it only provides a monitoring system that allows administrators to view the response time of the servers to the client.

Frequent Contributor

Re: What's the best way to monitor services?

I think there are a number of questions contained in your one question...?

First, a quick review of the two monitoring capabilities you might be interested in:

Health monitoring is used to determine if an individual node (back-end server) is functioning correctly or not:

  • Passive Health Monitoring checks that nodes respond when a user request is sent
  • Active Health Monitoring conducts synthetic transactions against a node periodically, and can perform sophisticated tests to make sure that a node is responding with correct data

More details here: Feature Brief: Health Monitoring in Stingray Traffic Manager

Service Level Monitoring is used to determine if a node or service is running slowly. It watches transactions as they happen and gauges the level of service that users receive (based on response times), so that you can raise an alert or take remedial action within TrafficScript.

More details here: Feature Brief: Service Level Monitoring

Essentially the monitoring and management of a "service" is going to happen at a number of different levels. The health monitoring capability covers the availability and performance of the individual servers that make up your application,

If you want to take a proactive "fixing" action to performance issues, then you're correct - this is where our SLM steps in. In its most simple usage, it can alert your operational staff to a performance issue.

However, if you then take this "trigger" into our scripting language, you can describe what you want to do when you see this drop off in performance.  For example, you could start to differentiate between different types of user/request/application, directing less important users elsewhere, or slowing down the rate at which they can make requests (for example, Dynamic rate shaping slow applications, and the other 'service level' examples here: Top Stingray Examples and Use Cases

Alternatively you may wish to provision more resource (assuming your infrastructure has this capability), so in a virtual or cloud environment you could use our Auto-Scaling feature (Feature Brief: Stingray's Autoscaling capability) to request (via an API) a new web/app server to be added to the pool of available machines.