Traffic Manager's autoscaling capability is intended to help you dynamically control the resources that a service uses so that you can deliver services to a desired SLA, while minimizing the cost. The intention of this feature is that you can:
Define the desired SLA for a service (based on response time of nodes in a pool)
Define the minimum number of nodes needed to deliver the service (e.g. 2 nodes for fault-tolerance reasons)
Define the maximum number of resources (acting as a brake – this limits how much resource Traffic Manager will deploy in the event of a denial of service attack, traffic surge, application fault etc)
You also need to configure Traffic Manager to deploy instances of the nodes, typically from a template in Amazon, Rackspace or VMware.
You can then leave Traffic Manager to provision the notes, and to dynamically scale the number of nodes up or down to minimize the cost (number of nodes) while preserving the SLA.
Autoscaling is a property of a pool.
A pool contains a set of ‘nodes’ – back-end servers that provide a service on an IP address and port. All of the nodes in a pool provide the same service. Autoscaling monitors the service level (i.e. response time) delivered by a pool. If the response time falls outside the desired SLA, then autoscaling will add or remove nodes from the pool to increase or reduce resource in order to meet the SLA at the lowest cost.
The feature consists of a monitoring and decision engine, and a collection of driver scripts that interface with the relevant platform.
The decision engine
The decision engine monitors the response time from the pool. Configure it with the desired SLA, and the scale-up/scale-down thresholds.
Example: my SLA is 1000 ms. I want to scale up (add nodes) if less than 40% of transactions are completed within this SLA, and scale-down (remove nodes) if more than 95% of transactions are completed within the SLA. To avoid flip-flopping, I want to wait for 20 seconds before initiating the change (in case the problem is transient and goes away), and I want to wait 180 seconds before considering another change.
Other parameters control the minimum and maximum number of nodes in a pool, and how we access the service on new nodes:
Traffic Manager includes drivers for Amazon EC2, Rackspace and VMware vSphere. You will need to configure a set of ‘cloud credentials’ (authentication details for the management API for the virtual platform):
You'll also need to specify the details of the virtual machine template that instantiates the service in the pool:
The decision engine initiates a scale-up or scale-down action by invoking the driver with the configured credentials and parameters. The driver instructs the virtualization layer to deploy or terminate a virtual machine. Once the action is complete, the driver returns the new list of nodes in the pool and the decision engine update the pool configuration.
You can manually provision nodes by editing the max-nodes and min-nodes settings in the pool. If Traffic Manager notices that there is a mismatch between the max/min and the actual number of nodes active, then it will initiate a series of scale-up or scale-down actions.
Creating a custom driver for a new platform
You can create a custom driver for any platform that is capable of deploying new service instances on demand. Creating a new driver involves:
Create the driver script, that conforms to the API below
Upload the script to the Extra Files -> Miscellaneous store using the UI (or copy to $ZEUSHOME/zxtm/conf/extra)
Create a Credentials object that contains the uids, passwords etc necessary to talk to the cloud platform:
Configure the pool to autoscale, and provide the details of the virtual machine that should be provisioned:
Specification of the driver scripts
The settings in the UI are interpreted by the Cloud API script. Traffic Manager will invoke this script and pass the details in. Use the ZEUSHOME/zxtm/bin/rackspace.pl or vsphere-client scripts as examples (the ZEUSHOME/zxtm/bin/awstool script is multi-purpose and used by Traffic Manager's handling of EC2 EIPs for failover as well).
The scripts should support several actions - status , createnode , destroynode , listimageids and listsizeids . Run --help:
:/opt/zeus/zxtm/bin# ./rackspace.pl --help
Usage: ./rackspace.pl [--help] action options
common options: --verbose=1 --cloudcreds=name
other valid options depend on the chosen action:
status: --deltasince=tstamp Only report changes since timestamp tstamp (unix time)
createnode: --name=newname Associate name newname (must be unique) with the new instance
--imageid=i_id Create an instance of image uniquely identified by i_id
--sizeid=s_id Create an instance with size uniquely identified by s_id
destroynode: --id=oldid destroy instance uniquely identified by oldid
Note: The ' --deltasince ' isn't supported by many cloud APIs, but has been added for Rackspace. If the cloud API in question supports reporting only changes since a given date/time, it should be implemented.
The value of the --name option will be chosen by the autoscaler on the basis of the 'autoscale!name': a different integer will be appended to the name for each node.
The script should return a JSON-formatted response for each action:
version and code must be JSON integers in decimal notation; sizeid, uniq_id, imageid can be decimal integers or strings.
name must be a string. Some clouds do not give every instance a name; in this case it should be left out or be set to the empty string. The autoscaler process will then infer the relevance of a node for a pool on the basis of the imageid (must match 'autoscale!imageid' in the pool's configuration).
created is the unix time stamp of when the node was created and hence must be a decimal integer. When the autoscaler destroys nodes, it will try to destroy the oldest node first. Some clouds do not provide this information; in this case it should be set to zero.
complete must be a decimal integer indicating the percentage of progress when a node is created.
A response code of 304 to a 'status' request with a '--deltasince' option is interpreted as 'no change from last status request'.
The response is a JSON-formatted object as follows:
The 202 corresponds to the HTTP response code 'Accepted'.
The response is a JSON-formatted object as follows:
The autoscaling driver script should communicate error conditions using responsecodes >= 400 and/or by writing output to stderr. When the autoscaler detects an error from an API script it disables autoscaling for all pools using the Cloud Credentials in question until an API call using those Cloud Credentials is successful.