This document describes some operating system tunables you may wish to apply to a production Traffic Manager instance. Note that the kernel tunables only apply to Traffic Manager software installed on a customer-provided Linux instance; it does not apply to the Traffic Manager Virtual Appliance or Cloud instances. Consider the tuning techniques in this document when:
For more information on performance tuning, start with the Tuning Pulse Virtual Traffic Manager article.
Most modern Linux distributions have sufficiently large defaults and many tables are autosized and growable, so it is often not be necessary to change tunings. The values below are recommended for typical deployments on a medium-to-large server (8 cores, 4 GB RAM).
Note: Tech tip: How to apply kernel tunings on Linux
# echo 2097152 > /proc/sys/fs/file-max
Set a minimum of one million file descriptors unless resources are seriously constrained. See also the setting maxfds below.
# echo "1024 65535" > /proc/sys/net/ipv4/ip_local_port_range
# echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
Each TCP and UDP connection from Traffic Manager to a back-end server consumes an ephemeral port, and that port is retained for the ‘fin_timeout’ period once the connection is closed. If back-end connections are frequently created and closed, it’s possible to exhaust the supply of ephemeral ports. Increase the port range to the maximum (as above) and reduce the fin_timeout to 30 seconds if necessary.
# echo 1 > /proc/sys/net/ipv4/tcp_syncookies
SYN cookies should be enabled on a production system. The Linux kernel will process connections normally until the backlog grows , at which point it will use SYN cookies rather than storing local state. SYN Cookies are an effective protection against syn floods, one of the most common DoS attacks against a server.
If you are seeking a stable test configuration as a basis for other tuning, you should disable SYN cookies. Increase the size of net/ipv4/tcp_max_syn_backlog if you encounter dropped connection attempts.
# echo 1024 > /proc/sys/net/core/somaxconn
The request backlog contains TCP connections that are established (the 3-way handshake is complete) but have not been accepted by the listening socket (on Traffic Manager). See also the tunable parameter ‘listen_queue_size’. Restart the Traffic Manager software after changing this value.
If the listen queue fills up because the Traffic Manager does not accept connections sufficiently quickly, the kernel will quietly ignore additional connection attempts. Clients will then back off (they assume packet loss has occurred) before retrying the connection.
In general, it’s rarely necessary to further tune Linux kernel internals because the default values that are selected on a normal-to-high-memory system are sufficient for the vast majority of deployments, and most kernel tables will automatically resize if necessary. Any problems will be reported in the kernel logs; dmesg is the quickest and most reliable way to check the logs on a live system.
In 10 GbE environments, you should consider increasing the size of the input queue:
# echo 5000 > net.core.netdev_max_backlog
TCP connections reside in the TIME_WAIT state in the kernel once they are closed. TIME_WAIT allows the server to time-out connections it has closed in a clean fashion.
If you see the error “TCP: time wait bucket table overflow”, consider increasing the size of the table used to store TIME_WAIT connections:
# echo 7200000 > /proc/sys/net/ipv4/tcp_max_tw_buckets
In earlier Linux kernels (pre-2.6.39), the initial TCP window size was very small. The impact of a small initial window size is that peers communicating over a high-latency network will take a long time (several seconds or more) to scale the window to utilize the full bandwidth available – often the connection will complete (albeit slowly) before an efficient window size has been negotiated.
The 2.6.39 kernel increases the default initial window size from 2 to 10. If necessary, you can tune it manually:
# ip route change default via 192.168.1.1 dev eth0 proto static initcwnd 10
If a TCP connection stalls, even briefly, the kernel may reduce the TCP window size significantly in an attempt to respond to congestion. Many commentators have suggested that this behavior is not necessary, and this “slow start” behavior should be disabled:
# echo 0 > /proc/sys/net/ipv4/tcp_slow_start_after_idle
If you are using older Spirent test kit, you may need to set the following tunables to work around optimizations in their TCP stack:
# echo 0 > /proc/sys/net/ipv4/tcp_timestamps
# echo 0 > /proc/sys/net/ipv4/tcp_window_scaling
[Note: See attachments for the above changes in an easy to run shell script]
Interrupts (IRQs) are wake-up calls to the CPU when new network traffic arrives. The CPU is interrupted and diverted to handle the new network data. Most NIC drivers will buffer interrupts and distribute them as efficiently as possible. When running on a machine with multiple CPUs/cores, interrupts should be distributed across cores roughly evenly. Otherwise, one CPU can be the bottleneck in high network traffic.
The general-purpose approach in Linux is to deploy irqbalance , which is a standard package on most major Linux distributions. Under extremely high interrupt load, you may see one or more ksoftirqd processes exhibiting high CPU usage. In this case, you should configure your network driver to use multiple interrupt queues (if supported) and then manually map those queues to one or more CPUs using SMP affinity.
Modern network cards can maintain multiple receive queues. Packets within a particular TCP connection can be pinned to a single receive queue, and each queue has its own interrupt. You can map interrupts to CPU cores to control which core each packet is delivered to. This affinity delivers better performance by distributing traffic evenly across cores and by improving connection locality (a TCP connection is processed by a single core, improving CPU affinity).
For optimal performance, you should:
You should also refer to the technical documentation provided by your network card vendor.
[Updates by Aidan Clarke and Rick Henderson ]
The Google Code article link is broken- the document may be found at http://lxr.free-electrons.com/source/Documentation/networking/scaling.txt
Thanks - I have updated the document to reflect the changes..