Configuring Linux Bonding for High Availability
Configuring Linux Bonding for High Availability

High Availability refers to configurations that provide maximum network availability by having redundant or backup devices,
links or switches between the host and the rest of the world. The
goal is to provide the maximum availability of network connectivity
(i.e., the network always works), even though other configurations
could provide higher throughput.

 High Availability in a Single Switch Topology

If two hosts (or a host and a single switch) are directly connected via multiple physical links, then there is no availability
penalty to optimizing for maximum bandwidth. In this case, there is only one switch (or peer), so if it fails, there is no alternative
access to fail over to. Additionally, the bonding load balance modes support link monitoring of their members, so if individual links fail,
the load will be rebalanced across the remaining devices.

See Section 13, “Configuring Bonding for Maximum Throughput” for information on configuring bonding with one peer device.

High Availability in a Multiple Switch Topology

With multiple switches, the configuration of bonding and the network changes dramatically. In multiple switch topologies, there is
a trade off between network availability and usable bandwidth.

Below is a sample network, configured to maximize the availability of the network:

|                                                                                 |
|port3                                                          port3|
+—–+—-+                                                         +- — – -+- – – -+
|                     |port2            ISL              port2|                        |
| switch A +- – — – – – – — – – — — — — — –+  switch B    |
|                    |                                                        |                        |
+—–+—-+                                                            +—–++—+
|port1                                                         port1|
|                             +– — —+                                   |
+– — — — – – – –+ host1 +– — – — — — — – -+
eth0  +— – – –+ eth1

In this configuration, there is a link between the two switches (ISL, or inter switch link), and multiple ports connecting to
the outside world (“port3” on each switch). There is no technical reason that this could not be extended to a third switch.

HA Bonding Mode Selection for Multiple Switch Topology ————————————————————-

In a topology such as the example above, the active-backup and broadcast modes are the only useful bonding modes when optimizing for
availability; the other modes require all links to terminate on the same peer for them to behave rationally.

active-backup: This is generally the preferred mode, particularly if the switches have an ISL and play together well. If the
network configuration is such that one switch is specifically a backup switch (e.g., has lower capacity, higher cost, etc),
then the primary option can be used to insure that the preferred link is always used when it is available.

broadcast: This mode is really a special purpose mode, and is suitable only for very specific needs. For example, if the two
switches are not connected (no ISL), and the networks beyond them are totally independent. In this case, if it is
necessary for some specific one-way traffic to reach both independent networks, then the broadcast mode may be suitable.

HA Link Monitoring Selection for Multiple Switch Topology
—————————————————————-

The choice of link monitoring ultimately depends upon your switch. If the switch can reliably fail ports in response to other
failures, then either the MII or ARP monitors should work. For example, in the above example, if the “port3” link fails at the remote
end, the MII monitor has no direct means to detect this. The ARP monitor could be configured with a target at the remote end of port3,
thus detecting that failure without switch support.

In general, however, in a multiple switch topology, the ARP monitor can provide a higher level of reliability in detecting end to
end connectivity failures (which may be caused by the failure of any individual component to pass traffic for any reason). Additionally,
the ARP monitor should be configured with multiple targets (at least one for each switch in the network). This will insure that,
regardless of which switch is active, the ARP monitor has a suitable target to query.

Note, also, that of late many switches now support a functionality generally referred to as “trunk failover.” This is a feature of the
switch that causes the link state of a particular switch port to be set down (or up) when the state of another switch port goes down (or up).
It’s purpose is to propogate link failures from logically “exterior” ports to the logically “interior” ports that bonding is able to monitor via
miimon. Availability and configuration for trunk failover varies by switch, but this can be a viable alternative to the ARP monitor when using
suitable switches.

Configuring Bonding for Maximum Throughput
==============================================

12.1 Maximizing Throughput in a Single Switch Topology
——————————————————

In a single switch configuration, the best method to maximize throughput depends upon the application and network environment. The
various load balancing modes each have strengths and weaknesses in different environments, as detailed below.

For this discussion, we will break down the topologies into two categories. Depending upon the destination of most traffic, we
categorize them into either “gatewayed” or “local” configurations.

In a gatewayed configuration, the “switch” is acting primarily as a router, and the majority of traffic passes through this router to
other networks. An example would be the following:
+———-+ +———-+
| |eth0 port1| | to other networks
| Host A +———————+ router +——————->
| +———————+ | Hosts B and C are out
| |eth1 port2| | here somewhere
+———-+ +———-+

The router may be a dedicated router device, or another host acting as a gateway. For our discussion, the important point is that
the majority of traffic from Host A will pass through the router to some other network before reaching its final destination.

In a gatewayed network configuration, although Host A may communicate with many other systems, all of its traffic will be sent
and received via one other peer on the local network, the router.

Note that the case of two systems connected directly via multiple physical links is, for purposes of configuring bonding, the
same as a gatewayed configuration. In that case, it happens that all traffic is destined for the “gateway” itself, not some other network
beyond the gateway.

In a local configuration, the “switch” is acting primarily as a switch, and the majority of traffic passes through this switch to
reach other stations on the same network. An example would be the following:

+———-+ +———-+ +——–+
| |eth0 port1| +——-+ Host B |
| Host A +————+ switch |port3 +——–+
| +————+ | +——–+
| |eth1 port2| +——————+ Host C |
+———-+ +———-+port4 +——–+

Again, the switch may be a dedicated switch device, or another host acting as a gateway. For our discussion, the important point is
that the majority of traffic from Host A is destined for other hosts on the same local network (Hosts B and C in the above example).

In summary, in a gatewayed configuration, traffic to and from the bonded device will be to the same MAC level peer on the network
(the gateway itself, i.e., the router), regardless of its final destination. In a local configuration, traffic flows directly to and
from the final destinations, thus, each destination (Host B, Host C) will be addressed directly by their individual MAC addresses.

This distinction between a gatewayed and a local network configuration is important because many of the load balancing modes
available use the MAC addresses of the local network source and destination to make load balancing decisions. The behavior of each
mode is described below.

balance-rr:

This mode is the only mode that will permit a single TCP/IP connection to stripe traffic across multiple
interfaces. It is therefore the only mode that will allow a single TCP/IP stream to utilize more than one interface’s
worth of throughput. This comes at a cost, however: the striping generally results in peer systems receiving packets out
of order, causing TCP/IP’s congestion control system to kick in, often by retransmitting segments.

It is possible to adjust TCP/IP’s congestion limits by altering the net.ipv4.tcp_reordering sysctl parameter. The
usual default value is 3, and the maximum useful value is 127.
For a four interface balance-rr bond, expect that a single TCP/IP stream will utilize no more than approximately 2.3
interface’s worth of throughput, even after adjusting tcp_reordering.

Note that the fraction of packets that will be delivered out of order is highly variable, and is unlikely to be zero. The level
of reordering depends upon a variety of factors, including the networking interfaces, the switch, and the topology of the
configuration. Speaking in general terms, higher speed network cards produce more reordering (due to factors such as packet
coalescing), and a “many to many” topology will reorder at a higher rate than a “many slow to one fast” configuration.

Many switches do not support any modes that stripe traffic (instead choosing a port based upon IP or MAC level addresses);
for those devices, traffic for a particular connection flowing through the switch to a balance-rr bond will not utilize greater
than one interface’s worth of bandwidth.

If you are utilizing protocols other than TCP/IP, UDP for example, and your application can tolerate out of order
delivery, then this mode can allow for single stream datagram performance that scales near linearly as interfaces are added
to the bond.

This mode requires the switch to have the appropriate ports configured for “etherchannel” or “trunking.”

active-backup:

There is not much advantage in this network topology to the active-backup mode, as the inactive backup devices are all
connected to the same peer as the primary. In this case, a load balancing mode (with link monitoring) will provide the
same level of network availability, but with increased available bandwidth. On the plus side, active-backup mode
does not require any configuration of the switch, so it may have value if the hardware available does not support any of
the load balance modes.

balance-xor:

This mode will limit traffic such that packets destined for specific peers will always be sent over the same
interface. Since the destination is determined by the MAC addresses involved, this mode works best in a “local” network
configuration (as described above), with destinations all on the same local network. This mode is likely to be suboptimal
if all your traffic is passed through a single router (i.e., a “gatewayed” network configuration, as described above).

As with balance-rr, the switch ports need to be configured for “etherchannel” or “trunking.”

broadcast:

Like active-backup, there is not much advantage to this mode in this type of network topology.

balance-alb:

This mode is everything that balance-tlb is, and more. It has all of the features (and restrictions) of balance-tlb,
and will also balance incoming traffic from local network peers (as described in the Bonding Module Options section,
above).

The only additional down side to this mode is that the network device driver must support changing the hardware address while
the device is open.

Maximum Throughput in a Multiple Switch Topology
—————————————————–

Multiple switches may be utilized to optimize for throughput when they are configured in parallel as part of an isolated network
between two or more systems, for example:

+———–+
| Host A |
+-+—+—+-+
| | |
+——–+ | +———+
| | |
+——+—+ +—–+—-+ +—–+—-+
| Switch A | | Switch B | | Switch C |
+——+—+ +—–+—-+ +—–+—-+
| | |
+——–+ | +———+
| | |
+-+—+—+-+
| Host B |
+———–+

In this configuration, the switches are isolated from one another. One reason to employ a topology such as this is for an
isolated network with many hosts (a cluster configured for high performance, for example), using multiple smaller switches can be more
cost effective than a single larger switch, e.g., on a network with 24 hosts, three 24 port switches can be significantly less expensive than
a single 72 port switch.

If access beyond the network is required, an individual host can be equipped with an additional network device connected to an
external network; this host then additionally acts as a gateway. 12.2.1 MT Bonding Mode Selection for Multiple Switch Topology
————————————————————-

In actual practice, the bonding mode typically employed in configurations of this type is balance-rr. Historically, in this
network configuration, the usual caveats about out of order packet delivery are mitigated by the use of network adapters that do not do
any kind of packet coalescing (via the use of NAPI, or because the device itself does not generate interrupts until some number of
packets has arrived). When employed in this fashion, the balance-rr mode allows individual connections between two hosts to effectively
utilize greater than one interface’s bandwidth.

12.2.2 MT Link Monitoring for Multiple Switch Topology
——————————————————

Again, in actual practice, the MII monitor is most often used in this configuration, as performance is given preference over
availability. The ARP monitor will function in this topology, but its advantages over the MII monitor are mitigated by the volume of probes
needed as the number of systems involved grows (remember that each host in the network is configured with bonding).

Switch Behavior Issues
==========================

13.1 Link Establishment and Failover Delays
——————————————-

Some switches exhibit undesirable behavior with regard to the timing of link up and down reporting by the switch.

First, when a link comes up, some switches may indicate that the link is up (carrier available), but not pass traffic over the
interface for some period of time. This delay is typically due to some type of autonegotiation or routing protocol, but may also occur
during switch initialization (e.g., during recovery after a switch failure). If you find this to be a problem, specify an appropriate
value to the updelay bonding module option to delay the use of the relevant interface(s).

Second, some switches may “bounce” the link state one or more times while a link is changing state. This occurs most commonly while
the switch is initializing. Again, an appropriate updelay value may help.

Note that when a bonding interface has no active links, the driver will immediately reuse the first link that goes up, even if the
updelay parameter has been specified (the updelay is ignored in this case). If there are slave interfaces waiting for the updelay timeout
to expire, the interface that first went into that state will be immediately reused. This reduces down time of the network if the
value of updelay has been overestimated, and since this occurs only in cases with no connectivity, there is no additional penalty for
ignoring the updelay.

In addition to the concerns about switch timings, if your switches take a long time to go into backup mode, it may be desirable
to not activate a backup interface immediately after a link goes down. Failover may be delayed via the downdelay bonding module option.

Duplicated Incoming Packets
——————————–

NOTE: Starting with version 3.0.2, the bonding driver has logic to suppress duplicate packets, which should largely eliminate this problem.
The following description is kept for reference.

It is not uncommon to observe a short burst of duplicated
traffic when the bonding device is first used, or after it has been
idle for some period of time. This is most easily observed by issuing
a “ping” to some other host on the network, and noticing that the
output from ping flags duplicates (typically one per slave).

For example, on a bond in active-backup mode with five slaves
all connected to one switch, the output may appear as follows:

# ping -n 10.0.4.2
PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data.
64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms
64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms
64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms
64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms

This is not due to an error in the bonding driver, rather, it
is a side effect of how many switches update their MAC forwarding
tables. Initially, the switch does not associate the MAC address in
the packet with a particular switch port, and so it may send the
traffic to all ports until its MAC forwarding table is updated. Since
the interfaces attached to the bond may occupy multiple ports on a
single switch, when the switch (temporarily) floods the traffic to all
ports, the bond device receives multiple copies of the same packet
(one per slave device).

The duplicated packet behavior is switch dependent, some
switches exhibit this, and some do not. On switches that display this
behavior, it can be induced by clearing the MAC forwarding table (on
most Cisco switches, the privileged command “clear mac address-table
dynamic” will accomplish this).

 

Hardware Specific Considerations
====================================

This section contains additional information for configuring
bonding on specific hardware platforms, or for interfacing bonding
with particular switches or other devices.

14.1 IBM BladeCenter
——————–

This applies to the JS20 and similar systems.

On the JS20 blades, the bonding driver supports only
balance-rr, active-backup, balance-tlb and balance-alb modes. This is
largely due to the network topology inside the BladeCenter, detailed
below.

JS20 network adapter information
——————————–

All JS20s come with two Broadcom Gigabit Ethernet ports
integrated on the planar (that’s “motherboard” in IBM-speak). In the
BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to
I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2.
An add-on Broadcom daughter card can be installed on a JS20 to provide
two more Gigabit Ethernet ports. These ports, eth2 and eth3, are
wired to I/O Modules 3 and 4, respectively.

Each I/O Module may contain either a switch or a passthrough
module (which allows ports to be directly connected to an external
switch). Some bonding modes require a specific BladeCenter internal
network topology in order to function; these are detailed below.

Additional BladeCenter-specific networking information can be
found in two IBM Redbooks (www.ibm.com/redbooks):

“IBM eServer BladeCenter Networking Options”
“IBM eServer BladeCenter Layer 2-7 Network Switching”

JS20 network adapter information
——————————–

All JS20s come with two Broadcom Gigabit Ethernet ports
integrated on the planar (that’s “motherboard” in IBM-speak). In the
BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to
I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2.
An add-on Broadcom daughter card can be installed on a JS20 to provide
two more Gigabit Ethernet ports. These ports, eth2 and eth3, are
wired to I/O Modules 3 and 4, respectively.

Each I/O Module may contain either a switch or a passthrough
module (which allows ports to be directly connected to an external
switch). Some bonding modes require a specific BladeCenter internal
network topology in order to function; these are detailed below.

Additional BladeCenter-specific networking information can be
found in two IBM Redbooks (www.ibm.com/redbooks):

“IBM eServer BladeCenter Networking Options”
“IBM eServer BladeCenter Layer 2-7 Network Switching”

BladeCenter networking configuration
————————————

Because a BladeCenter can be configured in a very large number
of ways, this discussion will be confined to describing basic
configurations.

Normally, Ethernet Switch Modules (ESMs) are used in I/O
modules 1 and 2. In this configuration, the eth0 and eth1 ports of a
JS20 will be connected to different internal switches (in the
respective I/O modules).

A passthrough module (OPM or CPM, optical or copper,
passthrough module) connects the I/O module directly to an external
switch. By using PMs in I/O module #1 and #2, the eth0 and eth1
interfaces of a JS20 can be redirected to the outside world and
connected to a common external switch.

Depending upon the mix of ESMs and PMs, the network will
appear to bonding as either a single switch topology (all PMs) or as a
multiple switch topology (one or more ESMs, zero or more PMs). It is
also possible to connect ESMs together, resulting in a configuration
much like the example in “High Availability in a Multiple Switch
Topology,” above.

Requirements for specific modes
——————————-

The balance-rr mode requires the use of passthrough modules
for devices in the bond, all connected to an common external switch.
That switch must be configured for “etherchannel” or “trunking” on the
appropriate ports, as is usual for balance-rr.

The balance-alb and balance-tlb modes will function with
either switch modules or passthrough modules (or a mix). The only
specific requirement for these modes is that all network interfaces
must be able to reach all destinations for traffic sent over the
bonding device (i.e., the network must converge at some point outside
the BladeCenter).

The active-backup mode has no additional requirements.

Link monitoring issues
———————-

When an Ethernet Switch Module is in place, only the ARP
monitor will reliably detect link loss to an external switch. This is
nothing unusual, but examination of the BladeCenter cabinet would
suggest that the “external” network ports are the ethernet ports for
the system, when it fact there is a switch between these “external”
ports and the devices on the JS20 system itself. The MII monitor is
only able to detect link failures between the ESM and the JS20 system.

When a passthrough module is in place, the MII monitor does
detect failures to the “external” port, which is then directly
connected to the JS20 system.

Other concerns
————–

The Serial Over LAN (SoL) link is established over the primary
ethernet (eth0) only, therefore, any loss of link to eth0 will result
in losing your SoL connection. It will not fail over with other
network traffic, as the SoL system is beyond the control of the
bonding driver.

It may be desirable to disable spanning tree on the switch
(either the internal Ethernet Switch Module, or an external switch) to
avoid fail-over delay issues when using bonding.

Frequently Asked Questions
==============================

1. Is it SMP safe?

Yes. The old 2.0.xx channel bonding patch was not SMP safe.
The new driver was designed to be SMP safe from the start.

2. What type of cards will work with it?

Any Ethernet type cards (you can even mix cards – a Intel
EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes,
devices need not be of the same speed.

Starting with version 3.2.1, bonding also supports Infiniband
slaves in active-backup mode.

3. How many bonding devices can I have?

There is no limit.

4. How many slaves can a bonding device have?

This is limited only by the number of network interfaces Linux
supports and/or the number of network cards you can place in your
system.

5. What happens when a slave link dies?

If link monitoring is enabled, then the failing device will be
disabled. The active-backup mode will fail over to a backup link, and
other modes will ignore the failed link. The link will continue to be
monitored, and should it recover, it will rejoin the bond (in whatever
manner is appropriate for the mode). See the sections on High
Availability and the documentation for each mode for additional
information.

Link monitoring can be enabled via either the miimon or
arp_interval parameters (described in the module parameters section,
above). In general, miimon monitors the carrier state as sensed by
the underlying network device, and the arp monitor (arp_interval)
monitors connectivity to another host on the local network.

If no link monitoring is configured, the bonding driver will
be unable to detect link failures, and will assume that all links are
always available. This will likely result in lost packets, and a
resulting degradation of performance. The precise performance loss
depends upon the bonding mode and network configuration.

6. Can bonding be used for High Availability?

Yes. See the section on High Availability for details.

7. Which switches/systems does it work with?

The full answer to this depends upon the desired mode.

In the basic balance modes (balance-rr and balance-xor), it
works with any system that supports etherchannel (also called
trunking). Most managed switches currently available have such
support, and many unmanaged switches as well.

The advanced balance modes (balance-tlb and balance-alb) do
not have special switch requirements, but do need device drivers that
support specific features (described in the appropriate section under
module parameters, above).

In 802.3ad mode, it works with systems that support IEEE
802.3ad Dynamic Link Aggregation. Most managed and many unmanaged
switches currently available support 802.3ad.

The active-backup mode should work with any Layer-II switch.

8. Where does a bonding device get its MAC address from?

When using slave devices that have fixed MAC addresses, or when
the fail_over_mac option is enabled, the bonding device’s MAC address is
the MAC address of the active slave.

For other configurations, if not explicitly configured (with
ifconfig or ip link), the MAC address of the bonding device is taken from
its first slave device. This MAC address is then passed to all following
slaves and remains persistent (even if the first slave is removed) until
the bonding device is brought down or reconfigured.

If you wish to change the MAC address, you can set it with
ifconfig or ip link:

# ifconfig bond0 hw ether 00:11:22:33:44:55

# ip link set bond0 address 66:77:88:99:aa:bb

The MAC address can be also changed by bringing down/up the
device and then changing its slaves (or their order):

# ifconfig bond0 down ; modprobe -r bonding
# ifconfig bond0 …. up
# ifenslave bond0 eth…

This method will automatically take the address from the next
slave that is added.

 

This method will automatically take the address from the next
slave that is added.

To restore your slaves’ MAC addresses, you need to detach them
from the bond (`ifenslave -d bond0 eth0′). The bonding driver will
then restore the MAC addresses that the slaves had before they were
enslaved.

 Resources and Links

The latest version of the bonding driver can be found in the latest
version of the linux kernel, found on http://kernel.org

The latest version of this document can be found in either the latest
kernel source (named Documentation/networking/bonding.txt), or on the
bonding sourceforge site:

http://www.sourceforge.net/projects/bonding

Discussions regarding the bonding driver take place primarily on the
bonding-devel mailing list, hosted at sourceforge.net. If you have
questions or problems, post them to the list. The list address is:

bonding-devel@lists.sourceforge.net

The administrative interface (to subscribe or unsubscribe) can
be found at:

https://lists.sourceforge.net/lists/listinfo/bonding-devel

Donald Becker’s Ethernet Drivers and diag programs may be found at :
– http://www.scyld.com/network/

You will also find a lot of information regarding Ethernet, NWay, MII,
etc. at www.scyld.com.

 

Free Web Hosting