[[ch-throughput]]
== Optimizing DRBD throughput

This chapter deals with optimizing DRBD throughput. It examines some
hardware considerations with regard to throughput optimization, and
details tuning recommendations for that purpose.

[[s-throughput-hardware]]
=== Hardware considerations

DRBD throughput is affected by both the bandwidth of the underlying
I/O subsystem (disks, controllers, and corresponding caches), and the
bandwidth of the replication network.

.I/O subsystem throughput
indexterm:[throughput]I/O subsystem throughput is determined, largely,
by the number of disks that can be written to in parallel. A single,
reasonably recent, SCSI or SAS disk will typically allow streaming
writes of roughly 40MB/s to the single disk. When deployed in a
striping configuration, the I/O subsystem will parallelize writes
across disks, effectively multiplying a single disk's throughput by
the number of stripes in the configuration. Thus the same, 40MB/s
disks will allow effective throughput of 120MB/s in a RAID-0 or
RAID-1+0 configuration with three stripes, or 200MB/s with five
stripes.

NOTE: Disk _mirroring_(RAID-1) in hardware typically has little, if
any effect on throughput. Disk _striping with parity_(RAID-5) does
have an effect on throughput, usually an adverse one when compared to
striping.

.Network throughput
indexterm:[throughput]Network throughput is usually determined by the
amount of traffic present on the network, and on the throughput of any
routing/switching infrastructure present. These concerns are, however,
largely irrelevant in DRBD replication links which are normally
dedicated, back-to-back network connections. Thus, network throughput
may be improved either by switching to a higher-throughput protocol
(such as 10 Gigabit Ethernet), or by using link aggregation over
several network links, as one may do using the Linux
indexterm:[bonding driver]+bonding+ network driver.

[[s-throughput-overhead-expectations]]
=== Throughput overhead expectations

When estimating the throughput overhead associated with DRBD, it is
important to consider the following natural limitations:

* DRBD throughput is limited by that of the raw I/O subsystem.
* DRBD throughput is limited by the available network bandwidth.

The _minimum_ between the two establishes the theoretical throughput
_maximum_ available to DRBD. DRBD then reduces that throughput maximum
by its additional throughput overhead, which can be expected to be
less than 3 percent.

* Consider the example of two cluster nodes containing I/O subsystems
  capable of 200 MB/s throughput, with a Gigabit Ethernet link
  available between them. Gigabit Ethernet can be expected to produce
  110 MB/s throughput for TCP connections, thus the network connection
  would be the bottleneck in this configuration and one would
  expect about 107 MB/s maximum DRBD throughput.

* By contrast, if the I/O subsystem is capable of only 100 MB/s for
  sustained writes, then it constitutes the bottleneck, and you would
  be able to expect only 97 MB/s maximum DRBD throughput.

[[s-throughput-tuning]]
=== Tuning recommendations

DRBD offers a number of configuration options which may have an effect
on your system's throughput. This section list some recommendations
for tuning for throughput. However, since throughput is largely
hardware dependent, the effects of tweaking the options described here
may vary greatly from system to system. It is important to understand
that these recommendations should not be interpreted as "silver
bullets" which would magically remove any and all throughput
bottlenecks.

[[s-tune-max-buffer-max-epoch-size]]
==== Setting +max-buffers+ and +max-epoch-size+

These options affect write performance on the secondary
node. +max-buffers+ is the maximum number of buffers DRBD allocates for
writing data to disk while +max-epoch-size+ is the maximum number of
write requests permitted between two write barriers. +max-buffers+ must be 
equal or bigger to +max-epoch-size+ to increase performance.
The default for both is 2048; setting it to around
8000 should be fine for most reasonably high-performance hardware RAID
controllers.

[source,drbd]
----------------------------
resource <resource> {
  net {
    max-buffers 8000;
    max-epoch-size 8000;
    ...
  }
  ...
}
----------------------------

[[s-tune-unplug-watermark]]
==== Tweaking the I/O unplug watermark

The I/O unplug watermark is a tunable which affects how often the I/O
subsystem's controller is "kicked" (forced to process pending I/O
requests) during normal operation. There is no universally recommended
setting for this option; this is greatly hardware dependent.

Some controllers perform best when "kicked" frequently, so for these
controllers it makes sense to set this fairly low, perhaps even as low
as DRBD's allowable minimum (16). Others perform best when left alone;
for these controllers a setting as high as +max-buffers+ is advisable.

[source,drbd]
----------------------------
resource <resource> {
  net {
    unplug-watermark 16;
    ...
  }
  ...
}
----------------------------

[[s-tune-sndbuf-size]]
==== Tuning the TCP send buffer size

The TCP send buffer is a memory buffer for outgoing TCP traffic. By
default, it is set to a size of 128 KiB. For use in high-throughput
networks (such as dedicated Gigabit Ethernet or load-balanced bonded
connections), it may make sense to increase this to a size of 512 KiB,
or perhaps even more. Send buffer sizes of more than 2 MiB are
generally not recommended (and are also unlikely to produce any
throughput improvement).

[source,drbd]
----------------------------
resource <resource> {
  net {
    sndbuf-size 512k;
    ...
  }
  ...
}
----------------------------

DRBD also supports TCP send buffer auto-tuning. After enabling this
feature, DRBD will dynamically select an appropriate TCP send buffer
size. TCP send buffer auto tuning is enabled by simply setting the
buffer size to zero:

[source,drbd]
----------------------------
resource <resource> {
  net {
    sndbuf-size 0;
    ...
  }
  ...
}
----------------------------

[[s-tune-al-extents]]
==== Tuning the Activity Log size

If the application using DRBD is write intensive in the sense that it
frequently issues small writes scattered across the device, it is
usually advisable to use a fairly large activity log. Otherwise,
frequent metadata updates may be detrimental to write performance.

[source,drbd]
----------------------------
resource <resource> {
  disk {
    al-extents 3389;
    ...
  }
  ...
}
----------------------------

[[s-tune-disable-barriers]]
==== Disabling barriers and disk flushes

WARNING: The recommendations outlined in this section should be applied
_only_ to systems with non-volatile (battery backed) controller caches.

Systems equipped with battery backed write cache come with built-in
means of protecting data in the face of power failure. In that case,
it is permissible to disable some of DRBD's own safeguards created for
the same purpose. This may be beneficial in terms of throughput:

[source,drbd]
----------------------------
resource <resource> {
  disk {
    disk-barrier no;
    disk-flushes no;
    ...
  }
  ...
}
----------------------------
