[[ch-ocfs2]]
== Using OCFS2 with DRBD

indexterm:[OCFS2]indexterm:[Oracle Cluster File System]This chapter
outlines the steps necessary to set up a DRBD resource as a block
device holding a shared Oracle Cluster File System, version 2 (OCFS2).


[IMPORTANT]
===============================
All cluster file systems _require_ fencing - not only via the DRBD 
resource, but STONITH! A faulty member _must_ be killed.

You'll want these settings:

	disk {
		fencing resource-and-stonith;
	}
	handlers {
		# Make sure the other node is confirmed
		# dead after this!
		fence-peer "/sbin/kill-other-node.sh";
	}

There must be _no_ volatile caches!
You might take a few hints of the page at https://fedorahosted.org/cluster/wiki/DRBD_Cookbook,
although that's about GFS2, not OCFS2.
===============================



[[s-ocfs2-primer]]
=== OCFS2 primer

The Oracle Cluster File System, version 2 (OCFS2) is a concurrent
access shared storage file system developed by Oracle
Corporation. Unlike its predecessor OCFS, which was specifically
designed and only suitable for Oracle database payloads, OCFS2 is a
general-purpose filesystem that implements most POSIX semantics. The
most common use case for OCFS2 is arguably Oracle Real Application
Cluster (RAC), but OCFS2 may also be used for load-balanced NFS
clusters, for example.

Although originally designed for use with conventional shared storage
devices, OCFS2 is equally well suited to be deployed on
<<s-dual-primary-mode,dual-Primary DRBD>>. Applications reading from
the filesystem may benefit from reduced read latency due to the fact
that DRBD reads from and writes to local storage, as opposed to the
SAN devices OCFS2 otherwise normally runs on. In addition, DRBD adds
redundancy to OCFS2 by adding an additional copy to every filesystem
image, as opposed to just a single filesystem image that is merely
shared.

Like other shared cluster file systems such as <<ch-gfs,GFS>>, OCFS2
allows multiple nodes to access the same storage device, in read/write
mode, simultaneously without risking data corruption. It does so by
using a Distributed Lock Manager (DLM) which manages concurrent access
from cluster nodes. The DLM itself uses a virtual file system
(+ocfs2_dlmfs+) which is separate from the actual OCFS2 file systems
present on the system.

OCFS2 may either use an intrinsic cluster communication layer to
manage cluster membership and filesystem mount and unmount operation,
or alternatively defer those tasks to the
<<ch-pacemaker,Pacemaker>>cluster infrastructure.

OCFS2 is available in SUSE Linux Enterprise Server (where it is the
primarily supported shared cluster file system), CentOS, Debian
GNU/Linux, and Ubuntu Server Edition. Oracle also provides packages
for Red Hat Enterprise Linux (RHEL). This chapter assumes running
OCFS2 on a SUSE Linux Enterprise Server system.

[[s-ocfs2-create-resource]]
=== Creating a DRBD resource suitable for OCFS2

Since OCFS2 is a shared cluster file system expecting concurrent
read/write storage access from all cluster nodes, any DRBD resource to
be used for storing a OCFS2 filesystem must be configured in
<<s-dual-primary-mode,dual-primary mode>>. Also, it is recommended to
use some of DRBD's
<<s-automatic-split-brain-recovery-configuration,features for
automatic recovery from split brain>>. And, it is necessary for the
resource to switch to the primary role immediately after startup. To
do all this, include the following lines in the resource
configuration: indexterm:[drbd.conf]

[source,drbd]
----------------------------
resource <resource> {
  startup {
    become-primary-on both;
    ...
  }
  net {
    # allow-two-primaries yes;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    ...
  }
  ...
}
----------------------------

[WARNING]
===============================
By configuring auto-recovery policies, you are configuring effectively configuring automatic data-loss! Be sure you understand the implications.
===============================


It is not recommended to set the +allow-two-primaries+ option to +yes+
upon initial configuration. You should do so after the initial
resource synchronization has completed.

Once you have added these options to <<ch-configure,your
freshly-configured resource>>, you may <<s-first-time-up,initialize
your resource as you normally would>>. After you set the
indexterm:[drbd.conf]+allow-two-primaries+ option to +yes+ for this
resource, you will be able to <<s-switch-resource-roles,promote the
resource>>to the primary role on both nodes.

[[s-ocfs2-create]]
=== Creating an OCFS2 filesystem

Now, use OCFS2's +mkfs+ implementation to create the file system:
----------------------------
mkfs -t ocfs2 -N 2 -L ocfs2_drbd0 /dev/drbd0
mkfs.ocfs2 1.4.0
Filesystem label=ocfs2_drbd0
Block size=1024 (bits=10)
Cluster size=4096 (bits=12)
Volume size=205586432 (50192 clusters) (200768 blocks)
7 cluster groups (tail covers 4112 clusters, rest cover 7680 clusters)
Journal size=4194304
Initial number of node slots: 2
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 0 block(s)
Formatting Journals: done
Writing lost+found: done
mkfs.ocfs2 successful
----------------------------

This will create an OCFS2 file system with two node slots on
+/dev/drbd0+, and set the filesystem label to +ocfs2_drbd0+. You may
specify other options on +mkfs+ invocation; please see the +mkfs.ocfs2+
system manual page for details.

[[s-ocfs2-pacemaker]]
=== Pacemaker OCFS2 management

[[s-ocfs2-pacemaker-drbd]]
==== Adding a Dual-Primary DRBD resource to Pacemaker

An existing <<s-ocfs2-create-resource,Dual-Primary DRBD resource>>may
be added to Pacemaker resource management with the following
+crm+ configuration:

[source,drbd]
----------------------------
primitive p_drbd_ocfs2 ocf:linbit:drbd \
  params drbd_resource="ocfs2"
ms ms_drbd_ocfs2 p_drbd_ocfs2 \
  meta master-max=2 clone-max=2 notify=true
----------------------------

IMPORTANT: Note the +master-max=2+ meta variable; it enables
dual-Master mode for a Pacemaker master/slave set. This requires that
+allow-two-primaries+ is also set to +yes+ in the DRBD
configuration. Otherwise, Pacemaker will flag a configuration error
during resource validation.

[[s-ocfs2-pacemaker-mgmtdaemons]]
==== Adding OCFS2 management capability to Pacemaker

In order to manage OCFS2 and the kernel Distributed Lock Manager
(DLM), Pacemaker uses a total of three different resource agents:

* +ocf:pacemaker:controld+ -- Pacemaker's interface to the DLM;

* +ocf:ocfs2:o2cb+ -- Pacemaker's interface to OCFS2 cluster
  management;

* +ocf:heartbeat:Filesystem+ -- the generic filesystem management
  resource agent which supports cluster file systems when configured
  as a Pacemaker clone.

You may enable all nodes in a Pacemaker cluster for OCFS2 management
by creating a _cloned group_ of resources, with the following
+crm+ configuration:

[source,drbd]
----------------------------
primitive p_controld ocf:pacemaker:controld
primitive p_o2cb ocf:ocfs2:o2cb
group g_ocfs2mgmt p_controld p_o2cb
clone cl_ocfs2mgmt g_ocfs2mgmt meta interleave=true
----------------------------

Once this configuration is committed, Pacemaker will start instances
of the +controld+ and +o2cb+ resource types on all nodes in the cluster.

[[s-ocfs2-pacemaker-fs]]
==== Adding an OCFS2 filesystem to Pacemaker

Pacemaker manages OCFS2 filesystems using the conventional
+ocf:heartbeat:Filesystem+ resource agent, albeit in clone mode. To
put an OCFS2 filesystem under Pacemaker management, use the following
+crm+ configuration:

[source,drbd]
----------------------------
primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \
  params device="/dev/drbd/by-res/ocfs2/0" directory="/srv/ocfs2" \
         fstype="ocfs2" options="rw,noatime"
clone cl_fs_ocfs2 p_fs_ocfs2
----------------------------

NOTE: This example assumes a single-volume resource.

[[s-ocfs2-pacemaker-constraints]]
==== Adding required Pacemaker constraints to manage OCFS2 filesystems

In order to tie all OCFS2-related resources and clones together, add
the following contraints to your Pacemaker configuration:

[source,drbd]
----------------------------
order o_ocfs2 inf: ms_drbd_ocfs2:promote cl_ocfs2mgmt:start cl_fs_ocfs2:start
colocation c_ocfs2 inf: cl_fs_ocfs2 cl_ocfs2mgmt ms_drbd_ocfs2:Master
----------------------------

[[s-ocfs2-legacy]]
=== Legacy OCFS2 management (without Pacemaker)

IMPORTANT: The information presented in this section applies to legacy
systems where OCFS2 DLM support is not available in Pacemaker. It is
preserved here for reference purposes only. New installations should
always use the <<s-ocfs2-pacemaker,Pacemaker>> approach.

[[s-ocfs2-enable]]
==== Configuring your cluster to support OCFS2

[[s-ocfs2-create-cluster-conf]]
===== Creating the configuration file

OCFS2 uses a central configuration file, +/etc/ocfs2/cluster.conf+.

When creating your OCFS2 cluster, be sure to add both your hosts to
the cluster configuration. The default port (7777) is usually an
acceptable choice for cluster interconnect communications. If you
choose any other port number, be sure to choose one that does not
clash with an existing port used by DRBD (or any other configured
TCP/IP).

If you feel less than comfortable editing the +cluster.conf+ file
directly, you may also use the +ocfs2console+ graphical configuration
utility which is usually more convenient. Regardless of the approach
you selected, your +/etc/ocfs2/cluster.conf+ file contents should look
roughly like this:

[source,drbd]
----------------------------
node:
    ip_port = 7777
    ip_address = 10.1.1.31
    number = 0
    name = alice
    cluster = ocfs2

node:
    ip_port = 7777
    ip_address = 10.1.1.32
    number = 1
    name = bob
    cluster = ocfs2

cluster:
    node_count = 2
    name = ocfs2
----------------------------


When you have configured you cluster configuration, use +scp+ to
distribute the configuration to both nodes in the cluster.

[[s-configure-o2cb-driver]]
===== Configuring the O2CB driver

====== SUSE Linux Enterprise systems

On SLES, you may utilize the +configure+ option of the +o2cb+ init
script:

----------------------------
/etc/init.d/o2cb configure
Configuring the O2CB driver.

This will configure the on-boot properties of the O2CB driver.
The following questions will determine whether the driver is loaded on
boot.  The current values will be shown in brackets ('[]').  Hitting
<ENTER> without typing an answer will keep that current value.  Ctrl-C
will abort.

Load O2CB driver on boot (y/n) [y]:
Cluster to start on boot (Enter "none" to clear) [ocfs2]:
Specify heartbeat dead threshold (>=7) [31]:
Specify network idle timeout in ms (>=5000) [30000]:
Specify network keepalive delay in ms (>=1000) [2000]:
Specify network reconnect delay in ms (>=2000) [2000]:
Use user-space driven heartbeat? (y/n) [n]:
Writing O2CB configuration: OK
Loading module "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading module "ocfs2_nodemanager": OK
Loading module "ocfs2_dlm": OK
Loading module "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Starting O2CB cluster ocfs2: OK
----------------------------

====== .Debian GNU/Linux systems
On Debian, the +configure+ option to +/etc/init.d/o2cb+ is not
available. Instead, reconfigure the +ocfs2-tools+ package to enable the
driver:

----------------------------
dpkg-reconfigure -p medium -f readline ocfs2-tools
Configuring ocfs2-tools
Would you like to start an OCFS2 cluster (O2CB) at boot time? yes
Name of the cluster to start at boot time: ocfs2
The O2CB heartbeat threshold sets up the maximum time in seconds that a node
awaits for an I/O operation. After it, the node "fences" itself, and you will
probably see a crash.

It is calculated as the result of: (threshold - 1) x 2.

Its default value is 31 (60 seconds).

Raise it if you have slow disks and/or crashes with kernel messages like:

o2hb_write_timeout: 164 ERROR: heartbeat write timeout to device XXXX after NNNN
milliseconds
O2CB Heartbeat threshold: `31`
		Loading filesystem "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading stack plugin "o2cb": OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Setting cluster stack "o2cb": OK
Starting O2CB cluster ocfs2: OK
----------------------------

[[s-ocfs2-use]]
==== Using your OCFS2 filesystem

When you have completed cluster configuration and created your file
system, you may mount it as any other file system:
----------------------------
mount -t ocfs2 /dev/drbd0 /shared
----------------------------

Your kernel log (accessible by issuing the command +dmesg+) should
then contain a line similar to this one:

[source,drbd]
----------------------------
ocfs2: Mounting device (147,0) on (node 0, slot 0) with ordered data mode.
----------------------------

From that point forward, you should be able to simultaneously mount
your OCFS2 filesystem on both your nodes, in read/write mode.
