The Enhanced NBD is the result of an industrially funded academic
research project with Realm Software of Atlanta, GA, to toughen up the
kernel's NBD. It started back in 2.0 times, when I back-ported the
nascent NBD
by Pavel Machek from the 2.1 development kernel.
An NBD is "a long pair of wires". It makes a remote disk on a different machine act as though it were a local disk on your machine. It looks like a block device on the local machine where it's typically going to appear as /dev/nda. The remote resource doesn't need to be a whole disk or even a partition. It can be a file.
The intended use for ENBD in particular is for RAID over the net.
You can make any NBD device part of a RAID mirror in order to get real
time mirroring to a distant (and safe!) backup. To make it clear: start
up an NBD connection to a distant NBD server, and use its local device
(probably /dev/nda) where you would normally use a local partition in a
RAID setup.
I've seen ENBD running at 70MB/s sustained over Giga-ethernet
(recent enbd 2.4.35a, between two standard 32-bit dual-core 2GHz intel
machines; cpu loading was at about 10%).
The original kernel NBD has been hardened in many ways in moving to ENBD: ENBD uses block-journaled multichannel communications; there is internal failover and automatic balancing between the channels; the client and server daemons restart, authenticate and reconnect after dying or loss of contact; the code can be compiled to take the networking transparantly over SSL channels (see the Makefile for the compilation options).
To summarize briefly, the important changes in ENBD with respect to the standard NBD kernel driver are
Sorry for the confusion that results from me having one minor version number (2.4) which covers three (2.4, 2.5, 2.6) kernel version minors!
Here are some now ancient performance measurements from 2000:
These are taken under Linux kernel 2.0.36, as I recall, with
a much older version of ENBD than the current one!
The testbed back then was a pair of 64MB P200s on a 100BT switched
circuit using
3c905 NICs. The best speed I could get out of raw TCP between them was
58.3Mb/s, tested using netperf.
On a 100BT switch I would expect to get at least 9.4MB/s with TCP under current ENBD, and going faster than the net is possible, indeed usual, under some circumstances, thanks to some tricks that ENBD incorporates. CPU load on 100BT on current multi-GHz 32-bit platforms is of the order of 10%. Recent measurements show 50MB/s in both directions being achieved with TCP on Gigabit ethernet (enbd 2.4.37).
The driver is known to work on intel (and AMD) 32 and 64-bit
platforms,
and on ARM, and on sparc and other bigendian platforms. If you have an
unusual architecture please let me know if it does not compile or run
correctly and I'll fix it.
The first and best advice is to do all the compilation on the
client machine that you will be using to host the kernel ENBD devices.
It has the kernel
configuration that we need to match during the compilation - the enbd
server never talks to its own kernel so it can pretty well be compiled
against any headers. It's an entirely user-space application (there's
even a
Windows server floating around - the protocol specification is
published so servers can be written fairly easily; there's a copy of
the specification in the documentation directory in the distribution).
The ENBD client daemon does speak to the kernel, however, and it needs
to get the shapes and sizes of several data types just right.
Usually, typing "make" in the enbd source directory
will do all the work automatically. However, that pre-supposes a very
common state of affairs that I'll make explicit, so you can mend it if
it is not so:
A distribution-provided kernel usually requires the kernel
source package
to be installed separately. Usually (but not always) that source is
configured in the same way as the running distribution kernel. In
particular ...
I generally do "zcat /proc/config.gz | diff - .config". No output from that command is good. A minor output showing a date difference is harmless.
If there is no /proc/config.gz, well, a distribution kernel usually comes with the original config in /boot/config-`uname -r` (check the kernel package manifest for details).
You can restore the .config from one or another of those places.
Usually
"make prepare" in the kernel source directory does the trick (the aim
is to establish the correct include/version.h file and a few other
little things).
Notice that this implies that you must have write permission in the
kernel directory. It doesn't mean you need to be root! I usually chown
the kernel source to my userid, recursively ("sudo chown ptb -R ."),
and do all the compiling as myself.
Unfortunately, the only way I know of of finding out about this is to do the compilation of one kernel module and try and insert it into the running target kernel. If insmod complains about a compiler mismatch, you'll see which compiler was used originally from the message. If it doesn't complain, you're using the right compiler (failing being sensible, I usually just compile the whole caboodle, both kernel and enbd, with the compiler I happen to have to hand at that moment, and install the newly compiled kernel too!)
The above conditions are usually met just because life is kind. But in case the source has moved or you are compiling for a kernel that is not the running kernel on the compile machine, you can still get away with doing a "make" in the enbd source directory but you need to first set the environment variable KLINUXDIR to the location of the source directory for the target kernel. Typing "make KLINUXDIR=wherever" also works.
Again, the target kernel source directory must contain the correct target kernel .config file, and "make prepare" must have been run there.
So, to summarize, what you need to do in the normal case is, for example:
and provided the conditions enumerated above are met, all will
be compiled correctly. Only if it doesn't work, look harder at what's
going on and perhaps email me about it. I'll quote the INSTALL
directions:
Typing "make" in the enbd source directory will build enbd.ko, enbd_ioctl.ko, enbd-server, enbd-client in /tmp.
Change BUILD in the Makefile to change the build directory or run "make BUILD=wherever".
Run
"make clean realclean config all" to really
really really make sure that
everything is set up for you, but just "make"
should normally
do the job.
Then you can go on to do a test as follows ...
install the modules enbd.ko and enbd_ioctl.ko into the
running kernel with
% insmod /tmp/enbd.ko
% insmod /tmp/enbd_ioctl.ko
Then run "make test".
This depends on the presence of sudo.
Observe that the module enbd.ko is loaded (use /sbin/lsmod for that).
Observe also that enbd-server and enbd-client are running (use pstree for that).
Check that server and client daemons have branched off slave server and client daemons to handle the connections (use pstree to visualize the situation).
Check that the state of the device is good by doing a "cat /proc/nbdinfo".
You should see indications of /dev/nda being up and running. If anything is wrong, look in your system logs for error messages and send me the state shown by /proc/nbdinfo.
% mke2fs /dev/nda
% mount /dev/nda /mnt
% cd /mnt
% bonnie ...
The ndxN devices must exist on the
client for
this to work. I've provided a script called MAKEDEV to
make them. On the client, do "cd /dev; sh path_to_MAKEDEV".
Be careful ... there is already a script called MAKEDEV in /dev. Name yours something different or look inside it and see what it does and make the devices it makes by hand. You need block devices /dev/nda, /dev/nda1, /dev/nda2, /dev/nda3, etc, with major 43 (or whatever the kernel sets for NBD_MAJOR) and minors 0, 1, 2, 3, etc. "mknod /dev/nda b 43 0; mknod /dev/nda1 b 43 1; ..." should do the trick.
You also need corresponding serial devices /dev/ndaS,
/dev/ndaS1, /dev/ndaS2, etc. The provided MAKEDEV should make them
automatically (but udev-users may need extra tricks to keep them
there permanently).
To abort the test, you can run "make stop" or "make rescue". I don't guarantee a rescue in all circumstances, but it'll try, and you can elaborate the Makefile to suit your circumstances.
The difficulty is in stopping the self-repairing code! Sending a
kill -USR1 to the daemons should shut them down and
error out
the pending device queue requests. A kill -USR2 will try
even
harder to shut them down. A kill -TERM should
then murder the daemons safely, allowing you to unload the kernel
module.
Look at the output from /proc/nbdinfo
to gauge
the state of the device. In particular, you should see the number of
active sockets, and the number of active client threads.
Device a: Open
[a] State: initialized, verify, rw, last error 0
[a] Queued: +0 curr reqs/+0 real reqs/+10 max reqs
[a] Buffersize: 86016 (sectors=168)
[a] Blocksize: 1024 (log=10)
[a] Size: 2097152
[a] Blocks: 2048
[a] Sockets: 4 (+) (+) (*) (+)
[a] Requested: 2048+0 (602) (462) (431) (553)
[a] Dispatched: 2048 (602) (462) (431) (553)
[a] Errored: 0 (0) (0) (0) (0)
[a] Pending: 0+0 (0) (0) (0) (0)
[a] Kthreads: 0 (0 waiting/0 running/1 max)
[a] Cthreads: 4 (+) (+) (+) (+)
[a] Cpids: 4 (9489) (9490) (9491) (9492)
Device b-p: Closed
In the above I see four client threads (Cthreads) all currently
within the kernel (+). They're probably waiting for work to do. I see
four network sockets open and known good (+) with the third of them
having
been active last (*). The first socket seems to have taken more of the
work available than the rest, but the difference is not significant.
There
are no errors reported and no requests waiting in internal queues. If
you send in a bug report, make sure to include the output from
/proc/nbdinfo.
The server generates a signature that is implanted into the clients nbd device at first contact. Any attempt to afterwards connect to a server with a different signature will be rejected. It's an anti-spoofing device. The client doesn't really know the signature either - it's buried in the kernel and the client can only ask if it's been given the right signature or not.
Some find out that they can remove the kernel module and then
start
again successfully. Of course! That wipes the embedded signature. But
it's
not the solution. The right thing to do is to
Most people are caught by GOTCHA! #1, but some people hit #2,
which is
why I mention it here.
The signal with SIGPWR is normally taken care of by the assistant
daemons, nbd-sstatd and nbd-cstatd, but ten to one they haven't been
installed yet. I'll explain briefly ... the handshake sequence is
longer
for a first contact than for a reconnect, and without the SIGPWR the
clients will try the short sequence instead of the long.
In addition, for enbd-2.4.35a and above, you can alternatively:
If you don't have a native 64 bit server system, from what I can find out from the current confused state of affairs in the linux world ... under glibc2 and kernels 2.2.* and 2.4.* you need to compile the nbd-server code with _LARGEFILE64_SOURCE defined. It's all set up for you from nbd-2.2.26 and nbd-2.4.5 on.
If you do not have Large File Support on your system, the ENBD
still supports resource aggregation, via either linear or striping
RAID, to any size, unlimited by the 2GB file size maximum, provided
only that
the individual components of the aggregate resource are below 2GB in
size. Check out the command line arguments for the server. Just listing
multiple resources on the command line is enough to cause some form of
aggregation to occur!
... or how to make ENBD work with heartbeat, the well known
failover infrastructure. Matthew Sackman has written a very good HOWTO
on this
subject. You'll find his document here . I've added the
scripts
necessary to the distribution archive (enbd-2.4.30 on) under the
nbd/etc/ha.d
directory. Flash: Steve Purkis has adapted the
scripts for
RedHat-based platforms, and I've included his scripts in the latest
archives
(enbd-2.4.32 on) in the nbd/etc-RH/ha.d directory.
Sorry about the links. I hate non-inline documentation myself. In
compensation,
I'll describe something of what one is trying to achieve with failover;
heartbeat is only a means to an end and in many instances a simple
little shell
script will be just as good or better and this description may help you
construct
it!
The idea is that server and client are both capable of using a
single "floating IP address". This floating IP is normally held by the
client, but it moves to the server when the client dies, and it moves
back again when the client comes back up and has been brought up to
date again. The floating IP is normally that announced in DNS for some
vital service such as samba or http.
Heartbeat is simply a general mechanism for detecting when the
client or server has failed, and for running the appropriate scripts in
response.
Overview: the client will normally be running a raid1
(mirror) composed of the NBD device and a local resource. When the
client dies, the floating IP is handed off to the server, which then
starts serving from the physical resource of which the NBD device
is/was
a virtual image.
When the client comes back up, its local mirror component has to be
resynced
from the NBD device component, but the client can take the IP
immediately,
as the mirror resyncs in the background while it continues working.
Abstraction: There are 4 possible states in which the pair
of machines can be: (1) server alive, client alive, (2) server alive,
client dead, (3) server dead, client alive, (4) server dead, client
dead. Of these, (1) is "normal" and (4) is impossible, for our purposes
-
failover would have failed. The transitions (1)-(2), (2)-(1) and
(1)-(3),
(3)-(1) are what we are interested in. Heartbeat initiates actions on
the
surviving machine or machines after each transition.
More detail: Let's look at the (1)-(2) transition. The
server is the survivor. It will run its 'endbd-client start' script
because it now has to take up the role of the client. If the client got
the chance before dying, it would also have run its 'enbd-client stop'
script. Leave aside what these scripts do for the moment and just focus
on the naming convention. In the (2)-(1) transition, when the client
comes back up, it runs its 'enbd-client start' script, and the server
runs its 'enbd-client stop' script.
Similarly, in the (1)-(3) transition, where the client is the
survivor, it must take up the role of the server and so it runs
'enbd-server start'. The server, if it got the chance before dying,
runs 'enbd-server stop'. On the reverse transition, (3)-(1), the client
runs 'enbd-server stop'
and the server runs 'enbd-server start'.
What do these scripts do?
Look at (1)-(2) again, where the client dies and the server
survives and takes the clients role. The server has to kill its enbd
server daemon, fsck the raw partition if it wasn't journaled, and then
mount it in the
place where its apache and samba services expect to find it. So that's
what 'enbd-client start' does for it.
The matter of taking the IP is normally handled by heartbeat, but
one can do it manually with a simple ifconfig eth0:0 foobar command in
the script. The same goes for starting and stopping the apache and
samba services - i.e. that's handled by heartbeat too.
If the client got a chance to run its 'enbd-client stop' script
before dying, it would have unmounted the raid mirror, then stopped the
mirror and stopped the enbd client daemon that it was running. So
that's what 'enbd-client stop' does do.
The (2)-(1) transition is the one that restarts the client in the
client role. Usually the server will live to see this transition
through, and its 'enbd-client stop' script will unmount the raw
partition, start the enbd-server daemon on it, and that's all. The
client's 'enbd-client start' script, on the other hand, has to
carefully start the enbd-daemon, wait
for the NBD device to come up, then start the mirror with the NBD
device
as primary component. Oh yes, it'll also steal the floating IP address
- well, that's normally handled by heartbeat itself.
The (1)-(3) transition should be thought about in the same way,
but it's
linked to an easier set of scripts than (1)-(2), since the apache and
samba
services don't need to be relocated - they stay on the client.
The client is the survivor. It takes the role of the server with
'enbd-server start', so this script should kill its enbd-client daemon
(the mirror component was dead anyway). It does not need to do anything
else since the mirror itself has survived. It could take the NBD
component out of the mirror
with raidhotremove, but it does not need to. If the server got to run
'enbd-server stop' before dying, it should have killed its enbd-server
daemon and that's all.
The reverse transition (3)-(1) is harder. This is where the server
has to be reintegrated. It runs 'enbd-server start', which starts up
its
enbd-server daemon. The client does the reintegration work - it runs
'enbd-server stop', which starts the enbd-client demon, waits for the
NBD
device to come up, then integrates it into the mirror as a secondary,
using
raidhotadd.
The scripts are in the HOWTO, and in the distribution archive.
Phew!
In ENBD 2.4.31, you can run under fr1
instead of plain kernel raid1, and get a "fully integrated" networked
RAID1 solution. Get the fr1 patch, apply it to your kernel source,
choose fr1 as a raid module in the kernel config (make menuconfig,
etc.), and recompile (make modules) for the module. Load the new md.o
module, and the fr1.o module on top of it. It's a replacement for
raid1.o. It is fully backward compatible with old kernel RAID1.
The intelligence in fr1 is in what happens when one of the enbd servers
fails, and what happens afterwards. In ordinary RAID1 mirroring, a
failed disk is usually replaced by the operator after a short delay,
and then the RAID controller takes it upon itself to resync the new
disk from the surviving good disk in the background, while the RAID
device pretends to the operating system that all systems are go, as
normal. Unfortunately, in the networked scenario
The reengineered mirroring in fr1 notices exactly what block writes are missing on the missing server, and when it comes back into contact, updates only those blocks.
That's a great speed up and time saver. It can reduce the time period in which the servers don't have redundancy from a matter of hours to a matter of seconds. And the resync is automatic when contact with an enbd server is reestablished. It doesn't require human intervention, because the enbd client issues the hotadd instruction.
To set it up, run an enbd device as one component of a fr1 ("raid1") mirror. That's it (so sue me, MacDonalds).
There is now an fr5 driver too, but even without it you can get at least the automatic resync on reconnect by patching the kernel for fr1 and then using the patched md.o module under ordinary kernel raid5.
From the horse's mouth
Nowadays (kernel 2.6.13 onwards), FR bitmap technology is to be found in the mainline kernel itself. The bitmap is implemented as an on-disk so-called intent log. That name only really refers to the fact that the on-disk bitmap is always a little pessimistic, because it is cleared more slowly than it is marked dirty, in order to avoid too much disk i/o. To put the kernel RAID bitmap in memory only, you will have to specify a bitmap on a ramdisk or tmpfs with mdadm --bitmap=, otherwise you get it on disk.
With the bitmap, one has most of the FR technology. What is missing is a facility to allow enbd to proactively warn the RAID driver when the network has died, and proactively tell it when the network returns and reinsert itself in the RAID array.
So, you should run mdadm --monitor --program= using a program (shell script, most likely) that acts on the Fail event sent it as argument. When Fail of an enbd device in the array happens, the program should run mdadm --remove for the enbd device and start polling the net and the enbd device itself to see when the network returns and the device is working again (the validated and enabled flags will be set in /proc/sys and/or be visible in the /proc/nbdinfo device status line), and then run mdadm --re-add for the enbd device.
If that description's not clear, I'll put up a script. Watch this space.
I started 2.4.35 by splitting off the sysfs code into a separate source file. Before I forked, however, 2.4.34 had had some updates go into the /proc/sys/devices/enbd display area so that the IP number of individual connections can be seen in there. That makes scripting for failover easier to make generic, since one can pick up the address to ping for connectivity from there. More state info is also shown than before I made the change - there are subdirectories dedicated to the individual slots, for example.
In 2.4.35 there is also a relatively minor change to the enbd-test facility. But it makes it exceedingly useful for a replacement for bonnie. Use -t 5 (the seek, write, read test). And specifiy the amount of i/o in the test with -S. You may want to specify about 50% of the resource size there. -S 50% will be understod. It's the amount to pepper with patterned data at random locations. It gets read back and checked later, and the speed both ways is printed at the end. Don't try for 100%! The test generation algorithm discards double-hits as they turn up. You can wait forever for 100% coverage to be achieved randomly like that. The change was to use a binary tree algorithm to store the patterns written, so the checking process is almost costless now, computationally (Order n log n, total). Otherwise it would grind to a halt for any significant amount of data.
Other minor differences are in the way tcp level packets are merged, via the MSG_MORE flag on the socket instead of the linux specific TCP_CORK. Internally the work task queue support for multireq has been factored into a separate source file.
In 2.4.36 it is now possible to use SCTP as a transport as well as TCP (or SSL). Use sctp://server:port as the address in the client command line. Improvements will follow (the library seems to use 64bit packets by default so it's deadly slow at the moment). You need libsctp1 and libsctp-dev (debian packages).
The server also clearly benefits from networking from mmapped buffers, so I made at least networked writes from mmapped buffers the default on the server - the -M option now just adds networked reads to the mmapped resource into the mix. Network sends from an mmapped file buffer should be zero-copy, like sendfile is, and indeed it seems to measure out as such in practice. Network reads to an mmapped file buffer don't seem to win anything extra, however.
For these and other improvements I made what backports I could and issued a 2.4.35a, as 2.4.36 continues to be the development vehicle yet it would be a shame to deny the improvements to users of the stable 2.4.35.
Peter T. Breuer ptb@inv.it.uc3m.es