In italian


The Enhanced Network Block Device Linux Kernel Module

(Last revised: 21 Jan. 2010)
SourceForge.net Logo Support This Project

Latest news! Enbd 2.4.37a works over SCTP as well as TCP and SSL. Use a server URL of the form sctp://... at the client end.


    The Enhanced NBD is the result of an industrially funded academic research project with Realm Software of Atlanta, GA, to toughen up the kernel's NBD. It started back in 2.0 times, when I back-ported the nascent NBD by Pavel Machek from the 2.1 development kernel.

    What is an NBD?

    An NBD is "a long pair of wires". It makes a remote disk on a different machine act as though it were a local disk on your machine. It looks like a block device on the local machine where it's typically going to appear as /dev/nda. The remote resource doesn't need to be a whole disk or even a partition. It can be a file.

    NBD transports a physicalblock device over the net

    The intended use for ENBD in particular is for RAID over the net. You can make any NBD device part of a RAID mirror in order to get real time mirroring to a distant (and safe!) backup. To make it clear: start up an NBD connection to a distant NBD server, and use its local device (probably /dev/nda) where you would normally use a local partition in a RAID setup.

    I've seen ENBD running at 70MB/s sustained over Giga-ethernet (recent enbd 2.4.35a, between two standard 32-bit dual-core 2GHz intel machines; cpu loading was at about 10%).

    The original kernel NBD has been hardened in many ways in moving to ENBD: ENBD uses block-journaled multichannel communications; there is internal failover and automatic balancing between the channels; the client and server daemons restart, authenticate and reconnect after dying or loss of contact; the code can be compiled to take the networking transparantly over SSL channels (see the Makefile for the compilation options).

    To summarize briefly, the important changes in ENBD with respect to the standard NBD kernel driver are


    I haven't been following Pavel's (now Paul's, etc.) kernel driver closely but we are in friendly contact, and bugfixes pass between the two when the differing architectures permit.

    Requirements

    Kernel version 2.2.10 to .15 or thereabouts onwards, or kernel 2.4.0 and 2.6.3 onwards.  "Legacy" branch versions enbd-2.2.25 onwards also work under at least the early 2.4 kernels. The "current" branch versions enbd-2.4.* work on both 2.2 and 2.4 kernels as well as 2.6 kernels, and probably on 2.0 kernels too (as do the enbd-2.2.* series). Compatibility layers in the driver code serve to emulate the newer kernel interfaces on older kernel architectures.

    Current version as of Aug. 2008

    The legacy code for the 2.2 kernel is to be found at enbd-2.2-current   (currently enbd-2.2.29). See the ftp area for the full set of releases. The latest fully qualified stable release is enbd-2.4.35 (which in theory supports the 2.2 kernels as well as the intended 2.4 and 2.6 kernels), available at enbd-2.4-current. The development version is currently enbd-2.4.37a, in the same directory. In addition, there are enbd-2.4.35a, enbd-2.4.35b, in that directory which consist of earlier stable (and development) releases with updates, corrections, and portability changes for compatibility with later kernels than the ones they were developed against. I have tested and run under kernels up to and including 2.6.32. Yes, this information is probably going to be out of date when you see it.

    Sorry for the confusion that results from me having one minor version number (2.4) which covers three (2.4, 2.5, 2.6) kernel version minors!

    Documentation

    I'll refer you for general orientation to the Linux Journal article (vol #73 May 2000), but please regard the documentation bundled with the distribution as authoratative. The journal article is accurate only for the procedures for 2.2 series codes. I keep a copy in postscript format here.

    Here are some now ancient performance measurements from 2000:

    NBD performance figures

    These are taken under Linux kernel 2.0.36, as I recall, with a much older version of ENBD than the current one! The testbed back then was a pair of 64MB P200s on a 100BT switched circuit using 3c905 NICs. The best speed I could get out of raw TCP between them was 58.3Mb/s, tested using netperf.

    On a 100BT switch I would expect to get at least 9.4MB/s with TCP under current ENBD, and going faster than the net is possible, indeed usual, under some circumstances, thanks to some tricks that ENBD incorporates. CPU load on 100BT on current multi-GHz 32-bit platforms is of the order of 10%. Recent measurements show 50MB/s in both directions being achieved with TCP on Gigabit ethernet (enbd 2.4.37).

    The driver is known to work on intel (and AMD) 32 and 64-bit platforms, and on ARM, and on sparc and other bigendian platforms. If you have an unusual architecture please let me know if it does not compile or run correctly and I'll fix it.

    HOWTO

    The first and best advice is to do all the compilation on the client machine that you will be using to host the kernel ENBD devices. It has the kernel configuration that we need to match during the compilation - the enbd server never talks to its own kernel so it can pretty well be compiled against any headers. It's an entirely user-space application (there's even a Windows server floating around - the protocol specification is published so servers can be written fairly easily; there's a copy of the specification in the documentation directory in the distribution). The ENBD client daemon does speak to the kernel, however, and it needs to get the shapes and sizes of several data types just right.

    Usually, typing "make" in the enbd source directory will do all the work automatically. However, that pre-supposes a very common state of affairs that I'll make explicit, so you can mend it if it is not so:

    1. The softlink /lib/modules/`uname -r`/build must point to the source directory where the running kernel was compiled - i.e. you (or somebody else) compiled it in that source directory and you haven't moved, changed, removed or cleaned the source afterwards.

      A distribution-provided kernel usually requires the kernel source package to be installed separately. Usually (but not always) that source is configured in the same way as the running distribution kernel. In particular ...

    2. The .config in the kernel source directory must be the one that was used to compile the running kernel. You can make sure it is by comparing it with `zcat /proc/config.gz`, if that file exists.

      I generally do "zcat /proc/config.gz | diff - .config". No output from that command is good. A minor output showing a date difference is harmless.

      If there is no /proc/config.gz, well, a distribution kernel usually comes with the original config in /boot/config-`uname -r` (check the kernel package manifest for details).

      You can restore the .config from one or another of those places.

    3. If you put the right .config back in its place, you will need to let the kernel compile go through again for at least the first few files of the compilation.

      Usually "make prepare" in the kernel source directory does the trick (the aim is to establish the correct include/version.h file and a few other little things).

      Notice that this implies that you must have write permission in the kernel directory. It doesn't mean you need to be root! I usually chown the kernel source to my userid, recursively ("sudo chown ptb -R ."), and do all the compiling as myself.

    4. Make sure you are using the same compiler as was used to compile the running kernel. I.e. if you've upgraded since, reinstall the old compiler somewhere and set the environment variable CC to point to it (e.g "export CC=gcc-3.4").

      Unfortunately, the only way I know of of finding out about this is to do the compilation of one kernel module and try and insert it into the running target kernel. If insmod complains about a compiler mismatch, you'll see which compiler was used originally from the message. If it doesn't complain, you're using the right compiler (failing being sensible, I usually just compile the whole caboodle, both kernel and enbd, with the compiler I happen to have to hand at that moment, and install the newly compiled kernel too!)

    The above conditions are usually met just because life is kind. But in case the source has moved or you are compiling for a kernel that is not the running kernel on the compile machine, you can still get away with doing a "make" in the enbd source directory but you need to first set the environment variable KLINUXDIR to the location of the source directory for the target kernel. Typing "make KLINUXDIR=wherever" also works.

    Again, the target kernel source directory must contain the correct target kernel .config file, and "make prepare" must have been run there.

    So, to summarize, what you need to do in the normal case is, for example:

    1. % tar xzvf enbd-2.4.36.tgz
    2. % cd enbd-2.4.36
    3. % make
    4. and provided the conditions enumerated above are met, all will be compiled correctly. Only if it doesn't work, look harder at what's going on and perhaps email me about it. I'll quote the INSTALL directions:

      Typing "make" in the enbd source directory will build enbd.ko, enbd_ioctl.ko, enbd-server, enbd-client in /tmp.

      Change BUILD in the Makefile to change the build directory or run "make BUILD=wherever".

      Run "make clean realclean config all" to really really really make sure that everything is set up for you, but just "make" should normally do the job.

      Then you can go on to do a test as follows ...

    5. Make sure that sudo is installed and that you are a sudoer.

      install the modules enbd.ko and enbd_ioctl.ko into the running kernel with

      % insmod /tmp/enbd.ko
      % insmod /tmp/enbd_ioctl.ko

      Then run "make test".

      This depends on the presence of sudo.

      Observe that the module enbd.ko is loaded (use /sbin/lsmod for that).

      Observe also that enbd-server and enbd-client are running (use pstree for that).

      Check that server and client daemons have branched off slave server and client daemons to handle the connections (use pstree to visualize the situation).

      Check that the state of the device is good by doing a "cat /proc/nbdinfo".

      You should see indications of /dev/nda being up and running. If anything is wrong, look in your system logs for error messages and send me the state shown by /proc/nbdinfo.

    What happens in the test above?

    Well, "make test" should set up a small file in /tmp on localhost and serve it to the same machine. That'll appear as the /dev/nda device. The makefile will run the "enbd-test" utility on /dev/nda and report the results. If all has gone well, you can make a file system with "mke2fs /dev/nda" and play. I suggest:
     
      % mke2fs /dev/nda
      % mount /dev/nda /mnt
      % cd /mnt
      % bonnie ...


    The ndxN devices must exist on the client for this to work. I've provided a script called MAKEDEV to make them. On the client, do "cd /dev; sh path_to_MAKEDEV".
     

    Be careful ... there is already a script called MAKEDEV in /dev. Name yours something different or look inside it and see what it does and make the devices it makes by hand. You need block devices /dev/nda, /dev/nda1, /dev/nda2, /dev/nda3, etc, with major 43 (or whatever the kernel sets for NBD_MAJOR) and minors 0, 1, 2, 3, etc. "mknod /dev/nda b 43 0; mknod /dev/nda1 b 43 1; ..." should do the trick.

    You also need corresponding serial devices /dev/ndaS, /dev/ndaS1, /dev/ndaS2, etc. The provided MAKEDEV should make them automatically (but udev-users may need extra tricks to keep them there permanently).

    To abort the test, you can run "make stop" or  "make rescue". I don't guarantee a rescue in all circumstances, but it'll try, and you can elaborate the Makefile to suit your circumstances.

    The difficulty is in stopping the self-repairing code! Sending a kill -USR1 to the daemons should shut them down and error out the pending device queue requests. A kill -USR2 will try even harder to shut them down. A kill -TERM should then murder the daemons safely, allowing you to unload the kernel module.
     

    Look at the output from /proc/nbdinfo to gauge the state of the device. In particular, you should see the number of active sockets, and the number of active client threads.
     

    Device a:       Open
    [a] State:      initialized, verify, rw, last error 0
    [a] Queued:     +0 curr reqs/+0 real reqs/+10 max reqs
    [a] Buffersize: 86016   (sectors=168)
    [a] Blocksize:  1024    (log=10)
    [a] Size:       2097152
    [a] Blocks:     2048
    [a] Sockets:    4       (+)     (+)     (*)     (+)
    [a] Requested:  2048+0  (602)   (462)   (431)   (553)
    [a] Dispatched: 2048    (602)   (462)   (431)   (553)
    [a] Errored:    0       (0)     (0)     (0)     (0)
    [a] Pending:    0+0     (0)     (0)     (0)     (0)
    [a] Kthreads:   0       (0 waiting/0 running/1 max)
    [a] Cthreads:   4       (+)     (+)     (+)     (+)
    [a] Cpids:      4       (9489)  (9490)  (9491)  (9492)
    Device b-p:     Closed


    In the above I see four client threads (Cthreads) all currently within the kernel (+). They're probably waiting for work to do. I see four network sockets open and known good (+) with the third of them having been active last (*). The first socket seems to have taken more of the work available than the rest, but the difference is not significant. There are no errors reported and no requests waiting in internal queues. If you send in a bug report, make sure to include the output from /proc/nbdinfo.
     

    GOTCHA!

    Some people run into trouble just when they've got a bit of confidence and try setting things up for themselves. They run successfully and then stop the server and the client for a while and then try again. They can't reconnect! What's going on?

    The server generates a signature that is implanted into the clients nbd device at first contact. Any attempt to afterwards connect to a server with a different signature will be rejected. It's an anti-spoofing device. The client doesn't really know the signature either - it's buried in the kernel and the client can only ask if it's been given the right signature or not.

    Some find out that they can remove the kernel module and then start again successfully. Of course! That wipes the embedded signature. But it's not the solution. The right thing to do is to

    1. generate the same signature in the server every time you start it, using its "-i foobar" option.
    2. if you restart the server without restarting the client, signal the client with SIGPWR ("kill -PWR 19645 " or whatever the pid is).

    Most people are caught by GOTCHA! #1, but some people hit #2, which is why I mention it here.

    The signal with SIGPWR is normally taken care of by the assistant daemons, nbd-sstatd and nbd-cstatd, but ten to one they haven't been installed yet. I'll explain briefly ... the handshake sequence is longer for a first contact than for a reconnect, and without the SIGPWR the clients will try the short sequence instead of the long.

    HOWTO-2

    I'll lay out in a bit more detail what the "make test" does for you so that you can duplicate it for yourself. The first set of instructions are for a enbd-2.2.* code, and you'll find instructions for the enbd-2.4.* codes immediately after them. Please watch out for command line differences:
     
    1. choose a resource (file or partition) on the serving machine and choose some ports on which to serve it out to the client. Then start the server:

    2.     enbd-server 1100 1101 1102 1103 /dev/sda1

    3. on the client, load the enbd module (make sure to get the right one, using absolute path names if in doubt)

    4.     insmod enbd.o

    5. on the client machine, start the client:

    6.     enbd-client your.server 1100 1101 1102 1103 /dev/nda

    That was for an enbd-2.2.*. For an enbd-2.4.*, the sequence is as follows:
     
    1. choose a resource (file or partition) on the serving machine and a single control port. Then start the server. Here the resource is /dev/sda1:

    2.     enbd-server 1099 /dev/sda1

    3. on the client, load the enbd module (make sure to get the right one, using absolute path names if in doubt)
       
          insmod enbd.ko
          insmod enbd_ioctl.ko

    4. on the client machine, start the client. Note that you give the server control port plus the number of channels you want it to set up. It'll find and set up on its own different ports for the data channels:

    5.     enbd-client your.server:1099 -n 4 /dev/nda.
       

    In addition, for enbd-2.4.35a and above, you can alternatively:
     

    1. edit /etc/enbd.conf to contain the lines:

          server test 1099 /dev/sda1
          client test /dev/nda your.server 1099 -n 4

      on server and client machine respectively.

    2. on the client, load the enbd module with insmod:

          % insmod enbd.ko
          % insmod enbd_ioctl.ko

    3. Then run enbd-server and enbd-client without arguments on the respective machines.

    4.  

    What about resources > 2GB?

    It's really up to the server, and thus a userspace question. Look: if the server system can use lseek() to move across more than 2GB, then the NBD network protocol will support it, because it passes 64 bit offsets. And on the clientside, the daemon certainly interacts with the kernel using 64bit offsets too (whether you can access >2GB files on the client is again a userspace question).

    If you don't have a native 64 bit server system, from what I can find out from the current confused state of affairs in the linux world ... under glibc2 and kernels 2.2.* and 2.4.* you need to compile the nbd-server code with _LARGEFILE64_SOURCE defined. It's all set up for you from nbd-2.2.26 and nbd-2.4.5 on.

    If you do not have Large File Support on your system, the ENBD still supports resource aggregation, via either linear or striping RAID, to any size, unlimited by the 2GB file size maximum, provided only that the individual components of the aggregate resource are below 2GB in size. Check out the command line arguments for the server. Just listing multiple resources on the command line is enough to cause some form of aggregation to occur!

    Setting up for failover

    ... or how to make ENBD work with heartbeat, the well known failover infrastructure. Matthew Sackman has written a very good HOWTO on this subject. You'll find his document here . I've added the scripts necessary to the distribution archive (enbd-2.4.30 on) under the nbd/etc/ha.d directory. Flash: Steve Purkis has adapted the scripts for RedHat-based platforms, and I've included his scripts in the latest archives (enbd-2.4.32 on) in the nbd/etc-RH/ha.d directory.

    Sorry about the links. I hate non-inline documentation myself. In compensation, I'll describe something of what one is trying to achieve with failover; heartbeat is only a means to an end and in many instances a simple little shell script will be just as good or better and this description may help you construct it!

    The idea is that server and client are both capable of using a single "floating IP address". This floating IP is normally held by the client, but it moves to the server when the client dies, and it moves back again when the client comes back up and has been brought up to date again. The floating IP is normally that announced in DNS for some vital service such as samba or http.

    Heartbeat is simply a general mechanism for detecting when the client or server has failed, and for running the appropriate scripts in response.

    Overview: the client will normally be running a raid1 (mirror) composed of the NBD device and a local resource. When the client dies, the floating IP is handed off to the server, which then starts serving from the physical resource of which the NBD device is/was a virtual image. When the client comes back up, its local mirror component has to be resynced from the NBD device component, but the client can take the IP immediately, as the mirror resyncs in the background while it continues working.

    Abstraction: There are 4 possible states in which the pair of machines can be: (1) server alive, client alive, (2) server alive, client dead, (3) server dead, client alive, (4) server dead, client dead. Of these, (1) is "normal" and (4) is impossible, for our purposes - failover would have failed. The transitions (1)-(2), (2)-(1) and (1)-(3), (3)-(1) are what we are interested in. Heartbeat initiates actions on the surviving machine or machines after each transition.

    More detail: Let's look at the (1)-(2) transition. The server is the survivor. It will run its 'endbd-client start' script because it now has to take up the role of the client. If the client got the chance before dying, it would also have run its 'enbd-client stop' script. Leave aside what these scripts do for the moment and just focus on the naming convention. In the (2)-(1) transition, when the client comes back up, it runs its 'enbd-client start' script, and the server runs its 'enbd-client stop' script.

    Similarly, in the (1)-(3) transition, where the client is the survivor, it must take up the role of the server and so it runs 'enbd-server start'. The server, if it got the chance before dying, runs 'enbd-server stop'. On the reverse transition, (3)-(1), the client runs 'enbd-server stop' and the server runs 'enbd-server start'.

    What do these scripts do?

    Look at (1)-(2) again, where the client dies and the server survives and takes the clients role. The server has to kill its enbd server daemon, fsck the raw partition if it wasn't journaled, and then mount it in the place where its apache and samba services expect to find it. So that's what 'enbd-client start' does for it.

    The matter of taking the IP is normally handled by heartbeat, but one can do it manually with a simple ifconfig eth0:0 foobar command in the script. The same goes for starting and stopping the apache and samba services - i.e. that's handled by heartbeat too.

    If the client got a chance to run its 'enbd-client stop' script before dying, it would have unmounted the raid mirror, then stopped the mirror and stopped the enbd client daemon that it was running. So that's what 'enbd-client stop' does do.

    The (2)-(1) transition is the one that restarts the client in the client role. Usually the server will live to see this transition through, and its 'enbd-client stop' script will unmount the raw partition, start the enbd-server daemon on it, and that's all. The client's 'enbd-client start' script, on the other hand, has to carefully start the enbd-daemon, wait for the NBD device to come up, then start the mirror with the NBD device as primary component. Oh yes, it'll also steal the floating IP address - well, that's normally handled by heartbeat itself.

    The (1)-(3) transition should be thought about in the same way, but it's linked to an easier set of scripts than (1)-(2), since the apache and samba services don't need to be relocated - they stay on the client.

    The client is the survivor. It takes the role of the server with 'enbd-server start', so this script should kill its enbd-client daemon (the mirror component was dead anyway). It does not need to do anything else since the mirror itself has survived. It could take the NBD component out of the mirror with raidhotremove, but it does not need to. If the server got to run 'enbd-server stop' before dying, it should have killed its enbd-server daemon and that's all.

    The reverse transition (3)-(1) is harder. This is where the server has to be reintegrated. It runs 'enbd-server start', which starts up its enbd-server daemon. The client does the reintegration work - it runs 'enbd-server stop', which starts the enbd-client demon, waits for the NBD device to come up, then integrates it into the mirror as a secondary, using raidhotadd.

    The scripts are in the HOWTO, and in the distribution archive. Phew!

    Intelligent mirroring


    (The FR technology is nowadays to be found in the mainline kernel RAID, since about kernel 2.6.13, so you no longer need the separate FR1 module to get intelligent RAID -- see below instead).

    In ENBD 2.4.31, you can run under fr1  instead of plain kernel raid1, and get a "fully integrated" networked RAID1 solution. Get the fr1 patch, apply it to your kernel source, choose fr1 as a raid module in the kernel config (make menuconfig, etc.), and recompile (make modules) for the module. Load the new md.o module, and the fr1.o module on top of it. It's a replacement for raid1.o. It is fully backward compatible with old kernel RAID1.

    The intelligence in fr1 is in what happens when one of the enbd servers fails, and what happens afterwards. In ordinary RAID1 mirroring, a failed disk is usually replaced by the operator after a short delay, and then the RAID controller takes it upon itself to resync the new disk from the surviving good disk in the background, while the RAID device pretends to the operating system that all systems are go, as normal. Unfortunately, in the networked scenario

    1. temporary network failures are much more common than real disk blow-ups, so we probably only need to catch up with a bit of the data when the network comes back, not write the whole disk from scratch!
    2. the reason you are working over the network is probably to aggregate a large number of disks, with a total size in the terrabytes or petabytes (if you can get there, 16TB is currently the aggregated limit on 32bit systems on linux), so resyncing all of a disk is something you definitely do not want to do.
    3. unless you are using Gigabit ethernet, or some other very fast medium, the transport is probably slower than the disks, so you want to avoid network transfers when possible.
The reengineered mirroring in fr1 notices exactly what block writes are missing on the missing server, and when it comes back into contact, updates only those blocks.

That's a great speed up and time saver. It can reduce the time period in which the servers don't have redundancy from a matter of hours to a matter of seconds. And the resync is automatic when contact with an enbd server is reestablished. It doesn't require human intervention, because the enbd client issues the hotadd instruction.

To set it up, run an enbd device as one component of a fr1 ("raid1") mirror. That's it (so sue me, MacDonalds).

There is now an fr5 driver too, but even without it you can get at least the automatic resync on reconnect by patching the kernel for fr1 and then using the patched md.o module under ordinary kernel raid5.


From the horse's mouth

Nowadays (kernel 2.6.13 onwards), FR bitmap technology is to be found in the mainline kernel itself. The bitmap is implemented as an on-disk so-called intent log. That name only really refers to the fact that the on-disk bitmap is always a little pessimistic, because it is cleared more slowly than it is marked dirty, in order to avoid too much disk i/o. To put the kernel RAID bitmap in memory only, you will have to specify a bitmap on a ramdisk or tmpfs with mdadm --bitmap=, otherwise you get it on disk.

With the bitmap, one has most of the FR technology. What is missing is a facility to allow enbd to proactively warn the RAID driver when the network has died, and proactively tell it when the network returns and reinsert itself in the RAID array.

So, you should run mdadm --monitor --program= using a program (shell script, most likely) that acts on the Fail event sent it as argument. When Fail of an enbd device in the array happens, the program should run mdadm --remove for the enbd device and start polling the net and the enbd device itself to see when the network returns and the device is working again (the validated and enabled flags will be set in /proc/sys and/or be visible in the /proc/nbdinfo device status line), and then run mdadm --re-add for the enbd device.

If that description's not clear, I'll put up a script. Watch this space.

    Bugs

    What's that? Oh yes, there are bugs.

    To Do

    Latest News

    Mailing List

    Tummy.com have kindly set up a mailing list for nbd. Send mail containing the word "help" or "subscribe enbd" to enbd-request@lists.community.tummy.com . You will find complete instructions on their web page for the list. The list itself is at enbd@lists.community.tummy.com .
     

    Downloads

    As well as the main site at ftp://nbd.it.uc3m.es/pub/Programs/ , tummy.com have set up a mirror at ftp://mirrors.tummy.com/pub/mirrors/linux-ha/enbd/ .

    Contacts

    Contact me - not least to encourage me to start a mailing list (yay! done, thanks to SuSE and tummy.com) and improve this page. The change list in the driver source is really impressive.
     

    Peter T. Breuer ptb@inv.it.uc3m.es