Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

History of DRBD at ITEG & Clazzes.org

For many years (since CentOS 5) we at ITEG (also the main force behind Clazzes.org) have used DRBD to mirror the partition of each virtual machine (then OpenVZ, now mostly LXC).

Kernel 4.6 hickups with DRBD 8.4.6 as possible cause

After shredding some data last year, it now (Debian jessie, Kernel 4.6 from jessie-backports, DRBD module 8.4.6) it seems to make troubles again.

Initial problem

Several times a single LXC guest became unreachable, the host's  "load" started to rise slowly but continously up to ~1000 (!) and more, fortunately with the host still allowing to login via ssh but unfortunately with only one way out: Reset button. Ouch. WTF.

With the exception of one case under Kernel 4.4 (DRBD module 8.4.5) the last non-trivial syslog entries always included DRBD:

Case 2
Code Block
titleKernel problem case 2
collapsetrue
Aug  6 05:00:14 host8 kernel: [730487.320583] RIP: 0010:[<ffffffff813201e6>]  [<ffffffff813201e6>] memcpy_erms+0x6/0x10
Aug  6 05:00:14 host8 kernel: [730487.339579] FS:  0000000000000000(0000) GS:ffff88103fb00000(0000) knlGS:0000000000000000
Aug  6 05:00:14 host8 kernel: [730487.353038]  000000000000faf0 ffff88203650fc48 0000000000000000 ffffffff81517a12
Aug  6 05:00:14 host8 kernel: [730487.369167]  [<ffffffffc063e119>] ? drbd_send+0xc9/0x1e0 [drbd]
Aug  6 05:00:14 host8 kernel: [730487.387607]  [<ffffffffc063c2f0>] ? drbd_destroy_connection+0xf0/0xf0 [drbd]
Aug  6 05:00:14 host8 kernel: [730487.403314]  RSP <ffff88203650fb40>
Case 3
Code Block
titleKernel problem case 3
collapsetrue
Aug 27 05:25:43 host9 kernel: [2547757.533648] RSP: 0018:ffff8801c93b3b40  EFLAGS: 00010292
Aug 27 05:25:43 host9 kernel: [2547757.579013]  00004000000005b4 00000000000005b4 00000000000008a0 0000000000000800
Aug 27 05:25:43 host9 kernel: [2547757.623329]  [<ffffffffc060a2f0>] ? drbd_destroy_connection+0xf0/0xf0 [drbd]
Case 4
Code Block
titleKernel problem case 4
collapsetrue
Aug 27 04:26:29 host4 kernel: [478357.442244] Modules linked in: pci_stub(E) vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nfsv3(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) tcp_diag(E) inet_diag(E) ipt_REJECT(E) nf_reject_ipv4(E) nf_log_ipv6(E) ip6t_rt(E) veth(E) drbd(E) ipmi_devintf(E) xt_multiport(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) xt_tcpudp(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip6table_filter(E) xt_conntrack(E) xt_state(E) iptable_filter(E) ip_tables(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) nf_conntrack(E) ip6_tables(E) x_tables(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) grace(E) fscache(E) sunrpc(E) 8021q(E) garp(E) mrp(E) bridge(E) stp(E) llc(E) lru_cache(E) libcrc32c(E) crc32c_generic(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E)<4>[478357.455870] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
Aug 27 04:26:29 host4 kernel: [478357.460567] RSP: 0018:ffff8808534f7b40  EFLAGS: 00010292
Aug 27 04:26:29 host4 kernel: [478357.465530] RBP: ffff8808534f7c58 R08: ffff880858d28af0 R09: 0000000000000000
Aug 27 04:26:29 host4 kernel: [478357.470727] FS:  0000000000000000(0000) GS:ffff88085fa00000(0000) knlGS:0000000000000000
Aug 27 04:26:29 host4 kernel: [478357.476167] Stack:
Aug 27 04:26:29 host4 kernel: [478357.483832] Call Trace:
Aug 27 04:26:29 host4 kernel: [478357.489826]  [<ffffffff814acf80>] ? sock_sendmsg+0x30/0x40
Aug 27 04:26:29 host4 kernel: [478357.498117]  [<ffffffffc07b9efd>] ? w_send_dblock+0x9d/0x1c0 [drbd]
Aug 27 04:26:29 host4 kernel: [478357.506710]  [<ffffffffc07d02f0>] ? drbd_destroy_connection+0xf0/0xf0 [drbd]
Aug 27 04:26:29 host4 kernel: [478357.515569] Code: 90 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 
Aug 27 04:26:29 host4 kernel: [478357.538514] ---[ end trace 0d23089f3d6f0d23 ]---

Dismissed solution ideas: DRBD9? Commercial support?

Our first idea was to try DRBD9, maybe with commercial support.

So we filled out Linbit's contact form, and got called back quickly.

Conclusion 1: Don't migrate from DRBD8 to DRBD9 unless you need >2 nodes

DRBD9 is for multi-node operation.

For 2-node operation DRBD8 is fine, recommended, and will be getting support for at least years (or so they say; this is 2016-08-29).

Conclusion 2: Commercial support prices not for us

We mirror within several pairs "pizza boxes" from the shelf, not between 2 top-of-the-line rack-high elephants.
This doesn't fit with their pricing that's based on per-node (and per-year).

So long, staying with pure Open Source approach then.

Trying out DRBD9 for curiousity's sake

One of our server pairs is due to be decommisioned soon and hosts nothing of relevance, under Debian jessie.

So we used Linbit's semi-official Ubuntu PPA Linbit's semi-official Ubuntu PPA to upgrade to DRBD9 the Open Source way.

We managed to do it, somehow. But I don't recommend it for production systems. There are too many hickups and not too much to gain unless you want to migrate to multi-node setups in which case I strongly recommend using new nodes anyway.

Former solution attempt: Packaging Upstream DRBD 8.x Kernel Module for Debian

Conclusion chain

Without checking mailing list archives it seems unlikely that an eventual severe problem with DRBD 8.4.6 (almost 16 months old now) still is not fixed in the current 8.x module eversion.

DRBD's contact area with the rest of the kernel or userspace (libc6) always has been quite thin (Unix rules, Linux without SystemDisabler still is a unix).

Building the module via DKMS (Thanks Dell!?) is no rocket science and widely documented, i.e. in Proxmox PVE's Wiki page on how to Build DRBD kernel module (Proxmox VE is a nice LXC+KVM+HA virtualization distro, using Debian OS, Ubuntu LTS kernels, and ther own unified adminstration tools & UI).

And, after all, DRBD sources are still Open Source.

So, let's build ;-)

Up-to-date DRBD 8 packages in Clazzes.org' Debian repository

We are using the packages below on 4 nodes so far (as of 2016-08-30), with 6 more nodes going to use them after the next reboot.
Remark: On some nodes dkms triggered the installation of a linux-image-3.2.0-4-rt-amd64 which can be removed afterwards.

Clazzes.org's Deb server deb.clazzes.org contains a repository "jessie-drbdpkg-8" providing 2 DRBD packages:

drbd8-dkms 8.4.8-1

This packages contains the most-recently available sources of the DRBD8 module, along with with DKMS integration for Debian jessie (probably usable for jessie-based derivates too).

On installation of drbd8-dkms 8.4.9-1 (or later installation of a new of linux-header-*-amd64 alongside the matching linux-image-*-amd64 package) the up-to-date DRBD module is automatically built and installed.

drbd-utils 8.9.7-1

We 'cheated' a bit here: drbd-utils_8.9.7-1 is the package from Debian's unstable/experimental repositories, re-integrated in our DRBD8-repository.

This way any Debian jessie installation has access to up-to-date DRBD8 packages without the need to care about compiling manually or adding Debian unstable to sources.list.

DKMS PROBLEMs, Solutions, hints
Problem

1 of 5 hosts rebooted so far, going from 4.4 to 4.6 through the reboot, failed to load the DRBD module, claiming

No Format
modprobe: ERROR: could not insert 'drbd': Exec format error
Possible root cause

This could be due to ABI changes in the kernel, from 4.4 to 4.6. Our other reboots so far have been without changing the kernel version, 4.6 before and after.
However 2 more of our 8 nodes are running 4.5, and their /var/lib/dkms/drbd/8.4.8-1clazzes1/4.6.0-0.bpo.1-amd64/x86_64/module/drbd.ko has a different size from the those nodes with 4.6. Although, /lib/modules/4.6.0-0.bpo.1-amd64/kernel/drivers/block/drbd/drbd.ko has the same size and MD5 sum on machines running 4.5 or 4.6.

Solution used

I solved it with ...

No Format
apt-get remove drbd8-dkms
apt-get install drbd8-dkms
/etc/init.d/drbd start
Suspected faster solution

It would propably have been sufficient to just ...

No Format
dkms autoinstall
/etc/init.d/drbd start
Proposed check command

It might be a good idea to perform this every now and then:

No Format
dkms status

Failure ;-(

Unfortunately the whole problem of an ever-rising load occured with DRBD module 4.8.4-1 too. Further research TBD.

 

 The details and further solution attempts are now described and tracked in Ever-rising load on Debian jessie + DRBD8 + LXC host pairs.

Initial suspicion and upstream DRBD 8 kernel modules

In 3 of the first 4 cases of the problem described above the first unusual syslog entry hinted at DRBD.

Therefore we packaged (Debianized) the most-recent upstream kernel modul.
We are currently unsure wether DRBD has anything to do with the problem, but are still maintaining those packages, see Debian jessie builds of DKMSed upstream DRBD8 Kernel Module.