Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Solution!?

The Problem

...

Code Block
titleKernel problem case 3
collapsetrue
Aug 27 05:25:43 host19 kernel: [2547757.533648] RSP: 0018:ffff8801c93b3b40  EFLAGS: 00010292
Aug 27 05:25:43 host19 kernel: [2547757.579013]  00004000000005b4 00000000000005b4 00000000000008a0 0000000000000800
Aug 27 05:25:43 host19 kernel: [2547757.623329]  [<ffffffffc060a2f0>] ? drbd_destroy_connection+0xf0/0xf0 [drbd]
Case 4

Debian kernel 4.6, RSP in or around DRBD module:

Code Block
titleKernel problem case 4
collapsetrue
Aug 27 04:26:29 host14 kernel: [478357.442244] Modules linked in: pci_stub(E) vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nfsv3(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) tcp_diag(E) inet_diag(E) ipt_REJECT(E) nf_reject_ipv4(E) nf_log_ipv6(E) ip6t_rt(E) veth(E) drbd(E) ipmi_devintf(E) xt_multiport(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) xt_tcpudp(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip6table_filter(E) xt_conntrack(E) xt_state(E) iptable_filter(E) ip_tables(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) nf_conntrack(E) ip6_tables(E) x_tables(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) grace(E) fscache(E) sunrpc(E) 8021q(E) garp(E) mrp(E) bridge(E) stp(E) llc(E) lru_cache(E) libcrc32c(E) crc32c_generic(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E)<4>[478357.455870] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
Aug 27 04:26:29 host14 kernel: [478357.460567] RSP: 0018:ffff8808534f7b40  EFLAGS: 00010292
Aug 27 04:26:29 host14 kernel: [478357.465530] RBP: ffff8808534f7c58 R08: ffff880858d28af0 R09: 0000000000000000
Aug 27 04:26:29 host14 kernel: [478357.470727] FS:  0000000000000000(0000) GS:ffff88085fa00000(0000) knlGS:0000000000000000
Aug 27 04:26:29 host14 kernel: [478357.476167] Stack:
Aug 27 04:26:29 host14 kernel: [478357.483832] Call Trace:
Aug 27 04:26:29 host14 kernel: [478357.489826]  [<ffffffff814acf80>] ? sock_sendmsg+0x30/0x40
Aug 27 04:26:29 host14 kernel: [478357.498117]  [<ffffffffc07b9efd>] ? w_send_dblock+0x9d/0x1c0 [drbd]
Aug 27 04:26:29 host14 kernel: [478357.506710]  [<ffffffffc07d02f0>] ? drbd_destroy_connection+0xf0/0xf0 [drbd]
Aug 27 04:26:29 host14 kernel: [478357.515569] Code: 90 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 
Aug 27 04:26:29 host14 kernel: [478357.538514] ---[ end trace 0d23089f3d6f0d23 ]---

...

Code Block
Sep 30 14:12:45 host14 kernel: [203230.540687] Oops: 0000 [#1] SMP 
Sep 30 14:12:45 host14 kernel: [203230.541997] CPU: 0 PID: 4211 Comm: drbd_w_bs Tainted: G           OE   4.6.0-0.bpo.1-amd64 #1 Debian 4.6.4-1~bpo8+1
Sep 30 14:12:45 host14 kernel: [203230.542186] RIP: 0010:[<ffffffff81320246>]  [<ffffffff81320246>] memcpy_erms+0x6/0x10
Sep 30 14:12:45 host14 kernel: [203230.542344] RDX: 00000000000003b0 RSI: 0000000000000003 RDI: ffff88080a616040
Sep 30 14:12:45 host14 kernel: [203230.542619] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 30 14:12:45 host14 kernel: [203230.542863]  00004000000005b4 00000000000005b4 0000000000000a70 0000000000000a00
Sep 30 14:12:45 host14 kernel: [203230.543108]  [<ffffffffc04fde49>] ? drbd_send+0xc9/0x1e0 [drbd]
Sep 30 14:12:45 host14 kernel: [203230.554230]  [<ffffffffc04fbf50>] ? drbd_destroy_connection+0xf0/0xf0 [drbd]
Sep 30 14:12:45 host14 kernel: [203230.564960]  [<ffffffff81099df0>] ? kthread_park+0x50/0x50
Sep 30 14:12:45 host14 kernel: [203230.584805] ---[ end trace 2335d6e97c28a203 ]---

Cases 8-11

Same same.

Dismissed solution ideas (after case 4): DRBD9? Commercial support?

...

Unfortunately kernel 3.16 fell victim to someone trying to "fix" the VLAN encapsulation. In fact that fix made the kernel drop packets occationally enough to render this kernel unusable.

Other ideas

Out of, kind of.

...

Probable Solution

Fixing drbd_main.c rg. Kernels 4.0+

We finally entrusted a Kernel specialist, Richard Weinberger from Sigma-Star.at.

We believe that his 0001-drbd-Fix-kernel_sendmsg-usage.patch solves the problem for Kernels 4.0 to 4.9 and have included this in our drbd8-dkms package (See Debian jessie builds of DKMSed upstream DRBD8 Kernel Module and the debian repository's pool directory http://deb.clazzes.org/debian/pool/jessie-drbdpkg-8/).

Kernel 4.10 will get a rewrite of that code and should solve the problem once and for all for everybody.

Conclusion

DRBD is faulty with Kernels 4.0-4.9.

Linbit didn't believe it and didn't care.

We had a professional kernel developer fix it.