The Problem
...
Code Block | ||||
---|---|---|---|---|
| ||||
Aug 27 05:25:43 host19 kernel: [2547757.533648] RSP: 0018:ffff8801c93b3b40 EFLAGS: 00010292 Aug 27 05:25:43 host19 kernel: [2547757.579013] 00004000000005b4 00000000000005b4 00000000000008a0 0000000000000800 Aug 27 05:25:43 host19 kernel: [2547757.623329] [<ffffffffc060a2f0>] ? drbd_destroy_connection+0xf0/0xf0 [drbd] |
Case 4
Debian kernel 4.6, RSP in or around DRBD module:
Code Block | ||||
---|---|---|---|---|
| ||||
Aug 27 04:26:29 host14 kernel: [478357.442244] Modules linked in: pci_stub(E) vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nfsv3(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) tcp_diag(E) inet_diag(E) ipt_REJECT(E) nf_reject_ipv4(E) nf_log_ipv6(E) ip6t_rt(E) veth(E) drbd(E) ipmi_devintf(E) xt_multiport(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) xt_tcpudp(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip6table_filter(E) xt_conntrack(E) xt_state(E) iptable_filter(E) ip_tables(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) nf_conntrack(E) ip6_tables(E) x_tables(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) grace(E) fscache(E) sunrpc(E) 8021q(E) garp(E) mrp(E) bridge(E) stp(E) llc(E) lru_cache(E) libcrc32c(E) crc32c_generic(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E)<4>[478357.455870] Hardware name: Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013 Aug 27 04:26:29 host14 kernel: [478357.460567] RSP: 0018:ffff8808534f7b40 EFLAGS: 00010292 Aug 27 04:26:29 host14 kernel: [478357.465530] RBP: ffff8808534f7c58 R08: ffff880858d28af0 R09: 0000000000000000 Aug 27 04:26:29 host14 kernel: [478357.470727] FS: 0000000000000000(0000) GS:ffff88085fa00000(0000) knlGS:0000000000000000 Aug 27 04:26:29 host14 kernel: [478357.476167] Stack: Aug 27 04:26:29 host14 kernel: [478357.483832] Call Trace: Aug 27 04:26:29 host14 kernel: [478357.489826] [<ffffffff814acf80>] ? sock_sendmsg+0x30/0x40 Aug 27 04:26:29 host14 kernel: [478357.498117] [<ffffffffc07b9efd>] ? w_send_dblock+0x9d/0x1c0 [drbd] Aug 27 04:26:29 host14 kernel: [478357.506710] [<ffffffffc07d02f0>] ? drbd_destroy_connection+0xf0/0xf0 [drbd] Aug 27 04:26:29 host14 kernel: [478357.515569] Code: 90 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 Aug 27 04:26:29 host14 kernel: [478357.538514] ---[ end trace 0d23089f3d6f0d23 ]--- |
...
Code Block |
---|
Sep 30 14:12:45 host14 kernel: [203230.540687] Oops: 0000 [#1] SMP
Sep 30 14:12:45 host14 kernel: [203230.541997] CPU: 0 PID: 4211 Comm: drbd_w_bs Tainted: G OE 4.6.0-0.bpo.1-amd64 #1 Debian 4.6.4-1~bpo8+1
Sep 30 14:12:45 host14 kernel: [203230.542186] RIP: 0010:[<ffffffff81320246>] [<ffffffff81320246>] memcpy_erms+0x6/0x10
Sep 30 14:12:45 host14 kernel: [203230.542344] RDX: 00000000000003b0 RSI: 0000000000000003 RDI: ffff88080a616040
Sep 30 14:12:45 host14 kernel: [203230.542619] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 30 14:12:45 host14 kernel: [203230.542863] 00004000000005b4 00000000000005b4 0000000000000a70 0000000000000a00
Sep 30 14:12:45 host14 kernel: [203230.543108] [<ffffffffc04fde49>] ? drbd_send+0xc9/0x1e0 [drbd]
Sep 30 14:12:45 host14 kernel: [203230.554230] [<ffffffffc04fbf50>] ? drbd_destroy_connection+0xf0/0xf0 [drbd]
Sep 30 14:12:45 host14 kernel: [203230.564960] [<ffffffff81099df0>] ? kthread_park+0x50/0x50
Sep 30 14:12:45 host14 kernel: [203230.584805] ---[ end trace 2335d6e97c28a203 ]--- |
Cases 8-11
Same same.
Dismissed solution ideas (after case 4): DRBD9? Commercial support?
...
Unfortunately kernel 3.16 fell victim to someone trying to "fix" the VLAN encapsulation. In fact that fix made the kernel drop packets occationally enough to render this kernel unusable.
Other ideas
Out of, kind of.
...
Probable Solution
Fixing drbd_main.c rg. Kernels 4.0+
We finally entrusted a Kernel specialist, Richard Weinberger from Sigma-Star.at.
We believe that his 0001-drbd-Fix-kernel_sendmsg-usage.patch solves the problem for Kernels 4.0 to 4.9 and have included this in our drbd8-dkms package (See Debian jessie builds of DKMSed upstream DRBD8 Kernel Module and the debian repository's pool directory http://deb.clazzes.org/debian/pool/jessie-drbdpkg-8/).
Kernel 4.10 will get a rewrite of that code and should solve the problem once and for all for everybody.
Conclusion
DRBD is faulty with Kernels 4.0-4.9.
Linbit didn't believe it and didn't care.
We had a professional kernel developer fix it.