Debugging an evil Go runtime bug

Preface

I’m a big fan of Prometheus and Grafana. As a former SRE at Google I’ve learned to appreciate good monitoring, and this combination has been a winner for me over the past year. I’m using them for monitoring my personal servers (both black-box and white-box monitoring), for the Euskal Encounter external and internal event infra, for work I do professionally for clients, and more. Prometheus makes it very easy to write custom exporters to monitor your own data, and there’s a good chance you’ll find an exporter that already works for you out of the box. For example, we use sql_exporter to make a pretty dashboard of attendee metrics for the Encounter events.

Event dashboard for Euskal Encounter (fake staging data)

Since it’s so easy to throw node_exporter onto any random machine and have a Prometheus instance scrape it for basic system-level metrics (CPU, memory, network, disk, filesystem usage, etc), I figured, why not also monitor my laptop? I have a Clevo “gaming” laptop that serves as my primary workstation, mostly pretending to be a desktop at home but also traveling with me to big events like the Chaos Communication Congress. Since I already have a VPN between it and one of my servers where I run Prometheus, I can just emerge prometheus-node_exporter, bring up the service, and point my Prometheus instance at it. This automatically configures alerts for it, which means my phone will make a loud noise whenever I open way too many Chrome tabs and run out of my 32GB of RAM. Perfect.

Trouble on the horizon

Barely an hour after setting this up, though, my phone did get a page: my newly-added target was inaccessible. Alas, I could SSH into the laptop fine, so it was definitely up, but node_exporter had crashed.

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc41ffc7fff pc=0x41439e]

goroutine 2395 [running]:
runtime.throw(0xae6fb8, 0x2a)
        /usr/lib64/go/src/runtime/panic.go:605 +0x95 fp=0xc4203e8be8 sp=0xc4203e8bc8 pc=0x42c815
runtime.sigpanic()
        /usr/lib64/go/src/runtime/signal_unix.go:351 +0x2b8 fp=0xc4203e8c38 sp=0xc4203e8be8 pc=0x443318
runtime.heapBitsSetType(0xc4204b6fc0, 0x30, 0x30, 0xc420304058)
        /usr/lib64/go/src/runtime/mbitmap.go:1224 +0x26e fp=0xc4203e8c90 sp=0xc4203e8c38 pc=0x41439e
runtime.mallocgc(0x30, 0xc420304058, 0x1, 0x1)
        /usr/lib64/go/src/runtime/malloc.go:741 +0x546 fp=0xc4203e8d38 sp=0xc4203e8c90 pc=0x411876
runtime.newobject(0xa717e0, 0xc42032f430)
        /usr/lib64/go/src/runtime/malloc.go:840 +0x38 fp=0xc4203e8d68 sp=0xc4203e8d38 pc=0x411d68
github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus.NewConstMetric(0xc42018e460, 0x2, 0x3ff0000000000000, 0xc42032f430, 0x1, 0x1, 0x10, 0x9f9dc0, 0x8a0601, 0xc42032f430)
        /var/tmp/portage/net-analyzer/prometheus-node_exporter-0.15.0/work/prometheus-node_exporter-0.15.0/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus/value.go:165 +0xd0 fp=0xc4203e8dd0 sp=0xc4203e8d68 pc=0x77a980

node_exporter, like many Prometheus components, is written in Go. Go is a relatively safe language: while it allows you to shoot yourself in the foot if you so wish, and it doesn’t have nearly as strong safety guarantees as, say, Rust does, it is still not too easy to accidentally cause a segfault in Go. More so, node_exporter is a relatively simple Go app with mostly pure-Go dependencies. Therefore, this was an interesting crash to get. Especially since the crash was inside mallocgc, which should never crash under normal circumstances.

Things got more interesting after I restarted it a few times:

2017/11/07 06:32:49 http: panic serving 172.20.0.1:38504: runtime error: growslice: cap out of range
goroutine 41 [running]:
net/http.(*conn).serve.func1(0xc4201cdd60)
        /usr/lib64/go/src/net/http/server.go:1697 +0xd0
panic(0xa24f20, 0xb41190)
        /usr/lib64/go/src/runtime/panic.go:491 +0x283
fmt.(*buffer).WriteString(...)
        /usr/lib64/go/src/fmt/print.go:82
fmt.(*fmt).padString(0xc42053a040, 0xc4204e6800, 0xc4204e6850)
        /usr/lib64/go/src/fmt/format.go:110 +0x110
fmt.(*fmt).fmt_s(0xc42053a040, 0xc4204e6800, 0xc4204e6850)
        /usr/lib64/go/src/fmt/format.go:328 +0x61
fmt.(*pp).fmtString(0xc42053a000, 0xc4204e6800, 0xc4204e6850, 0xc400000073)
        /usr/lib64/go/src/fmt/print.go:433 +0x197
fmt.(*pp).printArg(0xc42053a000, 0x9f4700, 0xc42041c290, 0x73)
        /usr/lib64/go/src/fmt/print.go:664 +0x7b5
fmt.(*pp).doPrintf(0xc42053a000, 0xae7c2d, 0x2c, 0xc420475670, 0x2, 0x2)
        /usr/lib64/go/src/fmt/print.go:996 +0x15a
fmt.Sprintf(0xae7c2d, 0x2c, 0xc420475670, 0x2, 0x2, 0x10, 0x9f4700)
        /usr/lib64/go/src/fmt/print.go:196 +0x66
fmt.Errorf(0xae7c2d, 0x2c, 0xc420475670, 0x2, 0x2, 0xc420410301, 0xc420410300)
        /usr/lib64/go/src/fmt/print.go:205 +0x5a

Well that’s interesting. A crash in Sprintf this time. What?

runtime: pointer 0xc4203e2fb0 to unallocated span idx=0x1f1 span.base()=0xc4203dc000 span.limit=0xc4203e6000 span.state=3
runtime: found in object at *(0xc420382a80+0x80)
object=0xc420382a80 k=0x62101c1 s.base()=0xc420382000 s.limit=0xc420383f80 s.spanclass=42 s.elemsize=384 s.state=_MSpanInUse
 <snip>
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw(0xaee4fe, 0x3e)
        /usr/lib64/go/src/runtime/panic.go:605 +0x95 fp=0x7f0f19ffab90 sp=0x7f0f19ffab70 pc=0x42c815
runtime.heapBitsForObject(0xc4203e2fb0, 0xc420382a80, 0x80, 0xc41ffd8a33, 0xc400000000, 0x7f0f400ac560, 0xc420031260, 0x11)
        /usr/lib64/go/src/runtime/mbitmap.go:425 +0x489 fp=0x7f0f19ffabe8 sp=0x7f0f19ffab90 pc=0x4137c9
runtime.scanobject(0xc420382a80, 0xc420031260)
        /usr/lib64/go/src/runtime/mgcmark.go:1187 +0x25d fp=0x7f0f19ffac90 sp=0x7f0f19ffabe8 pc=0x41ebed
runtime.gcDrain(0xc420031260, 0x5)
        /usr/lib64/go/src/runtime/mgcmark.go:943 +0x1ea fp=0x7f0f19fface0 sp=0x7f0f19ffac90 pc=0x41e42a
runtime.gcBgMarkWorker.func2()
        /usr/lib64/go/src/runtime/mgc.go:1773 +0x80 fp=0x7f0f19ffad20 sp=0x7f0f19fface0 pc=0x4580b0
runtime.systemstack(0xc420436ab8)
        /usr/lib64/go/src/runtime/asm_amd64.s:344 +0x79 fp=0x7f0f19ffad28 sp=0x7f0f19ffad20 pc=0x45a469
runtime.mstart()
        /usr/lib64/go/src/runtime/proc.go:1125 fp=0x7f0f19ffad30 sp=0x7f0f19ffad28 pc=0x430fe0

And now the garbage collector stumbled upon a problem. Yet a different crash.

At this point, there are two natural conclusions: either I have a severe hardware issue, or there is a wild memory corruption bug in the binary. I initially considered the former unlikely, as this machine has a very heavily mixed workload and no signs of instability that can be traced back to hardware (I have my fair share of crashing software, but it’s never random). Since Go binaries like node_exporter are statically linked and do not depend on any other libraries, I can download the official release binary and try that, which would eliminate most of the rest of my system as a variable. Yet, when I did so, I still got a crash.

unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x76b998]

goroutine 13 [running]:
runtime.throw(0xabfb11, 0x5)
        /usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc420060c40 sp=0xc420060c20 pc=0x42c725
runtime.sigpanic()
        /usr/local/go/src/runtime/signal_unix.go:374 +0x227 fp=0xc420060c90 sp=0xc420060c40 pc=0x443197
github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_model/go.(*LabelPair).GetName(...)
        /go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_model/go/metrics.pb.go:85
github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*Desc).String(0xc4203ae010, 0xaea9d0, 0xc42045c000)
        /go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus/desc.go:179 +0xc8 fp=0xc420060dc8 sp=0xc420060c90 pc=0x76b998

Yet another completely different crash. At this point there was a decent chance that there was truly an upstream problem with node_exporter or one of its dependencies, so I filed an issue on GitHub. Perhaps the developers had seen this before? It’s worth bringing this kind of issue to their attention and seeing if they have any ideas.

A not-so-brief bare metal detour

Unsurprisingly, upstream’s first guess was that it was a hardware issue. This isn’t unreasonable: after all, I’m only hitting the problem on one specific machine. All my other machines are happily running node_exporter. While I had no other evidence of hardware-linked instability on this host, I also had no other explanation as to what was so particular about this machine that would make node_exporter crash. A Memtest86+ run never hurt anyone, so I gave it a go.

And then this happened:

This is what I get for using consumer hardware

Whoops! Bad RAM. Well, to be more specific, one bit of bad RAM. After letting the test run for a full pass, all I got was that single bad bit, plus a few false positives in test 7 (which moves blocks around and so can amplify a single error).

Further testing showed that Memtest86+ test #5 in SMP mode would quickly detect the error, but usually not on the first pass. The error was always the same bit at the same address. This suggests that the problem is a weak or leaky RAM cell. In particular, one which gets worse with temperature. This is quite logical: a higher temperature increases leakage in the RAM cells and thus makes it more likely that a somewhat marginal cell will actually cause a bit flip.

To put this into perspective, this is one bad bit out of 274,877,906,944. That’s actually a very good error rate! Hard disks and Flash memory have much higher error rates - it’s just that those devices have bad blocks marked at the factory that are transparently swapped out without the user knowing, and can transparently mark newly discovered weak blocks as bad and relocate them to a spare area. RAM has no such luxury, so a bad bit sticks forever.

Alas, this is vanishingly unlikely to be the cause of my node_exporter woes. That app uses very little RAM, and so the chances of it hitting the bad bit (repeatedly, at that) are extremely low. This kind of problem would be largely unnoticeable, perhaps causing a pixel error in some graphics, a single letter to flip in some text, an instruction to be corrupted that probably won’t ever be run, and perhaps the rare segfault when something actually important does land on the bad bit. Nonetheless, it does cause long-term reliability issues, and this is why servers and other devices intended to be reliable must use ECC RAM, which can correct this kind of error.

I don’t have the luxury of ECC RAM on this laptop. What I do have, though, is the ability to mark the bad block of RAM as bad and tell the OS not to use it. There is a little-known feature of GRUB 2 which allows you to do just that, by changing the memory map that is passed to the booted kernel. It’s not worth buying new RAM just for a single bad bit (especially since DDR3 is already obsolete, and there’s a good chance new RAM would have weak cells anyway), so this is a good option.

However, there’s one more thing I can do. Since the problem gets worse with temperature, what happens if I heat up the RAM?

Using a heat gun set at a fairly low temperature (130°C) I warmed up two modules at a time (the other two modules are under the rear cover, as my laptop has four SODIMM slots total). Playing around with module order, I found three additional weak bits only detectable at elevated temperature, and they were spread around three of my RAM sticks.

I also found that the location of the errors stayed roughly consistent even as I swapped modules around: the top bits of the address remained the same. This is because the RAM is interleaved: data is spread over all four sticks, instead of each stick being assigned a contiguous quarter of the available address space. This is convenient, because I can just mask a region of RAM large enough to cover all possible addresses for each error bit, and not have to worry that I might swap sticks in the future and mess up the masking. I found that masking a contiguous 128KiB area should cover all possible permutations of addresses for each given bad bit, but, for good measure, I rounded up to 1MiB. This gave me three 1MiB aligned blocks to mask out (one of them covers two of the bad bits, for a total of four bad bits I wanted masked):

0x36a700000 – 0x36a7fffff
0x460e00000 – 0x460efffff
0x4ea000000 – 0x4ea0fffff

This can be specified using the address/mask syntax required by GRUB as follows, in /etc/default/grub:

GRUB_BADRAM="0x36a700000,0xfffffffffff00000,0x460e00000,0xfffffffffff00000,0x4ea000000,0xfffffffffff00000"

One quick grub-mkconfig later, I am down 3MiB of RAM and four dodgy bits with it. It’s not ECC RAM, but this should increase the effective reliability of my consumer-grade RAM, since now I know the rest of the memory is fine up to at least 100°C.

Needless to say, node_exporter still crashed, but we knew this wasn’t the real problem, didn’t we.

Digging deeper

The annoying thing about this kind of bug is that it clearly is caused by some kind of memory corruption that breaks code that runs later. This makes it very hard to debug, because we can’t predict what will be corrupted (it varies), and we can’t catch the bad code in the act of doing so.

First I tried some basic bisecting of available node_exporter releases and enabling/disabling different collectors, but that went nowhere. I also tried running an instance under strace. This seemed to stop the crashes, which strongly points to a race-condition kind of problem. strace will usually wind up serializing execution of apps to some extent, by intercepting all system calls run by all threads. I would later find that the strace instance crashed too, but it took much longer to do so. Since this seemed to be related to concurrency, I tried setting GOMAXPROCS=1, which tells Go to only use a single OS-level thread to run Go code. This also stopped the crashes, again pointing strongly to a concurrency issue.

By now I had gathered quite a considerable number of crash logs, and I was starting to notice some patterns. While there was a lot of variation in the parts that were crashing and how, ultimately the error messages could be categorized into different types and the same kind of error showed up more than once. So I started Googling these errors, and this is how I stumbled upon Go issue #20427. This was an issue in seemingly an unrelated part of Go, but one that had caused similar segfaults and random issues. The issue was closed with no diagnosis after it couldn’t be reproduced with Go 1.9. Nobody knew what the root cause was, just that it had stopped happening.

So I grabbed this sample code from the issue, which claimed to reproduce the problem, and ran it on my machine. Lo and behold, it crashed within seconds. Bingo. This is a lot better than waiting hours for node_exporter to crash.

That doesn’t get me any closer to debugging the issue from the Go side, but it gives me a much faster way to test for it. So let’s try another angle.

Bisecting machines

I know the problem happens on my laptop, but doesn’t happen on any other of my machines. I tried the reproducer on every other machine I have easy access to, and couldn’t get it to crash on any of them. This tells me there’s something special about my laptop. Since Go statically links binaries, the rest of userspace doesn’t matter. This leaves two relevant parts: the hardware, and the kernel.

I don’t have any easy way to test with various hardware other than the machines I own, but I can play with kernels. So let’s try that. First order of business: will it crash in a VM?

To test for this, I built a minimal initramfs that will allow me to very quickly launch the reproducer in a QEMU VM without having to actually install a distro or boot a full Linux system. My initramfs was built with Linux’s scripts/gen_initramfs_list.sh and contained the following files:

dir /dev 755 0 0
nod /dev/console 0600 0 0 c 5 1
nod /dev/null 0666 0 0 c 1 3
dir /bin 755 0 0
file /bin/busybox busybox 755 0 0
slink /bin/sh busybox 755 0 0
slink /bin/true busybox 755 0 0
file /init init.sh 755 0 0
file /reproducer reproducer 755 0 0

/init is the entry point of a Linux initramfs, and in my case was a simple shellscript to start the test and measure time:

#!/bin/sh
export PATH=/bin

start=$(busybox date +%s)

echo "Starting test now..."
/reproducer
ret=$?
end=$(busybox date +%s)
echo "Test exited with status $ret after $((end-start)) seconds"

/bin/busybox is a statically linked version of BusyBox, often used in minimal systems like this to provide all basic Linux shell utilities (including a shell itself).

The initramfs can be built like this (from a Linux kernel source tree), where list.txt is the file list above:

scripts/gen_initramfs_list.sh -o initramfs.gz list.txt

And QEMU can boot the kernel and initramfs directly:

qemu-system-x86_64 -kernel /boot/vmlinuz-4.13.9-gentoo -initrd initramfs.gz -append 'console=ttyS0' -smp 8 -nographic -serial mon:stdio -cpu host -enable-kvm

This resulted in no output at all to the console… and then I realized I hadn’t even compiled 8250 serial port support into my laptop’s kernel. D’oh. I mean, it doesn’t have a physical serial port, right? Anyway, a quick detour to rebuild the kernel with serial support (and crossing my fingers that didn’t change anything important), I tried again and it successfully booted and ran the reproducer.

Did it crash? Yup. Good, this means the problem is reproducible on a VM on the same machine. I tried the same QEMU command on my home server, with its own kernel, and… nothing. Then I copied the kernel from my laptop and booted that and… it crashed. The kernel is what matters. It’s not a hardware issue.

Juggling kernels

At this point, I knew I was going to be compiling lots of kernels to try to narrow this down. So I decided to move to the most powerful machine I had lying around: a somewhat old 12-core, 24-thread Xeon (now defunct, sadly). I copied the known-bad kernel source to that machine, built it, and tested it.

It didn’t crash.

What?

Some head-scratching later, I made sure the original bad kernel binary crashed (it did). Are we back to hardware? Does it matter which machine I build the kernel on? So I tried building the kernel on my home server, and that one promptly triggered the crash. Building the same kernel on two machines yields crashes, a third machine doesn’t. What’s the difference?

Well, these are all Gentoo boxes, and all Gentoo Hardened at that. But my laptop and my home server are both ~amd64 (unstable), while my Xeon server is amd64 (stable). That means GCC is different. My laptop and home server were both on gcc (Gentoo Hardened 6.4.0 p1.0) 6.4.0, while my Xeon was on gcc (Gentoo Hardened 5.4.0-r3 p1.3, pie-0.6.5) 5.4.0.

But my home server’s kernel, which was nearly the same version as my laptop (though not exactly), built with the same GCC, did not reproduce the crashes. So now we have to conclude that both the compiler used to build the kernel and the kernel itself (or its config?) matter.

To narrow things down further, I compiled the exact kernel tree from my laptop on my home server (linux-4.13.9-gentoo), and confirmed that it indeed crashed. Then I copied over the .config from my home server and compiled that, and found that it didn’t. This means we’re looking at a kernel config difference and a compiler difference:

linux-4.13.9-gentoo + gcc 5.4.0-r3 p1.3 + laptop .config - no crash
linux-4.13.9-gentoo + gcc 6.4.0 p1.0 + laptop .config - crash
linux-4.13.9-gentoo + gcc 6.4.0 p1.0 + server .config - no crash

Two .configs, one good, and one bad. Time to diff them. Of course, the two configs were vastly different (since I tend to tailor my kernel config to only include the drivers I need on any particular machine), so I had to repeatedly rebuild the kernel while narrowing down the differences.

I decided to start with the “known bad” .config and start removing things. Since the reproducer takes a variable amount of time to crash, it’s easier to test for “still crashes” (just wait for it to crash) than for “doesn’t crash” (how long do I have to wait to convince myself that it doesn’t?). Over the course of 22 kernel builds, I managed to simplify the config so much that the kernel had no networking support, no filesystems, no block device core, and didn’t even support PCI (still works fine on a VM though!). My kernel builds now took less than 60 seconds and the kernel was about 1/4th the size of my regular one.

Then I moved on to the “known good” .config and removed all the unnecessary junk while making sure it still didn’t crash the reproducer (which was trickier and slower than the previous test). I had a few false branches, where I changed something that made the reproducer start crashing (but I didn’t know what), yet I misidentified them as “no crash”, so when I got a crash I had to walk back up the previous kernels I’d built and make sure I knew exactly where the crash was introduced. I ended up doing 7 kernel builds.

Eventually, I narrowed it down to a small handful of different .config options. A few of them stood out, in particular CONFIG_OPTIMIZE_INLINING. After carefully testing them I concluded that, indeed, that option was the culprit. Turning it off produced kernels that crash the reproducer testcase, while turning it on produced kernels that didn’t. This option, when turned on, allows GCC to better determine which inline functions really must be inlined, instead of forcing it to inline them unconditionally. This also explains the GCC connection: inlining behavior is likely to change between GCC versions.

/*
 * Force always-inline if the user requests it so via the .config,
 * or if gcc is too old.
 * GCC does not warn about unused static inline functions for
 * -Wunused-function.  This turns out to avoid the need for complex #ifdef
 * directives.  Suppress the warning in clang as well by using "unused"
 * function attribute, which is redundant but not harmful for gcc.
 */
#if !defined(CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING) ||                \
    !defined(CONFIG_OPTIMIZE_INLINING) || (__GNUC__ < 4)
#define inline inline           __attribute__((always_inline,unused)) notrace
#define __inline__ __inline__   __attribute__((always_inline,unused)) notrace
#define __inline __inline       __attribute__((always_inline,unused)) notrace
#else
/* A lot of inline functions can cause havoc with function tracing */
#define inline inline           __attribute__((unused)) notrace
#define __inline__ __inline__   __attribute__((unused)) notrace
#define __inline __inline       __attribute__((unused)) notrace
#endif

So what next? We know that CONFIG_OPTIMIZE_INLINING makes the difference, but that potentially changes the behavior of every single inline function across the whole kernel. How to pinpoint the problem?

I had an idea.

Hash-based differential compilation

The basic premise is to compile part of the kernel with the option turned on, and part of the kernel with the option turned off. By testing the resulting kernel and checking whether the problem appears or not, we can deduce which subset of the kernel compilation units contains the problem code.

Instead of trying to enumerate all object files and doing some kind of binary search, I decided to go with a hash-based approach. I wrote this wrapper script for GCC:

#!/bin/bash
args=("$@")

doit=
while [ $# -gt 0 ]; do
        case "$1" in
                -c)
                        doit=1
                        ;;
                -o)
                        shift
                        objfile="$1"
                        ;;
        esac
        shift
done

extra=
if [ ! -z "$doit" ]; then
        sha="$(echo -n "$objfile" | sha1sum - | cut -d" " -f1)"
        echo "${sha:0:8} $objfile" >> objs.txt
        if [ $((0x${sha:0:8} & (0x80000000 >> $BIT))) = 0 ]; then
                echo "[n]" "$objfile" 1>&2
        else
                extra=-DCONFIG_OPTIMIZE_INLINING
                echo "[y]" "$objfile" 1>&2
        fi
fi

exec gcc $extra "${args[@]}"

This hashes the object file name with SHA-1, then checks a given bit of the hash out of the first 32 bits (identified by the $BIT environment variable). If the bit is 0, it builds without CONFIG_OPTIMIZE_INLINING. If it is 1, it builds with CONFIG_OPTIMIZE_INLINING. I found that the kernel had around 685 object files at this point (my minimization effort had paid off), which requires about 10 bits for a unique identification. This hash-based approach also has one neat property: I can choose to only worry about crashing outcomes (where the bit is 0), since it is much harder to prove that a given kernel build does not crash (as the crashes are probabilistic and can take quite a while sometimes).

I built 32 kernels, one for each bit of the SHA-1 prefix, which only took 29 minutes. Then I started testing them, and every time I got a crash, I narrowed down a regular expression of possible SHA-1 hashes to only those with zero bits at those specific positions. At 8 crashes (and thus zero bits), I was down to 4 object files, and a couple were looking promising. Once I hit the 10th crash, there was a single match.

$ grep '^[0246][012389ab][0189][014589cd][028a][012389ab][014589cd]' objs_0.txt
6b9cab4f arch/x86/entry/vdso/vclock_gettime.o

vDSO code. Of course.

vDSO shenanigans

The kernel’s vDSO is not actually kernel code. vDSO is a small shared library that the kernel places in the address space of every process, and which allows apps to perform certain special system calls without ever leaving user mode. This increases performance significantly, while still allowing the kernel to change the implementation details of those system calls as needed.

In other words, vDSO is GCC-compiled code, built with the kernel, that ends up being linked with every userspace app. It’s userspace code. This explains why the kernel and its compiler mattered: it wasn’t about the kernel itself, but about a shared library provided by the kernel! And Go uses the vDSO for performance. Go also happens to have a (rather insane, in my opinion) policy of reinventing its own standard library, so it does not use any of the standard Linux glibc code to call vDSO, but rather rolls its own calls (and syscalls too).

So what does flipping CONFIG_OPTIMIZE_INLINING do to the vDSO? Let’s look at the assembly.

With CONFIG_OPTIMIZE_INLINING=n:

arch/x86/entry/vdso/vclock_gettime.o.no_inline_opt:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <vread_tsc>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	90                   	nop
   5:	90                   	nop
   6:	90                   	nop
   7:	0f 31                	rdtsc  
   9:	48 c1 e2 20          	shl    $0x20,%rdx
   d:	48 09 d0             	or     %rdx,%rax
  10:	48 8b 15 00 00 00 00 	mov    0x0(%rip),%rdx        # 17 <vread_tsc+0x17>
  17:	48 39 c2             	cmp    %rax,%rdx
  1a:	77 02                	ja     1e <vread_tsc+0x1e>
  1c:	5d                   	pop    %rbp
  1d:	c3                   	retq   
  1e:	48 89 d0             	mov    %rdx,%rax
  21:	5d                   	pop    %rbp
  22:	c3                   	retq   
  23:	0f 1f 00             	nopl   (%rax)
  26:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  2d:	00 00 00 

0000000000000030 <__vdso_clock_gettime>:
  30:	55                   	push   %rbp
  31:	48 89 e5             	mov    %rsp,%rbp
  34:	48 81 ec 20 10 00 00 	sub    $0x1020,%rsp
  3b:	48 83 0c 24 00       	orq    $0x0,(%rsp)
  40:	48 81 c4 20 10 00 00 	add    $0x1020,%rsp
  47:	4c 8d 0d 00 00 00 00 	lea    0x0(%rip),%r9        # 4e <__vdso_clock_gettime+0x1e>
  4e:	83 ff 01             	cmp    $0x1,%edi
  51:	74 66                	je     b9 <__vdso_clock_gettime+0x89>
  53:	0f 8e dc 00 00 00    	jle    135 <__vdso_clock_gettime+0x105>
  59:	83 ff 05             	cmp    $0x5,%edi
  5c:	74 34                	je     92 <__vdso_clock_gettime+0x62>
  5e:	83 ff 06             	cmp    $0x6,%edi
  61:	0f 85 c2 00 00 00    	jne    129 <__vdso_clock_gettime+0xf9>
[...]

With CONFIG_OPTIMIZE_INLINING=y:

arch/x86/entry/vdso/vclock_gettime.o.inline_opt:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <__vdso_clock_gettime>:
   0:	55                   	push   %rbp
   1:	4c 8d 0d 00 00 00 00 	lea    0x0(%rip),%r9        # 8 <__vdso_clock_gettime+0x8>
   8:	83 ff 01             	cmp    $0x1,%edi
   b:	48 89 e5             	mov    %rsp,%rbp
   e:	74 66                	je     76 <__vdso_clock_gettime+0x76>
  10:	0f 8e dc 00 00 00    	jle    f2 <__vdso_clock_gettime+0xf2>
  16:	83 ff 05             	cmp    $0x5,%edi
  19:	74 34                	je     4f <__vdso_clock_gettime+0x4f>
  1b:	83 ff 06             	cmp    $0x6,%edi
  1e:	0f 85 c2 00 00 00    	jne    e6 <__vdso_clock_gettime+0xe6>
[...]

Interestingly, CONFIG_OPTIMIZE_INLINING=y, which is supposed to allow GCC to inline less, actually resulted in it inlining more: vread_tsc is inlined in that version, while not in the CONFIG_OPTIMIZE_INLINING=n version. But vread_tsc isn’t marked inline at all, so GCC is perfectly within its right to behave like this, as counterintuitive as it may be.

But who cares if a function is inlined? Where’s the actual problem? Well, looking closer at the non-inline version…

  30:	55                   	push   %rbp
  31:	48 89 e5             	mov    %rsp,%rbp
  34:	48 81 ec 20 10 00 00 	sub    $0x1020,%rsp
  3b:	48 83 0c 24 00       	orq    $0x0,(%rsp)
  40:	48 81 c4 20 10 00 00 	add    $0x1020,%rsp

Why is GCC allocating over 4KiB of stack? That’s not a stack allocation, that’s a stack probe, or more specifically, the result of the -fstack-check GCC feature.

Gentoo Linux enables -fstack-check by default on its hardened profile. This is a mitigation for the Stack Clash vulnerability. While -fstack-check is an old GCC feature and not intended for this, it turns out it effectively mitigates the issue (I’m told proper Stack Clash protection will be in GCC 8). As a side-effect, it causes some fairly silly behavior, where every non-leaf function (that is, a function that makes function calls) ends up probing the stack 4 KiB ahead of the stack pointer. In other words, code compiled with -fstack-check potentially needs at least 4 KiB of stack space, unless it is a leaf function (or a function where every call was inlined).

Go loves small stacks.

TEXT runtime·walltime(SB),NOSPLIT,$16
	// Be careful. We're calling a function with gcc calling convention here.
	// We're guaranteed 128 bytes on entry, and we've taken 16, and the
	// call uses another 8.
	// That leaves 104 for the gettime code to use. Hope that's enough!

Turns out 104 bytes aren’t enough for everybody. Certainly not for my kernel.

It’s worth pointing out that the vDSO specification makes no mention of maximum stack usage guarantees, so this is squarely Go’s fault for making invalid assumptions.

Conclusion

This perfectly explains the symptoms. The stack probe is an orq, which is a logical OR with 0. This is a no-op, but effectively probes the target address (if it is unmapped, it will segfault). But we weren’t seeing segfaults in vDSO code, so how was this breaking Go? Well, OR with 0 isn’t really a no-op. Since orq is not an atomic instruction, what really happens is the CPU reads the memory address and then writes it back. This creates a race condition. If other threads are running in parallel on other CPUs, orq might effectively wind up undoing a memory write that occurs simultaneously. Since the write was out of the stack bounds, it was likely intruding on other threads’ stacks or random data, and, when the stars line up, undoing a memory write. This is also why GOMAXPROCS=1 works around the issue, since that prevents two threads from effectively running Go code at the same time.

What’s the fix? I left that up to the Go devs. Their solution ultimately was to pivot to a larger stack before calling vDSO functions. This introduces a small speed penalty (nanoseconds), but it’s acceptable. After building node_exporter with the fixed Go toolchain, the crashes went away.

2017-12-05 01:20

Debugging an evil Go runtime bug

From heat guns to kernel compiler flags

Preface

Event dashboard for Euskal Encounter (fake staging data)