Finding the right needle in the kernel haystack
Spotify makes a significant investment in writing code to automate the provisioning of servers in our data centers. Our goal is to automate hardware configuration and operating system installation so that we can mark a system for installation, flip the power switch and have it installed and added to the pool of available hardware used to run the Spotify backend service.
One important link in this chain is to have the ability to automatically configure the RAID hardware to provide the proper logical drives for our various use cases. This is typically done using a binary only tool provided by the hardware vendor: 3Ware uses tw_cli, Adaptec uses arcconf and some LSI cards uses the sas2ircu tool which will be at the center of the rest of this post. We have been using that card in quite a few servers lately, and it has worked flawlessly. However, when we got a new type of hardware from Dell, the PowerEdge C5220 suddenly our RAID volume configuration code broke. More specifically the sas2icru tool died triggering a segfault on querying the current state of the raid card by issuing “sas2ircu 0 display”
We got in touch with Dell support, but since Debian GNU/Linux is not officially supported on their hardware they could not really help out. So, we tried installing Red Hat Enterprise Linux 6 on a server and that one turned out not to exhibit the problem. The binary tool worked when running the Red Hat kernel. In an ideal world, hardware vendors (Are you listening, LSI?) would release their tools and specifications as free software projects, and it would probably be fairly straight forward to fix the tool. In this case, our options were limited to switching Linux distribution, a monumentally large undertaking, or modifying our kernel to match the behavior of the RHEL kernel.
So, a few days back I started looking into fixing the kernel, more specifically the mpt2sas driver that was used by the LSI Logic / Symbios Logic SAS2008 PCI Express Fusion-MPT card in our machines. It turns out that the driver in our kernel tree, which is pretty close to the standard Debian Squeeze kernel, based on upstream Linux 2.6.32 is a fairly large and complex piece of code. The first glance at the RHEL6 kernel, more specifically the 2.6.32-279.14.1.el6 version, looks promising. It’s based on the same upstream version, so the mpt2sas driver can’t be that different from the Debian one, right? Well, that assumption turned out to be spectacularly wrong. In fact, the difference is vast. Looking at the source code of the RHEL6 kernel reveals that the mpt2sas driver is backported from somewhere around the upstream Linux release 3.4. The Debian version contains select few patches on top of the driver that was committed to the mainline kernel in September 2009. In contrast, the RHEL version contains the result of development from LSI as well as hundreds of kernel changes merged into the mainline kernel as late as March 2012. The full drivers is around 20 000 lines of code, and the (non-unified) diff between the RHEL and Debian version of the driver is well north of 10 000 lines. Looking at the diff, there seemed to be a high number of significant changes to the code, as would be expected.
Fortunately, since we had lots of hardware exhibiting the problem laying around, and I had some time to play around I decided to attempt to track down the change between the driver shipped in Linux version 2.6.32 and 3.4 that fixed the segfault problem. Before I describe my methodology in more detail, a word of warning: If you attempt this, please note that modifying the drivers for the RAID controller you use to access and store data with will probably silently corrupt your data in very unpleasant ways. Be careful.
What I did was an attempt to get a late driver to compile in our kernel source tree. If got a driver module built, I could copy that to a test machine running a ramdisk based system, load it and see if it resolves our problem. A module that did not exhibit the problem would, in itself, not be that valuable, but it would make it possible to binary search to find the smallest change that would solve our problem. If such a code change was found, it is entirely possible that the change could be applied on top of the Debian kernel and we had a fix for our problem. Given that the change was simple enough, it might even be that such a change could be considered safe enough to put into our production kernel.
So, only a couple of roadblocks in the way. First off, compile a good version. Fortunately we already have a version controlled set of scripts that is used to modify the standard Debian kernel building setup and make it build with our patches on our TeamCity build cluster. To speed up things a bit further, I borrowed a fast machine (16 core, 48G RAM, SSD), made sure that the build ran with the magical DEB_BUILD_OPTIONS=parallel=8 environment variable set.
When I attempted to copy the 3.4 driver straight into the tree, it turned out that compilation failed somewhat with a missing symbol DEF_SCSI_QCMD. Reading the comments in scsi_host.h about this new macro it seemed like this was part of some pretty major changes to the SCSI subsystem relating to how locking is done. Without delving too deeply in this, I tracked down the introduction of these changes to 2.6.37. So, i decided to try and get 2.6.36 up and running instead. The compilation failures here were more straightforward. Adding a few defines, and removing some code made it build. To my great joy, it also turned out that when loading the updated kernel module, the segfault problem was gone.
With one known good version and one known bad version, it was a simple matter of binary searching for the change that makes the difference. I made a silly error when executing this plan, and ended up building 13 different modules before I found the patch ebda4d38df542e1ff4747c4daadfc7da250b4fa6. A seven line change to the _ctrl_do_mpt_command() function. Wonderfully enough, this change applies cleanly the Debian version of the driver, and solves our problem.
We have now been running the patched kernel in a few weeks on a wide variety of hardware and we have not discovered any problems with it so far.