Yak Shaving Day: Rails Asset Precompilation to Kernel Debugging

Gilden Yak Shaving Day

Yak shaving is usually a huge pain in the ass. Sometimes, though, you can learn from the experience. Based on an actual bug I was recently asked to take a look at, I'm going to take you from an issue that exhibited as JavaScript not loading in the browser, down to Linux kernel debugging over virtualized serial ports.

This is of course, insane. If you ever have to really debug a broken Rails app by running gdb against the kernel, I'm highly interested in whatever awful thing you did to deserve such a fate. Think about your life choices really hard. Did you kick a kid back in kindergarten? He's probably homeless now, all due to that one transformative event that set his life on the wrong path. Dick.

The Bug

A staging setup consisting of two machines with the exact same revision of an app checked out were generating precompiled assets that ended up with different hashes. Cries of “host your assets on a CDN” notwithstanding, this was rather unexpected. The issue manifested itself as a page occasionally not working, as the <script> tag served up by one instance of the application referenced a URL that 404d on another instance.

Don't be afraid.

If you'd like to follow along, I've created a Rails template that sets up an application that might break in the same way. Due to the nature of the issue, I can't guarantee you'll actually be able to reproduce it. I was able to do it using two different Linodes I happened to have access to running different versions of Ubuntu with different kernels, but in the environment this actually happened in, both instances had the same kernel and some OS version. Your mileage, as usual, may vary. You can generate a new Rails app from the template like so:

$ rails new testapp -m https://gist.github.com/gabrielg/b68ea622f6ade2b6a99f/raw/4826ca1ab4725da89e079df0b137846b476d4aa6/recipe.rb

Here are two instances I prepared earlier. Each has the exact same copy of the application, running on the same version of Ruby. It's immediately obvious that application.js is being precompiled and ending up with different checksums on each machine:

# Instance A
ggironda@athena:~/testapp$ bundle exec rake assets:precompile
/home/ggironda/.rbenv/versions/1.9.3-p392/bin/ruby /home/ggironda/.rbenv/versions/1.9.3-p392/bin/rake assets:precompile:all RAILS_ENV=production RAILS_GROUPS=assets
ggironda@athena:~/testapp$ ls public/assets/application*.js
public/assets/application-d948590a01ac46e5fddbbc0e842b9d3e.js  public/assets/application.js

# Instance B
ggironda@annealer:~/testapp$ bundle exec rake assets:precompile
/home/ggironda/.rbenv/versions/1.9.3-p392/bin/ruby /home/ggironda/.rbenv/versions/1.9.3-p392/bin/rake assets:precompile:all RAILS_ENV=production RAILS_GROUPS=assets
ggironda@annealer:~/testapp$ ls public/assets/application*.js
public/assets/application-df1f53d685f287055687bc16383bb856.js  public/assets/application.js

The JavaScript has been minified, so running diff against the two files won't really help with figuring out what's different. Beautifying both files with a tool such as jsbeautify first allows for a useful diff to be produced, giving the first bit of insight into what the problem is:

ggironda@moya /tmp/assets » diff application-d948590a01ac46e5fddbbc0e842b9d3e.js application-df1f53d685f287055687bc16383bb856.js
3642a3643,3645
>     hr: {
>         hello: "hello world from hr"
>     },
3649,3654d3651
<     "es-AR": {
<         hello: "hello world from es-AR"
<     },
<     hr: {
<         hello: "hello world from hr"
<     },
3662a3660,3662
>     },
>     "es-AR": {
>         hello: "hello world from es-AR"

Looks like localization data, right? The app is using a gem called i18n-js to expose translations in the Rails app to the client side. The portions of the two files that are different are the JS representations of these localizations—it's the exact same data though, just in different order. While this works just fine when the JavaScript is loaded in the browser, it's the cause of the different hashes.

Flying Backwards

Are you clinging tenaciously to my buttocks?

I'm going to unwind different parts of this mess really fast. I'll leave the choice of the Benny Hill Theme or Sabre Dance as audio accompaniment up to you, dear reader.

The point at which the JavaScript related to all of this i18n stuff is first pulled into the asset pipeline is via two Sprockets require lines in app/assets/javascripts/application.js. One pulls in the i18n JS library itself, the other pulls in the translation data:

//= require i18n
//= require i18n/translations

i18n/translations references an ERB template in the i18n-js gem itself.

I18n.translations = <%=
  SimplesIdeias::I18n.translation_segments.each_with_object({}) do |(name, segment),translations|
    translations.merge!(segment)
  end.to_json
%>;

SimplesIdeias::I18n.translation_segments is a Hash, and in this dummy application I'm working against, it only has one key—“app/assets/javascripts/i18n/translations.js”. The value corresponding to this key is another Hash. Checking what the keys are on my two instances of the app yields more clues:

# Instance A
irb(main):019:0> SimplesIdeias::I18n.translation_segments["app/assets/javascripts/i18n/translations.js"].keys
=> [:en, :"en-AU", :"en-US", :"es-AR", :hr, :ko, :pl, :ru]

# Instance B
irb(main):018:0> SimplesIdeias::I18n.translation_segments["app/assets/javascripts/i18n/translations.js"].keys
=> [:en, :hr, :"en-AU", :"en-US", :ko, :pl, :ru, :"es-AR"]

In Ruby 1.9, insertion order for a Hash is preserved, unlike Ruby 1.8. Anything that iterates over a Hash will have the keys and values yielded in the order they were inserted. It stands to reason that this might have an effect on the order of keys in the JSON output generated from a Hash, too. This can be quickly confirmed as the behaviour of the JSON library that ships with Ruby 1.9:

irb(main):020:0* require "json"
=> false
irb(main):021:0> hash_a = {}; hash_a[:a] = true; hash_a[:b] = true
=> true
irb(main):022:0> hash_b = {}; hash_b[:b] = true; hash_b[:a] = true
=> true
irb(main):023:0> puts JSON.dump(hash_a)
{"a":true,"b":true}
=> nil
irb(main):024:0> puts JSON.dump(hash_b)
{"b":true,"a":true}
=> nil
irb(main):025:0> hash_a == hash_b
=> true

My word. Two ostensibly equivalent hashes, but JSON.dump is generating different output. What a topsy-turvy world.

Time to figure out why the keys are in a different order in the first place. The translations are loaded in i18n-js in a kinda gross bit of instance_eval hackery:

::I18n.backend.instance_eval do
  init_translations unless initialized?
  translations
end

The Simple I18n backend that's in use mixes in the Base backend module, which loads the translation files in order from I18n.load_path:

def load_translations(*filenames)
  filenames = I18n.load_path if filenames.empty?
  filenames.flatten.each { |filename| load_file(filename) }
end

In Rails 3.2.13, I18n.load_path is set during the application initialization process through i18n_railtie.rb, a part of ActiveSupport:

when :load_path
  I18n.load_path += value
else

According to the documentation in config/application.rb of a fresh Rails app, everything from config/locales/*.rb,yml is auto loaded. This application, however, has changed the default load path so that localizations in subdirectories are loaded as well. As of writing, it's the recommended change to make according to the Rails Guides:

config.i18n.load_path += Dir[Rails.root.join('config', 'locales', '**', '*.{rb,yml}').to_s]

Checking the return value of that Dir.[] call on the two instances of the app shows that the files are returned in different order—an order that looks suspiciously like the order of the keys in the translation Hash:

# Instance A. Full array returned by Dir.[] omitted for brevity
irb(main):024:0> pp Dir[Rails.root.join('config', 'locales', '**', '*.{rb,yml}').to_s][0..1]
["/home/ggironda/testapp/config/locales/en.yml",
 "/home/ggironda/testapp/config/locales/models/en-AU.yml"]

# Instance B. Full array returned by Dir.[] omitted for brevity
irb(main):033:0> pp Dir[Rails.root.join('config', 'locales', '**', '*.{rb,yml}').to_s][0..1]
["/home/ggironda/testapp/config/locales/en.yml",
 "/home/ggironda/testapp/config/locales/models/hr.yml"]

When sorted, the arrays are the same. Great, now what?

The Fix

Well, the quick fix is straightforward: make sure that the collection of translations is sorted somehow. The easiest way is to just sort the paths returned by the Dir.[] call:

config.i18n.load_path += Dir[Rails.root.join('config', 'locales', '**', '*.{rb,yml}').to_s].sort

This results in the Hash of translations kept internally by the i18n backend being sorted the same everywhere.

A longer term fix depends on your perspective:

If we were just interested in a fix, this post could end here. That's boring. Let's go spelunking.

Into The Gaping Maw

Dir.[] isn't written in Ruby. It's implemented in Ruby's C source as a function called dir_s_aref. This is a common quandary folks have with working on Ruby itself—a lot of the time, hacking on parts of the language that aren't the VM still involves knowing more C than it does Ruby. If you're truly, madly, deeply curious about how it's done in other Ruby implementations:

We're dealing with MRI here though, so onwards with that. dir_s_aref eventually ends up calling a function called glob_helper. Doxygen made me a call graph to help illustrate this. It's nice like that.

dir_s_aref call graph

glob_helper ends up calling whatever the READDIR macro expands to in a loop, as you may expect for code that is recursively searching directories. What the READDIR macro expands to is dependent on your system. Regardless of that, we can use strace on Linux (or dtruss on systems that support DTrace, such as OS X and FreeBSD) to figure out what syscalls end up getting invoked.

Running code under strace that performs the same glob without the rest of Rails in the way reveals more of what's going on underneath the covers—calls to getdents followed by calls to lstat for every file.

ggironda@annealer:~/testapp$ ruby -rpathname -e 'puts Process.pid; sleep(10); Dir[Pathname(".").join("config", "locales", "**", "*.{rb,yml}").to_s][0..1]'
5987
^Z
[1]+  Stopped                 ruby -rpathname -e 'puts Process.pid; sleep(10); Dir[Pathname(".").join("config", "locales", "**", "*.{rb,yml}").to_s][0..1]'
ggironda@annealer:~/testapp$ strace -q -f -e trace=desc,file -p 5987
[pid  6007] --- SIGTSTP (Stopped) @ 0 (0) ---
[pid  5987] --- SIGTSTP (Stopped) @ 0 (0) ---
[pid  6007] select(4, [3], NULL, NULL, NULL <unfinished ...>
[pid  5987] openat(AT_FDCWD, "config/locales", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
[pid  5987] getdents(5, /* 5 entries */, 32768) = 144
[pid  5987] lstat("config/locales/en.yml", {st_mode=S_IFREG|0644, st_size=214, ...}) = 0
[pid  5987] lstat("config/locales/models", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid  5987] openat(AT_FDCWD, "config/locales/models", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid  5987] getdents(6, /* 9 entries */, 32768) = 272
[pid  5987] lstat("config/locales/models/hr.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/models/en-AU.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] lstat("config/locales/models/en-US.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] lstat("config/locales/models/ko.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/models/pl.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/models/ru.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/models/es-AR.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] getdents(6, /* 0 entries */, 32768) = 0
[pid  5987] close(6)                    = 0
[pid  5987] lstat("config/locales/views", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid  5987] openat(AT_FDCWD, "config/locales/views", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid  5987] getdents(6, /* 9 entries */, 32768) = 272
[pid  5987] lstat("config/locales/views/hr.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/views/en-AU.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] lstat("config/locales/views/en-US.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] lstat("config/locales/views/ko.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/views/pl.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/views/ru.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/views/es-AR.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] getdents(6, /* 0 entries */, 32768) = 0
[pid  5987] close(6)                    = 0
[pid  5987] getdents(5, /* 0 entries */, 32768) = 0
[pid  5987] close(5)                    = 0
[pid  5987] openat(AT_FDCWD, "config/locales", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
[pid  5987] getdents(5, /* 5 entries */, 32768) = 144
[pid  5987] lstat("config/locales/en.yml", {st_mode=S_IFREG|0644, st_size=214, ...}) = 0
[pid  5987] lstat("config/locales/models", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid  5987] openat(AT_FDCWD, "config/locales/models", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid  5987] getdents(6, /* 9 entries */, 32768) = 272
[pid  5987] lstat("config/locales/models/hr.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/models/en-AU.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] lstat("config/locales/models/en-US.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] lstat("config/locales/models/ko.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/models/pl.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/models/ru.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/models/es-AR.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] getdents(6, /* 0 entries */, 32768) = 0
[pid  5987] close(6)                    = 0
[pid  5987] lstat("config/locales/views", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid  5987] openat(AT_FDCWD, "config/locales/views", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid  5987] getdents(6, /* 9 entries */, 32768) = 272
[pid  5987] lstat("config/locales/views/hr.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/views/en-AU.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] lstat("config/locales/views/en-US.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] lstat("config/locales/views/ko.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/views/pl.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/views/ru.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid  5987] lstat("config/locales/views/es-AR.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid  5987] getdents(6, /* 0 entries */, 32768) = 0
[pid  5987] close(6)                    = 0
[pid  5987] getdents(5, /* 0 entries */, 32768) = 0
[pid  5987] close(5)                    = 0
[pid  5987] open("/proc/self/maps", O_RDONLY|O_CLOEXEC) = 5
[pid  5987] fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid  5987] read(5, "00400000-0062c000 r-xp 00000000 "..., 1024) = 1024
[pid  5987] read(5, "       /home/ggironda/.rbenv/ver"..., 1024) = 1024
[pid  5987] read(5, "/1.9.1/x86_64-linux/enc/encdb.so"..., 1024) = 1024
[pid  5987] read(5, "/libm-2.15.so\n7fb3f7d2f000-7fb3f"..., 1024) = 1024
[pid  5987] read(5, "6_64-linux-gnu/libdl-2.15.so\n7fb"..., 1024) = 1024
[pid  5987] read(5, "\n7fb3f878d000-7fb3f878e000 rw-p "..., 1024) = 960
[pid  5987] close(5)                    = 0
[pid  5987] write(4, "!", 1)            = 1
[pid  6007] <... select resumed> )      = 1 (in [3])
[pid  6007] read(3, "!", 1024)          = 1
ggironda@annealer:~/testapp$

Giving a brief thought to how one could implement a simple recursive search of directories, this makes sense:

Running this on the two different instances I have set up shows the calls happening in different order. It turns out that the getdents system call (and by extension, readdir(3)) makes no guarantees as to the order directory entries are returned in. Assuming that it will return them in any specific order at all will probably land you in hot water. If you really want 'em sorted, you're just going to have to do it yourself.

We can go further. Pull up your socks and follow me.

Now This Is Just Getting Ridiculous

The getdents system call in Linux 3.5.0 is defined in readdir.c. You may notice it calls out to a vfs_readdir function defined in the same file, which eventually calls a function pointed to by some readdir pointer in a struct.

# Excerpt of the getdents syscall
buf.current_dir = dirent;
buf.previous = NULL;
buf.count = count;
buf.error = 0;

error = vfs_readdir(file, filldir, &buf);
if (error >= 0)
    error = buf.error;
lastdirent = buf.previous;


# Excerpt of vfs_readdir()
res = -ENOENT;
if (!IS_DEADDIR(inode)) {
    res = file->f_op->readdir(file, buf, filler);
    file_accessed(file);
}
mutex_unlock(&inode->i_mutex);

What function is it calling? Well, let's use gdb on the kernel itself to find out. Ubuntu ships with support for the kernel debugger built in, but it needs to be enabled at boot time. I did this by adding a new GRUB menu entry in /etc/grub.d/40_custom. Mind you, it's been a while since I had to configure a bootloader, so this is mostly copied and pasted then tweaked as necessary:

menuentry 'Ubuntu With KGDB' {
  recordfail
  gfxmode $linux_gfx_mode
  insmod gzio
  insmod part_msdos
  insmod ext2
  set root='hd0,msdos1'
  if [ x$feature_platform_search_hint = xy ]; then
    search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1  ea92d8ea-4d0a-4c67-8b69-ac8f0f9f7ee9
  else
    search --no-floppy --fs-uuid --set=root ea92d8ea-4d0a-4c67-8b69-ac8f0f9f7ee9
  fi
  linux  /vmlinuz root=UUID=ea92d8ea-4d0a-4c67-8b69-ac8f0f9f7ee9 ro kgdboc=ttyS1,115200 kgdbwait
  initrd /initrd.img
}

The kernel debugger is configured to use the serial port ttyS1, at a baud rate of 115200, and to drop to the debugger and wait on boot. “Wait, what serial port? I'm running this in a virtual machine.”, you ask? Right.

VMWare Fusion allows for emulation of serial ports. On my Mac, the VMs are kept in ~/Documents/Virtual Machines. Each VM in there is actually a directory, containing a .vmx file that specifies virtual machine configuration. My serial port configuration for the machine that's going to be debugged looks like so:

ggironda@moya ~ » grep serial1 ~/Documents/Virtual\ Machines.localized/Ubuntu\ 64-bit.vmwarevm/Ubuntu\ 64-bit.vmx
serial1.present = "TRUE"
serial1.fileType = "pipe"
serial1.fileName = "/private/tmp/ubuntu_vmware_serial_port"

You'll also need another VM to run gdb itself on. The configuration for that one is ever so slightly different:

ggironda@moya ~ » grep serial1 ~/Documents/Virtual\ Machines.localized/Ubuntu\ 64-bit\ KGDB\ Client.vmwarevm/Ubuntu\ 64-bit.vmx
serial1.present = "TRUE"
serial1.fileType = "pipe"
serial1.pipe.endPoint = "client"
serial1.fileName = "/private/tmp/ubuntu_vmware_serial_port"

To be able to do anything useful, you'll need debug symbols for the version of the kernel you're debugging. I built my own, but you can also install a package via apt-get. Once that's done, you're ready to fire up both VMs. Going into exact operation of kgdb here is a bit much, but you can find some good writing on the subject.

Once the machines are booted and a remote debugging session has started, getting more answers involves using gdb much like normal.

VM being debugged

GDB debugging session

Note: Executing ls itself may not have even triggered the breakpoint. Any number of other things, like the shell looking in $PATH may have resulted in getdents getting called. Either way, breakpoint triggered.

Setting a breakpoint in vfs_readdir lets us poke around in the kernel's guts. It turns out the brunt of the work of getdents in our case is handed off to ext4_readdir. The filesystem code clearly needs to be able to get directory entries. Going any further than this would really stretch your attention. At this point, I'm satisfied knowing that the filesystem code is responsible for the order directory entries are returned in, and that you shouldn't safely assume that order will be consistent in any way.

Get Me Out Of Here

I went from “this web app doesn't work in a user's browser because of a 404 on the JS” back down to “let's get crazy and debug the kernel over an emulated serial port”, all as part of investigating the one issue. Clearly overkill for fixing the problem—when working on it, I added the .sort and went on with my life. It just goes to show when a little more curiosity is involved, you can learn a whole lot even from some of the most easy to fix bugs.

If you enjoyed this, you'll probably also enjoy my posts on reverse engineering Vine.