Collecting kernel logs from a fleet of servers is an inherently difficult problem, since the messages you're interested in often result from crashes or other conditions that render any userspace collection of those logs unreliable or impossible. The traditional approach to this is to scrape consoles, but that becomes unworkable on a large scale, especially when the server fleet is comprised of many varying types of commodity and specialized hardware.
At Facebook, we use netconsole to solve this problem: since kernel log messages are emitted synchronously over UDP, it catches nearly all possible crashes, and is fantastically easy to deploy and run across a diverse server fleet. We use an open-source daemon called "netconsd" to process these messages on a very large scale.
In this talk, we'll discuss how we collect, analyze, and visualize the data from this system at Facebook. We'll briefly discuss how to setup and configure netconsole and netconsd in your own datacenter. Finally, we'll discuss some various sorts of problems/errors/crashes we've seen in production over the past year or so, how we found them, and how we fixed them.