The Byte That Brought Things Down


Sometimes I review code and find some nitpicking issues like trailing white spaces, and missing new lines at the end of files, and they annoy me. I feel this is something that automatic tools should catch, but regardless of whether I should comment that on PR or not, I’d like to elaborate on why it bothers me.

This was a long time ago, circa 2014, I was working on a distributed system. This system was in charge of scanning and monitoring ERP servers, and its architecture was comprised of a main server (a web service), and several sensors which were other agents (in charge of the specific scanning and monitoring). These sensors reported the results back to the main server which contained all the data to display to the users.

One day, after a new deployment the dashboard showed all sensors were offline. It seems tasks were not able to be scheduled because there were no sensors available, and also scheduled tasks were also not being triggered.

At first, it looked like a network issue: sensors were not able to reach the main server, and report their heartbeat status. But on a closer look, the problem was more nuanced. Sensors had their heartbeat status scheduled through cron jobs. And cron was also used to trigger all sorts of recurrent tasks like checking for new scan jobs in the queue, etc.

The problem was that new type of scheduled job was needed, so a new entry had to be added to the cron configuration file. And the developer in charge of this task edited one of the configuration files, and saved it, and went to prod. The problem? The file was not terminated in a new line (\n). The Linux version used (Ubuntu 12.04 LTS) was using a cron version that worked as if all configuration files were concatenated into a big cron table (i.e. cat /etc/cron.d/*), and because one of these files was not terminated in a new line, the final table was invalid, and therefore no job was running.

The fix was obviously simple, but this is not the kind of issue you’d expect from a seemingly innocent change that can very easily sneak through production undetected. This case is a reminder of how much small details can bring a whole system down.