On Horrible Software – Blue Toque Consulting

I was just reading The State of the Art is Terrible by Zack Morris, which is quite well written. I couldn’t have put it better. In his words

It really, truly, is all crap. And it’s much worse than anybody realizes

In my words, the entire software world from banking, to games, is held together with rubber bands and bailing wire. It’s beyond comprehension that software written so badly, on systems that are cobbled together from contracts with 20 different providers can get your money from your bank account into your hands at a bank machine. That we trust both hardware and software to fly airplanes, to control trains and traffic lights.

The more I know about the industry, the more I fear for my well being.

In the article he also mentions:

…your web browser locked up on you because some novice programmer wrote some portion of it in blocking network code that is waiting for the last byte to arrive from the web server, and that the web server is sending that byte over and over again because a router is temporarily overloaded and is dropping packets like crazy…

That made me remember a story about a server side application I worked on. I was working for a company that had some server software we were selling to police agencies all over the US. The lead programmer was a well known hack (I didn’t say hacker, just “hack” — he really wasn’t very good). A lot of his code was incomprehensible, and will be food for many stories for years to come.

A year or so after he left the company we began having an issue with the software. It ran as a Windows Service, and every once in a while it would go into a “zombie state” where it would consume 100% of the CPU (or, in a hypterthreaded or multi-processor system, 100% of one of the CPUs). Eventually this problem became so big that it was assigned to me, and I was told to use whatever resources I needed to get it fixed.

The problem was catching it in the act. It was hard to do, since it only seemed to happen at one of our client’s sites. Since we were working in the Justice and Public Safety area, they were justifiably reluctant for us to even access the live system, much less attach a debugger. We did get access, but the tools we could use were limited.

I had my suspicions however. The symptoms were clear: CPU consumption indicated that it was in a “spin wait” cycle, and the fact that it stopped servicing requests indicated that it could be in the request-handling code. I started there, just reading the code and sniffing for the “bad smell” — the sense that a software engineer gets after a while when they read code that’s just wrong.

I also suspected that, since it only happened on the client’s site, it had something to do with their network configuration. Since they were a secure site, I surmised that they would be using some active network monitoring and intrusion detection. This kind of software is known for sending “probe packets” to sense what kind of services are active on the network — when it detects a change it alerts an operator so they can track down the new service and figure out if it’s legit or not.

Putting these two things together I started running little experiments. I treated the service as a “black box” and just wrote some software to send it some bad requests. I knew that it implemented a very limited HTTP server — but kind of brain-dead in that it didn’t attempt to confirm to the standard very well. Knowing this, I sent requests with bad data in them, and some that were incorrectly formatted.

Finally, after a week of solid work on this task, I killed the server. I wrote a little program that sent a specially formatted HTTP request that caused the service to go into the zombie state. With this in place, I attached the debugger to the service and watched what was happening.

It was the service handling code, as suspected. As I read it, I had a sense that it was different from most of the other code in the application. Programmers have coding styles just the same as an author has a writing style, or a morse code operator used to have a “fist“, and this code did not look like it had been written by our illustrious hack, but by a different, even less experienced one.

It turned out that the server thread was reading the HTTP request length, and waiting in a “for” loop for the next byte. If you sent a request and “lied” about how many bytes were in the body of the request (said there were more than there actually were), the for loop would wait forever trying to read the last bytes. Rather than using a wait, the loop did continuous reads, and counted how many bytes had been read. If no more bytes were ever written to that port, the loop would continue forever, using CPU. Since the code was single threaded at this point, no other request could be serviced.

I finally figured out that the code had been copied verbatim from a Code Project article from the days of .NET 1.1.

This shortcut cost the company about $15,000 for the months of effort it took to isolate, reproduce and fix the code.

I never found what it was about the client’s site that was causing the trouble. I still suspect the network intrusion software, and I also suspect that the software was sending bad requests because it probably has some similarly bad code in it that is lying about the request length.