Product Planning System

This is the first of two major systems I wrote for a company producing food. I was working in a small team. We added and lost members along the way, but the core team consisted of me, writing business logic, and a guy who wrote the database, a product manager, and a software architect.

The software we were writing was to replace very old software system that the company was completely relying on. The old software was running on an old AIX mainframe computer. It was written in a dead language called NATURAL, and it used a not quite dead database named Adabas for storage.

It was written almost 30 years ago by people who are now dead. In its humble beginnings, it was serving a client computer with a text prompt. The user could type their user ID into it, and it would tell them which pallets they should load onto their trucks for delivery. Over the decades, it evolved into a multi-user system with a Telnet frontend. When I arrived at the scene it was hundreds of thousands of code lines long, and it had many concurrent users working with it around the clock in multiple cities. It was capable of doing advanced product planning, and it could plan when to send trucks, which pallets to put in each one, and which stores to send them to.

It had many problems, one being that the Telnet based front end was almost impossible to use unless you had received months of training. It was also slow. Very slow.

We had access to two people. One retired very soon after we started, and was generally helpful. The other felt threatened by us and decided to be very unhelpful. She would frequently claim that we were wasting everyone’s time and money by writing a replacement, because it was mathematically perfect, completely without bugs, and that it had been optimized for speed for decades, so it couldn’t possibly get any faster. She also refused to help us understand the NATURAL code, and would only share it with us on paper. We were given papers in the hundreds, and needless to say, they didn’t help one bit. Her general unhelpfulness eventually got her fired, and we had to resort to other means of understanding the old program.

The first thing we did was to interview its users. I was in charge of business logic, so I did most of the interviewing. We were assured that the users knew best, and that asking them would give us all the information we needed. This was absolutely not the case, it turned out, as most users would give conflicting answers, and those conflicts would conflict with how the program actually behaved. Reference implementations based on the user testimonies would be ripped apart by the other users. I tried to get around this by interviewing them in groups. That led to groups disagreeing with each other.

In the end, I abandoned the user interviews entirely and adopted another method; I decided to reverse engineer the program, starting with known good cases. At this point I still didn’t have a good grasp of how the Telnet application worked, so I scheduled a meeting with a power user. I laid out many valid scenarios that should work, and my hypothesis on how it should behave. I noted what the program actually did, and if it would differ from my hypothesis, I would document it thoroughly.

This still wasn’t enough. I hadn’t caught any of the corner cases. I sat down again and devised a long list of scenarios that shouldn’t work. Things like “you have three trucks, but only one pallet”, and “a product needs to go to a location, but no truck is going there”. The power user was very annoyed with me, refused to help me at first, because these things would never happen, and nobody would ever put this into the system. I insisted and he complained to my boss, who first told me to stand down. I explained that I was trying to find important corner cases, and begged them for a short session of this. In order to learn how an algorithm works, it’s incredibly important to know how it behaves when given bad data.

Remember how the unhelpful developer said there were no bugs? It turns out that she was wrong. We uncovered a great deal of errors in this process, and we learned a lot about the algorithm. Using what we learned doing this, we could answer several questions I had, and I could finally write a reference implementation that a lot of people could agree was somewhat working the way they expected. It still had bugs and issues, but given the power user was much more comfortable letting me do preposterous things to the system by now. In total, I uncovered almost 800 rules using reverse engineering.

We also managed to speed the algorithm up substantially. The original NATURAL code had a runtime measured in half hours. The slowest operation of the day happened in the morning when a new set of orders would drop in. It took hours. With our improved algorithm we got the time down to minutes in the worst case, and less than one second in the best case.

Initially this made our users think something had silently gone wrong, and they would complain to us. When you make an algorithm so much faster that people think it’s not working, you know you’ve made an improvement.

The second thing we drastically improved was in correctness. We identified many subtle problems when we subjected the code to my tests, and in fixing those, the code would produce fewer pallets. Fewer pallets means fewer trucks, and fewer trucks means major savings for the company.

In the end people were very happy with it, and did not want to go back to the old way of doing things. For software developers, that is the highest possible praise. We are always prepared to hear users complain about how much better the old system was, and how we failed to take some obscure user or use case into account.