#code developerstories | Explore Tumblr posts and blogs

druware · 6 years ago

Text

This Shouldn't Be a Problem

... Often turns out to be. When a developer writes code long enough, they will run into a development pattern that will lead to the dreaded unintended consequences, and the many lost hours to tracking down the how and why. Unfortunately, this also often leads to the other programmer problem, the complicated problems have complicated solutions, when in fact, the problem was simple, but the lost hours made it feel complicated, and the solution is likewise far simpler than it seemed.

I trapped myself in one of these this week, and I figure I will share in the hopes that someone else can learn from my series of mistakes that led me to a frustrating weekend and a couple of missed bike rides, because I got mired in the hunt for the problem.

There is a program that I maintain for customers that is essentially the heart of their businesses. When it hiccups, they get frustrated, and rightly so. Making changes in that code is usually not a big deal, as the codebase really is not overly complex on the surface. However, there are some things under the surface that are a bit more hairy. Many years ago, it moved from a model where most of the code existed within the user interface to a more modern model view controller design ( this is code that has been evolving since the early 90's and has its roots in 16-bit Windows ).

The business object layer is where most of the brains are, and the user experience is largely a thin veneer over them. I am the core maintainer of the business object layer, and because of the history with them, I am also the one that knows where most of the skeletons are within that code. So when things go wrong, things escalate back to me when others cannot find the problem reasonably quickly.

This was one of those problems.

The Problem

The original support ticket on this issue is now months old. It started like so many do: "The program randomly encounters an error and stops running. The error message was that a list index is out of bounds". Unfortunately, this information is not enough to create a reproduceable bug, so difficult to trace. Slowly, over the ensuing weeks, a few more reports come in. There is definitely a problem, but not much pattern to find it. So the hunt begins. What changed in the last release prior to the first reports? Changelog doesn't show anything that should be of significance. Most everything is enhancements. Nothing that should touch lists. Go back a a few revisions. Still nothing.

Time passes. More error reports come in, but still nothing reliably reproduceable.

Each time the error reports come in. Spend a couple of hours digging at it, but still nothing.

Another report comes in, and this has been escalated. Now it's MY problem. Spend a morning digging, literally randomly clicking between a couple of UX elements that are the most likely candidates. To tab pages with list views on them. An hour of clicking produces a few errors, but the debugger drops into Windows runtime assembler, not user space code. Cookies, M&Ms and Diet Coke acquired, office door closed, music cranked up, phone on Do Not Disturb. It is on. This bug is going down. More clicking and finally a reproduceable pattern is identified.. Select an item in a list view on tab page A. Switch to tab page B. Select an item in the list on that page. Switch back to tab page A and boom the error occurs and dumps into Window runtime land, with no reference in the stack to user space code.

With a pattern in hand, now it is time to figure out the reasons. Back to the changelogs to figure out when a change occurred that impact these user interface elements. 6 months prior, a change was made to improve performance of several list views, including one of these. However, that change had been in deployment for3 months before the first error report came in. It can't be the problem can it?

More digging and refining the test case that triggers the failure. Hours pass. Frustration sets in. Resort to calling a colleague in to just listen while I talk through it. Somewhere in the middle of that conversation, the answer hits like a lightning bolt. It is is a problem with the 6 month old performance change, but was only exposed by a months later tangential change that should have had no effect on the lists, at least on the surface.

Unintended, Unplanned for, and Unexpected Consequences

It turns out that the root of the problem was a premature optimization decision made several years prior. The relevant code comments point to the author ( me ), and a date of 2015. Evolution of code and responding to customer requests often creates these very situations.

The user interface element at the heart of this issue is a list of notes pertaining to the document currently open. As any developer who has worked with customer service type operations can reiterate, notes are the lifeblood of the operation. The constant customer requests mean that they end up expanding, and needing to be displayed like Chicago voting, early, often and everywhere. Because of this, there is a heavy reuse of the same user interface element to display these notes wherever needed.

The performance change months ago was to switch the lists from a traditional listview/listitem model to a listview/virtual list model operating directly against the business object store representing the array of notes. Everything worked fine, until a single, subtle change was made months later. One instance of the reusable notes user interface element had the option to have user filters applied to the list enabled. Since all of these reused display elements shared a common data store in the business logic, filters applied in one space applied across all of them, since they were all sharing a single instance of the business object. This was fine so long as there was never any potential overlap of having two of these instances visible at the same time, or, as it turns out switching directly between two of them.

Results

As I am sure the more experienced developers have already deduced, when one list was being swapped out, with the other, there was a brief period where it was possible (though not always true) that the state of the data store was inconsistent with the expected content of the displayable lists, the result, seemingly random failures to read a memory address that was no longer valid.

How Did We Get Here?

Honestly, it was a failure to communicate the risks. I wrote the code initially many moons ago. I authored the performance improvements, and then I approved the recent change in passing, without taking the time to fully consider the ramifications, which I am reasonably certain I would have caught given the chance to actually think about them, but in the press of a customer request while I was focused on another pressing task, I did not take the time to do so. Then to make matters worse, the situation was one that the unit test test cases would never have trapped as they existed at that point.

The Solution, and Not the Solution, and then the Solution

All of this leads to the other developer problem and disconnect in the how we think. In typical fashion, the immediate conclusion is that a hard to find problem, does not have an easy solution. So, a solution is formulated, planned implemented and enters testing. It works, but it is expensive in terms of resources. It creates separate copies of the data store per reused list view. It is a hammer, when really all that is needed is a gentle nudge.

Sometime after starting testing, the developer, who has at this point left to do something else because decompression is needed, has the "oh shit" moment. Yeah, brute force works, but is fundamentally flawed in the wasted resources, as well as user experience. Worse the business object layer already supports what is needed.

Instead of creating copies of the underlying datastore, all that is needed is to maintain multiple filtered lists that reference the allocated objects in the underlying data store. It is a 5 minute change involving about 3 lines of code ( and 30 lines of documentation to explain what is happening ). Dump it into the testing rig, and add a new test case to handle this unique tests.

Side Note

Unfortunately, there is no test case that can test for developerDidAStupidThing(); so in the interim, these types of things will keep happening. This is one of those problems that will crop up in customer driven solutions with short turn around times. No amount of automation will completely solve the problems so long as users are driving the evolution, and developers are operating in a respond to the customer model.

#code developerstories

1 note · View note