At some point in your professional programming journey, your software will have defects.
And I say “professional” because, although your academic projects have defects, they are no match for the challenging defects that you will create and face in real-world applications.
Although everything I am going to talk about in this post will seem obvious, I’ve found that it is not for 9 out of 10 professional programmers, even those with many years of experience under their belt.
The Scientific Method Distilled
The scientific method consists of a few clear and concise steps. These steps are used by scientist to investigate phenomena. Today I was thinking about a recent issue we had and realized today that the scientific method fits perfectly for troubleshooting.
The steps consist of:
- Make Observations
- Thinking of interesting questions
- Formulating a hypothesis
- Develop Testable Predictions
- Gather data to test predictions
- Repeat steps 4-5 to refine, alter, expand or reject hypothesis
- Test predictions
Whenever something is wrong with software an observation is made and is reported. This usually comes in the form of a user or tester reporting a problem.
Making the observation is a critical step in the process.
Gathering as much information related to the problem is critical.
This will help you, the developer, reproduce the problem because a problem that cannot be reproduced cannot be fixed.
The result of this step should be the exact steps and conditions to reproduce this problem.
Gathering the right questions
With the exact steps gathered you can move onto the next step and take the time to think about the problem.
Since you know the system intimately, you should be able to come up with some possible causes.
At this point, you can start analyzing some of the possible causes and even eliminate some ideas based on facts. Keep in mind I said facts. Consider every assumption bad until proven correct.
Some things could be purely coincidental. The fact that problem occurs every day at 1PM could have little to do with time in the code and more to do with a janitor unplugging the server at that time to connect his vacuum cleaner.
I have found that in most cases a new defect is a result of a recent change. The change can be in code, or system changes external to the code. But, even so, you cannot make the assumption that the change caused it. There may be other changes that you either forgot or are unaware of.
So one of your first questions should be “What changed recently?”. You can then test that hypothesis later in the process.
Formulating a hypothesis
As you formulate theories on what could be possible causes, make sure that you consider the pre-conditions required.
Create plans for testing your theories.
You may be wondering why I say create plans, but keep in mind that I’m talking about those tough defects, not any old trivial defects.
For example, let’s say that your defect is in a web application and the server recently received a series of system updates as well as a new version of the application.
It may not be possible to revert the server back to a previous state, application wise or system-wise, on a dime. You have to plan this out with the system administrators.
Testing the Hypothesis
Test your theories. I cannot emphasize this enough. I have seen this too many times. A developer assumes that the error is caused by a certain thing that he figured out on paper and does not test the issue.
Instead, he implements a fix and re-deploys only to discover, some hours later or even worse, several days later, that the problem persists.
If your theories fail your test, at least you can leverage the process of elimination and try out other theories.
But don’t blindly go out trying theories. Slice the problem. Start broad and narrow the focus as you go.
For example, let’s say that you have a performance problem and you want to know if it’s a database change, a system change or a software change. Set up the conditions to rule out each system.
Once you figure which system is the culprit, you can focus on that system and begin narrowing down further.
This a divide and conquer approach which is much better than the “pinata approach“.
Through all of this be systematic. It’s hard to do for many of us. We want to run with our superman cape and yell “I solved it, here is the solution”.
Resist that urge. Test your theory. Discuss it with fellow developers. This is much better than announcing it to the entire team only to discover you are completely wrong.
By not being systematic, you are likely to create a lot of confusion, look like you are clueless and create all sorts of confusion for you and the team.
Good luck on your troubleshooting journey.