Random bugs are nightmare of every programmer. They are problems in the software which occur from time to time, and they root causes are unclear.
What is software bug ?
Software bug is a colloquial name for failure which root cause is a mistake made by programmer in the code of program. When someone discovers failure of software, then he may say “I found a bug” and describes it in the ticket (also named bug) and passed to programmers to fix.
The programmers mistake in the code is a fault, this mistake makes that program is wrong and in some situation the fault cause failure. Generally people named “bug” failure and its root cause – fault.
What is random bug ?
It is a bug (failure) which makes impression that occur at random moment of program execution. The bug can occurred frequent or very rarely or even very very rarely e.g once per few weeks of test execution.
Source of random bugs
To effectively fix random bugs it is good to know potential root causes of theirs randomness.
Unknown start state of program
Depends on programming language memory can be differently initialized. In case of C/C++ it is possible to read variables which are not initialized by programmer, what leads to get random value. In the past I saw this hundreds of times in C++: program sometimes crashes because of reading uninitialized variable gives value outside expected range. Fortunately this situation can be avoided with static code analyzers or with memory checkers like valgrind. In nowadays, in case if You suffer for this kind of problems please think of about introduce analyzers into process of coding.
Cooperation with other systems
As I wrote at the beginning random bugs makes impression that they occur at random moment. Indeed very often it is only impression, because in fact the issue occur deterministic, when ecosystem in which program works has some specific state which was ignored or not discovered during testing. Sometimes the solution is to observe our program environment to detect events which cause the bug. I remember once I got to solve problem with randomly rebooting router (it was rebooting itself few times during day). After investigation it turned out that the router was under inefficient attack in which attackers ( probably it was botnet of already infected routers ) tried to install malicious software on the router. Fortunately system on the router was a customized piece of software and the attackers software could not be installed, because installation were corrupting system in the way which forced it back to default settings and reboot. The bug turned out security issue, changing password from “admin” to strong password has prevented attacks and solved the problem.
Random values generators
When I first time read about random generator as root cause of random bugs I thought it is rare situation, but then life verified this. I was surprised how many times random value generators makes problems. I remember, when I was working on Ipv6 stack on Linux, I fixed some number of random bugs which was cussed by generator of random delay of messages retransmission’s (Ipv6 standard required this). I spend long hours on tries of issue repetition, my PC was spending whole nights and weekends on automatic tests execution, finally very simple corrections for configuration of random number generator solved them. I have started to check if random generators may be involved somehow in the code which randomly fails, and frequently I save a time by finding direct path from random value generator to bug. When I started to work in space industry situation has repeated, also here some part of random bugs were cussed by random generators.
The King of random bugs. When two parallel threads use the same resource and modify it, then the results of operations can be unpredictable. Hangs of program, crashes, incorrect results of computation, memory leaks, all of this problems can occur when we have bug in running multi-threading code.
The occurrence of random bug is surprising, and the bug usually is not easy to repeat. That because programmers usually deeply think off their code, predict possible race conditions and prevent them, but some part of erroneous situations are omitted during analyses due its subtlety. To repeat this kind of bugs good idea is to start program on hi-loaded CPU (e.g with stress program), or, if is possible, to force some latency in thread suspected for making problems.
It is worth to remember that we need ensure that bug was fixed, and not not only masked by reduce its occurrence frequency. Programmer must deeply understand condition which cause problem, and only then correction may be applied.
Malicious random bugs are hard to repeat and debug. Debug session or additional logs have impact on program execution and may makes that bug cannot be repeated. The bug which cannot be observed with debug tools available for programmer is named “heisenbug” to emphasize the similarity of this situation to Heisenberg’s uncertainty principle.
How to know if random bug is fixed ?
I don’t believe that random bug was fixed until its root cause is unknown. Personally I always prepare a test which deterministically repeat random bug, then I make analyses to find a root cause, make a fix and run the test again to verify if fix helps. Frequent situation is when bug was reported on version A, and after some time, on version B, failure cannot be repeated on programmer desk. Programmers tend to consider the bug as solved indirectly, but this can only be accepted when exactly the same test running on version A detects problem, but not detect it on version B.
Spacecrafts software is threaten by soft error – in short: space radiation may change state in digital memories cells what may have big impact on execution of programs on on-board computers. Soft errors occur in random moments, it is impossible to predict when and where The Space particles will hit computer elements. Soft errors are similar to random bugs, but the difference is that they are not effects of mistake in the code. Both random bugs and soft error are faults with different root causes. If hardware of computer system is not resistant for soft errors, then the only solution is to implement fault tolerant system. Fault tolerant system do not distinguish if the fault is because of wrong program code or because of space radiation.
- 2009) Debug It! Find, Repair and Prevent Bugs in Your Code. Pragmatic Bookshelf, . (