Random bugs are nightmare of every programmer. They are problems in the software which occur from time to time, and they root causes are unclear.
What is software bug ?
A software bug is a colloquial name for failure which root cause is a mistake made by a programmer in the code of a program. When someone discovers failure of software, then he may say “I found a bug” and describes it in the ticket (also named bug) and passed to programmers to fix.
The programmers mistake in the code is a fault, this mistake makes that program is wrong and in some situation the fault cause failure. Generally, people named “bug” failure and its root cause – fault.
What is random bug ?
It is a bug (failure) which makes an impression that occurs at the random moment of program execution. The bug can occur frequent or very rare or even very very rarely e.g once per few weeks of test execution.
Source of random bugs
To effectively fix random bugs it is good to know the potential root causes of their randomness.
Unknown start state of program
Depends on programming language memory can be differently initialized. In the case of C/C++, it is possible to read variables which are not initialized by a programmer, which leads to getting random value. In the past I saw these hundreds of times in C++: program sometimes crashes because of reading uninitialized variable gives value outside the expected range. Fortunately, this situation can be avoided with static code analyzers or with memory checkers like Valgrind. In nowadays, in case if You suffer from this kind of problems please think of about introduce analyzers into the process of coding.
Cooperation with other systems
As I wrote at the beginning random bugs makes the impression that they occur at a random moment. Indeed very often it is the only impression because, in fact, the issue occurs deterministically when ecosystem in which program works has some specific state which was ignored or not discovered during testing. Sometimes the solution is to observe our program environment to detect events which cause the bug. I remember once I got to solve the problem with randomly rebooting router (it was rebooting itself a few times during the day). After investigation, it turned out that the router was under inefficient attack in which attackers ( probably it was botnet of already infected routers ) tried to install malicious software on the router. Fortunately, system on the router was a customized piece of software and the attacker’s software could not be installed because the installation was damage system in the way which forced it back to default settings and reboot. The bug turned out security issue, changing password from “admin” to strong password has prevented attacks and solved the problem.
Random values generators
When I first time read about random generator as a root cause of random bugs I thought it is a rare situation, but then life verified this. I was surprised by how many times random value generators makes problems. I remember, when I was working on Ipv6 stack on Linux, I fixed some number of random bugs which was cussed by generator of random delay of messages retransmission’s (Ipv6 standard required this). I spend long hours on tries of issue repetition, my PC was spending whole nights and weekends on automatic tests execution, finally very simple corrections for configuration of random number generator solved them. I have started to check if random generators may be involved somehow in the code which randomly fails, and frequently I save time by finding direct path from random value generator to bug. When I started to work in the space industry situation has repeated, also here some part of random bugs were cussed by random generators.
The King of random bugs. When two parallel threads use the same resource and modify it, then the results of operations can be unpredictable. Hangs of program, crashes, incorrect results of computation, memory leaks, all of these problems can occur when we have a bug in running multi-threading code.
The occurrence of random bug is surprising, and the bug usually is not easy to repeat. That because programmers usually deeply think off their code, predict possible race conditions and prevent them, but some part of erroneous situations are omitted during analyses due to its subtlety. To repeat this kind of bugs good idea is to start the program on hi-loaded CPU (e.g with stress program), or, if it is possible, to force some latency in thread suspected for making problems.
It is worth to remember that we need to ensure that bug was fixed, and not only masked by reducing its occurrence frequency. Programmer must deeply understand condition which causes problem, and only then correction may be applied.
Malicious random bugs are hard to repeat and debug. Debug session or additional logs have an impact on program execution and may makes that bug cannot be repeated. The bug which cannot be observed with debug tools available for a programmer is named “heisenbug” to emphasize the similarity of this situation to Heisenberg’s uncertainty principle.
How to know if random bug is fixed ?
I don’t believe that random bug was fixed until its root cause is unknown. Personally I always prepare a test which deterministically repeats random bug, then I make analyses to find a root cause, make a fix and run the test again to verify if fix helps. The frequent situation is when a bug was reported on version A, and after some time, on version B, failure cannot be repeated on programmer desk. Programmers tend to consider the bug as solved indirectly, but this can only be accepted when exactly the same test running on version A detects a problem, but not detect it on version B.
Spacecrafts software is threatened by soft error – in short: space radiation may change state in digital memories cells what may have a big impact on the execution of programs on on-board computers. Soft errors occur in random moments, it is impossible to predict when and where The Space particles will hit computer elements. Soft errors are similar to random bugs, but the difference is that they are not effects of mistake in the code. Both random bugs and soft error are faults with different root causes. If the hardware of a computer system is not resistant to soft errors, then the only solution is to implement a fault-tolerant system. Fault-tolerant system do not distinguish if the fault is because of the wrong program code or because of space radiation.
- 2009) Debug It! Find, Repair and Prevent Bugs in Your Code. Pragmatic Bookshelf, . (