Troubleshooting of a software running on satellites
When a satellite is in the space, then possibilities to detect and fix software problems working on its board is very limited. Narrow communication windows, low transfer’s bandwidth and limited numbers of software tools available on a satellite’s board make troubleshooting process very hard.
We need to investigate software failures
There are situations when software fail. Faults which caused failures may occur because of mistakes in hardware design or manufacturing, errors during software implementation or are caused by a harsh environment. Consequences of software failure may be very dramatic and expensive. Lost of expensive data, wast of time in which software gives services for users, lost of control on device etc..
In practice, we cannot prevent failures. It doesn’t matter how much time people will spend on design implementation and testing. From time to time we are informed about spectacular events or disaster caused by software failure. Big organizations, which spend a lot of money to prepare software, fail i.e. Boeing 737 Max, Ariane 5. We need to assume that during the working of any software system unexpected problems will happen.
Not every failure cause dramatic consequences, but every such incident shows us that the software has problems. Failures mean that something is missing with the software, and in consequence, the whole system is not reliable and it is not sure if is threaten or not by a dramatic disaster. We need to understand each failure occurrence because we need to know if errors in software design may lead to expensive consequences and prepare eventual corrections in further versions of the software.
As I wrote above we need to investigate each failure of software. It means we need to detect failure, register it, collect all useful information about the situation of a system when the problem occurs, and then start an investigation. Research of each failure is a quite complex task which may involve many people, systems and tools. It would be better to control such a complex process within a well-defined framework instead of taking chaotic actions which may cause more problems.
I defined a ‘Troubleshooting System‘ as resources (systems, tools, people) and a process which steers investigation of occurrence problems with the software. The main goal of the system is to effectively and safely execute an investigation, which results in the discovery root cause of the problem and its consequences.
Troubleshooting system for a satellite
To investigate software problems on a working satellite in orbit, the satellite must be equipped with tools which are able to signal problems to the operator on Earth and to collect all necessary information about emergency situations.
Who is involved on troubleshooting process ?
We can make a generalization and present a satellite system as follow: a satellite is controlled by its operator from Earth and delivers services to users. By the operator, I mean part of the company which control the space stuff. Both users and operator may spot an issue and report it into company troubleshooting system.
Users don’t have direct access to company internals, they are in contact with user’s support. ‘The user’s support’ write down users compliance and put it into the company CRM system, which may start the investigation process.
In the context of the satellite software troubleshooting the user’s support helps the users to fix their problem with using system and pass to the internal company troubleshooting system only issues which breaks the system specification. In the other words, the support works as a filter which distinguishes between malfunction of the system and wrong user consideration.
Someone who control systems which maintain the satellite. The operator can execute an operation which changes the state of the spacecraft, can communicate with the satellite to collect maintenance information not available for the users. The operator is conscious of the satellites position and their statuses. Because the operator is still taking care of the satellite statuses, he will be usually the first person who spots an issue and report it.
The team of people who design and implement all stuff on the satellites. They are ‘producers’ of the satellites. They don’t have control on the satellites when they are deployed, but they know how every detail work on their boards. They will fix bugs and prepare software updates for the satellites. They are also experts which can answer questions about the satellite details.
The satellite is a communication device. Some malicious persons – attackers – may want to get access to the spacecraft and its network to disturb traffic or collect secret information. If malformed infiltration becomes discovered, then security investigation must start. The goal of the Security Specialists is to collect all possible information about the attack, realize how deeply the network was infiltrated, which entry-point was used etc. When the investigation becomes finished, then the report will be passed to DevTeam to prepare a solution which prevents the same attack in the future. Security Specialists do not develop the system, they only observe it.
DevTeam is expensive and busy. The Engineers works on the next version of the satellite and they don’t want to waste time for non-effective communication. When The DevTeam will be informed about a new issue, then they will require a lot of information about the satellite state, history of operations on it, history of communication etc. All of the information must be collected from The Users’ support and from The Operator. Here we can introduce The TroubleShooter – a person who ensures that all necessary and available information are collected confirms that the issue is a real problem with the system and communicates it to the DevTeam or SecuritySpecialist, depends on the issue type. The TroubleShooter ensures that all required actions to fix an issue are executed.
I present above interaction between stakeholders of the troubleshooting process. The main goal of all the actions is to collect information required by The DevTeam to repeat a problem.
DevTeam needs information about the situation when the issue occurs, it needs a set of documents, log files, configurations etc…., let’s name this set “The DebugPack”. DebugPack consists of information from many sources and we can divide it for separated parts.
It is a set of files which describe the status of a satellite’s computer board in an interesting moment. The Board DebugPack is collected on the spacecraft’s board, so it contains most accurate information about its state. To predict what information is required we need to know what problems are possible to occur on the board:
- board software crash
There are situations when the software fails – processes or even the whole operating system are stopped because of fatal errors. It is a common situation that the board’s software works under the supervision of a watchdog, which restarts failed processes. The watchdog is very important for troubleshooting system because it detects problems and may collect information about the board stability.
- performance problems (CPU load, memory usage etc..)
There are situations when the software doesn’t fail but works ineffectively. Heavy CPU load, RAM consumption, run out of persistent storage may have an impact on services quality delivered by the satellite. Moreover, performance problems may lead to failures and frequently they are the root cause of fatal errors.
- radio link problems
When the radio link doesn’t work well, then we will lose contact with the satellite, or the contact is limited. Any problems with the radio can have an impact on the quality of services delivered by the satellite. Even when the rest of the system works correctly problems with radio may overwhelm this, and it may seem that the system suffers for failures.
- problem with persistence data storage (lost of files, databases etc..)
System and processes configuration is saved on persistent data storage like flash memories. Also compiled codes, boot loader and data processed by the system are saved in persistence memory. Problem with persistent storage leads to satellite failure and maybe a root cause of lost control over the spacecraft.
- problems with power supply
Without electric power, the satellite’s board won’t work. The power budget is a crucial part of the satellite’s project. The batteries are expensive so they are carefully selected for the mission and any disturbing of their work has an impact on the quality of delivered services. Wrongly utilized batteries may shorten their lifetime and in consequence lifetime of the whole satellite.
- problem with I/0 hardware connected to the mainboard
A satellite is quite a complicated device. To carry out its mission a spacecraft is equipped with a set of devices which allow it to travel in the space. If some of the device/subsystems fail, then the mission will fail. It is important to know the status of the subsystems.
The spacecraft’s software may collect some set of statistics which describe its work. This kind of information may be very useful in the process of the issue repetition on The Earth. For example, the number of broken connections may drive the team’s attention to a problem with the radio link or network stack. So it is a good idea to include statistics in The Board DebugPack.
Apart from the information reported by a satellite the operator has information collected on the ground. The operator should collect a history of communication with the spacecraft, expected its position on orbit etc. The operator can deduce expected state of a satellite. All the things may be delivered to the DevTeam in Mission DebugPack. Mission DebugPack can be obtained even when communication with a satellite cannot be established.
Strict technical information from Mission and Board DebugPacks are the most valuable for The DevTeam, but in a complicated process of an issue investigation, some weak information may help to faster solve the problem. Reports from users, operator’s thoughts, ideas from other people involved in the project may help The DevTeam to better understand the situation of the space system.
Service of The Board DebugPack
As I explained above board DebugPack is collected on a satellite’s board. We need to define how The DebugPack service should work both on the spacecraft and on the ground.
Creating a board debugpack
There are only two situations when the board debugpack is created on a satellite:
- The software detects system failure – internal monitoring on the board triggering a creation debugpack process
- The operator requests for creating debug pack – the operator may suppose that some problems occur on a satellite (for example getting information from User’s Support about problems), but the monitoring on the board may not detect it
The Board DebugPack may contain a lot of information and may require large space on satellite’s persistence memory. Moreover, the radio link bandwidth may be narrow and downloading the pack to the ground may take a long time and take resources from other services. Here is the example of a list of possible files included in the pack:
- List of pending processes on the operation system
- Callstacks/core files of crashed processes
- Log files from services
- The operating system’s log
- Radio transmission dumps
- The satellite configuration
- Report about states of the satellite’s equipment
- Set of persistence data saved on the board (i.e. data transferred from the ground)
There is no sense to collect all the possible information for each detected problem, there should be a configuration file which describes what kind of information is required for a given case. For example, it may not make sense to collect radio transmission dumps in the case when ADCS fail.
The configuration may be useful when The Operator demands creation of DebugPack, he will be able to configure which information should be included in the pack.
How to get the Board DebugPack from a satellite ?
When the Board DebugPack being created, then satellite needs to add it into a list of debugpacks. Each pack will have its unique id, the list of the ids should be broadcasted by the radio. The operator can check the list od debug pack each time when a satellite contacts with him, and then download debug pack with radio command which contains desire debug pack id.
After downloading a board debug pack operator send a request to the satellite to remove it from persistence memory. The board debugpack id is used to point on a particular debug pack.
When creation of debug pack will fail
Any piece of software can fail, also this which is responsible to help with fix a software’s problems. The debugpack creation process can fail because of all reason mentioned above regarding possible problems. If this happens then the system can generate only new problem id with empty or almost empty debug pack – just only terse information that some problem occurred but debug pack creation fail. Such a solution will signal to the operator that debug packs cannot be created, and the operator can better tune the configuration to reduce resources required to obtain debug pack.
Requirements for software running on a satellite
Now, when we understand what information is required, we can summarize the software running on satellite in the context of troubleshooting. Bellow, I present a container diagram which consists of required based elements and interactions between them.
In short: The Maintenance Interface is responsible for collecting and maintenance lifetime of debugpacks and interaction with The SatelliteOperationCenter on the ground. Other containers are responsible for observing some system’s elements and gathering information from them.
To effectively solve the problems with software on a satellite The DevTeam should get a pack of information (The DebugPack) which contains a possible complete description of the status of the software when the problem occurred. This requires to collect data from ground operation centres and the satellite’s board. The information must be accurate to the level which allows the team to repeat the problem with the software on run at the laboratory on the ground. How the problem will be fixed and how to apply the fix on software running on space is a different story.
- 2017) Fault-Tolerance Techniques for Spacecraft Control Computers. Wiley, . (
- 2009) Debug It! Find, Repair and Prevent Bugs in Your Code. Pragmatic Bookshelf, . (
- 2007) Patterns for Fault Tolerant Software. Wiley, . (