On-line testing and recovery of FPGA-based systems
U. Legat
abstract: SRAM-based FPGAs have become an attractive solution for many applications where a short development time, low-cost for low-production volumes, and in-the-field-programming ability are important issues. The flexibility of SRAM-based FPGAs comes from the adoption of a configuration memory that defines the operations of the circuit that the FPGA implements. It is therefore fundamental that the content of the configuration memory preserves the correct values during the FPGA operation. The main concern for the reliability and dependability of SRAM-based FPGAs are radiation-induced soft-errors that corrupt the configuration memory (produce bit-flips). These errors often occur in the space environment; however, because of increasing integration density they are also not uncommon at sea-level.
Different fault-tolerance techniques are being developed to increase the reliability and dependability of applications on FPGAs. These techniques function concurrently (on-line) with the system to monitor its operation. On-line testing techniques detect the errors in the system, error mitigation techniques are able to enhance the system to work despite faults, and error-recovery techniques recover the faults from the system. The goals of fault-tolerance techniques are to minimize the hardware, timing, and power overhead, and maximize the reliability of the system. This dissertation presents our advances in fault-tolerance techniques.
We have developed an on-line testing technique for an advanced encryption standard (AES) implemented on FPGA. This 32-bit AES core is the smallest reported AES core with error detection. The error detection is implemented in all the AES processes: encryption, decryption, and key schedule. Besides the on-line error-detection mode our AES core can also be tested in an efficient off-line self-test (BIST) mode. This novel smart BIST solution generates the random test vectors by performing the AES processes in a loop and uses the existing circuit of on-line error detection to analyze the outputs.
Additionally, we improved the existing error-recovery techniques. The first error-recovery technique is applicable for multiprocessor systems on FPGAs. We developed a software algorithm that can recover soft-errors from the FPGA configuration memory. The recovery algorithm for a single processor is already reported in the literature. However, the advanced feature of this algorithm is that it can adapt itself to another processor if the current testing processor is corrupted.
The second error-recovery technique is a hardware based error-recovery mechanism. This is the smallest and fastest controller that checks the configuration memory of the FPGA device and recovers the potential soft-errors. The error-recovery mechanism is small enough to be included in almost any FPGA design without the need to replace the FPGA device. We included the error-recovery mechanism in different self-recovery architectures with different levels of reliability.
All our developed fault-tolerance techniques were validated by fault-injection experiments. For this purpose we developed a fault-injection tool that automatically injects faults into the FPGA using a partial runtime reconfiguration.