Research output: Contribution to Journal/Magazine › Journal article › peer-review
Research output: Contribution to Journal/Magazine › Journal article › peer-review
}
TY - JOUR
T1 - Using underutilized CPU resources to enhance its reliability
AU - Timor, A.
AU - Mendelson, A.
AU - Birk, Y.
AU - Suri, Neeraj
PY - 2010/1/1
Y1 - 2010/1/1
N2 - Soft errors (or Transient faults) are temporary faults that arise in a circuit due to a variety of internal noise and external sources such as cosmic particle hits. Though soft errors still occur infrequently, they are rapidly becoming a major impediment to processor reliability. This is due primarily to processor scaling characteristics. In the past, systems designed to tolerate such faults utilized costly customized solutions, entailing the use of replicated hardware components to detect and recover from microprocessor faults. As the feature size keeps shrinking and with the proliferation of multiprocessor on die in all segments of computer-based systems, the capability to detect and recover from faults is also desired for commodity hardware. For such systems, however, performance and power constitute the main drivers, so the traditional solutions prove inadequate and new approaches are required. We introduce two independent and complementary microarchitecture-level techniques: Double Execution and Double Decoding. Both exploit the typically low average processor resource utilization of modern processors to enhance processor reliability. Double Execution protects the Out-Of-Order part of the CPU by executing each instruction twice. Double Decoding uses a second, low-performance low-power instruction decoder to detect soft errors in the decoder logic. These simple-to-implement techniques are shown to improve the processor's reliability with relatively low performance, power, and hardware overheads. Finally, the resulting excessive reliability can even be traded back for performance by increasing clock rate and/or reducing voltage, thereby improving upon single execution approaches. © 2006 IEEE.
AB - Soft errors (or Transient faults) are temporary faults that arise in a circuit due to a variety of internal noise and external sources such as cosmic particle hits. Though soft errors still occur infrequently, they are rapidly becoming a major impediment to processor reliability. This is due primarily to processor scaling characteristics. In the past, systems designed to tolerate such faults utilized costly customized solutions, entailing the use of replicated hardware components to detect and recover from microprocessor faults. As the feature size keeps shrinking and with the proliferation of multiprocessor on die in all segments of computer-based systems, the capability to detect and recover from faults is also desired for commodity hardware. For such systems, however, performance and power constitute the main drivers, so the traditional solutions prove inadequate and new approaches are required. We introduce two independent and complementary microarchitecture-level techniques: Double Execution and Double Decoding. Both exploit the typically low average processor resource utilization of modern processors to enhance processor reliability. Double Execution protects the Out-Of-Order part of the CPU by executing each instruction twice. Double Decoding uses a second, low-performance low-power instruction decoder to detect soft errors in the decoder logic. These simple-to-implement techniques are shown to improve the processor's reliability with relatively low performance, power, and hardware overheads. Finally, the resulting excessive reliability can even be traded back for performance by increasing clock rate and/or reducing voltage, thereby improving upon single execution approaches. © 2006 IEEE.
KW - Double execution
KW - Fault tolerance
KW - Microarchitecture
KW - Soft errors
KW - Superscalar
KW - Transient faults
KW - Clock rate
KW - Commodity hardware
KW - Computer-based system
KW - Cosmic particles
KW - CPU resources
KW - Customized solutions
KW - External sources
KW - Feature sizes
KW - Hardware components
KW - Hardware overheads
KW - Internal noise
KW - Low Power
KW - Micro architectures
KW - Microprocessor faults
KW - Modern processors
KW - New approaches
KW - Out of order
KW - Processor reliability
KW - Processor resources
KW - Soft error
KW - Temporary fault
KW - Computer hardware
KW - Cosmology
KW - Decoding
KW - Error correction
KW - Fault tolerant computer systems
KW - Microprocessor chips
KW - Quality assurance
U2 - 10.1109/TDSC.2008.31
DO - 10.1109/TDSC.2008.31
M3 - Journal article
VL - 7
SP - 94
EP - 109
JO - IEEE Transactions on Dependable and Secure Computing
JF - IEEE Transactions on Dependable and Secure Computing
SN - 1545-5971
IS - 1
ER -