Ultra low-cost defect protection for microprocessor pipelines
- ISSN: 01635964
- ISBN: 1595934510
- DOI: 10.1145/1168919.1168868
Abstract
The sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing defects, are a growing challenge that threatens the yield and product lifetime of future systems. In this paper we introduce the BulletProof pipeline, the first ultra low-cost mechanism to protect a microprocessor pipeline and on-chip memory system from silicon defects. To achieve this goal we combine area-frugal on-line testing techniques and system-level checkpointing to provide the same guarantees of reliability found in traditional solutions, but at much lower cost. Our approach utilizes a microarchitectural checkpointing mechanism which creates coarse-grained epochs of execution, during which distributed on-line built in self-test (BIST) mechanisms validate the integrity of the underlying hardware. In case a failure is detected, we rely on the natural redundancy of instructionlevel parallel processors to repair the system so that it can still operate in a degraded performance mode. Using detailed circuit-level and architectural simulation, we find that our approach provides very high coverage of silicon defects (89%) with little area cost (5.8%). In addition, when a defect occurs, the subsequent degraded mode of operation was found to have only moderate performance impacts, (from 4% to 18% slowdown).
Ultra low-cost defect protection for microprocessor pipelines
Smitha Shyam Kypros Constantinides Sujay Phadke
Valeria Bertacco Todd Austin
Advanced Computer Architecture Lab
University of Michigan, Ann Arbor, MI 48109
{smithash, kypros, sphadke, valeria, austin}@umich.edu
Abstract
The sustained push toward smaller and smaller technology sizes
has reached a point where device reliability has moved to the
forefront of concerns for next-generation designs. Silicon failure
mechanisms, such as transistor wearout and manufacturing defects,
are a growing challenge that threatens the yield and product life-
time of future systems. In this paper we introduce the BulletProof
pipeline, the first ultra low-cost mechanism to protect a micropro-
cessor pipeline and on-chip memory system from silicon defects.
To achieve this goal we combine area-frugal on-line testing tech-
niques and system-level checkpointing to provide the same guar-
antees of reliability found in traditional solutions, but at much
lower cost. Our approach utilizes a microarchitectural checkpoint-
ing mechanism which creates coarse-grained epochs of execution,
during which distributed on-line built in self-test (BIST) mecha-
nisms validate the integrity of the underlying hardware. In case a
failure is detected, we rely on the natural redundancy of instruction-
level parallel processors to repair the system so that it can still op-
erate in a degraded performance mode. Using detailed circuit-level
and architectural simulation, we find that our approach provides
very high coverage of silicon defects (89%) with little area cost
(5.8%). In addition, when a defect occurs, the subsequent degraded
mode of operation was found to have only moderate performance
impacts, (from 4% to 18% slowdown).
Categories and Subject Descriptors B.8.1 [Hardware]: Perfor-
mance and Reliability—Reliability, Testing, and Fault-Tolerance
General Terms Reliability, Design
Keywords Reliability, Defect-Protection, Low-Cost, Pipelines
1. Introduction
As silicon technologies move into the nanometer regime, there is
growing concern for the reliability of transistor devices. Leading
technology experts have begun to warn designers that device reli-
ability will wane in the 45nm regime and beyond [7, 6]. In fact,
device scaling aggravates a number of long standing silicon failure
mechanisms, and it introduces a number of new non-trivial fail-
ure modes. Unless these reliability concerns are addressed, either
through on-line detection and correction, or with the introduction
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
ASPLOS’06 October 21–25, 2006, San Jose, California, USA.
Copyright c© 2006 ACM 1-59593-451-0/06/0010. . . $5.00
of more robust devices, component yield and lifetime will soon be
compromised. In this paper, we introduce a low-cost mechanism
for tolerating a small number of silicon failures that occur in the
field, i.e., while the device is in operation.
1.1 The (Bumpy) Road Ahead
The following list highlights the types of silicon failures addressed
by the reliable solution presented in this work. Each of these failure
mechanisms have received significant attention in the process tech-
nology literature, and each has been identified as a growing concern
for deep-submicron silicon.
Device Wear-out. Metal electro-migration and hot carrier
degradation are traditional mechanisms that lead to eventual device
failure [28]. While these mechanisms continue to be a problem for
deep-submicron silicon, new concerns arise due to the extremely
thin gate oxides utilized in current and future process technologies,
which lead to gate oxide wear-out (or Time Dependent Dielectric
Breakdown, TDDB). Over time, gate oxides can break and become
conductive, essentially shorting the transistor and rendering it use-
less. Fast clocks, high temperatures, and voltage scaling limitations
are well-established architectural trends that conspire to aggravate
this failure mode [33].
Transistor Infant Mortality. Extreme device scaling also ex-
acerbates early transistor failures, due to weak transistors that es-
cape post-manufacturing testing. These weak transistors work ini-
tially, but they have dimensional and doping deficiencies that sub-
ject them to much higher stress than normal. Quickly (within days
to months from component deployment) they break down and ren-
der the device unusable. Traditionally, early transistor failures have
been addressed with aggressive burn-in testing, where, before be-
ing placed in the field, devices are subjected to high voltage and
temperature testing, to accelerate the failure of weak transistors
[4]. Those transistors that survive this grueling birth are likely to
be robust devices, thereby ensuring a long product lifetime. In the
deep-submicron regime, burn-in becomes less effective as devices
are subject to thermal run-away effects, where increased tempera-
ture leads to increased leakage current, which in turn leads to yet
higher temperatures and further increases in leakage current [23].
The end result is that aggressive burn-in can destroy even robust
devices. Manufacturers are forced to either sacrifice yield with an
aggressive burn-in or experience more frequent early transistor fail-
ures in the field.
Manufacturing Defects that Escape Testing. Optical proxim-
ity effects, airborne impurities, and processing material defects can
all lead to the manufacturing of faulty transistors and interconnect
[28]. Moreover, deep-submicron gate oxides have become so thin
that manufacturing variation can lead to currents penetrating the
gate, rendering it unusable [30]. In current 90nm devices these ox-
ides are only about 20 atoms of thickness. In 45nm technology,
this thickness is expected to reduce below 10 atoms. This prob-
which makes it more difficult to test for defects during manufactur-
ing. Vendors are forced to either spend more time with parts on the
tester, or risk having untested defects escape into the field.
1.2 Contributions of This Paper
While there is no consensus on the absolute rate of defects in future
technologies, or as to when these problems will potentially derail
the silicon manufacturing industry, there is, however, broad agree-
ment that device reliability will begin to wane in the 45nm regime
and beyond [13, 33, 17]. In this paper, we introduce the Bullet-
Proof pipeline. It is the first ultra low-cost defect protection mecha-
nism for microprocessor pipelines and on-chip cache memories. In
this work, we target specifically low in-field defect rates. The us-
age mode we envision for our technology is that it will be installed
into a microprocessor product. The technology will continuously
monitor the system’s health until the first defect is encountered.
At that point, the system will stay operative but at a lower perfor-
mance level. The user (and/or system controller) will be notified
and will have to choose to either: i) live with the degraded mode
performance, or ii) repair the system. And above all, our goal is
to provide all of these capabilities for a minimal cost. Specifically,
this research paper makes the following contributions to the area of
reliable microarchitecture design:
• We present the first low-cost reliable system design approach
which provides fine-grained detection, diagnosis, recovery, and
repair of silicon defects that occur while the system is in oper-
ation in the field. While traditional approaches require at least
100% overhead due to duplication of critical resources, our on-
line testing-based approach provides the same level of protec-
tion with an overhead of 5.8%.
• We provide a physical-level analysis of coverage and perfor-
mance impact of our technique, in the context of a low-cost
embedded VLIW processor design. We chose this target design
because it is i) an important target due to its high reliability
needs for safety-critical applications, and ii) a challenging en-
vironment to implement defect tolerance due to its high cost
sensitivity. Moreover, it should be noted that very few relia-
bility solutions in the computer architecture literature quantify
the corresponding fault coverage. In contrast, we evaluate the
coverage of our solution through a physical-level analysis (syn-
thesized gate-level netlist) and find that it provides coverage
against 89% of potential defect locations.
Our approach to defect detection is markedly different than pre-
vious solutions utilizing spatially or temporally redundant compu-
tation. We leverage instead a combination of on-line distributed
testing with microarchitectural checkpointing to efficiently iden-
tify defects, and recover from their impact. The microarchitectural
checkpointing mechanism provides a computational epoch, which
is a period of computation over which the processor’s hardware is
checked. During a computational epoch, on-line distributed built-
in self-testing (BIST) techniques exploit idle cycles to completely
verify the functional integrity of the underlying hardware. When
the on-line testing completes without finding faults, the underly-
ing hardware is known to be free of silicon defects and the epoch’s
computation is allowed to safely retire to non-speculative state. By
contrast, if the underlying hardware is found to be faulty, the results
of the computational epoch are thrown away, and the system’s state
is restored to the last known-good machine state at the start of the
epoch. Before continuing execution from this point, the defective
component is disabled and the system continues in a performance
degraded mode without the broken resource.
Relying on on-line testing, rather than traditional redundancy
techniques, allows us to achieve dramatically lower overhead
!
"#
" !
$% &
'
$% &
'
$
%
&
'
$
%
&
'
Figure 1. BulleProof pipeline architecture. Part a) shows
how we equip a microprocessor pipeline for defect protection:
Component-specific hardware testing blocks are associated with
each design component to implement test generation and check-
ing mechanisms. When a failure occurs, it is possible that results
computed in the microprocessor core are incorrect. However, the
speculative ”epoch”-based execution guarantees that the computa-
tion can always be reversed to the last known-correct state. Part b)
shows three possible epoch scenarios.
than previous proposed techniques. Redundant approaches such
as triple-modular redundancy [31] and N-version hardware [31]
utilize redundant hardware on a cycle-by-cycle basis to detect and
correct errant computation resulting from silicon defects. For each
of these previous techniques, redundant hardware is used to ver-
ify the integrity of computation on the baseline hardware com-
ponent, resulting in cost overheads of 100% or more [11]. An
on-line testing-based approach, in contrast, is much less expen-
sive because the hardware necessary to verify integrity and provide
checkpoint/recovery is quite modest. Our entire facility only adds
5.8% additional hardware to a 4-wide VLIW processor with 32-
KBytes of instruction and data cache.
The remainder of this paper introduces our approach to defect
tolerance and evaluates its impacts on design cost and performance.
In Section 2 we describe in detail how on-line testing can be com-
bined with checkpoint recovery to provide high-levels of defect tol-
erance at low cost. Section 3 presents a detailed simulation-based
evaluation of the approach, using physical design analysis to gauge
area costs and architectural simulation to judge performance im-
pacts. Section 4 details previous work in the areas of defect tol-
erant microarchitecture design, on-line defect testing, and microar-
chitectural checkpoint recovery techniques. Finally, Section 5 gives
conclusions and suggests future directions.
2. Testing and Recovery
Figure 1 illustrates the high-level system architecture of our de-
fect tolerance approach, and it shows a timeline of execution that
demonstrates its operation. At the base of our approach is a mi-
croarchitectural checkpoint and recovery mechanism that creates
computational epochs. A computational epoch is a protected re-
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


