In the last decades, the computing technology experienced tremendous developments.
For instance, transistors' feature size shrank to half at every two years as
consistently from the first time Moore stated his law. Consequently, number of
transistors and core count per chip doubles at each generation. Similarly, petascale
systems that have the capability of processing more than one billion calculation per
second have been developed. As a matter of fact, exascale systems are predicted to
be available at year 2020.
However, these developments in computer systems face a reliability wall. For instance,
transistor feature sizes are getting so small that it becomes easier for highenergy
particles to temporarily flip the state of a memory cell from 1-to-0 or 0-to-1.
Also, even if we assume that fault-rate per transistor stays constant with scaling,
the increase in total transistor and core count per chip will significantly increase the
number of faults for future desktop and exascale systems. Moreover, circuit ageing
is exacerbated due to increased manufacturing variability and thermal stresses,
therefore, lifetime of processor structures are becoming shorter.
On the other side, due to the limited power budget of the computer systems such
that mobile devices, it is attractive to scale down the voltage. However, when the
voltage level scales to beyond the safe margin especially to the ultra-low level, the
error rate increases drastically.
Nevertheless, new memory technologies such as NAND flashes present only limited
amount of nominal lifetime, and when they exceed this lifetime, they can not
guarantee storing of the data correctly leading to data retention problems.
Due to these issues, reliability became a first-class design constraint for contemporary
computing in addition to power and performance. Moreover, reliability
even plays increasingly important role when computer systems process sensitive
and life-critical information such as health records, financial information, power
regulation, transportation, etc.
In this thesis, we present several different reliability designs for detecting and correcting
errors occurring in processor pipelines, L1 caches and non-volatile NAND
flash memories due to various reasons. We design reliability solutions in order
to serve three main purposes. Our first goal is to improve the reliability of computer
systems by detecting and correcting random and non-predictable errors such as bit flips or ageing errors. Second, we aim to reduce the energy consumption
of the computer systems by allowing them to operate reliably at ultra-low voltage
level. Third, we target to increase the lifetime of new memory technologies by
implementing efficient and low-cost reliability schemes. |