by Muzzafer Simsir, Srihari Cadambi, Franjo Ivancic, Martin Rötteler,
and Niraj Jha
As the end of the semiconductor roadmap for CMOS approaches,
architectures based on nanoscale molecular devices are attracting
attention. Among several alternatives, silicon nanowires and carbon
nanotubes are the two most promising nanotechnologies according to the
ITRS. These technologies may enable scaling deep into the nanometer
regime. However, they suffer from very defect-prone manufacturing
processes. Although the reconfigurability property of the nanoscale
devices can be used to tolerate high defect rates, it may not be
possible to detect all defects. With very high device densities,
testing each component may not be possible because of time or
technology restrictions. This points to a scenario in which even
though the devices are tested, the tests are not very comprehensive at
locating defects, and hence the chips are still defective. Moreover,
the devices in the nanometer range will be susceptible to transient
faults which can produce arbitrary soft errors. Despite these
drawbacks, it is possible to make nanoscale architectures practical
and realistic by introducing defect and fault tolerance. In this
paper, we propose and evaluate a hybrid nanowire-CMOS architecture
that addresses all three problems -- namely high defect rates,
undetected defects and transient faults -- at the same time. This goal
is achieved by using multiple levels of redundancy and majority
voters. A key aspect of the architecture is that it contains a
judicious balance of both nanoscale and traditional CMOS
components. Companion to the architecture is a compiler with
heuristics to quickly determine if logic can be mapped onto partially
defective nanoscale elements. The heuristics make it possible to
introduce defect-awareness in placement and routing. The architecture
and compiler are evaluated by applying the complete design flow to
several benchmarks.
ACM Journal on Emerging Technologies in Computing Systems, vol. 5,
no. 3, pp. 14:2-14:26, 2009.
Shantanu Gupta, Florin Sultan, Srihari Cadambi, Franjo Ivancic,
and Martin Rötteler
Widespread emergence of multicore processors will spur
development of parallel applications, exposing programmers to degrees
of hardware concurrency hitherto unavailable. Dependable multithreaded
software will have to rely on the ability to dynamically detect
nondeterministic and notoriously hard to reproduce synchronization
bugs manifested through data races. Previous solutions to dynamic data
race detection have required specialized hardware, at additional
power, design and area costs. We propose RaceTM, a novel approach to
data race detection that exploits hardware that will likely be present
in future multiprocessors, albeit for a different purpose. In
particular, we show how emerging hardware support for transactional
memory can be leveraged to aid data race detection. We propose the
concept of lightweight debug transactions that exploit the conflict
detection mechanisms of transactional memory systems to perform data
race detection. We present a proof-of-concept simulation prototype,
and evaluate it on data races injected into applications from the
SPLASH-2 suite. Our experiments show that this technique is effective
at discovering data races and has low performance overhead.
Proceedings IEEE International Parallel and Distributed Processing
Symposium (IPDPS'09), pp. 1-11, 2009.
by Markus Püschel and Martin Rötteler
Recently, we introduced the framework for signal processing on a
nonseparable 2-D hexagonal spatial lattice including the associated
notion of Fourier transform called discrete triangle transform
(DTT). Spatial means that the lattice is undirected in contrast to
earlier work by Mersereau introducing hexagonal discrete Fourier
transforms. In this paper we derive a general-radix algorithm for the
DTT of an n x n 2-D signal, focusing on the radix-2 x 2 case. The
runtime of the algorithm is O(n^2 log(n)), which is the same as for
commonly used separable 2-D transforms. The DTT algorithm derivation
is based on the algebraic signal processing theory. This means that
instead of manipulating transform coefficients, the algorithm is
derived through a stepwise decomposition of its underlying polynomial
algebra based on a general theorem that we introduce. The theorem
shows that the obtained DTT algorithm is the precise equivalent of the
well-known Cooley-Tukey fast Fourier transform, which motivates the
title of this paper.
Applicable Algebra in Engineering, Communication and Computing,
vol. 19, Issue 3, pp. 259-292. Springer, 2008.
by S. Gupta, F. Sultan, S. Cadambi, F. Ivancic, and M. Rötteler
Widespread emergence of multicore processors will spur
development of parallel applications, exposing programmers to more
hardware concurrency. Dependable multithreaded software will have to
rely on the ability to dynamically detect data races, which are
non-deterministic and notoriously hard to reproduce symptoms of
synchronization bugs. In this paper, we propose RaceTM, a novel
approach that exploits transactional memory support to detect data
races. We introduce the concept of lightweight debug
transactions that exploit the conflict detection mechanisms of
transactional memory systems to perform data race detection. Debug
transactions differ from regular transactions in that they do not need
to be rolled back, and therefore require no versioning or
checkpointing support. Debug transactions do not overlap with a
regular transaction, thus providing a transparent mechanism to
leverage existing transactional memory support for data race
detection.
Proceedings 20th ACM Symposium on Parallelism in Algorithms and
Architectures (SPAA'08), pp. 104-106, 2008.
by M. Simsir, S. Cadambi, F. Ivancic, M. Rötteler, and N. Jha
Architectures based on nanoscale molecular devices are attracting
attention for replacing CMOS architectures at the end of the
semiconductor roadmap. The two most promising nanotechnologies,
according to ITRS, are silicon nanowires and carbon nanotubes.
Although they offer unmatched densities for building logic,
interconnect and memory, they do suffer from very defect-prone
manufacturing processes. This is further exacerbated by testing
complexities where it is nearly impossible to detect and localize all
faults in a large nanoscale chip. As a result, fault tolerance is
necessary to make nanoscale architectures practical and
realistic. Although reconfiguration can be used to compile around
known defects on a chip, a large fraction of defects may remain hidden
if they cannot be discovered by testing. Furthermore, the small
structures in nanoscale architectures are particularly susceptible to
transient faults which can produce arbitrary soft errors. For the
first time, we propose an architecture that can tolerate a large
number of undetected manufacturing faults as well as a large rate of
transient faults. Our architecture is characterized by multiple levels
of redundancy and majority voting to correct errors caused by such
faults. A key aspect of the architecture is that it contains a
judicious balance of both nanoscale and traditional CMOS components. A
companion to the architecture is a compiler with heuristics tailored
to quickly and compactly map logic onto partially defective
components. Experimental results demonstrate the efficacy of the
architecture.
Proceedings 21th Conference on VLSI Design (VLSI'08), pp. 435-440,
2008.
by Markus Püschel and Martin Rötteler
We develop the framework for signal processing on a spatial, or
undirected, 2-D hexagonal lattice for both an infinite and a finite
array of signal samples. This framework includes the proper notions of
z-transform, boundary conditions, filtering or convolution,
spectrum, frequency response, and Fourier transform. In the finite
case, the Fourier transform is called discrete triangle transform
(DTT). Like the hexagonal lattice, this transform is nonseparable.
The derivation of the framework makes it a natural extension of the
algebraic signal processing theory that we recently introduced.
Namely, we construct the proper signal models, given by polynomial
algebras, bottom-up from a suitable definition of hexagonal space
shifts using a procedure provided by the algebraic theory. These
signal models, in turn, then provide all the basic signal processing
concepts. The framework developed in this paper is related to
Mersereau's early work on hexagonal lattices in the same way as the
discrete cosine and sine transforms are related to the discrete
Fourier transform - a fact that is made rigorous in this paper.
IEEE Transactions on Image Processing, vol. 16, no. 6, pp. 1506-1521, 2007.
by Markus Püschel and Martin Rötteler
We derive a new, two-dimensional nonseparable signal transform for
computing the spectrum of spatial signals residing on a finite
quincunx lattice. The derivation uses the connection between
transforms and polynomial algebras, which has long been known for the
discrete Fourier transform (DFT), and was extended to other transforms
in recent research. We also show that the new transform can be
computed with O(n^2 log(n)) operations, which puts it in the same
complexity class as its separable counterparts.
Proceedings of the International Conference on Image Processing,
(ICIP'05), vol. 2, pp. 494-497, Genova, September 11-14, 2005.
by Markus Püschel and Martin Rötteler
We introduce a new signal transform for computing the spectrum of
a signal given on a two-dimensional directional quincunx lattice. The
transform is non-separable, but closely related to a two-dimensional
(separable) discrete Fourier transform. We derive the transform using
recently discovered connections between signal transforms and
polynomial algebras. These connections also yield several important
properties of the new transform.
Proceedings of the International Conference on Acoustics, Speech, and
Signal Processing (ICASSP'05), vol. 4, pp. 401-404, Philadelphia,
March 18-23, 2005.
by Markus Püschel and Martin Rötteler
The discrete triangle transform (DTT) was recently introduced in the
paper "The discrete triangle transform"
(Proceedings of ICASSP'04) as an example of a non-separable transform
for signal processing on a two-dimensional triangular grid. The DTT
is built from Chebyshev polynomials in two variables in the same way
as the DCT, type III, is built from Chebyshev polynomials in one
variable. We show that, as a consequence, the DTT has, as the DCT,
type III, a Cooley-Tukey FFT type fast algorithm. We derive this
algorithm and an upper bound for the number of complex operations it
requires. Similar to most separable two-dimensional transforms, the
operations count of this algorithm is O(n^2 log(n)) for an input of
size n x n.
Proceedings of the 11th Digital Signal Processing Workshop
3rd Signal Processing Education Workshop, Taos, August 1-4, 2004.
by Markus Püschel and Martin Rötteler
We introduce the discrete triangle transform (DTT), a
non-separable transform for signal processing on a two-dimensional
equispaced triangular grid. The DTT is, in a strict mathematical
sense, a generalization of the DCT, type III, to two dimensions, since
the DTT is built from Chebyshev polynomials in two variables in the
same way as the DCT, type III, is built from Chebyshev polynomials in
one variable. We provide boundary conditions, signal extension, and
diagonalization properties for the DTT. Finally, we give evidence that
the DTT has Cooley-Tukey FFT like algorithms that enable its efficient
computation.
Proceedings of the International Conference on Acoustics, Speech, and
Signal Processing (ICASSP'04), vol. 3, pp. 44-48, Montreal, May 17-21,
2004.