Simulation on Reconfigurable Heterogeneous Computer Architectures
06.2008 - 10.2017, SimTech Exzellenz-Cluster
Overview
Since the beginning of the DFG Cluster of Excellence "Simulation Technology" (SimTech) (EXC 310/1) at the University of Stuttgart in 2008, the Institute of Computer Architecture and Computer Engineering (ITI, RA) is an active part of the research within the Stuttgart Research Center for Simulation Technology (SRC SimTech). The institute's research includes the development of fault tolerant simulation algorithms for new, tightly-coupled many-core computer architectures like GPUs, the acceleration of existing simulations on such architectures, as well as the mapping of complex simulation applications to innovative reconfigurable heterogeneous computer architectures.
Within the research cluster, Hans-Joachim Wunderlich acts as a principal investigator (PI) and he co-coordinates the research activities of the SimTech Project Network PN2 "High-Performance Simulation across Computer Architectures". This project network is unique in terms of its interdisciplinary nature and its interfaces between the participating researchers and projects. Scientists from computer science, chemistry, physics and chemical engineering work together to develop and provide new solutions for some of the major challenges in simulation technology. The classes of computational problems treated within project network PN2 comprise quantum mechanics, molecular mechanics, electronic structure methods, molecular dynamics, Markov-chain Monte-Carlo simulations and polarizable force fields.
Project Overview
The ongoing semiconductor technology scaling impels the integration of highly diversified computer architectures with different kinds of processing cores, communication channels and embedded memories. Besides classic CPU and data-parallel GPU cores, runtime reconfigurable units emerge as integrated part of such architectures.
Simulation technology will benefit significantly from these emerging computer architectures since they will close the gap between serial or coarse-grained parallel tasks on CPU cores and highly data-parallel tasks on GPU cores. Reconfigurable units can change their functionality at runtime and hence adapt dynamically to the needs of simulation applications. However, the upcoming architectural advances will be accompanied by a significant increase of complexity on the software side. For example, the shift from serial programming to parallel programming of multiple processors (CPUs), or the use of graphics processing units (GPUs) introduces new programming paradigms, which increasingly reflect and exhibit particular aspects of the underlying hardware structures.
Consequently, algorithms have to be analyzed with a much stronger focus to the available hardware structures. Furthermore, algorithmic parts have to be identified and isolated to deduce compute modules for optimal matching architectures. The combination of different computing resources in a reconfigurable heterogeneous architecture demands sophisticated loadbalancing and adaption to changing system conditions (e.g. changing availability of computing resources).
In this project, we develop new methods that enable the direct mapping of simulation applications to innovative reconfigurable heterogeneous computer architctures. This includes methods for the assisted analysis and partitioning of algorithms, the deduction and design of compute modules and integrated software infrastructure for runtime load-balancing and adaption.
Project Overview
Computer simulations drive innovations in science and industry, and they are gaining more and more importance. However, their extraordinary high computational demand generates significant challenges for contemporary computing systems. Typical high-performance computing systems, which provide sufficient performance and high reliability, are extremely expensive.
Modern many-core processor architectures like graphics processors (GPUs) offer high computational performance at very low costs, and they enable scientific simulation applications on the researcher's desktop. However, being designed for the graphics mass-market, GPUs offer only limited fault tolerance measures (e.g. ECC-protected memory) to cope with the increasing vulnerability to transient effects (soft errors) and other reliability threats. To fulfill the strict reliability requirements in scientific computing and simulation technology, appropriate fault tolerance measures have to be integrated into simulation algorithms and applications on GPUs. Algorithm-Based Fault Tolerance has the potential to meet these requirements.
The research within the first project phase (Mapping Simulation Algorithms to NoC-MPSoC Computers) concentrates on the development of fault tolerant algorithms for GPU architectures and their integration into scientific simulation applications. Moreover, sophisticated simulations tasks from partners within the Cluster and Project Network PN2 are analyzed and adapted or re-designed for GPU architectures.
Acceleration of Monte-Carlo Molecular Simulations on Hybrid Computing Architectures
Stochastic-based simulation methods play an important role since they allow the solution of problems that tend to be very hard to be solved by deterministic algorithms. For search and optimization problems, evolutionary and genetic algorithms have been applied. Simulated annealing has been used to localize globally optimal problem solutions. One of the most important classes of such techniques are Monte Carlo (MC) methods, which approximate solutions for quantitative problems, with multiple coupled degrees of freedom, by random sampling. The problem, which is targeted in this work, is the parallelization of molecular simulations of the grand canonical ensemble, from the field of thermodynamics, on hybrid computing systems.
It can be shown, that these simulations are an instance of a special case of MC methods, the Markov-Chain Monte-Carlo (MCMC) simulation. Being the core of many tasks in thermodynamics, Monte-Carlo Molecular Simulations often forms the major bottleneck, which is typically tackled by coarse-grained parallelization and distribution of simulation instances on clusters or workstation grids. Commonly, this is associated with considerable overhead and costs. In our interdisciplinary collaboration with the Institute of Thermodynamics and Thermal Process Engineering we developed new methods for the parallel mapping and implementation of Markov-Chain Monte-Carlo molecular simulations on hybrid CPU-GPGPU systems. The mapping is characterized by data-parallel energy calculations and speculative computations in each Monte-Carlo step. The mapping is able to directly utilize the different architectural characteristics of hybrid computing systems.
It was shown that the parallel mapping achieves speedups of more than 87x. This significant speedup enables MCMC molecular simulations at workstation-level and the investigation of problem sizes, which previously required computing clusters or grid-based systems.
Evaluation of the Apoptotic Receptor-Clustering Process
Apoptosis, the prototype of programmed cell death allows multi-cellular organisms to remove damaged or infected cells. A profound understanding of the molecular mechanisms involved in this important physiological process is required for the control of cell death, especially focused on the initiation of the apoptotic signaling pathways. One of these signaling pathways is the extrinsic pro-apoptotic signaling pathway, which is initiated by signal competent clusters of e.g. tumor necrosis factor (TNF) receptors and the corresponding TNF ligands.
In recent years, different mathematical models have been developed in order to describe and simulate the formation of signal competent clusters consisting of receptors and their ligands. In our interdisciplinary collaboration with the Institute of Analysis, Dynamics, and Modeling and the Institute of Cell Biology and Immunology, we developed an efficient, parallel mapping of a novel mathematical model to a modern GPGPU many-core architecture. This model evaluates the apoptotic receptor-clustering on the cell membrane. Besides the translation of the receptors and ligands, the model additionally incorporates rotations. The model is based on a derivation of a nonlinearly coupled system of stochastic differential equations for the motion of the receptors and ligands. The system is solved by a Euler-Maruyama approximation. Due to the high costs of the simulation, the tailoring step to GPU many-core architectures was inevitable. Our efficient, parallel mapping exploits fine-grained intra-GPU parallelism with multiple active simulation instances per GPGPU device, as well as coarse-grained inter-GPU parallelism by utilizing all available GPGPU devices within a system.
The parallel evaluation algorithm for the mathematical model yields peak speedups of up to 400x relative to a grid-based implementation on a multi-core CPU. This finally reduces the computation times from months to days or hours.
Activities
- H.-J. Wunderlich: "Fault Tolerance Meets Diagnosis", Keynote at the 21st IEEE International On-Line Testing Symposium (IOLTS), Elia, Halkidiki, Greece, July 6-8, 2015
Publications
Journals and Conference Proceedings
13. | Energy-efficient and Error-resilient Iterative Solvers for Approximate Computing Schöll, Alexander; Braun, Claus; Wunderlich, Hans-Joachim Proceedings of the 23rd IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS'17), Thessaloniki, Greece, 3-5 July 2017, pp. 237-239 |
2017 DOI PDF |
Keywords: Approximate Computing, Energy-efficiency, Fault Tolerance, Quality Monitoring | ||
Abstract: Iterative solvers like the Preconditioned Conjugate Gradient (PCG) method are widely-used in compute-intensive domains including science and engineering that often impose tight accuracy demands on computational results. At the same time, the error resilience of such solvers may change in the course of the iterations, which requires careful adaption of the induced approximation errors to reduce the energy demand while avoiding unacceptable results. A novel adaptive method is presented that enables iterative Preconditioned Conjugate Gradient (PCG) solvers on Approximate Computing hardware with high energy efficiency while still providing correct results. The method controls the underlying precision at runtime using a highly efficient fault tolerance technique that monitors the induced error and the quality of intermediate computational results. | ||
BibTeX:
@inproceedings{SchoeBW2017, author = {Schöll, Alexander and Braun, Claus and Wunderlich, Hans-Joachim}, title = {{Energy-efficient and Error-resilient Iterative Solvers for Approximate Computing}}, booktitle = {Proceedings of the 23rd IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS'17)}, year = {2017}, pages = {237--239}, keywords = {Approximate Computing, Energy-efficiency, Fault Tolerance, Quality Monitoring}, abstract = {Iterative solvers like the Preconditioned Conjugate Gradient (PCG) method are widely-used in compute-intensive domains including science and engineering that often impose tight accuracy demands on computational results. At the same time, the error resilience of such solvers may change in the course of the iterations, which requires careful adaption of the induced approximation errors to reduce the energy demand while avoiding unacceptable results. A novel adaptive method is presented that enables iterative Preconditioned Conjugate Gradient (PCG) solvers on Approximate Computing hardware with high energy efficiency while still providing correct results. The method controls the underlying precision at runtime using a highly efficient fault tolerance technique that monitors the induced error and the quality of intermediate computational results.}, doi = {http://dx.doi.org/10.1109/IOLTS.2017.8046244}, file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2017/IOLTS_SchoeBW2017.pdf} } |
||
12. | Applying Efficient Fault Tolerance to Enable the Preconditioned Conjugate Gradient Solver on Approximate Computing Hardware Schöll, Alexander; Braun, Claus; Wunderlich, Hans-Joachim Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT'16), University of Connecticut, USA, 19-20 September 2016, pp. 21-26 DFT 2016 Best Paper Award |
2016 DOI PDF |
Keywords: Approximate Computing, Fault Tolerance, Sparse Linear System Solving, Preconditioned Conjugate Gradient | ||
Abstract: A new technique is presented that allows to execute the preconditioned conjugate gradient (PCG) solver on approximate hardware while ensuring correct solver results. This technique expands the scope of approximate computing to scientific and engineering applications. The changing error resilience of PCG during the solving process is exploited by different levels of approximation which trade off numerical accuracy and hardware utilization. Such approximation levels are determined at runtime by periodically estimating the error resilience. An efficient fault tolerance technique allows reductions in hardware utilization by ensuring the continued exploitation of maximum allowed energy-accuracy trade-offs. Experimental results show that the hardware utilization is reduced on average by 14.5% and by up to 41.0% compared to executing PCG on accurate hardware. | ||
BibTeX:
@inproceedings{SchoeBW2016, author = {Schöll, Alexander and Braun, Claus and Wunderlich, Hans-Joachim}, title = {{Applying Efficient Fault Tolerance to Enable the Preconditioned Conjugate Gradient Solver on Approximate Computing Hardware}}, booktitle = {Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT'16)}, year = {2016}, pages = {21-26}, keywords = {Approximate Computing, Fault Tolerance, Sparse Linear System Solving, Preconditioned Conjugate Gradient}, abstract = {A new technique is presented that allows to execute the preconditioned conjugate gradient (PCG) solver on approximate hardware while ensuring correct solver results. This technique expands the scope of approximate computing to scientific and engineering applications. The changing error resilience of PCG during the solving process is exploited by different levels of approximation which trade off numerical accuracy and hardware utilization. Such approximation levels are determined at runtime by periodically estimating the error resilience. An efficient fault tolerance technique allows reductions in hardware utilization by ensuring the continued exploitation of maximum allowed energy-accuracy trade-offs. Experimental results show that the hardware utilization is reduced on average by 14.5% and by up to 41.0% compared to executing PCG on accurate hardware.}, doi = {http://dx.doi.org/10.1109/DFT.2016.7684063}, file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2016/DFT_SchoeBW2016.pdf} } |
||
11. | Pushing the Limits: How Fault Tolerance Extends the Scope of Approximate Computing Wunderlich, Hans-Joachim; Braun, Claus; Schöll, Alexander Proceedings of the 22nd IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS'16), Sant Feliu de Guixols, Catalunya, Spain, 4-6 July 2016, pp. 133-136 |
2016 DOI PDF |
Keywords: Approximate Computing, Variable Precision, Metrics, Characterization, Fault Tolerance | ||
Abstract: Approximate computing in hardware and software promises significantly improved computational performance combined with very low power and energy consumption. This goal is achieved by both relaxing strict requirements on accuracy and precision, and by allowing a deviating behavior from exact Boolean specifications to a certain extent. Today, approximate computing is often limited to applications with a certain degree of inherent error tolerance, where perfect computational results are not always required. However, in order to fully utilize its benefits, the scope of applications has to be significantly extended to other compute-intensive domains including science and engineering. To meet the often rather strict quality and reliability requirements for computational results in these domains, the use of appropriate characterization and fault tolerance measures is highly required. In this paper, we evaluate some of the available techniques and how they may extend the scope of application for approximate computing. | ||
BibTeX:
@inproceedings{WundeBS2016, author = {Wunderlich, Hans-Joachim and Braun, Claus and Schöll, Alexander}, title = {{Pushing the Limits: How Fault Tolerance Extends the Scope of Approximate Computing}}, booktitle = {Proceedings of the 22nd IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS'16)}, year = {2016}, pages = {133--136}, keywords = {Approximate Computing, Variable Precision, Metrics, Characterization, Fault Tolerance}, abstract = {Approximate computing in hardware and software promises significantly improved computational performance combined with very low power and energy consumption. This goal is achieved by both relaxing strict requirements on accuracy and precision, and by allowing a deviating behavior from exact Boolean specifications to a certain extent. Today, approximate computing is often limited to applications with a certain degree of inherent error tolerance, where perfect computational results are not always required. However, in order to fully utilize its benefits, the scope of applications has to be significantly extended to other compute-intensive domains including science and engineering. To meet the often rather strict quality and reliability requirements for computational results in these domains, the use of appropriate characterization and fault tolerance measures is highly required. In this paper, we evaluate some of the available techniques and how they may extend the scope of application for approximate computing.}, doi = {http://dx.doi.org/10.1109/IOLTS.2016.7604686}, file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2016/IOLTS_WundeBS2016.pdf} } |
||
10. | Efficient Algorithm-Based Fault Tolerance for Sparse Matrix Operations Schöll, Alexander; Braun, Claus; Kochte, Michael A.; Wunderlich, Hans-Joachim Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'16), Toulouse, France, 28 June-1 July 2016, pp. 251-262 |
2016 DOI PDF |
Keywords: Fault Tolerance, Sparse Linear Algebra, ABFT, Online Error Localization | ||
Abstract: We propose a fault tolerance approach for sparse matrix operations that detects and implicitly locates errors in the results for efficient local correction. This approach reduces the runtime overhead for fault tolerance and provides high error coverage. Existing algorithm-based fault tolerance approaches for sparse matrix operations detect and correct errors, but they often rely on expensive error localization steps. General checkpointing schemes can induce large recovery cost for high error rates. For sparse matrix-vector multiplications, experimental results show an average reduction in runtime overhead of 43.8%, while the error coverage is on average improved by 52.2% compared to related work. The practical applicability is demonstrated in a case study using the iterative Preconditioned Conjugate Gradient solver. When scaling the error rate by four orders of magnitude, the average runtime overhead increases only by 31.3% compared to low error rates. | ||
BibTeX:
@inproceedings{SchoeBKW2016, author = {Schöll, Alexander and Braun, Claus and Kochte, Michael A. and Wunderlich, Hans-Joachim}, title = {{Efficient Algorithm-Based Fault Tolerance for Sparse Matrix Operations}}, booktitle = {Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'16)}, year = {2016}, pages = {251--262}, keywords = {Fault Tolerance, Sparse Linear Algebra, ABFT, Online Error Localization}, abstract = {We propose a fault tolerance approach for sparse matrix operations that detects and implicitly locates errors in the results for efficient local correction. This approach reduces the runtime overhead for fault tolerance and provides high error coverage. Existing algorithm-based fault tolerance approaches for sparse matrix operations detect and correct errors, but they often rely on expensive error localization steps. General checkpointing schemes can induce large recovery cost for high error rates. For sparse matrix-vector multiplications, experimental results show an average reduction in runtime overhead of 43.8%, while the error coverage is on average improved by 52.2% compared to related work. The practical applicability is demonstrated in a case study using the iterative Preconditioned Conjugate Gradient solver. When scaling the error rate by four orders of magnitude, the average runtime overhead increases only by 31.3% compared to low error rates.}, doi = {http://dx.doi.org/10.1109/DSN.2016.31}, file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2016/DSN_SchoeBKW2016.pdf} } |
||
9. | Low-Overhead Fault-Tolerance for the Preconditioned Conjugate Gradient Solver Schöll, Alexander; Braun, Claus; Kochte, Michael A.; Wunderlich, Hans-Joachim Proceedings of the International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT'15), Amherst, Massachusetts, USA, 12-14 October 2015, pp. 60-65 |
2015 DOI PDF |
Keywords: Fault Tolerance, Sparse Linear System Solving, Preconditioned Conjugate Gradient, ABFT | ||
Abstract: Linear system solvers are an integral part for many different compute-intensive applications and they benefit from the compute power of heterogeneous computer architectures. However, the growing spectrum of reliability threats for such nano-scaled CMOS devices makes the integration of fault tolerance mandatory. The preconditioned conjugate gradient (PCG) method is one widely used solver as it finds solutions typically faster compared to direct methods. Although this iterative approach is able to tolerate certain errors, latest research shows that the PCG solver is still vulnerable to transient effects. Even single errors, for instance, caused by marginal hardware, harsh environments, or particle radiation, can considerably affect execution times, or lead to silent data corruption. In this work, a novel fault-tolerant PCG solver with extremely low runtime overhead is proposed. Since the error detection method does not involve expensive operations, it scales very well with increasing problem sizes. In case of errors, the method selects between three different correction methods according to the identified error. Experimental results show a runtime overhead for error detection ranging only from 0.04% to 1.70%. | ||
BibTeX:
@inproceedings{SchoeBKW2015a, author = {Schöll, Alexander and Braun, Claus and Kochte, Michael A. and Wunderlich, Hans-Joachim}, title = {{Low-Overhead Fault-Tolerance for the Preconditioned Conjugate Gradient Solver}}, booktitle = {Proceedings of the International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT'15)}, year = {2015}, pages = {60-65}, keywords = { Fault Tolerance, Sparse Linear System Solving, Preconditioned Conjugate Gradient, ABFT }, abstract = {Linear system solvers are an integral part for many different compute-intensive applications and they benefit from the compute power of heterogeneous computer architectures. However, the growing spectrum of reliability threats for such nano-scaled CMOS devices makes the integration of fault tolerance mandatory. The preconditioned conjugate gradient (PCG) method is one widely used solver as it finds solutions typically faster compared to direct methods. Although this iterative approach is able to tolerate certain errors, latest research shows that the PCG solver is still vulnerable to transient effects. Even single errors, for instance, caused by marginal hardware, harsh environments, or particle radiation, can considerably affect execution times, or lead to silent data corruption. In this work, a novel fault-tolerant PCG solver with extremely low runtime overhead is proposed. Since the error detection method does not involve expensive operations, it scales very well with increasing problem sizes. In case of errors, the method selects between three different correction methods according to the identified error. Experimental results show a runtime overhead for error detection ranging only from 0.04% to 1.70%. }, doi = {http://dx.doi.org/10.1109/DFT.2015.7315136}, file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2015/DFTS_SchoeBKW2015.pdf} } |
||
8. | Efficient On-Line Fault-Tolerance for the Preconditioned Conjugate Gradient Method Schöll, Alexander; Braun, Claus; Kochte, Michael A.; Wunderlich, Hans-Joachim Proceedings of the 21st IEEE International On-Line Testing Symposium (IOLTS'15), Elia, Halkidiki, Greece, 6-8 July 2015, pp. 95-100 |
2015 DOI PDF |
Keywords: Sparse Linear System Solving, Fault Tolerance, Preconditioned Conjugate Gradient, ABFT | ||
Abstract: Linear system solvers are key components of many scientific applications and they can benefit significantly from modern heterogeneous computer architectures. However, such nano-scaled CMOS devices face an increasing number of reliability threats, which make the integration of fault tolerance mandatory. The preconditioned conjugate gradient method (PCG) is a very popular solver since it typically finds solutions faster than direct methods, and it is less vulnerable to transient effects. However, as latest research shows, the vulnerability is still considerable. Even single errors caused, for instance, by marginal hardware, harsh operating conditions or particle radiation can increase execution times considerably or corrupt solutions without indication. In this work, a novel and highly efficient fault-tolerant PCG method is presented. The method applies only two inner products to reliably detect errors. In case of errors, the method automatically selects between roll-back and efficient on-line correction. This significantly reduces the error detection overhead and expensive re-computations. | ||
BibTeX:
@inproceedings{SchoeBKW2015, author = {Schöll, Alexander and Braun, Claus and Kochte, Michael A. and Wunderlich, Hans-Joachim}, title = {{Efficient On-Line Fault-Tolerance for the Preconditioned Conjugate Gradient Method}}, booktitle = {Proceedings of the 21st IEEE International On-Line Testing Symposium (IOLTS'15)}, year = {2015}, pages = {95--100}, keywords = {Sparse Linear System Solving, Fault Tolerance, Preconditioned Conjugate Gradient, ABFT}, abstract = {Linear system solvers are key components of many scientific applications and they can benefit significantly from modern heterogeneous computer architectures. However, such nano-scaled CMOS devices face an increasing number of reliability threats, which make the integration of fault tolerance mandatory. The preconditioned conjugate gradient method (PCG) is a very popular solver since it typically finds solutions faster than direct methods, and it is less vulnerable to transient effects. However, as latest research shows, the vulnerability is still considerable. Even single errors caused, for instance, by marginal hardware, harsh operating conditions or particle radiation can increase execution times considerably or corrupt solutions without indication. In this work, a novel and highly efficient fault-tolerant PCG method is presented. The method applies only two inner products to reliably detect errors. In case of errors, the method automatically selects between roll-back and efficient on-line correction. This significantly reduces the error detection overhead and expensive re-computations.}, doi = {http://dx.doi.org/10.1109/IOLTS.2015.7229839}, file = {http://www.iti.uni-stuttgart.de/fileadmin/rami/files/publications/2015/IOLTS_SchoeBKW2015.pdf} } |
||
7. | Adaptive Parallel Simulation of a Two-Timescale-Model for Apoptotic Receptor-Clustering on GPUs Schöll, Alexander; Braun, Claus; Daub, Markus; Schneider, Guido; Wunderlich, Hans-Joachim Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM'14), Belfast, United Kingdom, 2-5 November 2014, pp. 424-431 SimTech Best Paper Award |
2014 DOI PDF |
Keywords: Heterogeneous computing, GPU computing, parallel particle simulation, multi-timescale model, adaptive Euler-Maruyama approximation, ligand-receptor aggregation | ||
Abstract: Computational biology contributes important solutions for major biological challenges. Unfortunately, most applications in computational biology are highly computeintensive and associated with extensive computing times. Biological problems of interest are often not treatable with traditional simulation models on conventional multi-core CPU systems. This interdisciplinary work introduces a new multi-timescale simulation model for apoptotic receptor-clustering and a new parallel evaluation algorithm that exploits the computational performance of heterogeneous CPU-GPU computing systems. For this purpose, the different dynamics involved in receptor-clustering are separated and simulated on two timescales. Additionally, the time step sizes are adaptively refined on each timescale independently. This new approach improves the simulation performance significantly and reduces computing times from months to hours for observation times of several seconds. |
||
BibTeX:
@inproceedings{SchoeBDSW2014, author = {Schöll, Alexander and Braun, Claus and Daub, Markus and Schneider, Guido and Wunderlich, Hans-Joachim}, title = {{Adaptive Parallel Simulation of a Two-Timescale-Model for Apoptotic Receptor-Clustering on GPUs}}, booktitle = {Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM'14)}, year = {2014}, pages = {424--431}, keywords = {Heterogeneous computing, GPU computing, parallel particle simulation, multi-timescale model, adaptive Euler-Maruyama approximation, ligand-receptor aggregation}, abstract = {Computational biology contributes important solutions for major biological challenges. Unfortunately, most applications in computational biology are highly computeintensive and associated with extensive computing times. Biological problems of interest are often not treatable with traditional simulation models on conventional multi-core CPU systems. This interdisciplinary work introduces a new multi-timescale simulation model for apoptotic receptor-clustering and a new parallel evaluation algorithm that exploits the computational performance of heterogeneous CPU-GPU computing systems. For this purpose, the different dynamics involved in receptor-clustering are separated and simulated on two timescales. Additionally, the time step sizes are adaptively refined on each timescale independently. |
||
6. | A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units Braun, Claus; Halder, Sebastian; Wunderlich, Hans-Joachim Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'14), Atlanta, Georgia, USA, 23-26 June 2014, pp. 443-454 |
2014 DOI PDF |
Keywords: Algorithm-Based Fault Tolerance, Rounding Error Estimation, GPU, Matrix Multiplication | ||
Abstract: Graphics processing units (GPUs) enable large-scale scientific applications and simulations on the desktop. To allow scientific computing on GPUs with high performance and reliability requirements, the application of software-based fault tolerance is attractive. Algorithm-Based Fault Tolerance (ABFT) protects important scientific operations like matrix multiplications. However, the application to floating-point operations necessitates the runtime classification of errors into inevitable rounding errors, allowed compute errors in the magnitude of such rounding errors, and into critical errors that are larger than those and not tolerable. Hence, an ABFT scheme needs suitable rounding error bounds to detect errors reliably. The determination of such error bounds is a highly challenging task, especially since it has to be integrated tightly into the algorithm and executed autonomously with low performance overhead. In this work, A-ABFT for matrix multiplications on GPUs is introduced, which is a new, parallel ABFT scheme that determines rounding error bounds autonomously at runtime with low performance overhead and high error coverage. |
||
BibTeX:
@inproceedings{BraunHW2014, author = {Braun, Claus and Halder, Sebastian and Wunderlich, Hans-Joachim}, title = {{A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units}}, booktitle = {Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'14)}, year = {2014}, pages = {443--454}, keywords = {Algorithm-Based Fault Tolerance, Rounding Error Estimation, GPU, Matrix Multiplication }, abstract = {Graphics processing units (GPUs) enable large-scale scientific applications and simulations on the desktop. To allow scientific computing on GPUs with high performance and reliability requirements, the application of software-based fault tolerance is attractive. Algorithm-Based Fault Tolerance (ABFT) protects important scientific operations like matrix multiplications. However, the application to floating-point operations necessitates the runtime classification of errors into inevitable rounding errors, allowed compute errors in the magnitude of such rounding errors, and into critical errors that are larger than those and not tolerable. Hence, an ABFT scheme needs suitable rounding error bounds to detect errors reliably. The determination of such error bounds is a highly challenging task, especially since it has to be integrated tightly into the algorithm and executed autonomously with low performance overhead. |
||
5. | Efficacy and Efficiency of Algorithm-Based Fault Tolerance on GPUs Wunderlich, Hans-Joachim; Braun, Claus; Halder, Sebastian Proceedings of the IEEE International On-Line Testing Symposium (IOLTS'13), Crete, Greece, 8-10 July 2013, pp. 240-243 |
2013 DOI PDF |
Keywords: Scientific Computing, GPGPU, Soft Errors, Fault Simulation, Algorithm-based Fault Tolerance | ||
Abstract: Computer simulations drive innovations in science and industry, and they are gaining more and more importance. However, their high computational demand generates extraordinary challenges for computing systems. Typical highperformance computing systems, which provide sufficient performance and high reliability, are extremly expensive. Modern GPUs offer high performance at very low costs, and they enable simulation applications on the desktop. However, they are increasingly prone to transient effects and other reliability threats. To fulfill the strict reliability requirements in scientific computing and simulation technology, appropriate fault tolerance measures have to be integrated into simulation applications for GPUs. Algorithm-Based Fault Tolerance on GPUs has the potential to meet these requirements. In this work we investigate the efficiency and the efficacy of ABFT for matrix operations on GPUs. We compare ABFT against fault tolerance schemes that are based on redundant computations and we evaluate its error detection capabilities |
||
BibTeX:
@inproceedings{WundeBH2013, author = {Wunderlich, Hans-Joachim and Braun, Claus and Halder, Sebastian}, title = {{Efficacy and Efficiency of Algorithm-Based Fault Tolerance on GPUs}}, booktitle = {Proceedings of the IEEE International On-Line Testing Symposium (IOLTS'13)}, year = {2013}, pages = {240--243}, keywords = {Scientific Computing, GPGPU, Soft Errors, Fault Simulation, Algorithm-based Fault Tolerance}, abstract = {Computer simulations drive innovations in science and industry, and they are gaining more and more importance. However, their high computational demand generates extraordinary challenges for computing systems. Typical highperformance computing systems, which provide sufficient performance and high reliability, are extremly expensive. |
||
4. | Parallel Simulation of Apoptotic Receptor-Clustering on GPGPU Many-Core Architectures Braun, Claus; Daub, Markus; Schöll, Alexander; Schneider, Guido; Wunderlich, Hans-Joachim Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM'12), Philadelphia, Pennsylvania, USA, 4-7 October 2012, pp. 1-6 |
2012 DOI PDF |
Keywords: GPGPU; parallel particle simulation; numerical modeling; apoptosis; receptor-clustering | ||
Abstract: Apoptosis, the programmed cell death, is a physiological process that handles the removal of unwanted or damaged cells in living organisms. The process itself is initiated by signaling through tumor necrosis factor (TNF) receptors and ligands, which form clusters on the cell membrane. The exact function of this process is not yet fully understood and currently subject of basic research. Different mathematical models have been developed to describe and simulate the apoptotic receptor-clustering. In this interdisciplinary work, a previously introduced model of the apoptotic receptor-clustering has been extended by a new receptor type to allow a more precise description and simulation of the signaling process. Due to the high computational requirements of the model, an ef?cient algorithmic mapping to a modern many-core GPGPU architecture has been developed. Such architectures enable high-performance computing (HPC) simulation tasks on the desktop at low costs. The developed mapping reduces average simulation times from months to days (peak speedup of 256x), allowing the productive use of the model in research. |
||
BibTeX:
@inproceedings{BraunDSSW2012, author = {Braun, Claus and Daub, Markus and Schöll, Alexander and Schneider, Guido and Wunderlich, Hans-Joachim}, title = {{Parallel Simulation of Apoptotic Receptor-Clustering on GPGPU Many-Core Architectures}}, booktitle = {Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM'12)}, year = {2012}, pages = {1--6}, keywords = {GPGPU; parallel particle simulation; numerical modeling; apoptosis; receptor-clustering}, abstract = {Apoptosis, the programmed cell death, is a physiological process that handles the removal of unwanted or damaged cells in living organisms. The process itself is initiated by signaling through tumor necrosis factor (TNF) receptors and ligands, which form clusters on the cell membrane. The exact function of this process is not yet fully understood and currently subject of basic research. Different mathematical models have been developed to describe and simulate the apoptotic receptor-clustering. |
||
3. | Acceleration of Monte-Carlo Molecular Simulations on Hybrid Computing Architectures Braun, Claus; Holst, Stefan; Wunderlich, Hans-Joachim; Castillo, Juan Manuel; Gross, Joachim Proceedings of the 30th IEEE International Conference on Computer Design (ICCD'12), Montreal, Canada, 30 September-3 October 2012, pp. 207-212 |
2012 DOI PDF |
Keywords: Hybrid Computer Architectures; GPGPU; Markov-Chain Monte-Carlo; Molecular Simulation; Thermodynamics | ||
Abstract: Markov-Chain Monte-Carlo (MCMC) methods are an important class of simulation techniques, which execute a sequence of simulation steps, where each new step depends on the previous ones. Due to this fundamental dependency, MCMC methods are inherently hard to parallelize on any architecture. The upcoming generations of hybrid CPU/GPGPU architectures with their multi-core CPUs and tightly coupled many-core GPGPUs provide new acceleration opportunities especially for MCMC methods, if the new degrees of freedom are exploited correctly. In this paper, the outcomes of an interdisciplinary collaboration are presented, which focused on the parallel mapping of a MCMC molecular simulation from thermodynamics to hybrid CPU/GPGPU computing systems. While the mapping is designed for upcoming hybrid architectures, the implementation of this approach on an NVIDIA Tesla system already leads to a substantial speedup of more than 87x despite the additional communication overheads. |
||
BibTeX:
@inproceedings{BraunHWCG2012, author = {Braun, Claus and Holst, Stefan and Wunderlich, Hans-Joachim and Castillo, Juan Manuel and Gross, Joachim}, title = {{Acceleration of Monte-Carlo Molecular Simulations on Hybrid Computing Architectures}}, booktitle = {Proceedings of the 30th IEEE International Conference on Computer Design (ICCD'12)}, publisher = {IEEE Computer Society}, year = {2012}, pages = {207--212}, keywords = {Hybrid Computer Architectures; GPGPU; Markov-Chain Monte-Carlo; Molecular Simulation; Thermodynamics}, abstract = {Markov-Chain Monte-Carlo (MCMC) methods are an important class of simulation techniques, which execute a sequence of simulation steps, where each new step depends on the previous ones. Due to this fundamental dependency, MCMC methods are inherently hard to parallelize on any architecture. The upcoming generations of hybrid CPU/GPGPU architectures with their multi-core CPUs and tightly coupled many-core GPGPUs provide new acceleration opportunities especially for MCMC methods, if the new degrees of freedom are exploited correctly. |
||
2. | Algorithmen-basierte Fehlertoleranz für Many-Core-Architekturen; Algorithm-based Fault-Tolerance on Many-Core Architectures Braun, Claus; Wunderlich, Hans-Joachim it - Information Technology Vol. 52(4), August 2010, pp. 209-215 |
2010 DOI |
Keywords: Zuverlässigkeit; Fehlertoleranz; parallele Architekturen; parallele Programmierung | ||
Abstract: Moderne Many-Core-Architekturen bieten ein sehr hohes Potenzial an Rechenleistung. Dies macht sie besonders für Anwendungen aus dem Bereich des wissenschaftlichen Hochleistungsrechnens und der Simulationstechnik attraktiv. Die Architekturen folgen dabei einem Ausführungsparadigma, das sich am besten durch den Begriff “Many-Threading” beschreiben lässt. Wie alle nanoelektronischen Halbleiterschaltungen leiden auch Many-Core-Prozessoren potentiell unter störenden Einflüssen von transienten Fehlern (soft errors) und diversen Arten von Variationen. Diese Faktoren können die Zuverlässigkeit von Systemen negativ beeinflussen und erfordern Fehlertoleranz auf allen Ebenen, von der Hardware bis zur Software. Auf der Softwareseite stellt die Algorithmen-basierte Fehlertoleranz (ABFT) eine ausgereifte Technik zur Verbesserung der Zuverlässigkeit dar. Der Aufwand für die Anpassung dieser Technik an moderne Many-Threading-Architekturen darf jedoch keinesfalls unterschätzt werden. In diesem Beitrag wird eine effiziente und fehlertolerante Abbildung der Matrixmultiplikation auf eine moderne Many-Core-Architektur präsentiert. Die Fehlertoleranz ist dabei integraler Bestandteil der Abbildung und wird durch ein ABFT-Schema realisiert, das die Leistung nur unwesentlich beeinträchtigt. Modern many-core architectures provide a high computational potential, which makes them particularly interesting for applications from the fields of scientific high-performance computing and simulation technology. The execution paradigm of these architectures is best described as “Many-Threading”. Like all nano-scaled semiconductor devices, many-core processors are prone to transient errors (soft errors) and different kinds of variations that can have severe impact on the reliability of such systems. Therefore, fault-tolerance has to be incorporated at all levels, from the hardware up to the software. On the software side, Algorithm-based Fault Tolerance (ABFT) is a mature technique to improve the reliability. However, significant effort is required to adapt this technique to modern many-threading architectures. In this article, an efficient and fault-tolerant mapping of the matrix multiplication to a modern many-core architecture is presented. Fault-tolerance is thereby an integral part of the mapping and implemented through an ABFT scheme with marginal impact on the overall performance. |
||
BibTeX:
@article{BraunW2010a, author = {Braun, Claus and Wunderlich, Hans-Joachim}, title = {{Algorithmen-basierte Fehlertoleranz für Many-Core-Architekturen; |
||
1. | Algorithm-Based Fault Tolerance for Many-Core Architectures Braun, Claus; Wunderlich, Hans-Joachim Proceedings of the 15th IEEE European Test Symposium (ETS'10), Praha, Czech Republic, 24-28 May 2010, pp. 253-253 |
2010 DOI PDF |
Abstract: Modern many-core architectures with hundreds of cores provide a high computational potential. This makes them particularly interesting for scientific high-performance computing and simulation technology. Like all nano scaled semiconductor devices, many-core processors are prone to reliability harming factors like variations and soft errors. One way to improve the reliability of such systems is software-based hardware fault tolerance. Here, the software is able to detect and correct errors introduced by the hardware. In this work, we propose a software-based approach to improve the reliability of matrix operations on many-core processors. These operations are key components in many scientific applications. | ||
BibTeX:
@inproceedings{BraunW2010, author = {Braun, Claus and Wunderlich, Hans-Joachim}, title = {{Algorithm-Based Fault Tolerance for Many-Core Architectures}}, booktitle = {Proceedings of the 15th IEEE European Test Symposium (ETS'10)}, publisher = {IEEE Computer Society}, year = {2010}, pages = {253--253}, abstract = {Modern many-core architectures with hundreds of cores provide a high computational potential. This makes them particularly interesting for scientific high-performance computing and simulation technology. Like all nano scaled semiconductor devices, many-core processors are prone to reliability harming factors like variations and soft errors. One way to improve the reliability of such systems is software-based hardware fault tolerance. Here, the software is able to detect and correct errors introduced by the hardware. In this work, we propose a software-based approach to improve the reliability of matrix operations on many-core processors. These operations are key components in many scientific applications.}, doi = {http://dx.doi.org/10.1109/ETSYM.2010.5512738}, file = {http://www.iti.uni-stuttgart.de//fileadmin/rami/files/publications/2010/ETS_BraunW2010.pdf} } |
Workshop Contributions
3. | Hardware/Software Co-Characterization for Approximate Computing Schöll, Alexander; Braun, Claus; Wunderlich, Hans-Joachim Workshop on Approximate Computing, Pittsburgh, Pennsylvania, USA, 06 October 2016 [BibTeX] |
2016 |
BibTeX:
@inproceedings{SchoeBW2016, author = {Schöll, Alexander and Braun, Claus and Wunderlich, Hans-Joachim}, title = {{Hardware/Software Co-Characterization for Approximate Computing}}, booktitle = {Workshop on Approximate Computing}, year = {2016} } |
||
2. | ABFT with Probabilistic Error Bounds for Approximate and Adaptive-Precision Computing Applications Braun, Claus; Wunderlich, Hans-Joachim Workshop on Approximate Computing, Paderborn, Germany, 15-16 October 2015 [BibTeX] |
2015 |
BibTeX:
@inproceedings{BraunW2015, author = {Braun, Claus and Wunderlich, Hans-Joachim}, title = {{ABFT with Probabilistic Error Bounds for Approximate and Adaptive-Precision Computing Applications}}, booktitle = {Workshop on Approximate Computing}, year = {2015} } |
||
1. | A-ABFT: Autonomous Algorithm-Based Fault Tolerance on GPUs Braun, Claus; Halder, Sebastian; Wunderlich, Hans-Joachim International Workshop on Dependable GPU Computing, in conjunction with the ACM/IEEE DATE'14 Conference, Dresden, Germany, 28 March 2014 |
2014 |
Keywords: Algorithm-Based Fault Tolerance, Graphics Processing Units, Scientific Computing, Simulation Technology, Floating-Point Arithmetic, Roundoff Error Analysis, Error Tolerance Determination | ||
Abstract: General-purpose computations on graphics processing units (GPUs) enable large-scale scientific applications and simulations on the desktop. Such applications typically have high performance and reliability requirements. For GPUs, which are still designed for the graphics mass-market, hardware-based fault tolerance measures often do not have the highest priority, which makes the application of appropriate software-based fault tolerance mandatory. Algorithm-based Fault Tolerance (ABFT) allows the efficient and effective protection of important kernels from scientific computing. Some ABFT schemes have already been adapted for GPU architectures. However, due to roundoff error introduced by floating-point arithmetic, ABFT requires the determination of tight error bounds for the error detection. The determination of such error bounds is a highly challenging task. In this work, we introduce A-ABFT for GPUs, a new parallel ABFT scheme that determines appropriate error bounds for the checksum comparison step autonomously and which therefore enables the transparent operation of ABFT without any user interaction. |
||
BibTeX:
@inproceedings{BraunHW2014, author = {Braun, Claus and Halder, Sebastian and Wunderlich, Hans-Joachim}, title = {{A-ABFT: Autonomous Algorithm-Based Fault Tolerance on GPUs}}, booktitle = {International Workshop on Dependable GPU Computing, in conjunction with the ACM/IEEE DATE'14 Conference}, year = {2014}, keywords = {Algorithm-Based Fault Tolerance, Graphics Processing Units, Scientific Computing, Simulation Technology, Floating-Point Arithmetic, Roundoff Error Analysis, Error Tolerance Determination}, abstract = {General-purpose computations on graphics processing units (GPUs) enable large-scale scientific applications and simulations on the desktop. Such applications typically have high performance and reliability requirements. For GPUs, which are still designed for the graphics mass-market, hardware-based fault tolerance measures often do not have the highest priority, which makes the application of appropriate software-based fault tolerance mandatory. |
Simtech Conference 2018
Fast and Accurate Characterization of Approximate Computing Designs on Heterogeneous Computer Architectures | 2018 |
Simtech Conference 2011
Reliable Simulations on Many-Core Architectures | 2011 |
SimTech-Status-Seminar Contributions
- Status-Seminar 2017
- Status-Seminar 2016
- Status-Seminar 2015
- Status-Seminar 2014
- Status-Seminar 2013
- Status-Seminar 2012
- Status-Seminar 2011
- Status-Seminar 2010
- Status-Seminar 2009
- Status-Seminar 2008
Internal Links
Contact People
- Prof. Dr. rer. nat. habil. Hans Joachim Wunderlich
Tel.: +49-711-685-88-391
wu@informatik.uni-stuttgart.de
- Dr. rer. nat. Claus Braun
Tel.: +49-711-685-88-407
claus.braun@informatik.uni-stuttgart.de
- Dipl. Inf. Alexander Schöll
Tel.: +49-711-685-88-279
alexander.schoell@informatik.uni-stuttgart.de