Active Research Projects
A Cross-Layer Distributed Control and Optimization Framework for Cloud Computing Systems
Collaborators: Dr. Tyrone Vincent (Control Theory. Colorado School of Mines), Dr. Dinesh Mehta (Graph Algorithms, Colorado School of Mines), and Dr. Bharat Joshi (System Modeling, UNC Charlotte)
The research goal is to establish a cross-layer distributed control and optimization framework for real-time energy-efficient operation of the next generation of cloud computing systems. We hypothesize that time-sensitive and energy-optimal operation of cloud computing systesm is possible if they are treated as large-scale dynamical systems subjected to unexpected disturbances including variations in the workload, cost of energy, and hardware failures.
We are currently working on developing and experimentally demonstrating a resource allocation and hierarchical distributed cooperative control framework for big-data applications with special emphasis on big-graph analytics.
The proposed hierarchical distributed control leverages the cross-layer flow of information to enable tunable parameters across the computing stack.
A Cross-Stack Predictive Control Framework for Multicore Real Time Applications
Student: Guangyi Cao, Ph.D. (Graduated Aug 2014)
Many of the next generation applications in entertainment, human computer interaction, infrastructure, security and medical systems are computationally intensive, always-on, and have soft real time (SRT) requirements. While failure to meet deadlines is not catastrophic in SRT systems, missing deadlines can result in an unacceptable degradation in the quality of service (QoS). To ensure acceptable QoS under dynamically changing operating conditions such as changes in the workload, energy availability, and thermal constraints, systems are typically designed for worst case conditions. Unfortunately, such overdesigning of systems increases costs and overall power consumption.
In this work we formulate the real-time task execution as a Multiple-Input, Single- Output (MISO) optimal control problem involving tracking a desired system utilization set point with control inputs derived from across the computing stack. We assume that an arbitrary number of SRT tasks may join and leave the system at arbitrary times. The tasks are scheduled on multiple cores by a dynamic priority multiprocessor scheduling algorithm. We use a model predictive controller (MPC) to realize optimal control. MPCs are easy to tune, can handle multiple control variables, and constraints on both the dependent and independent variables. We experimentally demonstrate the operation of our controller on a video encoder application and a computer vision application executing on a dual socket quadcore Xeon processor with a total of 8 processing cores. We establish that the use of DVFS and application quality as control variables enables operation at a lower power operating point while meeting real-time constraints as compared to non cross-stack control approaches. We also evaluate the role of scheduling algorithms in the control of homogeneous and heterogeneous workloads. Additionally, we propose a novel adaptive control technique for time-varying workloads.
Statistical machine learning based modeling framework for design space exploration and run-time cross-stack energy optimization for many-core processors
Student: Changshu Zhang, Ph.D. (Graduated May 2012)
Synposis: The complexity of many-core processors continues to grow as a larger number of heterogeneous cores are integrated on a single chip. Such systems-on-chip contains computing structures ranging from complex out-of-order cores, simple in-order cores, digital signal processors (DSPs), graphic processing units (GPUs), application specific processors, hardware accelerators, I/O subsystems, network-on-chip interconnects, and large caches arranged in complex hierarchies. While the industry focus is on putting higher number of cores on a single chip, the key challenge is to optimally architect these many-core processors such that performance, energy and area constraints are satisfied. The traditional approach to processor design through extensive cycle accurate simulations are ill-suited for designing many-core processors due to the large microarchitecture design space that must be explored. Additionally it is hard to optimize such complex processors and the applications that run on them statically at design time such that performance and energy constraints are met under dynamically changing operating conditions.
This project seeks to establish a statistical machine learning based modeling framework that enables the efficient design and operation of many-core processors that meets performance, energy and area constraints. We apply the proposed framework to rapidly design the micro-architecture of a many-core processor for multimedia, finance, and data mining applications derived from the Parsec benchmark. We further demonstrate the application of the framework in the joint run-time adaptation of both the application and micro-architecture such that energy availability constraints are met.
Effective data parallel computing on multicore processors
Student: Jong Ho Byun, Ph.D. (Graduated Dec 2010, Currently postdoc at UC Berkeley)
Synopsis: The rise of chip multiprocessing or the integration of multiple general purpose processing cores on a single chip (multicores), has impacted all computing platforms including high performance, servers, desktops, mobile, and embedded processors. Programmers can no longer expect continued increases in software performance without developing parallel, memory hierarchy friendly software that can effectively exploit the chip level multiprocessing paradigm of multicores. The goal of this projet is to demonstrate a design process for data parallel problems that starts with a sequential algorithm and ends with a high performance implementation on a multicore platform. Our design process combines theoretical algorithm analysis with practical optimization techniques. Our target multicores are quad-core processors from Intel and the eight-SPE IBM Cell B.E. Target applications include Matrix Multiplications (MM), Finite Difference Time Domain (FDTD), LU Decomposition (LUD), and Power Flow Solver based on Gauss-Seidel (PFS-GS) algorithms. These applications are popular computation methods in science and engineering problems and are characterized by unit-stride (MM, LUD, and PFS-GS) or 2-point stencil (FDTD) memory access pattern. The main contributions of this project include a cache- and space-efficient algorithm model, integrated data pre-fetching and caching strategies, and in-core optimization techniques. Our multicore efficient implementations of the above described applications outperform naïve parallel implementations by at least 2x and scales well with problem size and with the number of processing cores.
Design and Implementation of a High Performance Finite Difference Time Domain Algorithm on the Amazon Cloud Using Hadoop
Student: Aby Kuruvilla, M.S. (Graduated, Dec 2010, Currently at Envisage Information Systems)
Synopsis: The National Institute of Standards and Technology defines cloud computing as a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources such as networks, servers, storage, applications and services, that can be rapidly provisioned and released with minimal management effort or service provider interaction. With the ever increasing amounts of scientific data, and the expense, complexity and maintenance concerns involved in managing in-house compute clusters, considerable interest exists in utilizing the cloud computing paradigm for High Performance Computing (HPC). Programming models such as MapReduce, and its open source implementation, Hadoop, give researchers the ability to develop parallel, fault tolerant applications capable of exploiting the computing power of the cloud with relative ease.
This project investigates the design and performance issues involved in implementing a high performance 3D Finite Difference Time Domain (FDTD) algorithm on the Amazon public cloud. The FDTD is an example of iterative stencil algorithms common in scientific computing. The algorithm is parallelized in Hadoop using a partitioning scheme that divides the data among multiple reducers. The scalability of the implementation with increasing number of reducers and nodes is examined. The average execution time per iteration for a problem size, n = 256 (6 GB) using a 16 instance, 16 reducer configuration was found to be 6X faster than the sequential 1-instance, 1-reducer configuration. Adding more instances and carefully choosing the number of reducers would allow the computing of even larger problem sizes without any changes in the implementation. An estimate of the cost-performance tradeoffs in using the Amazon cloud compared to in-house compute cluster solutions is also determined.
A stream multiprocessor architecture for bioinformatics applications
Student: Ravikiran Karanam, M.S. (Graduated, Dec 2007, Currently at Qualcomm)
Synopsis: Bioinformatics applications such as gene and protein sequence matching algorithms are characterized by the need to process large amounts of data. While uni-processor performance growth is slowing due to an increasing gap between processor and memory speeds and saturation of processor clock frequencies, Genbank data is doubling every 15 months. With the advent of chip multiprocessor systems, great improvements in processor performance could potentially be achieved by taking advantage of the high inter-processor communication bandwidth and new models of programming. This project proposes a stream chip multiprocessor architecture customized for bioinformatics applications that takes advantage of the memory bandwidth available in chip multiprocessors by exploiting the data parallelism available in bioinformatics applications. The proposed stream chip multiprocessor architecture, on a Xilinx Virtex-5 FPGA with 70 customized MicroBlaze soft processor cores running at 100 MHz with a PCI- bus limited main memory bandwidth of 0.12 GB/s, achieves a speed-up of 68 % for the Smith-Waterman sequence matching algorithm compared to a AMD Opteron. Extrapolating the performance for 1GHz processor cores with a main memory bandwidth of 2.7 GB/s, a speed up of 1650% is possible