Emerging cyberphysical systems such as autonomous vehicles, surveillance, medical monitoring, smart power grids (and many others), are compute intensive (tens of Teraflops), have tight latency requirements (10 – 100 ms), and generate massive amounts of data (tens of GB/s). Additionally, privacy, security, and energy consumption are important considerations. As a possible solution to meet these requirements, the edge computing paradigm seeks to bring the compute and storage, currently done at the cloud, to the edge of the network close to the data source. Our research in this area seeks to rethink the computer system design for the edge. Questions of interest include – 1) What is the best way to cooperatively process and store data at the edge and cloud such that performance requirements are met energy-efficiently? 2) What type of coupling is required between the traditional computing abstractions to handle the dynamic nature of the edge system 3) How can security and privacy be ensured at the edge?
We seek to demonstrate solutions to these questions by constructing working prototypes of novel designs.
Current Project – Data Storage Architectures for Pervasive Machine Vision at the Edge
Collaborators – Drs. Tao Han, and Hamed Tabkhi (ECE, UNC Charlotte). Dr Shannon Reid (Criminal Justice, UNC Charlotte), Dr. Srinivas Pulugurtha (Transportation, UNC Charlotte)
Funding – SCC-Planning: Pedestrian Safe and Secure Communities with Ambient Machine Vision, National Science Foundation, PIs – Tabkhi (lead), Ravindran, Han, Reid, and Pulugurtha.
Paper – Yang Deng, Arun Ravindran, and Tao Han, “Edge Datastore for Distributed Vision Analytics”, Second ACM/IEEE International Symposium on Edge Computing, San Jose/ Fremont CA, October 2017.
As the size and complexity of computing systems continue to scale from massive data centers, to large scale IoT systems, the ability to meet performance, energy, and security, and reliability requirements, depend increasingly on dynamically responding to changing operating conditions. Our research in this area seeks to investigate the applicability of systems theory techniques (feedback control, modeling, optimization) and artificial intelligence techniques (machine learning, reinforcement learning, deep learning), to build autonomic computing systems. Questions of interest include – 1) What are effective ways to measure/detect events in modern computing systems? 2) What combination of systems theory and artificial intelligence techniques should be used in run time decision making? 3) What tuning “knobs” should the system be designed with for guiding the system execution trajectory?
We seek to demonstrate solutions to these questions through a combination of theoretical, simulation, and experimental results.
Current Project – Low Latency Microservices Cloud Architectures
Paper- Arun Ravindran, and Tyrone Vincent, “Thrifty: A Machine Learning Based Feedforward-Feedback Optimal Control Runtime System for Energy-Efficient Latency-Critical Systems”, 12th International Workshop on Feedback Computing, Columbus OH, July 2017.
A Cross-Stack Predictive Control Framework for Multicore Real Time Applications
Student: Guangyi Cao, Ph.D. (Graduated Aug 2014)
Many of the next generation applications in entertainment, human computer interaction, infrastructure, security and medical systems are computationally intensive, always-on, and have soft real time (SRT) requirements. While failure to meet deadlines is not catastrophic in SRT systems, missing deadlines can result in an unacceptable degradation in the quality of service (QoS). To ensure acceptable QoS under dynamically changing operating conditions such as changes in the workload, energy availability, and thermal constraints, systems are typically designed for worst case conditions. Unfortunately, such overdesigning of systems increases costs and overall power consumption.
In this work we formulate the real-time task execution as a Multiple-Input, Single- Output (MISO) optimal control problem involving tracking a desired system utilization set point with control inputs derived from across the computing stack. We assume that an arbitrary number of SRT tasks may join and leave the system at arbitrary times. The tasks are scheduled on multiple cores by a dynamic priority multiprocessor scheduling algorithm. We use a model predictive controller (MPC) to realize optimal control. MPCs are easy to tune, can handle multiple control variables, and constraints on both the dependent and independent variables. We experimentally demonstrate the operation of our controller on a video encoder application and a computer vision application executing on a dual socket quadcore Xeon processor with a total of 8 processing cores. We establish that the use of DVFS and application quality as control variables enables operation at a lower power operating point while meeting real-time constraints as compared to non cross-stack control approaches. We also evaluate the role of scheduling algorithms in the control of homogeneous and heterogeneous workloads. Additionally, we propose a novel adaptive control technique for time-varying workloads.
Publications – (Coming soon)
Statistical machine learning based modeling framework for design space exploration and run-time cross-stack energy optimization for many-core processors
Student: Changshu Zhang, Ph.D. (Graduated May 2012)
Synposis: The complexity of many-core processors continues to grow as a larger number of heterogeneous cores are integrated on a single chip. Such systems-on-chip contains computing structures ranging from complex out-of-order cores, simple in-order cores, digital signal processors (DSPs), graphic processing units (GPUs), application specific processors, hardware accelerators, I/O subsystems, network-on-chip interconnects, and large caches arranged in complex hierarchies. While the industry focus is on putting higher number of cores on a single chip, the key challenge is to optimally architect these many-core processors such that performance, energy and area constraints are satisfied. The traditional approach to processor design through extensive cycle accurate simulations are ill-suited for designing many-core processors due to the large microarchitecture design space that must be explored. Additionally it is hard to optimize such complex processors and the applications that run on them statically at design time such that performance and energy constraints are met under dynamically changing operating conditions.
This project seeks to establish a statistical machine learning based modeling framework that enables the efficient design and operation of many-core processors that meets performance, energy and area constraints. We apply the proposed framework to rapidly design the micro-architecture of a many-core processor for multimedia, finance, and data mining applications derived from the Parsec benchmark. We further demonstrate the application of the framework in the joint run-time adaptation of both the application and micro-architecture such that energy availability constraints are met.
Publications (Coming soon)
Effective data parallel computing on multicore processors
Student: Jong Ho Byun, Ph.D. (Graduated Dec 2010, Currently Research Scientist with Republic of Korea Armed Forces)
Synopsis: The rise of chip multiprocessing or the integration of multiple general purpose processing cores on a single chip (multicores), has impacted all computing platforms including high performance, servers, desktops, mobile, and embedded processors. Programmers can no longer expect continued increases in software performance without developing parallel, memory hierarchy friendly software that can effectively exploit the chip level multiprocessing paradigm of multicores. The goal of this projet is to demonstrate a design process for data parallel problems that starts with a sequential algorithm and ends with a high performance implementation on a multicore platform. Our design process combines theoretical algorithm analysis with practical optimization techniques. Our target multicores are quad-core processors from Intel and the eight-SPE IBM Cell B.E. Target applications include Matrix Multiplications (MM), Finite Difference Time Domain (FDTD), LU Decomposition (LUD), and Power Flow Solver based on Gauss-Seidel (PFS-GS) algorithms. These applications are popular computation methods in science and engineering problems and are characterized by unit-stride (MM, LUD, and PFS-GS) or 2-point stencil (FDTD) memory access pattern. The main contributions of this project include a cache- and space-efficient algorithm model, integrated data pre-fetching and caching strategies, and in-core optimization techniques. Our multicore efficient implementations of the above described applications outperform naïve parallel implementations by at least 2x and scales well with problem size and with the number of processing cores.
Publications (Coming soon)
Design and Implementation of a High Performance Finite Difference Time Domain Algorithm on the Amazon Cloud Using Hadoop
Student: Aby Kuruvilla, M.S. (Graduated, Dec 2010, Currently at Envisage Information Systems)
Synopsis: The National Institute of Standards and Technology defines cloud computing as a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources such as networks, servers, storage, applications and services, that can be rapidly provisioned and released with minimal management effort or service provider interaction. With the ever increasing amounts of scientific data, and the expense, complexity and maintenance concerns involved in managing in-house compute clusters, considerable interest exists in utilizing the cloud computing paradigm for High Performance Computing (HPC). Programming models such as MapReduce, and its open source implementation, Hadoop, give researchers the ability to develop parallel, fault tolerant applications capable of exploiting the computing power of the cloud with relative ease.
This project investigates the design and performance issues involved in implementing a high performance 3D Finite Difference Time Domain (FDTD) algorithm on the Amazon public cloud. The FDTD is an example of iterative stencil algorithms common in scientific computing. The algorithm is parallelized in Hadoop using a partitioning scheme that divides the data among multiple reducers. The scalability of the implementation with increasing number of reducers and nodes is examined. The average execution time per iteration for a problem size, n = 256 (6 GB) using a 16 instance, 16 reducer configuration was found to be 6X faster than the sequential 1-instance, 1-reducer configuration. Adding more instances and carefully choosing the number of reducers would allow the computing of even larger problem sizes without any changes in the implementation. An estimate of the cost-performance tradeoffs in using the Amazon cloud compared to in-house compute cluster solutions is also determined.
A stream multiprocessor architecture for bioinformatics applications
Student: Ravikiran Karanam, M.S. (Graduated, Dec 2007, Currently at Qualcomm)
Synopsis: Bioinformatics applications such as gene and protein sequence matching algorithms are characterized by the need to process large amounts of data. While uni-processor performance growth is slowing due to an increasing gap between processor and memory speeds and saturation of processor clock frequencies, Genbank data is doubling every 15 months. With the advent of chip multiprocessor systems, great improvements in processor performance could potentially be achieved by taking advantage of the high inter-processor communication bandwidth and new models of programming. This project proposes a stream chip multiprocessor architecture customized for bioinformatics applications that takes advantage of the memory bandwidth available in chip multiprocessors by exploiting the data parallelism available in bioinformatics applications. The proposed stream chip multiprocessor architecture, on a Xilinx Virtex-5 FPGA with 70 customized MicroBlaze soft processor cores running at 100 MHz with a PCI- bus limited main memory bandwidth of 0.12 GB/s, achieves a speed-up of 68 % for the Smith-Waterman sequence matching algorithm compared to a AMD Opteron. Extrapolating the performance for 1GHz processor cores with a main memory bandwidth of 2.7 GB/s, a speed up of 1650% is possible
Publications (Coming soon)