## 30.10 A 1TOPS/W Analog Deep Machine-Learning Engine with Floating-Gate Storage in 0.13µm CMOS

Junjie Lu, Steven Young, Itamar Arel, Jeremy Holleman

University of Tennessee, Knoxville, TN

Direct processing of raw high-dimensional data such as images and video by machine learning systems is impractical both due to prohibitive power consumption and the "curse of dimensionality," which makes learning tasks exponentially more difficult as dimension increases. Deep machine learning (DML) mimics the hierarchical presentation of information in the human brain to achieve robust automated feature extraction, reducing the dimension of such data. However, the computational complexity of DML systems limits large-scale implementations in standard digital computers. Custom analog or mixed-mode signal processors have been reported to yield much higher energy efficiency than DSP [1-4], presenting the means of overcoming these limitations. However, the use of volatile digital memory in [1-3] precludes their use in intermittently-powered devices, and the required interfacing and internal A/D/A conversions add power and area overhead. Nonvolatile storage is employed in [4], but the lack of learning capability requires task-specific programming before operation, and precludes online adaptation.

The feasibility of analog clustering, a key component of DML, has been demonstrated in [5]. In this paper, we present an analog DML engine (ADE) implementing DeSTIN [6], a state-of-art DML framework, and featuring online unsupervised trainability. Floating-gate nonvolatile memory facilitates operation with intermittent harvested energy. An energy efficiency of 1TOPS/W is achieved through massive parallelism, deep weak-inversion biasing, current-mode analog arithmetic, distributed memory, and power gating applied to per-operation partitions. Additionally, algorithm-level feedback desensitizes the system to errors such as offset and noise, allowing reduced device sizes and bias currents.

Figure 30.10.1 shows the architecture of the ADE, in which seven identical cortical circuits (nodes) form a 4-2-1 hierarchy. Each node captures regularities in its inputs through an unsupervised learning process. The lowest layer receives raw data (e.g. the pixels of an image), and continuously constructs belief states that characterize the sequence observed. The inputs of nodes on  $2^{nd}$  and  $3^{rd}$  layers are the belief states of nodes at their respective lower layers. The beliefs of the top layer are then used as rich features for a classifier.

The node (Fig. 30.10.2) incorporates an 8×4 array of reconfigurable analog computation cells (RAC), grouped into 4 centroids, each with 8-dimensional input. The centroids are characterized by their mean  $\mu$  and variance  $\sigma^2$  in each dimension, stored in their respective floating gate memories (FGM). In a training cycle, the analog arithmetic elements (AAE) calculate a simplified Mahalanobis distance (assuming a diagonal covariance matrix) D<sub>MAH</sub> between the input observation **o** and each centroid. The 8-D distances are built by joining the output currents. A distance processing unit (DPU) performs inversenormalization (IN) operation to the 4 distances to construct the belief states, which are the likelihood that the input belongs to each centroid. Then the centroid parameters  $\mu$  and  $\sigma^2$  are adapted using the online clustering algorithm. The centroid with the smallest Euclidean distance D<sub>EUC</sub> to the input is selected (classification). The errors between the selected centroids and input are loaded to the training control (TC) and their memories are then updated proportionally. In recognition mode, only the belief states are constructed and the memories are not adapted. Intra-cycle power gating is applied to reduce the power consumption by up to 37%.

Figure 30.10.3 shows the schematic of the RAC, which performs three different operations through reconfigurable current routing. Two embedded FGMs provide nonvolatile storage for centroid parameters. Capacitive feedback stabilizes the floating gate voltage ( $V_{FG}$ ) to yield pulse-width controlled update. Tunneling is enabled by lowering its supply to bring down the  $V_{FG}$  increasing the voltage across the tunneling junction. Injection is enabled by raising the source of the injection transistor. This allows random-accessible bidirectional updates without the need for on-chip high-voltage switches or charge pump. A 2-T V-I converter then provides a current output and sigmoid update rule. The FGM consumes 0.5nA of bias current, and shows an 8b programming accuracy and a 46dB SNR at full scale. The absolute value circuit (ABS) in the AAE rectifies the bidirectional difference current between *o* and  $\mu$ . Class-B operation and the virtual ground provided by amplifier A allow high-speed resolution of small

current differences. The rectified currents are then fed into a translinear  $X^2/Y$  circuit, which simulations indicate operates with more than an order of magnitude higher energy efficiency than its digital equivalence.

In the belief construction phase, the DPU (Fig. 30.10.4) inverts the distance outputs from the 4 centroids to calculate similarities, and normalizes them to vield a valid probability distribution. The output belief states are sampled then held for the rest of the cycle to allow parallel operation of all layers. The sampling switch reduces current sampling error due to charge injection: a diodeconnected PMOS provides a reduced  $V_{GS}$  to the switch NMOS to turn it on with minimal channel charge. The S/H achieved less than 0.7mV of charge injection error (2% current error), and less than 14µV of droop with parasitic capacitors as holding capacitor. In classification phase, the IN circuits are reused together with the winner-take-all network (WTA) to classify the observation to the nearest centroid. A starvation trace (ST) circuit is implemented to address unfavorable initial conditions wherein some centroids are starved of nearby inputs and never updated. The ST provides starved centroids with a small but increasing additional current to force their occasional selection and pull them into more populated areas of the input space. The lower right of Fig. 30.10.4 shows the TC circuit, which performs current-to-pulse-width conversion using a  $V_{oo}$ referenced comparison. Proportional updates cause the mean and variance memories to converge to the sample statistics, respectively.

The ADE is evaluated on a custom test board with data acquisition hardware connecting to a host PC. The waveforms in Fig. 30.10.5 show the measured beliefs, one from each layer. The sampling of beliefs proceeds from the top layer to the bottom to avoid delays due to output settling. The performance of the node is demonstrated by solving a clustering problem. The input data consists of 4 underlying clusters, each drawn from a Gaussian distribution with different mean and variance. The node achieves accurate extraction of the cluster parameters ( $\mu$  and  $\sigma^2$ ), and the ST ensures a robust operation against unfavorable initial conditions.

We demonstrate feature extraction for pattern recognition with the setup shown in Fig. 30.10.6. The input patterns are  $16 \times 16$  symbol bitmaps corrupted by random pixel errors. An  $8 \times 4$  moving window defines the pixels applied to the ADE's 32-D input. First the ADE is trained unsupervised with examples of patterns. After the training converges, the 4 belief states from the top layer are used as rich features and classified with a neural network implemented in software, achieving a dimension reduction from 32 to 4. Recognition accuracies of 100% with corruption lower than 10%, and 95.4% with 20% corruption are obtained, comparable to a software baseline, demonstrating robustness to the non-idealities of analog computation.

The ADE was fabricated in a 0.13µm CMOS process with thick-oxide IO FETs. The die micrograph is shown in Fig. 30.10.7, together with a performance summary and a comparison with state-of-art bio-inspired parallel processors utilizing analog computation. We achieve 1TOPS/W peak energy efficiency in recognition mode. Compared to state-of-art, this work achieves very high energy efficiency in both modes. This combined with the advantages of nonvolatile memory and unsupervised online trainability makes it a general-purpose feature extraction engine ideal for autonomous sensory applications or as a building block for large-scale learning systems.

## References:

[1] J. Park, et al., "A 646GOPS/W Multi-Classifier Many-Core Processor with Cortex-Like Architecture for Super-Resolution Recognition," *ISSCC Dig. Tech. Papers,* pp. 168-169, Feb. 2013.

[2] J. Oh, et al., "A 57mW Embedded Mixed-Mode Neuro-Fuzzy Accelerator for Intelligent Multi-Core Processor," *ISSCC Dig. Tech. Papers*, pp. 130-132, Feb. 2011.

[3] J.-Y. Kim, et al., "A 201.4GOPS 496mW Real-Time Multi-Object Recognition Processor With Bio-Inspired Neural Perception Engine," *ISSCC Dig. Tech. Papers*, pp. 150-151, Feb. 2009.

[4] S. Chakrabartty and G. Cauwenberghs, "Sub-Microwatt Analog VLSI Trainable Pattern Classifier," *IEEE J. Solid-State Circuits,* vol. 42, no. 5, pp. 1169-1179, May 2007.

[5] J. Lu, et al., "An Analog Online Clustering Circuit in 130nm CMOS," *IEEE Asian Solid-State Circuits Conference*, Nov. 2013.

[6] S. Young, et al., "Hierarchical Spatiotemporal Feature Extraction using Recurrent Online Clustering," Pattern Recognition Letters, Oct. 2013.





Figure 30.10.5: Measured output waveforms, clustering and ST test results. For clustering, 2-D results are shown for better visualization.

control is shown on the lower right.



Figure 30.10.6: Pattern recognition test setup and results, demonstrating accuracy comparable to baseline software simulation.

S/H

τĄ

|                                |                                         | Г                        | Technology 109                      | M 0 12um CMOS                   |
|--------------------------------|-----------------------------------------|--------------------------|-------------------------------------|---------------------------------|
|                                |                                         | F                        | Power Supply                        | 3V                              |
|                                |                                         | P. 47 4.54 61            | Active Area                         | 0.9mm×0.4mm                     |
|                                |                                         | 22222                    | Memory SNR<br>Training<br>Unsuperv  | 46dB<br>rised Online Clustering |
|                                |                                         | •                        | Algorithm Inv<br>utput Feature Maha | erse-Nomalized                  |
|                                | DT-, TC                                 |                          | nput Referred<br>Noise              | 56.23pA <sub>rms</sub>          |
|                                | 7 Nodes                                 |                          | System SNR<br>I/O Type              | 45dB<br>Inalog Current          |
|                                | 000000000                               |                          | Operating Trainin                   | g Mode 4.5kHz                   |
| F PICCOC                       | ະດວ່າການເກັບ                            |                          | Power Trainin                       | g Mode 27µW                     |
|                                |                                         |                          | Energy Training                     | g Mode 480GOPS/W                |
|                                | T while the T                           |                          | Recogniti                           | Univioue 1.0410PS/W             |
| Process                        | 0.13µm                                  | 0.13µm                   | 0.13µm                              | 0.13µm                          |
| Purpose<br>Non-volatile Memory | DML Feature Extraction<br>Floating Gate | Object Recognition<br>NA | Neural-Fuzzy Accelarato             | Object Recognition              |
| Power (W)                      | 11.4µW                                  | 260mW                    | 57mW                                | 496mW                           |
| i can Litergy Efficiency       | 1.0410P3/W                              | 04000P3/W                | 03300FS/W                           | 27000r3/W                       |
|                                |                                         | _                        |                                     |                                 |
| Figure 30.10.7: D              | Die micrograph, p                       | erformance s             | ummary and cor                      | nparison table.                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |
|                                |                                         |                          |                                     |                                 |