

## RRAM Fabric for Neuromorphic and Reconfigurable Compute-In-Memory Systems

#### <u>Wei D. Lu</u>

#### University of Michigan Electrical Engineering and Computer Science Ann Arbor, MI, USA



## Outline



- Introduction RRAM Devices
- Improving Computing Efficiency using RRAM Arrays
  - Bring memory as close to logic as possible
  - Neuromorphic computing in artificial neural networks
  - More bio-inspired networks, taking advantage of the internal ionic dynamics
  - In-memory computing for logic and arithmetic operations
- Future reconfigurable systems based on a common physical fabric





# **Need to Rethink Computing**

#### Observations

- Compute is cheap ( < 1pJ)
- Programming an instruction is very expensive (70pJ - fetching an instruction alone is 25pJ)
- DRAM access is another 10-100x more expensive

| Integer |        | FP     |       |
|---------|--------|--------|-------|
| Add     |        | FAdd   |       |
| 8 bit   | 0.03pJ | 16 bit | 0.4pJ |
| 32 bit  | 0.1pJ  | 32 bit | 0.9pJ |
| Mult    |        | FMult  |       |
| 8 bit   | 0.2pJ  | 16 bit | 1.1pJ |
| 32 bit  | 3.1pJ  | 32 bit | 3.7pJ |

| Memory |           |  |
|--------|-----------|--|
| Cache  | (64bit)   |  |
| 8KB    | 10pJ      |  |
| 32KB   | 20pJ      |  |
| 1MB    | 100pJ     |  |
| DRAM   | 1.3-2.6nJ |  |



#### Solutions

- Keep data local
- Non-instruction based? how?
- Get rid of DRAM!



Figure 1.1.9: Rough energy costs for various operations in 45nm 0.9V.



Mark Horowitz, ISSCC 2017

# **Rethinking Computing**



## Before

IT'S TIME TO RETHINK COMPUTING Optimize architecture and circuit design to minimize compute cost

Now

Compute is *cheap* Data and data routing are *expensive* 



Fundamentally redesigning the architecture from datarouting point of view, not from compute point of view

Lu Group U. Michigan

## **Two-Terminal Memory Devices and Crossbar Arrays**



#### » <u>Resistive memory (RRAM), memory + resistor (memristor)</u>

- Simple structure
  - Formed by two-terminal devices
  - Not limited by transistor scaling
- Ultra-high density
  - NAND-like layout, cell size 4F<sup>2</sup>
  - Terabit potential
- Large connectivity
- Application:
  - Memory
  - Neuromorphic
  - General Purpose Computing





## **Two-Terminal Memory Devices and Crossbar Arrays**



#### » <u>Resistive memory (RRAM), memory + resistor (memristor)</u>

- Simple structure
  - Formed by two-terminal devices
  - Not limited by transistor scaling
- Ultra-high density
  - NAND-like layout, cell size 4F<sup>2</sup>
  - Terabit potential
- Large connectivity
- Application:
  - Memory
  - Neuromorphic
  - General Purpose Computing





#### **Coupled electronic/ionic effects**



#### ElectroChemical Metallization Cell (ECM, CBRAM)

#### Valency Change Cell (VCM)



- Creating "new" materials on the fly
- Active electrode material + inert dielectric
- "Filament" based on electrode material injection and redox at electrodes
- Switching layer facilitates ionic movement
  - Modulating exiting material properties
- Filament based on oxygen exchange between two oxide layers
- Electrode plays minor role

•

Y. Yang and W. Lu, Nanoscale, 5, 10076 (2013)







• Ag/SiO<sub>2</sub>/Pt structure, sputtered SiO2 film

Lu Group

**U.** Michigan

- The filament grows from the IE backwards toward the AE
- Branched structures were observed with wider branches pointing to the AE

Y. Yang, Gao, Chang, Gaba, Pan, and W. Lu, Nature Communications, 3, 732, 2012.



## Microscopic Origin of Dynamic Filament Formation





- Metal inclusions form bipolar electrodes, with redox processes happening at opposite sides
- Dissolution of the original Ag particle leads to new particle nucleation and growth at downstream position
- Resulting in effective Ag particle migration in the electric field direction

Y. Yang, Gao, Li, Pan, Tappertzhofen, Choi, Waser, Valov, W. Lu, Nature Communications 5, 4232, 2014

Lu Group U. Michigan

## **RRAM Resistance Switching Characteristics**







50

40

30

20

10

1e6 on/off **>>** 

Lu Group

**U.** Michigan

- 1e8 W/E endurance **>>**
- Switching speed ~10ns **>>**

Jo, Kim, W. Lu, Nano Lett., 8, 392 (2008) Kim, Jo, W. Lu, Appl. Phys. Lett. 96, 053106 (2010)

Slide 10

# Integrated RRAM Crossbar/CMOS System





- » RRAM array: 100nm pitch, 50nm linewidth with density of 10Gbits/cm<sup>2</sup>
- » CMOS units larger but fewer units needed. 2n CMOS cells control n2 memory cells

Kim, Gaba, Wheeler, Cruz-Albrecht, Srivinara, W. Lu Nano Lett., 12, 389–395 (2012).



## Outline



- Introduction RRAM Devices
- Improving Computing Efficiency using RRAM Arrays
  - Bring memory as close to logic as possible
  - Neuromorphic computing in artificial neural networks
  - More bio-inspired networks, taking advantage of the internal ionic dynamics
  - In-memory computing for logic and arithmetic operations
- Future reconfigurable systems based on a common physical fabric





#### **RRAM as Embedded NVM**

- CMOS Compatible
- **3D** Stackable, Scalable Architecture Low thermal budget process
- Architectures proven include multiple Via schemes and Subtractive etching
- Commercial Products offered by several fabs







## Outline



- Introduction RRAM Devices
- Improving Computing Efficiency using RRAM Arrays
  - Bring memory as close to logic as possible
  - Neuromorphic computing in artificial neural networks
  - More bio-inspired networks, taking advantage of the internal ionic dynamics
  - In-memory computing for logic and arithmetic operations
- Future reconfigurable systems based on a common physical fabric



### **RRAM Based Neural Network Hardware**



#### » <u>Synapse – reconfigurable two-terminal resistive switches</u>

» Goal: building bio-inspired, efficient artificial neural networks



## Neuromorphic Computing with RRAM Arrays



### » RRAM perform learning and inference functions

- RRAM weights form dictionary elements (features)
- Image input, pixel intensity represented by widths of pulses
- RRAM array natively performs matrix operation:  $\vec{I} = \vec{v} \cdot \vec{\Phi}$
- Integrate and fire neurons
- Learning achieved by backpropagating spikes

M. A. Zidan, J. P. Strachan, and W. D. Lu, Nature Electronics 1: 22–29 (2018)







Lu Group U. Michigan

### **RRAM Network for Image Processing**



Michiaan



## Sparse Coding with RRAM Crossbar





#### Fully integrated RRAM/CMOS chip





Lu Group Michigan

U,

Fully integrated chip with all required ADCs, DACs, digital buses, and an on-chip OpenRISC Processor

Cai et al. Nature Electronics, DOI: 10.1038/s41928-019-0270-x

#### Fully integrated RRAM/CMOS chip



Fully integrated chip with all required ADCs, DACs, digital buses, and an on-chip OpenRISC Processor

Can run simple ML models end-to-end

Can be reprogrammed to run different ML models

Cai et al. Nature Electronics, DOI: 10.1038/s41928-019-0270-x

## Outline



- Introduction RRAM Devices
- Improving Computing Efficiency using RRAM Arrays
  - Bring memory as close to logic as possible
  - Neuromorphic computing in artificial neural networks
  - More bio-inspired networks, taking advantage of the internal ionic dynamics
  - In-memory computing for logic and arithmetic operations
- Future reconfigurable systems based on a common physical fabric





### In-memory Arithmetic Computing

- » Parallel write enables efficient programming/ storage for new functions
- » "computation" involves simple read operation
- » (binary) RRAM device with low power (lon < 100nA) and high on/off is used for the arithmetic operations.</p>



C. Bing, F. Cai, W. Ma, P. Sheridan, W. Lu, IEDM 2015

Groub

U. Michigan



#### **High Precision Arithmetic Computing**

Solving partial-differential equations (PDEs)



#### Solving an $A \cdot x = b$ problem in matrix form

- Requires high precision and accurate solutions vs. neural networks which can tolerate low precision and inaccurate solutions
- Numerical simulation of water drop in a shallow pool

M. A. Zidan, Y.J. Jeong, J. Lee, B. Chen, S. Huang, M. J. Kushner, & W. D. Lu, Nature Electronics, 1, 411-420 (2018)

![](_page_24_Picture_7.jpeg)

#### Hardware Acceleration of Simulated Annealing

![](_page_25_Figure_1.jpeg)

J. Shin, et al. IEDM 2018

Slide 25

## Outline

![](_page_26_Picture_1.jpeg)

- Introduction RRAM Devices
- Improving Computing Efficiency using RRAM Arrays
  - Bring memory as close to logic as possible
  - Neuromorphic computing in artificial neural networks
  - More bio-inspired networks, taking advantage of the internal ionic dynamics
  - In-memory computing for logic and arithmetic operations

#### • Future - reconfigurable systems based on a common physical fabric

![](_page_26_Picture_9.jpeg)

## Dynamically reconfigurable Computing Fabric

![](_page_27_Picture_1.jpeg)

» A reconfigurable hardware system with modular reconfigurable blocks

![](_page_27_Figure_3.jpeg)

- Hierarchically structured interconnects: locally dense connection + globally asynchronous serial link
- Reconfigurable computing modules at both fine-grained and coarse-grained levels

M. Zidan, Y. Jeong, J. H. Shin, C. Du, Z. Zhang, and W. D. Lu, IEEE Trans Multi-Scale Comp Sys, DOI 10.1109/TMSCS.2017.2721160 (2017)

![](_page_27_Picture_7.jpeg)

## Dynamically reconfigurable Computing Fabric

![](_page_28_Picture_1.jpeg)

![](_page_28_Figure_2.jpeg)

- "General" purpose by design: the same hardware supports different tasks image, video, speech, ...
- Dense local connection, sparse global connection

Lu Group

U. Michigan

• Run-time, dynamically reconfigurable. Function defined by software.

M. Zidan, Y. Jeong, J. H. Shin, C. Du, Z. Zhang, and W. D. Lu, IEEE Trans Multi-Scale Comp Sys, DOI 10.1109/TMSCS.2017.2721160 (2017)

#### **Possible evolutions**

![](_page_29_Picture_1.jpeg)

![](_page_29_Figure_2.jpeg)

- Memory Bottleneck
- Course grain cores Finer grain cores
  - Faster memory access •
- **Device level computing In-memory Computing**

M. A. Zidan, J. P. Strachan, and W. D. Lu, Nature Electronics 1: 22–29 (2018)

![](_page_29_Picture_9.jpeg)

#### Implementing Large Networks: Modular Systems

![](_page_30_Figure_1.jpeg)

![](_page_30_Picture_2.jpeg)

X. Wang

![](_page_30_Picture_4.jpeg)

Tiled architecture for practical model implementation:

- Weight mapping
- ADC quantization, partial products
- Device and circuit nonideality

#### Wang et al. IEDM 14.4 (2019)

Slide 30

#### **TAICHI: General RRAM IMC Chip Design**

- The chip design should be compatible with a wide range of models.
- A general RRAM IMC chip based on analog RRAM tiles and a heterogeneous NoC structure
- Optimally designed blocks based on four types of compute arras work well for different popular models.
  (a)

![](_page_31_Figure_4.jpeg)

#### **TAICHI: Chip Performance Analysis**

- Register and ADC dominate the chip area and power.
- 70TOPS/W (int-8) estimated at 28nm.
- High throughput and energy efficiency for common models based on the single chip design: 1391 FPS/W (ResNet-50), 4602 FPS/W (MobileNet), 646 FPS/W (Inception-v4) and 12911 FPS/W (Transformer).

![](_page_32_Picture_4.jpeg)

![](_page_32_Figure_5.jpeg)

![](_page_32_Figure_6.jpeg)

#### **Effects of Device and Architecture Non-idealities**

![](_page_33_Figure_1.jpeg)

#### **Architecture-Aware Training Topology**

![](_page_34_Picture_1.jpeg)

![](_page_34_Figure_2.jpeg)

 Architecture details need to be included during training processes to produce accurate inference results

Q. Wang, Y. Park, and W. D. Lu, "Device Non-Ideality Effects and Architecture-Aware Training in RRAM In-Memory Computing Modules," ISCAS, 2021

![](_page_34_Picture_5.jpeg)

#### **Inference Accuracy**

- Training: Levels 0-3
- Inference: Tiled Architecture (Level-3)
- Weight Precision: 4bits
- On/off Ratio: 10

- Array size: 256x64
- ADC: 8bit

![](_page_35_Figure_8.jpeg)

 In relatively complex models, the tiled architecture has to be accounted for during training to achieve acceptable accuracy

Q. Wang, Y. Park, and W. D. Lu, "Device Non-Ideality Effects and Architecture-Aware Training in RRAM In-Memory Computing Modules," ISCAS, 2021

#### **Structured Pruning to Fit Larger Models on Chip**

![](_page_36_Figure_1.jpeg)

- Memory capacity is fixed on CIM chips after fabrication, but model sizes keep increasing
- Structured pruning can allow mapping of larger models – tradeoff of compression ratio, pruning granularity, and accuracy

F. Meng et a. unpublished

#### **Fine-Grained Structured Pruning**

![](_page_37_Figure_1.jpeg)

• The proposed fine-grained structured pruning improves accuracy and allows compression ratio up to 10x, enabling the mapping of larger models

F. Meng et a. unpublished

#### Larger Scale Implementations (> 10M devices)

![](_page_38_Picture_1.jpeg)

Fully mapped MobileNet v2 on RRAM chip (no external DRAM)

Streaming images in, streaming classification out (batch = 1)

Nonvolatile - instant on, no data lost during power interrupts

## Conclusions

![](_page_39_Picture_1.jpeg)

- At the module level and at the system architecture level.
- on the cusp of commercialization
- Challenges for large scale implementation can be mitigated through multiple approaches
  - Tiled-architecture implementation
  - Architecture-aware training
  - Fine-grained structure pruning

wluee@umich.edu

![](_page_39_Picture_9.jpeg)

![](_page_39_Picture_10.jpeg)

#### **Acknowledgements**

![](_page_40_Picture_1.jpeg)

#### Grad students:

- \*Sung-Hyun Jo
- \*Kuk-Hwan Kim
- \*Siddharth Gaba •
- \*Ting Chang
- \*Patrick Sheridan •
- \*ShinHyun Choi
- \*Jiantao Zhou
- \*Chao Du •
- ٠
- Wen Ma, •
- Fuxi Cai ٠
- Yeonjoo Jeong •
- Jong Hong Shin ٠
- John Moon ٠
- **Billy Schell** ٠
- ٠ Qiwen Wang
- \*Eric Dattoli
- \*Wayne Fung •
- ٠
- \*Seok-Youl Choi
- \*Woo Hyung Lee

#### **PostDocs:**

- Dr. Mohammed Zidan
- Dr. Xiaojian Zhu
- \*Dr. Yuchao Yang
- \*Dr. Sungho Kim
- \*Dr. Bing Chen
- \*Dr. Taeho Moon
- \*Dr. Zhongqing Ji
- \* Dr. Qing Wan

- Prof. Z. Zhang, Prof. M. Flynn, UM
- Dr. G. Kenyon, LANL
- Prof. C. Teuscher, PSU
- Prof. D. Strukov, UCSB
- ADA Center through the SRC/DARPA JUMP Program
- DARPA UPSIDE program, DARPA ACCESS program •
- National Science Foundation (ECS-0601478, CCF-0621823, ECCS-0804863, CNS-0949667, ECCS-0954621). .
- DARPA SyNAPSE program ٠
- Air Force HyNano MURI program, Air Force q-2DEG program

- Prof. J. Hasler, GeorgiaTech
- Dr. I. Valov, Prof. R. Waser

Slide 40

- Jihang Lee

- Lin Chen

Lu Group

U. Michigan

**Funding:** 

#### **References:**

M. A. Zidan, J. P. Strachan, and W. D. Lu, Nature Electronics 1: 22–29 (2018) Y. Yang and W. Lu, Nanoscale, 5, 10076 (2013) Y. Yang, Gao, Chang, Gaba, Pan, and W. Lu, Nature Communications, 3, 732. (2012) Y. Yang, Gao, Li, Pan, Tappertzhofen, Choi, Waser, Valov, W. Lu, Nature Communications 5, 4232, (2014) S. H. Jo, K.-H. Kim, W. Lu Nano Lett. 9, 496-500 (2009) S. H. Jo, Kim, W. Lu, Nano Lett., 8, 392 (2008) K.-H. Kim, S. H. Jo, W. Lu, Appl. Phys. Lett. 96, 053106 (2010) K.-H. Kim, S. Gaba, D. Wheeler, J. Cruz-Albrecht, N. Srivinara, W. Lu Nano Lett., 12, 389–395 (2012) S. H. Jo, T. Kumar, S. Narayanan, W. D. Lu, H. Nazarian, 6.7, IEDM 2014 S. Kim, S. Choi, W. Lu, ACS Nano, 8, 2369–2376 (2014) P. M. Sheridan, F. Cai, C. Du, W. Ma, Z. Zhang, W. D. Lu, Nature Nanotechnology, 12, 784–789 (2017) S. Choi, P. Sheridan, J. Shin, W. Lu, Nano Lett. 17, 3113–3118 (2017) S. H. Jo, T. Chang, I. Ebong, B. Bhavitavya, P. Mazumder, W. Lu, Nano Lett. 10, 1297 (2010) C. Du, W. Ma, T. Chang, P. Sheridan, W. D. Lu, Adv. Func. Mater., 25, 4290, (2015) S. Kim, C. Du, P. Sheridan, W. Ma, S. Choi, W.D. Lu, Nano Lett, 15, 2203 (2015) C. Du, F. Cai, M. Zidan, W. Ma, W. Lu, Nature Communications, 8: 2204, (2017) B. Chen, F. Cai, W. Ma, P. Sheridan, W. Lu, 17.5, IEDM 2015 M. A. Zidan, Y.J. Jeong, J. Lee, B. Chen, S. Huang, M. J. Kushner, & W. D. Lu, Nature Electronics, 1, 411–420 (2018) J. H. Shin, Y. J. Jeong, M. A. Zidan, Q. Wang W. D. Lu, 3.3, IEDM 2018. J. Moon, W. Ma, J. H. Shin, F. Cai, C. Du, S. H. Lee, W. D. Lu, *Nature Electronics* https://doi.org/10.1038/s41928-019-0313-3 M. Zidan, Y. Jeong, J. H. Shin, C. Du, Z. Zhang, and W. D. Lu, IEEE Trans Multi-Scale Comp Sys, 4, 698-710 (2017) X. Wang, Q. Wang, S. H. Lee, F. S. Meng, W. D. Lu, IEDM 2019