RTeAAL Sim: Using Tensor Algebra to Represent and Accelerate RTL Simulation

Published in ASPLOS, 2026 (Full Paper | Code | bibtex)

@inproceedings{gofetch, title = {GoFetch: Breaking Constant-Time Cryptographic Implementations Using Data Memory-Dependent Prefetchers}, author = {Boru Chen and Yingchen Wang and Pradyumna Shome and Christopher W. Fletcher and David Kohlbrenner and Riccardo Paccagnella and Daniel Genkin}, booktitle = {USENIX Security}, year = {2024}, }
@inproceedings{peekawalk, author = { Wang, Alan and Chen, Boru and Wang, Yingchen and Fletcher, Christopher and Genkin, Daniel and Kohlbrenner, David and Paccagnella, Riccardo }, booktitle = { IEEE S\&P }, title = , year = {2025} }
@inproceedings{controlled-preemtion, author = {Zhu, Yongye and Chen, Boru and Zhao, Zirui Neil and Fletcher, Christopher W.}, title = {Controlled Preemption: Amplifying Side-Channel Attacks from Userspace}, year = {2025}, booktitle = {ASPLOS} }
@inproceedings{ustt, title = {µSTT: Microarchitecture Design for Speculative Taint Tracking}, author = {Boru Chen and Rutvik Choudhary and Kaustubh Khulbe and Archie Lee and Adam Morrison and Christopher W. Fletcher}, booktitle = {ICCD}, year = {2025}, }
@inproceedings{rteaalsim, title={RTeAAL Sim: Using Tensor Algebra to Represent and Accelerate RTL Simulation}, author={Yan Zhu and Boru Chen and Christopher W. Fletcher and Nandeeka Nayak}, year={2026}, booktitle={ASPLOS} }

RTL simulation on CPUs remains a persistent bottleneck in hardware design. State-of-the-art simulators embed the circuit directly into the simulation binary, resulting in long compilation times and execution that is fundamentally CPU frontend-bound, with severe instruction-cache pressure.

This work proposes RTeAAL Sim, which reformulates RTL simulation as a sparse tensor algebra problem. By representing RTL circuits as tensors and simulation as a sparse tensor algebra kernel, RTeAAL Sim decouples simulation behavior from binary size and makes RTL simulation amenable to well-studied tensor algebra optimizations. We demonstrate that a prototype of our tensor-based simulator, even with a subset of these optimizations, already mitigates the compilation overhead and frontend pressure and achieves performance competitive with the highly optimized Verilator simulator across multiple CPUs and ISAs.