Skip to main content

Jason Lau - An FPGA HLS Researcher

Jason Lau is a , whose research focuses on High-Level Synthesis, Customized Computing, and FPGA Abstraction. Currently creating low-latency hardware logic at Jump Trading Group.
Jason Lau

Publications (Cited 1000 times)

Automated Design Space Exploration in High-Level Physical Synthesis

Linfeng Du, Jiawei Liang, Jason Lau, Yuze Chi, Yutong Xie, Chunyou Su, Afzal Ahmad, Zifan He, Jake Ke, Jinming Ge, Jason Cong, Wei Zhang, Licheng Guo

ICCAD '25 - The 2025 IEEE/ACM International Conference On Computer Aided Design

We propose a robust Design Space Exploration (DSE) framework to address the instability and manual complexity of existing High-Level Physical Synthesis (HLPS) for multi-die FPGAs. By automating iterative parameter tuning through physical implementation metrics and tailored heuristics, our framework ensures consistent timing closure and eliminates user intervention. In evaluations on large-scale designs, the framework achieved an average frequency of 311.06 MHz, outperforming the AMD Vitis/Vivado toolchain by 2.42× and leading academic solutions by 1.67×.

RapidStream IR: Infrastructure for FPGA High-Level Physical Synthesis

Jason Lau, Yuanlong Xiao, Yutong Xie, Yuze Chi, Linghao Song, Shaojie Xiang, Michael Lo, Zhiru Zhang, Jason Cong, Licheng Guo

ICCAD '24 - The 43rd IEEE/ACM International Conference on Computer-Aided Design

We present the concept of high-level physical synthesis (HLPS), and a practical infrastructure for representing the composition of complex FPGA designs and exploring physical optimizations. Our approach introduces a flexible intermediate representation that captures interconnection protocols at arbitrary hierarchical levels, coarse-grained pipelining, and spatial information, enabling the creation of reusable passes for design frequency optimizations. RapidStream IR improves the frequency of a broad set of mixed-source designs by 7% to 62%.

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Shixin Ji, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Yiyu Shi, Deming Chen, Jason Cong, Peipei Zhou

TRETS '24 - ACM Transactions on Reconfigurable Technology and Systems, Volume 17, Issue 3, Article No. 51, Pages 1 - 31

We propose CHARM, a framework designed to optimize throughput for end-to-end deep learning applications on the AMD/Xilinx Versal ACAP. To overcome the performance bottlenecks caused by executing small matrix multiply (MM) layers on monolithic accelerators, CHARM composes multiple diverse, concurrent accelerator architectures tailored to varied layer sizes. By utilizing analytical models for design space exploration and automated code generation, CHARM achieves throughput gains of up to 32.51x over monolithic designs across models like BERT and ViT.

Enabling Heterogeneous Computing for Software Developers

Jason Lau advised by Jason Cong

Ph.D. Dissertation, University of California, Los Angeles

We introduce Heterosys, an end-to-end framework designed to bridge the gap between high-level software and efficient FPGA implementation. By decoupling algorithmic descriptions from underlying hardware, Heterosys utilizes three core components—HeteroRefactor for automated refactoring and selective offloading, Adroit for frequency-driven architectural optimization, and RapidIR for high-level physical synthesis and floorplanning. Our research demonstrates frequency improvements of 30% to over 100%, resource reductions up to 90%, and a 51% decrease in manual code effort, significantly lowering the barrier to entry for heterogeneous computing.

TAPA: A Scalable Task-Parallel Dataflow Programming Framework for Modern FPGAs with Co-Optimization of HLS and Physical Design

{Licheng Guo*, Yuze Chi*, Jason Lau*}, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru Zhang, Jason Cong

TRETS '23 - ACM Transactions on Reconfigurable Technology and Systems, Volume 16, Issue 4, Article No. 63, Pages 1 - 31

We propose TAPA, an end-to-end framework for compiling C++ task-parallel programs into high-frequency FPGA accelerators. By utilizing flexible communication APIs and coarse-grained floorplanning during compilation, TAPA enables accurate pipelining of critical paths and optimizes designs for HBM-based FPGAs. Experimental results across 43 designs demonstrate a 102% average frequency improvement, including successful implementation of previously unroutable designs with minimal resource impact.

Cited 50+ times

RapidStream 2.0: Automated Parallel Implementation of Latency–Insensitive FPGA Designs Through Partial Reconfiguration

Licheng Guo, Pongstorn Maidee, Yun Zhou, Chris Lavin, Eddie Hung, Wuxi Li, Jason Lau, Weikang Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao, Alireza Kaviani, Zhiru Zhang, Jason Cong

TRETS '23 - ACM Transactions on Reconfigurable Technology and Systems, Volume 16, Issue 4, Article No. 59, Pages 1 - 30

We present RapidStream, a parallelized, physical-integrated compilation framework designed to drastically reduce FPGA compilation cycles. By co-optimizing HLS with back-end physical implementation, RapidStream partitions latency-insensitive C/C++ programs for parallel placement and routing. RapidStream achieves a 5–7× reduction in compile time and up to a 1.3× increase in frequency compared to commercial toolchains.

CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Deming Chen, Jason Cong, Peipei Zhou

FPGA '23 - The 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

We propose CHARM, a framework for composing heterogeneous matrix multiply (MM) accelerators to optimize deep learning throughput on platforms like the AMD/Xilinx Versal ACAP. While monolithic accelerators struggle with the small, diverse MM layers found in models like BERT—often achieving less than 5% peak performance—CHARM utilizes analytical models to partition resources and schedule layers across multiple concurrent, specialized architectures. By automating code generation and design space exploration, CHARM achieves up to 32.5x throughput gains over monolithic designs across BERT, ViT, NCF, and MLP benchmarks.

Cited 50+ times

FPGA HLS Today: Successes, Challenges, and Opportunities

Jason Cong, Jason Lau, Gai Liu, Stephen Neuendorffer, Peichen Pan, Kees Vissers, Zhiru Zhang

TRETS '22 - ACM Transactions on Reconfigurable Technology and Systems, Volume 15, Issue 4, Article No. 51, Pages 1 - 4

We evaluate the decade-long evolution of FPGA HLS from prototype to industrial deployment across domains like deep learning and genomics. While highlighting HLS successes, we identify critical bottlenecks in clock frequency, system integration, and code legacy, proposing a roadmap for future research centered on open-source infrastructures and standardization.

Cited 200+ times

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

Linghao Song, Yuze Chi, Atefeh Sohrabizadeh, Young-kyu Choi, Jason Lau, Jason Cong

FPGA '22 - The 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

We present Sextans, a flexible FPGA accelerator for general-purpose SpMM that addresses challenges in random memory access, data movement, and workload balancing. By leveraging HBM and PE-aware scheduling, Sextans enables streaming access and balanced pipelining for arbitrary matrix sizes. Evaluation on 1,400 benchmarks shows that Sextans achieves up to a 2.50x speedup over K80 GPUs, with projected optimizations outperforming V100 GPUs.

Cited 100+ times

TARO: Automatic Optimization for Free-Running Kernels in FPGA High-Level Synthesis

Young-kyu Choi, Yuze Chi, Jason Lau, Jason Cong

TCAD '22 - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

We present TARO, an automated framework for applying free-running optimization to HLS-based streaming applications. By regulating tasks via data streams rather than complex global control, TARO simplifies hardware logic while preserving original functionality and performance. On the Alveo U250, TARO achieves an average reduction of 16% in LUTs and 45% in FFs for systolic array designs.

AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs

Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, Jason Cong

FPGA '21 - The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

We introduce AutoBridge, an automated framework that closes the frequency gap between HLS-generated and handcrafted RTL designs by integrating coarse-grained floorplanning with pipelining during HLS compilation. By providing global layout awareness, AutoBridge identifies and pipelines long interconnects—particularly die-crossing wires in multi-die FPGAs—while preventing routing congestion. Across 43 configurations, AutoBridge increased average frequency from 147 MHz to 297 MHz (102% improvement) and successfully routed previously unroutable designs with negligible resource overhead and zero throughput loss.

Best Paper Award Cited 100+ times

TAPA: Extending High-Level Synthesis for Task-Parallel Programs

Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong

FCCM '21 - The 29th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines

We present TAPA, an automated HLS framework designed to enhance the productivity of task-parallel FPGA accelerators. While traditional HLS struggles with the complexities of parallel task communication and slow development cycles, TAPA introduces a programmer-friendly C++ interface, unconstrained software simulation, and fast hierarchical code generation. By streamlining the development and verification process, TAPA reduces kernel and host code by 22% and 51% respectively, while accelerating correctness verification by 3.2× and QoR tuning by 6.8×.

Cited 50+ times

HeteroRefactor: Refactoring for Heterogeneous Computing with FPGA

{Jason Lau*, Aishwarya Sivaraman*, Qian Zhang*}, Muhammad Ali Gulzar, Jason Cong, Miryung Kim

ICSE '20 - The ACM/IEEE 42nd International Conference on Software Engineering

We propose HeteroRefactor, an automated refactoring framework that enables HLS for traditionally incompatible C/C++ programs containing recursion and dynamic memory. By monitoring FPGA-specific dynamic invariants—such as variable bitwidths and data structure sizes—HeteroRefactor automatically transforms kernels into synthesizable, resource-optimized hardware while using selective CPU offloading to ensure correctness. On Xilinx FPGAs, HeteroRefactor reduces BRAM usage by up to 83% and improves clock frequency by 42%, eliminating the need for extensive manual code refactoring by hardware experts.

Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency

{Licheng Guo*, Jason Lau*}, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, Zhiru Zhang, Jason Cong

DAC '20 - The 2020 57th ACM/IEEE Design Automation Conference

We present a study on frequency degradation in HLS-generated FPGA designs, identifying broadcast structures—specifically high-fanout data, flow control, and synchronization signals—as the primary bottlenecks. By addressing HLS compiler limitations through broadcast-aware scheduling, synchronization pruning, and skid-buffer integration, our approach improves the maximum frequency of representative benchmarks by 53% on average, with gains exceeding 100 MHz in several cases.

Best Paper Candidate Cited 50+ times

Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing

{Licheng Guo*, Jason Lau*}, Zhenyuan Ruan, Peng Wei, Jason Cong

FCCM '19 - The 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines

We present a high-performance acceleration framework for chaining, the primary bottleneck in genome sequencing that accounts for 70% of execution time. By reordering operation sequences for hardware-friendly parallelism and implementing a fine-grained task dispatching scheme, we overcome the limitations of variable input sizes and poor data dependency. Our fully pipelined streaming architecture on FPGA achieves a 28x speedup over the highly-optimized multi-threaded CPU program and 4x over a fully-utilized GPU, providing a quantitative guide for selecting optimal hardware platforms for genomic workloads.

Cited 100+ times

Reproducing Vectorization of the Tersoff Multi-Body Potential on the Intel Skylake and NVIDIA Volta Architectures

Jason Lau, Yuxuan Li, Lei Xie, Qian Xie, Beichen Li, Yu Chen, Guanyu Feng, Jiping Yu, Xinjian Yu, Miao Wang, Wentao Han, Jidong Zhai

Parallel Computing, Volume 78, October 2018, Pages 47-53

We present an evaluation of the Tersoff potential's performance portability on Intel Skylake and NVIDIA Volta architectures. While the original study claimed high efficiency and scalability through reduced precision and cross-platform vectorization, our experiments with updated datasets show inconsistent results. Analysis reveals that communication bottlenecks, triggered by specific characteristics of the new input data, significantly limit the reproducibility of the original performance gains.