Publications

PIGEON: A High Throughput Framework for Private Inference of Neural Networks using Secure Multiparty Computation

Published in 2025 25th Privacy Enhancing Technologies Symposium (PETS), 2025

Privacy-Preserving Machine Learning (PPML) is one of the most relevant use cases for Secure Multiparty Computation (MPC). While private training of large neural networks such as VGG16 or ResNet50 on state-of-the-art datasets such as Imagenet is still out of reach, given the performance overhead of MPC, GPU-based MPC frameworks are starting to achieve practical runtimes for private inference. However, we show that in contrast to plaintext machine learning, the usage of GPU acceleration for both linear (e.g. convolutions) and nonlinear neural network layers (e.g. ReLU) is actually counterproductive in PPML: While GPUs effectively accelerate linear layers compared to CPU-based MPC implementations, the MPC circuits required to evaluate non-linear layers introduce memory overhead and frequent data movement between the GPU and the CPU to handle network communication. This results in slow ReLU performance and high GPU memory requirements in state-of-theart GPU-based PPML frameworks, hindering them from scaling to multiple images per second inference throughput and more than eight images per batch on ImageNet.

Recommended citation: Christopher Harth-Kitzerow, Yongqin Wang, Rachit Rajat, Georg Carle, Murali Annavaram, "PIGEON: A High Throughput Framework for Private Inference of Neural Networks using Secure Multiparty Computation," 2025 25th Privacy Enhancing Technologies Symposium (PETS). https://eprint.iacr.org/2024/1371

High-Throughput Secure Multiparty Computation with an Honest Majority in Various Network Settings

Published in 2025 25th Privacy Enhancing Technologies Symposium (PETS), 2025

In this work, we present novel protocols over rings for semi-honest secure three-party computation (3PC) and malicious four-party computation (4PC) with one corruption. While most existing works focus on improving total communication complexity, challenges such as network heterogeneity and computational complexity, which impact MPC performance in practice, remain underexplored.

Recommended citation: Christopher Harth-Kitzerow, Ajith Suresh, Yongqin Wang, Hossein Yalame, Georg Carle, Murali Annavaram, "High-Throughput Secure Multiparty Computation with an Honest Majority in Various Network Settings," 2025 25th Privacy Enhancing Technologies Symposium (PETS). https://arxiv.org/abs/2206.03776

MPC-Pipe: An Efficient Pipeline Scheme for Semi-honest MPC Machine Learning

Published in 2024 ACM 29th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024

Multi-party computing (MPC) has been gaining popularity as a secure computing model over the past few years. However, prior works have demonstrated that MPC protocols still pay substantial performance penalties compared to plaintext, particularly when applied to ML algorithms. The overhead is due to added computation and communication costs. Prior studies, as well as our own analysis, found that most MPC protocols today sequentially perform communication and computation. The participating parties must compute on their shares first and then perform data communication to allow the distribution of new secret shares before proceeding to the next computation step. In this work, we show that serialization is unnecessary, particularly in the context of ML computations (both in Convolutional neural networks and in Transformer-based models). We demonstrate that it is possible to carefully orchestrate the computation and communication steps to overlap. We propose MPC-Pipe, an efficient MPC system for both training and inference of ML workloads, which pipelines computations and communications in an MPC protocol during the online phase. MPC-Pipe proposes three pipeline schemes to optimize the online phase of ML in the semi-honest majority adversary setting. The three pipeline schemes are 1. inter-linear pipeline, 2. inner-layer pipeline, and 3. inter-batch pipeline. Inter-linear pipeline focuses on linear layers; inner-layer pipeline focuses on non-linear layers; inter-batch pipeline focuses on communication and computation overlaps in different input batches. We implement MPC-Pipe by augmenting a modified version of CrypTen, which separates online and offline phases. We evaluate the end-to-end system performance benefits of the online phase of MPC using deep neural networks (VGG16, ResNet50) and Transformers using different network settings. We show that MPC-Pipe can improve the throughput and latency of ML workloads.

Recommended citation: Yongqin Wang, Rachit Rajat, Murali Annavaram, "MPC-Pipe: An Efficient Pipeline Scheme for Semi-honest MPC Machine Learning," 2024 ACM 29th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). https://arxiv.org/abs/2209.13643

LAORAM: A Look Ahead ORAM Architecture for Training Large Embedding Tables

Published in 2023 IEEE/ACM 50th International Symposium on Computer Architecture (ISCA), 2023

Data confidentiality is becoming a significant concern, especially in the cloud computing era. Memory access patterns have been demonstrated to leak critical information such as security keys and a program’s spatial and temporal information. This information leak poses an even more significant privacy challenge in machine learning models with embedding tables. Embedding tables are routinely used to learn categorical features from training data. Even knowing the locations of the embedding table entries accessed, not the data within the embedding table, will compromise categorical input data to the model. Embedding entries are privacy-sensitive since they disclose valuable properties about the user. Oblivious RAM (ORAM), and its enhanced variants such as PathORAM have emerged as viable solutions to hide leakage from memory access streams. In this work, we present LAORAM, an ORAM framework explicitly designed to protect user privacy during embedding table training. LAORAM exploits the unique property of training, the training samples used in the future are known beforehand. LAORAM preprocesses the training samples to identify the memory blocks which are accessed together in the near future. The system tries to assign these blocks to as few paths as possible within the PathORAM infrastructure. LAORAM does this operation by combining multiple blocks accessed together as superblocks. To further increase performance, LAORAM uses a fat-tree structure for PathORAM reducing the number of background evictions required, which improves the stash usage. We have evaluated LAORAM using both a recommendation model (DLRM) and a NLP model (XLM-R) embedding table configurations. LAORAM performs 5 times faster than PathORAM on a recommendation dataset (Kaggle) and 5.4x faster on a NLP dataset (XNLI), while guaranteeing the same security guarantees as the original PathORAM.

Recommended citation: Yongqin Wang*, Rachit Rajat*, Murali Annavaram, "LAORAM: A Look Ahead ORAM Architecture for Training Large Embedding Tables," 2023 IEEE/ACM 50th International Symposium on Computer Architecture (ISCA). [* Equal Contributions] https://arxiv.org/abs/2107.08094

PageORAM: An Efficient DRAM Page Aware ORAM Strategy

Published in 2022 IEEE/ACM 55th International Symposium on Microarchitecture (MICRO), 2022

Leaking memory access addresses can significantly empower an adversary in several computing usage scenarios – from key extraction to disclosing private information. Oblivious RAM has been proposed as a solution to this problem. Oblivious RAM involves reading multiple blocks instead of a single block for each memory access and changing the address of the data block being read after each access. State-of-the-art ORAMs (PathORAM and RingORAM) consider a tree-based structure for storing the data. However, ORAM designs pay a performance penalty. One reason is the strict requirement to evict stash blocks on a particular path that was previously read from. In treebased ORAMs, each memory block is assigned a random path number, and when accessing a single block, one must fetch all the blocks in that path. Once the path is fetched into a client, the computation on the block is performed and that block is assigned a new random path number. All the blocks that were fetched into the client should be evicted to the memory. However, the eviction process must place the unmodified blocks on the same path that the prior read has fetched data from. This eviction requirement may cause a block not to be placed back in the ORAM tree due to limited space on a given tree node. As a result, the client must temporarily hold the block in its stash, which is a secure storage. Every fetch request for a block must search the stash before issuing a request to the ORAM. As the stash size grows, the stash search process becomes a substantial latency hurdle. On the other hand, if the stash is small then the client has to issue dummy reads which are useless reads in the tree for the sole purpose of creating more opportunities to place the stash data back in the tree. An alternate approach used in prior works is to embed dummy data blocks to create large bucket sizes at each tree level to enable better stash eviction probability. Neither of the above two solutions is palatable in practice. Dummy reads increase memory access latency, while dummy blocks increase the fetch bandwidth to bring large buckets from each level into the stash. Furthermore, dummy blocks also decrease the effective memory size available.To solve this problem we propose PageORAM, a novel block eviction and placement strategy. PageORAM makes the critical observation that DRAM is accessed at the granularity of a page (also referred to as row buffer), which is at least an order magnitude larger than the tree node size. Thus, a page may hold data blocks from multiple sub-trees. Hence, when fetching a path, PageORAM fetches a few additional sub-paths from the tree that are already present in an open DRAM page. These additional fetches vastly increase stash eviction options by opening up exponentially more data block placement choices. Thus, PageORAM enables a dramatic reduction in stash size without increasing page access counts in DRAM. While this observation may be counter-intuitive, we note that PageORAM reduces the overall bandwidth even after accounting for the increased fetches along the sub-paths. The reason is that by vastly improving stash block placement possibilities, PageORAM can significantly reduce the bucket size of the tree. Our implementation of PageORAM demonstrates an order of magnitude slower stash growth, increased bucket occupancy with useful data, and correspondingly improved memory access latency and reduced memory bandwidth. In our experiments, we find that PageORAM can either reduce the memory space requirement of the tree-based ORAMs by up to 40% compared to baseline tree-based ORAM or give a performance improvement of up to 7.8x for the same structured tree-based ORAM.

Recommended citation: Rachit Rajat, Yongqin Wang and Murali Annavaram, "PageORAM: An Efficient DRAM Page Aware ORAM Strategy," 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). https://ieeexplore.ieee.org/abstract/document/9923803

Characterization of MPC-based Private Inference for Transformer-based Models

Published in 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2022

In this work, we provide an in-depth characterization study of the performance overhead for running Transformer models with secure multi-party computation (MPC). MPC is a cryptographic framework for protecting both the model and input data privacy in the presence of untrusted compute nodes. Our characterization study shows that Transformers introduce several performance challenges for MPC-based private machine learning inference. First, Transformers rely extensively on “softmax” functions. While softmax functions are relatively cheap in a non-private execution, softmax dominates the MPC inference runtime, consuming up to 50% of the total inference runtime. Further investigation shows that computing the maximum, needed for providing numerical stability to softmax, is a key culprit for the increase in latency. Second, MPC relies on approximating non-linear functions that are part of the softmax computations, and the narrow dynamic ranges make optimizing softmax while maintaining accuracy quite difficult. Finally, unlike CNNs, Transformer-based NLP models use large embedding tables to convert input words into embedding vectors. Accesses to these embedding tables can disclose inputs; hence, additional obfuscation for embedding access patterns is required for guaranteeing the input privacy. One approach to hide address accesses is to convert an embedding table lookup into a matrix multiplication. However, this naive approach increases MPC inference runtime significantly. We then apply tensor-train (TT) decomposition, a lossy compression technique for representing embedding tables, and evaluate its performance on embedding lookups. We show the trade-off between performance improvements and the corresponding impact on model accuracy using detailed experiments.

Recommended citation: Yongqin Wang, G. Edward Suh, Wenjie Xiong, Benjamin Lefaudeux, Brian Knott, Murali Annavaram, Hsien-Hsin S. Lee, "Characterization of MPC-based Private Inference for Transformer-based Models," 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). https://ieeexplore.ieee.org/abstract/document/9804616

DarKnight: An accelerated framework for privacy and integrity preserving deep learning using trusted hardware

Published in 2021 IEEE/ACM 54th International Symposium on Microarchitecture (MICRO), 2021

Privacy and security-related concerns are growing as machine learning reaches diverse application domains. The data holders want to train or infer with private data while exploiting accelerators, such as GPUs, that are hosted in the cloud. Cloud systems are vulnerable to attackers that compromise the privacy of data and integrity of computations. Tackling such a challenge requires unifying theoretical privacy algorithms with hardware security capabilities. This paper presents DarKnight, a framework for large DNN training while protecting input privacy and computation integrity. DarKnight relies on cooperative execution between trusted execution environments (TEE) and accelerators, where the TEE provides privacy and integrity verification, while accelerators perform the bulk of the linear algebraic computation to optimize the performance. In particular, DarKnight uses a customized data encoding strategy based on matrix masking to create input obfuscation within a TEE. The obfuscated data is then offloaded to GPUs for fast linear algebraic computation. DarKnight’s data obfuscation strategy provides provable data privacy and computation integrity in the cloud servers. While prior works tackle inference privacy and cannot be utilized for training, DarKnight’s encoding scheme is designed to support both training and inference. Download paper here

Recommended citation: Hanieh Hashemi, Yongqin Wang, and Murali Annavaram. 2021. DarKnight: An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). https://dl.acm.org/doi/abs/10.1145/3466752.3480112

Origami inference: Private inference using hardware enclaves

Published in 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), 2021

This work presents Origami, a framework which provides privacy-preserving inference for large deep neural network (DNN) models through a combination of enclave execution, cryptographic blinding, interspersed with accelerator-based computation. Origami partitions the ML model into multiple partitions. The first partition receives the encrypted user input within an SGX enclave. The enclave decrypts the input and then applies cryptographic blinding to the input data and the model parameters. The layer computation is offloaded to a GPU/CPU and the computed output is returned to the enclave, which decodes the computation on noisy data using the unblinding factors privately stored within SGX. This process may be repeated for each DNN layer, as has been done in prior work Slalom. However, the overhead of blinding and unblinding the data is a limiting factor to scalability. Origami relies on the empirical observation that the feature maps after the first several layers can not be used, even by a powerful conditional GAN adversary to reconstruct input. Hence, Origami dynamically switches to executing the rest of the DNN layers directly on an accelerator. We empirically demonstrate that using Origami, a conditional GAN adversary, even with an unlimited inference budget, cannot reconstruct the input. Compared to running the entire VGG-19 model within SGX, Origami inference improves the performance of private inference from 11x while using Slalom to 15.1x.

Recommended citation: Krishna Giri Narra, Zhifeng Lin, Yongqin Wang, Keshav Balasubramanian and Murali Annavaram, "Origami Inference: Private Inference Using Hardware Enclaves," 2021 IEEE 14th International Conference on Cloud Computing (CLOUD) https://ieeexplore.ieee.org/abstract/document/9582200

Byzantine-Robust and Privacy-Preserving Framework for FedML

Published in Security and Safety in Machine Learning Systems ICLR 2021 Workshop, 2021

Federated learning has emerged as a popular paradigm for collaboratively training a model from data distributed among a set of clients. This learning setting presents, among others, two unique challenges: how to protect privacy of the clients’ data during training, and how to ensure integrity of the trained model. We propose a two-pronged solution that aims to address both challenges under a single framework. First, we propose to create secure enclaves using a trusted execution environment (TEE) within the server. Each client can then encrypt their gradients and send them to verifiable enclaves. The gradients are decrypted within the enclave without the fear of privacy breaches. However, robustness check computations in a TEE are computationally prohibitive. Hence, in the second step, we perform a novel gradient encoding that enables TEEs to encode the gradients and then offloading Byzantine check computations to accelerators such as GPUs. Our proposed approach provides theoretical bounds on information leakage and offers a significant speed-up over the baseline in empirical evaluation.

Recommended citation: Hanieh Hashemi, Yongqin Wang, and Murali Annavaram. 2021. Byzantine-Robust and Privacy-Preserving Framework for FedML. https://arxiv.org/pdf/2105.02295

The BlackParrot Processor: An Open-Source Industrial-Strength RV64G Multicore Processor

Published in Defense Technical Information Center, 2019

In this paper we introduce the BlackParrot multicore processor, a mainstream industrial-strength open-source implementation of the RISC-V RV64G architecture. BlackParrot is a clean-slate processor design with a lean, energy efficient, and highly performant implementation. The key differentiator between BlackParrot and prior RISC-V processor efforts is that our goal is to distribute stewardship across industry and government stakeholders instead of adopting a freemium model where the source is controlled by a private startup that does not release the actual code it tapes out. Our approach enables a pathway for creating the equivalent of Linux for RISC-V a truly open RISC-V processor to power the open-source hardware revolution and the Age of Bespoke Silicon.

Download here