Sreepathi Pai

Postdoctoral Fellow (ISS)
ICES, The University of Texas, Austin
Nullius in verba

I have moved to the University of Rochester.


I am an experimental computer systems researcher interested in the performance of computer programs. To that end, I work in computer architecture, compilers and the implementation of programming languages.

I work on heterogeneous accelerator-based systems consisting of CPUs and GPUs. My most recent work has revolved around optimizing compilers for high-performance irregular/graph algorithms on GPUs. I am also developing performance models for graph algorithms on GPUs. If you are developing hand-written irregular/graph algorithm implementations for GPUs, please consider dropping me a line to get access to our compiler.

At ISS, I worked on the IrGL compiler and the LonestarGPU suite, among other things.

I received my PhD from the Indian Institute of Science where I was advised by Prof. R. Govindarajan and Prof. Matthew Jacob T.


  1. Sreepathi Pai, M. Amber Hassaan, Keshav Pingali, An Operational Performance Model of Breadth-First Search, AGP@ISCA 2017, Toronto, Canada, June 2017 [pdf]

  2. Tal Ben-Nun, Michael Sutton, Sreepathi Pai, Keshav Pingali, Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations, PPoPP 2017, Austin, TX, USA, February 2017 (Best Paper Nominee) [pdf] [source]

  3. Xulong Tang, Ashutosh Pattnaik, Huaipan Jiang, Onur Kayiran, Adwait Jog, Sreepathi Pai, Mohamed Ibrahim, Mahmut T. Kandemir, Chita Das, Controlled Kernel Launch for Dynamic Parallelism in GPUs, HPCA 2017, Austin, TX, USA, February 2017

  4. Sreepathi Pai, Keshav Pingali, A Compiler for Throughput Optimization of Graph Algorithms on GPUs, OOPSLA '16, Amsterdam, Netherlands, November 2016 [preprint]

  5. Sreepathi Pai, Keshav Pingali, Modeling Performance of Graph Programs on GPUs in a Compiler, ModSim 2016, Seattle, WA, USA, August 2016 [abstract pdf] [slides]

  6. Rashid Kaleem, Anand Venkat, Sreepathi Pai, Mary Hall, Keshav Pingali, Synchronization Trade-offs in GPU implementations of Graph Algorithms, IPDPS '16, Chicago, IL, USA, May 25, 2016 [pdf]

  7. Rashid Kaleem, Sreepathi Pai, Keshav Pingali, Stochastic gradient descent on GPUs, Proceedings of the 8th Workshop on General Purpose Processing using GPUs, GPGPU 8, San Francisco, CA, USA, February 8, 2015, [pdf]

  8. Sreepathi Pai, R. Govindarajan, Matthew J. Thazhuthaveetil, Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels (Poster), PACT '14, Edmonton, AB, Canada, August 24, 2014, [extended abstract pdf] [poster pdf] [full pre-review paper]

  9. Sreepathi Pai, Matthew J. Thazhuthaveetil, R. Govindarajan, Improving GPGPU Concurrency with Elastic Kernels, ASPLOS '13, Houston, USA, March 20, 2013 [abstract] [pdf] [source code]

  10. Sreepathi Pai, R. Govindarajan, Matthew J. Thazhuthaveetil, Fast and Efficient Automatic Memory Management for GPUs using Compiler-Assisted Runtime Coherence Scheme, PACT '12, Minneapolis, USA, September 19, 2012 [abstract] [pdf] [source code]

  11. Sreepathi Pai, R. Govindarajan, M. J. Thazhuthaveetil, PLASMA: Portable Programming for SIMD Heterogeneous Accelerators, Workshop on Language, Compiler, and Architecture Support for GPGPU, held in conjunction with HPCA/PPoPP 2010, Bangalore, India, January 9, 2010 [abstract] [pdf]

  12. Sreepathi Pai, R. Govindarajan, M. J. Thazhuthaveetil, Limits of Data-Level Parallelism, Poster session at the Fourteenth International Conference on High Performance Computing (HiPC 2007), Goa, India, December 18--21, 2007 [abstract] [pdf]


Recent Invited Talks

Other Technical Stuff

Microbenchmarking Unified Memory in CUDA 6.0, looks at CUDA Unified Memory performance on the Kepler K20Xm.

"How the Fermi Thread Block Scheduler Works (Illustrated)", if you've ever wondered.