1. Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway Processor;Proceedings of the 51st International Conference on Parallel Processing;2022-08-29
2. Optimizing GPU Deep Learning Operators with Polyhedral Scheduling Constraint Injection;2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO);2022-04-02
3. LoopOpt;Proceedings of the 24th International Workshop on Software and Compilers for Embedded Systems;2021-11
4. Inter-loop optimization in RAJA using loop chains;Proceedings of the ACM International Conference on Supercomputing;2021-06-03
5. Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture;The Journal of Supercomputing;2021-05-15