Affiliation:
1. University of Michigan, Ann Arbor, Michigan
2. Intel Corporation, Santa Clara, CA
Abstract
In the design of mobile systems, hardware/software (HW/SW) co-design has important advantages by creating specialized hardware for the performance or power optimizations. Dynamic binary translation (DBT) is a key component in co-design. During the translation, a dynamic optimizer in the DBT system applies various software optimizations to improve the quality of the translated code. With dynamic optimization, optimization time is an exposed run-time overhead and useful analyses are often restricted due to their high costs. Thus, a dynamic optimizer needs to make smart decisions with limited analysis information, which complicates the design of optimization decision models and often causes failures in human-made heuristics. In mobile systems, this problem is even more challenging because of strict constraints on computing capabilities and memory size.
To overcome the challenge, we investigate an opportunity to build practical optimization decision models for DBT by using machine learning techniques. As the first step,
loop unrolling
is chosen as the representative optimization. We base our approach on the industrial strength DBT infrastructure and conduct evaluation with 17,116 unrollable loops collected from 200 benchmarks and real-life programs across various domains. By utilizing all available features that are potentially important for loop unrolling decision, we identify the best classification algorithm for our infrastructure with consideration for both prediction accuracy and cost. The greedy feature selection algorithm is then applied to the classification algorithm to distinguish its significant features and cut down the feature space. By maintaining significant features only, the best affordable classifier, which satisfies the budgets allocated to the decision process, shows 74.5% of prediction accuracy for the optimal unroll factor and realizes an average 20.9% reduction in dynamic instruction count during the steady-state translated code execution. For comparison, the best baseline heuristic achieves 46.0% prediction accuracy with an average 13.6% instruction count reduction. Given that the infrastructure is already highly optimized and the ideal upper bound for instruction reduction is observed at 23.8%, we believe this result is noteworthy.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Software
Reference42 articles.
1. 2019-02-08. Intel Core i7 Embedded Processor. https://ark.intel.com/products/series/122593/8th-Generation-Intel-Core-i7-Processors#@embedded. 2019-02-08. Intel Core i7 Embedded Processor. https://ark.intel.com/products/series/122593/8th-Generation-Intel-Core-i7-Processors#@embedded.
2. 2019-06-02. 3DMark. https://www.3dmark.com/. 2019-06-02. 3DMark. https://www.3dmark.com/.
3. 2019-06-02. FPMark. https://www.eembc.org/fpmark/. 2019-06-02. FPMark. https://www.eembc.org/fpmark/.
4. 2019-06-02. Geekbench. https://www.geekbench.com/. 2019-06-02. Geekbench. https://www.geekbench.com/.
5. 2019-06-02. SYSmark. https://bapco.com/products/sysmark-2018/. 2019-06-02. SYSmark. https://bapco.com/products/sysmark-2018/.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献