Abstract
SIMD extensions are playing an increasingly important role in high-performance computing and artificial intelligence fields. To fully utilize these components, various manufacturers and institutions have implemented many optimizations for SIMD extensions, with dual SIMD extension pipeline optimization being one of them. This method generates instructions suitable for parallel execution of the integrated dual SIMD extension on processors by unrolling vectorizable loops in programs. It is integrated into a mainstream compiler GCC as an optimization pass and can be enabled with just one compilation option. Experiments were conducted on an SW421 processor, testing standard benchmark suites such as SPEC CPU 2006 and NPB. The experiments showed that after using this optimization pass, programs generated by the compiler can fully utilize the dual SIMD extension during execution. Compared with turning on the autovectorization option, the method has an acceleration effect on multiple applications in the test set, and the execution efficiency is improved by an average of 5.6% and a maximum of 14%.