Abstract
In this paper, we conduct a measurement study to comprehensively compare the accuracy impacts of multiple embedding options in cryptographic API completion tasks. Embedding is the process of automatically learning vector representations of program elements. Our measurement focuses on design choices of three important aspects,
program analysis preprocessing
,
token-level embedding
, and
sequence-level embedding
. Our findings show that program analysis is necessary even under advanced embedding. The results show 36.20% accuracy improvement on average when program analysis preprocessing is applied to transfer bytecode sequences into API dependence paths. With program analysis and the token-level embedding training, the embedding
dep2vec
improves the task accuracy from 55.80% to 92.04%. Moreover, only a slight accuracy advantage (0.55% on average) is observed by training the expensive sequence-level embedding compared with the token-level embedding. Our experiments also suggest the differences made by the data. In the cross-app learning setup and a data scarcity scenario, sequence-level embedding is more necessary and results in a more obvious accuracy improvement (5.10%).
Publisher
Association for Computing Machinery (ACM)
Reference60 articles.
1. Evaluation of static vulnerability detection tools with Java cryptographic API benchmarks;Afrose Sharmin;IEEE Transactions on Software Engineering,2022
2. The adverse effects of code duplication in machine learning models of code
3. Suggesting accurate method and class names
4. Miltiadis Allamanis , Marc Brockschmidt , and Mahmoud Khademi . 2018 . Learning to represent programs with graphs . In International Conference on Learning Representations (ICLR). Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In International Conference on Learning Representations (ICLR).
5. code2vec: learning distributed representations of code