Measurement of Embedding Choices on Cryptographic API Completion Tasks-Reference-Cited by-同舟云学术

Measurement of Embedding Choices on Cryptographic API Completion Tasks

Published:2023-10-17 Issue: Volume: Page:
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Xiao Ya¹,Song Wenjia¹,Ahmed Salman¹,Ge Xinyang²,Viswanath Bimal¹,Meng Na¹,Yao Danfeng (Daphne)¹

Affiliation:

1. Virginia Tech, USA

2. Databricks, USA

Abstract

In this paper, we conduct a measurement study to comprehensively compare the accuracy impacts of multiple embedding options in cryptographic API completion tasks. Embedding is the process of automatically learning vector representations of program elements. Our measurement focuses on design choices of three important aspects, program analysis preprocessing , token-level embedding , and sequence-level embedding . Our findings show that program analysis is necessary even under advanced embedding. The results show 36.20% accuracy improvement on average when program analysis preprocessing is applied to transfer bytecode sequences into API dependence paths. With program analysis and the token-level embedding training, the embedding dep2vec improves the task accuracy from 55.80% to 92.04%. Moreover, only a slight accuracy advantage (0.55% on average) is observed by training the expensive sequence-level embedding compared with the token-level embedding. Our experiments also suggest the differences made by the data. In the cross-app learning setup and a data scarcity scenario, sequence-level embedding is more necessary and results in a more obvious accuracy improvement (5.10%).

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Link

https://dl.acm.org/doi/pdf/10.1145/3625291

Reference60 articles.

1. Evaluation of static vulnerability detection tools with Java cryptographic API benchmarks;Afrose Sharmin;IEEE Transactions on Software Engineering,2022

2. The adverse effects of code duplication in machine learning models of code

3. Suggesting accurate method and class names

4. Miltiadis Allamanis , Marc Brockschmidt , and Mahmoud Khademi . 2018 . Learning to represent programs with graphs . In International Conference on Learning Representations (ICLR). Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In International Conference on Learning Representations (ICLR).

5. code2vec: learning distributed representations of code