Affiliation:
1. Department of Cyber Security, Air University, Islamabad, ICT, Pakistan
Abstract
Nowadays, data collection methods and techniques are increasingly used to address intelligence needs in the sense of training models to predict correct information. Open-source intelligence (OSINT) could now incorporate Machine Learning (ML) by correlating diverse data types, such as text, images, audio, and video. In this research, we focused on an essential yet underdeveloped aspect of OSINT, extracting insights from audio data for military intelligence, especially in Pakistan's defence and focused on developing advanced tools for analyzing the expanding audio data, proposing a novel method to extract perfect information for intelligence purposes, specifically targeting key entities like Location, Rank, Operation, Date, and Weapon in military contexts. First, we developed a unique dataset containing 2000 transcribed sentences with annotations for the mentioned entities using an open-source NER annotator. Then, we trained four customized models using advanced NLP frameworks such as Hugging Face's Transformers (DistilBERT), spaCy, NLTK and Stanford CoreNLP, which are subject of assessment to determine their practical use in intelligence contexts. The selected models were evaluated, which proved that AI-based techniques are crucial for enhancing intelligence gathering in the dynamic OSINT landscape. The results also demonstrated the potential of AI integration in OSINT for audio data processing in military intelligence.