Abstract
The exponential growth of video‐sharing platforms, exemplified by platforms like YouTube and Netflix, has made videos available to everyone with minimal restrictions. This proliferation, while offering a variety of content, at the same time introduces challenges, such as the increased vulnerability of children and adolescents to potentially harmful material, notably explicit content. Despite the efforts in developing content moderation tools, a research gap still exists in creating comprehensive solutions capable of reliably estimating users’ ages and accurately classifying numerous forms of inappropriate video content. This study is aimed at bridging this gap by introducing VideoTransformer, which combines the power of two existing models: AgeNet and MobileNetV2. To evaluate the effectiveness of the proposed approach, this study utilized a manually annotated video dataset collected from YouTube, covering multiple categories, including safe, real violence, drugs, nudity, simulated violence, kissing, pornography, and terrorism. In contrast to existing models, the proposed VideoTransformer model demonstrates significant performance improvements, as evidenced by two distinct accuracy evaluations. It achieves an impressive accuracy rate of (96.89%) in a 5‐fold cross‐validation setup, outperforming NasNet (92.6%), EfficientNet‐B7 (87.87%), GoogLeNet (85.1%), and VGG‐19 (92.83%). Furthermore, in a single run, it maintains a consistent accuracy rate of 90%. Additionally, the proposed model attains an F1‐score of 90.34%, indicating a well‐balanced trade‐off between precision and recall. These findings highlight the potential of the proposed approach in advancing content moderation and enhancing user safety on video‐sharing platforms. We envision deploying the proposed methodology in real‐time video streaming to effectively mitigate the spread of inappropriate content, thereby raising online safety standards.