In today's digital age, the e-commerce industry continues to grow and flourish. The widespread application of computer vision technology has brought revolutionary changes to e-commerce platforms. Extracting image features from e-commerce platforms using deep learning techniques is of paramount importance for predicting product sales. Deep learning-based computer vision models can automatically learn image features without the need for manual feature extractors. By employing deep learning techniques, key features such as color, shape, and texture can be effectively extracted from product images, providing more representative and diverse data for sales prediction models. This study proposes the use of ResNet-101 as an image feature extractor, enabling the automatic learning of rich visual features to provide high-quality image representations for subsequent analysis. Furthermore, a bidirectional attention mechanism is introduced to dynamically capture correlations between different modalities, facilitating the fusion of multimodal features.