Compared with single-modal methods, emotion recognition research is increasingly focusing on the use of multi-modal methods to improve accuracy. Despite the advantages of multimodality, challenges such as feature fusion and redundancy remain. In this study, we propose a multi-modal multi-label emotion recognition decision system based on graph convolution. Our approach utilizes text, speech, and video data for feature extraction, while combining tag attention to capture fine-grained modal dependencies. The two-stage feature reconstruction module facilitates complementary feature fusion while preserving mode-specific information. Emotional decisions are made using a fully connected layer to optimize performance without adding complexity to the model. Experimental results on IEMOCAP, CMU-MOSEI and MELD datasets show that our algorithm has higher accuracy than existing models, highlighting the effectiveness and innovation of our proposed algorithm.