Author:
Li Yanjin,Xu Linchuan,Yamanishi Kenji
Abstract
AbstractGraph data augmentation (GDA), which manipulates graph structure and/or attributes, has been demonstrated as an effective method for improving the generalization of graph neural networks on semi-supervised node classification. As a data augmentation technique, label preservation is critical, that is, node labels should not change after data manipulation. However, most existing methods overlook the label preservation requirements. Determining the label-preserving nature of a GDA method is highly challenging, owing to the non-Euclidean nature of the graph structure. In this study, for the first time, we formulate a label-preserving problem (LPP) in the context of GDA. The LPP is formulated as an optimization problem in which, given a fixed augmentation budget, the objective is to find an augmented graph with minimal difference in data distribution compared to the original graph. To solve the LPP problem, we propose GMMDA, a generative data augmentation (DA) method based on Gaussian mixture modeling (GMM) of a graph in a latent space. We designed a novel learning objective that jointly learns a low-dimensional graph representation and estimates the GMM. The learning is followed by sampling from the GMM, and the samples are converted back to the graph as additional nodes. To uphold label preservation, we designed a minimum description length (MDL)-based method to select a set of samples that produces the minimum shift in the data distribution captured by the GMM. Through experiments, we demonstrate that GMMDA can improve the performance of graph convolutional network on Cora, Citeseer and Pubmed by as much as $$7.75\%$$
7.75
%
, $$8.75\%$$
8.75
%
and $$5.87\%$$
5.87
%
, respectively, significantly outperforming the state-of-the-art methods.
Funder
Japan Science and Technology Agency
The University of Tokyo
Publisher
Springer Science and Business Media LLC
Reference46 articles.
1. Wang Y, Wang W, Liang Y, Cai Y, Liu J, Hooi B (2020) Nodeaug: semi-supervised node classification with data augmentation. In: KDD. ACM, pp 207–217
2. Verma V, Qu M, Kawaguchi K, Lamb A, Bengio Y, Kannala J, Tang J (2021) Graphmix: improved training of GNNs for semi-supervised learning. In: AAAI, vol 35. AAAI Press, pp 10024–10032
3. Zhao T, Liu Y, Neves L, Woodford O, Jiang M, Shah N (2021) Data augmentation for graph neural networks. In: AAAI, vol 35. AAAI Press, pp 11015–11023
4. Park H, Lee S, Kim S, Park J, Jeong J, Kim K-M, Ha J-W, Kim HJ (2021) Metropolis-hastings data augmentation for graph neural networks. NeurIPS 34:19010–19020
5. Rong Y, Huang W, Xu T, Huang J (2019) Dropedge: towards deep graph convolutional networks on node classification. arXiv preprint arXiv:1907.10903