Abstract
Domain-specific corpus can be used to build domain ontology, which is used in many areas such as IR, NLP and web Mining. We propose a multi-root based method to build a domain-specific corpus making use of Wikipedia resources. First we select some top-level nodes (Wikipedia category articles) as root nodes and traverse the Wikipedia using BFS-like algorithm. After the traverse, we get a directed Wikipedia graph (Wiki-graph). Then an algorithm mainly based on Kosaraju Algorithm is proposed to remove the cycles in the Wiki-graph. Finally, topological sort algorithm is used to traverse the Wiki-graph, and ranking and filtering is done during the process. When computing a node’s ranking score, the in-degree of itself and the out-degree of its parents are both considered. The experimental evaluation shows that our method could get a high-quality domain-specific corpus
Publisher
Trans Tech Publications, Ltd.
Reference8 articles.
1. Gaoying Cui, Qin Lu, Wenjie Li, and Yirong Chen. Corpus exploitation from Wikipedia for ontology construction. In European Language Resources Association, editor, Proceedings of the 6th International Conference on Language Resources and Evaluation. Marrakech, Morocco. (2008).
2. D. Milne, O. Medelyan, and I. H. Witten, Mining domain-specific thesauri from wikipedia: A case study, " inProc. of ACM International Conference on Web In- telligence (WI, 06), p.442–448, (2006).
3. K. Nakayama, T. Hara, and S. Nishio, Wikipedia mining for an association web thesaurus construction, in Proc. of IEEE International Conference on Web Information Systems Engineering (WISE 2007), p.322–334, (2007).
4. M. Strube and S. Ponzetto, WikiRelate! Computing semantic relatedness using Wikipedia, inProc. of National Conference on Artificial Intelligence (AAAI-06), p.1419–1424, July (2006).
5. Latifur Khan and Feng Luo. (2002). Ontology Construction for Information Selection. In proceedings of International Conference on Tools with Artificial Intelligence, (2002).
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献