Abstract
PurposeThe paper aims to explore multilingual thesauri automation construction based on the freely available digital library resources. The key methods and study results are presented in the paper. It also proposes a way that terms are automatically extracted from multilingual parallel corpus.Design/methodology/approachThe study adopted the technology of natural language processing to analyze the linguistics characteristics of terms, and combined this with statistical analyses to extract the terms from technological documents. The methods consist of automatically extracting and filtering terms, judging and building relationship among terms, building the multilingual parallel corpus, and extracting term pairs between Chinese and foreign languages through calculating their associated probability. The experiments run on the Java test platform.FindingsThe study obtains the following conclusions: finding the similarities and differences between the Chinese thesaurus standard and international thesaurus standard. The methods for automatically extracting terms and building relationships among them are presented. Eventually the multilingual terms' translation sets are generated based on real corpora. The results of the study show that the proposed methods can obtain better performance. The effect of automatic terms' translation alignment method is better than that of traditional IBM model method.Practical implicationsThe study results can provide references for further study and application of multilingual thesauri automation construction using Chinese as a pivot.Originality/valueThe paper proposes new ideas on thesaurus automation construction in the digital age. The presented method based on linguistics and statistics is a new attempt. According to the experimental results, this exploration and study is innovative and valuable. In addition, these ideas and methods give a good start for improving information services of the PRC's National Science and Technology Digital Library.
Subject
Library and Information Sciences,Computer Science Applications
Reference11 articles.
1. Bechhofer, S. and Carole, G. (2001), “Thesaurus construction through knowledge representation”, Data & Knowledge Engineering, Vol. 37 No. 1, pp. 25‐45.
2. Chen, H. and Schata, B. (1996), “A parallel computing approach to creating engineering concept spaces for semantic retrieval”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, pp. 771‐82.
3. He, W., Wang, H., Guo, Y. and Liu, T. (2009), “Dependency based Chinese sentence realization”, Proceedings of Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Singapore, 3‐5 August, pp. 809‐16.
4. Jiang, W., Huang, L., Liu, L. and Lü, Y. (2008), “A cascaded linear model for joint Chinese word segmentation and part‐of‐speech tagging”, Proceedings of ACL 2008, Columbus, OH, pp. 897‐904.
5. Maria, C. and Yang, W. (2005), “Design information retrieval: a thesauri‐based approach for reuse of informal design information”, Engineering with Computers, Vol. 21, pp. 177‐92.
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献