Author:
Asahara Masayuki,Maekawa Kikuo,Imada Mizuho,Kato Sachi,Konishi Hikari
Abstract
In 2011, the National Institute for Japanese Language and Linguistics (NINJAL) launched a corpus compilation project to construct a web corpus for linguistic research comprising ten billion words by 2016. The project is divided into four categories: Page Collection, Linguistic Annotation, Release and Preservation. For Page Collection, web crawlers are employed to collect web text by crawling 100 million pages every three months and retaining several versions of the text for three-month periods. For Linguistic Annotation, the linguistic studies web corpus contains annotated linguistic information. To improve the usability of these linguistic resources, normalization tasks such as tag removal, word segmentation, dependency parsing, and register estimation are performed. For Release, word lists and n-gram data are published based on the crawled and annotated text corpus. In addition, applications are being developed to enable searching for morphosyntax patterns in the ten-billion-word corpus. For Preservation, crawled web pages are preserved in chronological order as web archives primarily to support the survey of ongoing linguistic changes. In this paper, we present the basic design of the four categories. Additionally, we report the current status of the corpus using basic statistics of the crawled data and discuss the importance of deduplicating sentences.
Cited by
12 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献