Archiving and Analysing Techniques of the Ultra-Large-Scale Web-Based Corpus Project of NINJAL, Japan-Reference-Cited by-同舟云学术

Archiving and Analysing Techniques of the Ultra-Large-Scale Web-Based Corpus Project of NINJAL, Japan

Published:2014-08 Issue:1-2 Volume:25 Page:129-148
ISSN:0955-7490
Container-title:Alexandria: The Journal of National and International Library and Information Issues
language:en
Short-container-title:Alexandria

Author:

Asahara Masayuki,Maekawa Kikuo,Imada Mizuho,Kato Sachi,Konishi Hikari

Abstract

In 2011, the National Institute for Japanese Language and Linguistics (NINJAL) launched a corpus compilation project to construct a web corpus for linguistic research comprising ten billion words by 2016. The project is divided into four categories: Page Collection, Linguistic Annotation, Release and Preservation. For Page Collection, web crawlers are employed to collect web text by crawling 100 million pages every three months and retaining several versions of the text for three-month periods. For Linguistic Annotation, the linguistic studies web corpus contains annotated linguistic information. To improve the usability of these linguistic resources, normalization tasks such as tag removal, word segmentation, dependency parsing, and register estimation are performed. For Release, word lists and n-gram data are published based on the crawled and annotated text corpus. In addition, applications are being developed to enable searching for morphosyntax patterns in the ten-billion-word corpus. For Preservation, crawled web pages are preserved in chronological order as web archives primarily to support the survey of ongoing linguistic changes. In this paper, we present the basic design of the four categories. Additionally, we report the current status of the corpus using basic statistics of the crawled data and discuss the importance of deduplicating sentences.

Publisher

SAGE Publications

Link

http://journals.sagepub.com/doi/pdf/10.7227/ALX.0024

Reference8 articles.

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Improvements in Similarity Measurement Method for Emotion Classification of Japanese Sentences;Information;2024-03-15

2. Commas as a constructional resource: the use of a comma in a formulaic expression in Japanese social media texts;Journal of Japanese Linguistics;2023-05-01

3. Investigation of the Relationship Between Animacy and L2 Learners’ Acquisition of the English Plural Morpheme;Journal of Psycholinguistic Research;2022-10-29

4. Development of the Japanese Version of the Linguistic Inquiry and Word Count Dictionary 2015;Frontiers in Psychology;2022-03-07

5. Opposite Information Annotation on ‘Word List by Semantic Principles’;Journal of Natural Language Processing;2021