Estimating the latent number of types in growing corpora with reduced cost–accuracy trade-off-Reference-Cited by-同舟云学术

Estimating the latent number of types in growing corpora with reduced cost–accuracy trade-off

Published:2015-02-24 Issue:1 Volume:43 Page:107-134
ISSN:0305-0009
Container-title:Journal of Child Language
language:en
Short-container-title:J. Child Lang.

Author:

HIDAKA SHOHEI

Abstract

ABSTRACTThe number of unique words in children's speech is one of most basic statistics indicating their language development. We may, however, face difficulties when trying to accurately evaluate the number of unique words in a child's growing corpus over time with a limited sample size. This study proposes a novel technique to estimate the latent number of words from a series of words uttered by children. This technique utilizes statistical properties of the number of types as a function of the number of sampled tokens. We tested the practical effectiveness of the proposed method in the empirical data analysis of the cross-sectional and longitudinal samples. The converging empirical evidence indicates that the proposed estimator improves the accuracy of vocabulary size estimation over a set of existing estimators. Utilizing this efficient estimator, we propose a new sampling scheme for vocabulary assessment that has lower cost and higher accuracy compared to existing methods.

Publisher

Cambridge University Press (CUP)

Subject

General Psychology,Linguistics and Language,Developmental and Educational Psychology,Experimental and Cognitive Psychology,Language and Linguistics

Reference61 articles.

1. General type-token distribution

2. The Child Language Data Exchange System: an update

3. Shifting ontological boundaries: how Japanese- and English-speaking children generalize names for animals and artifacts

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Wordbank: an open repository for developmental vocabulary data;Journal of Child Language;2016-05-18