Affiliation:
1. Columbia University
2. Carnegie Mellon University
Abstract
Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. They are used ubiquitously in computational linguistics. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of probabilistic grammars using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. By making assumptions about the underlying distribution that are appropriate for natural language scenarios, we are able to derive distribution-dependent sample complexity bounds for probabilistic grammars. We also give simple algorithms for carrying out empirical risk minimization using this framework in both the supervised and unsupervised settings. In the unsupervised case, we show that the problem of minimizing empirical risk is NP-hard. We therefore suggest an approximate algorithm, similar to expectation-maximization, to minimize the empirical risk.
Subject
Artificial Intelligence,Computer Science Applications,Linguistics and Language,Language and Linguistics
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Employee welfare financing system with support vector machine and Naïve Bayes to Syariah banking;INTERNATIONAL CONFERENCE ON BIOMEDICAL ENGINEERING (ICoBE 2021);2023
2. Consistent Unsupervised Estimators for Anchored PCFGs;Transactions of the Association for Computational Linguistics;2020-07-01
3. Learnability;The Handbook of Language Emergence;2015-01-02
4. The forest for the trees;Physics of Life Reviews;2014-09
5. Complexity in Language Acquisition;Topics in Cognitive Science;2013-01