Affiliation:
1. Hungarian Research Centre for Linguistics, Institute for Lexicology, Hungary
Abstract
AbstractNowadays, it is quite common in linguistics to base research on data instead of introspection. There are countless corpora – both raw and linguistically annotated – available to us which provide essential data needed. Corpora are large in most cases, ranging from several million words to some billion words in size, clearly not suitable to investigate word by word by close reading. Basically, there are two ways to retrieve data from them: (1) through a query interface or (2) directly by automatic text processing. Here we present principles on how to soundly and effectively collect linguistic data from corpora by querying i.e. without knowledge of programming to directly manipulate the data. What is worth thinking about, which tools to use, what to do by default and how to solve problematic cases. In sum, how to obtain correct and complete data from corpora to do linguistic research.
Subject
Literature and Literary Theory,Linguistics and Language,Language and Linguistics,Cultural Studies
Reference23 articles.
1. Representativeness in corpus design;Biber, Douglas,1993
2. Radically truncated clauses in Hungarian and beyond: Evidence for the fine structure of the minimal VP;Halm, Tamás,2021
3. Kálmán, László. 2011. A nyitótövekről [On opening stems]. Nyelv és Tudomány. https://www.nyest.hu/hirek/a-nyitotovekrol.
4. On the role of the agreement morpheme in Hungarian;Kenesei, István,1986
5. Googleology is bad science;Kilgarriff, Adam,2007
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献