Abstract
Abstract
This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting. Second, it describes the results of applying the automatic classification model to these corpora and compares their composition. Finally, the paper introduces a framework for interpreting the results of automatic genre classification using linguistic features. The framework can help in comparing general reference corpora obtained from the Web and in comparing corpora across languages.
Publisher
John Benjamins Publishing Company
Reference49 articles.
1. Computational Register Analysis and Synthesis;Argamon;Register Studies,2019
2. Stylistic Text Classification Using Functional Lexical Features;Argamon,2007
3. Corpus-Based Translation Studies: The Challenges That Lie Ahead;Baker,1996
4. A New Approach to the Study of Translationese: Machine-Learning the Difference Between Original and Translated Text;Baroni;Literary and Linguistic Computing,2006
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献