On Methods of Data Standardization of German Social Media Comments

Author:

Melnyk Lidiia,Feld Linda

Abstract

This article is part of a larger project aiming at identifying discursive strategies in social media discourses revolving around the topic of gender diversity, for which roughly 350,000 comments were scraped from the comments sections below YouTube videos relating to the topic in question. This article focuses on different methods of standardizing social media data in order to enhance further processing. More specifically, the data are corrected in terms of casing, spelling, and punctuation. Different tools and models (LanguageTool, T5, seq2seq, GPT-2) were tested. The best outcome was achieved by the German GPT-2 model: It scored highest in all of the applied scores (ROUGE, GLEU, BLEU), making it the best model for the task of Grammatical Error Correction in German social media data.

Publisher

Universitat Politecnica de Valencia

Subject

General Medicine

Reference36 articles.

1. Awasthi, Abhijeet, Sunita Sarawagi, Rasna Goyal, Sabyasachi Ghosh, and Vihari Piratla. 2019. "Parallel Iterative Edit Models for Local Sequence Transduction." In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, November 03-07. Association for Computational Linguistics. 4260-4270. https://doi.org/10.18653/v1/D19-1435

2. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. "Neural Machine Translation by Jointly Learning to Align and Translate." Paper presented at ICLR 2015, San Diego, California, USA, May 07-09. https://arxiv.org/pdf/1409.0473.pdf.

3. Bangura, M., K. Barabashova, A. Karnysheva, S. Semczuk, and Y. Wang. 2023. "Automatic Generation of German Drama Texts Using Fine Tuned GPT-2 Models." https://arxiv.org/pdf/2301.03119.pdf

4. Casas, Noe, José A. R. Fonollosa, and Marta R. Costa-jussà. 2018. "A differentiable BLEU loss. Analysis and first results." Paper presented at ICLR 2018, Vancouver, Canada, April 30-May 03. 1-12. https://openreview.net/pdf?id=HkG7hzyvf

5. Cho, Kyunghyun, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. "On the Properties of Neural Machine Translation: Encoder-Decoder Approaches." In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, October 25. Association for Computational Linguistics. 103-111. https://doi.org/10.3115/v1/W14-4012

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3