Swiss-AL: Platform for Language Data in Applied Sciences-Reference-Cited by-同舟云学术

Swiss-AL: Platform for Language Data in Applied Sciences

Published:2023-09-07 Issue: Volume:1 Page:
ISSN:2941-296X
Container-title:Proceedings of the Conference on Research Data Infrastructure
language:
Short-container-title:Proc Conf Res Data Infrastr

Author:

Krasselt Julia^ORCID,Dreesen Philipp^ORCID,Stücheli-Herlach Peter,Lemmenmeier Dolores^ORCID,Cho Sooyeon^ORCID,Rothenhäusler Klaus^ORCID,Fluor Matthias

Abstract

Open Science is transforming the way researchers collect, process, analyze, and store empirical research data, particularly in the social sciences and humanities, where language data is crucial. This transformation process especially concerns developers and providers of large language corpora and manifests itself in at least three challenges when providing these corpora as Open Research Data (ORD). Challenges concern heterogeneous practices that researchers apply when working with language data, research data lifecycle, and legal and ethical aspect. In this paper, we present Swiss-AL, a language data platform developed in Switzerland that is being transformed into an Open Research Data Resource for Applied Sciences within the Swiss Open Science Strategy. The paper gives an overview over the data contained in Swiss-AL and the infrastructure that is used to process and analyze the data. Furthermore, it presents approaches to the three abovementioned challenges to language ORD.

Publisher

TIB Open Publishing

Reference8 articles.

1. P. Dreesen and P. Stücheli-Herlach, "Diskurslinguistik in Anwendung. Ein transdisziplinä-res Forschungsdesign für korpuszentrierte Analysen zu öffentlicher Kommunikation", Zeit-schrift für Diskursforschung, vol. 7, no. 2, pp. 123–162, 2019, doi: https.doi.org/ 10.3262/ZFD1902123

2. J. Krasselt, P. Dreesen, M. Fluor, C. Mahlow, K. Rothenhäusler, and M. Runte, "Swiss-AL: A Multilingual Swiss Web Corpus for Applied Linguistics", in Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 2020, pp. 4138--4144. https://aclanthology.org/2020.lrec-1.510/ [26.04.2023]

3. M. Theobald, J. Siddharth, and A. Paepcke, "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections", in 31st annual international ACM SIGIR conference on Research and development in information retrieval 2008 (SIGIR 2008), Singapore, Singapore, 2008.

4. D. Ferrucci and A. Lally, "UIMA: an architectural approach to unstructured information pro-cessing in the corporate research environment", Natural Language Engineering, vol. 10, no. 3–4, pp. 327–348, 2004, doi: https.doi.org/10.1017/S1351324904003523.

5. H. Schmid, "Probabilistic Part-of-Speech Tagging Using Decision Trees", in Proceedings of the international conference on new methods in language processing, Manchester, United Kingdom, 1994, pp. 44–49. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.1139