Swiss-AL: Platform for Language Data in Applied Sciences


Krasselt JuliaORCID,Dreesen PhilippORCID,Stücheli-Herlach Peter,Lemmenmeier DoloresORCID,Cho SooyeonORCID,Rothenhäusler KlausORCID,Fluor Matthias


Open Science is transforming the way researchers collect, process, analyze, and store empirical research data, particularly in the social sciences and humanities, where language data is crucial. This transformation process especially concerns developers and providers of large language corpora and manifests itself in at least three challenges when providing these corpora as Open Research Data (ORD). Challenges concern heterogeneous practices that researchers apply when working with language data, research data lifecycle, and legal and ethical aspect. In this paper, we present Swiss-AL, a language data platform developed in Switzerland that is being transformed into an Open Research Data Resource for Applied Sciences within the Swiss Open Science Strategy. The paper gives an overview over the data contained in Swiss-AL and the infrastructure that is used to process and analyze the data. Furthermore, it presents approaches to the three abovementioned challenges to language ORD.


TIB Open Publishing

Reference8 articles.

1. P. Dreesen and P. Stücheli-Herlach, "Diskurslinguistik in Anwendung. Ein transdisziplinä-res Forschungsdesign für korpuszentrierte Analysen zu öffentlicher Kommunikation", Zeit-schrift für Diskursforschung, vol. 7, no. 2, pp. 123–162, 2019, doi: 10.3262/ZFD1902123

2. J. Krasselt, P. Dreesen, M. Fluor, C. Mahlow, K. Rothenhäusler, and M. Runte, "Swiss-AL: A Multilingual Swiss Web Corpus for Applied Linguistics", in Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 2020, pp. 4138--4144. [26.04.2023]

3. M. Theobald, J. Siddharth, and A. Paepcke, "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections", in 31st annual international ACM SIGIR conference on Research and development in information retrieval 2008 (SIGIR 2008), Singapore, Singapore, 2008.

4. D. Ferrucci and A. Lally, "UIMA: an architectural approach to unstructured information pro-cessing in the corporate research environment", Natural Language Engineering, vol. 10, no. 3–4, pp. 327–348, 2004, doi:

5. H. Schmid, "Probabilistic Part-of-Speech Tagging Using Decision Trees", in Proceedings of the international conference on new methods in language processing, Manchester, United Kingdom, 1994, pp. 44–49. [Online]. Available:







Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3