Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English Wikipedia-Reference-Cited by-同舟云学术

Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English Wikipedia

Published:2021 Issue:1 Volume:2 Page:1-19
ISSN:2641-3337
Container-title:Quantitative Science Studies
language:en
Short-container-title:

Author:

Singh Harshdeep¹^ORCID,West Robert¹^ORCID,Colavizza Giovanni²^ORCID

Affiliation:

1. Data Science Laboratory, EPFL

2. Institute for Logic, Language and Computation, University of Amsterdam

Abstract

Abstract Wikipedia’s content is based on reliable and published sources. To this date, relatively little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive data set of citations extracted from Wikipedia. We extracted29.3 million citations from 6.1 million English Wikipedia articles as of May 2020, and classified as being books, journal articles, or Web content. We were thus able to extract 4.0 million citations to scholarly publications with known identifiers—including DOI, PMC, PMID, and ISBN—and further equip an extra 261 thousand citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science. We release our code to allow the community to extend upon our work and update the data set in the future.

Publisher

MIT Press - Journals

Subject

Aerospace Engineering

Link

http://direct.mit.edu/qss/article-pdf/2/1/1/1906624/qss_a_00105.pdf

Reference57 articles.

1. Science through Wikipedia: A novel representation of open knowledge through co-citation networks;Arroyo-Machado;PLOS ONE,2020

2. A graph-structured dataset for Wikipedia Research;Aspert,2019

3. Web of Science as a data source for research on scientific and scholarly activity;Birkle;Quantitative Science Studies,2020

4. Enriching word vectors with subword information;Bojanowski;Transactions of the Association for Computational Linguistics,2017

Cited by 23 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Providing Citations to Support Fact-Checking: Contextualizing Detection of Sentences Needing Citation on Small Wikipedias;Natural Language Processing Journal;2024-09

2. Knowledge is power: Open-world knowledge representation learning for knowledge-based visual reasoning;Artificial Intelligence;2024-08

3. The many publics of science: using altmetrics to identify common communication channels by scientific field;Scientometrics;2024-06-20

4. Sourcing public policy: organisation publishing in Wikipedia;New Review of Hypermedia and Multimedia;2024-05-20

5. The Most Cited Scientific Information Sources in Wikipedia Articles Across Various Languages;Biblioteka;2024-03-07