Addressing structural hurdles for metadata extraction from environmental impact statements

Author:

Laparra Egoitz1ORCID,Binford‐Walsh Alex2,Emerson Kirk3,Miller Marc L.4,López‐Hoffman Laura2,Currim Faiz5,Bethard Steven1

Affiliation:

1. School of Information University of Arizona Tucson Arizona USA

2. School of Natural Resources and the Environment and the Udall Center for Studies in Public Policy University of Arizona Tucson Arizona USA

3. School of Government and Public Policy University of Arizona Tucson Arizona USA

4. James E. Rogers College of Law University of Arizona Tucson Arizona USA

5. Department of Management Information Systems University of Arizona Tucson Arizona USA

Abstract

AbstractNatural language processing techniques can be used to analyze the linguistic content of a document to extract missing pieces of metadata. However, accurate metadata extraction may not depend solely on the linguistics, but also on structural problems such as extremely large documents, unordered multi‐file documents, and inconsistency in manually labeled metadata. In this work, we start from two standard machine learning solutions to extract pieces of metadata from Environmental Impact Statements, environmental policy documents that are regularly produced under the US National Environmental Policy Act of 1969. We present a series of experiments where we evaluate how these standard approaches are affected by different issues derived from real‐world data. We find that metadata extraction can be strongly influenced by nonlinguistic factors such as document length and volume ordering and that the standard machine learning solutions often do not scale well to long documents. We demonstrate how such solutions can be better adapted to these scenarios, and conclude with suggestions for other NLP practitioners cataloging large document collections.

Funder

National Science Foundation

Publisher

Wiley

Subject

Library and Information Sciences,Information Systems and Management,Computer Networks and Communications,Information Systems

Reference31 articles.

1. Beltagy I. Peters M. E. &Cohan A.(2020).Longformer: The long‐document transformer. arXiv:2004.05150.

2. Inferring missing metadata from environmental policy texts

3. A metadata extraction approach for clinical case reports to enable advanced understanding of biomedical concepts;Caufield J. H.;Journal of Visualized Experiments,2018

4. Congress. (1970).An act to establish a national policy for the environment; to authorize studies surveys and research relating to ecological systems natural resources and the quality of the human environment; and to establish a board of environmental quality advisers [Public Law 91‐190. 83 Stat. 852.].

5. Information extraction

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3