Quality Control for Distantly-Supervised Data-to-Text Generation via Meta Learning-Reference-Cited by-同舟云学术

Quality Control for Distantly-Supervised Data-to-Text Generation via Meta Learning

Published:2023-04-30 Issue:9 Volume:13 Page:5573
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Gong Heng¹,Feng Xiaocheng¹²,Qin Bing¹²

Affiliation:

1. Harbin Institute of Technology, Harbin 150001, China

2. Peng Cheng Laboratory, Shenzhen 518000, China

Abstract

Data-to-text generation plays an important role in natural language processing by processing structured data and helping people understand those data by generating user-friendly descriptive text. It can be applied to news generation, financial report generation, customer service, etc. However, in practice, it needs to adapt to different domains that may lack an annotated training corpus. To alleviate this dataset scarcity problem, distantly-supervised data-to-text generation has emerged, which constructs a training corpus automatically and is more practical to apply to new domains when well-aligned data is expensive to obtain. However, this distant supervision method of training induces an over-generation problem since the automatically aligned text includes hallucination. These expressions cannot be inferred from the data, misguiding the model to produce unfaithful text. To exploit the noisy dataset while maintaining faithfulness, we empower the neural data-to-text model by dynamically increasing the weights of those well-aligned training instances and reducing the weights of the low-quality ones via meta learning. To our best knowledge, we are the first to alleviate the noise in distantly-supervised data-to-text generation via meta learning. In addition, we rewrite those low-quality texts to provide better training instances. Finally, we construct a new distantly-supervised dataset, DIST-ToTTo (abbreviation for Distantly-supervised Table-To-Text), and conduct experiments on both the benchmark WITA (abbreviation for the data source Wikipedia and Wikidata) and DIST-ToTTo datasets. The evaluation results show that our model can improve the state-of-the-art DSG (abbreviation for Distant Supervision Generation) model across all automatic evaluation metrics, with an improvement of 3.72% on the WITA dataset and 3.82% on the DIST-ToTTo dataset in terms of the widely used metric BLEU (abbreviation for BiLingual Evaluation Understudy). Furthermore, based on human evaluation, our model can generate more grammatically correct and more faithful text compared to the state-of-the-art DSG model.

Funder

National Key R&D Program of China

National Natural Science Foundation of China

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/9/5573/pdf

Reference54 articles.

1. Lebret, R., Grangier, D., and Auli, M. (2016, January 1–5). Neural Text Generation from Structured Data with Application to the Biography Domain. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA.

2. Wiseman, S., Shieber, S., and Rush, A. (2017, January 7–11). Challenges in Data-to-Document Generation. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark.

3. Puduppully, R., Dong, L., and Lapata, M. (2019, January 28). Data-to-text Generation with Entity Modeling. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy.

4. Uehara, Y., Ishigaki, T., Aoki, K., Noji, H., Goshima, K., Kobayashi, I., Takamura, H., and Miyao, Y. (2020, January 8–13). Learning with Contrastive Examples for Data-to-Text Generation. Proceedings of the 28th International Conference on Computational Linguistics, Online.

5. Conditional text generation for harmonious human–machine interaction;Guo;ACM Trans. Intell. Syst. Technol. (TIST),2021