Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings
Author:
Blanco-Fernández YolandaORCID,
Gil-Solla Alberto,
Pazos-Arias José J.,
Quisi-Peralta Diego
Abstract
Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.
Publisher
Vilnius University Press
Subject
Applied Mathematics,Information Systems,General Medicine
Reference90 articles.
1. Recognizing question entailment for medical question answering;AMIA Annual Symposium Proceedings,2016
2. Error detection in a large-scale lexical taxonomy;Information,2020