Author:
Costa Wagner M.,Pedrosa Glauco V.
Abstract
The retrieval of similar textual documents is a challenging task for the legal area due to its peculiar language with unique characteristics. This paper presents a new approach, called BoC-Th, proposed to represent legal documents based on the Bag-of-Concept (BoC) approach, which generates concept through clustering word vectors generated from a basic neural network model, and compute the frequencies of these concept clusters to represent document vectors. The novel contribution of the BoC-Th is to generate weighted histograms of concepts defined from the distance of the word to its respective similar term within a thesaurus. The idea is to emphasize those words that have more significance for the context, thus generating more discriminative vectors. Experimental evaluations were performed by comparing the proposed approach with the traditional BoW and BoC approaches, both popular techniques for document representation. The proposed method obtained the best result among the evaluated techniques for retrieving judgments and jurisprudence documents. The BoC-Th increased the mAP (mean Average Precision) in 51% compared to the traditional BoC approach, while being up to 3.4 times faster than the traditional BoW representation.
Publisher
Sociedade Brasileira de Computação - SBC
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献