A new method for Arabic/Farsi numeral data set size reduction via modified frequency diagram matching
Author:
Amin Shayegan Mohammad,Aghabozorgi Saeed
Abstract
Purpose
– Pattern recognition systems often have to handle problem of large volume of training data sets including duplicate and similar training samples. This problem leads to large memory requirement for saving and processing data, and the time complexity for training algorithms. The purpose of the paper is to reduce the volume of training part of a data set – in order to increase the system speed, without any significant decrease in system accuracy.
Design/methodology/approach
– A new technique for data set size reduction – using a version of modified frequency diagram approach – is presented. In order to reduce processing time, the proposed method compares the samples of a class to other samples in the same class, instead of comparing samples from different classes. It only removes patterns that are similar to the generated class template in each class. To achieve this aim, no feature extraction operation was carried out, in order to produce more precise assessment on the proposed data size reduction technique.
Findings
– The results from the experiments, and according to one of the biggest handwritten numeral standard optical character recognition (OCR) data sets, Hoda, show a 14.88 percent decrease in data set volume without significant decrease in performance.
Practical implications
– The proposed technique is effective for size reduction for all pictorial databases such as OCR data sets.
Originality/value
– State-of-the-art algorithms currently used for data set size reduction usually remove samples near to class's centers, or support vector (SV) samples between different classes. However, the samples near to a class center have valuable information about class characteristics, and they are necessary to build a system model. Also, SV s are important samples to evaluate the system efficiency. The proposed technique, unlike the other available methods, keeps both outlier samples, as well as the samples close to the class centers.
Subject
Computer Science (miscellaneous),Social Sciences (miscellaneous),Theoretical Computer Science,Control and Systems Engineering,Engineering (miscellaneous)
Reference27 articles.
1. Abdul Sattar, S.
and
Shah, S.
(2012), “Character recognition of Arabic script languages”, The Second International Conference on Communication and Information Technology (ICCIT’12), June 26-28, Hammamet, pp. 502-506. 2. Benmokhtar, R.
,
Delhumeau, J.
and
Gosselin, P.H.
(2013), “Efficient supervised dimensionality reduction for image categorization”, IEEE International Conference on Acoustics, Speech, and Signal Processing, May 26-31, Vancouver, BC. 3. Boucheham, B.
(2012), “PLA – data reduction for speeding up time series comparison”, International Arab Journal of Information Technology, Vol. 9 No. 5, pp. 459-464. 4. Cano, J.R.
,
Garcia, S.
and
Herrera, F.
(2008), “Subgroup discover in large size data sets preprocessed using stratified instance selection for increasing the presence of minority classes”, Pattern Recognition Letter, Vol. 29 No. 16, pp. 2156-2164. 5. Cervantes, J.
,
Li, X.
and
Yu, W.
(2008), “Support vector classification for large data sets by reducing training data with change of classes”, IEEE International Conference on Systems, Man and Cybernetics, October 12-15, pp. 2609-2614.
|
|