Multimodal feature fusion technique for robust classification using Natural Scene Text Binary Images

Shahbaz Hassan Wasti; Ghulam Jillani Ansari; Sajid Ali

doi:10.71085/sss.04.01.447

Authors

Shahbaz Hassan Wasti Department of Information Sciences, Division of Science and Technology, University of Education, Lahore, 54770, Pakistan
Ghulam Jillani Ansari Assistant Professor Computer Science, Department of Information Sciences, Division of Science and Technology, University of Education, Lahore, 54770, Pakistan
Sajid Ali Associate Professor IT, Department of Information Sciences, Division of Science and Technology, University of Education, Lahore, 54770, Pakistan

DOI:

https://doi.org/10.71085/sss.04.01.447

Keywords:

Multimodal Feature Fusion, LTP, HOG, Ensemble Classifiers, Non-optimized

Abstract

The scene text classification requires effective representation of diverse visual properties including geometric, appearance, contour and shape. This paper presents utilizing ensemble classification framework incorporating multimodal feature fusion for robust scene text images. Histogram of Oriented Gradient (HOG), Geometric, Local Ternary Pattern (LTP) and contour based features are extracted from segmented binary images to capture relevant complementary statistical and structural information. After that, these features are serially fused to form comprehensive multimodal feature fusion representation. The fused model is further optimized using feature selection process to retain highly discriminative features for classification. Multiple ensemble classifiers including k-Nearest Neighborhood (KNN), Decision Tree (DT) models and Support Vector Machines (SVM) variants are employed to determine optimal classification performance. The classification experiments are conducted on Street View Text dataset which is considered to very challenging with scenic variabilities. Among all, the L-SVM consistently outperforms when compared with competitive classifiers and trained on optimized multimodal feature. The proposed framework achieved substantial improvements gradually in precision, recall, F-measure, accuracy and AUC when compared to single and non-optimized baselines. The findings evidently ensure that ensemble learning trained on optimized multimodal feature fusion results discriminative and reliable solution natural scene text recognition under non-standard real-world imaging conditions.