Kilho Shin, Tetsuji Kuboyama, Takako Hashimoto, Dave Shepard
PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA 61-67 2015年 査読有り
Feature selection is a useful tool for identifying which features, or attributes, of a dataset cause or explain phenomena, and improving the efficiency and accuracy of learning algorithms for discovering such phenomena. Consequently, feature selection has been studied intensively in machine learning research. However, advanced feature selection algorithms that can avoid redundant selection of features and can detect interacting features require heavy computation in general and hence are seldom used for big data analysis. To eliminate this limitation, we tried to improve the run-time performance of two of the most advanced feature selection algorithms known in the literature. We have developed two accurate and extremely fast algorithms, namely Super CWC and Super LCC. In experiments with multiple real datasets which are actually studied in big data research, we have demonstrated that our algorithms improve the performance of their original algorithms remarkably. For example, for two datasets, one with 15,568 instances and 15,741 features and another with 200,569 instances and 99,672 features, Super-CWC performed feature selection in 1.4 seconds and in 405 seconds, respectively. This is a remarkable improvement, because it is estimated that the original algorithms would need several hours to a few ten days to perform feature selection on the same datasets.