Avtor/Urednik     Blagus, Rok; Lusa, Lara
Naslov     Class prediction for high-dimensional class-imbalanced data
Tip     članek
Vir     BMC Bioinformatic
Vol. in št.     Letnik 11
Leto izdaje     2010
Obseg     str. 523 (1-27)
Jezik     eng
Abstrakt     The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance.Results Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers. (Abstract truncated at 2000 characters)
Deskriptorji     DATA COLLECTION
SAMPLE SIZE
BREAST NEOPLASMS
GENE EXPRESSION
OLIGONUCLEOTIDE PROBES
SELECTION BIAS