Machine learning on imbalanced data
Imbalanced data sets arise in many practically important classification problems. Typically in these type of problems we have sufficient number of majority class instances, while the class value of interest is rare. An examples of this problem is diagnostics of rare diseases where the vast majority of tested patients are negative, while we want to learn characteristics of the rare positive cases. Similar imbalances arise in genetics, detection of illegal stock market transactions, insurance fraud, production faults, etc. For general data analytics approaches these problems are difficult, but due to their importance there are many specialized approaches tackling them. We present sampling based approaches, cost-sensitive learning, and some adaptations of well-known learning algorithms intended to cope with data imbalance. In the last part we focus on the methods which are topic of our research, namely feature evaluation with imbalanced data, generation of semi-artificial data, and ensemble methods.