A new genetic algorithm based clustering for binary and imbalanced class data sets (2016)
Type of ContentElectronic Thesis or Dissertation
Degree NameDoctor of Philosophy
PublisherUniversity of Canterbury
AuthorsSaharan, Sabariahshow all
This research was initially driven by the lack of clustering algorithms that specifically focus on binary data. To overcome this gap in knowledge, a promising technique for analysing this type of data became the main subject in this research, namely Genetic Algorithm (GA). This type of algorithm has an intrinsic search parallelism that avoids getting stuck at the local optima and poor initialization. For the purpose of this research, GA was combined with the Incremental K-means (IKM) algorithm to cluster the binary data streams. However, prior to this proposed method, a well-known GA based clustering method, GCUK was applied to gauge the performance of this algorithm to cluster the binary data, with new application for binary data set. Subsequently, this led to a proposed new method known as Genetic Algorithm-Incremental K-means (GAIKM) with the objective function based on a few suff- cient statistics that may be easily and quickly calculated on binary numbers. Different from the other clustering algorithms for binary data, this proposed method has an advantage in terms of fast convergence by implementing the IKM. Additionally, the utilization of GA provides a continuous process of searching for the best solutions, that can escape from being trapped at the local optima like the other clustering methods. The results show that GAIKM is an effcient and effective new clustering algorithm compared to the clustering algorithms and to the IKM itself. The other main contribution in this research is the ability of the proposed GAIKM to cluster imbalanced data sets, where standard clustering algorithms cannot simply be applied to this data as they could cause misclassification results. In conclusion, the GAIKM outperformed other clustering algorithms, and paves the way for future research in missing data and outliers and also by implementing the GA multi-objective optimization.