BEACON Researchers at Work: The Encoding Scheme based on Anchors for Scalable Multi-Objective Clustering Algorithms

This BEACON Researchers at Work blog post is by Dr. Shuwei Zhu, Prof. Lihong Xu and Prof. Erik D Goodman.

Fig. 1. The correct clustering solution often corresponds to a tradeoff between two or more clustering objectives

BEACON’s Greenhouse research team, led by Prof. Lihong Xu (from Tongji University, China) and Prof. Erik D Goodman, has been doing work on BEACON’s international collaboration projects to model, optimize and control for the microclimate inside greenhouse with the aid of evolutionary techniques. We explore the capabilities of Multi-Objective Evolutionary Algorithms (MOEAs) for solving problems with multiple conflicting objectives, to explicitly tradeoff the greenhouse production and the associated energy cost. Recently, we are working on data-driven MOEAs with the help of surrogate models to address the above task, as the data collected is very helpful to guide the search towards the true Pareto front (PF) of the problem. However, the huge amount of data makes it time-consuming to build data-driven models, hence we resort to data mining techniques as preprocessing.

Clustering is a widely researched topic in data mining, pattern recognition, and machine learning (ML), which aims at partitioning a given dataset into clusters (groups or categories). Generally, data clustering can be regarded as an optimization problem, which is targeted for some particular types of data. Also, the correct clustering solution and the number of clusters (k) often corresponds to a tradeoff between two or more clustering objectives. For example, as shown in Fig. 1, minimization of the overall deviation and connectivity is conflicting, but each can contribute to the correct clustering solution as well as the corresponding true k in this case. Evolutionary multiobjective clustering (MOC) algorithms have shown promising potential to outperform conventional single-objective clustering algorithms, especially when the number of clusters k is not set before clustering. However, the computational burden becomes a tricky problem due to the extensive search space and fitness computational time of the evolving population, especially when the data size is large.

To define the model structure, most existing MOC methods are primarily driven by the choice of suitable clustering criteria (objectives) in order to address specific types of data. Actually, the choice of cluster representation (encoding scheme) plays a somewhat primary role in developing the model structure, which, accordingly, influences the search space and reproduction operators. For MOC methods with evolutionary computation (EC), three prevalent cluster representations (or encoding schemes) exist in the literature—namely: 1) graph-based (sometimes called a locus-based adjacency representation) scheme represents clusters with links between genes; 2) prototype-based scheme uses cluster representatives (e.g., centroids, medoids, or modes of a cluster) as the individuals; and 3) label-based scheme directly encodes each gene with the class label. The first two encoding schemes (more common in the literature) are presented in Fig. 2.

(a) Graph-based encoding (b) Cluster center-based encoding
Fig. 2. Encoding schemes of MOC

The existing encoding schemes have, to a greater or lesser degree, limitations regarding their scalability when used in an EC-based framework. The graph-based and label-based representations belong to the category of point-based approaches, in which class assignments of data points (objects) are explicitly/implicitly encoded in the individuals, such that the genome length is equal to the number of data points n, leading to difficulties in addressing large data. On the other hand, prototype-based methods are limited to hyper-spherical clusters, which may significantly degrade their performance when handling data that are not linearly separable; while graph-based and label-based representations can overcome this problem due to their independence from the shape of clusters.

Fig. 3. Hierarchical structure of training seed nodes

Besides the encoding scheme, we also develop a new reproduction operation to produce high-quality solutions more efficiently by taking advantage of clustering ensemble technique. All of the previous MOC approaches use genetic operator-based reproduction (e.g., crossover and mutation), which comes from traditional MOEAs without considering many properties of clustering tasks. To solve this problem, we propose to design an ensemble-based reproduction operator. However, most clustering ensemble are not able to be served as reproduction operators in the proposed framework. This is mainly due to the fact that the clustering ensemble technique should: 1) flexibly output a solution of specified k; and 2) be time efficient. Fortunately, we find that the bipartite graph partitioning strategy is very suitable for these purposes. Moreover, in the final decision making from the set of non-dominated solutions get by MOC, which is underexplored in the existing methods, the usage of the bipartite-graph-based cluster ensemble strategy is also presented, whether k is provided or not. To be specific, the bipartite graph integrating information of multiple solutions is used to generate a specific clustering solution of k. If k is not provided, the seed-points-based cluster validity can effectively and efficiently be used to select the appropriate result among solutions of different k generated through the obtained bipartite graph.

From the above analysis we see that most existing cluster representations have scalability limitations and lead to huge search spaces when applied on large datasets, or sensitivity issues to the shape of clusters. In view of this, we propose to develop a novel cluster representation for developing scalable MOC models, which can simplify the search procedure and decrease computational overhead. For our purposes, a set of seed nodes (anchors) is found to represent the local structure of small subgroups, and then they are used to encode the individuals of the proposed MOC method. Instead of directly generating sufficient seed points distributed uniformly over the entire dataset, we locate them within a hierarchical topology (a coarse-to-fine-trained topological structure), such that there exist more seed points in denser regions and fewer in sparse regions. The hierarchical structure of training seed nodes is shown in Fig. 3, where we can see a set of seed nodes (the red points) is generated to approximately represent the data set based on its density distribution, which is beneficial to the performance if conducting clustering on these seed nodes. However, its one-layer version can just generate uniformly-distributed seed points. Thus, the graph of seed points can be built (e.g., using minimum spanning trees) to develop the proposed hierarchical topology-based cluster representation, which can reduce both time and space complexities significantly.

Comparison experiments are conducted on a series of different data distributions, revealing the superiority of the proposed MOC algorithm in terms of both clustering performance and computing efficiency.

Fig. 4. Different types of combinations of clusterings

We illustrate three different types of combinations of clusterings in Fig. 4, where (a) sequential clustering: each method in turn may use information provided by the previous clustering system while possibly exploiting new additional data; (b) cooperative clustering: each clustering algorithm produces its result independently. The final clustering is computed in a post-processing step, and the only exchange of information is about when the individual processes are completed; and (c) collaborative clustering: the group solves together problems defined and imposed by the central controller, affecting an individual task to each learner. Interactions are recurrent between team members, responsibility is collective, the action of each teammate contribute to the performance of the clustering each iteration. It is obvious that most classical clustering methods that work sequentially belong to the first type; and ensemble clustering techniques work in parallel at the end are the kind of cooperative clustering. The MOC is actually a kind of collaborative clustering, as the individuals are exchanging information each generation, in order to evolve a set of high quality and diverse clustering solutions to aid the final decision making. As the scalability issue of MOC models has been alleviated dramatically in our work by defining the new cluster representation, there is a big room of potential to explore MOC methods in addressing more complicated data sets with either huge size or irregular clustering structures.

Notes: The main result above was published in the Journal— IEEE Transactions on Cybernetics, 2021, doi: 10.1109/TCYB.2021.3081988

References:

[1] S. Zhu, L. Xu, E. D Goodman. Hierarchical Topology-Based Cluster Representation for Scalable Evolutionary Multiobjective Clustering. IEEE Transactions on Cybernetics, 2021, doi: 10.1109/TCYB.2021.3081988

For more information about this work, you can contact Prof. Lihong Xu at email: xulihong@msu.edu

BEACON is a consortium of

Member Resources

Archives

Categories