| Data Clustering |
Website Links For Cluster |
Information AboutData Clustering |
| CATEGORIES ABOUT CLUSTER ANALYSIS | |
| data mining | |
| machine learning | |
| statistical data sets | |
| knowledge discovery in databases | |
|
Besides the term ''data clustering'' (or just ''clustering''), there are a number of terms with similar meanings, including ''cluster analysis'', ''automatic classification'', ''numerical taxonomy'', ''botryology'' and ''typological analysis''. TYPES OF CLUSTERING Data clustering algorithms can be Hierarchical or Partitional . Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. ''Two-way clustering'', ''co-clustering'' or ''bi-clustering'' are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a Data Matrix , the row and columns are clustered simultaneously. Another important distinction is whether the clustering uses symmetric or asymmetric distances. A property of Euclidean Space is that distances are symmetric (the distance from object'' A'' to ''B'' is the same as the distance from ''B'' to ''A''). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case. DISTANCE MEASURE An important step in any clustering is to select a Distance Measure , which will determine how the ''similarity'' of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another. For example, in a 2-dimensional space, the distance between the point (x=1, y=0) and the origin (x=0, y=0) is always 1 according to the usual norms, but the distance between the point (x=1, y=1) and the origin can be 2, or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance. Common distance functions:
HIERARCHICAL CLUSTERING Creating clusters Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters. The traditional representation of this hierarchy is a Tree (called a Dendrogram ), with individual elements at one end and a single cluster containing every element at the other. Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the bottom. (In the figure, the arrows indicate an agglomerative clustering.) Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters. Agglomerative hierarchical clustering For example, suppose this data is to be clustered, and the Euclidean Distance is the Distance Metric . The hierarchical clustering Dendrogram would be as such: This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance. Optionally, one can also construct a Distance Matrix at this stage, where the number in the ''i''-th row ''j''-th column is the distance between the ''i''-th and ''j''-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of catching distances between clusters. Suppose we have merged the two closest elements ''b'' and ''c'', we now have the following clusters {''a''}, {''b'', ''c''}, {''d''}, {''e''} and {''f''}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters and is one of the following:
::
::
::
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion). Concept clustering Another variation of the agglomerative clustering approach is Conceptual Clustering . PARTITIONAL CLUSTERING ''K''-means and derivatives ''K''-means clustering The ''K''-means Algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster... Example: The algorithm steps are (J. MacQueen, 1967):
The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. Fuzzy ''c''-means clustering In Fuzzy Clustering , each point has a degree of belonging to clusters, as in Fuzzy Logic , rather than belonging completely to just one cluster. Thus, points on the edge of a cluster, may be ''in the cluster'' to a lesser degree than points in the center of cluster. For each point ''x'' we have a coefficient giving the degree of being in the ''k''th cluster . Usually, the sum of those coefficients is defined to be 1: : With fuzzy ''c''-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster: : The degree of belonging is related to the inverse of the distance to the cluster : then the coefficients are normalized and fuzzyfied with a real parameter so that their sum is 1. So : For ''m'' equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When ''m'' is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to ''k''-means. The fuzzy ''c''-means algorithm is very similar to the ''k''-means algorithm:
The algorithm minimizes intra-cluster variance as well, but has the same problems as ''k''-means, the minimum is a local minimum, and the results depend on the initial choice of weights. The Expectation-maximization Algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes. It has better convergence properties and is in general preferred to fuzzy-k-means. QT clustering algorithm QT (quality threshold) clustering (Heyer et al, 1999) is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than ''k''-means, but does not require specifying the number of clusters ''a priori'', and always returns the same result when run several times. The algorithm is:
The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters). Graph-theoretic methods Formal Concept Analysis is a technique for generating clusters of objects and attributes, given a Bipartite Graph representing the relations between the objects and attributes. Other methods for generating ''overlapping clusters'' (a Cover rather than a Partition ) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970). ELBOW CRITERION The elbow criterion is a common Rule Of Thumb to determine what number of clusters should be chosen, for example for ''k''-means and agglomerative hierarchical clustering. It should also be noted that the initial assignment of cluster seeds has bearing on the final model performance. Thus, it is appropriate to re-run the cluster analysis multiple times. The elbow criterion says that you should choose a number of clusters so that adding another cluster doesn't add sufficient information. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph (the elbow). This elbow can not always be unambigiously identified. On the following graph, the elbow is indicated by the red circle. The number of clusters chosen should therefore be 4. Percent of variance explained or percent of total variance is the ratio of within-group variance to total variance. SPECTRAL CLUSTERING Given a set of data points A, the Similarity Matrix may be defined as a matrix where represents a measure of the similarity between points . Spectral clustering techniques make use of the Spectrum of the similarity matrix of the data to perform Dimensionality Reduction for clustering in fewer dimensions. One such technique is the ''Shi-Malik algorithm'', commonly used for Image Segmentation . It partitions points into two sets based on the Eigenvector corresponding to the second-smallest Eigenvalue of the Laplacian Matrix : of , where is the diagonal matrix : This partitioning may be done in various ways, such as by taking the median of the components in , and placing all points whose component in is greater than in , and the rest in . The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion. A related algorithm is the ''Meila-Shi algorithm'', which takes the Eigenvector s corresponding to the ''k'' largest Eigenvalue s of the matrix for some ''k'', and then invokes another (e.g. ''k''-means) to cluster points by their respective ''k'' components in these eigenvectors. APPLICATIONS Biology In Biology clustering has many applications
Market research Cluster analysis is widely used in Market Research when working with multivariate data from Surveys and test panels. Market researchers use cluster analysis to partition the general Population of Consumers into market segments and to better understand the relationships between different groups of consumers/potential Customers .
Other applications Social network analysis: In the study of Social Networks , clustering may be used to recognize Communities within large groups of people. Image segmentation: Clustering can be used to divide a Digital Image into distinct regions for Border Detection or Object Recognition . Data mining: Many Data Mining applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples. Another common application is the division of documents, such as World Wide Web pages, into genres. Slippy map optimization: Flickr 's map of photos and other map sites use clustering to reduce the number of markers on a map. This makes it both faster and reduces the amount of visual clutter. COMPARISONS BETWEEN DATA CLUSTERINGS There have been several suggestions for a measure of similarity between two clusterings. Such a measure can be used to compare how well different data clustering algorithms perform on a set of data. Marina Meila 's Variation of Information metric is a more recent approach for comparing distance between clusterings. It uses Mutual Information and Entropy to approximate the distance between two clusterings across the lattice of possible clusterings. ALGORITHMS In recent years considerable effort has been put into improving algorithm performance (Z. Huang, 1998). Among the most popular are ''CLARANS'' (Ng and Han,1994), ''DBSCAN'' (Ester et al., 1996) and ''BIRCH'' (Zhang et al., 1996). SEE ALSO
BIBLIOGRAPHY Others
For spectral clustering :
For estimating number of clusters:
For discussion of the elbow criterion:
EXTERNAL LINKS
|
|
|