![]() |
![]() |
|
![]() |
![]() |
Encyclopedia :
D :
DA :
DAT :
Data clustering |
|
|
Data clusteringData clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure.Besides the term data clustering, there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis. Data clustering algorithms can be hierarchical or partitional. With hierarchical algorithms, successive clusters are found using previously established clusters, whereas partitional algorithms determine all clusters in one go. Hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglomerative algorithms begin with each element as a separate cluster and merge them in successively larger clusters. The divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Hierarchical clusteringIntroductionHierarchical clustering builds a hierarchy of clusters. The traditional representation of this hierarchy is a tree, with individual elements at one end and a single cluster with every element at the other. Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the bottom. (In the figure, the arrows indicate an agglomerative clustering.)
Cutting the tree at a given height will give a clustering at a selected precision. In the above example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, but with fewer clusters. Agglomerative hierarchical clusteringThis method builds the hierarchy from the individual elements by progressively merging clusters. Again, we have six elements {a} {b} {c} {d} {e} and {f}. the first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, therefore we must define a distance between elements. Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. But to do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters and is one of the following: Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion). Partitional clusteringk-means and derivativesk-means clusteringThe k-means algorithm assigns each point to the cluster whose center (or centroid) is nearest. The centroid is the point generated by computing the arithmetic mean for each dimension separately for all the points in the cluster.
The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Yet it does not systematically yield the same result with each run of the algorithm. Rather, the resulting clusters depend on the initial assignments. Fuzzy c-means clusteringOne of the problems of the k-means algorithm is that it gives a hard partitioning of the data, that is to say that each point is attributed to one and only one cluster. But points on the edge of the cluster, or near another cluster may not be as much in the cluster as point in the center of cluster. Therefore, in fuzzy clustering, each point does not pertain to a given cluster, but has a degree of belonging to a certain cluster, as in fuzzy logic. For each point x we have a coefficient giving the degree of being in the kth cluster . Usually, the sum of those coefficients has to be one, so that denotes a probability of belonging to a certain cluster:
: The degree of being in a certain cluster is the inverse of the distance to the cluster
The fuzzy c-means algorithm is greatly similar to the k-means algorithm; when , algorithms are equivalent: The fuzzy c-means algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is local minimum, and the results depend on the initial choice of weights. ApplicationsIn biology has two main applications in the fields of computational biology and bioinformatics. External linksSeealsok-means, ANN, Self-organizing map, other clustering and mixture model links.
|
|
|
This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License. |
|
| © 2008 Chamas Enterprises Inc. |